Video summary
LISA17 - Queueing Theory in Practice: Performance Modeling for the Working Engineer
Main summary
Key takeaways
Overview
- Speaker: Eben (engineer at Honeycomb).
- Talk goal: Build small experiments plus simple analytic models to extrapolate performance, identify bottlenecks, and guide design/scale decisions. Emphasis on validating models with real data and being explicit about assumptions.
Use small experiments and simple queueing models to reason about latency, capacity, and scaling without huge-scale testing.
Serial (single-server) modeling
- Experiment setup:
- Measure latency versus throughput on one CPU core.
- Send requests from many independent clients (arrivals assumed random) and observe latencies.
- Model:
- Simplest single-server queueing model: random arrivals at rate λ, constant service time s, single job processing at a time (M/D/1-like assumptions).
- Area-under-curve reasoning gives a closed-form expression for average wait time W as a function of λ and s.
- Model reproduces measured latency curves and reveals the “knee” — the utilization level where latency rapidly increases.
- Practical result: When restricted to a consistent performance regime, the simple model fits real data surprisingly well.
Key lessons from serial modeling
- Reduce per-request work (service time s): the biggest win for both latency and capacity. For example, halving s can allow much higher throughput while maintaining similar latency.
- Variability is harmful: variable inter-arrival times and variable job sizes increase queuing and latency. Constant-size jobs and uniform arrivals minimize queuing.
- Mitigations for variability:
- Batching
- Aggressive preemption/timeouts
- Client-side backpressure or concurrency control
- Reducing per-request variance
Parallel systems and scaling
- Naive assumption: N servers give N× single-server capacity — not necessarily true. The outcome depends on task assignment and load balancing.
- Optimal assignment (always pick the least-busy server) minimizes queuing at high utilization, but finding and maintaining that choice incurs coordination costs.
- Coordination costs:
- Assign-time overhead (α)
- Costs that grow with number of servers (e.g., probing each backend adds β·N)
- As N increases, coordination overhead can dominate and reduce net throughput.
- Universal Scalability Law: a compact model that captures throughput versus parallelism including serialization/coordination costs; explains diminishing or negative returns when adding servers without careful design.
Design patterns to improve scalability
- Power-of-two-choices (pick-two):
- Pick two random servers and send the task to the less-loaded one.
- Constant overhead independent of N; dramatically reduces maximum load (roughly from ~log N to ~log log N).
- Used in large-scale schedulers/load balancers (and in systems like Sparrow).
- Partitioning + hierarchical aggregation (iterative parallelization):
- Avoid aggregating all partial results at a single node (aggregation cost ∝ N).
- Use intermediate aggregators in a tree-shaped reduce to reduce aggregation cost to O(log N).
- Scan time decreases with fan-out, while aggregation cost grows — a tree reduces overall cost and improves scaling.
- Randomized approximation and iterative partitioning:
- Balance coordination cost against latency/throughput by trading exactness for lower coordination.
Practical guidance / workflow
- State goals and assumptions explicitly.
- Run small tests and microbenchmarks; collect instrumentation data needed to fit models.
- Validate simple models against production-like data before trusting extrapolations.
- If unsure, draw the system timeline/queueing picture, write a small simulation, or consult standard queueing results/textbooks.
- Watch for unbounded queues (which lead to unbounded latency) and minimize variance. The simplest capacity improvement is reducing work per request.
References / concepts / systems mentioned
- Theoretical concepts:
- Poisson/random arrivals
- Single-server queue (M/D/1-like assumptions)
- Variability effects
- Universal Scalability Law
- Algorithms / systems:
- Power-of-two-choices (pick-two) load balancing
- Sparrow scheduler
- Facebook Scuba (and general map-reduce / distributed query partitioning + aggregation)
- Papers:
- Literature on two-choice load balancing and related schedulers
- Follow-up talk:
- Barron — deeper dive on the Universal Scalability Law
Main speaker / sources
- Main speaker: Eben (engineer at Honeycomb) — presented experiments and models.
- Referenced speaker: Barron — follow-up on the Universal Scalability Law.
- Referenced systems/papers: power-of-two-choices literature, Sparrow scheduler, Facebook Scuba, and papers on randomized load balancing and distributed query aggregation.