Video summary

LISA17 - Queueing Theory in Practice: Performance Modeling for the Working Engineer

Main summary

Key takeaways

Technology

Overview

  • Speaker: Eben (engineer at Honeycomb).
  • Talk goal: Build small experiments plus simple analytic models to extrapolate performance, identify bottlenecks, and guide design/scale decisions. Emphasis on validating models with real data and being explicit about assumptions.

Use small experiments and simple queueing models to reason about latency, capacity, and scaling without huge-scale testing.

Serial (single-server) modeling

  • Experiment setup:
    • Measure latency versus throughput on one CPU core.
    • Send requests from many independent clients (arrivals assumed random) and observe latencies.
  • Model:
    • Simplest single-server queueing model: random arrivals at rate λ, constant service time s, single job processing at a time (M/D/1-like assumptions).
    • Area-under-curve reasoning gives a closed-form expression for average wait time W as a function of λ and s.
    • Model reproduces measured latency curves and reveals the “knee” — the utilization level where latency rapidly increases.
  • Practical result: When restricted to a consistent performance regime, the simple model fits real data surprisingly well.

Key lessons from serial modeling

  • Reduce per-request work (service time s): the biggest win for both latency and capacity. For example, halving s can allow much higher throughput while maintaining similar latency.
  • Variability is harmful: variable inter-arrival times and variable job sizes increase queuing and latency. Constant-size jobs and uniform arrivals minimize queuing.
  • Mitigations for variability:
    • Batching
    • Aggressive preemption/timeouts
    • Client-side backpressure or concurrency control
    • Reducing per-request variance

Parallel systems and scaling

  • Naive assumption: N servers give N× single-server capacity — not necessarily true. The outcome depends on task assignment and load balancing.
  • Optimal assignment (always pick the least-busy server) minimizes queuing at high utilization, but finding and maintaining that choice incurs coordination costs.
  • Coordination costs:
    • Assign-time overhead (α)
    • Costs that grow with number of servers (e.g., probing each backend adds β·N)
    • As N increases, coordination overhead can dominate and reduce net throughput.
  • Universal Scalability Law: a compact model that captures throughput versus parallelism including serialization/coordination costs; explains diminishing or negative returns when adding servers without careful design.

Design patterns to improve scalability

  • Power-of-two-choices (pick-two):
    • Pick two random servers and send the task to the less-loaded one.
    • Constant overhead independent of N; dramatically reduces maximum load (roughly from ~log N to ~log log N).
    • Used in large-scale schedulers/load balancers (and in systems like Sparrow).
  • Partitioning + hierarchical aggregation (iterative parallelization):
    • Avoid aggregating all partial results at a single node (aggregation cost ∝ N).
    • Use intermediate aggregators in a tree-shaped reduce to reduce aggregation cost to O(log N).
    • Scan time decreases with fan-out, while aggregation cost grows — a tree reduces overall cost and improves scaling.
  • Randomized approximation and iterative partitioning:
    • Balance coordination cost against latency/throughput by trading exactness for lower coordination.

Practical guidance / workflow

  1. State goals and assumptions explicitly.
  2. Run small tests and microbenchmarks; collect instrumentation data needed to fit models.
  3. Validate simple models against production-like data before trusting extrapolations.
  4. If unsure, draw the system timeline/queueing picture, write a small simulation, or consult standard queueing results/textbooks.
  5. Watch for unbounded queues (which lead to unbounded latency) and minimize variance. The simplest capacity improvement is reducing work per request.

References / concepts / systems mentioned

  • Theoretical concepts:
    • Poisson/random arrivals
    • Single-server queue (M/D/1-like assumptions)
    • Variability effects
    • Universal Scalability Law
  • Algorithms / systems:
    • Power-of-two-choices (pick-two) load balancing
    • Sparrow scheduler
    • Facebook Scuba (and general map-reduce / distributed query partitioning + aggregation)
  • Papers:
    • Literature on two-choice load balancing and related schedulers
  • Follow-up talk:
    • Barron — deeper dive on the Universal Scalability Law

Main speaker / sources

  • Main speaker: Eben (engineer at Honeycomb) — presented experiments and models.
  • Referenced speaker: Barron — follow-up on the Universal Scalability Law.
  • Referenced systems/papers: power-of-two-choices literature, Sparrow scheduler, Facebook Scuba, and papers on randomized load balancing and distributed query aggregation.

Original video