Summary of "Let’s Handle 1 Million Requests per Second, It’s Scarier Than You Think!"
High-level overview
- Purpose: simulate and analyze handling ≥1,000,000 HTTP requests per second (RPS). The creator builds, benchmarks and iterates on an end-to-end stack (client load generators → web servers → Redis / Postgres) to expose real-world bottlenecks, costs and design patterns.
- Key message:
At extreme scale you must optimize algorithms, architecture and tooling — small mistakes (O(n) vs O(log n), wrong storage choice, wrong language/framework) become catastrophically expensive. Monitoring, profiling and trade-offs (cost vs latency vs durability) are critical.
Technologies, tools and components used
- Languages & runtimes
- Node.js (Express, Fastify, custom “Cpeak”)
- C++ (Drogon framework) — final performance achieved in C++
- Brief experiments with Python, Go, Java
- Load testing
- autocannon (CLI + scripted JS wrapper)
- Multi-machine orchestration via AWS SSM and small bash/node scripts
- Process management / clustering
- PM2 for Node cluster-mode (spawn many Node instances per host)
- Drogon (C++) uses internal threading
- Databases / caches
- PostgreSQL (RDS/Aurora) for durable storage
- Redis (standalone and Redis Cluster) for hot paths and ingest buffering
- Migration tooling to move data Postgres → Redis for hot paths; background batch sync from Redis → Postgres
- JSON parsers
- RapidJSON used in C++ (much faster than default Drogon parser)
- Monitoring
- mpstat, top/activity monitor, system metrics, CloudWatch
- Manual measurement of bytes transferred
- Cloud / infra
- AWS EC2 (examples: local Mac Studio; C8i.32xlarge — 128 cores, 50 Gbit/s; C8GN.48xlarge “beast” — 192 cores, ~600 Gbit/s)
- RDS instance types with tuned IOPS/provisioned throughput
- S3 for results; AWS support/quotas (NLB capacity reservation; EC2 vCPU limits)
Key benchmarks / empirical results
Representative numbers from experiments and runs:
- Local Mac Studio single-route simple JSON
- Express: ~14k–20k RPS
- Fastify: ~66k–77k RPS
- Cpeak (custom Node minimal framework): ~73k RPS
- Trivial endpoints on powerful tester + server
- Millions of RPS possible when CPU and network are sufficient
- Patch route (heavier response ~30 KB)
- Server becomes network-bound — moved multiple GB/s
- Example peak ≈ 3M RPS when payloads made tiny and network was large enough
- PostgreSQL (writes)
- ~30k–66k inserts/sec on a very large RDS instance
- Increasing IO throughput improved RPS but cost rose sharply
- PostgreSQL (reads)
- Bad query patterns (ORDER BY random(), SELECT COUNT(*) + random) scale terribly (O(n)) — seconds per response at millions of rows
- Indexed/ID lookup design reached ~200k–400k RPS with careful queries and high-end DB instance
- Redis
- Single Redis instance ≈ 100k RPS (read/write) typical bound
- Redis Cluster (many masters + replicas) scaled to >1,000,000 RPS for the “code” route (UUID-based keys to avoid sequential contention); example: 30 master nodes, 15 replicas
- C++ Drogon + RapidJSON
- Final optimized route reached ~1,000,000–1,200,000 RPS on beefy EC2 machines, handling huge aggregate throughput (example cited: 38 GB/s application-layer, ~300 Gbit/s)
- Final 30-minute large run
- Architecture: 1 “beast” server receiving traffic from 60 small tester VMs (each running autocannon scripts)
- Aggregate: ≈ 2 billion requests processed in 30 minutes, ≈ 60 TB transferred, ≈ 40 timeouts across billions of requests (very low error rate)
Architectural lessons, optimizations and recommended patterns
- Measure first — resource monitoring is essential (CPU core utilization vs aggregated CPU percent, network throughput, disk IOPS).
- Algorithmic complexity matters — avoid O(n) database patterns (random scans, COUNT(*)); prefer indexed lookups, precomputed metadata, and O(1)/O(log n) approaches.
- Reduce framework overhead — use a fast minimal framework for hot routes (Fastify or custom Cpeak over Express). For CPU-heavy hot paths consider native code (C++/Drogon, Rust).
- In-memory storage for hot writes/reads
- Use Redis as an ingest buffer/queue for extremely hot routes, then batch-sync to durable DB asynchronously to reduce DB write load and cost.
- Shard Redis (cluster) to scale past a single-instance Redis limit.
- Avoid sequential global counters/locks for hot inserts — use random UUIDs to remove single-writer contention (collision probability negligible at extreme scale).
- Networking is often the bottleneck for large payloads — provision adequate network bandwidth, use instances with high network performance, and reserve load balancer capacity where required.
- For maximal raw throughput prefer compiled languages + optimized parsers (C++ + RapidJSON used here).
- For production, use horizontal distribution across multiple servers/regions and geo-routing rather than relying on a single supercomputer-style box.
- Be cost-aware — RDS IOPS and instance/network capacity scale cost quickly; test carefully and avoid mistakes that multiply cost.
Engineering anti-patterns / gotchas uncovered
- Trusting default SQL code such as ORDER BY RANDOM or SELECT COUNT(*) at scale — massively underperforming.
- Relying on single-node Redis or single DB for very hot routes.
- Under-reserving cloud load-balancer capacity or hitting EC2 vCPU regional limits.
- Using slow JSON parsers or high-level frameworks without benchmarking for hot endpoints.
- Not accounting for TCP/OS limits (number of open connections) on the tester side — required spawning many smaller test VMs to saturate the server.
Product / project notes
- Cpeak: a zero-dependency Node.js mini-framework created by the author, optimized for performance and used in benchmarks.
- Demo product referenced: a URL-shortener project (“weer.pro”) intended to be stress-tested; open-source repos and code were shared to reproduce tests locally.
- Repos & scripts: multiple repositories provided (Node and C++ code, load-test scripts, Redis cluster scripts, migration scripts). The author recommends running locally before attempting cloud runs.
Costs and operational considerations
- Example instance costs (vary by region/time)
- C8i.32xlarge ≈ $6/hr
- C8GN.48xlarge ≈ $11/hr
- RDS instances with high IOPS and large storage can add thousands to tens of thousands $/month.
- Example totals
- A high-throughput DB configuration might cost $7k–33k/month
- Multi-VM tester setups add significant EC2 charges
- Final experimental month cost ≈ $2,000 (could be reduced; mistakes would have increased cost)
- Energy & environmental note: moving terabytes/minute and sustaining many vCPUs consumes substantial power (comparable to driving an electric car many kilometers per hour in aggregate).
Practical how-to / guidance (recap)
- Benchmarks
- autocannon options: c (connections), p (pipelining), d (duration), -w (workers/threads)
- Concurrent in-flight requests = c * p
- CPU metrics
- Understand per-core vs aggregated utilization (sum across cores can exceed 100% vs normalized 0–100%)
- Node clustering
- PM2 to spawn processes equal to available cores
- Redis
- Migrate Postgres data into Redis using batching; run local Redis cluster scripts; clustered Redis required to hit >100k RPS
- Testing tips
- Use private VPC networking to avoid Internet throttles
- Use multiple client machines to avoid bottlenecking the tester side
Final takeaways
- Achieving 1M RPS is feasible but requires:
- Right language & parsers for CPU-bound handlers (e.g., C++ + RapidJSON)
- In-memory buffering (Redis) and sharding for hot routes
- Careful algorithm design (avoid O(n) DB ops)
- Provisioning of sufficient network bandwidth and cloud resource reservations
- Meticulous monitoring and cost-awareness
- The project is educational and reproducible: repos and scripts are shared so parts can be reproduced locally; careful testing is emphasized over blindly spinning up expensive cloud infra.
Main speaker / sources
- Speaker / presenter: the video author and builder of the demos (referred to in first person in the original content); provided repos and ran all experiments.
- Primary technologies / sources cited: AWS (EC2, RDS/Aurora, S3, CloudWatch, NLB, IAM), Node.js (Express, Fastify, Cpeak), autocannon, PM2, Redis (standalone & cluster), PostgreSQL, C++ (Drogon), RapidJSON.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...