Summary of "Kafka System Design Deep Dive w/ a Ex-Meta Staff Engineer"
High-level purpose
This is a deep-dive overview of Apache Kafka aimed at system-design interviews. It covers when to use Kafka, core concepts, how Kafka scales, durability, error and retry handling, performance tuning, and retention trade-offs.
Delivered by Evan (former Meta staff engineer), co-founder of Hello Interview. A written guide and additional resources are available on hellointerview.com.
Motivating example: world-cup events
Producers publish real-time game events (goals, substitutions, etc.) to Kafka and consumers update websites or other downstream systems.
Problems Kafka addresses in this example:
- Scale consumers horizontally while preserving per-game ordering by partitioning messages by game.
- Avoid duplicate processing with consumer groups.
- Separate topics by domain (e.g., soccer vs. basketball) to keep data and consumers decoupled.
Core Kafka concepts & lifecycle
- Broker: a Kafka server that stores partitions.
- Topic: a logical grouping of messages; topics contain partitions.
- Partition: an ordered, immutable append-only log on disk. Messages in a partition have offsets (0, 1, 2, …).
- Producer: writes records to a topic. Records contain a key, value, timestamp, and headers.
- Consumer: subscribes to topics, reads by offset, and commits offsets back to Kafka.
- Consumer groups: ensure each partition’s messages are delivered to exactly one consumer in the group (enables horizontal scaling).
- Controller: the broker that maps partitions to brokers and assigns leaders.
- Replication: each partition has a leader replica (handles reads/writes) and follower replicas that replicate data for failover.
Partitioning, keys, and ordering
- If a key is provided, Kafka hashes it (commonly MurmurHash) and uses
hash % num_partitionsto select a partition. This deterministic assignment preserves ordering for that key. - Without a key, Kafka typically distributes messages round-robin and does not guarantee ordering across messages.
- Choosing an appropriate partition key is critical — it affects both load balancing and ordering guarantees.
Durability & availability
- Replication factor (commonly 3) controls how many replicas store each partition.
acksconfiguration controls the durability vs. latency trade-off. Useacks=allfor maximum durability.- Consumers commit offsets periodically. Commit offsets after work is complete so consumers can recover and resume from the last committed offset.
Scaling guidance & constraints
- Keep messages small (recommended < 1 MB). Avoid storing large binary blobs (e.g., video) in Kafka; store them in object storage (S3) and send pointers/URLs in Kafka messages.
- Rough baseline (hardware-dependent): a single broker might handle ~1 TB of storage and ~10k messages/sec — useful as a handwavy interview estimate, not a hard rule.
- To scale: add brokers and increase the number of partitions. Design partition keys to distribute load evenly across partitions.
- Consider managed Kafka options to reduce operational burden: Confluent Cloud, AWS MSK, etc.
Hot-partition mitigation
If one partition becomes a hotspot (for example, a very popular ad ID):
- Remove the key if ordering isn’t required so messages can distribute evenly across partitions.
- Use compound keys to shard a hot key across partitions (trade ordering for throughput). Examples:
adID:shardadID:randomNadID:userIDprefix
- Implement producer-side backpressure to slow production when partitions become overloaded.
Error handling & retries
- Producer retries: configure retry count and backoff, and enable idempotent producer mode to avoid duplicates.
- Kafka does not have built-in consumer retry scheduling. Common pattern:
- On consumer failure, publish the failed record to a retry topic with retry metadata.
- A retry consumer reads the retry topic and attempts processing (optionally with delays between retries).
- After exceeding max attempts, move the message to a dead-letter queue (DLQ) topic for manual inspection.
- Consumer group rebalancing: Kafka automatically reassigns partitions if a consumer fails or leaves the group.
Performance optimizations
- Batch messages on the producer side (configure
max.batch.sizeandlinger.ms) to reduce request overhead. - Compress messages (gzip, snappy, etc.) to reduce network I/O and storage.
- The most significant performance gains often come from a good partitioning strategy to maximize parallelism.
Retention and storage policies
- Topics are configured with
retention.ms(time-based) andretention.bytes(size-based). The first condition met causes older data to be purged.- Typical defaults:
retention.ms≈ 7 days,retention.bytes≈ 1 GB (these are examples; actual defaults vary by deployment).
- Typical defaults:
- Increase retention when you need long-term replay, but account for storage costs and operational impact.
APIs and tools referenced
- CLI tools:
kafka-console-producer,kafka-console-consumer. - Client libraries: KafkaJS (JavaScript) examples include
producer.send,consumer.subscribe, andeachMessage. - Record fields to be aware of: key, value, timestamp, headers.
When to use Kafka (typical use cases)
- Asynchronous/background processing (e.g., job queues for video transcoding).
- Ordered processing scenarios (e.g., ticketing or waiting queues).
- Decoupling producers and consumers so they can scale independently (e.g., workload submission and workers).
- Stream processing and real-time aggregation (e.g., ad-click counters with Flink).
- Pub/sub for pushing events to many consumers at once (e.g., live comments or notifications).
Interview-focused advice
Follow a high-level design, then deep-dive into a few concrete areas to show technical depth. Suggested focus areas:
- Scalability: partition strategy and broker count.
- Fault tolerance & durability: replication factor and
acks. - Errors & retries: producer idempotency, retry topics, and DLQs.
- Performance optimizations: batching, compression, and partitioning.
- Retention policies and associated cost/performance trade-offs.
Be prepared to estimate capacity and defend choices. Mention managed Kafka as an operational alternative.
Resources and tutorials
- Written guide linked in the video description covering the same topics.
- Hello Interview site: detailed writeups and system-design breakdowns (examples: ad-click aggregator, YouTube transcoding, ticketing) and paid mock interviews.
- Other deep-dives and example problems (previous content included a Redis deep-dive).
Presenters and sources: Evan (former Meta staff engineer, co-founder of Hello Interview). Other contributors mentioned: Stefan (co-founder) and Christian (engineering manager at Meta).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.