Summary of "We Don't Need KV Cache Anymore?"
Technology summary (KV cache vs residual stream / “KV direct”)
Baseline problem (standard KV cache)
Using Gemma 312B (described as a “production size” LLM), the speaker measures two things across ~20 turns:
- Memory usage from the KV cache (attention state).
- Wall clock time per turn.
The KV cache grows every turn and becomes a bottleneck:
- KV cache memory rises from tens/hundreds of MB up to about ~978 MB (nearly 1 GB) for a single conversation.
- Wall-clock time per turn increases roughly from ~3s to ~13.4s by turn 20, attributed to the growing cache slowing inference.
KV direct / bounded KV cache approach
The speaker uses the same model and conversation, but enforces a fixed KV budget (example: 150 MB).
Reported behavior:
- Memory grows for a few turns, then stops increasing (stays flat).
- Wall-clock time per turn stays stable (around ~3.6–4.2s early), and remains far below unbounded KV cache even at turn 10/15/20.
Key claim: standard KV cache optimization can backfire—the “saved computation” from precomputed KV becomes worse than recomputation once KV is huge.
Core theoretical claim: KV cache is redundant given the residual stream (“Markov property”)
The speaker argues that the residual stream (the transformer’s main internal “data highway”) contains the complete computational state needed to continue generation.
They claim:
- The residual stream state per token is about ~8 KB per token (described initially as one vector; later narration mentions specific structures like 3,840 numbers and a 2560-dimensional map).
- The KV cache is a derived projection of that residual stream:
“If you have the state, you can recompute the functions” (KV tensors) with a single matrix multiply / linear projection per head.
Because the current state is sufficient, they equate this with a Markov property: future depends only on the present state, not on the full history.
Proof plan / experiments the speaker says you can verify (open-source code)
The speaker outlines three tests using real models and code (a GitHub link is mentioned).
Test 1: Exact layer-by-layer recoverability
- Compute KV cache normally during the forward pass.
- Recompute KV (“K′V′”) purely from the residual stream at each layer.
- Compare KV values across all layers.
Result claimed:
- For Gemma 3 270M (smaller than the 12B-class mentioned earlier), with 18 layers and a 6-token prompt, the comparison shows:
- difference = exactly zero across all layers.
Conclusion: “KV cache is exactly recoverable from the residual.”
Test 2: Token-by-token generation correctness without persistent KV
Run generation side-by-side:
- Standard inference with KV cache.
- Inference where KV is recomputed from the residual stream, with no persistent KV state.
Result claimed:
- All generated tokens match for 50 tokens (exact match).
- Memory usage remains tiny relative to KV growth.
They also note recomputation can be slower, since KV is repeatedly recomputed.
Test 3 (“KV direct”): Precompute KV from residual once, then reuse
“KV direct” computes KV from residual during prefill/refill, then stores KV for later tokens.
Claimed outcome:
- Still matches standard KV output token-for-token.
- Performance can approach standard KV cache speed.
- In multi-turn / repeated prefill scenarios, they claim KV direct can outstrip standard KV cache due to memory behavior.
Practical architecture described: bounded memory with residual “checkpoints”
Active window with bounded KV memory
- Set an active KV window budget (example: 150 MB).
- Claimed capacity: roughly ~400 tokens worth of K&V tensors stored and “ready for attention.”
Sliding window without losing context (because residual holds state)
When the window slides and old tokens leave:
- The speaker claims context is not lost because the residual stream is already the compressed complete state.
- The residual stream’s Markov property allows continuation without storing full token history in working memory.
Residual checkpointing instead of token/KV history
When tokens leave the window:
- Store a residual checkpoint (only a few KB per position/state), not gigabytes of KV tensors.
- Rationale: residual checkpoint already includes results of all layers (attention + feedforward), and KV tensors can be derived later with cheap projections.
Token IDs kept as fallback
They still keep token IDs (example: ~10.2 KB for full conversation) as “ground truth/safety net” to reconstruct if needed.
But the architecture’s “hero” remains the residual state, not the tokens.
Main takeaway
- KV cache grows and can become the real bottleneck in long multi-turn conversations.
- If the residual stream truly contains the complete state, then KV cache is an optimization, not a necessity:
- You can recompute KV from residual (or precompute once per refill) and avoid unbounded KV growth.
- Result: bounded-memory inference with stable latency and better scaling to very long contexts and/or smaller hardware.
Main speakers / sources
- Primary speaker: The presenter of the concept and experiments (no name given in the subtitles). Mentions a GitHub repo and an “open source” code link.
- Model sources referenced: Gemma 312B and Gemma 3 270M (Gemma-family LLMs).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.