Summary of "We Don't Need KV Cache Anymore?"

Technology summary (KV cache vs residual stream / “KV direct”)

Baseline problem (standard KV cache)

Using Gemma 312B (described as a “production size” LLM), the speaker measures two things across ~20 turns:

  1. Memory usage from the KV cache (attention state).
  2. Wall clock time per turn.

The KV cache grows every turn and becomes a bottleneck:

KV direct / bounded KV cache approach

The speaker uses the same model and conversation, but enforces a fixed KV budget (example: 150 MB).

Reported behavior:

Key claim: standard KV cache optimization can backfire—the “saved computation” from precomputed KV becomes worse than recomputation once KV is huge.


Core theoretical claim: KV cache is redundant given the residual stream (“Markov property”)

The speaker argues that the residual stream (the transformer’s main internal “data highway”) contains the complete computational state needed to continue generation.

They claim:

“If you have the state, you can recompute the functions” (KV tensors) with a single matrix multiply / linear projection per head.

Because the current state is sufficient, they equate this with a Markov property: future depends only on the present state, not on the full history.


Proof plan / experiments the speaker says you can verify (open-source code)

The speaker outlines three tests using real models and code (a GitHub link is mentioned).

Test 1: Exact layer-by-layer recoverability

  1. Compute KV cache normally during the forward pass.
  2. Recompute KV (“K′V′”) purely from the residual stream at each layer.
  3. Compare KV values across all layers.

Result claimed:

Conclusion: “KV cache is exactly recoverable from the residual.”

Test 2: Token-by-token generation correctness without persistent KV

Run generation side-by-side:

Result claimed:

They also note recomputation can be slower, since KV is repeatedly recomputed.

Test 3 (“KV direct”): Precompute KV from residual once, then reuse

“KV direct” computes KV from residual during prefill/refill, then stores KV for later tokens.

Claimed outcome:


Practical architecture described: bounded memory with residual “checkpoints”

Active window with bounded KV memory

Sliding window without losing context (because residual holds state)

When the window slides and old tokens leave:

Residual checkpointing instead of token/KV history

When tokens leave the window:

Token IDs kept as fallback

They still keep token IDs (example: ~10.2 KB for full conversation) as “ground truth/safety net” to reconstruct if needed.

But the architecture’s “hero” remains the residual state, not the tokens.


Main takeaway


Main speakers / sources

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video