Summary of "LLMs Are Databases - So Query Them"
Overview
This video argues that (at least the FFN/knowledge component of) transformer-based LLMs can be treated as a real graph database rather than a metaphor. The speaker claims:
- The model’s internal weights correspond to a physical graph of entities (nodes) and features/slots (edges).
- Relations are represented as edge labels.
- A custom query language can read, trace, and write facts in the model without training.
Core technical claim: LLM weights as a graph database
The model is described as having a three-stage architecture:
- Syntax/context understanding (early transformer layers)
- Knowledge retrieval from weights (middle layers, represented as graph edges/features)
- Output formation / token commitment (later layers)
The speaker further maps these components to transformer parts:
- Early layers: interpret syntax and context
- Middle layers: store and retrieve knowledge via graph-like edge mechanics
- Output layers: commit to generated tokens
Tooling / method: Larql queries over model weights
The speaker uses Larql, described as a “language to query large language models.” They connect Larql to the weights of Google Gemma 3 (34B) (and mention Gemma 4 as a future topic).
They probe the internal representations into a knowledge-graph-like structure:
- Example statistic: the mapped internal graph contains 1,785 features (per their probe/stats).
- They also describe layer bands aligned with:
- syntax
- knowledge
- output
“Describe” and entity-level browsing (read capabilities)
Using describe France, the speaker claims the model’s internal graph can be browsed as follows:
- Early layers: detect “country” cues and query syntax signals (including signals like Spanish/international markers).
- Middle layers: contain knowledge facts for France (e.g., tags related to Europe/Italy, borders-related attributes, etc.).
- Output layers: select plausible answer tokens, with alternatives appearing depending on context.
They also emphasize polysemanticity (noise overlap):
- A single feature/slot can represent multiple unrelated concepts because compressed representations can overlap.
- Example claim: features may store “France” alongside other countries or relations in the same slot.
SQL-like graph queries (edges/features as database tables)
The speaker demonstrates queries analogous to SQL over the internal graph.
Query borders for France
Example (as described):
select star from edges where entity = France and relation = borders(with a score filter)
The claim is that:
- The “France borders …” fact appears as a single strong stored relation.
- A concrete mapping is reported: layer 25, feature 5067 represents “borders” for “country” token variants (including spelling/capitalization variants).
Query nationality for France
Example:
select star from edges where entity = France and relation = nationality
They claim nationality-related outputs cluster across features/layers (e.g., Germany/Sweden/Italy; plus other sets in different layers).
Nearest-neighbor queries
Example:
select star from edges nearest to France at layer 26 limit 10
This is used to show “country clusters” around France, with other countries appearing in the nearest set.
FFN “features” explained as edge mechanics
The speaker claims:
- A feature corresponds to a column/gate in the FFN.
- One gate vector decides when the feature fires.
- One “down” vector contributes to pushing next-token prediction.
- Feature activation depends on cosine similarity between:
- the layer’s residual state and
- the feature’s gate direction.
They interpret the FFN as the knowledge store:
- FFN = graph
- Attention = routing/navigation
Probing feature reuse across layers (why polysemy happens)
The speaker shows that feature indices are reused at every layer, but their semantic meaning changes across layers.
- Example pattern: the same feature ID at different layers corresponds to different concepts (e.g., “planet” in one layer, “foods” in another, capital-related concepts elsewhere).
- They attribute polysemanticity to:
- scalar (one-dimensional) activations per feature compressing a lot of context
- limited capacity / “dimensionality constraint”
- a feature that can’t disambiguate which sense/context triggered it
Relation schema discovery (“show relations”)
They use show relations to list discovered relation types and counts.
- The claim: the probe discovered ~1,489 relation labels.
- Top relations resemble a natural knowledge graph schema, e.g.:
- manufacturer
- league
- genre
- language
- capital
- award/Nobel-related
Key point: the speaker claims this relational structure was learned during training, not manually provided.
Inference tracing: answer derivation via “graph walk”
The speaker demonstrates actual inference:
inferfor “capital of France … top five” returns Paris as the dominant next token.
A key Larql capability highlighted:
- Inference trace across layers, indicating which features activate and how attention routes the path through the graph.
They claim their Larql inference uses:
- FFN as a graph walk (rather than matrix multiplication)
- A V index format that decomposes matrices into graph-structured components
- A per-layer KNN lookup:
- find features nearest to the current residual state
- fired features update the residual
- Attention remains dense routing via QKV projections, selecting which paths/features are used.
Write capability: insert facts (training-free editing)
A major demonstration is described in two phases.
1) Baseline: hallucinations for unknown facts
- For “capital of Atlantis is …”, the model outputs hallucinations (e.g., “believed/said”-like outputs).
- The speaker claims there is no real stored fact for Atlantis initially.
2) Insert into the graph
They perform an insert:
insert into edges (entity, relation, target) values (Atlantis, capital, Poseidon)
After insertion:
inferfor “capital of Atlantis is …” predicts Poseidon ~99.98%.
They then check for leakage/breakage:
- “capital of France …” still returns Paris ~81%.
Patch overlay vs compilation
Inserted edits initially live in a patch overlay:
- Base weights remain read-only during the session.
- Runtime overlay applies changes on top.
To make changes permanent, they demonstrate compilation:
compile current into V index temp Atlantis.Vindex- They describe this approach using a technique they call “Memmet” to bake edits into a standalone index.
They can export and reload:
- export to safe tensors or GGUF
- reconnect Larcal to the compiled V-index for a fresh session.
Claimed implications (product/tech impact)
Because:
- FFN execution can be treated like a graph walk (KNN-based), while
- attention performs routing,
the speaker argues:
- attention and the knowledge store can be decoupled
- the knowledge store could live on a different server
- more efficient loading/execution may be possible
- editing a model without training becomes possible by inserting into the graph database
Future items mentioned
- Test/extend the approach to Gemma 4 soon.
- Future videos may cover:
- larger models running locally (laptop feasibility is speculated)
- deeper implications of decoupling attention from the knowledge store
- training-free construction of models from inserted knowledge
Main speakers / sources
- Main speaker: the video’s author/presenter (no specific name given in the subtitles).
- Model source mentioned:
- Google Gemma 3 (34B) weights
- Gemma 4 as a future test
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.