Summary of "Chroma For Code Part 1: Chunking a Codebase for Code Search"

High-level summary

Key technologies and products

Chunking strategy (design and algorithm)

Guiding principle:

Chunk into self-contained logical units (functions, classes, interfaces, etc.) so returned chunks are directly useful to the model.

Main steps:

  1. Configure tree-sitter per language (example: TypeScript/TSX) and define a set of “wanted node” types to select from the AST.
  2. collect_tree_nodes:
    • Recursively traverse the AST.
    • Add nodes whose type is in the wanted set.
    • Optionally record ancestor lineage (symbol path).
  3. Sort selected nodes by start line and process the file sequentially to detect and handle gaps (imports, constants, other code not in wanted nodes).
  4. For each selected node:
    • Extract the source span and node metadata (symbol name, parent chain).
    • If the node span exceeds the token limit, split it further by line (not mid-line) using a tokenizer to keep chunks under max tokens.
  5. split_by_tokens:
    • Iterate lines, tokenize per-line, accumulate until adding a line would exceed max tokens, then flush the chunk.
    • Track start/end line numbers for metadata.
  6. Skip blank spans (don’t index empty chunks).
  7. Add common metadata for every chunk: file path, language, line range, and (for wanted nodes) symbol name.

Notes:

Embedding and ingestion details

Search and query capabilities demonstrated

Practical demo

Design decisions and caveats

Next steps in the series

Main speaker / sources

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video