Summary of "Chroma For Code Part 1: Chunking a Codebase for Code Search"

High-level summary

Goal: build a code-search / coding-agent pipeline that uses Chroma to provide an LLM with focused, relevant code context (context engineering) instead of dumping whole files. This reduces “context rot” and improves model performance.
Outcome of the series: index a repo into Chroma, keep the index up-to-date, and build a simple coding agent that queries the indexed code.

Chroma: vector database / retrieval system for storing embedded code chunks and running semantic and other retrievals.
Chroma Cloud: used in the demo for dashboard visualization and interactive queries.
Tree-sitter: fast parser used to build an AST and extract language constructs (classes, functions, interfaces, methods, etc.) as chunk candidates.
OpenAI embeddings: example embedding model used (text-embedding-3-large).
tiktoken (tokenizer): used to count tokens per chunk to respect embedding model token limits.
Regex, full-text search, and metadata filtering: combined with semantic retrieval for precise queries.

Guiding principle:

Chunk into self-contained logical units (functions, classes, interfaces, etc.) so returned chunks are directly useful to the model.

Main steps:

Configure tree-sitter per language (example: TypeScript/TSX) and define a set of “wanted node” types to select from the AST.
collect_tree_nodes:
- Recursively traverse the AST.
- Add nodes whose type is in the wanted set.
- Optionally record ancestor lineage (symbol path).
Sort selected nodes by start line and process the file sequentially to detect and handle gaps (imports, constants, other code not in wanted nodes).
For each selected node:
- Extract the source span and node metadata (symbol name, parent chain).
- If the node span exceeds the token limit, split it further by line (not mid-line) using a tokenizer to keep chunks under max tokens.
split_by_tokens:
- Iterate lines, tokenize per-line, accumulate until adding a line would exceed max tokens, then flush the chunk.
- Track start/end line numbers for metadata.
Skip blank spans (don’t index empty chunks).
Add common metadata for every chunk: file path, language, line range, and (for wanted nodes) symbol name.

Notes:

Use a tokenizer to measure tokens and free encoder resources when done.
Respect the embedding model’s token limits; choose the splitting strategy accordingly (demo uses a line-based split).
Batch inserts into Chroma (demo uses batches of 100).
Example environment: Chroma Cloud client with credentials supplied via environment variables.

Semantic (vector) search via Chroma + embeddings (example query: “how is scrolling handled?”).
Regex search to find code patterns (example: find all calls to setError).
Metadata filtering to locate specific symbols or scope (e.g., symbol = “ChatMessage” or file_path = “app/page.tsx”).
Full-text search combined with metadata filters (e.g., search for “import” scoped to a single file).

Indexed a Next.js repo (TypeScript/TSX) into a Chroma Cloud collection using the chunking pipeline.
Dashboard shows nicely formatted code when language metadata is provided.
Demonstrated semantic hits, regex results, and metadata-filtered results working as intended.

Granularity choice matters:
- Collecting only outer nodes avoids duplicates.
- Collecting inner nodes (post-order) leads to overlapping chunks (inner function and enclosing class), which may be desired depending on retrieval strategy.
Must handle very large nodes by splitting by token/line.
Maintain line-range metadata for reconstruction and context.
Avoid indexing blank code spans.

Speaker: Itai
Technologies/products referenced: Chroma (and Chroma Cloud), Tree-sitter, OpenAI embeddings (text-embedding-3-large), tiktoken (tokenizer).