Summary of "Chroma For Code Part 1: Chunking a Codebase for Code Search"
High-level summary
- Goal: build a code-search / coding-agent pipeline that uses Chroma to provide an LLM with focused, relevant code context (context engineering) instead of dumping whole files. This reduces “context rot” and improves model performance.
- Outcome of the series: index a repo into Chroma, keep the index up-to-date, and build a simple coding agent that queries the indexed code.
Key technologies and products
- Chroma: vector database / retrieval system for storing embedded code chunks and running semantic and other retrievals.
- Chroma Cloud: used in the demo for dashboard visualization and interactive queries.
- Tree-sitter: fast parser used to build an AST and extract language constructs (classes, functions, interfaces, methods, etc.) as chunk candidates.
- OpenAI embeddings: example embedding model used (text-embedding-3-large).
- tiktoken (tokenizer): used to count tokens per chunk to respect embedding model token limits.
- Regex, full-text search, and metadata filtering: combined with semantic retrieval for precise queries.
Chunking strategy (design and algorithm)
Guiding principle:
Chunk into self-contained logical units (functions, classes, interfaces, etc.) so returned chunks are directly useful to the model.
Main steps:
- Configure tree-sitter per language (example: TypeScript/TSX) and define a set of “wanted node” types to select from the AST.
- collect_tree_nodes:
- Recursively traverse the AST.
- Add nodes whose type is in the wanted set.
- Optionally record ancestor lineage (symbol path).
- Sort selected nodes by start line and process the file sequentially to detect and handle gaps (imports, constants, other code not in wanted nodes).
- For each selected node:
- Extract the source span and node metadata (symbol name, parent chain).
- If the node span exceeds the token limit, split it further by line (not mid-line) using a tokenizer to keep chunks under max tokens.
- split_by_tokens:
- Iterate lines, tokenize per-line, accumulate until adding a line would exceed max tokens, then flush the chunk.
- Track start/end line numbers for metadata.
- Skip blank spans (don’t index empty chunks).
- Add common metadata for every chunk: file path, language, line range, and (for wanted nodes) symbol name.
Notes:
- Splitting is line-based to avoid breaking syntax mid-line.
- Keep line-range metadata so entire files can be reconstructed if needed.
Embedding and ingestion details
- Use a tokenizer to measure tokens and free encoder resources when done.
- Respect the embedding model’s token limits; choose the splitting strategy accordingly (demo uses a line-based split).
- Batch inserts into Chroma (demo uses batches of 100).
- Example environment: Chroma Cloud client with credentials supplied via environment variables.
Search and query capabilities demonstrated
- Semantic (vector) search via Chroma + embeddings (example query: “how is scrolling handled?”).
- Regex search to find code patterns (example: find all calls to setError).
- Metadata filtering to locate specific symbols or scope (e.g., symbol = “ChatMessage” or file_path = “app/page.tsx”).
- Full-text search combined with metadata filters (e.g., search for “import” scoped to a single file).
Practical demo
- Indexed a Next.js repo (TypeScript/TSX) into a Chroma Cloud collection using the chunking pipeline.
- Dashboard shows nicely formatted code when language metadata is provided.
- Demonstrated semantic hits, regex results, and metadata-filtered results working as intended.
Design decisions and caveats
- Granularity choice matters:
- Collecting only outer nodes avoids duplicates.
- Collecting inner nodes (post-order) leads to overlapping chunks (inner function and enclosing class), which may be desired depending on retrieval strategy.
- Must handle very large nodes by splitting by token/line.
- Maintain line-range metadata for reconstruction and context.
- Avoid indexing blank code spans.
Next steps in the series
- Efficiently index entire repositories.
- Keep the index up to date on code changes.
- Build a coding agent powered by Chroma that combines retrievals with an LLM.
Main speaker / sources
- Speaker: Itai
- Technologies/products referenced: Chroma (and Chroma Cloud), Tree-sitter, OpenAI embeddings (text-embedding-3-large), tiktoken (tokenizer).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...