Summary of "Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization"
Main ideas, concepts, and lessons
Course purpose (“build from scratch”)
- The class emphasizes understanding language model mechanics and an engineering mindset by implementing major pieces yourself rather than relying only on prompting or black-box models.
- The instructors refine what must be built “from scratch” to maximize learning within time constraints.
Why this class matters now
- Modern “frontier” LLMs are extremely expensive, and important details are often withheld (e.g., training costs and build details not fully shared).
- Small-scale experiments may not always transfer because:
- Compute bottlenecks and optimization priorities shift with scale (e.g., relative importance of MLP vs attention changes).
- Emergent behaviors may appear only past certain model scales.
What knowledge transfers across scales (3-part framing)
- Mechanics: how things work (transformers, parallelism, etc.).
- Mindset: how to build and optimize (profiling, benchmarking, efficiency-first).
- Intuitions: data/modeling decisions that work
- May require scale-specific experimentation and can be less transferable.
“Bitter lesson” clarified
- The lesson is not “scale is all that matters”; it is:
- Algorithms that scale matter.
- Efficiency is critical at large scale because compute is expensive—small gains can be financially significant.
- Mental model:
- accuracy ≈ efficiency × resources
- Better efficiency prevents wasted compute.
Course context: evolution of language models
- Historical lineage:
- Statistical LM ideas (e.g., Shannon-era entropy of English).
- N-gram models for translation/speech (as components).
- Neural progressions:
- Feedforward neural language models (early work)
- sequence vector compression ideas
- attention
- transformers
- LSTMs/optimization (mentioned historically)
- Pretrained + fine-tune (ELMo/BERT).
- Prompting and scaling laws enabling GPT-style models and in-context learning (GPT-2/3).
- More recent open-weight ecosystems (Llama, Mistral, DeepSeek, Qwen, etc.) approaching closed models.
- Current era: chat → agents, where capabilities now exceed what older researchers might imagine.
- Fundamentals remain:
- GPUs/kernels, gradient-based training, transformers/attention
- But requirements change—especially longer context, making inference efficiency more important.
Course logistics and philosophy
- Lectures and materials are online; lectures are posted/recorded (eventually YouTube).
- 5 assignments are central; emphasis is “from scratch but not reckless”:
- No scaffolding code; instead unit tests ensure correctness (avoid sparse/mostly binary grading).
- Work can be done locally, with a cluster for real training runs and benchmarking.
- Leaderboards evaluate outcomes like minimizing perplexity given a compute budget.
AI policy / using tools appropriately
- Because coding agents can solve homework too easily, students receive an AI prompt/agents.md template:
- AI should be pedagogically minded
- clarify questions, explain understanding, but not directly generate the missing core implementation
- e.g., not “write the transformer for you” when you must implement it.
Detailed methodology / instruction-style content (tokenization unit focus)
Tokenization goals and properties
-
Tokenizer function
- Converts:
- Raw input strings (Unicode) / bytes
- → sequence of token indices that the language model uses.
- Must support:
- Decoding tokens back into strings
- “Round trip” requirement:
- If encoding/decoding doesn’t preserve information properly, it’s a problem.
- Converts:
-
Efficiency metrics
- Compression ratio: bytes per token
- Higher compression ratio → shorter sequences
- Important because attention cost is quadratic in sequence length.
- Trade-off:
- Increasing vocab size can improve compression, but may cause sparsity and more rarely-used tokens.
- Compression ratio: bytes per token
Tokenizer approaches discussed (and why earlier ones are suboptimal)
-
Character-level tokenization
- Pros: simple, covers all text
- Cons: large vocab (Unicode set), many rare characters → inefficient and poor compression
-
Byte-level tokenization
- Pros: fixed small vocab (0–255), no unseen tokens
- Cons: sequences become long → poor compression
-
Word/regex chunk tokenization
- Pros: semantically meaningful chunks
- Cons: huge or unbounded vocab; unknown tokens become an issue (“UNK” harms perplexity)
Byte Pair Encoding (BPE) methodology (core algorithm)
-
Core idea
- Build a tokenizer vocabulary from training data, merging frequently co-occurring byte/token pairs.
- If something is rare, it decomposes into smaller parts rather than becoming UNK.
- Common sequences become single tokens; rare sequences become multiple tokens.
-
Training-time BPE algorithm
- Start with:
- A corpus treated as one long sequence of bytes
- Each byte is initially its own token
- Iteratively:
- Count all adjacent token pairs in the corpus and how often they occur.
- Select the most frequent pair (ties may exist; example chooses first).
- Merge that pair into a new token and add it to the vocabulary.
- Replace occurrences of that pair in the corpus with the new token.
- Repeat until reaching the desired vocab size.
- Start with:
-
Token IDs and merges (illustrative example)
- Example string: “the cat in the hat”
- Bytes are tokenized (e.g., “t” and “h”)
- Most frequent pair merged:
- “t” + “h” → new token id (e.g., 256)
- Later merges combine larger units (e.g., 256 with other adjacent bytes) producing increasing token IDs (e.g., 257, 258, …).
- Result: sequence shrinks; vocabulary grows.
-
Encoding new text using learned BPE merges
- Convert new text to bytes.
- Apply the learned merge rules in sequence to obtain token indices.
- Must decode back to the original string (round-trip correctness).
-
Practical implementation notes
- Naive BPE encoding is slow:
- encoding may loop over many merges
- Optimization direction for Assignment 1:
- Only consider merges that matter (need indexing/data structures)
- Chunking:
- Tokenize text in chunks rather than one full string to improve speed
- Implementation language:
- Python may be too slow; reimplementation in Rust/C is suggested as an option
- Special tokens:
- There are additional considerations (not deeply covered, but required for modern tokenizers)
- Naive BPE encoding is slow:
Broader tokenization design requirements (stated as evaluation criteria for future end-to-end approaches)
Any replacement for tokenizers should ideally:
- Provide meaningful abstractions for sequences
- e.g., not modeling raw bytes directly for high-noise domains like video/DNA
- Support adaptive chunking / variable granularity
- compress common parts, keep rare/important parts more granular
What will be covered later in the course (high-level roadmap)
- 5 parts mirroring the 5 assignments
- Basics (first ~2 weeks): tokenize → define architecture → implement optimizer & trainer
- Assignment 1: implement BPE tokenizer + transformer + loss/optimizer/training stack + resource accounting; train on datasets; compete via perplexity/efficiency-like leaderboard.
- Systems: kernels, GPU parallelism, inference
- Resource accounting, roofline analysis (compute vs memory bottlenecks), profiling/benchmarking
- Triton kernels and distributed training concepts
- Inference optimizations: prefill/decode, speculative decoding, quantization/distillation/pruning.
- Scaling laws: construct scaling recipes; extrapolate loss across compute budgets
- Hyperparameter transfer and predictability
- Kaplan/Chinchilla-style compute-optimal trade-offs
- Assignment 3 simulates expensive training via offline cached experiments.
- Data: evaluation + data sourcing/curation/processing
- Internal vs external eval metrics; contamination avoidance; dataset diversity
- Data processing: transformation, filtering, deduplication, source mixing, synthetic data generation
- Assignment 4 focuses on turning raw corpora into clean training data.
- Alignment: improve beyond next-token prediction using weak supervision
- RL methods (PPO/GRPO), preference optimization (DPO), preference scoring (human or judge/LM judge)
- Orchestration/system challenges for RL at scale
- Assignment 5 content TBD, likely DPO/GRPO on a realistic benchmark.
- Basics (first ~2 weeks): tokenize → define architecture → implement optimizer & trainer
Speakers / sources featured
Speakers / instructors / TAs (identified in the subtitles)
- Percy (co-instructor / course staff)
- Tatsu (co-instructor)
- Marcel (teaching staff; returning from prior offering)
- Herman (teaching staff / TA)
- Steven (first-time CA for the course)
Referenced sources / works / entities
- BERT (example pretrained model)
- ELMo (language model pretraining)
- GPT-2, GPT-3, GPT-4 (frontier GPT lineage; training cost mention)
- Noam Shazeer (SwiGLU activation paper; “divine benevolence” quote attribution)
- OpenAI (2020 algorithmic efficiency reference on ImageNet)
- Kaplan et al. (scaling-law reference)
- Chinchilla scaling laws (scaling-law reference)
- SwiGLU activation (from Shazeer paper)
- Llama series (Llama 1/2/3), Mistral, DeepSeek, Qwen, plus other Chinese/model ecosystem mentions
- The Moraine project (speaker’s work; pre-registration of scaling results)
- How to Scale Your Model (Google book referenced)
- Modal (compute credits/platform for the course)
- Triton (for kernel implementation in Assignment 2)
- Andrej Karpathy (tokenization video reference)
- Common Crawl, Papers with code sources mentioned (books, archive papers, GitHub code, etc.; dataset sources)
- DPO and GRPO (alignment methods)
- PPO (alignment/RL method mentioned)
- B200 GPUs (hardware referenced; inference/training context)
- DGX B200 and NVLink / InfiniBand / Ethernet (hardware/system architecture referenced)
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...