Summary of "Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization"

Main ideas, concepts, and lessons

Course purpose (“build from scratch”)

The class emphasizes understanding language model mechanics and an engineering mindset by implementing major pieces yourself rather than relying only on prompting or black-box models.
The instructors refine what must be built “from scratch” to maximize learning within time constraints.

Why this class matters now

Modern “frontier” LLMs are extremely expensive, and important details are often withheld (e.g., training costs and build details not fully shared).
Small-scale experiments may not always transfer because:
- Compute bottlenecks and optimization priorities shift with scale (e.g., relative importance of MLP vs attention changes).
- Emergent behaviors may appear only past certain model scales.

What knowledge transfers across scales (3-part framing)

Mechanics: how things work (transformers, parallelism, etc.).
Mindset: how to build and optimize (profiling, benchmarking, efficiency-first).
Intuitions: data/modeling decisions that work
- May require scale-specific experimentation and can be less transferable.

“Bitter lesson” clarified

The lesson is not “scale is all that matters”; it is:
- Algorithms that scale matter.
Efficiency is critical at large scale because compute is expensive—small gains can be financially significant.
Mental model:
- accuracy ≈ efficiency × resources
- Better efficiency prevents wasted compute.

Course context: evolution of language models

Historical lineage:
- Statistical LM ideas (e.g., Shannon-era entropy of English).
- N-gram models for translation/speech (as components).
- Neural progressions:
  - Feedforward neural language models (early work)
  - sequence vector compression ideas
  - attention
  - transformers
  - LSTMs/optimization (mentioned historically)
- Pretrained + fine-tune (ELMo/BERT).
- Prompting and scaling laws enabling GPT-style models and in-context learning (GPT-2/3).
- More recent open-weight ecosystems (Llama, Mistral, DeepSeek, Qwen, etc.) approaching closed models.
Current era: chat → agents, where capabilities now exceed what older researchers might imagine.
Fundamentals remain:
- GPUs/kernels, gradient-based training, transformers/attention
- But requirements change—especially longer context, making inference efficiency more important.

Course logistics and philosophy

Lectures and materials are online; lectures are posted/recorded (eventually YouTube).
5 assignments are central; emphasis is “from scratch but not reckless”:
- No scaffolding code; instead unit tests ensure correctness (avoid sparse/mostly binary grading).
- Work can be done locally, with a cluster for real training runs and benchmarking.
- Leaderboards evaluate outcomes like minimizing perplexity given a compute budget.

AI policy / using tools appropriately

Because coding agents can solve homework too easily, students receive an AI prompt/agents.md template:
- AI should be pedagogically minded
- clarify questions, explain understanding, but not directly generate the missing core implementation
  - e.g., not “write the transformer for you” when you must implement it.

Detailed methodology / instruction-style content (tokenization unit focus)

Tokenization goals and properties

Tokenizer function
- Converts:
  - Raw input strings (Unicode) / bytes
  - → sequence of token indices that the language model uses.
- Must support:
  - Decoding tokens back into strings
- “Round trip” requirement:
  - If encoding/decoding doesn’t preserve information properly, it’s a problem.
Efficiency metrics
- Compression ratio: bytes per token
  - Higher compression ratio → shorter sequences
  - Important because attention cost is quadratic in sequence length.
- Trade-off:
  - Increasing vocab size can improve compression, but may cause sparsity and more rarely-used tokens.

Tokenizer approaches discussed (and why earlier ones are suboptimal)

Character-level tokenization
- Pros: simple, covers all text
- Cons: large vocab (Unicode set), many rare characters → inefficient and poor compression
Byte-level tokenization
- Pros: fixed small vocab (0–255), no unseen tokens
- Cons: sequences become long → poor compression
Word/regex chunk tokenization
- Pros: semantically meaningful chunks
- Cons: huge or unbounded vocab; unknown tokens become an issue (“UNK” harms perplexity)

Byte Pair Encoding (BPE) methodology (core algorithm)

Core idea
- Build a tokenizer vocabulary from training data, merging frequently co-occurring byte/token pairs.
- If something is rare, it decomposes into smaller parts rather than becoming UNK.
- Common sequences become single tokens; rare sequences become multiple tokens.
Training-time BPE algorithm
- Start with:
  - A corpus treated as one long sequence of bytes
  - Each byte is initially its own token
- Iteratively:
  1. Count all adjacent token pairs in the corpus and how often they occur.
  2. Select the most frequent pair (ties may exist; example chooses first).
  3. Merge that pair into a new token and add it to the vocabulary.
  4. Replace occurrences of that pair in the corpus with the new token.
- Repeat until reaching the desired vocab size.
Token IDs and merges (illustrative example)
- Example string: “the cat in the hat”
- Bytes are tokenized (e.g., “t” and “h”)
- Most frequent pair merged:
  - “t” + “h” → new token id (e.g., 256)
- Later merges combine larger units (e.g., 256 with other adjacent bytes) producing increasing token IDs (e.g., 257, 258, …).
- Result: sequence shrinks; vocabulary grows.
Encoding new text using learned BPE merges
- Convert new text to bytes.
- Apply the learned merge rules in sequence to obtain token indices.
- Must decode back to the original string (round-trip correctness).
Practical implementation notes
- Naive BPE encoding is slow:
  - encoding may loop over many merges
- Optimization direction for Assignment 1:
  - Only consider merges that matter (need indexing/data structures)
- Chunking:
  - Tokenize text in chunks rather than one full string to improve speed
- Implementation language:
  - Python may be too slow; reimplementation in Rust/C is suggested as an option
- Special tokens:
  - There are additional considerations (not deeply covered, but required for modern tokenizers)

Broader tokenization design requirements (stated as evaluation criteria for future end-to-end approaches)

Any replacement for tokenizers should ideally:

Provide meaningful abstractions for sequences
- e.g., not modeling raw bytes directly for high-noise domains like video/DNA
Support adaptive chunking / variable granularity
- compress common parts, keep rare/important parts more granular

What will be covered later in the course (high-level roadmap)

5 parts mirroring the 5 assignments
1. Basics (first ~2 weeks): tokenize → define architecture → implement optimizer & trainer
  - Assignment 1: implement BPE tokenizer + transformer + loss/optimizer/training stack + resource accounting; train on datasets; compete via perplexity/efficiency-like leaderboard.
2. Systems: kernels, GPU parallelism, inference
  - Resource accounting, roofline analysis (compute vs memory bottlenecks), profiling/benchmarking
  - Triton kernels and distributed training concepts
  - Inference optimizations: prefill/decode, speculative decoding, quantization/distillation/pruning.
3. Scaling laws: construct scaling recipes; extrapolate loss across compute budgets
  - Hyperparameter transfer and predictability
  - Kaplan/Chinchilla-style compute-optimal trade-offs
  - Assignment 3 simulates expensive training via offline cached experiments.
4. Data: evaluation + data sourcing/curation/processing
  - Internal vs external eval metrics; contamination avoidance; dataset diversity
  - Data processing: transformation, filtering, deduplication, source mixing, synthetic data generation
  - Assignment 4 focuses on turning raw corpora into clean training data.
5. Alignment: improve beyond next-token prediction using weak supervision
  - RL methods (PPO/GRPO), preference optimization (DPO), preference scoring (human or judge/LM judge)
  - Orchestration/system challenges for RL at scale
  - Assignment 5 content TBD, likely DPO/GRPO on a realistic benchmark.

Speakers / sources featured

Speakers / instructors / TAs (identified in the subtitles)

Percy (co-instructor / course staff)
Tatsu (co-instructor)
Marcel (teaching staff; returning from prior offering)
Herman (teaching staff / TA)
Steven (first-time CA for the course)

Referenced sources / works / entities

BERT (example pretrained model)
ELMo (language model pretraining)
GPT-2, GPT-3, GPT-4 (frontier GPT lineage; training cost mention)
Noam Shazeer (SwiGLU activation paper; “divine benevolence” quote attribution)
OpenAI (2020 algorithmic efficiency reference on ImageNet)
Kaplan et al. (scaling-law reference)
Chinchilla scaling laws (scaling-law reference)
SwiGLU activation (from Shazeer paper)
Llama series (Llama 1/2/3), Mistral, DeepSeek, Qwen, plus other Chinese/model ecosystem mentions
The Moraine project (speaker’s work; pre-registration of scaling results)
How to Scale Your Model (Google book referenced)
Modal (compute credits/platform for the course)
Triton (for kernel implementation in Assignment 2)
Andrej Karpathy (tokenization video reference)
Common Crawl, Papers with code sources mentioned (books, archive papers, GitHub code, etc.; dataset sources)
DPO and GRPO (alignment methods)
PPO (alignment/RL method mentioned)
B200 GPUs (hardware referenced; inference/training context)
DGX B200 and NVLink / InfiniBand / Ethernet (hardware/system architecture referenced)

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization"

Main ideas, concepts, and lessons

Course purpose (“build from scratch”)

Why this class matters now

What knowledge transfers across scales (3-part framing)

“Bitter lesson” clarified

Course context: evolution of language models

Course logistics and philosophy

AI policy / using tools appropriately

Detailed methodology / instruction-style content (tokenization unit focus)

Tokenization goals and properties

Tokenizer approaches discussed (and why earlier ones are suboptimal)

Byte Pair Encoding (BPE) methodology (core algorithm)

Broader tokenization design requirements (stated as evaluation criteria for future end-to-end approaches)

What will be covered later in the course (high-level roadmap)

Speakers / sources featured

Speakers / instructors / TAs (identified in the subtitles)

Referenced sources / works / entities

Category

Share this summary

Is the summary off?

Video

Summary of "Stanford CS336 Language Modeling from Scratch | Spring 2026 | Lecture 1: Overview, Tokenization"

Main ideas, concepts, and lessons

Course purpose (“build from scratch”)

Why this class matters now

What knowledge transfers across scales (3-part framing)

“Bitter lesson” clarified

Course context: evolution of language models

Course logistics and philosophy

AI policy / using tools appropriately

Detailed methodology / instruction-style content (tokenization unit focus)

Tokenization goals and properties

Tokenizer approaches discussed (and why earlier ones are suboptimal)

Byte Pair Encoding (BPE) methodology (core algorithm)

Broader tokenization design requirements (stated as evaluation criteria for future end-to-end approaches)

What will be covered later in the course (high-level roadmap)

Speakers / sources featured

Speakers / instructors / TAs (identified in the subtitles)

Referenced sources / works / entities

Category ?

Share this summary

Is the summary off?

Video

Category