Summary of "Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer"
Course overview
This was the first lecture of Stanford CME 295 (Transformers & Large Language Models), taught by twin instructors Afin (primary lecturer) and Shervin (co‑instructor). The lecture introduced course logistics and objectives, and gave a technical overview that traced NLP tasks, tokenization and representations, classic sequence models (RNN/LSTM), the attention mechanism, and the Transformer architecture (encoder/decoder, multi‑head attention, masking, positional encodings). Practical training/evaluation considerations and common hyperparameter magnitudes were covered, with student questions answered throughout.
Logistics and grading (short)
- Meeting time/location: Fridays 3:30–5:20; class recorded and slides posted online.
- Units/credit: 2 units; letter or credit/no‑credit options.
- Grading:
- Midterm 50% (week 5, Oct 24)
- Final 50% (final exam week, ~Dec 8)
- No programming/homework; exams are conceptual (no coding).
- Materials: course slides, recordings, syllabus; course textbook/study guide and a condensed “VIP cheat sheet” on GitHub (translated into several languages).
- Questions handled via Canvas Ed/maillist.
Main ideas, concepts and lessons
1. NLP task taxonomy
- Classification (single label)
- Examples: sentiment analysis, intent detection, language ID.
- Evaluation: accuracy; precision/recall/F1 when classes are imbalanced.
- Sequence labeling / token‑level classification (multi‑classification)
- Examples: Named Entity Recognition (NER), POS tagging, dependency/constituency parsing.
- Evaluate at token/entity type level.
- Generation (text → text)
- Examples: machine translation, summarization, question answering, code/poem generation.
- Evaluation: reference‑based metrics (BLEU, ROUGE), reference‑free measures (including later LLM‑based methods), and perplexity (lower is better).
2. Tokenization: units and trade‑offs
- Word‑level tokens: simple, but large vocabularies and OOV problems (need an unknown token).
- Subword tokenization (e.g., BPE, unigram): good compromise—shares roots across inflections, reduces OOV risk; increases sequence length vs. words.
- Character‑level: robust to misspellings/casing but greatly increases sequence length and makes semantics harder to capture per token.
- Practical vocab sizes: tens of thousands for single languages; hundreds of thousands for multilingual/code models.
3. Token / token embedding representations
- One‑hot vectors are orthogonal and unsuitable for capturing semantic similarity.
- Learn continuous embeddings via proxy tasks (e.g., Word2Vec CBOW / Skip‑gram); embeddings capture semantic relationships (e.g., king : queen :: Paris : France).
- Embedding dimensions are typically hundreds (e.g., 768); choice is a trade‑off between expressiveness and compute/latency.
4. From token embeddings to sequence/contextual embeddings
- Simple pooling (e.g., averaging token embeddings) loses order and context.
- Sequence models:
- RNNs: process tokens sequentially, maintain a hidden state representing the sequence so far; suffer from vanishing/exploding gradients and inefficient sequential computation.
- LSTMs: mitigate forgetting with gated cell state, but remain sequential and relatively slow for long contexts.
- Attention idea: allow direct connections between any pair of positions so a token can “attend” to useful past/future tokens without passing through all intermediate steps.
5. Attention mechanism (intuition and math)
- Each token is projected into three vectors: query (Q), key (K), value (V).
- For a given query, compute similarity (dot product) with all keys, normalize via softmax to get attention weights, then compute a weighted sum of corresponding values.
- Scale dot products by
1/sqrt(dk)to avoid large magnitudes as dimensionality grows. - Multi‑head attention: run several independent Q/K/V projections (heads) in parallel so the model can capture different relationships; concatenate head outputs and project back to the model dimension.
- Masked self‑attention in the decoder prevents attending to future tokens during autoregressive decoding.
- Cross‑attention in the decoder: queries come from decoder states; keys/values come from encoder outputs.
6. Transformer architecture (encoder–decoder)
- Encoder: stacked layers of multi‑head self‑attention + position‑wise feed‑forward network (FFN), with residual connections and layer normalization.
- Decoder: stacked layers containing (1) masked self‑attention over decoded tokens so far, (2) cross‑attention to encoder outputs (keys/values), and (3) FFN.
- Positional encoding: because attention is permutation‑invariant, positional encodings (e.g., sinusoidal) are added to token embeddings so the model knows token order.
- Decoding loop: start from a BOS token, run the decoder to produce a distribution over the vocabulary, select the next token (greedy/beam/sampling), and repeat until EOS.
7. Training details and tricks
- Proxy training objectives: next‑token prediction (autoregressive) or masked language modeling (relevant in LLMs).
- Loss: cross‑entropy with softmax over the vocabulary.
- Label smoothing: replace a hard one‑hot target with a smoothed distribution (e.g.,
1 − εon the target,ε/(V−1)on others) to reflect multiple valid continuations and improve generalization/BLEU. - Monitor convergence via training loss; stop when the loss converges (proxy task loss is a proxy for downstream quality but not the full metric).
- Perplexity: lower is better. BLEU/ROUGE: higher is better.
8. Practical considerations and limitations
- Compute and data scale matter: historical models (RNNs/LSTMs) were limited by compute/data; modern gains came from transformers + large compute + large datasets.
- Attention is quadratic in sequence length, so longer sequences imply much more compute.
- Model sizes and vocabulary choices are empirical; many practitioners reuse established design choices (embedding dims, number of heads, vocab sizes).
- Contextualized representations let the same surface token have different embeddings depending on context (e.g., “bank” as financial institution vs river bank).
End‑to‑end Transformer decoding workflow (stepwise)
- Tokenize source sentence (add BOS/EOS as needed).
- Construct token embeddings and add positional encodings.
- Encoder pass:
- For each encoder layer:
- Compute Q/K/V projections of the encoder inputs.
- Compute multi‑head self‑attention:
softmax(Q K^T / sqrt(dk)) * V. - Apply FFN and residual/normalization.
- Output: encoded, context‑aware token representations.
- For each encoder layer:
- Decoder pass (autoregressive):
- Initialize with BOS token (tokenized + embedding + positional encoding).
- For each decoder layer at each decoding step:
- Masked self‑attention over decoded token representations so far.
- Cross‑attention: queries from decoder, keys/values from encoder outputs.
- FFN + residual/normalization.
- Final linear + softmax over vocabulary → distribution for next token.
- Select next token (strategy varies), append to decoded sequence, repeat until EOS.
Evaluation and datasets mentioned
- Datasets:
- Sentiment: IMDb, Amazon reviews, social media/X posts.
- Translation: WMT (Workshop on Machine Translation), European Parliament parallel corpus.
- Metrics:
- Classification: accuracy, precision/recall/F1.
- Generation: BLEU, ROUGE (reference‑based); reference‑free methods are an active area of research.
- Probabilistic: perplexity.
- Note: reference‑based metrics require labeled text pairs; collecting references has a cost.
Selected student questions answered in lecture
- Exams are conceptual (no coding).
- Waitlist: contact instructors; usually manageable.
- Missed lectures: recordings and slides are posted.
- Vocabulary sizes and tokenizer choices vary by language/task (guidance given on orders of magnitude).
- Hidden dimension choices: trade‑off between expressivity and compute; common values are in the hundreds (e.g., 768).
- How to stop generation: the EOS token.
- Label smoothing rationale: accounts for multiple valid continuations.
- Q/K/V and multi‑head projections: learned via gradient descent; multi‑heads let the model learn different relations.
- Vanishing/exploding gradients in RNNs: caused by repeated multiplication of factors through time, leading to gradients that vanish or explode.
References and historical context
- Early sequence models: RNNs (1980s), LSTMs.
- Word2Vec (2013): CBOW and Skip‑gram proxy tasks for learning embeddings.
- Transformer: “Attention Is All You Need” (2017) introduced self‑attention and the transformer encoder/decoder.
- Datasets/metrics: WMT, BLEU, ROUGE, perplexity.
- Practical resources: course textbook/study guide, VIP cheat sheet on GitHub.
Speakers / sources featured
- Afin — primary lecturer.
- Shervin — co‑instructor.
- Audience/students — asked questions during Q&A.
- Referenced works/datasets: Word2Vec, RNN/LSTM literature, Transformer paper, WMT, BLEU/ROUGE, label smoothing concept.
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...