Summary of "Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer"

Course overview

This was the first lecture of Stanford CME 295 (Transformers & Large Language Models), taught by twin instructors Afin (primary lecturer) and Shervin (co‑instructor). The lecture introduced course logistics and objectives, and gave a technical overview that traced NLP tasks, tokenization and representations, classic sequence models (RNN/LSTM), the attention mechanism, and the Transformer architecture (encoder/decoder, multi‑head attention, masking, positional encodings). Practical training/evaluation considerations and common hyperparameter magnitudes were covered, with student questions answered throughout.

Logistics and grading (short)

Meeting time/location: Fridays 3:30–5:20; class recorded and slides posted online.
Units/credit: 2 units; letter or credit/no‑credit options.
Grading:
- Midterm 50% (week 5, Oct 24)
- Final 50% (final exam week, ~Dec 8)
- No programming/homework; exams are conceptual (no coding).
Materials: course slides, recordings, syllabus; course textbook/study guide and a condensed “VIP cheat sheet” on GitHub (translated into several languages).
Questions handled via Canvas Ed/maillist.

Main ideas, concepts and lessons

1. NLP task taxonomy

Classification (single label)
- Examples: sentiment analysis, intent detection, language ID.
- Evaluation: accuracy; precision/recall/F1 when classes are imbalanced.
Sequence labeling / token‑level classification (multi‑classification)
- Examples: Named Entity Recognition (NER), POS tagging, dependency/constituency parsing.
- Evaluate at token/entity type level.
Generation (text → text)
- Examples: machine translation, summarization, question answering, code/poem generation.
- Evaluation: reference‑based metrics (BLEU, ROUGE), reference‑free measures (including later LLM‑based methods), and perplexity (lower is better).

2. Tokenization: units and trade‑offs

Word‑level tokens: simple, but large vocabularies and OOV problems (need an unknown token).
Subword tokenization (e.g., BPE, unigram): good compromise—shares roots across inflections, reduces OOV risk; increases sequence length vs. words.
Character‑level: robust to misspellings/casing but greatly increases sequence length and makes semantics harder to capture per token.
Practical vocab sizes: tens of thousands for single languages; hundreds of thousands for multilingual/code models.

3. Token / token embedding representations

One‑hot vectors are orthogonal and unsuitable for capturing semantic similarity.
Learn continuous embeddings via proxy tasks (e.g., Word2Vec CBOW / Skip‑gram); embeddings capture semantic relationships (e.g., king : queen :: Paris : France).
Embedding dimensions are typically hundreds (e.g., 768); choice is a trade‑off between expressiveness and compute/latency.

4. From token embeddings to sequence/contextual embeddings

Simple pooling (e.g., averaging token embeddings) loses order and context.
Sequence models:
- RNNs: process tokens sequentially, maintain a hidden state representing the sequence so far; suffer from vanishing/exploding gradients and inefficient sequential computation.
- LSTMs: mitigate forgetting with gated cell state, but remain sequential and relatively slow for long contexts.
Attention idea: allow direct connections between any pair of positions so a token can “attend” to useful past/future tokens without passing through all intermediate steps.

5. Attention mechanism (intuition and math)

Each token is projected into three vectors: query (Q), key (K), value (V).
For a given query, compute similarity (dot product) with all keys, normalize via softmax to get attention weights, then compute a weighted sum of corresponding values.
Scale dot products by 1/sqrt(dk) to avoid large magnitudes as dimensionality grows.
Multi‑head attention: run several independent Q/K/V projections (heads) in parallel so the model can capture different relationships; concatenate head outputs and project back to the model dimension.
Masked self‑attention in the decoder prevents attending to future tokens during autoregressive decoding.
Cross‑attention in the decoder: queries come from decoder states; keys/values come from encoder outputs.

6. Transformer architecture (encoder–decoder)

Encoder: stacked layers of multi‑head self‑attention + position‑wise feed‑forward network (FFN), with residual connections and layer normalization.
Decoder: stacked layers containing (1) masked self‑attention over decoded tokens so far, (2) cross‑attention to encoder outputs (keys/values), and (3) FFN.
Positional encoding: because attention is permutation‑invariant, positional encodings (e.g., sinusoidal) are added to token embeddings so the model knows token order.
Decoding loop: start from a BOS token, run the decoder to produce a distribution over the vocabulary, select the next token (greedy/beam/sampling), and repeat until EOS.

7. Training details and tricks

Proxy training objectives: next‑token prediction (autoregressive) or masked language modeling (relevant in LLMs).
Loss: cross‑entropy with softmax over the vocabulary.
Label smoothing: replace a hard one‑hot target with a smoothed distribution (e.g., 1 − ε on the target, ε/(V−1) on others) to reflect multiple valid continuations and improve generalization/BLEU.
Monitor convergence via training loss; stop when the loss converges (proxy task loss is a proxy for downstream quality but not the full metric).
Perplexity: lower is better. BLEU/ROUGE: higher is better.

8. Practical considerations and limitations

Compute and data scale matter: historical models (RNNs/LSTMs) were limited by compute/data; modern gains came from transformers + large compute + large datasets.
Attention is quadratic in sequence length, so longer sequences imply much more compute.
Model sizes and vocabulary choices are empirical; many practitioners reuse established design choices (embedding dims, number of heads, vocab sizes).
Contextualized representations let the same surface token have different embeddings depending on context (e.g., “bank” as financial institution vs river bank).

End‑to‑end Transformer decoding workflow (stepwise)

Tokenize source sentence (add BOS/EOS as needed).
Construct token embeddings and add positional encodings.
Encoder pass:
- For each encoder layer:
  - Compute Q/K/V projections of the encoder inputs.
  - Compute multi‑head self‑attention: softmax(Q K^T / sqrt(dk)) * V.
  - Apply FFN and residual/normalization.
- Output: encoded, context‑aware token representations.
Decoder pass (autoregressive):
- Initialize with BOS token (tokenized + embedding + positional encoding).
- For each decoder layer at each decoding step:
  - Masked self‑attention over decoded token representations so far.
  - Cross‑attention: queries from decoder, keys/values from encoder outputs.
  - FFN + residual/normalization.
- Final linear + softmax over vocabulary → distribution for next token.
- Select next token (strategy varies), append to decoded sequence, repeat until EOS.

Evaluation and datasets mentioned

Datasets:
- Sentiment: IMDb, Amazon reviews, social media/X posts.
- Translation: WMT (Workshop on Machine Translation), European Parliament parallel corpus.
Metrics:
- Classification: accuracy, precision/recall/F1.
- Generation: BLEU, ROUGE (reference‑based); reference‑free methods are an active area of research.
- Probabilistic: perplexity.
Note: reference‑based metrics require labeled text pairs; collecting references has a cost.

Selected student questions answered in lecture

Exams are conceptual (no coding).
Waitlist: contact instructors; usually manageable.
Missed lectures: recordings and slides are posted.
Vocabulary sizes and tokenizer choices vary by language/task (guidance given on orders of magnitude).
Hidden dimension choices: trade‑off between expressivity and compute; common values are in the hundreds (e.g., 768).
How to stop generation: the EOS token.
Label smoothing rationale: accounts for multiple valid continuations.
Q/K/V and multi‑head projections: learned via gradient descent; multi‑heads let the model learn different relations.
Vanishing/exploding gradients in RNNs: caused by repeated multiplication of factors through time, leading to gradients that vanish or explode.

References and historical context

Early sequence models: RNNs (1980s), LSTMs.
Word2Vec (2013): CBOW and Skip‑gram proxy tasks for learning embeddings.
Transformer: “Attention Is All You Need” (2017) introduced self‑attention and the transformer encoder/decoder.
Datasets/metrics: WMT, BLEU, ROUGE, perplexity.
Practical resources: course textbook/study guide, VIP cheat sheet on GitHub.

Speakers / sources featured

Afin — primary lecturer.
Shervin — co‑instructor.
Audience/students — asked questions during Q&A.
Referenced works/datasets: Word2Vec, RNN/LSTM literature, Transformer paper, WMT, BLEU/ROUGE, label smoothing concept.

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer"

Course overview

Logistics and grading (short)

Main ideas, concepts and lessons

1. NLP task taxonomy

2. Tokenization: units and trade‑offs

3. Token / token embedding representations

4. From token embeddings to sequence/contextual embeddings

5. Attention mechanism (intuition and math)

6. Transformer architecture (encoder–decoder)

7. Training details and tricks

8. Practical considerations and limitations

End‑to‑end Transformer decoding workflow (stepwise)

Evaluation and datasets mentioned

Selected student questions answered in lecture

References and historical context

Speakers / sources featured

Category

Share this summary

Is the summary off?

Video

Summary of "Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer"

Course overview

Logistics and grading (short)

Main ideas, concepts and lessons

1. NLP task taxonomy

2. Tokenization: units and trade‑offs

3. Token / token embedding representations

4. From token embeddings to sequence/contextual embeddings

5. Attention mechanism (intuition and math)

6. Transformer architecture (encoder–decoder)

7. Training details and tricks

8. Practical considerations and limitations

End‑to‑end Transformer decoding workflow (stepwise)

Evaluation and datasets mentioned

Selected student questions answered in lecture

References and historical context

Speakers / sources featured

Category ?

Share this summary

Is the summary off?

Video

Category