Summary of "Stanford CME295 Transformers & LLMs | Autumn 2025 | Lecture 1 - Transformer"

Course overview

This was the first lecture of Stanford CME 295 (Transformers & Large Language Models), taught by twin instructors Afin (primary lecturer) and Shervin (co‑instructor). The lecture introduced course logistics and objectives, and gave a technical overview that traced NLP tasks, tokenization and representations, classic sequence models (RNN/LSTM), the attention mechanism, and the Transformer architecture (encoder/decoder, multi‑head attention, masking, positional encodings). Practical training/evaluation considerations and common hyperparameter magnitudes were covered, with student questions answered throughout.

Logistics and grading (short)

Main ideas, concepts and lessons

1. NLP task taxonomy

2. Tokenization: units and trade‑offs

3. Token / token embedding representations

4. From token embeddings to sequence/contextual embeddings

5. Attention mechanism (intuition and math)

6. Transformer architecture (encoder–decoder)

7. Training details and tricks

8. Practical considerations and limitations

End‑to‑end Transformer decoding workflow (stepwise)

  1. Tokenize source sentence (add BOS/EOS as needed).
  2. Construct token embeddings and add positional encodings.
  3. Encoder pass:
    • For each encoder layer:
      • Compute Q/K/V projections of the encoder inputs.
      • Compute multi‑head self‑attention: softmax(Q K^T / sqrt(dk)) * V.
      • Apply FFN and residual/normalization.
    • Output: encoded, context‑aware token representations.
  4. Decoder pass (autoregressive):
    • Initialize with BOS token (tokenized + embedding + positional encoding).
    • For each decoder layer at each decoding step:
      • Masked self‑attention over decoded token representations so far.
      • Cross‑attention: queries from decoder, keys/values from encoder outputs.
      • FFN + residual/normalization.
    • Final linear + softmax over vocabulary → distribution for next token.
    • Select next token (strategy varies), append to decoded sequence, repeat until EOS.

Evaluation and datasets mentioned

Selected student questions answered in lecture

References and historical context

Speakers / sources featured

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video