Video summary

Let's build GPT: from scratch, in code, spelled out.

Main summary

Key takeaways

Educational

High-level overview

  • The video demonstrates how to build a tiny, working decoder-only Transformer (a GPT-style language model) from first principles in roughly 200 lines of PyTorch, training on the Tiny Shakespeare character corpus.
  • It explains the core ideas behind language models and Transformers (from “Attention Is All You Need”, 2017) and shows how production systems like GPT/ChatGPT extend the basic recipe through scale, pretraining, and fine-tuning / RL from human feedback.
  • The pedagogical goal is to make mechanisms and code concrete: start with a simple baseline, then incrementally add self-attention, multi-head attention, feed-forward layers, residuals, layer norm, dropout, and stacking — showing practical implementation tips and improvements at each step.

Core concepts explained

  • Language modeling

    • Treat text as a sequence and predict the next token given previous context.
    • Generation is sequential: sample one token at a time and append to the context.
  • Tokenization choices

    • Character-level: simple, small vocabulary, longer sequences (used in the demo).
    • Subword tokenizers (SentencePiece, BPE / tiktoken): larger vocabularies, shorter sequences — used in production models.
  • Training / evaluation split

    • Hold out a validation split (e.g., last 10%) to detect overfitting.
  • Batching and chunking

    • Train on many short chunks sampled randomly rather than whole documents.
    • Block size (context length) determines how many past tokens the model sees.
    • Batching stacks independent chunks for efficient GPU utilization.
  • Baseline (bigram) model

    • A simple token embedding table that predicts the next token solely from the current token — useful to illustrate baseline loss and the training loop.
  • Loss and reshaping

    • PyTorch CrossEntropyLoss expects logits shaped (N, C).
    • For sequence logits (B, T, C) reshape to (B*T, C) and targets to (B*T,).
  • Generation

    • At each step, compute logits for the last timestep, apply softmax, sample with torch.multinomial, and append the sampled token to the context.
  • Self-attention (scaled dot-product)

    • Positions emit query (Q), key (K), and value (V) vectors.
    • Affinities are Q·K^T scaled by 1/sqrt(d_k).
    • Use a lower-triangular mask to enforce causality (no future token access).
    • Softmax over affinities produces attention weights; output is attention_weights · V.
  • Efficient implementation trick

    • Build a lower-triangular mask / weight matrix and use batched matrix multiplies to compute prefix-weighted aggregations in one shot (vectorizes nested loops).
  • Multi-head attention

    • Run several attention heads in parallel with smaller head sizes, concatenate outputs, and project back — enables multiple independent communication patterns.
  • Feed-forward block

    • Per-token MLP (typically inner dimension ≈ 4× model dim) applied after attention so tokens can compute independently.
  • Residual (skip) connections

    • Add block outputs back to inputs (x = x + block(x)) to preserve identity pathways and stabilize gradients.
  • Layer normalization

    • Normalize per-token features (LayerNorm); pre-norm (apply before attention/MLP) usually improves training stability.
  • Regularization

    • Dropout applied to attention probabilities and projections to avoid overfitting at scale.
  • Two-stage workflow for production GPT-style systems

    1. Pretraining: large decoder-only Transformer trained on next-token prediction over huge corpora (billions → trillions of tokens).
    2. Fine-tuning / alignment: supervised fine-tuning, collect human rankings to train a reward model, and use RL (e.g., PPO) to optimize for human-preferred outputs (RLHF).

Step-by-step methodology (implementation roadmap)

  1. Data & tokenizer

    • Download Tiny Shakespeare (≈1 MB).
    • Build vocabulary: sorted(set(text)) → e.g., vocab_size = 65.
    • Implement character-level encoder/decoder mappings and encode the entire text to a long integer tensor.
  2. Train / validation split

    • Use the first 90% of data for training and the last 10% for validation.
  3. Mini-batching and chunking

    • Set block_size (context length), e.g., 8 initially.
    • Data loader:
      • Sample random start offsets (one per batch element).
      • Extract chunks of length block_size + 1 → inputs X are first block_size tokens, targets Y are the next block_size tokens.
      • Stack to produce (B, T) input and target tensors.
  4. Baseline bigram model

    • PyTorch Module with:
      • token_embeddings = Embedding(vocab_size, n_embed)
      • lm_head = Linear(n_embed, vocab_size)
    • Forward: embeddings → logits; reshape logits and targets for CrossEntropyLoss.
    • Generation: given context (B, T), repeatedly sample from softmax of last-step logits to append tokens.
  5. Training loop basics

    • Optimizer: Adam (typical lr ~3e-4; may be smaller for small nets).
    • Batch size: increase (e.g., 32, 64) for GPU efficiency.
    • Steps: sample batch → compute loss → zero grads → loss.backward()optimizer.step().
    • Estimate loss: evaluate train/validation losses over several batches with torch.no_grad() to report stable metrics.
    • Device management: move model/data to CUDA if available; use model.eval() and torch.no_grad() for evaluation.
  6. Add positional encodings

    • position_embeddings = Embedding(block_size, n_embed)
    • Sum token and position embeddings and pass into the attention stack.
  7. Implement single self-attention head

    • Linear projections for Q, K, V from X (shape (B, T, C)) to (B, T, head_size).
    • Compute scores: Q @ K.transpose(-2, -1)(B, T, T) and scale by 1/sqrt(head_size).
    • Apply lower-triangular mask (set future positions to -inf) then softmax and optional dropout.
    • Multiply attention weights by V(B, T, head_size).
  8. Value aggregation and projection

    • Project attention outputs back into model dimension (concatenate heads then linear project).
  9. Multi-head attention

    • Run several heads in parallel (each head_size = n_embed / n_head), concat along channel dimension, and project back to n_embed.
  10. Add feed-forward MLP - Two-layer MLP with nonlinearity (e.g., GELU), inner dim ≈ 4× n_embed, applied per token.

  11. Residual connections and pre-norm LayerNorm - Wrap attention and MLP with residual connections. - Apply LayerNorm before each sub-block (pre-norm). - Register the lower-triangular mask as a module buffer for masking.

  12. Additional training stability features - Scale dot-product by sqrt(d_k). - Use dropout inside attention and on linear projections. - Use standard parameter initialization practices. - Re-evaluate learning rate and use estimate_loss more frequently when adding attention.

  13. Stack blocks and scale up - Create n_layers Transformer blocks. - Increase n_embed, n_layers, block_size, batch_size, n_heads and train longer. - Use dropout and larger compute to reduce validation loss.

  14. Practical tips - Always reshape logits for CrossEntropyLoss as (B*T, C). - Crop generation context to block_size to stay within position embedding range. - Use torch.register_buffer for non-parameter masks. - Use torch.no_grad() and model.eval() during evaluation and generation. - Save/load checkpoints and manage weight decay exclusions. - Profile and run on GPU — CPU training for large configs is impractical.

Empirical results (from the video)

  • Baseline bigram:
    • Initial loss ~4.87 (random) → trained down to ~2.5.
  • Adding self-attention:
    • Validation loss ~2.4.
  • Multi-head attention + feed-forward:
    • Loss dropped to ~2.24 → ~2.08 after residuals and expanding feed-forward inner dim (×4).
  • Adding pre-norm LayerNorm and final norm:
    • Improved further (~2.06).
  • Scaled-up small model (example):
    • n_embed = 384, block_size = 256, n_layers = 6, n_heads = 6, dropout 0.2, larger batch.
    • Final reported validation loss ≈ 1.48 after ~15 minutes on an A100 GPU.
    • Generated text became noticeably Shakespeare-like (syntactic structure similar, but often semantically nonsensical).

Where the released code fits (nanogpt)

  • The speaker’s GitHub repo nanogpt includes:
    • train.py: training boilerplate (data loading, optimizer, LR schedules, checkpointing, distributed options).
    • model.py: Transformer implementation (causal self-attention, MLP, blocks, positional embeddings, generate).
  • The video-built model is intentionally small and didactic; production GPTs differ mainly by scale (parameters, tokens, compute), data, and fine-tuning/alignment pipelines.

How ChatGPT differs from this tiny GPT implementation

  • Architectural similarity

    • ChatGPT is a decoder-only Transformer like the demo model but vastly larger (e.g., GPT-3: 175B parameters) and trained on much larger corpora (hundreds of billions → trillions of tokens).
  • Two-stage workflow in practice

    1. Pretraining: large next-token prediction on web-scale text.
    2. Fine-tuning / alignment: supervised fine-tuning on assistant examples, collect human preference rankings, train a reward model, and apply RL (PPO-like) to align outputs with human preferences (RLHF).
  • Additional engineering required

    • Production requires tokenizers (tiktoken / BPE), data curation, safety filters, caching, serving infrastructure, multi-GPU distributed training, and extensive evaluation — much of which is private and expensive.

Practical takeaways / lessons

  • A working GPT-style model is conceptually compact:
    • Token embeddings, positional encodings, scaled dot-product self-attention, per-token MLPs, residuals, layer norms, softmax sampling, and cross-entropy training.
  • The main challenges are scale (compute, data, distributed training) and the fine-tuning / alignment pipeline to produce useful and safe assistants.
  • Key implementation tricks:
    • Causal masking via a lower-triangular mask.
    • Batched matrix multiplies for efficiency.
    • Scale attention scores by 1/sqrt(d_k).
    • Use pre-norm LayerNorm and residuals for stability and gradient flow.
    • Careful reshaping for loss computation in PyTorch.

Files / code referenced

  • nanogpt GitHub repository (includes train.py, model.py).
  • Google Colab notebook used for live demonstration.
  • Libraries / tools: PyTorch, tiktoken (OpenAI), SentencePiece (Google), Google Colab, A100 GPU.

Speakers / sources featured

  • Speaker: Andrej Karpathy (presenter; author of the nanoGPT walkthrough and the “make more” video series).
  • Paper: “Attention Is All You Need” (Vaswani et al., 2017).
  • Other referenced papers / methods:
    • Residual learning (ResNet), Dropout, LayerNorm, Scaled dot-product attention, PPO / policy optimization for RL.
  • Projects / tools:
    • OpenAI GPT family (GPT, GPT-2, GPT-3), tiktoken, Tiny Shakespeare dataset, SentencePiece / BPE tokenizers, nanogpt, PyTorch, Google Colab.

(End of summary.)

Original video