Summary of "MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention"

Summary of MIT 6.S191 Lecture 2: Recurrent Neural Networks, Transformers, and Attention

This lecture provides a foundational overview of sequence modeling in deep learning, focusing on Recurrent Neural Networks (RNNs), their limitations, and the modern Transformer architecture with the attention mechanism. The goal is to prepare students for advanced topics like large language models (LLMs) by building intuition and understanding from first principles.

Main Ideas and Concepts

1. Introduction to Sequence Modeling

Sequence modeling involves predicting or generating outputs based on sequential data such as time series, text, or audio.

Example: Predicting the next position of a moving ball based on its prior trajectory.
Sequential data is ubiquitous: speech, text, ECG signals, stock prices, biological sequences, weather, video, etc.
Common tasks include:
- Single input to single output (classification).
- Sequence input to single output (e.g., sentiment classification).
- Sequence input to sequence output (e.g., language translation, text generation).

2. From Feedforward Networks to Recurrent Neural Networks (RNNs)

Feedforward networks operate on static inputs with no notion of time or sequence.
Processing each time step independently ignores temporal dependencies.
RNNs introduce an internal hidden state ( h_t ) that carries information from previous time steps, enabling memory and temporal dependency modeling.
The hidden state is updated recurrently:

[ h_t = f(W_{xh} x_t + W_{hh} h_{t-1}) ]

Output at each time step depends on both current input and hidden state.
RNNs can be visualized as cyclic graphs or unrolled over time steps.

3. Training RNNs

Loss is computed at each time step and summed over the sequence.
Backpropagation Through Time (BPTT) propagates gradients through the unrolled network.
Challenges include:
- Vanishing gradients (gradients shrink exponentially).
- Exploding gradients (gradients grow exponentially).
These issues make it difficult for RNNs to learn long-term dependencies.

4. Improvements to RNNs: LSTMs

Long Short-Term Memory (LSTM) networks add gating mechanisms to control information flow.
They help mitigate vanishing/exploding gradient problems and better capture long-term dependencies.

5. Practical Example: Language Modeling

Task: Predict the next word in a sentence given previous words.
Words must be vectorized (numericalized) before feeding into neural networks.
Common vectorization methods:
- One-hot encoding: Sparse vectors with a single 1 at the word index.
- Learned embeddings: Dense, lower-dimensional vectors learned during training.
Sequence length variability and long-range dependencies make modeling challenging.

6. Limitations of RNNs

Fixed-size hidden state limits information capacity (bottleneck).
Sequential processing limits parallelization and efficiency.
Difficulty in capturing very long-range dependencies.

7. Introduction to Attention and Transformers

Can we model sequences without step-by-step recurrence?
Naive approach: Concatenate all inputs and feed to a feedforward network—loses order and is inefficient.
Attention mechanism enables models to “attend” to important parts of the input sequence dynamically.
Inspired by human selective attention and search:
- Query: What we want to find.
- Keys: Descriptors of data elements.
- Values: Data elements associated with keys.
Attention computes similarity (via dot product) between queries and keys, applies softmax to get weights, and uses these weights to combine values.
This allows the model to relate different parts of the sequence regardless of their distance.
Transformers use multi-head self-attention to capture diverse relationships simultaneously.
Positional embeddings encode order information since attention alone is order-agnostic.

8. Applications and Extensions

Transformers have revolutionized natural language processing (e.g., GPT, ChatGPT).
Extended to other domains:
- Biological sequence analysis.
- Vision Transformers (ViT) for image processing.
Students will get hands-on experience with RNNs and Transformers in course labs.

Methodology / Instructions Highlighted

Building an RNN from Scratch (Pseudo-code Outline)

Initialize hidden state ( h_0 = 0 ).
For each input in the sequence:
- Update hidden state using current input and previous hidden state.
- Generate output prediction from hidden state.
Use predictions to compute loss at each time step.
Train using backpropagation through time.

Vectorizing Text Input

Define a fixed vocabulary.
Map each word to an index.
Convert indices to one-hot vectors or embeddings.

Attention Mechanism Steps

Compute query, key, and value matrices from input embeddings.
Calculate dot product similarity between queries and keys.
Scale and apply softmax to get attention weights.
Multiply attention weights by values to get weighted output features.

Training RNNs

Compute loss at each time step.
Sum losses over the sequence.
Backpropagate errors through time steps (BPTT).

Speakers / Sources Featured

Ava – Primary lecturer presenting the material.
Alexander – Instructor of lecture 1 and contributor to foundational concepts.
John Werner – Host of the in-person reception mentioned at the end.
Startup example – A company that trained a neural network on classical music to complete Schubert’s unfinished symphony (unnamed).

This lecture builds a solid conceptual and practical foundation for understanding sequence modeling, starting from simple feedforward networks, progressing through RNNs and LSTMs, and culminating in the modern attention-based Transformer architecture that underpins today’s state-of-the-art language models.