Summary of "Recurrent Neural Networks (RNNs), Clearly Explained!!!"

What recurrent neural networks (RNNs) are for

RNNs are neural networks designed to handle sequential data of variable length (for example, stock prices over different numbers of days). Unlike feedforward networks that require a fixed-size input, RNNs can process sequences because they include a feedback (recurrent) connection that passes information from one time step to the next. RNNs still have weights, biases, layers and activation functions like other neural nets; the defining feature is the feedback loop.

Illustrative example (StatLand stock rules)

Data in the toy example is discretized/scaled: low = 0, medium = 0.5, high = 1.

Simple rules used to motivate the example:

low followed by low → next day likely low

low followed by medium → next day likely higher

high followed by medium → next day likely lower

high followed by high → next day likely high

The RNN is used to predict tomorrow’s price using past days’ scaled values.

How the RNN processes sequential inputs (intuitive mechanics)

At each time step the input is combined with the recurrent signal (the previous hidden/output value multiplied by a recurrent weight) and passed through an activation to produce the next hidden/output.
The recurrent output at one time step is fed into the next time step’s summation, allowing past inputs to influence future predictions.
You can “unroll” the RNN in time: create a copy of the network for each time step so the sequence becomes a feedforward-like chain of copies. This makes it easy to visualize how inputs from time 1, 2, …, t flow to the final output.
When unrolled, intermediate outputs can be ignored and only the final output used as the prediction (for example, the last output predicts tomorrow).

Key properties of unrolled RNNs

Weight and bias sharing: all time-step copies share the same weights and biases (the same parameters are reused across time). Unrolling does not increase the number of trainable parameters.
Inputs are fed oldest → newest; the final time-step output is the prediction for the next time point.

Training and the vanishing / exploding gradient problem

RNNs are typically trained with backpropagation through time (BPTT), which computes gradients for parameters shared across timesteps.
Because the recurrent connection multiplies signals repeatedly across time, gradients propagated backward can:
- Explode if the recurrent weight (call it W2) has magnitude > 1. Example: W2 = 2; after n time steps an early input is effectively multiplied by 2^n, producing huge gradient terms and causing wildly large parameter updates that prevent convergence.
- Vanish if the recurrent weight has magnitude < 1. Example: W2 = 0.5; after n steps an early input is multiplied by 0.5^n (→ nearly 0 for large n), producing tiny gradients and preventing learning of long-range dependencies.
Consequence: long sequences make training vanilla RNNs difficult because gradient magnitudes either blow up or shrink toward zero, making optimization (gradient descent) ineffective.
Simply constraining W2 < 1 avoids exploding gradients but induces vanishing gradients; conversely, allowing larger magnitudes risks explosion.

Mitigation and next steps

Long Short-Term Memory networks (LSTMs) and Transformers are popular solutions to the vanishing/exploding gradient problem and are discussed in later material.
Practical note: vanilla RNNs are conceptually useful stepping stones but are less common in modern practice because of these training issues.

Assumptions and notes

The StatQuest assumes familiarity with standard neural network ideas, backpropagation, and activation functions (e.g., ReLU).
The toy example demonstrates how RNNs can incorporate multiple prior time steps (2, 3, …) without increasing parameter count, by unrolling and reusing parameters.

Methodology — how to run a sequence through a vanilla RNN

Preprocess/scale data (in the example: low → 0, medium → 0.5, high → 1).
Decide how many past time steps to use. If using T days, unroll the RNN into T copies (one per timestep).
For each timestep t = 1..T (feed in order oldest → newest):
- Multiply the input at time t by the input-to-hidden weight (W1) and add bias (B1).
- Add the recurrent contribution: previous hidden/output (from t-1) multiplied by the recurrent weight (W2).
- Pass the sum through the activation function to get the hidden/output for time t.
Optionally ignore intermediate outputs; use the final time-step output as the prediction for the next time point.
During training, compute gradients via backpropagation through time; remember all time-step copies share the same parameters, so gradients accumulate across time.
Be aware: repeated multiplications of W2 across many timesteps cause vanishing/exploding gradients; choose architectures (e.g., LSTM, GRU) or training techniques to mitigate this for long sequences.