The algorithm that taught machines to learn
An interactive 10-minute visual guide to the most important algorithm in deep learning
They learn by making mistakes — then fixing them. Backpropagation is *how* they figure out what to fix.
Imagine throwing darts blindfolded. After each throw, a friend tells you "too far left" or "too high."
You adjust your aim accordingly. Over hundreds of throws, you get closer to the bullseye.
Backpropagation is the friend telling the network how to adjust.
A network is layers of interconnected neurons, each with adjustable weights.
Each line is a weight — a number that controls signal strength. Learning = finding the right weights.
Data flows from input to output — the network makes a prediction.
Each neuron: multiply inputs by weights, add bias, apply activation function. That's it!
Compare prediction to the true answer using a loss function.
L = (y - ŷ)²
If true y = 2.0 and prediction ŷ = 1.1:
L = (2.0 - 1.1)² = 0.81
Make the loss as small as possible. A loss of 0 means the prediction is perfect.
0.81 is high — the network needs to improve!
Backpropagation is just the chain rule from calculus applied systematically.
If changing A causes B to change, and changing B causes C to change, then we can figure out how A affects C by multiplying the two effects together.
"How does the loss change when we nudge weight w?" — multiply the chain of local derivatives.
Gradients flow backward — from the loss to every weight.
The gradient −3.6 tells us: increase w to reduce the loss (since gradient is negative).
Use gradients to take a step downhill on the loss landscape.
η (eta) is the learning rate — how big each step is. Too big = overshoot. Too small = too slow.
Adjust the weight and learning rate — watch the loss change in real time.
The same algorithm works for networks with millions of parameters.
Gradients propagate backward through every layer. Each layer computes its local gradient and passes it back.
Modern frameworks (PyTorch, TensorFlow) automatically build a graph of all operations and compute gradients via autograd.
Matrix multiplications in backprop are massively parallelizable. GPUs can process millions of gradients simultaneously.
Real-world training faces several obstacles.
In very deep networks, gradients shrink exponentially as they pass through layers, making early layers almost impossible to train. Solutions: ReLU activation, residual connections (ResNets), batch normalization.
The opposite problem — gradients grow uncontrollably. Solutions: gradient clipping, careful weight initialization (Xavier, He).
The loss landscape is complex. Optimizers like Adam, RMSprop, and SGD with momentum help navigate these tricky terrains.
Large models (GPT-4, etc.) require thousands of GPUs training for weeks. Techniques: mixed-precision training, gradient checkpointing, distributed training.
The complete training loop in four steps, repeated thousands of times.
Feed a batch of data through the network to produce predictions.
Measure how wrong the predictions are using a loss function (MSE, cross-entropy, etc.).
Compute gradients of the loss with respect to every weight using the chain rule.
Nudge every weight in the direction that reduces the loss: w = w − η × ∇L.
An algorithm that computes how much each weight in a neural network contributes to the prediction error.
It applies the chain rule of calculus backward through the network to compute gradients efficiently.
Without backprop, deep learning wouldn't exist. It enables training of everything from image classifiers to large language models.
Complex learning is just small, repeated corrections guided by mathematics. Simple rules, emergent intelligence.
First described by Rumelhart, Hinton & Williams (1986) — now the backbone of all modern AI.