00 / 12

Backpropagation

The algorithm that taught machines to learn

An interactive 10-minute visual guide to the most important algorithm in deep learning

Deep Learning Mathematics Visual Guide

01 / 12 — The Big Picture

How Do Neural Networks Learn?

They learn by making mistakes — then fixing them. Backpropagation is *how* they figure out what to fix.

🎯

Imagine throwing darts blindfolded. After each throw, a friend tells you "too far left" or "too high." You adjust your aim accordingly. Over hundreds of throws, you get closer to the bullseye.

Backpropagation is the friend telling the network how to adjust.

02 / 12 — Neural Network Anatomy

Inside a Neural Network

A network is layers of interconnected neurons, each with adjustable weights.

Each line is a weight — a number that controls signal strength. Learning = finding the right weights.

03 / 12 — Forward Pass

Step 1: The Forward Pass

Data flows from input to output — the network makes a prediction.

Input
x = 2.0

→

Multiply
× w = 0.5

→

Add Bias
+ b = 0.1

→

Activate
ReLU

→

Output
ŷ = 1.1

ŷ = ReLU(w · x + b) = ReLU(0.5 × 2.0 + 0.1) = 1.1

Each neuron: multiply inputs by weights, add bias, apply activation function. That's it!

04 / 12 — The Loss

Step 2: Measure the Error

Compare prediction to the true answer using a loss function.

📎

Mean Squared Error

L = (y - ŷ)²

If true y = 2.0 and prediction ŷ = 1.1:
L = (2.0 - 1.1)² = 0.81

🎯

The Goal

Make the loss as small as possible. A loss of 0 means the prediction is perfect.

0.81 is high — the network needs to improve!

Loss: L = (y − ŷ)² = (2.0 − 1.1)² = 0.81

05 / 12 — The Chain Rule

The Key Math: Chain Rule

Backpropagation is just the chain rule from calculus applied systematically.

⛓️

If changing A causes B to change, and changing B causes C to change, then we can figure out how A affects C by multiplying the two effects together.

∂L/∂w = ∂L/∂ŷ × ∂ŷ/∂z × ∂z/∂w

"How does the loss change when we nudge weight w?" — multiply the chain of local derivatives.

06 / 12 — Backward Pass

Step 3: The Backward Pass

Gradients flow backward — from the loss to every weight.

Loss
L=0.81

←

∂L/∂ŷ
−1.8

←

∂ŷ/∂z
1.0

←

∂z/∂w
2.0

←

∂L/∂w
−3.6

∂L/∂w = (−1.8) × (1.0) × (2.0) = −3.6

The gradient −3.6 tells us: increase w to reduce the loss (since gradient is negative).

07 / 12 — Gradient Descent

Step 4: Update the Weights

Use gradients to take a step downhill on the loss landscape.

w_new = w_old − η × ∂L/∂w = 0.5 − 0.01 × (−3.6) = 0.536

η (eta) is the learning rate — how big each step is. Too big = overshoot. Too small = too slow.

08 / 12 — Interactive Demo

Try It Yourself

Adjust the weight and learning rate — watch the loss change in real time.

Weight (w): 0.50

Bias (b): 0.10

Learning Rate (η): 0.010

PREDICTION (ŷ)

1.10

TARGET (y)

2.00

LOSS

0.81

Gradient ∂L/∂w = −3.60 | ∂L/∂b = −1.80

09 / 12 — Going Deeper

Scaling to Deep Networks

The same algorithm works for networks with millions of parameters.

📚

Layer by Layer

Gradients propagate backward through every layer. Each layer computes its local gradient and passes it back.

⚡

Computational Graphs

Modern frameworks (PyTorch, TensorFlow) automatically build a graph of all operations and compute gradients via autograd.

🚀

GPU Acceleration

Matrix multiplications in backprop are massively parallelizable. GPUs can process millions of gradients simultaneously.

10 / 12 — Challenges

When Backprop Struggles

Real-world training faces several obstacles.

1

Vanishing Gradients

In very deep networks, gradients shrink exponentially as they pass through layers, making early layers almost impossible to train. Solutions: ReLU activation, residual connections (ResNets), batch normalization.

2

Exploding Gradients

The opposite problem — gradients grow uncontrollably. Solutions: gradient clipping, careful weight initialization (Xavier, He).

3

Local Minima & Saddle Points

The loss landscape is complex. Optimizers like Adam, RMSprop, and SGD with momentum help navigate these tricky terrains.

4

Computational Cost

Large models (GPT-4, etc.) require thousands of GPUs training for weeks. Techniques: mixed-precision training, gradient checkpointing, distributed training.

11 / 12 — The Full Algorithm

Putting It All Together

The complete training loop in four steps, repeated thousands of times.

1

Forward Pass

Feed a batch of data through the network to produce predictions.

2

Compute Loss

Measure how wrong the predictions are using a loss function (MSE, cross-entropy, etc.).

3

Backward Pass (Backpropagation)

Compute gradients of the loss with respect to every weight using the chain rule.

4

Update Weights (Gradient Descent)

Nudge every weight in the direction that reduces the loss: w = w − η × ∇L.

Repeat for N epochs × M batches → the network converges toward minimal loss

12 / 12 — Summary

Key Takeaways

What is it?

An algorithm that computes how much each weight in a neural network contributes to the prediction error.

How does it work?

It applies the chain rule of calculus backward through the network to compute gradients efficiently.

Why does it matter?

Without backprop, deep learning wouldn't exist. It enables training of everything from image classifiers to large language models.

The Insight

Complex learning is just small, repeated corrections guided by mathematics. Simple rules, emergent intelligence.

First described by Rumelhart, Hinton & Williams (1986) — now the backbone of all modern AI.