Backpropagation is the chain rule, organized
Backpropagation is not a special learning trick. It is an efficient way to compute gradients of the loss with respect to every parameter in a network by systematically applying the chain rule and reusing intermediate results from the forward pass.
The key idea: instead of recomputing derivatives from scratch for each weight, we compute “local gradients” at each node (each operation) and multiply them along the paths that connect a weight to the loss. This turns a potentially expensive computation into one that scales well with the number of parameters.
Start small: one neuron, one loss, one weight
Consider a single neuron with one input feature (to keep notation minimal). Let the neuron compute:
- Pre-activation:
z = w * x + b - Activation:
a = f(z) - Loss:
L = loss(a, y)
We want the gradient dL/dw: how changing the weight changes the loss.
Conceptual chain rule breakdown
The loss depends on w only through the intermediate variables z and a. The chain rule gives:
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
dL/dw = (dL/da) * (da/dz) * (dz/dw)Each factor is a local gradient:
dL/da: how the loss changes if the neuron output changes a bit (the learning signal arriving from the loss).da/dz: how the activation output changes if its input changes (depends on the activation’s slope atz).dz/dw: how the pre-activation changes if the weight changes. Sincez = w*x + b, we havedz/dw = x.
So the gradient becomes:
dL/dw = (dL/da) * (da/dz) * xThis is the entire “learning rule” for a single weight: the input x times a sensitivity term that combines the loss sensitivity and the activation slope.
Step-by-step computation you would do in code
- Forward pass: compute
z, thena, thenL. - Backward pass: compute
dL/dafrom the loss, computeda/dzfrom the activation atz, multiply to getdL/dz, then multiply byxto getdL/dw.
It is common to name the intermediate gradient:
delta = dL/dz = (dL/da) * (da/dz)Then:
dL/dw = delta * xAnd similarly for the bias:
dL/db = delta * (dz/db) = delta * 1 = deltaExtend to two layers: gradients flow backward through dependencies
Now consider two layers (still scalar for clarity):
- Layer 1:
z1 = w1*x + b1,a1 = f1(z1) - Layer 2:
z2 = w2*a1 + b2,a2 = f2(z2) - Loss:
L = loss(a2, y)
We want gradients for w2 and w1.
Gradient for the second-layer weight
For w2, the chain is short:
dL/dw2 = (dL/da2) * (da2/dz2) * (dz2/dw2)Since z2 = w2*a1 + b2, dz2/dw2 = a1. Define:
delta2 = dL/dz2 = (dL/da2) * (da2/dz2)Then:
dL/dw2 = delta2 * a1Gradient for the first-layer weight: multiply local gradients along the path
For w1, the loss depends on it through layer 1 and then layer 2. The chain rule becomes:
dL/dw1 = (dL/da2) * (da2/dz2) * (dz2/da1) * (da1/dz1) * (dz1/dw1)Each term is local:
(dL/da2)*(da2/dz2)isdelta2.dz2/da1 = w2becausez2 = w2*a1 + b2.da1/dz1is the slope off1atz1.dz1/dw1 = x.
So:
dL/dw1 = delta2 * w2 * (da1/dz1) * xIt is helpful to define the “error signal” at layer 1:
delta1 = dL/dz1 = (dL/da1) * (da1/dz1)But dL/da1 comes from layer 2:
dL/da1 = dL/dz2 * dz2/da1 = delta2 * w2Therefore:
delta1 = (delta2 * w2) * (da1/dz1)And then:
dL/dw1 = delta1 * xThis shows the reusable structure: once you have delta for a layer, gradients for its weights are simple products with that layer’s inputs.
General pattern: local gradients and “messages” passed backward
In vector/matrix form, the same idea holds: each layer computes a local derivative, and backprop multiplies it with the upstream gradient to produce the downstream gradient. Conceptually, each layer receives a gradient with respect to its outputs and converts it into a gradient with respect to its inputs and parameters.
For a layer that computes z = W a_prev + b and a = f(z), the backward pass has three main products:
- Compute
delta = dL/dzby combining upstream gradient withf'(z). - Parameter gradients:
dL/dW = delta * a_prev^T,dL/db = sum(delta)(sum over batch). - Pass gradient backward:
dL/da_prev = W^T * delta.
The efficiency comes from reusing z, a, and a_prev from the forward pass, and from computing each layer’s gradients once, in a single backward sweep.
Why gradients vanish or explode: products of many factors
Backpropagation multiplies many local terms as it moves backward through depth. In the two-layer example, dL/dw1 included a product like delta2 * w2 * f1'(z1) * x. In a deep network, the gradient for early layers contains a product of many weight terms and many activation derivatives.
When you multiply many numbers:
- If typical magnitudes are less than 1, the product shrinks toward 0 (vanishing gradients).
- If typical magnitudes are greater than 1, the product grows rapidly (exploding gradients).
This is not mysterious: it is a direct consequence of repeated chain rule multiplication.
A concrete mental model
Imagine a simplified deep chain where each layer contributes a factor roughly like c to the gradient magnitude. After L layers, the magnitude scales like c^L. If c = 0.9 and L = 50, then 0.9^50 is tiny. If c = 1.1 and L = 50, then 1.1^50 is huge. Real networks are more complex, but the core multiplication effect remains.
How activation choice affects gradient flow
Activation functions influence gradients through their derivatives f'(z), which appear in every layer’s chain.
- If
f'(z)is often near 0 in the regions wherezlands, gradients shrink as they pass through that layer. - If
f'(z)stays reasonably sized over typicalzvalues, gradients propagate more reliably.
Practically, this means the distribution of pre-activations z matters: if many units operate in regimes where the activation slope is tiny, the network can struggle to update early layers because the learning signal gets damped at each step.
How initialization affects vanishing/exploding gradients
Weights appear in backprop in two ways:
- They scale the backward signal when computing
dL/da_prev = W^T * delta. - They influence the forward activations
z, which then determine activation slopesf'(z).
If weights are initialized too large, forward activations can become large in magnitude, pushing activations into regions with small derivatives (damping gradients) or causing unstable growth in signals. If weights are initialized too small, signals (and gradients) can shrink layer by layer.
Good initialization aims to keep the variance of activations and gradients roughly stable across layers, so that neither systematic shrinkage nor systematic growth dominates as depth increases.
Depth makes the chain longer, not “more magical”
Depth increases the number of multiplications in the chain rule. Even if each layer is well-behaved on average, small biases in scaling accumulate. This is why deeper networks are more sensitive to activation choice and initialization: they simply have more opportunities for gradient magnitudes to drift toward zero or infinity.
Practical backprop checklist (what to compute and what to watch)
Step-by-step for a typical layer during training
- Cache forward values needed for derivatives (commonly
a_prevandz). - In backward pass, compute
delta = upstream * f'(z). - Compute
dWfromdeltaanda_prev; computedbfromdelta. - Compute gradient to pass to previous layer:
upstream_prev = W^T * delta.
Signs of vanishing/exploding gradients during training
- Vanishing: early layers learn extremely slowly; gradients for early layers are near zero; loss decreases very slowly or stalls.
- Exploding: loss becomes unstable; gradients become very large; parameter updates overshoot; numerical issues can appear.
Levers that directly target the chain-rule product problem
- Activation choice: prefer activations whose derivatives do not collapse to near-zero for typical pre-activations.
- Initialization: choose scales that keep forward activations and backward gradients in a reasonable range across depth.
- Depth awareness: deeper models require more careful control of signal/gradient scaling because the chain is longer.