All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Backpropagation Intuition: How Neural Networks Learn Without Magic

Capítulo 6

Estimated reading time: 7 minutes

Backpropagation is the chain rule, organized

Backpropagation is not a special learning trick. It is an efficient way to compute gradients of the loss with respect to every parameter in a network by systematically applying the chain rule and reusing intermediate results from the forward pass.

The key idea: instead of recomputing derivatives from scratch for each weight, we compute “local gradients” at each node (each operation) and multiply them along the paths that connect a weight to the loss. This turns a potentially expensive computation into one that scales well with the number of parameters.

Start small: one neuron, one loss, one weight

Consider a single neuron with one input feature (to keep notation minimal). Let the neuron compute:

Pre-activation: z = w * x + b
Activation: a = f(z)
Loss: L = loss(a, y)

We want the gradient dL/dw: how changing the weight changes the loss.

Conceptual chain rule breakdown

The loss depends on w only through the intermediate variables z and a. The chain rule gives:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

dL/dw = (dL/da) * (da/dz) * (dz/dw)

Each factor is a local gradient:

dL/da: how the loss changes if the neuron output changes a bit (the learning signal arriving from the loss).
da/dz: how the activation output changes if its input changes (depends on the activation’s slope at z).
dz/dw: how the pre-activation changes if the weight changes. Since z = w*x + b, we have dz/dw = x.

So the gradient becomes:

dL/dw = (dL/da) * (da/dz) * x

This is the entire “learning rule” for a single weight: the input x times a sensitivity term that combines the loss sensitivity and the activation slope.

Step-by-step computation you would do in code

Forward pass: compute z, then a, then L.
Backward pass: compute dL/da from the loss, compute da/dz from the activation at z, multiply to get dL/dz, then multiply by x to get dL/dw.

It is common to name the intermediate gradient:

delta = dL/dz = (dL/da) * (da/dz)

Then:

dL/dw = delta * x

And similarly for the bias:

dL/db = delta * (dz/db) = delta * 1 = delta

Extend to two layers: gradients flow backward through dependencies

Now consider two layers (still scalar for clarity):

Layer 1: z1 = w1*x + b1, a1 = f1(z1)
Layer 2: z2 = w2*a1 + b2, a2 = f2(z2)
Loss: L = loss(a2, y)

We want gradients for w2 and w1.

Gradient for the second-layer weight

For w2, the chain is short:

dL/dw2 = (dL/da2) * (da2/dz2) * (dz2/dw2)

Since z2 = w2*a1 + b2, dz2/dw2 = a1. Define:

delta2 = dL/dz2 = (dL/da2) * (da2/dz2)

Then:

dL/dw2 = delta2 * a1

Gradient for the first-layer weight: multiply local gradients along the path

For w1, the loss depends on it through layer 1 and then layer 2. The chain rule becomes:

dL/dw1 = (dL/da2) * (da2/dz2) * (dz2/da1) * (da1/dz1) * (dz1/dw1)

Each term is local:

(dL/da2)*(da2/dz2) is delta2.
dz2/da1 = w2 because z2 = w2*a1 + b2.
da1/dz1 is the slope of f1 at z1.
dz1/dw1 = x.

So:

dL/dw1 = delta2 * w2 * (da1/dz1) * x

It is helpful to define the “error signal” at layer 1:

delta1 = dL/dz1 = (dL/da1) * (da1/dz1)

But dL/da1 comes from layer 2:

dL/da1 = dL/dz2 * dz2/da1 = delta2 * w2

Therefore:

delta1 = (delta2 * w2) * (da1/dz1)

And then:

dL/dw1 = delta1 * x

This shows the reusable structure: once you have delta for a layer, gradients for its weights are simple products with that layer’s inputs.

General pattern: local gradients and “messages” passed backward

In vector/matrix form, the same idea holds: each layer computes a local derivative, and backprop multiplies it with the upstream gradient to produce the downstream gradient. Conceptually, each layer receives a gradient with respect to its outputs and converts it into a gradient with respect to its inputs and parameters.

For a layer that computes z = W a_prev + b and a = f(z), the backward pass has three main products:

Compute delta = dL/dz by combining upstream gradient with f'(z).
Parameter gradients: dL/dW = delta * a_prev^T, dL/db = sum(delta) (sum over batch).
Pass gradient backward: dL/da_prev = W^T * delta.

The efficiency comes from reusing z, a, and a_prev from the forward pass, and from computing each layer’s gradients once, in a single backward sweep.

Why gradients vanish or explode: products of many factors

Backpropagation multiplies many local terms as it moves backward through depth. In the two-layer example, dL/dw1 included a product like delta2 * w2 * f1'(z1) * x. In a deep network, the gradient for early layers contains a product of many weight terms and many activation derivatives.

When you multiply many numbers:

If typical magnitudes are less than 1, the product shrinks toward 0 (vanishing gradients).
If typical magnitudes are greater than 1, the product grows rapidly (exploding gradients).

This is not mysterious: it is a direct consequence of repeated chain rule multiplication.

A concrete mental model

Imagine a simplified deep chain where each layer contributes a factor roughly like c to the gradient magnitude. After L layers, the magnitude scales like c^L. If c = 0.9 and L = 50, then 0.9^50 is tiny. If c = 1.1 and L = 50, then 1.1^50 is huge. Real networks are more complex, but the core multiplication effect remains.

How activation choice affects gradient flow

Activation functions influence gradients through their derivatives f'(z), which appear in every layer’s chain.

If f'(z) is often near 0 in the regions where z lands, gradients shrink as they pass through that layer.
If f'(z) stays reasonably sized over typical z values, gradients propagate more reliably.

Practically, this means the distribution of pre-activations z matters: if many units operate in regimes where the activation slope is tiny, the network can struggle to update early layers because the learning signal gets damped at each step.

How initialization affects vanishing/exploding gradients

Weights appear in backprop in two ways:

They scale the backward signal when computing dL/da_prev = W^T * delta.
They influence the forward activations z, which then determine activation slopes f'(z).

If weights are initialized too large, forward activations can become large in magnitude, pushing activations into regions with small derivatives (damping gradients) or causing unstable growth in signals. If weights are initialized too small, signals (and gradients) can shrink layer by layer.

Good initialization aims to keep the variance of activations and gradients roughly stable across layers, so that neither systematic shrinkage nor systematic growth dominates as depth increases.

Depth makes the chain longer, not “more magical”

Depth increases the number of multiplications in the chain rule. Even if each layer is well-behaved on average, small biases in scaling accumulate. This is why deeper networks are more sensitive to activation choice and initialization: they simply have more opportunities for gradient magnitudes to drift toward zero or infinity.

Practical backprop checklist (what to compute and what to watch)

Step-by-step for a typical layer during training

Cache forward values needed for derivatives (commonly a_prev and z).
In backward pass, compute delta = upstream * f'(z).
Compute dW from delta and a_prev; compute db from delta.
Compute gradient to pass to previous layer: upstream_prev = W^T * delta.

Signs of vanishing/exploding gradients during training

Vanishing: early layers learn extremely slowly; gradients for early layers are near zero; loss decreases very slowly or stalls.
Exploding: loss becomes unstable; gradients become very large; parameter updates overshoot; numerical issues can appear.

Levers that directly target the chain-rule product problem

Activation choice: prefer activations whose derivatives do not collapse to near-zero for typical pre-activations.
Initialization: choose scales that keep forward activations and backward gradients in a reasonable range across depth.
Depth awareness: deeper models require more careful control of signal/gradient scaling because the chain is longer.

Now answer the exercise about the content:

In a two-layer network, what best explains why gradients for earlier layers can become much smaller or much larger as depth increases?

You are right! Congratulations, now go to the next page

You missed! Try again.

Backpropagation applies the chain rule, multiplying local gradients across layers. If typical factors are < 1, gradients vanish; if > 1, they can explode. Greater depth means more multiplications, increasing this effect.