From Gradients to Parameter Updates
After backpropagation, you have gradients: for each parameter (weight or bias), a number telling you how changing that parameter would change the loss. Optimization is the practical procedure that turns those gradients into actual parameter updates that (usually) reduce the loss.
The core idea is iterative improvement: start with some parameters, measure loss on data, compute gradients, update parameters, repeat. Each update is small on purpose, because the loss surface can be curved and complicated, and large jumps can overshoot good regions.
Plain Gradient Descent Update Rule
For a parameter vector θ and loss L(θ), the simplest update is:
θ ← θ - η * ∇θ L(θ)∇θ L(θ) is the gradient (direction of steepest increase). Subtracting it moves you in the direction of steepest decrease. η (eta) is the learning rate: it scales how big a step you take.
Learning Rate: The Single Most Important Knob
The learning rate controls how aggressively you move. It does not change which direction you move (the gradient does); it changes how far you move in that direction.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
What Happens When the Learning Rate Is Too Large
Overshooting minima: If the loss surface slopes down toward a valley, a huge step can jump across the valley to the other side, potentially increasing loss.
Oscillation: In narrow valleys (common in deep nets), a large learning rate can bounce from one side to the other, never settling.
Divergence: Loss can explode (becoming
NaN), especially with unstable activations or poorly scaled inputs.
Practical signs: training loss spikes upward, becomes unstable, or never consistently decreases.
What Happens When the Learning Rate Is Too Small
Slow progress: Loss decreases, but painfully slowly; you waste compute.
Stuck on plateaus: If gradients are small in a region, tiny steps may look like no learning at all for many iterations.
Overfitting risk via long training: If you train far longer to compensate, you may overfit unless you use regularization/early stopping (covered elsewhere).
Practical signs: smooth but very slow loss reduction; validation metrics improve only after many epochs.
A Step-by-Step Way to Choose a Learning Rate
Step 1: Pick a reasonable starting point. For many problems,
1e-3is a common starting guess for Adam, and1e-1to1e-2for SGD with momentum (these are not universal, but useful anchors).Step 2: Run a short training probe. Train for a small number of iterations/epochs and watch the training loss curve.
Step 3: Adjust based on behavior.
If loss diverges or is wildly unstable: decrease learning rate by 10×.
If loss decreases but extremely slowly: increase learning rate by 2× to 10×.
If loss decreases quickly then starts oscillating: slightly reduce learning rate (for example, 2× smaller) or use momentum/Adam to smooth updates.
Step 4: Add a schedule if needed. A common pattern is a larger learning rate early (fast progress) and smaller later (fine-tuning). Even a simple step decay (drop by 10× at certain epochs) can help.
Why We Use Minibatches (and What Stochasticity Buys You)
Computing the gradient on the entire dataset each update is expensive. Instead, we estimate the gradient using a subset of data.
Batch, Minibatch, and Stochastic Updates
Full-batch gradient descent: Use all training examples to compute one gradient and take one step. Gradient is accurate but each step is expensive.
Stochastic gradient descent (SGD): Use one example per step. Steps are cheap and frequent, but gradients are very noisy.
Minibatch SGD: Use a small batch (for example, 32–1024 examples) per step. This is the standard compromise: efficient on hardware and less noisy than single-example updates.
The Compute vs Noise Trade-off
Minibatches create a gradient estimate. Compared to the true full-dataset gradient, it has noise because each minibatch is only a sample.
Smaller batch: cheaper per step, more steps per second, noisier gradients. Noise can help exploration: it can jostle you out of shallow local traps or flat regions.
Larger batch: more expensive per step, fewer steps per second, smoother gradients. Can be more stable but may require careful learning-rate tuning and can generalize differently.
In practice, you usually pick a batch size that fits comfortably in memory and gives good throughput, then tune the learning rate for that batch size.
What “Noisy but Fast” Looks Like
With minibatches, the loss you measure per step will bounce around because each minibatch is different. This is normal. The trend over many steps matters more than any single step.
Plain SGD vs Momentum vs Adam (Intuition First)
All these methods use gradients, but they differ in how they transform gradients into updates.
Plain SGD: Direct Steps in the Gradient Direction
Minibatch SGD uses:
g_t = ∇θ L_batch(θ_t) (gradient on current minibatch)θ_{t+1} = θ_t - η * g_tIntuition: you follow the slope you see right now. If the slope estimate is noisy, your steps jitter. If the landscape has steep directions and shallow directions, a single learning rate can be awkward: safe in steep directions may be too small in shallow ones.
SGD with Momentum: Smoothing and Faster Movement in Consistent Directions
Momentum keeps a running “velocity” that accumulates gradients:
v_{t+1} = μ * v_t + g_tθ_{t+1} = θ_t - η * v_{t+1}μ (mu) is typically around 0.9.
Smoothing: If gradients point in slightly different directions each step (minibatch noise), momentum averages them over time, reducing jitter.
Acceleration: If gradients consistently point in the same direction, momentum builds up speed, helping you move faster along long, shallow slopes.
Less zig-zag: In narrow valleys, momentum can reduce side-to-side bouncing and make progress along the valley direction.
Adam: Momentum Plus Adaptive Step Sizes Per Parameter
Adam tracks two moving averages:
First moment (mean): like momentum (average gradient direction).
Second moment (uncentered variance): average of squared gradients, used to scale updates.
Intuitively, Adam does two things:
Smoothing: like momentum, it reduces noise by averaging gradients.
Adaptive step sizes: parameters that consistently see large gradients get smaller effective steps; parameters with small gradients get relatively larger steps. This helps when different parts of the network learn at different rates or when features are differently scaled.
A simplified intuition for the update is:
θ ← θ - η * (smoothed_gradient) / (sqrt(smoothed_squared_gradient) + ε)ε prevents division by zero and stabilizes training.
When to Prefer Simpler Optimizers vs Adaptive Ones
Prefer Plain SGD (Often with Momentum) When
You want strong, predictable generalization behavior: for many vision and classification tasks, SGD with momentum is a reliable baseline.
You can afford more tuning: SGD may require more careful learning-rate schedules, but can reward that effort.
Your training is stable and well-scaled: if gradients and feature scales are not wildly different across parameters, a single global learning rate can work well.
Prefer Adam (Adaptive) When
You want fast, robust progress with less tuning: Adam often works “out of the box” on many architectures.
Gradients are sparse or vary a lot across parameters: adaptive scaling helps when some parameters rarely get strong learning signals.
You are training models that are sensitive to optimization details: many transformer-style and NLP setups commonly use Adam-like optimizers because they handle differing gradient scales well.
A Practical Default Decision Rule
If you are unsure and want a strong starting point: use Adam with a modest learning rate and monitor stability.
If you are optimizing for final performance and have time to tune schedules: try SGD with momentum and compare.
If training is unstable: first reduce learning rate; if still unstable, consider Adam or adding momentum (depending on what you started with), and consider smaller batches to introduce beneficial noise or larger batches for smoother gradients depending on the failure mode.
Putting It Together: A Typical Training Loop (Conceptual)
Optimization becomes concrete when you see the repeated steps. Conceptually, a minibatch training loop looks like this:
initialize parameters θrepeat for each epoch: shuffle training data for each minibatch B: predictions = model(x in B; θ) loss = compute_loss(predictions, y in B) gradients = backprop(loss w.r.t. θ) θ = optimizer_update(θ, gradients, learning_rate)The optimizer update is where SGD, momentum, or Adam differ. Everything else in the loop stays the same: you repeatedly turn gradient information into parameter changes that, over many steps, reduce the loss.