Free Ebook cover Deep Learning Foundations Without the Hype: Neural Networks Explained Clearly

Deep Learning Foundations Without the Hype: Neural Networks Explained Clearly

New course

12 pages

Gradient Descent and Optimization: Turning Gradients Into Progress

Capítulo 7

Estimated reading time: 7 minutes

+ Exercise

From Gradients to Parameter Updates

After backpropagation, you have gradients: for each parameter (weight or bias), a number telling you how changing that parameter would change the loss. Optimization is the practical procedure that turns those gradients into actual parameter updates that (usually) reduce the loss.

The core idea is iterative improvement: start with some parameters, measure loss on data, compute gradients, update parameters, repeat. Each update is small on purpose, because the loss surface can be curved and complicated, and large jumps can overshoot good regions.

Plain Gradient Descent Update Rule

For a parameter vector θ and loss L(θ), the simplest update is:

θ ← θ - η * ∇θ L(θ)

∇θ L(θ) is the gradient (direction of steepest increase). Subtracting it moves you in the direction of steepest decrease. η (eta) is the learning rate: it scales how big a step you take.

Learning Rate: The Single Most Important Knob

The learning rate controls how aggressively you move. It does not change which direction you move (the gradient does); it changes how far you move in that direction.

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

What Happens When the Learning Rate Is Too Large

  • Overshooting minima: If the loss surface slopes down toward a valley, a huge step can jump across the valley to the other side, potentially increasing loss.

  • Oscillation: In narrow valleys (common in deep nets), a large learning rate can bounce from one side to the other, never settling.

  • Divergence: Loss can explode (becoming NaN), especially with unstable activations or poorly scaled inputs.

Practical signs: training loss spikes upward, becomes unstable, or never consistently decreases.

What Happens When the Learning Rate Is Too Small

  • Slow progress: Loss decreases, but painfully slowly; you waste compute.

  • Stuck on plateaus: If gradients are small in a region, tiny steps may look like no learning at all for many iterations.

  • Overfitting risk via long training: If you train far longer to compensate, you may overfit unless you use regularization/early stopping (covered elsewhere).

Practical signs: smooth but very slow loss reduction; validation metrics improve only after many epochs.

A Step-by-Step Way to Choose a Learning Rate

  • Step 1: Pick a reasonable starting point. For many problems, 1e-3 is a common starting guess for Adam, and 1e-1 to 1e-2 for SGD with momentum (these are not universal, but useful anchors).

  • Step 2: Run a short training probe. Train for a small number of iterations/epochs and watch the training loss curve.

  • Step 3: Adjust based on behavior.

    • If loss diverges or is wildly unstable: decrease learning rate by 10×.

    • If loss decreases but extremely slowly: increase learning rate by 2× to 10×.

    • If loss decreases quickly then starts oscillating: slightly reduce learning rate (for example, 2× smaller) or use momentum/Adam to smooth updates.

  • Step 4: Add a schedule if needed. A common pattern is a larger learning rate early (fast progress) and smaller later (fine-tuning). Even a simple step decay (drop by 10× at certain epochs) can help.

Why We Use Minibatches (and What Stochasticity Buys You)

Computing the gradient on the entire dataset each update is expensive. Instead, we estimate the gradient using a subset of data.

Batch, Minibatch, and Stochastic Updates

  • Full-batch gradient descent: Use all training examples to compute one gradient and take one step. Gradient is accurate but each step is expensive.

  • Stochastic gradient descent (SGD): Use one example per step. Steps are cheap and frequent, but gradients are very noisy.

  • Minibatch SGD: Use a small batch (for example, 32–1024 examples) per step. This is the standard compromise: efficient on hardware and less noisy than single-example updates.

The Compute vs Noise Trade-off

Minibatches create a gradient estimate. Compared to the true full-dataset gradient, it has noise because each minibatch is only a sample.

  • Smaller batch: cheaper per step, more steps per second, noisier gradients. Noise can help exploration: it can jostle you out of shallow local traps or flat regions.

  • Larger batch: more expensive per step, fewer steps per second, smoother gradients. Can be more stable but may require careful learning-rate tuning and can generalize differently.

In practice, you usually pick a batch size that fits comfortably in memory and gives good throughput, then tune the learning rate for that batch size.

What “Noisy but Fast” Looks Like

With minibatches, the loss you measure per step will bounce around because each minibatch is different. This is normal. The trend over many steps matters more than any single step.

Plain SGD vs Momentum vs Adam (Intuition First)

All these methods use gradients, but they differ in how they transform gradients into updates.

Plain SGD: Direct Steps in the Gradient Direction

Minibatch SGD uses:

g_t = ∇θ L_batch(θ_t)  (gradient on current minibatch)θ_{t+1} = θ_t - η * g_t

Intuition: you follow the slope you see right now. If the slope estimate is noisy, your steps jitter. If the landscape has steep directions and shallow directions, a single learning rate can be awkward: safe in steep directions may be too small in shallow ones.

SGD with Momentum: Smoothing and Faster Movement in Consistent Directions

Momentum keeps a running “velocity” that accumulates gradients:

v_{t+1} = μ * v_t + g_tθ_{t+1} = θ_t - η * v_{t+1}

μ (mu) is typically around 0.9.

  • Smoothing: If gradients point in slightly different directions each step (minibatch noise), momentum averages them over time, reducing jitter.

  • Acceleration: If gradients consistently point in the same direction, momentum builds up speed, helping you move faster along long, shallow slopes.

  • Less zig-zag: In narrow valleys, momentum can reduce side-to-side bouncing and make progress along the valley direction.

Adam: Momentum Plus Adaptive Step Sizes Per Parameter

Adam tracks two moving averages:

  • First moment (mean): like momentum (average gradient direction).

  • Second moment (uncentered variance): average of squared gradients, used to scale updates.

Intuitively, Adam does two things:

  • Smoothing: like momentum, it reduces noise by averaging gradients.

  • Adaptive step sizes: parameters that consistently see large gradients get smaller effective steps; parameters with small gradients get relatively larger steps. This helps when different parts of the network learn at different rates or when features are differently scaled.

A simplified intuition for the update is:

θ ← θ - η * (smoothed_gradient) / (sqrt(smoothed_squared_gradient) + ε)

ε prevents division by zero and stabilizes training.

When to Prefer Simpler Optimizers vs Adaptive Ones

Prefer Plain SGD (Often with Momentum) When

  • You want strong, predictable generalization behavior: for many vision and classification tasks, SGD with momentum is a reliable baseline.

  • You can afford more tuning: SGD may require more careful learning-rate schedules, but can reward that effort.

  • Your training is stable and well-scaled: if gradients and feature scales are not wildly different across parameters, a single global learning rate can work well.

Prefer Adam (Adaptive) When

  • You want fast, robust progress with less tuning: Adam often works “out of the box” on many architectures.

  • Gradients are sparse or vary a lot across parameters: adaptive scaling helps when some parameters rarely get strong learning signals.

  • You are training models that are sensitive to optimization details: many transformer-style and NLP setups commonly use Adam-like optimizers because they handle differing gradient scales well.

A Practical Default Decision Rule

  • If you are unsure and want a strong starting point: use Adam with a modest learning rate and monitor stability.

  • If you are optimizing for final performance and have time to tune schedules: try SGD with momentum and compare.

  • If training is unstable: first reduce learning rate; if still unstable, consider Adam or adding momentum (depending on what you started with), and consider smaller batches to introduce beneficial noise or larger batches for smoother gradients depending on the failure mode.

Putting It Together: A Typical Training Loop (Conceptual)

Optimization becomes concrete when you see the repeated steps. Conceptually, a minibatch training loop looks like this:

initialize parameters θrepeat for each epoch:    shuffle training data    for each minibatch B:        predictions = model(x in B; θ)        loss = compute_loss(predictions, y in B)        gradients = backprop(loss w.r.t. θ)        θ = optimizer_update(θ, gradients, learning_rate)

The optimizer update is where SGD, momentum, or Adam differ. Everything else in the loop stays the same: you repeatedly turn gradient information into parameter changes that, over many steps, reduce the loss.

Now answer the exercise about the content:

During training, the loss decreases quickly at first but then starts oscillating from step to step. What is the most appropriate adjustment?

You are right! Congratulations, now go to the next page

You missed! Try again.

Oscillation is a common sign the learning rate is a bit too large, especially in narrow valleys. Reducing it slightly or using momentum/Adam can smooth noisy updates and reduce bouncing.

Next chapter

Regularization and Generalization: Preventing d459Just Memorizingd45a

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.