Free online courseDeep Neural Network Optimization: Hyperparameter Tuning, Regularization, and Training Tricks
Duration of the online course: 4 hours and 44 minutes
New
Build faster, more accurate models with this free deep learning course on tuning, regularization, and training tricks—learn to stabilize training and boost results.
In this free course, learn about
How to split data into train/dev/test and size dev/test in big data settings
Diagnose bias vs variance from train/dev errors; apply the basic ML improvement recipe
Regularization: L2 updates, why it reduces overfitting, dropout, early stopping methods
Input normalization and using train-set stats to normalize dev/test consistently
Vanishing/exploding gradients; ReLU-friendly weight initialization (He) to mitigate them
Mini-batch GD mechanics: epochs, noisy cost curves, and practical batch training behavior
Exponentially weighted averages, bias correction, and momentum update equations
RMSProp and Adam: adaptive learning rates using EWAs of gradients/squared gradients
Learning rate decay to improve convergence during mini-batch optimization
Hyperparameter tuning: random search vs grid, log-scale sampling, panda vs caviar strategy
Batch normalization: gamma/beta roles, where applied, why it speeds training, test-time stats
Softmax regression: outputs, cross-entropy gradient for final layer, optimization challenges
TensorFlow basics: placeholders and feed_dict usage in a training loop
Course Description
Getting a deep neural network to train reliably is often harder than designing the architecture. Small choices like how you split data, set learning rates, initialize weights, or apply regularization can be the difference between a model that generalizes well and one that overfits or fails to converge. This free online course helps you develop the practical intuition and methods needed to optimize neural networks in real-world machine learning work.
You will learn how to diagnose training issues through the lens of bias and variance, and how to set up train, dev, and test sets so your evaluation truly matches your deployment goals. From there, the course builds the skills to reduce overfitting with techniques such as L2 regularization, dropout, and early stopping, while keeping model performance strong and training time reasonable. You will also explore why input normalization matters, and how thoughtful preprocessing can make optimization smoother and more predictable.
As networks get deeper, training instability can show up as vanishing or exploding gradients. The course explains how these problems arise and how to mitigate them using principled weight initialization strategies and gradient checking to verify correctness. You will then move into modern optimization approaches that make large-scale training practical, including mini-batch gradient descent, momentum, RMSProp, Adam, and learning rate decay, with an emphasis on understanding what each method is doing and when it is most useful.
Hyperparameter tuning is treated as a disciplined process rather than guesswork. You will practice choosing effective search strategies, sampling on appropriate scales, and making tradeoffs based on your project constraints. Finally, you will see how batch normalization and softmax-based classifiers fit into efficient training pipelines, and how frameworks like TensorFlow structure training loops in practice. By the end, you will be able to troubleshoot training behavior, select optimization and regularization techniques with confidence, and deliver models that learn faster and generalize better.
Course content
Video class: Train/Dev/Test Sets (C2W1L01)12m
Exercise: In the Big Data era, why can the dev and test sets be much smaller percentages of the total dataset (e.g., ~1% each)?
Video class: Bias/Variance (C2W1L02)08m
Exercise: You train a cat classifier and get 1% training error and 11% dev error (assume near-zero Bayes error and same distribution). What does this most likely indicate?
Video class: Basic Recipe for Machine Learning (C2W1L03)06m
Exercise: Using the basic recipe for improving a neural network, what is the most appropriate next step if the model has high variance?
Video class: Regularization (C2W1L04)09m
Exercise: In L2 regularization for a neural network, how does the gradient descent update for a weight matrix change?
Video class: Why Regularization Reduces Overfitting (C2W1L05)07m
Exercise: Why can increasing L2 regularization (large λ) reduce overfitting in a deep neural network?
Video class: Dropout Regularization (C2W1L06)09m
Exercise: In inverted dropout, why are activations divided by the keep_prob during training?
Video class: Understanding Dropout (C2W1L07)07m
Exercise: Why does dropout tend to reduce overfitting in a neural network?
Video class: Other Regularization Methods (C2W1L08)08m
Exercise: What is the main idea behind early stopping as a way to reduce overfitting?
Video class: Normalizing Inputs (C2W1L09)05m
Exercise: When normalizing inputs for training, what should be used to normalize the test set?
Video class: Vanishing/Exploding Gradients (C2W1L10)06m
Exercise: In a very deep network with linear activations and zero biases, what happens if each weight matrix is slightly larger than the identity (e.g., like 1.5·I)?
Video class: Weight Initialization in a Deep Network (C2W1L11)06m
Exercise: Which weight initialization scaling is commonly recommended when using ReLU activations to help reduce vanishing/exploding gradients?
Video class: Numerical Approximations of Gradients (C2W1L12)06m
Exercise: Which numerical formula is preferred for gradient checking because it gives a more accurate gradient approximation?
Video class: Gradient Checking (C2W1L13)06m
Exercise: In gradient checking, how is the numerical gradient for a single parameter computed?
Video class: Gradient Checking Implementation Notes (C2W1L14)05m
Exercise: When using gradient checking, what is a recommended practice regarding dropout?
Video class: Mini Batch Gradient Descent (C2W2L01)11m
Exercise: In mini-batch gradient descent, what does one epoch correspond to when the training set is split into 5000 mini-batches?
Video class: Understanding Mini-Batch Gradient Dexcent (C2W2L02)11m
Exercise: Why can the cost curve look noisy during mini-batch gradient descent?
Video class: Exponentially Weighted Averages (C2W2L03)05m
Exercise: In an exponentially weighted average, what is the approximate number of days being averaged when β = 0.9?
Video class: Understanding Exponentially Weighted Averages (C2W2L04)09m
Exercise: In exponentially weighted averages used in optimization (e.g., momentum), what is the rule-of-thumb for the effective number of recent steps being averaged when using parameter β?
Video class: Bias Correction of Exponentially Weighted Averages (C2W2L05)04m
Exercise: What is the purpose of bias correction when computing exponentially weighted averages?
Video class: Gradient Descent With Momentum (C2W2L06)09m
Exercise: In gradient descent with momentum, which update correctly describes how the exponentially weighted average of gradients is computed for the weights?
Video class: RMSProp (C2W2L07)07m
Exercise: In RMSProp, why are parameter updates divided by the square root of an exponentially weighted average of squared gradients?
Video class: Adam Optimization Algorithm (C2W2L08)07m
Exercise: What best describes how the Adam optimization algorithm works?
Video class: Learning Rate Decay (C2W2L09)06m
Exercise: What is the main purpose of using learning rate decay during mini-batch gradient descent?
Video class: Tuning Process (C2W3L01)07m
Exercise: Why is random sampling often preferred over grid search for hyperparameter tuning?
Video class: Using an Appropriate Scale (C2W3L02)08m
Exercise: When tuning the learning rate b1 over a wide range (e.g., 0.0001 to 1), what is the recommended way to sample values?
Video class: Hyperparameter Tuning in Practice (C2W3L03)06m
Exercise: How should you decide between the “panda” approach and the “caviar” approach for hyperparameter search?
Video class: Normalizing Activations in a Network (C2W3L04)08m
Exercise: In batch normalization, what is the main role of the learnable parameters γ (gamma) and β (beta) after normalizing z?
Video class: Fitting Batch Norm Into Neural Networks (C2W3L05)12m
Exercise: In a deep network using batch normalization, where is batch normalization applied within a layer’s computations?
Video class: Why Does Batch Norm Work? (C2W3L06)11m
Exercise: Which best describes a key reason batch normalization helps speed up training in deep networks?
Video class: Batch Norm At Test Time (C2W3L07)05m
Exercise: In batch normalization, what is commonly used at test time to normalize a single example when mini-batch statistics aren’t available?
Video class: Softmax Regression (C2W3L08)11m
Exercise: In a neural network using a softmax output layer with C classes, what does the softmax function produce?
Video class: Training Softmax Classifier (C2W3L09)10m
Exercise: In softmax classification with cross-entropy loss, what is the gradient for the last layer pre-activation (dZ^L)?
Video class: The Problem of Local Optima (C2W3L10)05m
Exercise: In high-dimensional neural network optimization, which issue is typically more problematic than getting stuck in bad local optima?
Video class: TensorFlow (C2W3L11)16m
Exercise: In a typical TensorFlow training loop, what is the main purpose of using a placeholder with a feed_dict?