All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Regularization and Generalization: Preventing d459Just Memorizingd45a

Capítulo 8

Estimated reading time: 7 minutes

Generalization vs. Overfitting: What We Actually Care About

Generalization means your model performs well on unseen data drawn from the same process as your training data. In practice, you estimate this using a validation/test set that the model does not train on.

Overfitting happens when the model has more effective capacity than the data and constraints justify. It learns patterns that reduce training loss but do not transfer: quirks, noise, or accidental correlations. A useful way to say it: overfitting is not “too many parameters” by itself; it is excess capacity relative to the amount of information in the data and the strength of the constraints.

A concrete mental picture

If the model is too restricted, it cannot represent the true relationship well, so both training and validation performance are poor.
If the model is too flexible and not constrained, it can match the training set extremely well, but validation performance degrades because the learned function is overly tailored to the training examples.

Bias–Variance Trade-off (Conceptual, Not Math-Heavy)

The bias–variance trade-off is a way to describe two different failure modes when you try to generalize.

High bias: the model’s assumptions are too rigid. It systematically misses important structure. Symptoms: training loss stays high; validation loss is also high and close to training loss.
High variance: the model is too sensitive to the particular training sample. Symptoms: training loss becomes very low; validation loss is noticeably higher and may worsen as training continues.

Regularization is the set of techniques that intentionally reduce effective flexibility (variance) or stabilize learning so that the model’s performance on unseen data improves, even if training performance gets slightly worse.

Regularization as “Shrinking the Effective Hypothesis Space”

Think of training as choosing one function from a huge set of possible functions your network could represent. Regularizers work by changing which functions are easy or even possible to learn. They do this by adding constraints, injecting noise, or limiting training time.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Below are four practical regularizers and how each changes the effective hypothesis space.

Weight Decay (L2 Regularization): Prefer Smaller Weights

What it is

Weight decay adds a penalty for large weights, typically proportional to the sum of squared weights. Intuitively, it prefers “simpler” functions that do not require extreme parameter values.

Why it helps generalization

Large weights can create very sharp decision boundaries or highly sensitive functions that react strongly to small input changes.
Penalizing large weights encourages smoother mappings and reduces sensitivity to noise in the training set.

How it changes the hypothesis space

The network can still represent many functions, but functions that require large weights become expensive, so training is biased toward lower-norm solutions. This effectively narrows the set of functions you are likely to end up with.

Step-by-step: adding weight decay in practice

Step 1: Choose a regularization strength (often called weight_decay or lambda).
Step 2: Apply it to weight parameters (commonly exclude biases and normalization parameters).
Step 3: Tune using validation performance: too small does little; too large underfits.

# Pseudocode (framework-agnostic idea): minimize loss + lambda * sum(W^2) over weights W

Practical note: With adaptive optimizers, “L2 penalty” and “true weight decay” can differ; many libraries provide a dedicated weight_decay option that implements decoupled weight decay.

Dropout: Noise Injection That Prevents Co-Adaptation

What it is

Dropout randomly zeroes a fraction of activations during training. At inference time, you use the full network with appropriately scaled activations (or equivalent scaling during training).

Why it helps generalization

It prevents neurons from relying on specific other neurons being present (reduces “co-adaptation”).
It acts like training an ensemble of many subnetworks that share weights, which tends to be more robust.

How it changes the hypothesis space

Dropout makes the training objective effectively optimize performance under random perturbations of the network. Functions that only work when very specific internal pathways are active become hard to learn; functions that are stable under these perturbations are favored.

Step-by-step: using dropout well

Step 1: Decide where to apply it (often after dense layers; sometimes in attention/MLP blocks; less commonly right after input for structured data).
Step 2: Choose a dropout rate (e.g., 0.1–0.5 depending on model size and data).
Step 3: Train with dropout enabled; disable it for evaluation/inference.
Step 4: Tune rate by validation loss/metrics; too high can cause underfitting or slow convergence.

# Pseudocode: during training only, a = dropout(a, rate=p)

Practical note: Dropout is less commonly used in some modern architectures that rely heavily on normalization/residual connections, but it remains a strong tool when overfitting is evident, especially in smaller datasets.

Data Augmentation: Encode Invariances Through More (Better) Data

What it is

Data augmentation creates additional training examples by applying label-preserving transformations. For images, this might be random crops, flips, color jitter, small rotations; for audio, time shifts or noise; for text, careful perturbations (often harder to guarantee label preservation).

Why it helps generalization

It increases the effective dataset size and diversity.
More importantly, it teaches the model invariances: the prediction should not change under certain transformations.

How it changes the hypothesis space

Augmentation restricts the set of functions that fit the training data well: the model must produce consistent outputs across transformed versions of the same example. Functions that depend on brittle, non-invariant cues become less viable.

Step-by-step: designing augmentation

Step 1: List transformations that should not change the label (domain knowledge).
Step 2: Start with mild transformations; verify they truly preserve labels.
Step 3: Apply them stochastically during training (on-the-fly) rather than precomputing.
Step 4: Monitor validation performance; if it drops, your augmentation may be too strong or not label-preserving.

# Example idea for images (conceptual): x_aug = random_crop(flip(color_jitter(x)))

Practical note: Augmentation is most effective when it matches real-world variation you expect at deployment time. Randomness that does not reflect deployment conditions can act like harmful noise.

Early Stopping: A Stopping Rule Tied to Validation Loss

What it is

Early stopping halts training when validation performance stops improving. You keep the model checkpoint that achieved the best validation score (typically lowest validation loss).

Why it helps generalization

As training continues, the model can start fitting idiosyncrasies of the training set.
Stopping at the point of best validation performance prevents the later drift into overfitting.

How it changes the hypothesis space

Even if the architecture can represent extremely complex functions, early stopping limits which solutions are reachable by the training process. It effectively constrains complexity by limiting training time, often favoring smoother or simpler solutions found earlier in optimization.

Step-by-step: implementing early stopping

Step 1: Split out a validation set that reflects deployment data.
Step 2: Choose a monitored metric (commonly validation loss).
Step 3: Set patience (how many evaluations without improvement before stopping).
Step 4: Save the best checkpoint; restore it after stopping.
Step 5: If training is noisy, consider smoothing (e.g., require a minimum improvement threshold).

# Pseudocode: if val_loss hasn't improved for 'patience' epochs: stop and restore best model

Why Combining Regularizers Is Common

Different regularizers target different sources of overfitting and shape the effective hypothesis space in complementary ways:

Weight decay biases toward small-norm (often smoother) solutions.
Dropout enforces robustness to internal perturbations and discourages fragile feature interactions.
Data augmentation encodes invariances and reduces reliance on spurious cues by expanding the training distribution in a controlled way.
Early stopping limits how far optimization can chase training-specific patterns.

In practice, you often see combinations like: augmentation + weight decay as a baseline, then add dropout if overfitting persists, and use early stopping as a safety mechanism tied to validation loss. The key is to tune them together because they interact: stronger augmentation may allow weaker dropout; stronger weight decay may require longer training; early stopping criteria may need adjustment when validation curves are noisy.

Now answer the exercise about the content:

Which description best captures how regularization improves a neural network’s generalization?

You are right! Congratulations, now go to the next page

You missed! Try again.

Regularization intentionally shrinks the effective hypothesis space or stabilizes learning via constraints, noise, or early stopping, improving performance on unseen data even if training loss is slightly worse.