Reading Training Logs Like a Diagnostic Report
Training logs (and the curves you plot from them) are not just “progress indicators”; they are measurements of a system under stress. Your job is to map what you observe (symptom) to the most likely mechanisms (cause), then apply the smallest change that tests that hypothesis (fix). The most useful logs include: training loss, validation loss, training accuracy (or task metric), validation accuracy, learning rate, gradient norms (if available), and a few batch-level samples of predictions.
What to Plot and What to Record
- Loss curves: training loss vs. steps/epochs; validation loss vs. epochs.
- Metric curves: training accuracy (or F1/AUC/etc.) and validation metric.
- Learning rate: especially if using schedules or warmup.
- Gradient/weight norms: exploding/vanishing often shows up here before the loss becomes obviously wrong.
- Data sanity snapshots: a few (input, label, prediction) triplets per epoch to catch label issues and leakage.
Systematic Troubleshooting Workflow
When something looks wrong, avoid random tweaks. Use a repeatable checklist that narrows the search space.
Step-by-step: A Minimal Diagnostic Loop
- Step 1: Reproduce deterministically. Fix random seeds, log versions of code/data, and confirm the issue repeats.
- Step 2: Run a tiny overfit test. Train on a very small subset (e.g., 32–512 examples) and see if the model can drive training loss near zero. If it cannot, suspect data/labels, loss/activation mismatch, or optimization/gradient issues.
- Step 3: Check data pipeline invariants. Verify shapes, ranges, normalization, label encoding, shuffling, and that augmentations are applied only to training.
- Step 4: Compare train vs. validation behavior. The gap pattern is often the fastest clue (overfitting vs. underfitting vs. leakage).
- Step 5: Change one thing. Apply a single targeted fix, rerun, and compare curves.
Symptom → Cause → Fix: Common Failure Modes
1) Exploding Loss (or NaNs/Infs)
Symptom: Loss shoots upward, becomes NaN/Inf, or oscillates wildly. Gradient norms (if logged) spike. Sometimes accuracy is random and never stabilizes.
Likely causes:
- Learning rate too high: updates overshoot minima and destabilize training.
- Bad initialization or scale mismatch: activations/gradients start too large.
- Unnormalized inputs: large feature magnitudes amplify activations.
- Numerical instability in loss: e.g., computing log of values too close to 0, or using an unstable formulation.
- Mixed precision issues: overflow in float16 without proper scaling.
Fixes (test in this order):
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
- Lower the learning rate by 10× (or more) and rerun. If it stabilizes, you found the main lever.
- Enable gradient clipping (e.g., clip global norm) to prevent rare spikes from destroying training.
- Verify input scaling/normalization. Check min/mean/max per feature/channel; ensure consistent preprocessing between train and validation.
- Use numerically stable loss implementations. Prefer combined “logits + loss” functions that avoid separate sigmoid/softmax then log.
- Check initialization and layer scaling. If you changed defaults (custom init, very deep nets), revert to standard initializers.
- If using mixed precision: ensure dynamic loss scaling is on; try full precision to confirm the diagnosis.
Practical check: Detecting explosion early
Log gradient norms and a few weight norms every N steps. If loss is still finite but gradient norms jump by orders of magnitude, reduce learning rate or clip before the run wastes hours.
# Pseudocode logging idea (framework-agnostic) log({ "step": step, "train_loss": loss, "lr": lr, "grad_norm": global_grad_norm(), "weight_norm": global_weight_norm() })2) Flat Loss (No Learning)
Symptom: Training loss barely decreases; training accuracy stays near chance. Validation mirrors training (also poor). Gradients may be near zero, or updates are tiny.
Likely causes:
- Vanishing gradients: gradients shrink through the network, especially with poor scaling or saturated activations.
- Saturation: activations stuck in regimes where gradients are tiny (often due to large pre-activations from unnormalized inputs or large weights).
- Learning rate too low: updates are too small to make progress.
- Frozen parameters: layers accidentally set to not train, or optimizer not receiving parameters.
- Bug in labels or loss: constant labels, wrong label mapping, or loss not connected to the model output you think it is.
- Mismatched loss/activation pair: the model outputs and the loss expect different formats (details below).
Fixes (targeted tests):
- Run the tiny overfit test. If you cannot overfit a tiny dataset, treat it as a pipeline/optimization bug until proven otherwise.
- Increase learning rate modestly (e.g., 2× to 5×) and see if loss begins to move. If it starts learning then later diverges, you overshot.
- Inspect gradients. If gradient norms are near zero from the start, suspect saturation/vanishing or a disconnected graph.
- Check trainable parameters. Count parameters with requires-grad/trainable flags; confirm optimizer includes them.
- Check input/label distributions. Print a batch: input stats, unique labels, label frequency.
- Reduce model complexity temporarily. A smaller model can be easier to optimize; if small works and large doesn’t, suspect optimization/initialization issues.
Practical step-by-step: Tiny overfit test
- Step 1: Take 64–256 training examples (stratified if classification).
- Step 2: Turn off regularization that intentionally limits fit (strong dropout, heavy augmentation).
- Step 3: Train until the model should clearly memorize (many epochs).
- Expected: Training loss should drop dramatically; training accuracy should approach 100% for classification.
- If it fails: Focus on data/labels, loss/activation mismatch, optimizer wiring, and gradient flow.
3) High Training Accuracy but Low Validation Accuracy (Overfitting Pattern)
Symptom: Training loss decreases steadily and training accuracy becomes high, but validation loss stops improving early or increases; validation accuracy stays low or degrades.
Likely causes:
- Model capacity too high for the data.
- Train/validation distribution mismatch. Different preprocessing, different sampling, or different time periods/domains.
- Data leakage in validation setup. Sometimes leakage inflates validation; but a related issue is anti-leakage: training sees information validation doesn’t (e.g., augmentation differences or label-dependent preprocessing), making validation look worse.
- Label noise concentrated in validation or inconsistent labeling policy between splits.
- Class imbalance and misleading accuracy. Training accuracy can look high by predicting the majority class; validation may have a different class mix.
Fixes (choose based on evidence):
- Verify split integrity. Ensure no near-duplicates across train/val (same user, same product, same scene, same time window). For time series, split by time, not random rows.
- Align preprocessing. Confirm normalization, tokenization, resizing, and feature engineering are identical across splits (except training-only augmentation).
- Reduce capacity or increase regularization. Try a smaller model, stronger weight decay, dropout, or early stopping based on validation loss.
- Increase effective data. More data, better augmentation, or better sampling can reduce the gap.
- Use better metrics. For imbalance, track precision/recall, F1, or AUC; also inspect confusion matrices.
Diagnostic: Is it truly overfitting or a data problem?
- If training improves but validation is random from the start: suspect split mismatch, label issues, or preprocessing mismatch.
- If validation improves early then degrades: classic overfitting; early stopping and regularization often help.
- If validation is much worse only for certain classes: suspect class imbalance, label noise, or domain shift for those classes.
4) Both Training and Validation Accuracy Are Low (Underfitting Pattern)
Symptom: Training loss decreases slowly or plateaus at a high value; training and validation metrics are both poor and close to each other.
Likely causes:
- Model too small or too constrained.
- Optimization not effective. Learning rate too low, poor schedule, or too few epochs.
- Features insufficient. Inputs lack signal, or preprocessing removed important information.
- Label noise so high that the achievable performance is limited.
- Wrong objective. Loss does not reflect the metric you care about, or the target is encoded incorrectly.
Fixes:
- Train longer and confirm the model is still improving (not just noisy).
- Increase learning rate or adjust the schedule to make faster early progress.
- Increase capacity (more units/layers) or reduce overly strong regularization.
- Improve input representation (better preprocessing, add missing features, fix normalization).
- Audit labels for systematic noise; if noise is high, consider cleaning or robust training strategies.
Cross-Cutting Checks That Prevent Wasted Weeks
Data Leakage: The Silent Curve Distorter
Leakage happens when information that would not be available at inference time influences training or evaluation. It can make validation look unrealistically good, or create confusing patterns when the leakage differs between splits.
Common leakage sources:
- Duplicates across splits: same or near-identical samples in train and validation.
- Group leakage: samples from the same entity (user, patient, device, location) appear in both splits.
- Time leakage: using future information to predict the past; random split on time-dependent data.
- Preprocessing fit on full data: normalization/statistics computed using train+val together.
- Target leakage features: features derived from labels or post-outcome information.
Fixes:
- Split by group/time when appropriate.
- Fit preprocessing on training only and apply to validation/test.
- Deduplicate using hashes or similarity checks.
- Feature audit: remove any feature that encodes the target directly or indirectly.
Label Noise: When the Best Possible Curve Is Worse Than You Expect
Label noise can look like underfitting (both curves low) or like overfitting (training climbs while validation stalls) depending on where the noise is concentrated.
Signs:
- Persistent high loss on a subset of samples across epochs.
- Confident wrong predictions that appear “reasonable” when you inspect the input.
- Disagreement among annotators or inconsistent labeling rules.
Fixes:
- Spot-check errors. Sample misclassified validation examples and inspect labels.
- Measure per-class confusion. Noise often clusters in specific classes.
- Clean or relabel a small high-impact subset first.
- Use robust evaluation. If labels are noisy, rely on multiple metrics and confidence intervals.
Class Imbalance: When Accuracy Lies
With imbalance, a model can achieve high accuracy by predicting the majority class, while being useless for minority classes. Curves may look “fine” while the model fails in practice.
Checks:
- Print class counts for train and validation.
- Track per-class precision/recall and confusion matrix, not just accuracy.
- Compare baseline: majority-class predictor accuracy vs. your model.
Fixes:
- Resampling: oversample minority or undersample majority (careful with duplicates).
- Class-weighted loss to penalize minority errors more.
- Threshold tuning for probabilistic outputs to match the cost of errors.
Mismatched Loss/Activation Pairs: A Frequent “Flat Loss” or “Weird Metrics” Bug
Many training failures come from using an output activation that doesn’t match what the loss expects. This can produce slow learning, unstable loss, or metrics that don’t correlate with loss.
Common correct pairings (conceptual):
- Binary classification: output as a single logit (no sigmoid applied in the model head) + a binary cross-entropy loss that expects logits (stable implementation).
- Multi-class single-label: output as K logits (no softmax in the model head) + a cross-entropy loss that expects logits and integer class labels.
- Multi-label: output as K logits + independent binary cross-entropy per class (logits-based).
Failure patterns:
- Applying sigmoid/softmax twice (once in the model, once inside the loss) can squash gradients and slow learning.
- Using MSE for classification often learns slowly and can saturate early.
- Wrong label encoding: providing one-hot labels where integer indices are expected (or vice versa) can silently break training.
Fix: Confirm (1) what your model outputs (logits vs probabilities), (2) what your loss expects, and (3) label format. Then run the tiny overfit test again.
Putting It Together: Symptom-to-Fix Cheat Sheet
- Exploding loss / NaNs: lower learning rate → add gradient clipping → verify normalization → use stable logits-based loss → check mixed precision scaling.
- Flat loss / chance accuracy: tiny overfit test → verify trainable params/optimizer wiring → check gradients → adjust learning rate → check loss/activation/labels.
- High train, low val: verify split/preprocessing → check leakage/duplicates → inspect per-class metrics → regularize/reduce capacity → early stopping.
- Both low: train longer → tune learning rate/schedule → increase capacity or reduce constraints → improve features → audit label noise.