Start With Constraints, Not Architecture
Model design is easiest when you treat it like an engineering problem: you have constraints (data, noise, compute, latency, memory) and you need the smallest model that meets the target metric reliably. If you start by picking a fancy architecture, you often end up debugging symptoms (instability, overfitting, slow inference) that were predictable from the constraints.
Constraint checklist
- Data size: number of labeled examples, class balance, and how diverse the inputs are.
- Noise: label noise, ambiguous cases, sensor noise, and whether the task has an irreducible error floor.
- Latency: maximum inference time per example (server or on-device).
- Memory: model size on disk, activation memory during training, and batch size limits.
- Compute budget: training time, available GPU/CPU, and how often you can retrain.
- Deployment constraints: quantization requirements, supported ops, and input resolution limits.
Write these down as numbers when possible (e.g., “p95 latency < 20 ms”, “model < 30 MB”, “training < 6 hours”). These numbers will guide capacity choices more reliably than intuition.
Pick a Baseline Architecture You Can Reason About
A baseline is not “the best model.” It is a stable reference point that trains predictably, fits within constraints, and is simple enough that you can interpret training/validation behavior. The baseline should be intentionally modest; you will scale it up or down based on evidence.
Baseline selection by input type
- Tabular / structured features: start with a small multilayer perceptron (MLP) with 2–4 hidden layers and moderate width.
- Images: start with a small convolutional network (few stages, modest channels) or a compact pretrained backbone if allowed by your setting.
- Text / sequences: start with an embedding + small recurrent/temporal model or a compact transformer variant, but keep sequence length and hidden size conservative.
Keep the baseline training settings conservative too: a learning rate that is known to be stable for your optimizer, a batch size that fits comfortably, and a straightforward data pipeline.
Baseline capacity knobs (what they really change)
- Width (more units/channels): increases representational capacity and often improves fit quickly, but increases memory and compute roughly linearly (sometimes more due to matrix sizes).
- Depth (more layers/blocks): can improve efficiency of representation, but increases training complexity and can slow inference; also increases activation memory.
- Input resolution / sequence length: often the most expensive knob; it can dominate compute and memory and may matter more than adding layers.
Decision Framework: From Constraints to a First Training Run
Step 1: Set a target metric and a minimum acceptable baseline
Choose one primary metric that matches the product goal (e.g., accuracy, F1, AUROC, mean absolute error). Define a minimum acceptable value and a validation protocol (holdout split, time-based split, or cross-validation). If you cannot trust your validation setup, capacity tuning becomes guesswork.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Step 2: Choose the smallest model that fits latency/memory
If you have strict deployment constraints, start by sizing the model to meet them. It is easier to improve accuracy within a fixed budget than to shrink a model after it becomes dependent on large capacity.
- Latency-limited: prefer fewer layers, smaller input sizes, and operations that are efficient on your target hardware.
- Memory-limited: reduce width, reduce batch size, consider gradient checkpointing for training, and keep embeddings/feature maps modest.
- Training-time-limited: prefer architectures that converge quickly and are easy to optimize; avoid extremely deep or exotic designs until you have a working baseline.
Step 3: Run a short “sanity training”
Before long training runs, do a short run to catch pipeline and optimization issues.
- Verify loss decreases on a small subset of data.
- Try to overfit a tiny batch (e.g., 32–256 examples). If the model cannot drive training loss very low on a tiny set, you likely have a bug, a mismatch between labels and inputs, or an optimization/initialization issue.
- Check that metrics move in the expected direction (e.g., accuracy increases as loss decreases for classification).
Adjust Capacity Using Training vs Validation Signals
Capacity tuning should be driven by the shape of learning curves: training loss/metric vs validation loss/metric over epochs. The key is to diagnose whether you are limited by optimization, by insufficient capacity, or by generalization.
Reading the curves: common patterns
- Underfitting (capacity or optimization limited): training performance is poor and validation performance is similarly poor. Training loss plateaus at a high value.
- Overfitting (generalization limited): training performance keeps improving while validation stalls or degrades. The gap between training and validation grows.
- High variance across runs: validation metrics fluctuate a lot between seeds or folds; often indicates small data, noisy labels, or an unstable training setup.
- Both curves improve slowly: may indicate learning rate too low, poor initialization, insufficient normalization, or data pipeline bottlenecks.
When you see underfitting
Underfitting means the model is not capturing the signal available in the data (or training is not effectively finding it). Use this order of operations:
- First check optimization: increase learning rate cautiously, use a learning rate schedule, or adjust batch size. If training loss is not decreasing, adding capacity may not help.
- Then increase capacity: add width before depth if you need a quick capacity increase with simpler optimization; add depth if you suspect the task benefits from more hierarchical processing and you can train stably.
- Increase input information: higher resolution, longer context window, or better features can help more than adding layers, but watch compute costs.
- Train longer: if both training and validation are improving and not plateaued, you may simply need more epochs.
When you see overfitting
Overfitting means the model is learning patterns that do not generalize. You can respond by reducing effective capacity or increasing the amount/quality of signal seen during training.
- Prefer data improvements first: more data, better labels, better augmentation, and better sampling often beat architectural tweaks.
- Reduce capacity: fewer layers/units, smaller embeddings, lower input resolution, or earlier stopping based on validation.
- Increase regularization pressure: stronger weight decay, dropout, or data augmentation. (You are not re-learning what these are here; the key is when to use them.)
- Use early stopping: pick the checkpoint with best validation metric, not the final epoch.
When you see a persistent train/validation gap but validation is still improving
A gap is not automatically bad. If validation keeps improving, you may accept the gap and continue training or slightly increase capacity. The real problem is when validation stops improving while training continues to improve.
Training Settings as Design Choices (Not Afterthoughts)
Batch size: stability vs generalization vs memory
- Larger batches: more stable gradient estimates and faster throughput on GPUs, but can require learning rate tuning and may generalize differently.
- Smaller batches: noisier gradients, sometimes better generalization, but slower and may destabilize training if too small.
Practical approach: pick the largest batch that comfortably fits memory, then tune learning rate; if generalization is weak, try smaller batches or stronger augmentation/regularization.
Learning rate and schedules: the main dial for progress
If training is unstable (loss spikes, diverges), the learning rate is often too high. If training is painfully slow and plateaus early, it may be too low. Schedules (warmup, cosine decay, step decay) are not decoration; they help you move quickly early and refine later.
Epoch count: stop based on validation, not habit
Set a maximum epoch budget, but select the model based on validation performance. Track the best checkpoint and use early stopping patience if validation stops improving.
Normalization: What It’s Doing Conceptually and When to Use Which
Normalization layers help training by keeping activations in a range where gradients behave well. Conceptually, they reduce sensitivity to scale changes inside the network, making optimization less brittle.
Batch Normalization (BN): good when batch statistics are reliable
- Concept: normalizes activations using mean/variance computed over the batch (and sometimes spatial dimensions), then applies a learned scale and shift.
- When it works well: vision models with reasonably sized batches; it often speeds up training and allows higher learning rates.
- When it can be problematic: very small batch sizes, highly variable batch composition, or when train/test behavior differs due to batch-statistics handling.
Layer Normalization (LN): good when batch size is small or variable
- Concept: normalizes across features within a single example, so it does not depend on batch statistics.
- When it works well: sequence models and transformers; also useful when batch size is constrained by memory.
Practical normalization guidance
- If you are forced into tiny batches and BN becomes unstable, try LN (or alternatives designed for small batches) rather than fighting BN with hacks.
- Normalization choice interacts with learning rate: with effective normalization, you can often use more aggressive learning rates.
Initialization Intuition: Start in a “Reasonable Scale” Regime
Initialization is about keeping signals and gradients from exploding or vanishing at the start of training. A good initialization makes early training predictable: activations have sensible magnitudes, and gradients are neither tiny nor enormous.
What can go wrong with poor initialization
- Exploding activations/gradients: loss becomes NaN or diverges quickly.
- Vanishing gradients: training loss barely moves, especially in deeper networks.
- Dead units: some activations saturate or become inactive and stop contributing.
Practical approach
- Use standard initializations that match your activation functions and layer types (most frameworks default to sensible choices).
- If you change activation types or remove normalization, re-check training stability; what was stable before may no longer be.
- When debugging instability, simplify: reduce depth, reduce learning rate, add normalization, and verify you can overfit a tiny subset.
Monitoring Beyond Loss: Metrics That Reveal Failure Modes
Loss is the optimization target, but it can hide problems that matter in practice. Track metrics that match the decision you will make with the model.
Classification
- Accuracy: can be misleading with class imbalance.
- Precision/recall and F1: better when false positives/negatives have different costs.
- AUROC / AUPRC: useful when you care about ranking or thresholds; AUPRC is often more informative with rare positives.
- Calibration metrics: if predicted probabilities are used downstream, monitor calibration (e.g., reliability curves, expected calibration error) rather than only accuracy.
Regression
- MAE vs MSE: MAE is more robust to outliers; MSE penalizes large errors more.
- Percentile errors: track p50/p90/p95 absolute error if tail behavior matters.
Operational metrics
- Inference time and memory: measure on the target hardware, not just in development.
- Throughput: examples per second at the batch size you will actually use.
- Stability across seeds: if results vary widely, your model may be too sensitive for production decisions.
A Practical Tuning Loop You Can Reuse
Step-by-step loop
- 1) Fix constraints and validation protocol: define latency/memory budgets and a trustworthy validation split.
- 2) Train a modest baseline: keep architecture and settings simple; log training/validation curves and key metrics.
- 3) Diagnose from curves: underfitting vs overfitting vs optimization instability.
- 4) Make one change at a time: adjust learning rate/schedule, then capacity (width/depth), then data/augmentation, while keeping notes.
- 5) Re-measure constraints: every capacity increase should be checked against latency/memory budgets.
- 6) Lock the smallest model that meets the metric: if two models tie on validation, prefer the simpler one for robustness and deployment ease.
Example: interpreting a run and choosing the next move
Suppose you train a baseline and observe: training loss keeps decreasing, training accuracy rises to 99%, but validation accuracy peaks early and then declines. This is classic overfitting. Your next move should not be “add more layers.” Instead, try one of the following in order: improve augmentation or data quality, increase regularization pressure, reduce model width, and use early stopping. If validation improves and the gap shrinks, you are moving in the right direction.
Alternatively, if both training and validation accuracy stall at a low value and training loss plateaus, first test whether learning rate/schedule is limiting progress. If optimization is fine and the model still cannot fit the training data well, increase capacity (often width first), or increase input signal (resolution/context) if the task requires it.
Minimal Debug Checklist When Results Don’t Make Sense
- Can the model overfit a tiny subset? If not, suspect bugs, label mismatch, or unstable training settings.
- Are training and validation preprocessing identical? Differences can create artificial generalization gaps.
- Are metrics computed correctly? Check thresholding, averaging, and class mapping.
- Is the validation split appropriate? Leakage or distribution shift can make curves misleading.
- Do normalization and batch size match? BN with tiny batches is a common source of instability.
# Pseudocode: a simple experiment log structure to keep decisions evidence-based
experiment = {
"constraints": {"latency_ms": 20, "model_mb": 30},
"data": {"n_train": 50000, "label_noise_est": "medium"},
"baseline": {"depth": 3, "width": 256, "norm": "batch"},
"train": {"batch_size": 128, "lr": 3e-4, "schedule": "cosine"},
"results": {
"train_loss_curve": "...",
"val_loss_curve": "...",
"primary_metric": {"best": 0.87, "epoch": 12},
"latency_ms_measured": 18
},
"next_change": "reduce width to 192 and add stronger augmentation"
}