A neural network as a function you can tune
The most useful mental model is simple: a neural network is a parameterized function that maps inputs to outputs. You choose the “shape” of the function (the architecture), and learning means adjusting its parameters (weights and biases) so that its outputs match the targets on your data.
We can write this idea as: y_hat = f(x; θ), where x is the input, y_hat is the prediction, and θ collects all learnable parameters (weights and biases). Training is the process of finding parameter values that make y_hat good according to a loss function.
Start with a concrete mapping task
Imagine a practical prediction problem: you want to estimate the selling price of an apartment from a few features.
- Inputs (features):
x = [square_meters, bedrooms, distance_to_center, age] - Output (target):
y = price - Goal: learn a mapping from features to price that generalizes to new apartments.
A neural network is one way to represent a flexible mapping. It does not “understand apartments”; it learns statistical regularities that connect the input features to the target values in your dataset.
What must be learned: parameters, not rules
Before training, the network’s parameters are typically initialized to small random values. At that point, the network produces essentially arbitrary predictions. Learning is not about discovering explicit human-readable rules; it is about finding parameter values that make the function output the right numbers for the right inputs.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Even a small network has many parameters. For example, a simple two-layer network might look like:
h = σ(W1 x + b1) # hidden representation (vector) y_hat = W2 h + b2 # predicted priceW1,b1,W2,b2are learnable parameters.σis a non-linear activation (e.g., ReLU), which allows the model to represent non-linear relationships.
Loss: how we measure “wrong”
To improve predictions, we need a numeric score that tells us how bad a prediction is. That score is the loss. For price prediction, a common choice is mean squared error (MSE):
loss(y_hat, y) = (y_hat - y)^2Over a dataset, we average this across examples. The loss is the objective we optimize: training tries to find parameters that minimize the loss on the training data.
Key point: the loss is not an afterthought. It defines what “good” means. If you choose a loss that does not match your real goal, the model will faithfully optimize the wrong thing.
The training loop: predict → compare → update
At a conceptual level, training repeats the same loop many times. Each repetition nudges the parameters in a direction that reduces the loss.
Step-by-step loop on one mini-batch
- 1) Predict: feed inputs through the network to get predictions.
y_hat = f(x; θ) - 2) Compare: compute the loss between predictions and targets.
L = loss(y_hat, y) - 3) Update: adjust parameters to reduce the loss, typically using gradient-based optimization.
θ ← θ - η ∇θ L
In practice, we do this on mini-batches (small subsets of data) rather than one example at a time, because it is more efficient and the gradient estimate is usually good enough.
What “update” really means (without mysticism)
The update step uses gradients: numbers that tell you how sensitive the loss is to each parameter. If changing a weight slightly would increase the loss, the gradient points that way; the optimizer moves the weight in the opposite direction to reduce the loss.
You do not need to view this as the network “learning concepts.” It is closer to tuning a very large set of knobs so that the function’s outputs match the training targets.
Separating three things: capacity, data, objective
Many misunderstandings come from mixing up what the model can represent, what the data contains, and what the training process rewards. Keep these separate.
1) Model capacity: what functions are even possible
Capacity is the range of mappings your architecture can represent. More layers/units usually means higher capacity: the model can fit more complex patterns.
- Too little capacity: the model cannot fit the training data well (underfitting). Even with perfect training, loss stays high because the function family is too limited.
- Too much capacity: the model can fit the training data extremely well, including noise (overfitting), unless you control it with enough data, regularization, or appropriate training choices.
Capacity is not “intelligence.” It is flexibility of the function class.
2) Data: what information is available to learn from
The model can only learn patterns that are present in the inputs and reflected in the targets. If an important factor is missing from the features, the network cannot reliably infer it.
Example: if apartment price strongly depends on “view quality” but your features do not include anything correlated with view quality, the model cannot consistently learn that effect. It may still produce a number, but it will be guessing based on whatever weak proxies exist.
- Signal: stable relationships that generalize (e.g., larger apartments often cost more).
- Noise: idiosyncrasies that do not generalize (e.g., a one-off renovation not captured by features).
- Bias in data: if your dataset covers only one neighborhood type, the model may fail elsewhere because it never saw those conditions.
3) Objective: what the training process is incentivized to do
The objective is the loss (often plus regularization terms). It determines the behavior the optimizer will produce.
- If you optimize MSE, the model is rewarded for being close on average, and it tends to predict conditional means.
- If you optimize a classification loss, the model is rewarded for separating classes, not for producing calibrated probabilities unless the loss and evaluation encourage it.
When results are surprising, ask: did the objective actually match the outcome you wanted?
What deep learning can and cannot infer from data
What it can do (when conditions are right)
- Approximate complex mappings: if there is a learnable relationship from inputs to outputs, a network with enough capacity can approximate it.
- Exploit non-linear interactions: it can combine features in ways that simple linear models cannot (e.g., the effect of size might depend on distance to center).
- Improve with more relevant data: performance often scales with more high-quality, representative examples.
What it cannot do (even with a big model)
- Create information that is not in the inputs: if the target depends on unobserved variables not correlated with your features, the model cannot recover them reliably.
- Guarantee causality: learning correlations from observational data does not automatically reveal what would happen under interventions (e.g., “if we renovate, price will increase by X”).
- Generalize outside the data regime: if test cases differ substantially from training cases (distribution shift), performance can degrade sharply.
- Fix a mismatched objective: if the loss rewards the wrong behavior, training will optimize that behavior, even if it conflicts with your real-world goal.
Putting the mental model into a checklist
When you approach any deep learning problem, you can ground yourself with a few concrete questions:
- Inputs: What exactly is
x? What information is present, and what is missing? - Outputs: What is
y? Is it well-defined and consistently measured? - Function family: What architecture defines
f(x; θ)? Is capacity too small or unnecessarily large? - Loss/objective: What does the loss reward? Does it align with the real metric you care about?
- Training loop: Are you iterating predict → compare → update in a stable way (learning rate, batch size, optimization choices)?
- Generalization: Does the data represent the situations you will face at deployment?