All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Deep Learning Foundations: Building the Right Mental Model

Capítulo 1

Estimated reading time: 7 minutes

A neural network as a function you can tune

The most useful mental model is simple: a neural network is a parameterized function that maps inputs to outputs. You choose the “shape” of the function (the architecture), and learning means adjusting its parameters (weights and biases) so that its outputs match the targets on your data.

We can write this idea as: y_hat = f(x; θ), where x is the input, y_hat is the prediction, and θ collects all learnable parameters (weights and biases). Training is the process of finding parameter values that make y_hat good according to a loss function.

Start with a concrete mapping task

Imagine a practical prediction problem: you want to estimate the selling price of an apartment from a few features.

Inputs (features): x = [square_meters, bedrooms, distance_to_center, age]
Output (target): y = price
Goal: learn a mapping from features to price that generalizes to new apartments.

A neural network is one way to represent a flexible mapping. It does not “understand apartments”; it learns statistical regularities that connect the input features to the target values in your dataset.

What must be learned: parameters, not rules

Before training, the network’s parameters are typically initialized to small random values. At that point, the network produces essentially arbitrary predictions. Learning is not about discovering explicit human-readable rules; it is about finding parameter values that make the function output the right numbers for the right inputs.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Even a small network has many parameters. For example, a simple two-layer network might look like:

h = σ(W1 x + b1)   # hidden representation (vector) y_hat = W2 h + b2  # predicted price

W1, b1, W2, b2 are learnable parameters.
σ is a non-linear activation (e.g., ReLU), which allows the model to represent non-linear relationships.

Loss: how we measure “wrong”

To improve predictions, we need a numeric score that tells us how bad a prediction is. That score is the loss. For price prediction, a common choice is mean squared error (MSE):

loss(y_hat, y) = (y_hat - y)^2

Over a dataset, we average this across examples. The loss is the objective we optimize: training tries to find parameters that minimize the loss on the training data.

Key point: the loss is not an afterthought. It defines what “good” means. If you choose a loss that does not match your real goal, the model will faithfully optimize the wrong thing.

The training loop: predict → compare → update

At a conceptual level, training repeats the same loop many times. Each repetition nudges the parameters in a direction that reduces the loss.

Step-by-step loop on one mini-batch

1) Predict: feed inputs through the network to get predictions. y_hat = f(x; θ)
2) Compare: compute the loss between predictions and targets. L = loss(y_hat, y)
3) Update: adjust parameters to reduce the loss, typically using gradient-based optimization. θ ← θ - η ∇θ L

In practice, we do this on mini-batches (small subsets of data) rather than one example at a time, because it is more efficient and the gradient estimate is usually good enough.

What “update” really means (without mysticism)

The update step uses gradients: numbers that tell you how sensitive the loss is to each parameter. If changing a weight slightly would increase the loss, the gradient points that way; the optimizer moves the weight in the opposite direction to reduce the loss.

You do not need to view this as the network “learning concepts.” It is closer to tuning a very large set of knobs so that the function’s outputs match the training targets.

Separating three things: capacity, data, objective

Many misunderstandings come from mixing up what the model can represent, what the data contains, and what the training process rewards. Keep these separate.

1) Model capacity: what functions are even possible

Capacity is the range of mappings your architecture can represent. More layers/units usually means higher capacity: the model can fit more complex patterns.

Too little capacity: the model cannot fit the training data well (underfitting). Even with perfect training, loss stays high because the function family is too limited.
Too much capacity: the model can fit the training data extremely well, including noise (overfitting), unless you control it with enough data, regularization, or appropriate training choices.

Capacity is not “intelligence.” It is flexibility of the function class.

2) Data: what information is available to learn from

The model can only learn patterns that are present in the inputs and reflected in the targets. If an important factor is missing from the features, the network cannot reliably infer it.

Example: if apartment price strongly depends on “view quality” but your features do not include anything correlated with view quality, the model cannot consistently learn that effect. It may still produce a number, but it will be guessing based on whatever weak proxies exist.

Signal: stable relationships that generalize (e.g., larger apartments often cost more).
Noise: idiosyncrasies that do not generalize (e.g., a one-off renovation not captured by features).
Bias in data: if your dataset covers only one neighborhood type, the model may fail elsewhere because it never saw those conditions.

3) Objective: what the training process is incentivized to do

The objective is the loss (often plus regularization terms). It determines the behavior the optimizer will produce.

If you optimize MSE, the model is rewarded for being close on average, and it tends to predict conditional means.
If you optimize a classification loss, the model is rewarded for separating classes, not for producing calibrated probabilities unless the loss and evaluation encourage it.

When results are surprising, ask: did the objective actually match the outcome you wanted?

What deep learning can and cannot infer from data

What it can do (when conditions are right)

Approximate complex mappings: if there is a learnable relationship from inputs to outputs, a network with enough capacity can approximate it.
Exploit non-linear interactions: it can combine features in ways that simple linear models cannot (e.g., the effect of size might depend on distance to center).
Improve with more relevant data: performance often scales with more high-quality, representative examples.

What it cannot do (even with a big model)

Create information that is not in the inputs: if the target depends on unobserved variables not correlated with your features, the model cannot recover them reliably.
Guarantee causality: learning correlations from observational data does not automatically reveal what would happen under interventions (e.g., “if we renovate, price will increase by X”).
Generalize outside the data regime: if test cases differ substantially from training cases (distribution shift), performance can degrade sharply.
Fix a mismatched objective: if the loss rewards the wrong behavior, training will optimize that behavior, even if it conflicts with your real-world goal.

Putting the mental model into a checklist

When you approach any deep learning problem, you can ground yourself with a few concrete questions:

Inputs: What exactly is x? What information is present, and what is missing?
Outputs: What is y? Is it well-defined and consistently measured?
Function family: What architecture defines f(x; θ)? Is capacity too small or unnecessarily large?
Loss/objective: What does the loss reward? Does it align with the real metric you care about?
Training loop: Are you iterating predict → compare → update in a stable way (learning rate, batch size, optimization choices)?
Generalization: Does the data represent the situations you will face at deployment?

Now answer the exercise about the content:

In the training loop of a neural network, what does the “update” step do?

You are right! Congratulations, now go to the next page

You missed! Try again.

The update step uses gradients to nudge weights and biases in a direction that lowers the loss, e.g., θ ← θ - η ∇θ L. It tunes parameters rather than creating rules or inventing missing information.