Free Ebook cover Deep Learning Foundations Without the Hype: Neural Networks Explained Clearly

Deep Learning Foundations Without the Hype: Neural Networks Explained Clearly

New course

12 pages

Loss Functions and Learning Signals: Measuring What d459Goodd45a Means

Capítulo 5

Estimated reading time: 7 minutes

+ Exercise

Loss as the Single Number That Drives Learning

A model produces predictions, but learning needs a single scalar objective to tell the optimizer whether a change in parameters made things better or worse. A loss function takes the model’s output (and the target) and returns one number: lower is better. Training is the repeated process of adjusting parameters to reduce the average loss over data.

Two practical points keep the idea grounded:

  • The loss is not “truth”; it is a measurement choice for what you mean by “good.” Different tasks (and different error costs) imply different losses.
  • Training typically minimizes expected loss: performance averaged over the data distribution you care about, not perfect recall of the training set.

Mapping Common Tasks to Common Losses

Regression: Predicting a Real Number

In regression, targets are continuous values (e.g., house price, temperature). A common choice is mean squared error (MSE):

MSE = (1/N) * Σ (y_i - ŷ_i)^2

Intuition: squaring makes large errors count much more than small errors. This is useful when large mistakes are especially costly, but it can also make training sensitive to outliers.

Another common option is mean absolute error (MAE):

Continue in our app.

You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.

Or continue reading below...
Download App

Download the app

MAE = (1/N) * Σ |y_i - ŷ_i|

MAE penalizes errors linearly and is more robust to outliers, but its gradient behavior can be less smooth around zero error.

Binary Classification: Two Classes (0/1)

Binary classification predicts the probability of class 1. The standard loss is binary cross-entropy (also called log loss):

For a single example (y in {0,1}, p = predicted P(y=1)):
Loss = -[ y * log(p) + (1 - y) * log(1 - p) ]

Why this loss? It strongly penalizes confident wrong predictions. If the true label is 1 and the model predicts p=0.01, the loss is large; if it predicts p=0.99, the loss is small. This matches the idea that probabilities should be calibrated: being confidently wrong should hurt more than being uncertain.

Multiclass Classification: One of K Classes

For K classes, the model outputs a probability distribution over classes. The standard loss is categorical cross-entropy:

For one example with one-hot target y and predicted probabilities p:
Loss = - Σ_k y_k * log(p_k)

Because y is one-hot, this reduces to -log(p_true_class). Again, the loss is small when the model assigns high probability to the correct class, and large when it assigns low probability.

How Loss Interacts with Output Activations

The last layer’s activation and the loss must agree on what the outputs mean. A mismatch can make training unstable or produce outputs that are hard to interpret.

Regression Outputs: Often Linear (No Squashing)

If you want to predict any real value (positive or negative), the final layer is often linear (no bounding activation). Then MSE or MAE compares the raw prediction ŷ to y directly.

If the target has known constraints, you may choose an activation that enforces them:

  • Predicting a value in (0, 1): use a sigmoid output and a loss appropriate for probabilities (often cross-entropy if it is truly a probability).
  • Predicting a strictly positive value: sometimes use an exponential/softplus output and a loss defined on that scale.

Binary Classification: Sigmoid + Binary Cross-Entropy

Binary classification typically uses a sigmoid output to map a real-valued logit z to a probability p:

p = sigmoid(z) = 1 / (1 + exp(-z))

Then binary cross-entropy measures how well p matches the target y.

Implementation detail that matters in practice: many libraries provide a combined function (often called something like “BCE with logits”) that takes logits directly and computes the loss in a numerically stable way. Conceptually it is still “sigmoid + cross-entropy,” but computed safely to avoid overflow/underflow when z is large in magnitude.

Multiclass: Softmax + Categorical Cross-Entropy

For K classes, the model produces logits z (one per class). Softmax converts logits into probabilities:

p_k = exp(z_k) / Σ_j exp(z_j)

Categorical cross-entropy then penalizes low probability on the true class. Softmax and cross-entropy are commonly paired because:

  • Softmax ensures outputs form a valid probability distribution (nonnegative, sums to 1).
  • Cross-entropy directly rewards assigning probability mass to the correct class.
  • The combined gradient signal is well-behaved: increasing the true class logit decreases loss; increasing other logits increases loss.

As with the binary case, many libraries implement a single “softmax cross-entropy with logits” for numerical stability.

Step-by-Step: Choosing a Loss for a Task

Step 1: Write Down What the Output Should Represent

  • If the output is a real number: regression.
  • If the output is a probability of one class vs another: binary classification.
  • If the output is a probability distribution over K classes: multiclass classification.

Step 2: Choose the Output Activation That Matches That Meaning

  • Regression (unbounded): linear output.
  • Binary probability: sigmoid output (or logits + stable BCE-with-logits).
  • Multiclass probabilities: softmax output (or logits + stable softmax-cross-entropy).

Step 3: Choose the Loss That Measures Error in That Space

  • Regression: MSE (penalize large errors strongly) or MAE (more robust to outliers).
  • Binary classification: binary cross-entropy.
  • Multiclass classification: categorical cross-entropy.

Step 4: Check the Loss Against Real-World Costs

The “standard” loss is a starting point, not a law. Ask what mistakes cost you:

  • If outliers are common and you do not want them to dominate: MAE (or a robust alternative) may be better than MSE.
  • If false negatives are much worse than false positives (or vice versa): you may need class weighting or a different decision threshold, even if the loss stays cross-entropy.

Learning Signal: How Loss Produces Useful Gradients

The optimizer updates parameters using gradients of the loss. A good loss provides a learning signal that is:

  • Directional: it indicates which changes reduce error.
  • Sensitive: it reacts more when the model is confidently wrong than when it is slightly off.
  • Aligned with the task: reducing loss should usually improve the metric you care about (accuracy, RMSE, etc.), even if they are not identical.

Cross-entropy is popular for classification because it creates strong gradients when the model assigns low probability to the correct class, which helps correct confident mistakes efficiently.

Optimizing Expected Performance (Not Memorizing Examples)

Training data is a sample from a broader process (the environment). The goal is not to drive the loss to zero on the training set at any cost; the goal is to minimize expected loss on new samples from the same distribution.

Practically, training minimizes an empirical average:

Empirical risk (training objective) = (1/N) * Σ Loss(x_i, y_i)

This is used as an estimate of the expected loss:

Expected risk (what we actually care about) = E_{(x,y)~data} [ Loss(x, y) ]

The gap between these two is where generalization lives. A model can reduce training loss by fitting quirks of the sample (memorization), but that does not necessarily reduce expected loss. This is why you evaluate on held-out data and why you care about whether the loss choice matches the real objective.

Practical Examples: What the Loss Is “Asking” the Model to Do

Regression with MSE: “Be Close, and Avoid Big Misses”

If one prediction is off by 10 and another is off by 1, MSE treats the first as 100 times worse. The learning signal pushes hard to fix large errors, sometimes at the expense of many small ones.

Binary Cross-Entropy: “Put High Probability on the True Label”

Suppose y=1:

  • If p=0.9, loss is small: -log(0.9).
  • If p=0.1, loss is large: -log(0.1).

This encourages not just correct classification, but confident, well-calibrated probabilities.

Softmax Cross-Entropy: “Make the True Class Logit Win”

Because softmax compares logits relative to each other, improving the true class probability can happen by increasing its logit, decreasing competing logits, or both. The loss provides a clean scalar objective that turns “pick the right class” into a smooth optimization problem.

Now answer the exercise about the content:

Why is cross-entropy commonly used for classification tasks instead of a loss that only counts whether the predicted class is correct?

You are right! Congratulations, now go to the next page

You missed! Try again.

Cross-entropy measures how much probability the model assigns to the true class. It gives large loss (and strong gradients) when the model is confidently wrong, encouraging better-calibrated probabilities and efficient correction of mistakes.

Next chapter

Backpropagation Intuition: How Neural Networks Learn Without Magic

Arrow Right Icon
Download the app to earn free Certification and listen to the courses in the background, even with the screen off.