All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Supervised Learning: Learning from Labeled Examples

Capítulo 5

Estimated reading time: 13 minutes

What “Supervised Learning” Means

Supervised learning is a way to train an AI model using examples where the correct answer is already provided. Each training example includes an input (what the model sees) and a target output (what the model should produce). The model’s job is to learn a mapping from inputs to outputs so that, later, it can predict the output for new inputs it has never seen.

Think of supervised learning as “learning with an answer key.” During training, the model makes a guess, compares it to the known correct answer, and adjusts itself to reduce future mistakes. Over many examples, it learns patterns that connect inputs to outputs. The key idea is not memorizing individual examples, but learning general rules that work on new data.

Two Main Types: Classification and Regression

Classification (predicting a category)

In classification, the output is a discrete label, such as “spam” vs. “not spam,” “fraud” vs. “legit,” or “cat” vs. “dog.” The model often produces a probability for each class and then chooses the most likely class.

Email filtering: Input: email text and metadata. Output: spam/not spam.
Medical triage support: Input: symptoms and measurements. Output: risk category (low/medium/high).
Customer support routing: Input: ticket text. Output: department label (billing/technical/sales).

Regression (predicting a number)

In regression, the output is a continuous numeric value, such as a price, a temperature, or a time estimate.

House price estimation: Input: size, location, number of rooms. Output: predicted price.
Demand forecasting: Input: past sales, seasonality indicators. Output: predicted units sold.
Delivery time prediction: Input: distance, traffic indicators. Output: minutes to deliver.

Some problems look like classification but are treated as regression (for example, predicting a score from 1 to 5). The choice depends on how you want the model to behave and how you plan to evaluate it.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

What the Model Actually Learns: A Function That Generalizes

Supervised learning aims to learn a function that works well beyond the training set. If a model only performs well on the examples it saw during training but fails on new examples, it has not learned useful general patterns. This is why supervised learning is not just about “getting the training answers right,” but about building a predictor that generalizes.

Generalization depends on many factors: how representative the training examples are, how complex the model is, and how training is controlled to avoid overfitting (learning noise or accidental quirks instead of real signal).

Core Workflow (Step-by-Step)

The exact tools vary, but most supervised learning projects follow a similar sequence. The steps below are written in a practical, implementation-oriented way.

Step 1: Define the prediction task precisely

Write a one-sentence statement that includes the input, output, and when the prediction is made.

Example (classification): “Given an incoming transaction’s details at purchase time, predict whether it is fraudulent (yes/no).”
Example (regression): “Given a product page view session, predict the expected order value in dollars.”

Be explicit about what counts as the “correct” answer and what time window it refers to. Many real-world failures come from vague definitions like “predict churn” without defining churn.

Step 2: Choose the unit of prediction (what is one row?)

Decide what one training example represents. Is it one email, one customer, one transaction, one day, or one image? This choice affects everything: features, labels, and evaluation.

Fraud: one row per transaction.
Churn: one row per customer per month (or per customer at a specific snapshot date).
Demand: one row per product per day.

Step 3: Build features (inputs the model can use)

Features are measurable signals derived from raw data. For tabular business problems, features might include counts, averages, time since last event, or categorical fields like country or device type. For text, features might come from embeddings; for images, from pixel values or learned representations.

Practical feature examples for a fraud classifier:

Transaction amount
Merchant category
Country mismatch between billing and IP
Number of transactions in last 10 minutes
Account age in days

A common beginner pitfall is including features that “cheat” by leaking future information (for example, including a feature that is only known after the fraud investigation is complete). This is called data leakage and it can make a model look great in testing but fail in production.

Step 4: Split data into train/validation/test (or use cross-validation)

You need a way to estimate how well the model will perform on new data. A typical approach is:

Training set: used to fit the model.
Validation set: used to choose model settings (hyperparameters) and compare approaches.
Test set: used once at the end for an unbiased final check.

For time-based problems (forecasting, churn over time, fraud patterns that evolve), splitting by time is often more realistic than random splitting. For example, train on months 1–10, validate on month 11, test on month 12.

Step 5: Pick an evaluation metric that matches the goal

Different tasks require different metrics. Choosing the wrong metric can lead you to optimize the wrong behavior.

Classification metrics: accuracy, precision, recall, F1 score, ROC-AUC, PR-AUC.
Regression metrics: MAE (mean absolute error), MSE (mean squared error), RMSE, R².

In imbalanced classification (like fraud), accuracy can be misleading. If only 1% of transactions are fraud, a model that predicts “not fraud” for everything gets 99% accuracy but is useless. In such cases, precision/recall and PR-AUC are often more informative.

Step 6: Train a baseline model first

A baseline is a simple reference point. It helps you confirm that your pipeline works and gives you a minimum standard to beat.

Classification baseline: predict the most common class, or a simple logistic regression.
Regression baseline: predict the average value, or a simple linear regression.

Baselines are also useful for explaining value to stakeholders: “We improved from X to Y compared to a simple rule.”

Step 7: Train improved models and tune hyperparameters

Once the baseline works, you can try more expressive models or better features. Hyperparameters are settings you choose before training (for example, tree depth, regularization strength, learning rate). You use the validation set to compare options.

Common supervised learning model families:

Linear models: logistic regression, linear regression (fast, interpretable, strong baselines).
Tree-based models: decision trees, random forests, gradient boosting (often strong on tabular data).
Neural networks: flexible, often used for text, images, and complex patterns.

Step 8: Check for overfitting and underfitting

Two common failure modes:

Underfitting: model is too simple or not trained enough; it performs poorly on both training and validation data.
Overfitting: model performs very well on training data but worse on validation/test data; it learned noise or overly specific patterns.

Practical signs of overfitting include a large gap between training and validation performance. Remedies include simplifying the model, adding regularization, using more data, improving feature quality, or using early stopping for iterative models.

Step 9: Decide a threshold (for many classification problems)

Many classifiers output a probability (for example, 0.0 to 1.0). You must choose a threshold above which you label something as positive (like “fraud”). The threshold controls the trade-off between catching more positives (higher recall) and avoiding false alarms (higher precision).

Example: If you set the fraud threshold to 0.9, you might catch fewer fraud cases but with fewer false positives. If you set it to 0.3, you catch more fraud but may block many legitimate transactions. The “best” threshold depends on business costs and constraints.

Step 10: Perform error analysis (look at mistakes)

Error analysis means inspecting false positives and false negatives to learn what the model struggles with and what data or features might fix it.

Are many false positives coming from a particular country or merchant category?
Are false negatives concentrated in new account types?
Are there label issues (some “fraud” labels are wrong or delayed)?

This step often produces the biggest improvements because it turns model development into a targeted investigation rather than random tuning.

How Supervised Learning “Learns”: Loss Functions and Optimization (Intuition)

During training, the model needs a numeric score that tells it how wrong it is. This score is called a loss function. Training tries to minimize the loss.

Classification: a common loss is cross-entropy (it penalizes confident wrong predictions heavily).
Regression: common losses include squared error or absolute error.

An optimizer (such as gradient descent for many models) adjusts the model’s internal parameters to reduce loss. You do not need to compute gradients by hand to understand the practical implication: training is an iterative process of “predict → measure error → adjust → repeat.”

Practical Example 1: Spam Detection (Classification)

Goal

Given an email, predict whether it is spam.

Inputs and outputs

Input: email subject, body text, sender domain, presence of links, etc.
Output: spam (1) or not spam (0).

Step-by-step approach

Define the unit: one row per email.
Create features: text representation (for example, word counts or embeddings), number of links, sender reputation indicators.
Split data: if spam patterns change over time, consider time-based splitting.
Choose metric: precision and recall are important. High recall catches more spam; high precision avoids flagging legitimate emails.
Train baseline: logistic regression on simple text features.
Improve: add better text representations, include sender/domain features, tune regularization.
Threshold selection: choose a threshold that matches tolerance for false positives (marking real emails as spam is costly).
Error analysis: inspect false positives (e.g., newsletters) and false negatives (e.g., new scam templates).

Practical Example 2: Predicting House Prices (Regression)

Goal

Given house attributes, predict sale price.

Inputs and outputs

Input: square footage, number of bedrooms, neighborhood, year built, lot size.
Output: price in dollars.

Step-by-step approach

Define the unit: one row per house sale.
Feature engineering: convert neighborhood to a categorical feature; create derived features like price per square foot in the area (careful to avoid leakage by using only past data).
Split data: random split may be fine, but if the market changes quickly, consider time-based split.
Choose metric: MAE is easy to interpret (“average error is $18,000”).
Train baseline: linear regression.
Improve: gradient boosted trees often handle mixed feature types well.
Error analysis: check where errors are largest (luxury homes, unusual properties) and whether additional features (renovation status, school district) would help.

Common Pitfalls and How to Avoid Them

1) Data leakage (the silent killer)

Leakage happens when the model gets access to information that would not be available at prediction time. It can occur in subtle ways, such as using a feature computed with future data or using a label proxy that is too close to the answer.

Example: predicting loan default but including “number of missed payments” measured after the loan was issued.
Fix: design features as they would exist at the moment you want to predict; use time-aware feature generation.

2) Imbalanced classes

When one class is rare (fraud, disease detection), the model can ignore it. Solutions include using appropriate metrics, resampling strategies, class weights, and careful threshold selection.

3) Label noise and ambiguous labels

Sometimes the “correct” label is uncertain or inconsistently applied. For example, support tickets might be tagged differently by different agents. Noisy labels limit model performance and can mislead evaluation.

Fix: improve labeling guidelines, audit samples, measure inter-annotator agreement, and consider simplifying label categories.

4) Distribution shift (the world changes)

A model trained on past data may perform worse when user behavior, products, or fraud tactics change. This is especially common in dynamic environments.

Fix: monitor performance over time, retrain periodically, and use time-based validation to estimate future performance more realistically.

5) Optimizing the wrong objective

A model can score well on a metric but fail the real goal. For example, maximizing accuracy might not minimize financial losses. Align metrics and thresholds with real costs and constraints.

Interpreting Supervised Models: From “Why” to “What to Do”

In many applications, you need to understand why the model made a prediction, not just what it predicted. Interpretability can be crucial for trust, debugging, and compliance.

Global understanding: which features generally matter most (for example, “transaction amount” is a strong driver).
Local understanding: why a specific case was predicted as positive (for example, “unusual location + high amount + many recent attempts”).

Some models are naturally easier to interpret (linear models, small decision trees). For more complex models, you can use explanation techniques (like feature importance methods) to get approximate insights. Interpretability is also a practical tool for finding leakage and bias: if the model relies heavily on a suspicious feature, investigate it.

Deploying a Supervised Learning System: Practical Considerations

Prediction latency and throughput

Some predictions must be made instantly (fraud checks), while others can run in batches (daily demand forecasts). The required speed influences model choice and infrastructure.

Monitoring after deployment

Once deployed, you should monitor:

Input data drift: are feature distributions changing?
Prediction drift: are predicted probabilities shifting?
Outcome performance: are precision/recall or MAE changing over time?

Monitoring is essential because supervised learning assumes that future data will resemble past data closely enough for learned patterns to remain valid.

Human-in-the-loop workflows

Many supervised learning systems work best when combined with human review. For example, a fraud model might automatically block only the highest-risk cases and send medium-risk cases to analysts. This setup can improve safety and also generate new labeled examples for future training.

A Minimal Pseudocode Blueprint

The following pseudocode shows the typical structure of a supervised learning pipeline. It is not tied to a specific programming language.

# 1) Load dataset of (X, y) pairs (features and labels/targets) dataset = load_data() X, y = dataset.features, dataset.targets # 2) Split data X_train, X_val, y_train, y_val = split(X, y, method="time" or "random") # 3) Train baseline model model = Model(type="logistic_regression" or "linear_regression") model.fit(X_train, y_train) # 4) Evaluate val_predictions = model.predict(X_val) score = metric(y_val, val_predictions) # 5) Iterate: improve features, try new model, tune hyperparameters best_model = tune_and_select(models, X_train, y_train, X_val, y_val, metric) # 6) Choose threshold if classification probabilities are used threshold = choose_threshold(best_model, X_val, y_val, cost_tradeoff) # 7) Error analysis analyze_errors(best_model, X_val, y_val)

When Supervised Learning Is a Good Fit (and When It’s Not)

Supervised learning is a strong choice when you can define a clear target output and you have enough labeled examples that reflect the situations you care about. It is especially effective for prediction tasks where you can measure correctness objectively (fraud confirmed later, sales numbers, known categories).

It is less suitable when labels are unavailable, extremely expensive, or when the goal is to discover structure without predefined categories. In those cases, other approaches (like unsupervised learning or reinforcement learning) may be more appropriate, or you may need a hybrid approach such as weak supervision or semi-supervised learning.

Now answer the exercise about the content:

Why can a supervised learning model perform well during testing but fail after deployment?

You are right! Congratulations, now go to the next page

You missed! Try again.

A model can look strong in evaluation if it accidentally uses future-only information (data leakage) or learns noise specific to the training set (overfitting). Both reduce generalization, so performance drops on new, real-world inputs.