All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Evaluating Performance: Metrics That Match Real Goals

Capítulo 8

Estimated reading time: 11 minutes

Why “Good” Needs a Definition

Model evaluation is not about finding the biggest number on a leaderboard; it is about measuring whether the system meets real-world goals under real constraints. The same model can look “great” under one metric and “bad” under another, depending on class imbalance, error costs, and how predictions are consumed (top-1 label, ranked list, bounding boxes, masks, or alerts). This chapter focuses on choosing metrics that match the decision being made and interpreting results in context.

1) Classification Metrics: Choosing the Right Lens

Confusion matrix as the starting point

Most classification metrics are derived from counts of correct/incorrect predictions. For a binary classifier, define:

TP (true positives): predicted positive and actually positive
FP (false positives): predicted positive but actually negative
TN (true negatives): predicted negative and actually negative
FN (false negatives): predicted negative but actually positive

When you are unsure which metric to use, start by inspecting these counts (or the full confusion matrix for multi-class problems) because they reveal which errors dominate.

Accuracy

Accuracy = (TP + TN) / (TP + FP + TN + FN). Accuracy is appropriate when:

Classes are reasonably balanced.
False positives and false negatives have similar cost.
You care about overall correctness rather than a specific class.

Accuracy can be misleading with imbalance. If 99% of images are “normal,” a model that always predicts “normal” gets 99% accuracy but is useless for finding rare defects.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Precision and recall

Precision = TP / (TP + FP) answers: “When the model says positive, how often is it correct?”

Recall = TP / (TP + FN) answers: “Of all true positives, how many did we catch?”

Use precision/recall when the positive class is rare or when the costs of FP and FN differ. Examples:

Safety alerting: prioritize high recall to avoid missing dangerous events (low FN).
Human review queues: prioritize high precision to avoid wasting reviewer time (low FP).

F1 score

F1 = 2 · (precision · recall) / (precision + recall). F1 is useful when you need a single number that balances precision and recall and the positive class matters. It is most meaningful when you compare models at a similar operating point (i.e., similar thresholding strategy).

In multi-class classification, you will often see:

Macro F1: average F1 across classes (treats all classes equally; highlights poor minority-class performance).
Micro F1: compute TP/FP/FN globally (dominated by frequent classes; closer to overall accuracy when single-label).

ROC-AUC

Many classifiers output a score (probability or logit) rather than a hard label. The ROC curve plots:

TPR (true positive rate) = recall
FPR (false positive rate) = FP / (FP + TN)

ROC-AUC summarizes the ROC curve as the probability that a random positive example receives a higher score than a random negative example. ROC-AUC is appropriate when:

You care about ranking quality across thresholds.
Class imbalance is not extreme, or you will interpret it carefully.

With heavy imbalance, ROC-AUC can look strong even when precision is poor at practical thresholds. In those cases, precision–recall curves (PR curves) are often more informative because they focus on performance on the positive class.

Practical step-by-step: compute and interpret classification metrics

Step 1: Collect predictions as scores (not just labels). Store per-image score for the positive class (or per-class scores for multi-class).
Step 2: Pick a baseline threshold (often 0.5) and compute TP/FP/TN/FN.
Step 3: Compute accuracy, precision, recall, F1 at that threshold.
Step 4: Sweep thresholds to produce ROC and PR curves; identify regions that meet constraints (e.g., recall ≥ 0.95).
Step 5: Inspect confusion matrix (multi-class) to see which classes are confused and whether errors are symmetric or one-sided.

# Pseudocode for threshold sweep (binary classification)  thresholds = [i/100 for i in range(0,101)]  for t in thresholds:      y_pred = (score >= t)      TP, FP, TN, FN = counts(y_true, y_pred)      precision = TP/(TP+FP+eps)      recall = TP/(TP+FN+eps)      f1 = 2*precision*recall/(precision+recall+eps)      store(t, precision, recall, f1)

2) Detection Metrics: Measuring Boxes and Confidence

IoU (Intersection over Union)

Object detection outputs bounding boxes with class labels and confidence scores. To decide whether a predicted box matches a ground-truth box, we use IoU:

IoU = area(pred ∩ gt) / area(pred ∪ gt)

IoU captures localization quality. A prediction can have the correct class but still be wrong if it is poorly aligned. Common match thresholds are IoU ≥ 0.5 (lenient) or higher thresholds like 0.75 (strict).

Precision–recall for detection

Detection produces multiple predictions per image. As you lower the confidence threshold, you get more boxes (higher recall) but also more false positives (lower precision). A PR curve for detection is built by:

Sorting predictions by confidence (highest to lowest).
Walking down the list and marking each prediction as TP or FP based on class match and IoU threshold, with a rule that each ground-truth object can be matched at most once.
Computing precision and recall at each cutoff.

mAP (mean Average Precision) conceptually

Average Precision (AP) summarizes the PR curve into a single number (area under the PR curve, with specific interpolation rules depending on the evaluation standard). mAP is the mean AP across classes (and sometimes across multiple IoU thresholds).

How to interpret mAP conceptually:

Higher AP means the detector ranks correct detections above incorrect ones and maintains good precision as recall increases.
mAP across multiple IoU thresholds rewards both classification and localization quality.

When comparing detectors, always check which IoU thresholds and class averaging scheme were used; “mAP” is not a single universal definition.

Confidence calibration for detection

A detector’s confidence score is often treated as a probability, but it may be miscalibrated (e.g., many 0.9-confidence boxes are wrong). Calibration matters when:

You use confidence to trigger actions (alerts, robot stops, auto-labeling).
You want stable thresholding across environments.

Practical checks:

Reliability diagram: bin predictions by confidence and compare predicted confidence to empirical correctness.
Expected Calibration Error (ECE): a summary of calibration gap across bins.

Even without changing the detector, you can sometimes improve calibration with post-processing (e.g., temperature scaling on logits) if you have a held-out calibration set.

Practical step-by-step: evaluate a detector

Step 1: Choose IoU threshold(s) that match the application. Tight grasping or precise cropping needs higher IoU; coarse counting may tolerate lower IoU.
Step 2: Run inference and collect all predicted boxes with class and confidence.
Step 3: Match predictions to ground truth per image and per class using IoU and one-to-one matching.
Step 4: Build PR curves by sorting predictions by confidence.
Step 5: Compute AP per class and average to mAP; also report per-class AP to avoid hiding weak classes.
Step 6: Check calibration of confidence if thresholds drive decisions.

3) Segmentation Metrics: Area vs Boundaries

Pixel accuracy

Pixel accuracy is the fraction of pixels labeled correctly. It is simple but can be misleading when:

The background dominates the image (common in semantic segmentation).
Small objects matter (a model can ignore them and still score high).

IoU (Jaccard) for segmentation

For a class mask, IoU = |pred ∩ gt| / |pred ∪ gt|, where the sets are pixels. IoU penalizes both over-segmentation and under-segmentation and is more informative than pixel accuracy for imbalanced foreground/background.

In multi-class segmentation, you will see:

Mean IoU (mIoU): average IoU across classes (often excluding “void”/ignore labels).
Frequency-weighted IoU: weights classes by pixel frequency (can hide rare-class failures).

Dice (F1) coefficient

Dice = 2|pred ∩ gt| / (|pred| + |gt|). Dice is closely related to IoU and often preferred in domains where foreground regions are small and you want a metric that is sensitive to overlap (e.g., thin structures). Dice tends to be numerically higher than IoU for the same prediction, so compare like with like.

Boundary sensitivity

Region metrics (IoU/Dice) can miss boundary quality issues. Two masks can have similar IoU but very different edge accuracy, which matters for tasks like measuring object size, cutting, or precise overlay.

Ways to evaluate boundary quality:

Boundary F-score: compare predicted and ground-truth boundaries within a tolerance (in pixels).
Trimap IoU: compute IoU only in a narrow band around the boundary to focus on edge alignment.

Practical step-by-step: evaluate segmentation

Step 1: Decide what matters: overall region coverage (IoU/Dice) vs edge precision (boundary metrics).
Step 2: Compute per-class metrics (IoU/Dice) and inspect the distribution, not just the mean.
Step 3: Report both macro and frequency-weighted summaries if class imbalance is present.
Step 4: Add a boundary-focused metric when downstream use depends on edges.
Step 5: Visualize worst cases by metric (lowest IoU or boundary score) to see failure modes.

4) Threshold Selection and Operating Points Based on Error Costs

Why thresholds are a product decision

Many vision systems output scores. Turning scores into actions requires choosing an operating point (thresholds, top-k, NMS settings, per-class thresholds). The “best” threshold depends on the cost of mistakes:

False positive (FP): acting when you should not (e.g., unnecessary alarm, wrong crop, wasted review time).
False negative (FN): failing to act when you should (e.g., missed defect, missed pedestrian).

Two teams can legitimately choose different thresholds for the same model because they optimize different costs.

Cost-based thresholding (practical recipe)

If you can estimate relative costs, you can choose a threshold that minimizes expected cost.

Step 1: Define costs: C_FP and C_FN (and optionally costs for TP/TN if relevant).
Step 2: Sweep thresholds and compute FP and FN counts on a validation set that matches deployment conditions.
Step 3: Compute expected cost: Cost(t) = C_FP · FP(t) + C_FN · FN(t).
Step 4: Choose threshold t* that minimizes Cost(t), subject to any hard constraints (e.g., recall ≥ 0.98).

# Pseudocode for cost-based threshold selection  best_t, best_cost = None, +inf  for t in thresholds:      TP, FP, TN, FN = counts(y_true, score >= t)      cost = C_FP*FP + C_FN*FN      if cost < best_cost:          best_cost, best_t = cost, t

Operating points for detection and segmentation

For detection, operating points include:

Confidence threshold per class (rare classes may need different thresholds).
NMS IoU threshold (controls duplicate suppression; affects recall vs precision).
Max detections per image (can cap false positives but may drop true objects in crowded scenes).

For segmentation, operating points include:

Probability threshold for binary masks.
Post-processing (morphology, connected components filtering) that trades small false positives against missed thin regions.

When thresholds are tuned, always tune on a held-out set and then verify on a separate test set to avoid optimistic estimates.

5) Error Analysis Workflow: Turning Metrics into Fixes

Start from examples, not just aggregates

Aggregate metrics tell you that a model fails; error analysis tells you how and why. A practical workflow combines quantitative slicing with qualitative inspection of representative failures.

Slice by condition (lighting, motion blur, viewpoint)

Overall performance can hide severe failures in specific conditions. Create “slices” of the dataset and compute metrics per slice. Common slices in vision:

Lighting: bright daylight, low light, backlit, mixed illumination.
Motion blur: none, mild, severe (can be estimated via blur metrics or metadata like shutter speed).
Viewpoint: frontal vs oblique, top-down vs side, distance to object.
Occlusion/crowding: isolated objects vs overlapping objects.
Scale: small vs large objects (especially for detection/segmentation).

Practical step-by-step slicing:

Step 1: Define slice labels using metadata (camera ID, time of day) or computed heuristics (blur score, brightness histogram).
Step 2: Compute the same metrics per slice (e.g., per-slice mAP, per-slice recall at fixed precision).
Step 3: Rank slices by performance drop relative to overall average.
Step 4: Inspect top failure slices with example images and predictions.

Confusion patterns (classification and beyond)

For classification, a confusion matrix reveals systematic mix-ups (e.g., “cat” vs “fox”). For detection and segmentation, confusion can appear as:

Class swaps: correct box/mask but wrong label.
Localization errors: correct class but IoU too low.
Duplicate detections: multiple boxes on one object (precision drop).
Missed small objects: recall drop concentrated at small scales.

Actionable pattern extraction:

Step 1: Bucket errors into categories (miss, misclass, poor localization, duplicate, boundary error).
Step 2: Quantify each bucket (counts and rate) overall and per slice.
Step 3: Prioritize buckets by impact on the metric that matches your goal (e.g., missed detections if recall is critical).

Example-driven debugging

Once you know where performance drops, use targeted example review to generate hypotheses you can test.

Step 1: Collect “hard negatives” and “hard positives”: highest-confidence false positives and lowest-confidence true positives.
Step 2: Visually inspect predictions and annotate the failure reason (glare, blur, occlusion, unusual pose, confusing background, label ambiguity).
Step 3: Check for systematic label issues in the failure cluster (e.g., inconsistent box tightness, missing objects, boundary ambiguity). Even small inconsistencies can dominate metrics like IoU and mAP.
Step 4: Create targeted evaluation sets for the failure mode (e.g., a “low light” set) and track metrics separately so improvements are measurable.
Step 5: Re-evaluate after changes and confirm that improvements generalize across slices rather than shifting errors elsewhere.

# Practical debugging checklist (task-agnostic)  1) What metric matters for the product decision?  2) Which slice shows the largest drop?  3) What is the dominant error bucket in that slice?  4) Do the top-20 failures share a visual pattern?  5) Are scores calibrated (do high-confidence errors exist)?  6) After a fix, did the target slice improve without regressing others?

Now answer the exercise about the content:

In a binary classification system where missing a positive case is much more costly than raising extra alarms, which evaluation focus best matches this goal?

You are right! Congratulations, now go to the next page

You missed! Try again.

When false negatives are costly, the key goal is to catch as many true positives as possible, which is captured by recall (TP/(TP+FN)). Thresholds should be selected to satisfy recall targets or minimize expected cost under the FP/FN trade-off.