All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Core Vision Tasks: Classification vs Detection vs Segmentation

Capítulo 5

Estimated reading time: 8 minutes

1) Classification: “What is present?”

Classification assigns one or more labels to an entire image (or to a pre-cropped region). The model does not say where the object is; it only predicts which categories apply.

Single-label vs multi-label

Single-label (multi-class): exactly one class is correct (e.g., {cat, dog, car} and the image is one of them). Output is typically a vector of scores over classes.
Multi-label: multiple classes can be simultaneously true (e.g., {person, bicycle, helmet} can all be present). Output is typically one score per class, interpreted independently.

What the outputs look like: logits, probabilities, top-k

Logits: raw model scores before converting to probabilities. They can be any real numbers and are convenient for training and ranking.
Probabilities: normalized scores. In single-label classification, probabilities often sum to 1 across classes. In multi-label classification, each class has its own probability (not summing to 1) because multiple classes may be true.
Top-k: instead of only the best class, you return the k highest-scoring classes (useful when classes are similar or downstream logic can disambiguate).

Practical step-by-step: turning model scores into a decision

Single-label example (multi-class):

Run the model and get a vector of logits, one per class.
Convert logits to probabilities (if needed) and sort by score.
Pick the top-1 class, or return top-k classes with their scores.
Optionally reject uncertain predictions by requiring the top probability to exceed a threshold (e.g., “only accept if p > 0.8”).

Multi-label example:

Run the model and get one logit per class.
Convert each logit to an independent probability.
Choose a per-class threshold (e.g., label is present if p > 0.5). Thresholds may differ by class depending on risk and class frequency.

# Pseudocode for multi-label decision rule (conceptual, not framework-specific)  probs = model(image)  # one probability per class  predicted_labels = [c for c in classes if probs[c] > threshold[c]]

2) Detection: “What is present, and where?”

Object detection finds instances of objects and returns their locations. Instead of one label for the whole image, the output is a set of predicted objects, each with a bounding box and a class score.

What the outputs look like: boxes + class scores

Bounding box: typically represented as (x, y, width, height) or (x1, y1, x2, y2) in image coordinates.
Class scores: a score per class (or a best class plus confidence). Each predicted box also has a confidence indicating how likely it is a real object of that class.
Multiple predictions: detectors often propose many candidate boxes; post-processing reduces them to a clean set.

Confidence thresholds: filtering weak detections

Detectors usually output many low-confidence boxes. A confidence threshold removes predictions below a chosen cutoff (e.g., keep only detections with confidence ≥ 0.4). This trades off between missing objects (too high a threshold) and producing false alarms (too low a threshold).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Non-maximum suppression (NMS): conceptually removing duplicates

Detectors may output several overlapping boxes for the same object. Non-maximum suppression keeps the highest-confidence box and removes nearby boxes that overlap it too much (based on an overlap measure such as intersection-over-union). Conceptually:

Sort boxes by confidence (highest first).
Take the top box as a final detection.
Remove other boxes that overlap it beyond a chosen overlap threshold.
Repeat with the remaining boxes.

Practical step-by-step: from raw detections to final boxes

Run the detector to get candidate boxes and scores.
Discard boxes below the confidence threshold.
Apply NMS per class (or class-agnostic, depending on the system) to remove duplicates.
Return the remaining boxes with class labels and confidences.

# Conceptual detection post-processing  candidates = detector(image)  filtered = [b for b in candidates if b.confidence >= conf_thresh]  final = non_max_suppression(filtered, iou_thresh)  return final

3) Segmentation: “Which pixels belong to what?”

Segmentation assigns a label to each pixel (or produces a mask) so you know not just where an object is, but its precise shape. This is useful when boundaries matter (e.g., medical imaging, robotics grasping, measuring area).

Semantic segmentation vs instance segmentation

Semantic segmentation: every pixel gets a class label (e.g., road, sidewalk, car, sky). All objects of the same class share the same label; individual instances are not separated.
Instance segmentation: each object instance gets its own mask (e.g., car #1, car #2), typically along with a class label per instance.

What the outputs look like: pixel-wise predictions and masks

Semantic output: a per-pixel class score map (one score per class per pixel) or directly a per-pixel label map after taking the best class at each pixel.
Instance output: a set of objects, each with (mask, class, confidence). Some systems also provide bounding boxes as a byproduct.
Soft vs hard masks: models may output a probability per pixel (soft mask) that you threshold to obtain a binary mask (hard mask).

Practical step-by-step: producing a usable mask

Semantic segmentation:

Run the model to get per-pixel class scores.
For each pixel, choose the class with the highest score (or keep probabilities if downstream code needs uncertainty).
Optionally apply simple cleanup rules (e.g., remove tiny isolated regions) depending on the application’s tolerance for noise.

Instance segmentation:

Run the model to get candidate instance masks and confidences.
Filter instances by confidence threshold.
Resolve overlaps if needed (some pipelines use NMS-like logic for masks).
Return each instance’s mask and label.

# Conceptual semantic segmentation decoding  score_map = segmenter(image)  # shape: [H, W, C]  label_map = argmax(score_map, axis=classes)  # shape: [H, W]

4) Labeling requirements, annotation cost, and typical applications

The three task families differ most in what supervision you need and how expensive it is to create. Choosing the right task is often a cost/benefit decision.

Classification labeling

Annotation unit: one label per image (single-label) or a checklist of labels (multi-label).
Cost: typically lowest. Many images can be labeled quickly.
Common applications: product category tagging, presence/absence checks (e.g., “contains defect?”), content moderation, medical triage (“likely pneumonia?”) when localization is not required.

Detection labeling

Annotation unit: draw a bounding box around each object instance and assign a class.
Cost: medium. Boxes are faster than precise outlines but still require careful work, especially with crowded scenes.
Common applications: counting objects, locating items for downstream actions (e.g., robotics pick points), traffic monitoring, retail shelf analytics, safety systems (finding people/vehicles).

Segmentation labeling

Annotation unit: pixel-accurate masks (semantic: label every pixel; instance: outline each object instance).
Cost: highest. Pixel-level annotation is time-consuming and requires clear guidelines for boundaries and occlusions.
Common applications: medical imaging (tumor boundaries), autonomous driving (drivable area), industrial inspection (exact defect area), agriculture (leaf area measurement), background removal and compositing.

Practical implication: more detailed labels enable more detailed answers

Classification can tell you if something is present. Detection can tell you where it is approximately. Segmentation can tell you which pixels belong to it, enabling measurements like area, shape, and precise boundaries.

5) Decision rules: choosing the task based on the real-world question

Start from the question your system must answer. A useful way to decide is to map your needs to three progressively more specific questions: what is present, where it is, and how much of it (or what exact shape).

Rule A: Choose classification when you only need “what is present”

Use classification if: the decision is global (image-level) and location does not matter.
Examples: “Is there a crack in this X-ray?” “Which product category is this?” “Does this photo contain a dog?”
Practical tip: if multiple things can co-occur, use multi-label classification and per-class thresholds.

Rule B: Choose detection when you need “where it is” at object level

Use detection if: you need to locate objects, count them, or trigger actions based on approximate position.
Examples: “Where are the pedestrians?” “How many bottles are on the conveyor?” “Find the license plate region.”
Practical tip: plan for post-processing: confidence thresholds and NMS are part of making detections usable.

Rule C: Choose segmentation when you need “how much of it” or precise boundaries

Use segmentation if: pixel-accurate shape or area matters, or you need to separate foreground/background precisely.
Examples: “What is the exact tumor boundary?” “How much of the field is covered by weeds?” “Which pixels are drivable road?”
Practical tip: decide between semantic vs instance segmentation based on whether you must distinguish individual objects (instance) or only classes (semantic).

Quick selection checklist

If the output can be a single label (or a set of labels) for the whole image: classification.
If the output must include rectangles around objects: detection.
If the output must include pixel masks: segmentation (semantic for class-per-pixel, instance for object-per-mask).
If annotation budget is tight: start with classification, then escalate to detection/segmentation only if the product requirements truly need localization or precise boundaries.

Now answer the exercise about the content:

In an object detection pipeline, what is the main purpose of non-maximum suppression (NMS) after applying a confidence threshold?

You are right! Congratulations, now go to the next page

You missed! Try again.

NMS is used in detection to handle multiple overlapping boxes for the same object by keeping the highest-confidence box and removing others that overlap it beyond a chosen threshold.