All courses > Hobbies and Special Interests > Robotics and Drones ::

From Pixels to Objects: Segmentation and Object Detection Concepts

Capítulo 8

Estimated reading time: 10 minutes

Two Families of Methods: Classical Segmentation vs Learning-Based Detection

Robotics vision tasks often reduce to one question: “Which pixels belong to the thing I care about, and where is it?” Two broad method families answer this differently.

Classical (Rule-Based) Approaches

Thresholding: decide foreground/background by comparing pixel values to a fixed or adaptive threshold (e.g., intensity, depth, thermal).
Color segmentation: classify pixels by whether their color falls inside a chosen range (often in HSV/Lab rather than raw RGB).
Contour-based detection: find connected regions, extract contours, then filter by shape/size/geometry to decide if a region is an object.

These approaches are fast, interpretable, and can work well when the environment is controlled (consistent lighting, known colors, simple backgrounds). They typically output pixel regions and derived geometry (area, centroid, bounding rectangle).

Learning-Based Detection and Segmentation

Bounding-box detectors: output rectangles around objects plus class labels and confidence scores.
Instance segmentation: output a per-object pixel mask (often also a bounding box).
Keypoint detectors: output landmark points (e.g., corners of a pallet, joints of a robot arm, grasp points) with confidence.

These methods learn appearance variation from data and are more robust to clutter and moderate changes in lighting/background, but require careful dataset design, thresholding, and validation for robotics reliability.

What a Detector Outputs (and How to Read It)

Most modern detectors produce a list of N candidate detections. Each detection typically contains:

Geometry: bounding box (x, y, w, h) or (x1, y1, x2, y2); optionally a mask or keypoints.
Class label: e.g., person, forklift, box.
Confidence score: a number in [0,1] indicating how strongly the model believes the detection is correct.

Important: the confidence score is not automatically “probability of being correct” in a strict statistical sense. It is a model score that must be calibrated and validated in your domain.

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Interpreting Confidence Scores in Practice

High confidence, wrong class can happen when two classes look similar in your environment (class confusion).
Low confidence, correct object often happens for small objects, motion blur, occlusion, or unusual viewpoints.
Confidence thresholding trades off false positives vs false negatives; robotics safety often prefers different trade-offs depending on the action.

Errors You Must Plan For: FP, FN, and Class Confusion

False Positives (FP)

The system reports an object that is not actually present (e.g., a reflection detected as a person). In robotics, FPs can cause unnecessary stops or detours.

False Negatives (FN)

The system misses a real object (e.g., fails to detect a pallet corner). In robotics, FNs can be safety-critical if they lead to collision or unsafe interaction.

Class Confusion

The system detects an object but assigns the wrong class (e.g., “chair” vs “box”). This is especially important when different classes trigger different behaviors (slow down vs stop vs ignore).

Failure mode	Typical cause	Robotics impact
False positive	Background patterns, reflections, repetitive textures	Unnecessary braking, reduced throughput
False negative	Small object, occlusion, unusual pose, domain shift	Collision risk, missed grasp, unsafe proximity
Class confusion	Similar appearance, insufficient class diversity in training	Wrong policy/action selection

Classical Segmentation: Practical Step-by-Step Recipes

1) Thresholding + Connected Components (Fast Foreground Extraction)

Use when the object differs strongly from background in a single channel (intensity, depth, thermal, or a computed index).

Choose the channel: e.g., grayscale intensity or depth.
Pick threshold type: fixed threshold for stable conditions; adaptive threshold if illumination varies across the image.
Binary mask: pixels above/below threshold become 1; others 0.
Clean up: remove tiny blobs; fill holes if needed.
Extract regions: connected components → area, centroid, bounding box.
Filter by geometry: keep regions within expected size/aspect ratio range.

# Pseudocode (conceptual) for threshold segmentation pipeline mask = threshold(channel, T) mask = remove_small_components(mask, min_area) regions = connected_components(mask) candidates = [r for r in regions if shape_ok(r)]

2) Color Segmentation in HSV (Robust to Some Lighting Changes)

Use when the target has a distinctive color (e.g., safety vest, colored fiducial, painted parts) and the background is not similar.

Convert color space: RGB → HSV (or Lab).
Define color ranges: choose hue window(s) and saturation/value constraints.
Create mask: mask = inRange(hsv, low, high).
Post-process: remove noise blobs; optionally smooth edges.
Find contours: compute contour(s) and bounding rectangles.
Validate: reject candidates that violate expected size/shape.

Engineering note: if the object color is glossy, highlights can shift value/saturation; include constraints that tolerate specular variation or use multiple hue ranges.

3) Contour Detection + Shape Filters (When Geometry Is Known)

Use when the object has a predictable silhouette (e.g., circular markers, rectangular labels, known part outlines).

Get a binary mask: via thresholding or color segmentation.
Extract contours: find connected boundaries.
Compute shape descriptors: area, perimeter, circularity, convexity, aspect ratio.
Fit simple models: circle/ellipse/rectangle fitting if applicable.
Reject outliers: keep only contours matching expected geometry.

Learning-Based Detection: Outputs, Thresholds, and Post-Processing

Bounding Boxes vs Masks vs Keypoints: Choosing the Right Output

Bounding boxes are sufficient for coarse navigation, obstacle awareness, and triggering behaviors (slow/stop).
Masks help when you need precise free-space reasoning, grasp planning, or separating touching objects.
Keypoints are ideal when the robot needs specific landmarks (handle location, corners for alignment, pose estimation inputs).

Non-Maximum Suppression (NMS): Removing Duplicate Detections

Detectors often propose multiple overlapping boxes for the same object. NMS keeps the best one and removes the rest based on overlap.

Conceptually:

Sort detections by confidence (highest first).
Take the top detection as “kept.”
Remove any remaining detections whose overlap with the kept detection exceeds an IoU threshold.
Repeat until no detections remain.

# NMS sketch (conceptual) keep = [] dets = sort_by_score(dets) while dets: d = dets.pop(0) keep.append(d) dets = [x for x in dets if IoU(x.box, d.box) < iou_thr]

Engineering trade-off: a lower IoU threshold removes more duplicates but can accidentally suppress nearby distinct objects (e.g., two boxes close together on a shelf).

Temporal Smoothing: Stabilizing Detections Over Time

Robots operate on video streams, not single images. Frame-to-frame jitter can cause unstable behavior (start/stop oscillations). Temporal smoothing reduces noise by combining evidence across frames.

Score smoothing: average or exponential moving average of confidence for a tracked object.
Geometry smoothing: smooth box center/size over time to reduce jitter.
Persistence rules: require K consecutive frames above threshold before acting; require M misses before declaring the object gone.

Tracking-by-Detection: Turning Per-Frame Detections into Object Tracks

Tracking-by-detection links detections across frames to maintain object identity and motion estimates. This helps with:

Reducing false positives: single-frame blips can be ignored if they don’t persist.
Handling brief occlusions: keep a track alive for a short time without detections.
Predicting motion: anticipate where an object will be next frame for better association.

A typical association loop:

Run detector → candidate boxes with scores.
Predict existing track positions (simple constant-velocity model is often enough).
Match detections to tracks using IoU and/or distance between centers; optionally include class consistency.
Update matched tracks; create new tracks for unmatched detections above a “birth” threshold.
Delete tracks that have been unmatched for too long.

Engineering Checklist: Making Detection and Segmentation Work on a Robot

1) Dataset Domain Match (The #1 Source of Surprises)

Environment match: floors, walls, shelves, outdoor scenes, reflective surfaces, dust/rain.
Object variants: different sizes, colors, wear, packaging, partial occlusion.
Operational lighting: day/night, shadows, flicker, backlighting.
Motion conditions: robot speed, object speed, blur frequency.

Checklist question: “Does my training/validation data include the exact failure cases I fear in deployment?” If not, treat confidence thresholds as untrusted until you collect representative data.

2) Camera Viewpoint Constraints (What Angles Will You Actually See?)

Mounting height and tilt: changes apparent object shape and size.
Expected pose range: front view vs top-down vs oblique.
Occlusion patterns: shelves, people, robot arm self-occlusion.

Practical step: build a “viewpoint envelope” document: min/max distance, min/max angle, and typical occluders. Use it to drive data collection and acceptance tests.

3) Minimum Object Pixel Size (Detectability Budget)

Detectors and segmenters fail when the object is too small in the image. Define a minimum pixel size requirement and enforce it in system design.

For bounding boxes: define minimum box height/width in pixels (e.g., at least 20–40 px on the shortest side, depending on model and class).
For masks: define minimum mask area in pixels; tiny masks are unstable.
For keypoints: define minimum inter-keypoint distance; if keypoints collapse into a few pixels, localization becomes noisy.

Practical step-by-step:

Pick the smallest object instance you must detect safely.
Measure its pixel size at the farthest operational distance and worst-case viewpoint.
If below your model’s reliable range, change one of: camera placement, resolution, minimum distance policy, or model choice.

4) Threshold Setting for Safety-Critical Robotics

In safety-critical behaviors, thresholds are not just “accuracy tuning”; they define risk.

Separate thresholds by action: e.g., a low threshold to trigger “slow down,” a higher threshold to trigger “hard stop,” and an even higher threshold to trigger “manipulate/grasp.”
Use asymmetric cost thinking: if a false negative could cause collision, prefer lower thresholds and add temporal confirmation + tracking to control false positives.
Per-class thresholds: some classes are inherently harder; tune thresholds per class based on validation data.

Practical step-by-step threshold workflow:

Define the safety objective: e.g., “detect humans within 3 m with < X missed detections per hour.”
Collect scenario-specific validation clips: include near-misses, occlusions, and edge lighting.
Sweep confidence thresholds: compute FP/FN rates per scenario and per class.
Choose thresholds per action: pick operating points that satisfy safety constraints, not just average metrics.
Add temporal rules: require persistence for non-emergency actions; allow immediate reaction for emergency stop with lower threshold.

Validation: Confusion Matrices and Scenario-Based Tests

Confusion Matrix (Classification Quality Under Detection)

A confusion matrix counts how often true classes are predicted as each class. For detectors, you typically compute it after matching predictions to ground truth objects (e.g., by IoU for boxes or overlap for masks).

Diagonal entries: correct class predictions.
Off-diagonal entries: class confusion (wrong label).
Extra “background” row/column: captures false positives (predicted object with no match) and false negatives (missed ground truth).

Engineering use: if “box” is often predicted as “trash bin,” you may need more data diversity, clearer labeling rules, or to merge classes if the robot behavior is identical.

Scenario-Based Tests (Robotics-Relevant Acceptance)

Average dataset metrics can hide failures that matter in the field. Build tests around operational scenarios:

Distance sweep: run the robot (or replay logs) at increasing distance; record detection rate vs distance.
Viewpoint sweep: vary yaw/pitch angles; record failures at extreme angles.
Occlusion cases: partial visibility behind shelves/people/arm; measure time-to-detect after reappearance.
Lighting stress: backlight, shadows, flicker; measure FP spikes.
Clutter and look-alikes: backgrounds containing similar textures/colors; measure class confusion and FP rate.

Practical step-by-step test plan template:

Define pass/fail criteria: e.g., “no more than 1 false stop per 30 minutes,” “detect obstacle within 0.5 s of entering ROI.”
Define scenarios and parameters: distance, speed, occlusion percentage, lighting condition.
Run repeated trials: enough repetitions to estimate variability (not just one run).
Log raw outputs: boxes/masks, scores, NMS settings, track IDs, and robot actions.
Analyze by scenario: compute FP/FN, confusion matrix, and time-to-detect/time-to-clear.

Putting It Together: A Deployment-Oriented Processing Stack

A typical robotics perception stack for objects combines multiple layers:

Detector/segmenter: produces candidate boxes/masks/keypoints with scores.
Post-processing: NMS, class filtering, per-class thresholds.
Temporal logic: smoothing, persistence, track management.
Decision interface: converts perception outputs into robot-relevant signals (e.g., nearest obstacle distance, human-in-zone flag, grasp pose candidate list).

Engineering reminder: treat thresholds, NMS IoU, and temporal parameters as part of the safety case. Version them, test them, and tie them to scenario-based acceptance results.

Now answer the exercise about the content:

In a robotics vision pipeline, why might you apply temporal smoothing and persistence rules after running an object detector on each frame?

You are right! Congratulations, now go to the next page

You missed! Try again.

Robots process video streams, so per-frame detections can jitter or briefly appear as blips. Temporal smoothing and persistence rules combine evidence across frames to stabilize geometry/scores and reduce false positives by requiring detections to persist.