Two Families of Methods: Classical Segmentation vs Learning-Based Detection
Robotics vision tasks often reduce to one question: “Which pixels belong to the thing I care about, and where is it?” Two broad method families answer this differently.
Classical (Rule-Based) Approaches
- Thresholding: decide foreground/background by comparing pixel values to a fixed or adaptive threshold (e.g., intensity, depth, thermal).
- Color segmentation: classify pixels by whether their color falls inside a chosen range (often in HSV/Lab rather than raw RGB).
- Contour-based detection: find connected regions, extract contours, then filter by shape/size/geometry to decide if a region is an object.
These approaches are fast, interpretable, and can work well when the environment is controlled (consistent lighting, known colors, simple backgrounds). They typically output pixel regions and derived geometry (area, centroid, bounding rectangle).
Learning-Based Detection and Segmentation
- Bounding-box detectors: output rectangles around objects plus class labels and confidence scores.
- Instance segmentation: output a per-object pixel mask (often also a bounding box).
- Keypoint detectors: output landmark points (e.g., corners of a pallet, joints of a robot arm, grasp points) with confidence.
These methods learn appearance variation from data and are more robust to clutter and moderate changes in lighting/background, but require careful dataset design, thresholding, and validation for robotics reliability.
What a Detector Outputs (and How to Read It)
Most modern detectors produce a list of N candidate detections. Each detection typically contains:
- Geometry: bounding box
(x, y, w, h)or(x1, y1, x2, y2); optionally a mask or keypoints. - Class label: e.g.,
person,forklift,box. - Confidence score: a number in
[0,1]indicating how strongly the model believes the detection is correct.
Important: the confidence score is not automatically “probability of being correct” in a strict statistical sense. It is a model score that must be calibrated and validated in your domain.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Interpreting Confidence Scores in Practice
- High confidence, wrong class can happen when two classes look similar in your environment (class confusion).
- Low confidence, correct object often happens for small objects, motion blur, occlusion, or unusual viewpoints.
- Confidence thresholding trades off false positives vs false negatives; robotics safety often prefers different trade-offs depending on the action.
Errors You Must Plan For: FP, FN, and Class Confusion
False Positives (FP)
The system reports an object that is not actually present (e.g., a reflection detected as a person). In robotics, FPs can cause unnecessary stops or detours.
False Negatives (FN)
The system misses a real object (e.g., fails to detect a pallet corner). In robotics, FNs can be safety-critical if they lead to collision or unsafe interaction.
Class Confusion
The system detects an object but assigns the wrong class (e.g., “chair” vs “box”). This is especially important when different classes trigger different behaviors (slow down vs stop vs ignore).
| Failure mode | Typical cause | Robotics impact |
|---|---|---|
| False positive | Background patterns, reflections, repetitive textures | Unnecessary braking, reduced throughput |
| False negative | Small object, occlusion, unusual pose, domain shift | Collision risk, missed grasp, unsafe proximity |
| Class confusion | Similar appearance, insufficient class diversity in training | Wrong policy/action selection |
Classical Segmentation: Practical Step-by-Step Recipes
1) Thresholding + Connected Components (Fast Foreground Extraction)
Use when the object differs strongly from background in a single channel (intensity, depth, thermal, or a computed index).
- Choose the channel: e.g., grayscale intensity or depth.
- Pick threshold type: fixed threshold for stable conditions; adaptive threshold if illumination varies across the image.
- Binary mask: pixels above/below threshold become 1; others 0.
- Clean up: remove tiny blobs; fill holes if needed.
- Extract regions: connected components → area, centroid, bounding box.
- Filter by geometry: keep regions within expected size/aspect ratio range.
# Pseudocode (conceptual) for threshold segmentation pipeline mask = threshold(channel, T) mask = remove_small_components(mask, min_area) regions = connected_components(mask) candidates = [r for r in regions if shape_ok(r)]2) Color Segmentation in HSV (Robust to Some Lighting Changes)
Use when the target has a distinctive color (e.g., safety vest, colored fiducial, painted parts) and the background is not similar.
- Convert color space: RGB → HSV (or Lab).
- Define color ranges: choose hue window(s) and saturation/value constraints.
- Create mask:
mask = inRange(hsv, low, high). - Post-process: remove noise blobs; optionally smooth edges.
- Find contours: compute contour(s) and bounding rectangles.
- Validate: reject candidates that violate expected size/shape.
Engineering note: if the object color is glossy, highlights can shift value/saturation; include constraints that tolerate specular variation or use multiple hue ranges.
3) Contour Detection + Shape Filters (When Geometry Is Known)
Use when the object has a predictable silhouette (e.g., circular markers, rectangular labels, known part outlines).
- Get a binary mask: via thresholding or color segmentation.
- Extract contours: find connected boundaries.
- Compute shape descriptors: area, perimeter, circularity, convexity, aspect ratio.
- Fit simple models: circle/ellipse/rectangle fitting if applicable.
- Reject outliers: keep only contours matching expected geometry.
Learning-Based Detection: Outputs, Thresholds, and Post-Processing
Bounding Boxes vs Masks vs Keypoints: Choosing the Right Output
- Bounding boxes are sufficient for coarse navigation, obstacle awareness, and triggering behaviors (slow/stop).
- Masks help when you need precise free-space reasoning, grasp planning, or separating touching objects.
- Keypoints are ideal when the robot needs specific landmarks (handle location, corners for alignment, pose estimation inputs).
Non-Maximum Suppression (NMS): Removing Duplicate Detections
Detectors often propose multiple overlapping boxes for the same object. NMS keeps the best one and removes the rest based on overlap.
Conceptually:
- Sort detections by confidence (highest first).
- Take the top detection as “kept.”
- Remove any remaining detections whose overlap with the kept detection exceeds an IoU threshold.
- Repeat until no detections remain.
# NMS sketch (conceptual) keep = [] dets = sort_by_score(dets) while dets: d = dets.pop(0) keep.append(d) dets = [x for x in dets if IoU(x.box, d.box) < iou_thr]Engineering trade-off: a lower IoU threshold removes more duplicates but can accidentally suppress nearby distinct objects (e.g., two boxes close together on a shelf).
Temporal Smoothing: Stabilizing Detections Over Time
Robots operate on video streams, not single images. Frame-to-frame jitter can cause unstable behavior (start/stop oscillations). Temporal smoothing reduces noise by combining evidence across frames.
- Score smoothing: average or exponential moving average of confidence for a tracked object.
- Geometry smoothing: smooth box center/size over time to reduce jitter.
- Persistence rules: require
Kconsecutive frames above threshold before acting; requireMmisses before declaring the object gone.
Tracking-by-Detection: Turning Per-Frame Detections into Object Tracks
Tracking-by-detection links detections across frames to maintain object identity and motion estimates. This helps with:
- Reducing false positives: single-frame blips can be ignored if they don’t persist.
- Handling brief occlusions: keep a track alive for a short time without detections.
- Predicting motion: anticipate where an object will be next frame for better association.
A typical association loop:
- Run detector → candidate boxes with scores.
- Predict existing track positions (simple constant-velocity model is often enough).
- Match detections to tracks using IoU and/or distance between centers; optionally include class consistency.
- Update matched tracks; create new tracks for unmatched detections above a “birth” threshold.
- Delete tracks that have been unmatched for too long.
Engineering Checklist: Making Detection and Segmentation Work on a Robot
1) Dataset Domain Match (The #1 Source of Surprises)
- Environment match: floors, walls, shelves, outdoor scenes, reflective surfaces, dust/rain.
- Object variants: different sizes, colors, wear, packaging, partial occlusion.
- Operational lighting: day/night, shadows, flicker, backlighting.
- Motion conditions: robot speed, object speed, blur frequency.
Checklist question: “Does my training/validation data include the exact failure cases I fear in deployment?” If not, treat confidence thresholds as untrusted until you collect representative data.
2) Camera Viewpoint Constraints (What Angles Will You Actually See?)
- Mounting height and tilt: changes apparent object shape and size.
- Expected pose range: front view vs top-down vs oblique.
- Occlusion patterns: shelves, people, robot arm self-occlusion.
Practical step: build a “viewpoint envelope” document: min/max distance, min/max angle, and typical occluders. Use it to drive data collection and acceptance tests.
3) Minimum Object Pixel Size (Detectability Budget)
Detectors and segmenters fail when the object is too small in the image. Define a minimum pixel size requirement and enforce it in system design.
- For bounding boxes: define minimum box height/width in pixels (e.g., at least 20–40 px on the shortest side, depending on model and class).
- For masks: define minimum mask area in pixels; tiny masks are unstable.
- For keypoints: define minimum inter-keypoint distance; if keypoints collapse into a few pixels, localization becomes noisy.
Practical step-by-step:
- Pick the smallest object instance you must detect safely.
- Measure its pixel size at the farthest operational distance and worst-case viewpoint.
- If below your model’s reliable range, change one of: camera placement, resolution, minimum distance policy, or model choice.
4) Threshold Setting for Safety-Critical Robotics
In safety-critical behaviors, thresholds are not just “accuracy tuning”; they define risk.
- Separate thresholds by action: e.g., a low threshold to trigger “slow down,” a higher threshold to trigger “hard stop,” and an even higher threshold to trigger “manipulate/grasp.”
- Use asymmetric cost thinking: if a false negative could cause collision, prefer lower thresholds and add temporal confirmation + tracking to control false positives.
- Per-class thresholds: some classes are inherently harder; tune thresholds per class based on validation data.
Practical step-by-step threshold workflow:
- Define the safety objective: e.g., “detect humans within 3 m with < X missed detections per hour.”
- Collect scenario-specific validation clips: include near-misses, occlusions, and edge lighting.
- Sweep confidence thresholds: compute FP/FN rates per scenario and per class.
- Choose thresholds per action: pick operating points that satisfy safety constraints, not just average metrics.
- Add temporal rules: require persistence for non-emergency actions; allow immediate reaction for emergency stop with lower threshold.
Validation: Confusion Matrices and Scenario-Based Tests
Confusion Matrix (Classification Quality Under Detection)
A confusion matrix counts how often true classes are predicted as each class. For detectors, you typically compute it after matching predictions to ground truth objects (e.g., by IoU for boxes or overlap for masks).
- Diagonal entries: correct class predictions.
- Off-diagonal entries: class confusion (wrong label).
- Extra “background” row/column: captures false positives (predicted object with no match) and false negatives (missed ground truth).
Engineering use: if “box” is often predicted as “trash bin,” you may need more data diversity, clearer labeling rules, or to merge classes if the robot behavior is identical.
Scenario-Based Tests (Robotics-Relevant Acceptance)
Average dataset metrics can hide failures that matter in the field. Build tests around operational scenarios:
- Distance sweep: run the robot (or replay logs) at increasing distance; record detection rate vs distance.
- Viewpoint sweep: vary yaw/pitch angles; record failures at extreme angles.
- Occlusion cases: partial visibility behind shelves/people/arm; measure time-to-detect after reappearance.
- Lighting stress: backlight, shadows, flicker; measure FP spikes.
- Clutter and look-alikes: backgrounds containing similar textures/colors; measure class confusion and FP rate.
Practical step-by-step test plan template:
- Define pass/fail criteria: e.g., “no more than 1 false stop per 30 minutes,” “detect obstacle within 0.5 s of entering ROI.”
- Define scenarios and parameters: distance, speed, occlusion percentage, lighting condition.
- Run repeated trials: enough repetitions to estimate variability (not just one run).
- Log raw outputs: boxes/masks, scores, NMS settings, track IDs, and robot actions.
- Analyze by scenario: compute FP/FN, confusion matrix, and time-to-detect/time-to-clear.
Putting It Together: A Deployment-Oriented Processing Stack
A typical robotics perception stack for objects combines multiple layers:
- Detector/segmenter: produces candidate boxes/masks/keypoints with scores.
- Post-processing: NMS, class filtering, per-class thresholds.
- Temporal logic: smoothing, persistence, track management.
- Decision interface: converts perception outputs into robot-relevant signals (e.g., nearest obstacle distance, human-in-zone flag, grasp pose candidate list).
Engineering reminder: treat thresholds, NMS IoU, and temporal parameters as part of the safety case. Version them, test them, and tie them to scenario-based acceptance results.