Robotics Vision Task: Obstacle Detection and Free-Space Estimation

Capítulo 12

Estimated reading time: 10 minutes

+ Exercise

What “Obstacle Detection” and “Free Space” Mean for Navigation

For a mobile robot, obstacle detection answers “what volume is unsafe to enter,” while free-space estimation answers “where can I drive right now.” The navigation stack typically consumes one or more of these outputs:

  • Occupancy grid: 2D map where each cell is free/occupied/unknown (often probabilistic).
  • Costmap: like an occupancy grid but with graded costs (inflation around obstacles, traversability penalties).
  • Obstacle list: sparse set of obstacle points/segments/polygons in robot frame.

This chapter presents two practical approaches and how to turn their results into navigation-friendly representations: (1) depth-based obstacle detection (point cloud to occupancy) and (2) monocular cues (semantic segmentation, ground-plane estimation, motion cues). Each approach is organized as sensing choice, preprocessing, detection logic, and output representation.

Approach 1: Depth-Based Obstacle Detection (Point Cloud → Occupancy)

Sensing Choice: When Depth Is the Right Tool

Depth-based obstacle detection is usually the most direct path to reliable free-space estimation because it measures geometry. It works well for cluttered indoor spaces, narrow passages, and low-texture surfaces where monocular methods can struggle. However, depth sensors can fail on reflective/transparent materials (glass, shiny metal), at range limits, and in strong sunlight (for some technologies). Plan to represent unknown space explicitly rather than assuming it is free.

Preprocessing: From Raw Depth to a Usable Point Cloud

Goal: produce a point cloud in the robot/base frame with outliers reduced and missing data handled.

  • Temporal alignment: synchronize depth with robot pose/odometry; motion during exposure can smear depth and create “ghost” obstacles.
  • Range gating: discard points outside a trusted depth interval [z_min, z_max] (e.g., ignore very near points that are sensor noise and very far points that are sparse).
  • Downsampling: voxel grid downsample (e.g., 2–5 cm voxels) to stabilize occupancy updates and reduce compute.
  • Outlier removal: remove isolated points via radius/neighbor count filtering to reduce speckle.
  • Flying pixels handling: depth discontinuities (object edges) often generate points floating in front of or behind surfaces. Mitigations include:
  • Reject points with high local depth variance in a small window.
  • Prefer “closest valid depth” per pixel only after edge-aware filtering.
  • Use morphological cleanup on the depth validity mask before projection.
  • Missing depth: treat as unknown, not free. If you must fill, do it conservatively (e.g., small hole filling only, never across large gaps).

Detection Logic: Point Cloud to Obstacles and Free Space

A common pattern is to classify points as “ground” vs “non-ground,” then project non-ground points into a 2D grid for navigation.

Continue in our app.
  • Listen to the audio with the screen off.
  • Earn a certificate upon completion.
  • Over 5000 courses for you to explore!
Or continue reading below...
Download App

Download the app

Step-by-step: Ground Removal + 2D Occupancy Projection

  1. Transform to base frame: ensure points are expressed in the robot’s navigation frame (x forward, y left, z up).
  2. Ground model: choose one:
  • Planar ground assumption (fast): points with z near a known ground height are ground.
  • Fitted ground plane (robust): fit a plane (e.g., RANSAC) to low points and classify inliers as ground.
  • Grid-based ground segmentation: split space into radial/Cartesian bins and estimate local ground height to handle ramps and uneven floors.
  1. Obstacle height band: classify a point as obstacle if it is above ground by at least h_min and below h_max (ignore ceiling/overhangs if your robot can pass under them). Example: h_min = 0.10 m to ignore small floor noise; h_max = 1.5 m for typical indoor navigation.
  2. Project to 2D grid: for each obstacle point, compute its cell index (i, j) in a grid centered on the robot. Mark that cell occupied (or increase its occupancy probability).
  3. Free-space marking (ray tracing): for each depth measurement, cast a ray from sensor origin to the measured point and mark traversed cells as free until the hit cell. This prevents “unknown everywhere” and helps in narrow passages.
  4. Inflation / safety margin: expand occupied cells by robot radius plus a buffer m (to cover localization error, control overshoot, and sensor noise). This yields a costmap inflation layer.

Thresholds and margins should be tuned to your robot’s speed and stopping distance. A practical rule: increase inflation with speed, and add a fixed minimum buffer for localization uncertainty.

Output Representation: Occupancy Grid, Costmap, Obstacle List

Occupancy grid is a natural output of the projection step. Use three states: free, occupied, unknown. Unknown is critical when depth is missing (glass, glare, range dropouts).

Costmap is derived by inflating obstacles and optionally adding costs for risky areas (near unknown, near drop-offs, near reflective surfaces where depth is unreliable). A simple cost function is distance-to-nearest-obstacle with saturation:

cost(d) = 0                      if d > d_inflate_max
cost(d) = cost_max * (1 - d/d_inflate_max)  otherwise

Obstacle list can be produced by clustering occupied cells (connected components) and outputting bounding boxes or centroids for local planners that prefer sparse obstacles.

Handling Dynamic Obstacles with Depth

Depth makes dynamic obstacle detection straightforward if you maintain short-term temporal state:

  • Temporal decay: occupied cells fade back to unknown/free unless re-observed. This prevents “ghost obstacles” when a person walks away.
  • Velocity cues: track clusters across frames (nearest-neighbor association) to estimate motion; mark moving obstacles with higher cost or larger inflation.
  • Safety-first unknown: if a region alternates between missing depth and occupied, treat it as high cost (e.g., reflective surfaces or thin chair legs).

Dealing with Sensor Artifacts: Glass, Reflective Surfaces, and Range Dropouts

  • Glass: depth may return invalid or behind-glass readings. Treat persistent invalid returns in a region as unknown; if the robot operates near glass walls/doors, add a conservative “unknown inflation” band near those areas using semantic cues or map priors.
  • Reflective metal: may produce speckle or incorrect depths. Use outlier filters and require multi-frame confirmation before declaring free space in such regions.
  • Black/absorptive materials: can cause missing depth. Do not ray-trace through missing depth as free; stop rays at the last valid cell and keep beyond as unknown.
  • Edge flying pixels: reduce with edge-aware filtering and by ignoring isolated obstacle cells that appear only for one frame unless they are close to the robot.

Approach 2: Monocular Vision Cues (Semantics, Ground Plane, Motion)

Sensing Choice: When You Only Have a Camera (or Want Redundancy)

Monocular methods estimate free space without direct depth. They are attractive when depth sensors are unavailable, unreliable in sunlight, or when you want redundancy for safety. They are also useful for recognizing what an obstacle is (person, chair, curb) to adapt behavior. The trade-off is that geometry is inferred and can be brittle under changing light, motion blur, and textureless floors.

Preprocessing: Make the Image “Navigation-Ready”

Keep preprocessing minimal and consistent to avoid shifting model behavior:

  • Stabilize exposure as much as possible; sudden gain changes can confuse segmentation and motion cues.
  • Mask robot body (bumper, gripper) if visible to avoid self-obstacles.
  • Resize/crop consistently to the network’s expected input; keep the horizon region if using ground-plane cues.
  • Maintain a validity mask for saturated highlights, lens flare, or rain droplets; these regions should become unknown/high cost.

Detection Logic A: Semantic Segmentation → Free Space and Obstacles

Semantic segmentation labels each pixel (e.g., floor, wall, person). For navigation, you typically collapse labels into three sets: free (floor/road), occupied (walls, furniture, people), and unknown (ambiguous classes, low confidence, glare).

Step-by-step: From Segmentation to a Local Costmap

  1. Run segmentation and obtain per-pixel class probabilities.
  2. Confidence thresholding: if max probability < p_min, mark pixel unknown. This is essential under changing light.
  3. Free-space mask: pixels classified as traversable surface with confidence ≥ p_free.
  4. Obstacle mask: pixels classified as non-traversable with confidence ≥ p_obs.
  5. Project to ground plane: convert image pixels to ground coordinates using a ground-plane assumption (flat floor) and camera pose. Each pixel in the free-space mask votes for free cells; obstacle pixels vote for occupied cells.
  6. Apply safety margins: inflate occupied cells; additionally, shrink free-space slightly (erode free mask) to reduce “optimistic” free-space near boundaries.

Practical thresholds: use p_obs higher than p_free if false obstacles are acceptable but false free-space is not. For example, p_obs=0.7, p_free=0.6, and treat everything else as unknown.

Detection Logic B: Ground-Plane Estimation → Geometric Free Space

If you can estimate the ground plane and identify pixels that belong to it, you can infer free space geometrically. Common cues include texture continuity, color consistency, and the fact that ground points lie on a plane in 3D. In practice, you often combine a weak geometric model with semantics: semantics proposes “floor,” geometry checks consistency.

  • Planar assumption: works well indoors with flat floors; fails on ramps, curbs, and uneven terrain unless you allow a slowly varying plane.
  • Horizon/vanishing cues: can stabilize ground estimation but degrade in cluttered scenes.

Navigation output is similar: project estimated ground region to a 2D grid, mark it free, and mark everything else unknown unless confirmed as obstacle by other cues.

Detection Logic C: Motion Cues → Dynamic Obstacle Awareness

Motion cues help detect moving objects even when semantics is uncertain (e.g., a person in unusual lighting). The key idea is to separate camera-induced motion (from robot movement) from independently moving objects.

  • Frame-to-frame motion consistency: estimate the dominant image motion; regions that deviate significantly are candidates for moving obstacles.
  • Time-to-collision heuristics: expanding regions in the image (increasing apparent size) can indicate approaching obstacles.
  • Conservative integration: motion-only detections should usually raise cost (caution) rather than declare hard occupancy unless persistent across several frames.

For navigation, motion cues often produce a dynamic cost layer: areas with detected independent motion get higher cost and larger inflation, encouraging the planner to give them space.

Output Representation: Fusing Monocular Cues into Navigation Layers

LayerSourceMeaningTypical action
Static occupancySemantic obstacles + stable geometryLikely non-traversableMark occupied + inflate
Free spaceHigh-confidence floor/road + ground-plane consistencyTraversable nowMark free (possibly eroded)
UnknownLow confidence, glare, shadows, reflectionsUncertainKeep unknown; optionally add cost
Dynamic costMotion cues + tracked objectsMoving hazardIncrease cost + inflate more

A practical fusion rule is “occupied overrides free” and “unknown overrides free when confidence is low.” This prevents the robot from driving into areas where the camera is blinded or the model is uncertain.

Thresholds, Safety Margins, and Conservative Design

Key Threshold Types to Tune

  • Height thresholds (h_min, h_max) for depth obstacles: too low creates noise obstacles; too high misses low hazards (cables, small steps).
  • Range thresholds (z_min, z_max): too permissive includes unreliable depth; too strict reduces lookahead.
  • Confidence thresholds (p_free, p_obs) for semantics: tune to minimize false free-space.
  • Temporal persistence: number of frames required to confirm free/occupied; higher persistence reduces flicker but increases latency.

Safety Margins That Reflect Real Robot Behavior

  • Inflation radius = robot footprint radius + localization error bound + control tracking error.
  • Speed-dependent margin: at higher speeds, increase inflation and require longer free-space lookahead.
  • Unknown-space policy: in tight spaces, you may allow slow traversal near unknown; in safety-critical settings, treat unknown as occupied.

Evaluation in Realistic Environments (What to Test and What to Measure)

Test Scenarios That Expose Failure Modes

  • Clutter: chair legs, cables, small objects near the floor; evaluate whether h_min and downsampling miss hazards.
  • Narrow passages: doorways and corridors; evaluate inflation tuning and whether flying pixels create phantom blocks.
  • Reflective surfaces: mirrors, glossy floors, stainless appliances; evaluate unknown handling and false free-space.
  • Glass doors/walls: evaluate whether the system incorrectly marks through-glass space as free.
  • Changing light: moving from bright to dim, shadows, headlights; evaluate segmentation confidence gating and unknown policy.
  • Dynamic obstacles: people crossing, carts, pets; evaluate temporal decay, tracking stability, and planner behavior around moving costs.

Metrics That Connect Vision to Navigation Safety

  • False free-space rate: fraction of truly occupied area labeled free (most dangerous).
  • False obstacle rate: free area labeled occupied (hurts efficiency, can cause dead-ends).
  • Unknown coverage: too high means timid navigation; too low can mean overconfidence.
  • Latency: time from observation to costmap update; high latency causes collisions at speed.
  • Stability/flicker: how often cells toggle states; flicker destabilizes planners.

When evaluating, log synchronized sensor data, robot pose, and produced grids/costmaps. In post-analysis, inspect failure cases by overlaying occupancy/cost layers on the camera/depth view to see whether errors come from preprocessing, thresholds, projection, or fusion rules.

Now answer the exercise about the content:

In a navigation costmap built from depth measurements, what is the purpose of ray tracing from the sensor origin to each measured point?

You are right! Congratulations, now go to the next page

You missed! Try again.

Ray tracing marks traversed cells as free up to the measured obstacle cell, preventing large regions from staying unknown and helping navigation in narrow spaces.

Next chapter

Integrating Vision with Control and Navigation: Latency, Fusion, and Reliability

Arrow Right Icon
Free Ebook cover Computer Vision for Robotics: A Beginner’s Guide to Seeing and Understanding
86%

Computer Vision for Robotics: A Beginner’s Guide to Seeing and Understanding

New course

14 pages

Download the app to earn free Certification and listen to the courses in the background, even with the screen off.