All courses > Hobbies and Special Interests > Robotics and Drones ::

Depth Sensing Options: Stereo, Structured Light, Time-of-Flight, and Monocular Cues

Capítulo 9

Estimated reading time: 10 minutes

Depth modalities at a glance

Depth sensing answers “how far is this surface from the sensor?” In robotics, the best choice depends on your operating range, lighting, surface materials, compute budget, and how you plan to fuse depth with color and motion. The main options are: passive stereo (two cameras), active structured light (projected pattern), active time-of-flight (ToF, modulated light), and monocular depth estimation (single image, learned or geometric cues).

Modality	Typical range	Accuracy trend	Lighting sensitivity	Compute cost	Common failure cases
Stereo (passive)	~0.3 m to 20+ m (depends on baseline, texture)	Better at close range; degrades with distance	Needs texture; struggles in low light unless you add illumination	Medium to high (matching)	Textureless walls, repetitive patterns, motion blur, occlusions, thin structures
Structured light (active)	~0.2 m to 5 m (consumer); some industrial longer	High at close range; drops with distance	Strongly affected by sunlight/IR interference	Low to medium (sensor does much internally)	Sunlight, reflective/transparent objects, multi-path, projector-camera shadowing
Time-of-Flight (active)	~0.2 m to 10+ m (varies by power and optics)	Stable over mid-range; can be noisy at edges	Sunlight can reduce SNR; generally better than structured light outdoors but still challenged	Low to medium (depth output is direct)	Multi-path interference, flying pixels at depth discontinuities, dark/absorptive surfaces, specular reflections
Monocular cues (learned)	“Any” (but scale is ambiguous without extra info)	Relative depth often good; absolute depth uncertain	Depends on training domain; can fail under unusual lighting	Medium to very high (DNN inference)	Domain shift, novel objects/scenes, scale drift, reflective/transparent surfaces, adversarial textures

Stereo depth: intuitive geometry you can use

Stereo estimates depth by comparing where the same scene point appears in the left and right images. Because the cameras are separated by a baseline (distance between camera centers), closer points shift more between views than far points. That shift is called disparity.

Disparity and baseline

Disparity: horizontal pixel difference between corresponding points in rectified stereo images.
Baseline: larger baseline increases disparity for a given depth, improving depth precision at longer range, but increases occlusions and makes close-range matching harder.
Rule of thumb: depth uncertainty grows quickly with distance because disparity becomes very small.

In a rectified stereo pair (epipolar lines are horizontal), depth is approximately:

Z ≈ (f * B) / d

where Z is depth, f is focal length in pixels, B is baseline (meters), and d is disparity (pixels). Practical implications:

Doubling baseline roughly halves depth error at the same distance (if matching is reliable).
Higher resolution (more pixels) can improve disparity precision, but increases compute and bandwidth.
Wide FOV lenses can reduce effective pixel focal length, reducing disparity for the same depth (hurting precision), even though they see more.

Where stereo works well (and where it doesn’t)

Stereo is strongest when the scene has texture: carpet, foliage, printed labels, brick, etc. It struggles when correspondence is ambiguous or impossible:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Textureless surfaces (plain walls): many pixels look the same, so matching is unstable.
Repetitive patterns (fences, blinds): multiple matches look plausible, causing “stair-step” or wrong depths.
Occlusions: a point visible in one camera may be hidden in the other; this creates invalid depth regions near edges.
Motion blur and rolling-shutter artifacts: reduce matching quality.
Thin structures (wires, chair legs): can be missed or “fattened” depending on matching and filtering.

Active depth cameras

Active depth sensors emit light (often IR) and measure how it returns. This helps in low-texture scenes, but introduces sensitivity to sunlight, reflective materials, and interference.

Structured light

Structured light projects a known pattern (dots or stripes). The camera observes how the pattern deforms on surfaces and triangulates depth. You can think of it as “stereo,” where one “camera” is the projector with a known pattern.

Strengths: excellent close-range detail; works on textureless walls indoors because the projected pattern provides texture.
Weaknesses: sunlight can wash out the pattern; multiple devices can interfere; shiny/transparent objects distort or erase the pattern.

Typical failure signatures include holes on glossy surfaces, missing depth in sunlit areas, and shadowed regions where the projector cannot illuminate.

Time-of-Flight (ToF)

ToF measures distance by emitting modulated light and estimating how long it takes to return (directly or via phase shift). The sensor outputs depth per pixel without correspondence search.

Strengths: depth is available even in low-texture scenes; often more robust than structured light in mixed lighting; simpler compute pipeline.
Weaknesses: multi-path interference (light bouncing between surfaces) can bias depth; edges can produce mixed pixels (“flying pixels”); dark materials reduce signal-to-noise.

ToF noise often increases with distance and in low-reflectance regions. Depth discontinuities (object edges) are a common source of artifacts.

Monocular depth: treat it as a probabilistic cue

Monocular depth estimation predicts depth (or relative depth) from a single RGB image using learned priors (e.g., object sizes, perspective, shading, context). Unlike stereo or active sensors, a single image does not uniquely determine absolute scale without additional information (known camera motion, known object size, ground plane constraints, or fusion with other sensors).

How to use monocular depth safely in robotics

Prefer relative depth: use it to rank what is closer/farther, detect obstacles, or propose regions of interest.
Model uncertainty: treat predictions as distributions (mean + variance) or confidence maps. Low confidence should reduce influence in planning.
Anchor scale: fuse with wheel odometry/IMU + known camera height, or periodically correct with sparse range (ultrasonic, LiDAR, or occasional valid depth pixels).
Watch for domain shift: performance can drop in new buildings, outdoors, unusual lighting, or with novel objects.

Practical mindset: monocular depth is often best as a “soft sensor” that complements geometry-based depth, not a replacement for safety-critical ranging.

Integration steps that matter in real robots

1) Aligning depth to the color frame

Depth sensors and RGB cameras may have different viewpoints, resolutions, and timestamps. “Aligned depth” means each RGB pixel has a corresponding depth value in the RGB camera’s coordinate system.

Step 1: Ensure time synchronization. Use hardware sync if available; otherwise, approximate with nearest timestamps and track latency.
Step 2: Choose alignment direction:
- Depth-to-color: reproject depth pixels into 3D, then project into the RGB image. Best when you want depth at RGB pixels for perception.
- Color-to-depth: map RGB into the depth frame. Useful if depth is primary (e.g., obstacle grids) and you only need color for labeling.
Step 3: Handle occlusions during reprojection. Multiple depth pixels can map to the same RGB pixel; keep the closest depth (z-buffer logic).
Step 4: Interpolate carefully. Nearest-neighbor preserves edges but can look blocky; bilinear can blur edges and create invalid mixed depths. For manipulation, prefer edge-preserving strategies.

Implementation hint (conceptual): for each depth pixel (u_d, v_d, z), back-project to 3D in the depth camera, transform to the RGB camera frame, then project to RGB pixel (u_c, v_c). Store z at (u_c, v_c) if it is nearer than an existing value.

2) Handling missing or wrong depth (reflective, transparent, absorptive)

All depth modalities can produce invalid pixels (holes) or biased depth. Common problematic materials:

Specular/reflective (metal, glossy plastic): returns can bounce away or cause multi-path; depth may be missing or too far/too near.
Transparent (glass): active sensors may pass through or reflect unpredictably; stereo may match background through the glass.
Absorptive/dark (black fabric, matte rubber): active sensors get weak return; ToF noise increases; structured light dots disappear.

Practical steps:

Step 1: Use the sensor’s validity/confidence output (if available). Treat invalid as unknown, not as “far.”
Step 2: Apply morphological hole handling: small holes can be filled; large holes should remain unknown. Use size thresholds.
Step 3: Edge-aware filling: fill only within regions of similar color/normal to avoid bleeding foreground depth into background.
Step 4: Add complementary sensing: for safety, fuse bumper, cliff sensors, or a 2D LiDAR for glass walls and shiny elevator doors.

3) Filtering depth maps and point clouds for robotics

Filtering should reduce noise while preserving obstacles and graspable edges. A typical pipeline:

Step 1: Depth range gating: clamp to [z_min, z_max] for your task to remove out-of-range noise.
Step 2: Temporal smoothing: exponential moving average or a short median window. Keep it small to avoid lag in fast motion.
Step 3: Spatial edge-preserving smoothing: bilateral filtering on depth (or joint bilateral guided by RGB) to denoise while keeping edges.
Step 4: Outlier removal in 3D: statistical outlier removal (remove points with sparse neighborhoods) or radius filtering.
Step 5: Downsample for planning: voxel grid downsampling to a resolution that matches your robot’s footprint and required clearance.

For obstacle avoidance, it is often better to keep unknown space as unknown rather than “free,” especially when depth holes correlate with problematic materials.

4) Choosing resolutions and frame rates

Resolution and FPS trade off detail, range precision, compute, and bandwidth.

Higher resolution: improves small obstacle detection and stereo disparity precision, but increases matching cost and point cloud size.
Higher FPS: reduces motion-induced artifacts and improves control responsiveness, but may force lower exposure or higher noise.
Practical approach: pick the lowest resolution that still resolves the smallest obstacle you must reliably detect at the required distance, then allocate compute for filtering and fusion.

Example heuristics:

Mobile base obstacle avoidance: moderate resolution + higher FPS is often better than high resolution + low FPS.
Manipulation: prioritize close-range precision and edge quality; moderate FPS is acceptable if the arm motion is controlled.

Selection guidance by robotics task

Indoor obstacle avoidance (mobile robots)

Best fits: ToF or stereo; structured light can work well indoors but is vulnerable to sunlit windows and IR interference.
Why: you need robust mid-range depth, stable in low-texture hallways and varying lighting.
Practical setup:
- Mount at a height that sees the floor ahead and avoids self-occlusion by the chassis.
- Use depth-to-costmap projection with conservative inflation around unknown regions.
- Apply temporal smoothing lightly to avoid lag in dynamic environments.
Watch for: glass doors, glossy floors, and black rugs; add redundancy (bump sensors, cliff sensors, or 2D LiDAR) if safety-critical.

Manipulation at close range (tabletop, bin picking)

Best fits: structured light (high close-range detail) or high-quality ToF; stereo can work if you ensure texture or add active illumination.
Why: grasp planning benefits from clean edges and accurate surface geometry at 0.2–1.5 m.
Practical setup:
- Align depth to RGB for segmentation-to-grasp pipelines (RGB selects object; depth provides 3D pose/shape).
- Use edge-preserving filtering; avoid aggressive smoothing that rounds corners.
- Filter point clouds with voxel downsampling tuned to gripper tolerance (e.g., voxel size smaller than fingertip width).
Watch for: shiny packaging, transparent containers, and black foam; plan fallbacks (multi-view scanning, tactile confirmation, or model-based pose from RGB).

Outdoor navigation under sunlight

Best fits: stereo (passive) is often the most reliable camera-based depth outdoors; ToF can work depending on power and optical filtering but may degrade in strong sun; structured light is usually the least suitable.
Why: sunlight overwhelms many active IR systems; passive stereo avoids emitter washout but needs texture and good exposure control.
Practical setup:
- Use a larger baseline if you need longer-range depth for speed (balanced against occlusions and mounting constraints).
- Prioritize robust matching settings and consider adding polarization/ND filtering if glare is severe (while monitoring SNR).
- Fuse with other ranging (LiDAR, radar) when operating at higher speeds or in low-texture scenes (snow, uniform asphalt).
Watch for: low-texture roads, strong shadows, and reflective vehicles; treat monocular depth as a supportive cue for semantics and relative ordering, not as a sole safety sensor.

Now answer the exercise about the content:

Why does passive stereo depth estimation typically become less precise at longer distances, even when calibration is good?

You are right! Congratulations, now go to the next page

You missed! Try again.

In stereo, depth is inversely related to disparity. Farther points produce smaller disparities, so the same pixel-level matching uncertainty causes much larger depth uncertainty at long range.