All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Image Preprocessing for Reliable Vision Pipelines

Capítulo 2

Estimated reading time: 9 minutes

1) Resizing strategies and their effect on geometry

Most vision models expect a fixed input size (for example 224×224 for classification or 640×640 for detection). Resizing is not just a convenience: it changes pixel geometry, object proportions, and sometimes the meaning of coordinates (bounding boxes, keypoints). Choosing a strategy is about controlling what geometric distortions you allow.

Stretch (direct resize)

What it does: Scales the image to the target width and height independently, regardless of original aspect ratio.

Pros: Simple, fast, no empty pixels, preserves full field of view (nothing is cut off).
Cons: Distorts shapes (circles become ovals). For tasks sensitive to geometry (detection, pose, OCR), this can reduce accuracy.
Geometry impact: A point (x, y) maps to (x·sx, y·sy) with different scale factors sx and sy. Boxes and keypoints must be scaled with the same anisotropic factors.

# Stretch resize: scale x and y independently to fit target size (W,H)

Pad / letterbox (aspect-ratio preserving resize + padding)

What it does: Resizes the image uniformly to fit within the target size, then pads the remaining area (often with zeros or a constant color).

Pros: Preserves object shapes and aspect ratio; common for detection models.
Cons: Adds artificial borders that the model may learn; changes where objects appear relative to the full canvas.
Geometry impact: Coordinates must be scaled uniformly and then shifted by the padding offset. If you forget the offset, predictions will be systematically displaced.

Step-by-step (with coordinate bookkeeping):

Compute uniform scale s = min(W/Win, H/Hin).
Resize to (round(Win·s), round(Hin·s)).
Compute padding: padX = W - newW, padY = H - newH.
Distribute padding (often half on each side): left = padX/2, top = padY/2.
To map an original point (x, y): x' = x·s + left, y' = y·s + top.
To map a box [x1,y1,x2,y2]: scale all coordinates by s, then add offsets to x and y coordinates.

# Letterbox mapping (conceptual formulas, not tied to a library)

Crop (center crop or random crop)

What it does: Selects a region of the image and discards the rest. Often used in classification training (random crop for augmentation) and sometimes in inference (center crop).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Pros: Can focus on the subject; can act as augmentation to improve robustness; avoids padding artifacts.
Cons: May cut off important content; for detection/segmentation it can remove objects or truncate them, requiring careful label updates.
Geometry impact: Coordinates must be shifted by the crop origin (subtract cropX, cropY), then possibly resized to the model input size.

Step-by-step (safe crop workflow):

Choose crop window (x0, y0, cropW, cropH).
Crop image to that window.
If needed, resize cropped image to model input size.
Update labels: subtract (x0, y0) from all coordinates; clip boxes to crop bounds; optionally drop boxes with too little remaining area.

Practical guidance: choosing a resizing strategy

Classification: stretch is often acceptable; crop (random during training, center at inference) is common.
Detection / pose / OCR: prefer letterbox to preserve aspect ratio; only use stretch if the model was trained that way.
Any task with coordinates: treat resizing as a geometric transform and keep the exact mapping so you can invert it when projecting predictions back to the original image.

2) Normalization and standardization (mean/std per channel)

After resizing, pixel values are typically transformed to a numeric range and distribution that the model expects. Two common steps are (a) scaling to a range (like 0–1) and (b) standardizing each channel using a mean and standard deviation. The key requirement is consistency: the same preprocessing used during training must be used during inference.

Normalization (range scaling)

What it does: Converts integer pixels (0–255) to floats, often 0–1 by dividing by 255. Some pipelines use -1 to 1 scaling.

Why it helps: Stabilizes optimization and avoids large numeric ranges.
Common pitfall: Forgetting to convert to float before division (integer division issues in some environments) or mixing 0–1 and 0–255 conventions between training and inference.

Standardization (per-channel mean/std)

What it does: For each channel c, compute: x' = (x - mean[c]) / std[c]. Means/stds may come from the training dataset or from a known reference used by a pretrained backbone.

Why it helps: Centers and scales features so early layers see a consistent distribution, improving training stability and transfer learning performance.
What can go wrong: Using RGB means on BGR images, applying mean/std to 0–255 values when they were intended for 0–1, or using different statistics at inference than training.

Step-by-step consistency checklist:

Confirm color order (RGB vs BGR) before any mean/std operation.
Confirm value range (0–255 vs 0–1) expected by the mean/std numbers.
Store preprocessing parameters with the model artifact (config file, model card, or code constants).
Write one shared preprocessing function used by both training and inference to avoid drift.
Add a unit test: feed a known image and assert the preprocessed tensor matches a saved reference within tolerance.

# Standardization formula per channel: x' = (x - mean[c]) / std[c]

3) Denoising and smoothing (median/gaussian) as concept-level tools

Denoising reduces unwanted high-frequency variations that come from sensors, compression, or transmission. It can improve robustness when noise is not informative for the task, but it can also remove small details that matter (fine texture, thin edges, small text). Treat denoising as a targeted tool, not a default step.

Gaussian blur (smoothing)

What it removes: High-frequency noise and small variations by averaging with a Gaussian kernel. It tends to blur edges.

When it helps: Low-light sensor noise, mild compression artifacts, stabilizing inputs for classical edge/feature pipelines, or reducing false positives from speckle-like noise.
When it hurts: Small object detection, OCR, or tasks where sharp edges are critical.
Knobs: Kernel size and sigma (larger means more blur). Prefer the smallest setting that addresses the noise.

Median filter (impulse noise removal)

What it removes: Salt-and-pepper (impulse) noise by replacing each pixel with the median in a neighborhood. It preserves edges better than Gaussian blur for this noise type.

When it helps: Dead pixels, transmission glitches, binary-like speckles.
When it hurts: Fine textures can become blocky; repeated application can distort thin structures.
Knobs: Neighborhood size (odd dimensions like 3×3, 5×5). Larger removes more impulses but risks detail loss.

Practical step-by-step: deciding whether to denoise

Inspect a sample of failure cases: are errors correlated with visible noise?
Identify noise type: impulse-like (try median) vs grainy Gaussian-like (try Gaussian blur).
Run an A/B test: evaluate metrics with and without denoising on a validation set that matches deployment conditions.
Prefer training-time augmentation (adding noise) over heavy inference-time smoothing when possible, so the model learns robustness without losing detail.

4) Contrast adjustments (histogram equalization, CLAHE)

Contrast adjustments change the distribution of intensities to make details more visible. They can help when lighting varies or images are under/overexposed, but they can also amplify noise, create unnatural edges, or shift color relationships. For color images, contrast operations are often applied to a luminance channel rather than independently per RGB channel to avoid color artifacts.

Histogram equalization

What it does: Remaps intensities so the histogram becomes more uniform, increasing global contrast.

When it helps: Grayscale imagery with low contrast (some industrial inspection, document scans, certain medical-like grayscale contexts).
When it hurts: It can over-amplify noise in flat regions and can wash out regions that were already well-exposed. On natural color images, applying it per RGB channel can produce unrealistic colors.

CLAHE (Contrast Limited Adaptive Histogram Equalization)

What it does: Performs histogram equalization locally (tile-by-tile) with a clip limit to prevent extreme amplification.

When it helps: Uneven illumination (shadows, vignetting), local low-contrast details, improving visibility without globally blowing out highlights.
When it hurts: Can introduce tile boundary artifacts if parameters are poor; can still amplify noise in very dark regions; may change the appearance distribution relative to training data.
Knobs: Tile grid size and clip limit. Smaller tiles increase local adaptation; higher clip limit increases contrast but risks noise amplification.

Step-by-step: safe contrast adjustment workflow

Decide the color space: for color images, convert to a luminance-based space (apply to luminance channel) rather than per RGB channel.
Start with CLAHE using conservative settings (moderate tile size, low clip limit).
Validate on representative data: check both quantitative metrics and qualitative artifacts (noise amplification, halos).
Keep training and inference aligned: if you apply CLAHE at inference, include it during training (or train with augmentations that mimic it).

5) Handling corrupted inputs and edge cases (empty frames, extreme exposure)

Reliable pipelines assume inputs can be missing, corrupted, or outside normal operating conditions. Instead of letting preprocessing crash or silently produce garbage tensors, add explicit validation and fallback behavior. This is especially important in production systems that process video streams, user uploads, or networked cameras.

Common edge cases

Empty frame / null image: No data returned from a camera read, decode failure, or zero-sized array.
Corrupted decode: Partial JPEG/PNG, wrong color format, unexpected number of channels.
Extreme exposure: Nearly all-black or all-white frames (lens cap, lights off, overexposed scene).
Unexpected dtype/range: Float images in 0–1 when you expect uint8 0–255, or vice versa.
NaNs/Infs: Can appear after prior processing steps or faulty sensors.

Basic validation checks (practical)

These checks are lightweight and can prevent cascading failures.

Existence and shape: Verify the image is not null and has positive width/height.
Channel check: Ensure expected channels (1 for grayscale or 3 for color). If 4 channels (RGBA), decide whether to drop alpha or composite onto a background.
Dtype and range: Confirm dtype (uint8 vs float32) and expected range before normalization/standardization.
Finite values: Check for NaN/Inf and handle (replace, clip, or reject).
Exposure sanity: Compute simple statistics (mean intensity, percent of pixels near 0 or 255). Flag frames that are almost uniform.

# Example validation logic (conceptual): if image is None or H==0 or W==0: reject/skip

Fallback strategies

Skip with reason: For batch processing, skip the sample and log the error code (decode_failed, empty_frame, bad_shape).
Return a safe default: For real-time systems, return “no prediction” with a confidence of 0 and a status flag rather than forcing a tensor through the model.
Auto-repair (carefully): If channels are wrong (RGBA), convert deterministically; if grayscale but model expects RGB, replicate the channel. Avoid “guessing” ranges without detection.
Clamp and sanitize: Clip values to expected range and replace NaNs/Infs before standardization to avoid propagating invalid numbers.

Step-by-step: a robust preprocessing wrapper

Validate input (null/shape/channels/dtype/finite values).
Convert color format to the model’s expected order (and document it).
Apply resizing strategy (and store transform parameters if you need to map coordinates back).
Apply optional denoising/contrast steps only if they are part of the trained pipeline.
Normalize and standardize using the exact training configuration.
Return both the processed tensor and metadata (original size, scale, padding/crop offsets, any warnings).

Now answer the exercise about the content:

When using letterbox resizing (aspect-ratio preserving resize + padding) for a task with bounding boxes, what is the correct way to transform coordinates from the original image to the resized canvas?

You are right! Congratulations, now go to the next page

You missed! Try again.

Letterbox resizing preserves aspect ratio by using one scale factor, then pads the remaining area. Coordinates must be scaled by the uniform factor and then shifted by the padding offsets; otherwise predictions will be systematically displaced.