1) The image tensor: H×W×C
In computer vision, an image is typically handled as an array (often called a tensor) with shape H×W×C. This is the in-memory representation used by libraries and models, regardless of how the image was stored on disk.
- H (height): number of rows of pixels (top to bottom).
- W (width): number of columns of pixels (left to right).
- C (channels): how many values describe each pixel.
Common channel configurations:
- Grayscale: C=1. Each pixel has one intensity value (dark to bright).
- RGB color: C=3. Each pixel has three values: Red, Green, Blue.
- RGBA: C=4. RGB plus an alpha channel (transparency).
Be aware that some tools store the channel dimension first (C×H×W) instead of last (H×W×C). This difference is purely about memory layout and API expectations, but mixing them up causes subtle bugs (e.g., models receiving scrambled colors or wrong shapes).
# Example shapes (conceptual, not tied to a specific library) grayscale: (H, W) or (H, W, 1) RGB: (H, W, 3) batch of RGB images: (N, H, W, 3) or (N, 3, H, W)2) Intensity ranges: 0–255 vs 0–1 (and why normalization matters)
Pixels are numbers, but the numeric range depends on the data type and the stage of the pipeline.
- 8-bit images (uint8): values are integers in [0, 255]. This is common when decoding JPEG/PNG into memory.
- Floating-point images (float32/float64): values are often scaled to [0, 1] for learning algorithms.
Normalization matters because many models and algorithms assume inputs are in a particular range. If you feed 0–255 values into a model trained on 0–1, activations can saturate and predictions degrade. Conversely, scaling twice (e.g., dividing by 255 two times) makes the image too dark and reduces signal.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Practical step-by-step: normalize safely
- Step 1: Inspect dtype and range. Check whether your array is uint8 or float, and look at min/max values.
- Step 2: Convert to float before dividing (to avoid integer truncation in some environments).
- Step 3: Scale to the expected range (commonly /255).
- Step 4 (optional): standardize using mean/std per channel if required by a pretrained model.
# Pseudocode # img: H×W×C if img.dtype == uint8: img = img.astype(float32) / 255.0 # Now in [0,1] approximately # Optional: model-specific normalization # img = (img - mean) / stdTip: keep a consistent convention across your pipeline (e.g., “all tensors are float32 in [0,1] after decoding”), and enforce it at module boundaries.
3) Color spaces: RGB, BGR, HSV, YCbCr and when conversions are used
A color space defines how the three (or more) channel values map to perceived color. The same image content can be represented in different color spaces depending on the task.
RGB
RGB is the most common conceptual representation: each pixel has red, green, and blue intensities. Many deep learning pipelines assume RGB ordering.
BGR
BGR is a common alternative ordering used by some computer vision libraries. The data is the same channels, just permuted. If you forget to convert BGR to RGB (or vice versa), colors look “wrong” (e.g., reds and blues swapped), and models trained on RGB can lose accuracy.
# Pseudocode: BGR to RGB (channel swap) rgb = bgr[..., [2, 1, 0]]HSV
HSV separates color into Hue (type of color), Saturation (colorfulness), and Value (brightness). It is often used for tasks like color-based segmentation because thresholds on hue can be more intuitive than thresholds on raw RGB.
- Example use: detect ripe fruit by selecting a hue range.
- Example use: robust color masking under moderate lighting changes (though not perfect).
YCbCr (and related YUV-family spaces)
YCbCr separates luminance (Y) from chrominance (Cb, Cr). This is common in image/video compression and can be useful when you want to process brightness separately from color.
- Example use: face detection or tracking sometimes benefits from focusing on luminance or using chroma cues.
- Example use: compression artifacts and subsampling are often discussed in YCbCr terms.
Practical step-by-step: choose and convert intentionally
- Step 1: Identify what your model/algorithm expects (RGB vs BGR; normalized range; specific preprocessing).
- Step 2: Convert once at the boundary (right after decoding), not repeatedly throughout the code.
- Step 3: Validate with a sanity check: visualize a few samples after conversion and confirm colors look correct.
Important: conversions can be lossy if you quantize back to 8-bit repeatedly. Prefer keeping a high-precision float representation during processing and only quantize when saving outputs.
4) Resolution trade-offs and aspect ratio considerations
Resolution is the image size in pixels (H×W). Higher resolution preserves more detail but increases memory and compute cost. Many operations scale roughly with the number of pixels, so doubling both height and width can increase compute by about 4×.
Detail vs compute
- High resolution helps when small objects matter (e.g., reading text, detecting tiny defects).
- Lower resolution helps when speed matters or when the task depends on coarse structure (e.g., scene classification).
Practical step-by-step: resizing without breaking the task
- Step 1: Decide the target size based on model input requirements or latency constraints.
- Step 2: Preserve aspect ratio when possible to avoid geometric distortion.
- Step 3: If a fixed size is required, use one of these strategies:
- Letterbox/pad: resize to fit within the target while preserving aspect ratio, then pad the remaining area.
- Center crop: resize so the shorter side matches the target, then crop the center (risk: may cut off important content).
- Warp (stretch): resize directly to target H×W (fast, but distorts shapes; can harm detection/pose tasks).
# Pseudocode outline: letterbox scale = min(target_w / w, target_h / h) new_w, new_h = round(w*scale), round(h*scale) resized = resize(img, (new_h, new_w)) padded = pad_to_target(resized, target_h, target_w, fill=0)Aspect ratio mistakes show up as “squished” objects, shifted bounding boxes, or inconsistent performance between portrait and landscape images. If you resize images, you must also apply the same geometric transform to any associated labels (keypoints, masks, bounding boxes).
5) Metadata pitfalls: EXIF orientation and pipeline breakage
Image files can contain metadata that affects how they should be displayed. A common example is EXIF orientation, especially from phone cameras. The pixel array stored in the file may be “sideways,” and the correct upright view is achieved by applying an orientation tag (rotate/flip) at display time.
This becomes a serious pipeline issue when:
- You train on images that appear upright in a viewer (which applies EXIF), but your decoder loads raw pixels without applying EXIF orientation.
- You compute bounding boxes or keypoints in one orientation but run inference in another.
- You mix sources: some images are already physically rotated (pixels corrected), others rely on EXIF tags.
Practical step-by-step: handle EXIF orientation reliably
- Step 1: After decoding, check for orientation metadata (if available in your I/O stack).
- Step 2: Apply the required rotate/flip to the pixel array to make the image “upright” in memory.
- Step 3: Clear or ignore the orientation tag afterward so downstream steps don’t apply it again.
- Step 4: Keep labels consistent: if you rotate/flip the image, transform annotations the same way.
# Pseudocode concept img, exif = decode_with_metadata(path) img = apply_exif_orientation(img, exif.orientation) exif.orientation = 1 # normalized/uprightWhen debugging a vision pipeline, a fast diagnostic is to visualize a batch right after loading (before any augmentation) and confirm that people, text, and horizons are upright. If not, fix orientation handling before tuning models or augmentations.