End-to-End View: What a CNN Pipeline Is Trying to Do
A modern CNN-based pipeline turns an input image into a set of intermediate feature maps, then converts those features into task-specific predictions (for example, class probabilities, bounding boxes, or pixel masks). Conceptually, you can think of it as two big parts: (1) a feature extractor (often called a backbone) built from repeated convolutional blocks, and (2) one or more prediction heads that map features to outputs. Training adjusts the backbone and heads so that the outputs match ground truth; inference runs the same forward computation but replaces learning steps with postprocessing that turns raw predictions into final results.
1) Model Components: Convolution, Activation, Pooling/Strides, and Receptive Fields
Convolution: local pattern detectors applied everywhere
A convolution layer slides small filters (kernels) across the image or a feature map. Each filter produces one output channel (one feature map) by computing a weighted sum over a local neighborhood. Key ideas at a high level:
- Local connectivity: each output value depends only on a small patch of the input.
- Weight sharing: the same filter is used at every spatial location, so the model can detect the same pattern anywhere.
- Multiple channels: many filters run in parallel, producing a stack of feature maps that represent different learned patterns.
Activation: adding nonlinearity so features can become expressive
After convolution, an activation function (commonly ReLU or variants) is applied elementwise. Without activation, stacking convolutions would still be equivalent to a single linear operation; activations make the network capable of modeling complex, nonlinear relationships. Practically, activations help the network build increasingly abstract features as depth increases.
Pooling and strides: reducing spatial size while increasing context
CNNs often reduce the spatial resolution of feature maps as they go deeper. This is done by:
- Strided convolution: moving the convolution window by more than one pixel (stride > 1), which downsamples the output.
- Pooling: summarizing local neighborhoods (for example, max pooling) to reduce resolution.
Downsampling has two main effects: it reduces computation and it increases how much of the original image each feature “sees” at deeper layers.
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
How receptive fields grow
The receptive field of a unit in a feature map is the region of the input image that can influence it. Receptive fields grow as you stack layers because each layer aggregates information from a neighborhood of the previous layer. Downsampling (stride/pooling) makes receptive fields grow faster because each step in the deeper feature map corresponds to a larger jump in the input image.
Conceptual example: if you stack several 3×3 convolutions, each layer expands the receptive field by roughly one pixel outward in each direction (in the coordinate system of the previous layer). If you also downsample, a single deep feature location can correspond to a large patch of the original image, enabling recognition of larger structures and object-level context.
2) Feature Hierarchy: Early Layers vs Deeper Layers
Early layers: fine spatial detail and simple patterns
Early CNN layers operate on high-resolution feature maps and tend to represent simple, local patterns. They preserve precise spatial information (where something is) and respond strongly to small-scale structures. In practice, these layers are useful for tasks that need accurate boundaries and localization.
Middle layers: parts and texture-like compositions
As depth increases, features combine simpler patterns into more complex ones. Middle layers often represent repeated motifs, parts of objects, and more stable patterns that are less sensitive to small pixel-level changes.
Deeper layers: semantic, context-rich representations
Deep layers typically have lower spatial resolution but higher channel depth, capturing more abstract and context-aware information. These features are often more invariant to small shifts and local noise, and they are particularly useful for deciding what is present in the scene. The trade-off is that very deep features can lose fine-grained spatial precision unless the architecture explicitly preserves or recovers it (for example, with multi-scale features or skip connections).
3) Prediction Heads for Different Tasks
After the backbone produces feature maps, a head converts those features into the specific outputs needed by a task. Many systems share a backbone and attach different heads depending on the problem.
Classifier head: predicting class probabilities
A classifier head takes features and outputs scores for each class. Conceptually, it answers: “What is in this image?” or “What is in this region?” depending on whether it is image-level classification or region-based classification.
- Input: feature map(s) from the backbone (sometimes pooled into a single vector).
- Output: a vector of class logits or probabilities.
- Interpretation: higher score means the model believes that class is present.
Box regression head: predicting object locations
A box regression head outputs bounding box coordinates. Conceptually, it answers: “Where is the object?” Typically, it predicts a small set of numbers per candidate object (for example, box center and size, or offsets relative to an anchor/reference box).
- Input: features corresponding to candidate locations/regions.
- Output: box parameters (e.g., x, y, w, h) or offsets that can be converted into a box.
- Interpretation: the model refines a coarse guess into a tight box around the object.
Mask head: predicting per-pixel object shapes
A mask head outputs a pixel-level mask for an object (instance segmentation) or for each class (semantic segmentation). Conceptually, it answers: “Which pixels belong to this object (or class)?”
- Input: higher-resolution features (often from earlier or multi-scale layers) plus object-specific features if doing instance masks.
- Output: a 2D mask (probability per pixel) for each object or class.
- Interpretation: thresholding the mask yields a binary shape; keeping probabilities supports soft blending or uncertainty-aware decisions.
4) Training Loop Overview: Forward Pass, Loss, Backpropagation, Optimizer Steps
Training teaches the network parameters (filter weights and head weights) to produce correct predictions. At a conceptual level, training repeats the same loop over many batches of data.
Step-by-step training loop
- 1) Sample a batch: load a batch of images and their labels (classes, boxes, masks, depending on the task).
- 2) Forward pass: run images through the backbone to get feature maps, then through the head(s) to get raw predictions (logits, box parameters, mask probabilities).
- 3) Compute losses: compare predictions to ground truth using task-appropriate loss functions. In multi-head models, the total loss is often a weighted sum of several components (e.g., classification loss + box regression loss + mask loss).
- 4) Backpropagation: compute gradients of the total loss with respect to all learnable parameters. This tells you how each parameter should change to reduce the loss.
- 5) Optimizer step: update parameters using an optimizer (e.g., SGD or Adam) and a learning rate. This is the actual “learning” step.
- 6) Repeat: iterate over batches for many epochs, often with learning-rate schedules and regularization settings.
Practical mental model for multi-task losses
If a model has multiple heads, each head contributes a different error signal. For example, if boxes are good but classes are wrong, the classification loss will dominate and push features to become more discriminative. If classes are correct but boxes are misaligned, the regression loss will push features to better encode geometry and localization cues. Balancing these losses (via weights) is a practical lever to control training behavior.
# Conceptual pseudocode (framework-agnostic) for one training step images, targets = next_batch() preds = model(images) # forward pass loss = compute_loss(preds, targets) # sum of task losses loss.backward() # gradients via backprop optimizer.step() # update weights optimizer.zero_grad() # clear gradients for next step5) Inference Pipeline Steps: Preprocessing, Model Forward, Postprocessing
Inference uses the trained model to produce predictions for new images. The core forward computation is the same as in training, but you do not compute gradients or update weights. Instead, you focus on consistent input handling and turning raw outputs into user-facing results.
Step-by-step inference pipeline
- 1) Preprocessing: apply the same input formatting the model expects (for example, resizing to a fixed shape, normalizing channels, and converting to the correct tensor layout). The key is consistency with what the model was trained on.
- 2) Model forward: run the image through the backbone and head(s) to obtain raw outputs (class scores, box coordinates, mask probabilities).
- 3) Postprocessing: convert raw outputs into final predictions. This typically includes score thresholding and, for detection, removing duplicate boxes with Non-Maximum Suppression (NMS).
- 4) Produce user-facing outputs: return structured results such as a list of detected objects (class, confidence, box), segmentation masks, or annotated overlays.
Postprocessing details you should recognize
Thresholding: many heads output probabilities or scores. A confidence threshold filters out low-confidence predictions to reduce noise. The threshold is a practical knob: higher thresholds yield fewer, more confident results; lower thresholds yield more detections but more false positives.
Non-Maximum Suppression (NMS): detectors often produce multiple overlapping boxes for the same object. NMS keeps the highest-scoring box and suppresses others that overlap too much (based on an IoU overlap criterion). This turns a dense set of candidate boxes into a clean set of final detections.
# Conceptual pseudocode for inference image = load_image() x = preprocess(image) preds = model(x) # raw outputs if task == "detection": boxes, scores, classes = decode_boxes(preds) keep = scores > score_threshold boxes, scores, classes = boxes[keep], scores[keep], classes[keep] boxes, scores, classes = nms(boxes, scores, classes, iou_threshold) return format_detections(boxes, scores, classes) if task == "segmentation": mask_probs = preds["masks"] masks = mask_probs > mask_threshold return format_masks(masks)What the user ultimately receives
Even though the model internally works with feature maps and raw numeric outputs, the pipeline should end with outputs that match the application: labeled boxes for an inspection tool, masks for a medical viewer, or class probabilities for a sorting system. Designing this final interface (what fields you return, how you represent uncertainty, and how you handle “no detection”) is part of building a reliable CNN-based system, not an afterthought.