All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Features and Representations: From Edges to Learned Embeddings

Capítulo 4

Estimated reading time: 8 minutes

What “Features” Mean in Computer Vision

A feature is a measurable pattern extracted from an image that helps a model make a decision. Raw pixels are often too literal: a small shift, a shadow, or a viewpoint change can alter many pixel values even when the object is the same. Features aim to capture what matters (structure and appearance cues) while being less sensitive to nuisance changes.

Features can be local (computed in a small neighborhood, like a patch) or global (summarizing the whole image). Modern pipelines often build strong global understanding by first extracting many local patterns and then combining them.

Why Local Patterns Are So Informative: Edges, Corners, Textures

Edges: where intensity changes

Edges often correspond to boundaries: object outlines, part boundaries, and strong shading transitions. Even when color varies, the presence and orientation of an edge can remain stable. For example, a mug’s handle boundary produces consistent edge structure across many backgrounds.

Corners and junctions: distinctive points

Corners (or junctions where multiple edges meet) are informative because they are more unique than a straight edge segment. A long straight edge could appear in many places (table edge, book edge), but a specific corner configuration is less ambiguous.

Textures: repeated micro-patterns

Textures describe repeated local variations (e.g., fabric weave, grass, brick). They can be useful cues for materials and surfaces, but they can also be misleading if a background texture correlates with a label (e.g., “boats” often appearing with “water ripples”).

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

Practical intuition: what local patterns “buy” you

Compactness: a small set of numbers can summarize a patch’s structure better than raw pixels.
Robustness: local patterns can be less sensitive to global illumination shifts or small translations.
Compositionality: objects can be represented as arrangements of parts (edges → corners → motifs → object).

Classical Feature Ideas (Conceptual): Gradients and HOG-like Descriptors

Gradients: direction and strength of change

A gradient measures how quickly intensity changes and in what direction. In practice, you can think of it as answering: “Is this patch mostly flat, or does it have a strong edge? If so, which way does the edge run?”

Key idea: gradients emphasize structure (boundaries and shapes) and de-emphasize uniform regions. This often makes them more stable than raw intensity under certain lighting changes.

From gradients to a descriptor: summarize a patch

A descriptor turns a patch into a vector. A common strategy is to compute gradient directions in the patch and then summarize them as a histogram: “how much edge energy points up-right vs left vs down,” etc. This yields a representation that is less sensitive to exact pixel alignment.

HOG-like intuition: local histograms of oriented gradients

A HOG-like descriptor can be understood as:

Split a region into small cells (e.g., a grid).
In each cell, build a histogram of gradient orientations weighted by gradient magnitude.
Optionally normalize across neighboring cells to reduce sensitivity to contrast/illumination.

This produces a feature vector that captures local shape and contour patterns (like “there is a vertical edge here and a diagonal edge there”) without requiring exact pixel-level matching.

Step-by-step: conceptual pipeline for a gradient-histogram descriptor

Step 1: Choose a patch/region. For example, a 64×128 window for a person candidate, or a small patch around a keypoint.
Step 2: Compute gradients. Estimate horizontal and vertical changes to get magnitude and orientation at each pixel.
Step 3: Pool into cells. Divide the patch into a grid of cells (e.g., 8×8 pixels per cell).
Step 4: Build orientation histograms. For each cell, accumulate gradient magnitudes into bins by direction.
Step 5: Normalize locally. Normalize groups of neighboring cells to reduce sensitivity to overall brightness/contrast.
Step 6: Concatenate into a vector. This vector is the feature representation of the patch.

Even without implementing it, this mental model helps: classical features are often about measuring local structure and then pooling it into a stable summary.

Transition to Learned Representations: Feature Maps and Embeddings

What changes with learned features?

Classical features are hand-designed: you decide what to measure (gradients, histograms) and how to pool. Learned representations instead let a model discover what measurements are useful for the task by optimizing parameters from data.

Feature maps: many detectors applied across the image

In a convolutional neural network (CNN), early layers behave like banks of local pattern detectors applied everywhere. The output is a set of feature maps: each map indicates where a learned pattern is present (e.g., a certain edge orientation, a corner-like pattern, a texture motif).

As you go deeper, feature maps tend to represent more complex patterns: combinations of edges into parts, and parts into object-level cues. Importantly, these are not fixed “edge filters” by design; they are learned because they help reduce the training objective.

Embeddings: compact vectors for similarity and decision-making

An embedding is a vector representation (often from a later layer) intended to place similar images (or patches) close together in feature space and dissimilar ones far apart. Embeddings are useful for:

Classification: a linear layer can separate classes if embeddings are well-structured.
Retrieval: nearest-neighbor search finds similar items.
Clustering: grouping by visual similarity.

Step-by-step: how a CNN turns pixels into an embedding (conceptually)

Step 1: Local filtering. Apply learned filters across the image to produce early feature maps (detect simple patterns).
Step 2: Nonlinearity. Apply nonlinear functions so the model can represent complex combinations of patterns.
Step 3: Downsampling/pooling/striding. Reduce spatial resolution while keeping salient responses, improving robustness to small shifts.
Step 4: Deeper composition. Repeat filtering + nonlinearity to build higher-level patterns from lower-level ones.
Step 5: Aggregation. Combine spatial information (e.g., global pooling) to produce a fixed-length vector.
Step 6: Embedding output. Use the vector directly (retrieval) or feed it to task heads (classification/detection).

Invariances: What They Are and Why They Matter

An invariance is a property of a representation where certain changes to the input do not significantly change the features. In practice, you want invariance to nuisance factors while remaining sensitive to factors that define the label.

Translation (small shifts)

If an object moves a few pixels, you usually want the representation to remain similar. Convolutions apply the same detector across locations, and pooling/striding can make the representation less sensitive to exact position. This is crucial for real images where objects are rarely perfectly aligned.

Small rotations and viewpoint changes

Many objects appear at slightly different angles. A good representation should not treat a 5–10 degree rotation as a different object. Learned features can become tolerant to such changes by seeing varied examples and by composing local patterns into higher-level cues that are less tied to a single orientation.

Illumination and contrast changes

Lighting can change dramatically while the object identity stays the same. Gradient-based features often reduce sensitivity to uniform brightness shifts because they focus on changes rather than absolute intensity. Learned representations can also become robust by relying on stable cues (shape, relative contrasts) rather than raw brightness.

Invariance vs. sensitivity: the trade-off

Too much invariance can hurt. For example, if you make a representation fully rotation-invariant, you might lose information needed to distinguish “6” from “9” or to estimate pose. The goal is task-appropriate invariance: ignore what should not matter, preserve what should.

How Feature Quality Affects Downstream Tasks

Classification: separating classes in feature space

In classification, good features make classes easier to separate. If embeddings cluster by object identity rather than background, a simple classifier can work well. If features entangle object and background, the classifier may rely on spurious cues.

Example failure mode: Suppose a dataset has “cows” mostly on green fields. A model might learn grass texture as a shortcut. Then a cow on a beach could be misclassified because the embedding is pulled toward “non-cow” regions due to missing grass-like texture.

Detection: localizing objects needs spatially meaningful features

Detection requires both “what” and “where.” Feature maps must preserve enough spatial structure to localize objects while being robust to small shifts. If early features are noisy or overly texture-biased, the detector may fire on background patterns (e.g., window grids mistaken for a building facade feature) or miss objects with unusual textures.

Diagnosing feature problems with practical checks

Background sensitivity check: Evaluate on images where the object appears in unusual contexts. Large performance drops suggest features rely on background cues.
Occlusion sensitivity check: Mask parts of the image and see if predictions change drastically when irrelevant regions are masked. If masking background flips the prediction, features may be using spurious signals.
Similarity sanity check (embeddings): Retrieve nearest neighbors using embeddings. If neighbors share background/lighting more than object identity, the representation is not capturing the intended semantics.

Step-by-step: improving downstream behavior by improving features (conceptually)

Step 1: Identify spurious correlations. Use error analysis (e.g., failures in new backgrounds) to hypothesize what the model is using.
Step 2: Encourage object-centric cues. Use training signals or architectural choices that emphasize shape/parts (e.g., multi-scale features for detection, stronger spatial supervision when available).
Step 3: Validate invariances. Test controlled changes (small shifts, lighting changes) to ensure features remain stable when they should.
Step 4: Re-check embedding neighborhoods. Confirm that similarity aligns with the task (same object class/instance closer than merely similar textures).

Now answer the exercise about the content:

Why can gradient-histogram descriptors (HOG-like) be more robust than raw pixels for recognizing local shape patterns?

You are right! Congratulations, now go to the next page

You missed! Try again.

HOG-like features pool gradient orientations (weighted by magnitude) into histograms and can normalize over neighboring cells, capturing local shape/contours while reducing sensitivity to pixel-level misalignment and some lighting or contrast shifts.