All courses > Technology and Programming > Artificial Intelligence and Machine Learning ::

Forward Pass Mechanics: From Inputs to Predictions

Capítulo 4

Estimated reading time: 6 minutes

The forward pass as a sequence of transformations

A forward pass is the deterministic computation that maps an input batch to a batch of predictions by applying the same pattern repeatedly: an affine transform (matrix multiply plus bias) followed by a nonlinearity, layer after layer. Each layer takes a tensor of activations, produces a new tensor (a new representation), and passes it forward.

We will build a small fully connected network with two hidden layers and explicitly track tensor shapes. The goal is to make it easy to answer questions like: “What is the shape of the weight matrix here?” and “What does this layer output represent?”

Our example network and tensor shapes

Assume we have a batch of tabular inputs (or flattened features) with:

Batch size: B
Input features: D
Hidden units in layer 1: H1
Hidden units in layer 2: H2
Output units: K (e.g., number of classes, or 1 for a single regression target)

We will use concrete numbers to make shapes tangible:

B = 4 examples per batch
D = 10 input features
H1 = 8 units in hidden layer 1
H2 = 6 units in hidden layer 2
K = 3 output units

Input batch:

Continue in our app.

Listen to the audio with the screen off.
Earn a certificate upon completion.
Over 5000 courses for you to explore!

Or continue reading below...

Download the app

X has shape (B, D) = (4, 10)

Layer 1: affine transform then activation

The affine part is:

Z1 = X W1 + b1

W1 has shape (D, H1) = (10, 8)
b1 has shape (H1,) = (8,) and is broadcast across the batch
Z1 has shape (B, H1) = (4, 8)

Then apply an activation function elementwise:

A1 = act(Z1)

A1 has the same shape as Z1: (4, 8)

Interpretation: each of the H1 columns in A1 is a learned feature of the input, computed for every example in the batch.

Layer 2: affine transform then activation

Repeat the pattern:

Z2 = A1 W2 + b2

W2 has shape (H1, H2) = (8, 6)
b2 has shape (H2,) = (6,)
Z2 has shape (B, H2) = (4, 6)

Activation:

A2 = act(Z2)

A2 has shape (4, 6)

Interpretation: A2 is a “hidden representation” that is typically more task-specific than A1. It is still not the final prediction; it is an intermediate encoding that the output layer will use.

Output layer: affine transform to produce logits/scores

Often the final layer is left as an affine transform that produces raw scores (sometimes called logits in classification):

Z3 = A2 W3 + b3

W3 has shape (H2, K) = (6, 3)
b3 has shape (K,) = (3,)
Z3 has shape (B, K) = (4, 3)

From here, you may apply a task-specific output mapping:

For multi-class classification: Y_hat = softmax(Z3) gives probabilities with shape (4, 3).
For regression with K=1: you might use Y_hat = Z3 directly (or apply a constraint like positivity if needed).

Step-by-step forward pass with shapes (summary)

# Given: X shape (B, D) = (4, 10)  Z1 = X @ W1 + b1    W1: (10, 8), b1: (8,)   => Z1: (4, 8)  A1 = act(Z1)                         => A1: (4, 8)  Z2 = A1 @ W2 + b2   W2: (8, 6),  b2: (6,)   => Z2: (4, 6)  A2 = act(Z2)                         => A2: (4, 6)  Z3 = A2 @ W3 + b3   W3: (6, 3),  b3: (3,)   => Z3: (4, 3)  Y_hat = output_map(Z3)               => Y_hat: (4, 3)

Two practical shape checks that prevent many bugs:

The second dimension of the input to a layer must match the first dimension of that layer’s weight matrix (e.g., A1 is (4, 8), so W2 must start with 8).
Bias vectors match the number of units in the layer output (e.g., b2 is length 6 because Z2 has 6 columns).

What a “neuron” learns (conceptually) in this forward pass

In a dense layer, each unit corresponds to one column of the weight matrix and one bias value. For example, in layer 1, unit j computes (for each example):

z1_j = x · w1_j + b1_j

where w1_j is the j-th column of W1. Conceptually, this unit defines a direction in input space (the weight vector) and produces a scalar score indicating how strongly the input aligns with that direction, shifted by the bias. After the activation, that score becomes a feature value passed to the next layer.

Calling these units “feature detectors” is a useful shorthand if you keep it precise: a unit computes a particular weighted combination of its inputs, and after nonlinearity it behaves like a learned feature that can be reused downstream. In early layers, these features often correlate with simpler patterns in the input; in later layers, features are built from combinations of earlier features and tend to be more tailored to the prediction task.

Hidden representations vs. final outputs

Hidden representations: internal coordinates for the task

A1 and A2 are hidden representations: they are not directly supervised as “the answer,” but they are shaped by training so that the final layer can easily map them to the target. You can think of them as intermediate coordinate systems where the problem becomes simpler for the next affine transform.

Important practical implication: hidden units do not have to correspond to human-interpretable concepts. They only need to be useful as inputs to subsequent layers. Some may align with recognizable patterns; many will be distributed and only meaningful in combination.

Final outputs: task-specific meaning

Z3 (and Y_hat after the output mapping) is different: its dimensions are tied to the task definition.

If K=3 classes, each of the 3 output coordinates corresponds to one class score/probability.
If K=1 regression target, the output coordinate corresponds to the predicted numeric value.

This is why the last layer’s width is determined by the problem specification, while hidden layer widths are design choices that control capacity and computation.

Common forward-pass pitfalls (shape and meaning)

Mixing up batch and feature dimensions: In the convention used here, tensors are (batch, features). If you accidentally treat (features, batch) as input, matrix multiplications may fail or silently produce wrong results.
Forgetting bias broadcasting: Biases are typically stored as (units,) and added to every row in the batch. If you store biases as (B, units), you are making the bias depend on the example, which is not what a standard dense layer does.
Confusing logits with probabilities: The output of the final affine transform (Z3) is often not yet a probability. The probability-like interpretation comes after applying the appropriate output mapping.
Assuming hidden units map to labels: Hidden dimensions are not “proto-classes.” Only the output layer is directly aligned with the label space.

Now answer the exercise about the content:

In a forward pass of a fully connected network, which statement correctly distinguishes hidden representations (A1/A2) from the final output (Z3/Y_hat)?

You are right! Congratulations, now go to the next page

You missed! Try again.

A1/A2 are intermediate encodings learned to make prediction easier for later layers and need not match labels. The final layer output (Z3, then Y_hat) has width K set by the task, and logits may need a mapping like softmax to become probabilities.