The forward pass as a sequence of transformations
A forward pass is the deterministic computation that maps an input batch to a batch of predictions by applying the same pattern repeatedly: an affine transform (matrix multiply plus bias) followed by a nonlinearity, layer after layer. Each layer takes a tensor of activations, produces a new tensor (a new representation), and passes it forward.
We will build a small fully connected network with two hidden layers and explicitly track tensor shapes. The goal is to make it easy to answer questions like: “What is the shape of the weight matrix here?” and “What does this layer output represent?”
Our example network and tensor shapes
Assume we have a batch of tabular inputs (or flattened features) with:
- Batch size:
B - Input features:
D - Hidden units in layer 1:
H1 - Hidden units in layer 2:
H2 - Output units:
K(e.g., number of classes, or 1 for a single regression target)
We will use concrete numbers to make shapes tangible:
B = 4examples per batchD = 10input featuresH1 = 8units in hidden layer 1H2 = 6units in hidden layer 2K = 3output units
Input batch:
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
Xhas shape(B, D) = (4, 10)
Layer 1: affine transform then activation
The affine part is:
Z1 = X W1 + b1
W1has shape(D, H1) = (10, 8)b1has shape(H1,) = (8,)and is broadcast across the batchZ1has shape(B, H1) = (4, 8)
Then apply an activation function elementwise:
A1 = act(Z1)
A1has the same shape asZ1:(4, 8)
Interpretation: each of the H1 columns in A1 is a learned feature of the input, computed for every example in the batch.
Layer 2: affine transform then activation
Repeat the pattern:
Z2 = A1 W2 + b2
W2has shape(H1, H2) = (8, 6)b2has shape(H2,) = (6,)Z2has shape(B, H2) = (4, 6)
Activation:
A2 = act(Z2)
A2has shape(4, 6)
Interpretation: A2 is a “hidden representation” that is typically more task-specific than A1. It is still not the final prediction; it is an intermediate encoding that the output layer will use.
Output layer: affine transform to produce logits/scores
Often the final layer is left as an affine transform that produces raw scores (sometimes called logits in classification):
Z3 = A2 W3 + b3
W3has shape(H2, K) = (6, 3)b3has shape(K,) = (3,)Z3has shape(B, K) = (4, 3)
From here, you may apply a task-specific output mapping:
- For multi-class classification:
Y_hat = softmax(Z3)gives probabilities with shape(4, 3). - For regression with
K=1: you might useY_hat = Z3directly (or apply a constraint like positivity if needed).
Step-by-step forward pass with shapes (summary)
# Given: X shape (B, D) = (4, 10) Z1 = X @ W1 + b1 W1: (10, 8), b1: (8,) => Z1: (4, 8) A1 = act(Z1) => A1: (4, 8) Z2 = A1 @ W2 + b2 W2: (8, 6), b2: (6,) => Z2: (4, 6) A2 = act(Z2) => A2: (4, 6) Z3 = A2 @ W3 + b3 W3: (6, 3), b3: (3,) => Z3: (4, 3) Y_hat = output_map(Z3) => Y_hat: (4, 3)Two practical shape checks that prevent many bugs:
- The second dimension of the input to a layer must match the first dimension of that layer’s weight matrix (e.g.,
A1is(4, 8), soW2must start with8). - Bias vectors match the number of units in the layer output (e.g.,
b2is length6becauseZ2has 6 columns).
What a “neuron” learns (conceptually) in this forward pass
In a dense layer, each unit corresponds to one column of the weight matrix and one bias value. For example, in layer 1, unit j computes (for each example):
z1_j = x · w1_j + b1_j
where w1_j is the j-th column of W1. Conceptually, this unit defines a direction in input space (the weight vector) and produces a scalar score indicating how strongly the input aligns with that direction, shifted by the bias. After the activation, that score becomes a feature value passed to the next layer.
Calling these units “feature detectors” is a useful shorthand if you keep it precise: a unit computes a particular weighted combination of its inputs, and after nonlinearity it behaves like a learned feature that can be reused downstream. In early layers, these features often correlate with simpler patterns in the input; in later layers, features are built from combinations of earlier features and tend to be more tailored to the prediction task.
Hidden representations vs. final outputs
Hidden representations: internal coordinates for the task
A1 and A2 are hidden representations: they are not directly supervised as “the answer,” but they are shaped by training so that the final layer can easily map them to the target. You can think of them as intermediate coordinate systems where the problem becomes simpler for the next affine transform.
Important practical implication: hidden units do not have to correspond to human-interpretable concepts. They only need to be useful as inputs to subsequent layers. Some may align with recognizable patterns; many will be distributed and only meaningful in combination.
Final outputs: task-specific meaning
Z3 (and Y_hat after the output mapping) is different: its dimensions are tied to the task definition.
- If
K=3classes, each of the 3 output coordinates corresponds to one class score/probability. - If
K=1regression target, the output coordinate corresponds to the predicted numeric value.
This is why the last layer’s width is determined by the problem specification, while hidden layer widths are design choices that control capacity and computation.
Common forward-pass pitfalls (shape and meaning)
Mixing up batch and feature dimensions: In the convention used here, tensors are
(batch, features). If you accidentally treat(features, batch)as input, matrix multiplications may fail or silently produce wrong results.Forgetting bias broadcasting: Biases are typically stored as
(units,)and added to every row in the batch. If you store biases as(B, units), you are making the bias depend on the example, which is not what a standard dense layer does.Confusing logits with probabilities: The output of the final affine transform (
Z3) is often not yet a probability. The probability-like interpretation comes after applying the appropriate output mapping.Assuming hidden units map to labels: Hidden dimensions are not “proto-classes.” Only the output layer is directly aligned with the label space.