Activations as Controlled Nonlinear Transformations
An activation function takes a neuron’s pre-activation value (often written as z, the weighted sum plus bias) and transforms it into an output a = f(z). If every layer used only linear transformations, stacking layers would still collapse into a single linear transformation, no matter how many layers you add. Activations are the controlled nonlinearity that prevents that collapse, letting the network represent curved decision boundaries and complex input-output relationships.
In practice, activations also shape how gradients flow during training. Since training relies on gradients, the activation’s derivative f'(z) strongly affects whether learning is stable, fast, or stuck.
Four Common Activations Through Three Lenses
We will compare step, sigmoid, tanh, and ReLU using: (1) output range, (2) gradient behavior, and (3) practical implications.
1) Step Function
Definition (conceptually): outputs 0 for negative inputs and 1 for positive inputs.
- Output range: {0, 1} (binary, discontinuous).
- Gradient behavior: derivative is 0 almost everywhere and undefined at 0. That means gradient-based training cannot “nudge” weights to improve performance in a smooth way.
- Practical implications: useful for hard thresholding at inference time, but not used inside modern networks during training because it blocks learning via backpropagation.
2) Sigmoid
Definition: f(z) = 1 / (1 + exp(-z))
Continue in our app.
You can listen to the audiobook with the screen off, receive a free certificate for this course, and also have access to 5,000 other free online courses.
Or continue reading below...Download the app
- Output range: (0, 1). Interpretable as a probability for binary outcomes.
- Gradient behavior: derivative is largest near 0 and becomes very small for large positive or negative
z. This is the classic saturation problem: whenzis far from 0, the neuron’s output changes very little, and gradients shrink. - Practical implications: saturation can slow learning in hidden layers, especially in deep networks. However, sigmoid is still very useful as an output activation for binary classification because it maps naturally to probabilities.
3) Tanh
Definition: f(z) = tanh(z)
- Output range: (-1, 1). Unlike sigmoid, it is zero-centered, which often makes optimization easier because activations can be positive or negative.
- Gradient behavior: like sigmoid, it saturates for large |z|, causing small gradients in the extremes. Around 0, gradients are relatively strong.
- Practical implications: can work well in some hidden-layer settings, especially when you want bounded outputs and zero-centered activations. But saturation still makes it less common than ReLU in many feedforward hidden layers.
4) ReLU (Rectified Linear Unit)
Definition: f(z) = max(0, z)
- Output range: [0, ∞). Unbounded above, exactly 0 for negative inputs.
- Gradient behavior: derivative is 1 for positive
zand 0 for negativez(undefined at 0, but handled consistently in implementations). This means gradients flow well when the neuron is “on” (z > 0), but stop entirely when it is “off” (z < 0). - Practical implications: ReLU often trains faster and deeper than sigmoid/tanh because it avoids saturation on the positive side. The main risk is dead neurons: if a neuron’s
zbecomes negative for all inputs, its output stays 0 and its gradient stays 0, so it may never recover.
Saturation and Dead Neurons: What They Look Like in Practice
Saturation (Sigmoid/Tanh)
When sigmoid or tanh saturate, the neuron output becomes almost constant even if z changes. That makes the derivative tiny, so weight updates become tiny too.
Practical symptom: training loss decreases very slowly, especially in earlier layers, because gradients shrink as they propagate backward.
Dead Neurons (ReLU)
A ReLU neuron is “dead” when it outputs 0 for essentially all inputs. Since its gradient is 0 in the negative region, it stops learning.
Practical symptom: you may observe many activations stuck at exactly 0 and some units never activating across batches.
Differentiability and Why It Matters for Training
Gradient-based training needs derivatives. If an activation is not differentiable, or has zero derivative almost everywhere, learning becomes unreliable or impossible with standard backpropagation.
- Step: not suitable for gradient descent because it is discontinuous and has no useful derivative.
- Sigmoid/tanh: differentiable everywhere, but can have very small derivatives in saturated regions.
- ReLU: not differentiable at 0, but this is manageable because it is differentiable almost everywhere and frameworks define a consistent subgradient at 0. The bigger issue is the zero gradient for negative inputs.
Choosing Activations: Hidden Layers vs Output Layers
Hidden Layers: Default Choices and Rationale
Common default: ReLU in hidden layers.
- Why: strong gradient flow for active units, computational simplicity, and good empirical performance in many architectures.
- When to consider tanh: if you specifically want bounded, zero-centered activations and can manage saturation (for example, in some recurrent or control-like settings).
- When to avoid sigmoid in hidden layers: saturation and non-zero-centered outputs often make optimization harder in deeper networks.
Output Layers: Match the Activation to the Task
Output activations are chosen to produce outputs with the right meaning and constraints.
- Binary classification: use sigmoid to map a single logit to a probability in (0, 1).
- Multi-class (single-label) classification: use softmax to map a vector of logits to a probability distribution that sums to 1.
- Multi-label classification: use sigmoid independently per label (each label is its own probability).
- Regression: often use no activation (linear output) for unconstrained real values; use bounded activations only if the target is inherently bounded (e.g., sigmoid for outputs constrained to (0, 1)).
Practical Step-by-Step: Picking an Activation Setup
Step 1: Identify the output constraint
- If the output should be a probability for two classes, plan for sigmoid.
- If exactly one class among K is correct, plan for softmax.
- If multiple classes can be simultaneously true, plan for K sigmoids.
- If predicting a real number without bounds, plan for a linear output.
Step 2: Choose a hidden-layer default
- Start with ReLU for hidden layers.
- If you observe many dead units (lots of exact zeros), consider adjustments such as smaller learning rate, different initialization, or a ReLU variant (e.g., Leaky ReLU) depending on your toolkit.
- If training is unstable due to exploding activations, consider bounded activations (tanh) or normalization strategies, but be mindful of saturation.
Step 3: Check gradient flow with a simple diagnostic
During training, inspect activation statistics per layer (mean, fraction of zeros for ReLU, or fraction near -1/1 for tanh, near 0/1 for sigmoid). These quick checks often reveal whether you are saturating or killing units.
- ReLU: if a layer has an extremely high fraction of zeros across batches, you may be losing capacity due to dead neurons.
- Sigmoid/tanh: if many activations cluster near the extremes, gradients are likely small and learning may be slow.
Concrete Examples: What the Output Activation Changes
Binary classification with sigmoid
Suppose the network produces a logit z for “spam”. Applying sigmoid gives p = sigmoid(z). If p is close to 1, the model is confident it is spam; close to 0 means confident it is not spam. The activation enforces the probability interpretation.
Multi-class classification with softmax
Suppose the network outputs logits z = [z1, z2, z3] for three classes. Softmax converts them into probabilities that sum to 1, ensuring the model expresses a single-label distribution rather than independent scores.
Regression with linear output
If you are predicting house price, a linear output avoids artificially bounding predictions. Using sigmoid here would incorrectly force outputs into (0, 1) unless you also rescale targets and accept that constraint.
Reference Snippets (Shapes and Typical Placement)
# Hidden layer (common default) a = ReLU(Wx + b) # Output layer depends on task # Binary classification: p = sigmoid(w^T a + b) # Multi-class single-label: p = softmax(W a + b) # Regression (unbounded): y_hat = W a + b