18.5. Backpropagation and Training of Neural Networks: Activation Functions

The Backpropagation algorithm is essential for training deep neural networks (Deep Learning). It allows the network to adjust its internal weights efficiently, minimizing the difference between predicted outputs and actual outputs (the error). The training process is iterative and involves calculating the gradient of the loss function (or cost) in relation to each weight in the network, which is done using the chain rule, backpropagating the error from the last layer to the initial layers.

How Backpropagation works

The backpropagation process can be divided into two main steps:

Forward Propagation: In this step, the input data is passed through the network, layer by layer, until an output is produced. Each neuron in a layer receives input signals that are summed (combining the associated weights) and then passed through an activation function to generate an output signal.
Backward Propagation: After obtaining the output, the difference between this output and the desired output (the error) is calculated. This error is then propagated back through the network, updating the weights in a way that minimizes the error. The weights are updated using the gradient of the loss function in relation to the weights, multiplied by a learning rate.

Activation Functions

The activation functions play a crucial role in training neural networks, as they are what introduce non-linearity to the model, allowing the network to learn complex relationships between input and output data. Without nonlinearity, the network would be equivalent to a linear model and could not solve problems that are not linearly separable.

Some of the most common activation functions are:

Sigmoid: A function that maps any input value to a value between 0 and 1. It is useful for outputting probabilities, but is rarely used in hidden layers due to the gradient vanishing problem .
Tanh (Hyperbolic Tangent): Similar to the sigmoid function, but maps input values to a range between -1 and 1. May also suffer from the gradient vanishing problem, but is preferable to sigmoid in hidden layers because the output values have zero mean.
ReLU (Rectified Linear Unit): A function that returns the value itself if it is positive, and zero otherwise. It is the most used activation function in deep neural networks due to its computational efficiency and the fact that it mitigates the problem of gradient disappearance.
Leaky ReLU: A variation of ReLU that allows a small gradient when the value is negative, preventing neurons from becoming "dead" during training.
Softmax: Typically used in the last layer of a neural network for classification tasks, the softmax function transforms logits (raw input values) into probabilities that sum to 1.

Optimization and Weight Adjustment

The optimization process during training a neural network is typically performed using a gradient-based optimizer such as Gradient Descent or its variants (Stochastic Gradient Descent, Mini-batch Gradient Descent , Adam, RMSprop, etc.). The objective is to find the set of weights that minimizes the loss function.

Learning rate is a critical hyperparameter in training neural networks. Too high a rate can cause the algorithm to "skip" the global minimum, while too low a rate can result in very slow convergence or getting stuck at local minima.

Conclusion

Backpropagation is the backbone of neural network training, allowing adjustments to be made to the network's weights in order to minimize prediction error. Activation functions are key components that allow the network to capture the complexity and nonlinearity of data. Proper choice of activation function and careful configuration of hyperparameters, such as learning rate, are critical to success in building efficient and accurate Machine Learning and Deep Learning models.

It is important to note that the field of Deep Learning is constantly evolving, and new techniques and approaches are being developed regularly. Therefore, it is essential to stay up to date with the latest literature and experiment with different architectures and hyperparameters to find the best solution for a specific problem.

Now answer the exercise about the content: