Activation functions are an essential component in artificial neural networks which introduce non-linearity to the network, allowing it to learn complex patterns and relationships from the data to make accurate predictions. In this article, we’ll take a close look at these functions, understand why they’re crucial, and explore different types. Whether you’re new to this or quite familiar, I’ll walk you through everything you need to know about activation functions.
What Is an Activation Function?
An Activation Function is a mathematical function applied to the output of the neural network layers. The main purpose of the activation function is to introduce non-linearity to the network, allowing it to learn complex patterns and relationships in the data. Otherwise, it only learns linear patterns in the data.
Imagine a simple neural network with three layers without implementing any activation function. During training, this network performs fundamental computations, including input, weight multiplication, and bias addition, resulting in Output = (W*X) + b. If you look closely, you will notice that this is a linear function, So this model has only linearity.

Note that the image above illustrates the mathematical computations within the hidden layer. If you’re unfamiliar with the mathematical operations of neural networks, follow our comprehensive guides on Linear Algebra and Neural Networks for a better understanding.
Therefore this neural network only learns the linear relationship between input data and output data, aiming to find the suitable W(weight values) for achieving the desired output. But in the real world, we have to face more non-linear problems than linear ones.
For example, consider a simple XOR problem.

There is no common linear relationship that can be applied to each XOR data. To solve this problem, we need to apply non-linearity to the model. enabling it to find deep complex patterns and relationships in XOR.
Sometimes, the Activation function plays a role in controlling the output values and achieving the desired output format from the neural network. For example, the Softmax activation function transforms the neuron’s output into a probability distribution, where the output values are in the range [0, 1] and their sum is equal to 1.
It is important to note that activation functions are only applied to the output of the hidden layers and the output layer. Here, ‘Output’ refers to the result obtained after adding the bias term to the sum of the weighted inputs of each neuron within the layer. It’s worth mentioning that the output of a neural network layer is a collection of results, not just a single one. Therefore Activation functions are typically applied element-wise to each neuron independently, generating new results for each neuron.

Types of Activation Functions
There are lots of activation functions that can be used in deep learning. Here are some of the most essential and commonly used activation functions you should know.
Throughout this article, I use Python and PyTorch combination for programming, You can follow our installation guide to setup your PC properly!
Sigmoid Activation Function
The sigmoid is a smooth activation function(those are continuous and have smooth derivatives/lines over their entire range) that maps the input(layer outputs) to a value between 0 and 1, which makes it suitable for binary classification tasks. Also, this function is used as an output layer activation function in multi-class classification problems.

However, there are some limitations in the sigmoid function, such as vanishing gradient and being sensitive to large inputs, which lead to slower training and convergence problems. For this reason, nowadays other activation functions like ReLU and its variants are commonly used in deep networks.
Sigmoid can be implemented in Python like this:

Output:

Here is a standard implementation in Pytorch using sigmoid() function which is in the torch library.

Output:

Softmax Activation Function
The softmax activation function is commonly used in the output layer of neural networks for multi-class classification problems. It takes a vector that contains the final outputs of each neuron in the output layer and then transforms it into a probability distribution such that the sum of all probabilities is equal to 1. So Softmax is particularly useful when dealing with mutually exclusive classes.
The softmax function takes the exponential of each element in the input vector and then normalizes it by dividing the exponential of each element by the sum of the exponentials of all elements in the input vector.

The output of the softmax function is a probability distribution over the classes, where each element in the output vector represents how much the final result of the neural network belongs to the corresponding class.
Softmax can be implemented in Python like this:

Output:

There are four classes [ class1, class2, class3, class4], and for one training data example, the corresponding probability of each class is [0.01, 0.03, 0.09, 0.23, 0.64]. You can see that the values in the output probability distribution sum up to 1. The class with the highest probability is considered the predicted class. So model’s predicted class for the above training data example is class 4.
Here is a standard implementation in Pytorch using softmax() function which is in the torch library.

Output:

ReLU Activation Function
Rectified Linear Unit (ReLU) is one of the most popular and widely used activation functions in deep learning due to its simplicity and effectiveness. It replaces all negative input values with zero and leaves positive values unchanged.

ReLu activation function introduces non-linearity to the network by turning off the negative values (setting them to zero). And unlike sigmoid, ReLU does not suffer from the vanishing gradient problem. These make ReLU computationally more efficient more stable and faster in training.
However, ReLU also has some limitations:
ReLU can suffer from the dying ReLU problem, where neurons become stuck in a state of inactivity. This means that the output of some neurons remains zero for all inputs during the training. Such neurons are said to be “dead” and do not attend to the learning process. This problem was solved by variants of ReLU, like Leaky ReLU.
ReLU can also suffer from exploding gradients for extremely large inputs. In other words, as ReLU outputs the input as it is for positive values if a neuron receives an extremely large positive input then the output of the ReLU function becomes equal to the input. If this value is very large, it can lead to extremely large gradients during backpropagation. These large gradients can cause the model’s parameters to be updated by very large values during gradient descent optimization, which can lead to instability during training.
To address these problems, several variations of ReLU were presented. Such as Leaky ReLU, Parametric ReLU, ELU, and Swish.
ReLU can be implemented in Python like this:

Output:

Here is a standard implementation in Pytorch using relu() function which is in the torch library.

Output:

Leaky Rectified Linear Unit (Leaky ReLU)
This is a variant of the ReLU function. Leaky ReLU addresses the dying ReLU problem (where the activation function outputs zero for all neuron values less than zero) by allowing a small slope (often denoted by α) for negative values. This prevents the issue of the vanishing gradient.

Leaky ReLU can be implemented in Python like this:

Output:

Here is a standard implementation in Pytorch using LeakyReLU() function which is in the torch.nn library.

Output:

Parametric Rectified Linear Unit (PReLU)
The Parametric Rectified Linear Unit (PReLU) is a variation of the ReLU activation function. In PReLU, unlike Leaky ReLU, the value α is learned during training rather than being a fixed parameter which allows the negative slope to be learned during training as a model parameter. This helps the network to adapt the negative slope based on the data.

During the training process, the alpha(a) parameter is updated through backpropagation.
PReLU can be implemented in Python like this:

Output:

In PyTorch, you can directly use the built-in torch.nn.PReLU() function, which provides efficient computation and handles automatic differentiation for learning the ‘alpha’ parameter during training

num_parameters=1 – specifies that we want to learn one alpha parameter.
Output:

The `grad_fn` attribute indicates that the operation is differentiable, allowing for backpropagation during training to learn the ‘alpha’ parameter.
GELU (Gaussian Error Linear Unit)
The GELU (Gaussian Error Linear Unit) is a smooth activation function (continuous and has a smooth curve line, making it easy to optimize) that approximates the ReLU function and has become popular in transformer-based models like BERT due to its effectiveness in improving model performance and convergence speed.
The GELU function is continuous, differentiable, and approximates the identity function for small positive inputs. For negative inputs, it smoothly saturates to zero, avoiding the “dying ReLU” problem.

GELU can be implemented in Python like this:

Output:

Here is a standard implementation in Pytorch using gelu() function which is in the torch.nn.functional library.

Output:

Swish Activation Function
Swish is a new activation function that has gained popularity for its smoothness and better performance in some cases. The properties of the ReLU and the Sigmoid activation functions inspire it. So it shows the linear behavior of ReLU for positive inputs and the saturation behavior of the sigmoid for negative inputs. Also, It introduces a non-linearity using a learnable parameter β and It is continuous and differentiable over its entire range, enabling efficient gradient-based optimization during training.

Swish is computationally efficient compared to the GELU due to this simple formulation.
Swish can be implemented in Python like this:

Output:

Here is a standard implementation in Pytorch using sigmoid() function which is in the torch library.

Output:

These are some of the most important and popular activation functions used in deep learning. Each has its own advantages and disadvantages. So you should know how to choose the right activation function for your model.
How to Choose the Right Activation Function
The choice of activation function depends on various factors including the nature of the problem, the characteristics of the data, and the architecture of the neural network.
The activation functions in the hidden layers are used to capture complex patterns within the data. So the activation functions in the hidden layers mainly depend on the characteristics of the input data, while the activation function in the output layer depends on the specific problem being addressed.
- Hidden Layer: Activation Functions Like tanh, Swish, GeLU, ReLU, and its variants(Leaky ReLU, Parametric ReLU) are only used in hidden layers of a neural network. Sometimes times Sigmoid activation function too used in the hidden layer.
- Output Layer: The choice of activation function in the output layer depends on the nature of the problem. For binary classification problems, the sigmoid function is commonly used in the output layer. For multi-class classification problems, the softmax function is used in the output layer.
Sometimes you might have several choices to pick from for a particular task or data feature. During such instances, You should experiment with different activation functions and compare their performance to find the best-suited function for your model.
In this comprehensive guide, we’ve dived into the Activation Functions in Neural Networks, unraveling activation functions’ significance in artificial neural networks and exploring various activation functions such as sigmoid, ReLU, and tanh. Lastly, We looked at some key factors you should determine when choosing the activation functions for our model. As the AI field advances rapidly, novel activation functions are continuously being devised to overcome existing limitations. So you should be updated to stay connected with these latest innovations.