top of page

Activation Functions in Neural Network

Updated: Aug 5, 2021

If you accidentally touch a hot object, you automatically pull your hand away without even thinking, which is a reflex action. But what causes this reflex action? This reflex action is caused by the sensory neurons present in your hand. So, these neurons get activated after sensing the hot object and send the message to the brain to pull the hand away from the hot object. But, only the neurons in the area which touched the hot object get activated, other neurons remain deactivated.

Similarly, Activation Functions are the decision-making unit of a neural network in deep learning. When the input features are fed into the input layer of the neural network, the input features get multiplied with the weights and a bias is added to it. Now the value of this processed input can be anything from -∞ to +∞. So, an activation function is applied to this processed input. It returns bounded values of the net input. Finally, the output from the activation function moves to the next hidden layer and the same process is repeated.

Why Activation Functions ?

It is important to use activation functions otherwise, the neural network will be nothing but a linear regression model. It does the non-linear transformation or adds non-linearity to the input before sending it to the next layer of neurons so that the model could learn and perform more complex tasks.

Mathematically, we can write it as:

y = (w1x1 + w2x2 + w3x3) + bias
z = Activation(y)
Z = z * w4
output = Act(z)

In general,

y = weights * input features + bias
z = Activation(y)

The Activation Functions can be divided into 2 types:

  • Linear Activation Function

  • Non – Linear Activation Functions

Linear or identity Activation Function

This function is a straight line where the activation function is directly proportional to the input. Therefore, the output of the function will range from -∞ to +∞. It is useful when the data can be separated just by a line. But, it doesn’t help with complex and variable input to the neural network.

Mathematical definition:

f(x) = x
f’ (x) = 1

Non – Linear Activation Functions

In neural networks, Non–linear Activation Functions are mostly used because they can easily separate variable data.

The Non- linear Activation Functions are mainly classified based on the range of their curves. Several different types of activation functions are used in Deep Learning. Some of them are discussed below:

Sigmoid or Logistic Activation function

This function transforms all the input values between 0 and 1. The threshold value for this function is 0.5. If the input value is greater than 0.5, it will activate the neuron that means the output value will be 1 and if the input value is less than 0.5 then it will not activate the neuron so, the output value will be 0. But, a problem arises during backpropagation when we use the sigmoid function.

In backpropagation, weights get updated and the formula for updating weights is given by: Wnew = Wold –η∂L/ ∂ Wold , where Wnew is the updated weight , Wold is the old weight, η is the learning rate and ∂L/ ∂ Wold is the derivative of the slope with respect to the weight. Since it is mathematically proved that the derivative of the sigmoid function is in the range of 0 to 0.25. So, as the number of layers increases, the value of the derivative decreases. Thus, the updated weight becomes approximately equal to the old weight. This problem is called Vanishing Gradient Problem.

Mathematical definition:

f’ (x) = f(x)(1- f(x))

Thresohld (tanh) Activation function

Tanh activation function is similar to the sigmoid activation function. It transforms all the values between -1 and 1. It is also prone to the vanishing gradient problem. Unlike sigmoid, it is a zero centered function. It takes more time in computation than sigmoid.

Mathematical definition:

ReLU (Rectified Linear Unit) Activation function

It is the most commonly used activation function. This function finds the maximum of 0 and the input value. If the input value is positive it will return the same input value and if the input value is negative it gives the value as 0 because the maximum of 0 and a negative number will always be 0. Using ReLU can solve the problem of vanishing gradient.

But, there is also a problem with the ReLU activation function. During backpropagation, while applying the chain rule the values of the negative values will be 0 and so will be the derivative. This will create a dead neuron. So, Wnew = Wold. To fix this problem, another activation function Leaky ReLU is used.

Mathematical definition:

f(x) = max(0, x)