Activation Function

Activation Functions - Sigmoid, TanH, ReLU, Softmax

4 minute read

Playlist Deep Learning Fundamentals | Full Course

Why do we need activation function?

Activation Function introduces non-linearity, which allows networks to learn complex patterns in the data.

Why is non-linearity important ?

Real-world data (images, speech, text, financial trends) is rarely linear. Non-linearity allows the network to learn and represent complex mappings between inputs and outputs.

It enables the network to become a ‘Universal Function Approximator’.

A neural network with following properties can approximate any continuous function.

at least one hidden layer
nonlinear activation

🎯 This theorem is the mathematical reason - why neural networks are so powerful.

What if we do not use any activation function ?

A deep neural network without any non-linear activation collapses into a single linear layer.

Say, if, f(x) = ax and g(x) = bx,
then, g(f(x)) = g(ax) = (ba)x = cx
where, c = ba (another constant)

Effectively, both the linear functions g(f(x)) can be represented by another single linear function h(x).

❌ So depth becomes useless.

Sigmoid
Tanh (Hyperbolic Tangent)
ReLU (Rectified Linear Unit)
- Leaky ReLU
Softmax

A mathematical function with a characteristic “S”-shaped curve (sigmoid curve) that maps any real-valued number into a range between 0 and 1.

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Usage:
Mostly used in binary classification output layers.

Issue:
Suffers from vanishing gradient (gradients become near-zero for high or low input values), which slows down training.

images/deep_learning/fundamentals/activation_function/sigmoid.png

images/deep_learning/fundamentals/activation_function/sigmoid_derivative.png

Read more about Differentiation

A mathematical function with a S-shaped curve that maps any real-valued number into a range between -1 and 1.
It is zero-centered, making it more effective than the sigmoid function for hidden layers in neural networks.

\[tanh(x) = 2\sigma(2x) - 1 = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Benefit:
Makes optimization faster as the data is zero-centered.

Issue:
TanH also suffers from vanishing gradient (gradients become near-zero for high or low input values), which slows down training.

images/deep_learning/fundamentals/activation_function/tanh.png

images/deep_learning/fundamentals/activation_function/tanh_derivative.png

Read more about Differentiation

A mathematical function that outputs the input value directly if it is positive, and zero otherwise.
Computationally simple; very fast to compute; does not saturate in the positive direction.

\[ReLU(x) = \text{max}(0,x)\]

Benefit:
It is computationally efficient and helps mitigate the ‘vanishing gradient’ problem, making it the most popular choice for hidden layers.

Issue:
‘Dying ReLU’ problem: negative inputs result in a zero gradient, meaning the neuron stops learning (no weight updates).

images/deep_learning/fundamentals/activation_function/relu.png

images/deep_learning/fundamentals/activation_function/relu_derivative.png

Read more about Differentiation

Instead of setting negative input values to zero like a standard ReLU, Leaky ReLU allows a small, non-zero gradient (slope) for negative values.
This ensures that neurons continue learning (even for negative values).

\[Leaky ~ ReLU(x) = \text{max}(\alpha x,x)\]

where ‘\(\alpha\)’ is a small constant (e.g., 0.01)

Benefit:
Fixes the ‘dying ReLU’ problem.

images/deep_learning/fundamentals/activation_function/leaky_relu.png

images/deep_learning/fundamentals/activation_function/leaky_relu_derivative.png

Read more about Differentiation

Multivariate activation function that takes a vector of raw scores (logits) and converts them into a probability distribution; sum of probabilities = 1.

\[\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]

where ‘K’ = number of classes

Winner Takes Most
Since, the exponential function \(e^{x}\) grows rapidly.
A small lead in raw score (logit) results in a disproportionately large share of the final probability.
So, winner takes the majority share, but non-winners still retain a small, non-zero probability.

Role of Temperature (\(T\))
The “sharpness” of the distribution is controlled by a temperature parameter \(T\):

\[ \sigma (z)_{i}=\frac{e^{z_{i}/T}}{\sum _{j=1}^{K}e^{z_{j}/T}}\]

High Temperature (\(T \to \infty\)): The output becomes a uniform distribution, where all classes have nearly equal probability regardless of their input scores.
Low Temperature (\(T \to 0\)): The output becomes a “hard” max (one-hot vector), where the highest score gets a probability of \(1\) and all others \(0\).

Gradient of Softmax

\[\frac{\partial \sigma_i}{\partial z_j} = \begin{cases} \sigma_i(1 - \sigma_i) & \text{if } i = j \\ -\sigma_i\sigma_j & \text{if } i \neq j \end{cases} \]

Combining both cases:

\[\frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j)\]

where, \(\delta_{ij}\) is the Kronecker delta (1 if \(i=j\) (diagonal), 0 otherwise).

Usage:

Almost exclusively used in the output layer of multi-class classification networks.
Attention score calculation.

Softmax Graph

images/deep_learning/fundamentals/activation_function/softmax.png

Example
Consider an AI model reading a product review to categorize the customer’s mood into three classes: Positive,Neutral, or Negative.

Class	Score (Logit)	Softmax
Positive	3	82.1%
Neutral	1	11.1%
Negative	0.5	6.7%

\[\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]

\[ \begin{aligned} \text{softmax(positive) } = \frac{e^{3}}{e^3 + e^1 + e^{0.5}} \\[10pt] = \frac{20.085}{20.085 + 2.718 + 1.649} \\[10pt] = \frac{20.085}{24.452} \\[10pt] = 0.821 \text { or } 82.1\% \end{aligned} \]

Similarly, you can calculate for negative and neutral sentiments.

Video Activation Function | Why Non Linearity is important in Deep Learning ? | Sigmoid | ReLU | Softmax

Previous: XOR Problem Next: Optimization Methods