This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Fundamentals

Deep Learning Fundamentals

1: Intro to DL
2: XOR Problem
3: Activation Functions
4: Optimization Methods

PlaylistDeep Learning Fundamentals | Full Course

1 - Intro to DL

Introduction to Deep Learning

PlaylistDeep Learning Fundamentals | Full Course

Deep Learning

📘 Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to simulate human-like learning, analyzing vast data to identify complex patterns, such as recognizing objects in photos, detecting medical anomalies, or processing natural language, like LLMs.

💡 The ‘deep’ in ‘deep learning’ stands for the idea of successive layers of representations.

🐋 It is “deep” because it uses many layers (often hundreds) to automatically extract, transform, and map data features into predictions, surpassing traditional machine learning in handling unstructured data.

🖼️ A deep neural network for digit classification

images/deep_learning/fundamentals/intro_to_dl/digit_classification.png

📘 Deep learning is a multistage way to learn data representations.

💡 It’s a simple idea - but, as it turns out, very simple mechanisms, sufficiently scaled, can end up looking like magic.

🖼️ A fully connected neural network

images/deep_learning/fundamentals/intro_to_dl/neural_network.png

Question

What makes deep learning different ?

Answer

Feature Engineering: Deep learning completely automates what used to be the most crucial step in a machine learning workflow, making problem-solving much easier.

2 - XOR Problem

XOR Problem - Why Linear/Logistic Regression Can’t Solve it ?

PlaylistDeep Learning Fundamentals | Full Course

Before we dive into the XOR problem, lets get familiar with few terms and concepts first.

Perceptron (1958)

Simplest form of an artificial neural network, acting as a single-layer binary classifier that categorizes input data into one of two groups.

It serves as a mathematical model of a biological neuron, receiving multiple signals (inputs), weighting their importance, and deciding whether to ‘fire’ (output 1) or stay ‘inactive’ (output 0).

🖼️ Perceptron

💡 Even Logistic Regression is a simple neural network with a sigmoid activation (instead of step function as in Perceptron).

🖼️ Logistic Regression as a Neural Network

images/deep_learning/fundamentals/xor_problem/logistic_regression.png

Question

Why linear or logistic regression is unable to represent the XOR function?

Answer

XOR Function:

Input A	Input B	Output (A ⊕ B)
0	0	0
0	1	1
1	0	1
1	1	0

Let’s plot the input and output on a graph for visualization.

images/deep_learning/fundamentals/xor_problem/xor_function_plot.png

Linear Regression
The cost function:

\[ J(\theta) = \frac{1}{4} \sum_{i=1}^{4} (y_i - \hat{y}_i)^2 \]

where, \(\hat{y_i} = \mathbf{w^Tx} + w_0\)
Solving the normal equations, we get:
\(\mathbf{w = 0} , w_0 = 0.5\)

This implies, whatever is the input, we always get 0.5 as output, because linear regression is trying to fit the best line to the data, which in this case will be mid-way between the points.
And, that definitely is not the correct solution.

images/deep_learning/fundamentals/xor_problem/xor_linear_regression.png

Logistic Regression
Similarly logistic regression can not find a single linear decision boundary to separate the 4 XOR outputs.

❌ No straight line can separate the XOR points.

Therefore, a linear model is not sufficient to represent the XOR function.

So, we need more than 1 neuron to solve the XOR problem (because logistic regression is a neural network with a single neuron).

XOR Problem Solution

Let’s solve the XOR problem with a simple neural network with 2 neurons (1 hidden layer) and ReLU activation function.

1 Hidden layer and 2 Neurons

images/deep_learning/fundamentals/xor_problem/xor_nn.png

We will use linear algebra to demonstrate one of solutions to the problem.
Let input = X and output = Y.

\[X = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix}, \quad Y = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix}\]

Out of many possible solutions, let’s look at the below solution:
Weight and bias of hidden layer:

\[W = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}, \quad c = \begin{bmatrix} 0 \\ -1 \end{bmatrix}\]

Output of hidden layer:

\[z = XW + c\]\[ XW = \begin{bmatrix} 0 & 0 \\ 0 & 1 \\ 1 & 0 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 1 & 1 \\ 1 & 1 \\ 2 & 2 \end{bmatrix} \]

\[ z = \begin{bmatrix} 0 & 0 \\ 1 & 1 \\ 1 & 1 \\ 2 & 2 \end{bmatrix} + \begin{bmatrix} 0 & -1 \\ 0 & -1 \\ 0 & -1 \\ 0 & -1 \end{bmatrix} = \begin{bmatrix} 0 & -1 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{bmatrix} \]

Now, lets apply ReLU activation function to the output ‘z’ of the hidden layer:

\[ReLU(z) = \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{bmatrix}\]

Applying ReLU non-linearity, changes the position of the points in the hidden space and now the points can be separated by a line.

images/deep_learning/fundamentals/xor_problem/xor_relu.png

Weight and bias of output layer:

\[\mathbf{w} = \begin{bmatrix} 1 \\ -2 \end{bmatrix}, \quad b = 0\]

Output:

\[\hat{y} = \mathbf{w} ~ \text{max}(0, ~ XW + c) + b\]\[\hat{y} = \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 1 & 0 \\ 2 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ -2 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix}\]

Therefore, we can see that we have got the expected output for XOR function.

XOR Problem Solution Code

import tensorflow as tf
import numpy as np

# 1. XOR Data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
y = np.array([[0], [1], [1], [0]], dtype=np.float32)

# 2. Build the Model (2 Hidden Neurons)
model = tf.keras.Sequential([
    tf.keras.layers.Dense(2, activation='leaky_relu', 
                         kernel_initializer='he_normal', 
                         input_shape=(2,), name='hidden_layer'),
    # Output layer (Linear for MSE)
    tf.keras.layers.Dense(1, name='output_layer')
])

print("--- Model Architecture ---")
model.summary()

# 3. Compile
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.05), loss='mse')

# 4. Train
print("Training XOR Neural Network ...")
model.fit(X, y, epochs=100, verbose=0)

# 5. Extract and Print Final Weights
weights = model.get_weights()
W, c, w_out, b = weights

print("\n--- Final Weights (W) ---")
print(W)
print(f"\nHidden Bias (c): {c}")
print(f"\nOutput Weights (w): \n{w_out}")
print(f"Output Bias (b): {b}")

# 6. Predictions
print("\n--- Final Predictions ---")
preds = model.predict(X)
for i in range(len(X)):
    print(f"Input: {X[i]} | Raw Output: {preds[i][0]:.4f} | Rounded: {int(np.round(preds[i][0]))}")

Output:

--- Model Architecture ---
Model: "sequential_19"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ hidden_layer (Dense)            │ (None, 2)              │             6 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ output_layer (Dense)            │ (None, 1)              │             3 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 9 (36.00 B)
 Trainable params: 9 (36.00 B)
 Non-trainable params: 0 (0.00 B)
Training XOR Neural Network ...

--- Final Weights (W) ---
[[-1.1186169  1.8888004]
 [ 1.0687382 -1.8048155]]

Hidden Bias (c): [-0.20543203 -0.18170778]

Output Weights (w): 
[[1.3961141]
 [0.7466789]]
Output Bias (b): [0.0938225]

--- Final Predictions ---
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 105ms/step
Input: [0. 0.] | Raw Output: 0.0093 | Rounded: 0
Input: [0. 1.] | Raw Output: 1.0024 | Rounded: 1
Input: [1. 0.] | Raw Output: 0.9988 | Rounded: 1
Input: [1. 1.] | Raw Output: 0.0079 | Rounded: 0

Video What is XOR Problem? | Why Deep Learning is Required To Solve XOR Problem ?

Previous: Intro to DL Next: Activation Function

3 - Activation Functions

Activation Functions - Sigmoid, TanH, ReLU, Softmax

PlaylistDeep Learning Fundamentals | Full Course

Question

Why do we need activation function?

Answer

Activation Function introduces non-linearity, which allows networks to learn complex patterns in the data.

Question

Why is non-linearity important ?

Answer

Real-world data (images, speech, text, financial trends) is rarely linear. Non-linearity allows the network to learn and represent complex mappings between inputs and outputs.

It enables the network to become a ‘Universal Function Approximator’.

Universal Approximation Theorem

A neural network with following properties can approximate any continuous function.

at least one hidden layer
nonlinear activation

🎯 This theorem is the mathematical reason - why neural networks are so powerful.

Question

What if we do not use any activation function ?

Answer

A deep neural network without any non-linear activation collapses into a single linear layer.

Say, if, f(x) = ax and g(x) = bx,
then, g(f(x)) = g(ax) = (ba)x = cx
where, c = ba (another constant)

Effectively, both the linear functions g(f(x)) can be represented by another single linear function h(x).

❌ So depth becomes useless.

Common Activation Functions

Sigmoid
Tanh (Hyperbolic Tangent)
ReLU (Rectified Linear Unit)
- Leaky ReLU
Softmax

Sigmoid

A mathematical function with a characteristic “S”-shaped curve (sigmoid curve) that maps any real-valued number into a range between 0 and 1.

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Usage:
Mostly used in binary classification output layers.

Issue:
Suffers from vanishing gradient (gradients become near-zero for high or low input values), which slows down training.

images/deep_learning/fundamentals/activation_function/sigmoid.png

images/deep_learning/fundamentals/activation_function/sigmoid_derivative.png

Read more about Differentiation

Hyperbolic Tangent (TanH)

A mathematical function with a S-shaped curve that maps any real-valued number into a range between -1 and 1.
It is zero-centered, making it more effective than the sigmoid function for hidden layers in neural networks.

\[tanh(x) = 2\sigma(2x) - 1 = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Benefit:
Makes optimization faster as the data is zero-centered.

Issue:
TanH also suffers from vanishing gradient (gradients become near-zero for high or low input values), which slows down training.

images/deep_learning/fundamentals/activation_function/tanh.png

images/deep_learning/fundamentals/activation_function/tanh_derivative.png

Read more about Differentiation

Rectified Linear Unit (ReLU)

A mathematical function that outputs the input value directly if it is positive, and zero otherwise.
Computationally simple; very fast to compute; does not saturate in the positive direction.

\[ReLU(x) = \text{max}(0,x)\]

Benefit:
It is computationally efficient and helps mitigate the ‘vanishing gradient’ problem, making it the most popular choice for hidden layers.

Issue:
‘Dying ReLU’ problem: negative inputs result in a zero gradient, meaning the neuron stops learning (no weight updates).

images/deep_learning/fundamentals/activation_function/relu.png

images/deep_learning/fundamentals/activation_function/relu_derivative.png

Read more about Differentiation

Leaky Rectified Linear Unit (Leaky ReLU)

Instead of setting negative input values to zero like a standard ReLU, Leaky ReLU allows a small, non-zero gradient (slope) for negative values.
This ensures that neurons continue learning (even for negative values).

\[Leaky ~ ReLU(x) = \text{max}(\alpha x,x)\]

where ‘\(\alpha\)’ is a small constant (e.g., 0.01)

Benefit:
Fixes the ‘dying ReLU’ problem.

images/deep_learning/fundamentals/activation_function/leaky_relu.png

Imgproc Error: Could not find images/deep_learning/fundamentals/activation_function/Leaky_relu_derivative.png in assets/ or Page Bundle.

Read more about Differentiation

Softmax

Multivariate activation function that takes a vector of raw scores (logits) and converts them into a probability distribution; sum of probabilities = 1.

\[\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]

where ‘K’ = number of classes

Usage:
Almost exclusively used in the output layer of multi-class classification networks.

images/deep_learning/fundamentals/activation_function/softmax.png

Example:
Consider an AI model reading a product review to categorize the customer’s mood into three classes: Positive,Neutral, or Negative.

Class	Score (Logit)	Softmax
Positive	3	82.1%
Neutral	1	11.1%
Negative	0.5	6.7%

\[\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]

\[ \begin{aligned} \text{softmax(positive) } = \frac{e^{3}}{e^3 + e^1 + e^{0.5}} \\[10pt] = \frac{20.085}{20.085 + 2.718 + 1.649} \\[10pt] = \frac{20.085}{24.452} \\[10pt] = 0.821 \text { or } 82.1\% \end{aligned} \]

Similarly, you can calculate for negative and neutral sentiments.

Video Activation Function | Why Non Linearity is important in Deep Learning ? | Sigmoid | ReLU | Softmax

Previous: XOR Problem Next: Optimization Methods

4 - Optimization Methods

Optimization Methods - Momentum, AdaGrad, RMSProp, Adam

PlaylistDeep Learning Fundamentals | Full Course

Non-Convex Loss Surface

The loss function surface in deep learning is non-convex, i.e, it has multiple local minima, saddle points, and plateaus rather than a single, global minimum.
So, in the context of neural network training, we usually do not care about finding the exact (global) minimum of a function, but seek only to reduce its value sufficiently to obtain good generalization error.

🖼️ Non-Convex Loss Surface Examples

images/deep_learning/fundamentals/optimization_method/non_convex_loss_surface_1.png

images/deep_learning/fundamentals/optimization_method/non_convex_loss_surface_2.png

Read more about Saddle Point

Training deep neural networks is inherently complex because of the multiple layers and the vast number of parameters to be updated during training.

Therefore, we need to find ways to accelerate the optimization process.

Stochastic Gradient Descent

The optimization process can be accelerated considerably by using stochastic gradient descent (instead of simple gradient descent), i.e, follow the gradient of randomly selected mini-batches downhill.

\[w_{new} = w_{old} - \eta.\text{(average gradient of randomly chosen ‘m' data points)}\]

where, \(\eta\) = learning rate

Learning Rate Decay

In practice, it is common to decay the learning rate linearly until some pre-defined fixed number of iterations ‘\(\tau\)’.

The primary reason for this approach is to start with a high learning rate to rapidly traverse the loss landscape and escape poor local minima, while later using a small learning rate to fine-tune the parameters and settle into a deeper, more stable minimum without oscillating around it.

\[\eta_k = (1-\alpha)\eta_0 + \alpha\eta_{\tau}\]

where, \(\alpha = \frac{k}{\tau}\)

After,’\(\tau\)’ iterations, leave the learning rate \(\eta\) constant.
e.g., \(\eta_0 = 0.1,~ \eta_{\tau}=0.01, \text{ and } \tau=100\)

Terminologies

Say, we have ’n’ samples, and we divide them into mini-batches, such that, each mini-batch has ‘m’<’n’ samples.

1 iteration = weight update after computing the gradient of 1 mini-batch
1 epoch = one complete pass through the entire training dataset = n/m iterations
L epochs = L x (n/m) iterations

Note:

Size ‘m’ of a mini-batch is decided based on the computing resources, such as RAM, GPU, TPU etc., e.g, Nvidia H100 GPU has 80GB RAM.
In practice, the mini-batch size is chosen to be the largest possible power of 2 that fits within the available GPU memory while still allowing for good model performance.
Samples in the mini-batches are randomized in every epoch.

Optimization Methods

Methods to accelerate the optimization process in deep learning:

Momentum Based; Polyak (1964) | Refined for Deep Learning: Sutskever et al. (2013)
AdaGrad (Adaptive Gradient); Duchi, Hazan, and Singer (2011)
RMSProp (Root Mean Square Propagation); Geoffrey Hinton (2012)
Adam (Adaptive Moment Estimation); Kingma and Ba (2014)

Momentum Based Optimizer

💡 Momentum introduces velocity.
(term borrowed from Physics, where momentum = mass x velocity)

‘Accumulates’ velocity in directions of consistent gradients and cancels out directions that fluctuate.

Algorithm

For each iteration (t):
- Instead of moving purely by gradient: \[w_{t+1} = w_{t} - \eta . g_t\]
- Accumulate previous gradients, i.e, the velocity (speed + direction): \[ v_{t} = \gamma . v_{t-1} + \eta. g_t\]
  - where, \( \gamma \) = momentum coefficient (typically 0.9)
- Update parameter: \[ w_{t+1} = w_{t} - v_{t} \]

Size of the step depends on how large and how aligned are a sequence of gradients.

\[ \begin{aligned} \text{Let, } v_0 &= 0 \\ v_1 &= \gamma. v_0 + \eta.g_0 = \eta.g_0\\ v_2 &= \gamma. v_1 + \eta.g_1 = \gamma (\eta.g_0) + \eta.g_1 \\ v_3 &= \gamma. v_2 + \eta.g_2 = \gamma (\gamma (\eta.g_0) + \eta.g_1 ) + \eta.g_2 = \eta(\gamma^2 g_0 + \gamma g_1 + g_2)\\ v_{k} &= \eta(\gamma^{k-1} g_0 + \gamma^{k-2} g_1 + \dots g_{k-1})\\ \end{aligned} \]

If many successive gradients point in exactly the same direction, then we want to take larger steps.

\[ \lim_{k\rightarrow \infty} v_k = \eta.g(1+\gamma+ \gamma^2 + \dots \infty) \]

The term inside the bracket, is a geometric progression with the common ratio \(\gamma < 1\).

So, if the momentum algorithm always observes gradient ‘g’, then it will accelerate in the direction of ‘g’, until reaching a terminal velocity where the size of each step is:

\[ \frac{\eta. \lVert g \rVert}{1-\gamma} \]

where, \(0 < \gamma < 1\)

Say, if \(\gamma\)= 0.9, then it means to multiply the maximum velocity by 10 relative to a gradient descent algorithm.

🖼️ Momentum Based Optimizer Vs SGD

images/deep_learning/fundamentals/optimization_method/momentum_based.png

Limitations

Momentum can be like a heavy ball rolling down a hill; it gathers so much speed that it may overshoot the minima.
It does not adjust the learning rate based on the importance of specific features.

Adaptive Gradient (AdaGrad)

💡 Scales the learning rate for each parameter based on the historical sum of squares of its gradients.

Problem
In many datasets, some features are frequent while others are sparse.
e.g., Predicting house prices based on certain rare feature, such as, presence of shopping mall.
For most of the houses the value of that feature is 0.
A single learning rate ‘\(\eta\)’ for all parameters is inefficient.
We want larger updates for sparse features and smaller updates for frequent ones.

Algorithm

For each iteration (t):
- Calculate gradient \(g_t\).
- Accumulate gradients: \[ r_{t} = r_{t-1} + g_t \odot g_t\]
- Update parameter: \[ w_{t+1} = w_{t} - \frac{\eta}{\sqrt{r_t} + \delta} \odot g_t \]
  - where, \(\delta\) is small smoothing term (e.g. \(10^{-8}\)) to avoid division by 0.
  - if, \(g = \begin{bmatrix} g_1 \\ g_2 \\ \vdots \\ g_d \end{bmatrix} \), then \( g \odot g = \begin{bmatrix} g_1^2 \\ g_2^2 \\ \vdots \\ g_d^2 \end{bmatrix} \) (element wise dot product)

Since, \(r_{t+1} = r_{t} + g_t \odot g_t\), so, for sparse features, we hardly get any gradient updates, so ‘g’ is mostly 0.
Therefore, accumulations ‘r’ is very small.

Since, \(w_{t+1} = w_{t} - \frac{\eta}{\sqrt{r_t} + \delta} \odot g_t\), this implies that, the learning rate is inversely proportional to accumulations ‘r’.
Therefore, sparse features get larger updates, whereas, for weights that are frequent will have very large accumulations, as a result, the learning rate will start decaying.

Limitation
Vanishing Learning Rate:
Since accumulation of gradients increases monotonically.
This causes the effective learning rate to shrink until it becomes infinitesimally small, effectively ‘killing’ the learning process before the model converges.

Video Activation Function | Why Non Linearity is important in Deep Learning ? | Sigmoid | ReLU | Softmax

Previous: Activation_Function Next: Optimization Method