This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Deep Learning

Deep Learning - from Neural Networks, Activation Functions, to Back Propagation

1: Introduction
2: XOR Problem
3: Activation Function
4: Optimization Method
5: Regularization
6: Batch Normalization
7: Back Propagation

Playlist Deep Learning Fundamentals | Full Course

1 - Introduction

Introduction to Deep Learning

Playlist Deep Learning Fundamentals | Full Course

📘 Deep learning is a subset of AI and machine learning that uses multi-layered artificial neural networks to simulate human-like learning, analyzing vast data to identify complex patterns, such as recognizing objects in photos, detecting medical anomalies, or processing natural language, like LLMs.

images/deep_learning/fundamentals/intro_to_dl/dl_ai.png

💡 The ‘deep’ in ‘deep learning’ stands for the idea of successive layers of representations.

🐋 It is “deep” because it uses many layers (often hundreds) to automatically extract, transform, and map data features into predictions, surpassing traditional machine learning in handling unstructured data.

A deep neural network for digit classification

images/deep_learning/fundamentals/intro_to_dl/digit_classification.png

📘 Deep learning is a multistage way to learn data representations.

💡 It’s a simple idea - but, as it turns out, very simple mechanisms, sufficiently scaled, can end up looking like magic.

A fully connected neural network

images/deep_learning/fundamentals/intro_to_dl/neural_network.png

What makes deep learning different ?

Feature Engineering: Deep learning completely automates what used to be the most crucial step in a machine learning workflow, making problem-solving much easier.

2 - XOR Problem

XOR Problem - Why Linear/Logistic Regression Can’t Solve it ?

Playlist Deep Learning Fundamentals | Full Course

Before we dive into the XOR problem, lets get familiar with few terms and concepts first.

Simplest form of an artificial neural network, acting as a single-layer binary classifier that categorizes input data into one of two groups.

It serves as a mathematical model of a biological neuron, receiving multiple signals (inputs), weighting their importance, and deciding whether to ‘fire’ (output 1) or stay ‘inactive’ (output 0).

Perceptron

💡 Even Logistic Regression is a simple neural network with a sigmoid activation (instead of step function as in Perceptron).

Input A	Input B	Output (A ⊕ B)
0	0	0
0	1	1
1	0	1
1	1	0

3 - Activation Function

Activation Functions - Sigmoid, TanH, ReLU, Softmax

Playlist Deep Learning Fundamentals | Full Course

Why do we need activation function?

Activation Function introduces non-linearity, which allows networks to learn complex patterns in the data.

Why is non-linearity important ?

Real-world data (images, speech, text, financial trends) is rarely linear. Non-linearity allows the network to learn and represent complex mappings between inputs and outputs.

It enables the network to become a ‘Universal Function Approximator’.

A neural network with following properties can approximate any continuous function.

at least one hidden layer
nonlinear activation

🎯 This theorem is the mathematical reason - why neural networks are so powerful.

What if we do not use any activation function ?

A deep neural network without any non-linear activation collapses into a single linear layer.

Say, if, f(x) = ax and g(x) = bx,
then, g(f(x)) = g(ax) = (ba)x = cx
where, c = ba (another constant)

Effectively, both the linear functions g(f(x)) can be represented by another single linear function h(x).

❌ So depth becomes useless.

Sigmoid
Tanh (Hyperbolic Tangent)
ReLU (Rectified Linear Unit)
- Leaky ReLU
Softmax

A mathematical function with a characteristic “S”-shaped curve (sigmoid curve) that maps any real-valued number into a range between 0 and 1.

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Usage:
Mostly used in binary classification output layers.

Issue:
Suffers from vanishing gradient (gradients become near-zero for high or low input values), which slows down training.

images/deep_learning/fundamentals/activation_function/sigmoid.png

images/deep_learning/fundamentals/activation_function/sigmoid_derivative.png

Read more about Differentiation

A mathematical function with a S-shaped curve that maps any real-valued number into a range between -1 and 1.
It is zero-centered, making it more effective than the sigmoid function for hidden layers in neural networks.

\[tanh(x) = 2\sigma(2x) - 1 = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

Benefit:
Makes optimization faster as the data is zero-centered.

Issue:
TanH also suffers from vanishing gradient (gradients become near-zero for high or low input values), which slows down training.

images/deep_learning/fundamentals/activation_function/tanh.png

images/deep_learning/fundamentals/activation_function/tanh_derivative.png

Read more about Differentiation

A mathematical function that outputs the input value directly if it is positive, and zero otherwise.
Computationally simple; very fast to compute; does not saturate in the positive direction.

\[ReLU(x) = \text{max}(0,x)\]

Benefit:
It is computationally efficient and helps mitigate the ‘vanishing gradient’ problem, making it the most popular choice for hidden layers.

Issue:
‘Dying ReLU’ problem: negative inputs result in a zero gradient, meaning the neuron stops learning (no weight updates).

images/deep_learning/fundamentals/activation_function/relu.png

images/deep_learning/fundamentals/activation_function/relu_derivative.png

Read more about Differentiation

Instead of setting negative input values to zero like a standard ReLU, Leaky ReLU allows a small, non-zero gradient (slope) for negative values.
This ensures that neurons continue learning (even for negative values).

\[Leaky ~ ReLU(x) = \text{max}(\alpha x,x)\]

where ‘\(\alpha\)’ is a small constant (e.g., 0.01)

Benefit:
Fixes the ‘dying ReLU’ problem.

images/deep_learning/fundamentals/activation_function/leaky_relu.png

images/deep_learning/fundamentals/activation_function/leaky_relu_derivative.png

Read more about Differentiation

Multivariate activation function that takes a vector of raw scores (logits) and converts them into a probability distribution; sum of probabilities = 1.

\[\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]

where ‘K’ = number of classes

Winner Takes Most
Since, the exponential function \(e^{x}\) grows rapidly.
A small lead in raw score (logit) results in a disproportionately large share of the final probability.
So, winner takes the majority share, but non-winners still retain a small, non-zero probability.

Role of Temperature (\(T\))
The “sharpness” of the distribution is controlled by a temperature parameter \(T\):

\[ \sigma (z)_{i}=\frac{e^{z_{i}/T}}{\sum _{j=1}^{K}e^{z_{j}/T}}\]

High Temperature (\(T \to \infty\)): The output becomes a uniform distribution, where all classes have nearly equal probability regardless of their input scores.
Low Temperature (\(T \to 0\)): The output becomes a “hard” max (one-hot vector), where the highest score gets a probability of \(1\) and all others \(0\).

Gradient of Softmax

\[\frac{\partial \sigma_i}{\partial z_j} = \begin{cases} \sigma_i(1 - \sigma_i) & \text{if } i = j \\ -\sigma_i\sigma_j & \text{if } i \neq j \end{cases} \]

Combining both cases:

\[\frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j)\]

where, \(\delta_{ij}\) is the Kronecker delta (1 if \(i=j\) (diagonal), 0 otherwise).

Usage:

Almost exclusively used in the output layer of multi-class classification networks.
Attention score calculation.

Softmax Graph

images/deep_learning/fundamentals/activation_function/softmax.png

Example
Consider an AI model reading a product review to categorize the customer’s mood into three classes: Positive,Neutral, or Negative.

Class	Score (Logit)	Softmax
Positive	3	82.1%
Neutral	1	11.1%
Negative	0.5	6.7%

\[\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}\]

\[ \begin{aligned} \text{softmax(positive) } = \frac{e^{3}}{e^3 + e^1 + e^{0.5}} \\[10pt] = \frac{20.085}{20.085 + 2.718 + 1.649} \\[10pt] = \frac{20.085}{24.452} \\[10pt] = 0.821 \text { or } 82.1\% \end{aligned} \]

Similarly, you can calculate for negative and neutral sentiments.

Video Activation Function | Why Non Linearity is important in Deep Learning ? | Sigmoid | ReLU | Softmax

Previous: XOR Problem Next: Optimization Methods

4 - Optimization Method

Optimization Methods - Momentum, AdaGrad, RMSProp & Adam

Playlist Deep Learning Fundamentals | Full Course

The loss function surface in deep learning is non-convex, i.e, it has multiple local minima, saddle points, and plateaus rather than a single, global minimum.
So, in the context of neural network training, we usually do not care about finding the exact (global) minimum of a function, but seek only to reduce its value sufficiently to obtain good generalization error.

Non-Convex Loss Surface Examples

images/deep_learning/fundamentals/optimization_method/non_convex_loss_surface_1.png

images/deep_learning/fundamentals/optimization_method/non_convex_loss_surface_2.png

5 - Regularization

Regularization Methods - Early Stopping, Dropout & L2 Regularization

Playlist Deep Learning Fundamentals | Full Course

In ‘Deep Learning’ before thinking of regularization we make sure that the model is able to overfit on the training data and then later take steps to prevent overfitting.
Overfitting on training data ensures that the model training is successful and is not under-fit, i.e:

No coding error.
Layers are all connected and have sufficient capacity to learn the complexity in data.
Initialization parameters of gradient descent are fine, so that convergence to a local minima occurs.
Model is trained for enough iterations/epochs.

Note: Overfitting on training data => very low training loss.

Once, we have made sure that the deep learning model is overfitting, now we test the model performance against a separate validation dataset, and if the performance on validation set is poor, this implies that:

Training and Validation data distributions are different, or
Overfitting on training data.

How to prevent overfitting ?

Data Augmentation
The best way to make a machine learning model generalize better is to train it on more data.
Of course, in practice, the amount of data we have is limited.
One way to get around this problem is to create fake data from the existing data and add it to the training set.
Regularization

L2 Regularization (Weight Decay)
Early Stopping
Dropout

Adds a penalty term proportional to the square of the magnitude of weights to the loss function.

It prevents overfitting by forcing weight values to be small, encouraging a smoother, simpler model that generalizes better to new data.
Weights ‘decay’ toward zero at every step, which is why it’s often called ‘Weight Decay’. \[ \underset{w}{\mathrm{min}}\ J_{reg}(w) = \underset{w}{\mathrm{min}}\ J(w) + \lambda.\sum_{j=1}^n \Vert w_j \Vert_2^2 \]

Note: Most modern optimizers (like AdamW) implement this by default to keep weights small and prevent overfitting.

Early stopping is a ‘free’ regularization technique that relies on monitoring the model’s performance on a separate validation set during training.

As training progresses, the error on both training and validation sets usually decreases.
However, at some point, the model begins to ‘memorize’ the training data.
While the training error continues to drop, the validation error starts to rise.
Early stopping halts the training at the precise moment the validation error is at its minimum.
- e.g. if validation loss does not improve for 5 epochs, stop training.

Code


# 1. Define the EarlyStopping callback
# Monitors 'val_loss' and stops if no improvement for 3 epochs.
# restore_best_weights=True ensures you get the model from the best epoch.
early_stop_callback = EarlyStopping(
    monitor='val_loss', 
 patience=3, 
    restore_best_weights=True
)

# 2. Create a simple model
model = Sequential([
    Dense(10, activation='relu', input_shape=(5,), name="hidden_1", kernel_initializer="he_normal"),
    Dense(1, activation='sigmoid', name="output")
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


# 3. Train the model with the EarlyStopping callback
# The 'callbacks' argument accepts a list of callbacks.
history = model.fit(
    X_train, y_train,
    epochs=100, # Set a large number of epochs
    validation_data=(X_val, y_val),
 callbacks=[early_stop_callback], # Pass the callback here 
    verbose=1
)

Output

Epoch 22/100
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - accuracy: 0.6100 - loss: 0.6458 - val_accuracy: 0.6000 - val_loss: 0.7082
Epoch 23/100
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - accuracy: 0.6200 - loss: 0.6450 - val_accuracy: 0.6000 - val_loss: 0.7076
Epoch 24/100
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - accuracy: 0.6200 - loss: 0.6450 - val_accuracy: 0.6000 - val_loss: 0.7070
Epoch 25/100
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - accuracy: 0.6200 - loss: 0.6449 - val_accuracy: 0.6000 - val_loss: 0.7079
Epoch 26/100
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 23ms/step - accuracy: 0.6200 - loss: 0.6446 - val_accuracy: 0.6000 - val_loss: 0.7089
Epoch 27/100
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 24ms/step - accuracy: 0.6200 - loss: 0.6447 - val_accuracy: 0.6000 - val_loss: 0.7094
Training stopped early after 27 epochs.

Ensemble
Ensemble models reduce overfitting by combining the predictions of multiple diverse models, which reduces the overall variance of the final model.

Note: If variance of each model is \(\sigma^2 \) then the combined variance of ensemble will be \(\frac{\sigma^2}{k}\).

Problem
But training multiple different ‘deep learning’ models is costly, also at runtime we need to get the predictions from all ‘k’ models and take the average of them, which may be time-consuming.

💡 Dropout provides an inexpensive approximation to training and running an ensemble of models.
Randomly remove non-output neurons, i.e, input or hidden layer neurons from the network during every mini-batch (only for that mini-batch) training.

Note: Possible subnetworks = \(2^{n}\), where ’n’ is number of neurons in the input and hidden layers.

Research Paper: Improving neural networks by preventing co-adaptation of feature detectors, Hinton et al., 2012, https://arxiv.org/pdf/1207.0580

Let’s understand Dropout using the example below.
We will start with a fully connected neural network and randomly dropout(turn-off) neurons.

Fully Connected Neural Network

images/deep_learning/fundamentals/regularization/fcnn.png

Dropout Neurons Randomly (iteration 1)

images/deep_learning/fundamentals/regularization/dropout_1.png

Thinned Network (iteration 1)

images/deep_learning/fundamentals/regularization/thinned_network_1.png

Note: Only the “weights” corresponding to retained neurons will be updated in each iteration (mini-batch).

Dropout Neurons Randomly (iteration 2)

images/deep_learning/fundamentals/regularization/dropout_2.png

Thinned Network (iteration 2)

images/deep_learning/fundamentals/regularization/thinned_network_2.png

Note: Only the “weights” corresponding to retained neurons will be updated in each iteration (mini-batch).

Since, all neurons are not present in every iteration, so all the weights will not be updated, thus preventing over-fitting.

We can think of removal of hidden neurons as adding some form of random noise to features.
Removal of input neurons as input variations.
All of the above things prevent over-fitting.

Say probability of retaining a neuron,
p(hidden neuron) = 0.6 and p(input neuron) = 0.8

Generate a random number \(r_i \in [0,1]\), if \(r_i \le 0.8\), then retain the input neuron, else drop it; this corresponds to a 80% retention probability.

Co-adaptation in deep learning occurs when neurons become overly dependent on others to correct errors, leading to fragile, overfitted models that perform poorly on new data.

💡 Dropout prevents co-adaptation.

No single neuron can rely on the presence of another specific neuron to correct its errors.
This forces every neuron to learn features independently.

Since, each neuron is present in the network with probability ‘p’, so the corresponding outgoing weights of the neuron are scaled by the factor ‘p’ to account for the presence of the that neuron in the network during training.

💡 At inference time we scale the weights.

images/deep_learning/fundamentals/regularization/weight_scaling.png

Note: No clear justification for doing this.

import tensorflow as tf
from tensorflow.keras import layers, regularizers, models
import numpy as np

# Create some dummy data for demonstration purposes
X_train = np.random.rand(1000, 32)
y_train = np.random.rand(1000, 1)
X_val = np.random.rand(200, 32)
y_val = np.random.rand(200, 1)

# Define the L2 regularization strength (e.g., 0.0001)
l2_strength = 1e-4

# Create a Sequential model with L2 regularization and Dropout layers
model = models.Sequential([
    # Add a Dense layer with L2 regularization
    layers.Dense(128, activation='relu',
              kernel_regularizer=regularizers.l2(l2_strength),
                 input_shape=(32,)),
    
    # Add a Dropout layer with a dropout rate of 30%
 layers.Dropout(0.3),
    
    # Another Dense layer with L2 regularization
    layers.Dense(64, activation='relu',
                 kernel_regularizer=regularizers.l2(l2_strength)),
    
    # Another Dropout layer with a dropout rate of 20%
 layers.Dropout(0.2),
    
    # Output layer
    layers.Dense(1, activation='linear')
])

# Compile the model
model.compile(optimizer='adam',
              loss='mse', # Using Mean Squared Error loss for a regression example
              metrics=['mae']) # Mean Absolute Error as a metric

# Display the model summary
model.summary()

# Train the model (optional, for a complete example)
# history = model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val), verbose=1)

Output

------Model Architecture-------
Model: "sequential_21"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_3 (Dense)                 │ (None, 128)            │         4,224 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 64)             │         8,256 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_3 (Dropout)             │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 1)              │            65 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 12,545 (49.00 KB)
 Trainable params: 12,545 (49.00 KB)
 Non-trainable params: 0 (0.00 B)

Video Regularization in Deep Learning | Dropout | Early Stopping | L2 Regularization | Explained with Code

Previous: Optimization_Methods Next: Batch Normalization

6 - Batch Normalization

Batch Normalization

Playlist Deep Learning Fundamentals | Full Course

Problem
In deep neural networks, during training, as weights update the distribution of input values to hidden layers changes continuously, also called, ‘internal covariate shift’ (ICS).
This change forces layers to constantly adapt to new input distributions, which :

slows down training,
hinders convergence, and
makes hyper-parameter tuning difficult

A deep neural network for digit classification

images/deep_learning/fundamentals/batch_normalization/digit_classification.png

Batch Normalization is a technique to control the variation in the features, such that, they do not vary too much and are bounded (by normalizing the inputs to each layer).

\[ \hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}} \]

where, \(\epsilon \approx 10^{-5}\) is a tiny constant to prevent division by zero.

If we always normalize to mean \(\mu\)=0 and variance \(\sigma^2\)=1, we might restrict the layer too much (e.g., forcing everything into the linear region of a Sigmoid function) and the network might lose representational power.

💡 So BatchNorm introduces learnable parameters:

\[y_i = \gamma \hat{x}_i + \beta\]

\(\gamma\) = scaling parameter
\(\beta\)= shifting parameter

Note: The network can now decide for itself if it wants the mean(\(\mu\)) to be 0 and variance(\(\sigma^2\)) to be 1.
If the optimal state for the network is something else, it can learn the values for ‘\(\gamma\)’ and ‘\(\beta\)’ to undo the normalization.

Inference Time

During training, mean(\(\mu\)) and variance(\(\sigma^2\)) come from current mini-batch.
At inference (test) time, we use frozen running averages of the mean(\(\mu\)) and variance(\(\sigma^2\)) calculated during training.

Research Paper: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe & Szegedy, 2015, https://arxiv.org/pdf/1502.03167

Mitigates changing distributions (internal covariate shift).
Prevents vanishing/exploding gradients.
- Allows for higher learning rates.
Smoothing the optimization landscape.
Acts as a regularizer to reduce overfitting.
- Because BN calculates the mean(\(\mu\)) and variance(\(\sigma^2\)) for each mini-batch, these statistics vary slightly across different batches.  This randomness introduces a small amount of noise into the activations, which acts as a regularizer, similar to dropout.

import tensorflow as tf
from tensorflow.keras import layers, models, regularizers
import numpy as np

# 1. Setup Synthetic Data (Binary Classification)
# 1000 samples, 20 features per sample
X = np.random.rand(1000, 20).astype(np.float32)
y = np.random.randint(2, size=(1000, 1)).astype(np.float32)

# 2. Define the Sequential Model
model = models.Sequential([
    # Input Layer
    layers.Input(shape=(20,)),

    # Hidden Layer 1: Dense + L2 Regularization
    layers.Dense(64, kernel_regularizer=regularizers.l2(0.01), name="dense_1"),
    layers.BatchNormalization(name="batch_norm_1"), # Normalizes activations
    layers.Activation('relu'),
    layers.Dropout(0.3, name="dropout_1"),          # Prevents overfitting

    # Hidden Layer 2
    layers.Dense(32, kernel_regularizer=regularizers.l2(0.01), name="dense_2"),
    layers.BatchNormalization(name="batch_norm_2"),
    layers.Activation('relu'),
    layers.Dropout(0.2, name="dropout_2"),

    # Output Layer (Sigmoid for binary probability)
    layers.Dense(1, activation='sigmoid', name="output_layer")
])

# 3. Compile the Model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 4. Print Architecture & Parameters
print("--- Model Architecture ---")
model.summary()

Output

--- Model Architecture ---
Model: "sequential_23"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_1 (Dense)                 │ (None, 64)             │         1,344 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ batch_norm_1                    │ (None, 64)             │           256 │
│ (BatchNormalization)            │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ activation_2 (Activation)       │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_1 (Dropout)             │ (None, 64)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 32)             │         2,080 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ batch_norm_2                    │ (None, 32)             │           128 │
│ (BatchNormalization)            │                        │               │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ activation_3 (Activation)       │ (None, 32)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_2 (Dropout)             │ (None, 32)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ output_layer (Dense)            │ (None, 1)              │            33 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 3,841 (15.00 KB)
 Trainable params: 3,649 (14.25 KB)
 Non-trainable params: 192 (768.00 B)

Video Batch Normalization | Deep Learning | Internal Covariate Shift | Detailed Explanation

Previous: Regularization Methods Next: Back Propagation

7 - Back Propagation

Back Propagation

Playlist Deep Learning Fundamentals | Full Course

Model Training
Training a model means updating the weights, i.e, repeatedly calculating \(\frac{\partial{J(w)}}{\partial{w_{old}}}\).
🎯 Goal: Minimize the cost function J(w).

We use gradient descent to find the optimum weights.

Gradient Descent

\[ w_{new} = w_{old} - \eta \frac{\partial{J(w)}}{\partial{w_{old}}} \]