This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Natural Language Processing

Natural Language Processing (NLP) - from Tokenization, Embeddings, Transformers to LLMs

1: Introduction
2: Text Pre-Processing
3: Tokenization
4: Text Embedding
5: RNN
6: LSTM
7: GRU
8: Attention
9: Self Attention
10: Transformer
11: LLM
12: BERT

Playlist Natural Language Processing | Full Course

1 - Introduction

Introduction to Natural Language Processing

Playlist Natural Language Processing | Full Course

Natural Language Processing (NLP) is a subfield of artificial intelligence that enables computers to understand, interpret, and generate human language.

e.g. Question Answering, Summary, Dialogue Generation (Chatbot), Sentiment Analysis, Spam Classification etc.

❌ Human language is ambiguous; meaning of the sentence is determined by its context.

He sat by the bank.
- ‘bank’ may mean a river or a financial institution.
I saw the man with the telescope.
- the man was with a telescope, or
- he saw a man using a telescope.
The chicken is ready to eat.
- chicken(dead) is cooked and ready to eat, or
- the chicken(alive) is hungry and ready to eat.

💡 Therefore, in NLP, it’s very important to capture the context, so that the exact meaning of the words in the sentence is understood.

Before we dive into how do we capture the context, we need to convert the human language into a form that is understandable by machines.
We know that deep learning models take vectors/matrices as input, which are collections of numbers.
So, first we need to convert the words into vectors for training the models.

2 - Text Pre-Processing

Text Pre-Processing - Cleaning, Stemming & Lemmatization

Playlist Natural Language Processing | Full Course

❌ Raw text is messy and inconsistent.

e.g., “WOw!!! The new iphone 18 pro is SOOO good! I luvv it… best phone ever? Check it out at someCoolSite.com #tech #apple”

So, first we need to do some pre-processing of this messy text, before it can be used for model training.
There are 2 main steps:

Cleaning: Removing punctuation, lowercasing, stop word removal and stripping special characters.
Stemming/Lemmatization: Reducing words to their root form.

Removing punctuation, lowercasing, stop word removal and stripping special characters.

Input: “👋 ‍Hello, together we will learn NLP (Natural Language Processing)!!!”:
Output: “hello learn nlp natural language processing”

Reduce words to their root form, by chopping off suffixes, often resulting in non-dictionary roots; very fast.

Input: “Running was considered better than going to gym.”
Porter Stemmer Output: [“run", “wa”, “consid”, “better”, “than”, “go”, “to”, “gym”]

Reduce words to their root form, i.e, dictionary base form (lemma).

Input: “Running was considered better than going to gym.
WordNet Lemmatizer Output: [“run", “be”, “consider”, “good”, “than”, “go”, “to”, “gym”]

When to use Lemmatization or Stemming:

✅ Lemmatization: Accuracy; e.g., chatbots
✅ Stemming: Speed; e.g., searching massive datasets

Video Text Pre Processing for LLM | Text Cleaning | Stemming | Lemmatization

Previous: Intro to NLP Next: Tokenization

3 - Tokenization

Tokenization - Byte Pair Encoding, WordPiece & SentencePiece

Playlist Natural Language Processing | Full Course

Tokenization is preprocessing step of breaking down raw text (sentences) into smaller units called tokens, such as, words, characters, or subwords.

Text is split by whitespace or punctuation.
Each unique word gets its own ‘ID’ in the vocabulary.

e.g., “The player is playing”, becomes:
[“The”, “player”, “is”, “playing”]

Issues

Out-Of-Vocabulary (OOV): If the model has seen only “player” in the training set and at runtime it sees “players” or “playful”, then it marks them as (Unknown).
Vocabulary Explosion: If we use full words then vocabulary size.
No Shared Meaning: The model treats “running” and “run” as completely unrelated symbols, even though they share the same root.

Text is split into individual letters or symbols.

e.g., “Apple”, becomes:
[“A”, “p”, “p”, “l”, “e”]

Issues

Loss of Semantics: Individual characters like “a” or “p” carry no inherent meaning.
- The model has to work much harder to learn that the sequence “A-p-p-l-e” represents a fruit.
Long Sequences: A single sentence becomes a massive string of tokens.
Computationally Expensive: Longer sequences require more memory and processing power to calculate relationships between tokens.

A fixed-size vocabulary that can represent any word by breaking it into meaningful sub-units.

e.g., “unhappiness”, becomes:
[“un”, “happi”, “ness”]

Benefits

Small vocabulary size.
No OOV (everything decomposable).
Better generalization.

Comparison: Word-Level, Character-Level & Sub-Word Tokenization

Feature	Word-Level	Char-Level	Sub-Word
Vocab Size	Very Large	Small	Medium (Fixed)
OOV Problem	Severe	None	None
Meaning (Per Token)	High	Low	High (Morpheme-based)
Sequence Length	Small	Very High	Balanced

Originally a data compression algorithm, Byte Pair Encoding (BPE) was adapted for NLP to iteratively merge the most frequent adjacent pairs of characters into new tokens.

BPE Algorithm

Initialization: Prepare a corpus and define a target vocabulary size ‘’.  Break every word into individual characters (plus a special end-of-word symbol ).
Frequency Count: Count all adjacent pairs of symbols in the corpus.
Merge: Identify the pair with the highest frequency and merge them into a single new symbol .
Repeat: Update the corpus with the new symbol and repeat until the vocabulary reaches size ‘’ or no more merges are possible.

Example

--- Initial Word Corpus ---
Word                  Count
------------------  -------
l o w </w>                5
l o w e r </w>            2
n e w e s t </w>          6
w i d e s t </w>          3
h i g h e s t </w>        2

 Iteration 0
| Token   |   Frequency |
|---------+-------------|
| e       |          19 |
| </w>    |          18 |
| w       |          16 |
| s       |          11 |
| t       |          11 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
Action: Merge ('e', 's')

Iteration 1
| Token   |   Frequency |
|---------+-------------|
| </w>    |          18 |
| w       |          16 |
| es      |          11 |
| t       |          11 |
| e       |           8 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
Action: Merge ('es', 't')

Iteration 2
| Token   |   Frequency |
|---------+-------------|
| </w>    |          18 |
| w       |          16 |
| est     |          11 |
| e       |           8 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
Action: Merge ('est', '</w>')

Iteration 3
| Token   |   Frequency |
|---------+-------------|
| w       |          16 |
| est</w> |          11 |
| e       |           8 |
| </w>    |           7 |
| l       |           7 |
| o       |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
Action: Merge ('l', 'o')

Iteration 4
| Token   |   Frequency |
|---------+-------------|
| w       |          16 |
| est</w> |          11 |
| e       |           8 |
| </w>    |           7 |
| lo      |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
| g       |           2 |
Action: Merge ('lo', 'w')

Iteration 5
| Token   |   Frequency |
|---------+-------------|
| est</w> |          11 |
| w       |           9 |
| e       |           8 |
| </w>    |           7 |
| low     |           7 |
| n       |           6 |
| i       |           5 |
| h       |           4 |
| d       |           3 |
| g       |           2 |

OOV Handling Example
Original OOV: 'lowest'
BPE Segmented: ['low', 'est</w>']

OpenAI Tokenizer

images/natural_language_processing/tokenization/tokens.png

images/natural_language_processing/tokenization/token_ids.png

Source: https://platform.openai.com/tokenizer

WordPiece merges the pair that maximizes the likelihood of the training data.
It chooses the pair\((s_1, s_2)\) that maximizes:

\[\text{Score} = \frac{P(s_1, s_2)}{P(s_1)P(s_2)}\]

\(P(s_1, s_2)\): probability of the pair appearing together.
\(P(s_1)P(s_2)\): probability of the pair appearing independently.

e.g: Is “unfriendly” in vocabulary? No;
Is “un” + “##friend” + “##ly” available? Yes.
where, “##” means continuation of the previous token.

Note: WordPiece is used in BERT.

SentencePiece is not just a new algorithm but a subword regularization framework that treats the input as a raw stream of characters, including spaces.

Key Innovations:

Space as a Symbol:
- It treats whitespace as a normal character (represented as “underscore”).
- This makes it “lossless”; you can reconstruct the exact original string from the tokens.
Algorithm Agnostic:
- It can implement both BPE and Unigram Language Model (a probabilistic approach that removes tokens that least impact the overall likelihood of the corpus).
No Pre-tokenization:
- Unlike BPE/WordPiece, which often require a preliminary step to split text by whitespace, SentencePiece works directly on raw Unicode strings.
- This is vital for languages like Chinese or Japanese that do not use spaces between words.

Video What are Tokens in LLM ? | How tokenization works ? | Byte Pair Encoding | Detailed Explanation

Previous: Text Pre-Processing Next: Text Embedding

4 - Text Embedding

Text Embeddings - OHE, BoW, TF-IDF, Word2Vec, GloVe & FastText

Playlist Natural Language Processing | Full Course

Process of converting raw text into numerical vectors that machines can understand.

We will discuss the following ways to represent(embeddings) text as vectors:

One Hot Encoding (OHE) (discrete)
Bag of Words (BoW), TF-IDF (statistical)
Word2Vec, GloVe, FastText (distributed)

Requirements of Good Text Representation(Embeddings)

Capture meaning/similarity (semantics)
Capture context
Compact

Every word in the vocabulary ‘V’ is assigned a unique index.

e.g.

\[ \mathbf{v_{aardvark}} = \begin{pmatrix} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \mathbf{v_{abacus}} = \begin{pmatrix} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{pmatrix}, \dots , \mathbf{v_{zyzzyva}} = \begin{pmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{pmatrix} \]

Limitations

High dimensional (curse of dimensionality)
Sparse vectors
No semantics is captured
- e.g., “love” vs “like” → completely different vectors

Represents a document by the frequency of words within it, disregarding grammar and word order.

Given a sequence of words in a document, \(D: <(w_1, w_2, \dots , w_T>\).
Vocabulary, \(\mathbf{v}_D = \sum_{i=1}^T \mathbf{v}_{w_i}\)

e.g. Document, D = “We learn NLP. For NLP we need to learn BoW."

Vocabulary = [“We”, “Learn”, “NLP”, “For”, “Need”, “To”, “BOW”]
Vector = [ 2, 2, 2, 1, 1, 1, 1]

Limitations

Stop Word Problem: Common words like “the” or “is” appear frequently but carry little information, often drowning out the meaningful terms.
Sparse Vector.
Semantics not captured.
Word ordering is not captured.
- e.g. \( \mathbf{v_{\text{man kills lion}}} = \begin{pmatrix} 0 \\ \vdots \\ 1 \\ 0 \\ \vdots \\ 1 \\ 0 \\ \vdots \\ 0 \\ 1 \\ \vdots \\ \end{pmatrix} = \mathbf{v_{\text{lion kills man}}} \)

To solve the stop-word problem, TF-IDF penalizes words that appear across almost all documents.

Term Frequency (TF): Measures how frequently a term ‘t’ occurs in a document ‘d’.
- \(TF(t, d) = \frac{\text{Number of times } t \text{ appears in } d}{\text{Total number of words in } d}\)
Inverse Document Frequency (IDF): Measures how important a term is across the entire corpus ‘D’.
- \(IDF(t, D) = \log\left(\frac{\text{Total number of documents } N}{\text{Number of documents containing term } t}\right)\)
\(TF-IDF(t, d, D) = TF(t, d) \times IDF (t, D)\)

If a word appears in every document (like “the,” “is,” “data”), it isn’t a good discriminator.

Note: The log function “dampens” the effect of very high frequencies.

TF-IDF Score Meaning

images/natural_language_processing/text_representation/tfidf_meaning.png

The minimum TF-IDF value is 0. This occurs when a term appears in all documents (IDF = 0) or is not present in the document at all (TF = 0).
No fixed upper bound for unnormalized TF-IDF; Max value depends on the corpus size (\(log(\text{Number of Documents})\)) and how rarely a word appears.

Let’s understand the TF-IDF score better with a very simple example.

Document 1: “I love coffee.”
Document 2: “I love tea.”
Vocabulary: {‘I’, ’love’, ‘coffee’, ’tea’}
Output Matrix:

	I	Love	Coffee	Tea
Doc 1	0	0	0.1	0
Doc 2	0	0	0	0.1

Limitations

Semantics not captured, e.g car ≠ vehicle.
Word ordering is not captured.
Sparse Vector.
- Vector size = vocabulary size;
- For a vocabulary of 50,000 words, every document is a 50,000-dimensional vector consisting mostly of zeros.
No context awareness, e.g., “bank” (river) vs “bank” (finance).
Corpus dependency.
- Changing corpus changes representation.
Poor generalization.
- New/unseen words → no representation

Note: TF-IDF answers → “Which words are important in this document?” , but NOT “What does this word mean?”.

Till now, we have seen that complete semantic information of each word is mapped to exactly one dimension in the vector, such as, OHE, BoW, and TFIDF.

Wouldn’t it be better if we captured the semantic and syntactic information separately in different dimensions such that similar words have similar vectors and the representation generalizes well.

e.g. Say, a word like “Apple” is represented by a 300-dimensional vector.

Dimension 1 might represent “fruitiness,”
Dimension 2 “redness,” and
Dimension 3 “technology.”
and so on …

💡 If we are able to represent a word in multiple dimensions then we can compare words in those multiple dimensions on various aspects, such that all the similar words occur together.

Distributed Representation of Words

images/natural_language_processing/text_representation/distributed_representation_1.png

images/natural_language_processing/text_representation/distributed_representation_2.png

Note: We can see that similar words occur together.

Video Word Embeddings (Vector) | OHE | Bag of Words | Term Frequency Inverse Document Frequency (TFIDF)

“You shall know a word by the company it keeps.” ~ J.R. Firth, 1957

Word2Vec captures information about the meaning of the word based on the surrounding words, because meaning comes from context.

Developed at Google (Mikolov et al.), 2013.
Word2Vec moved NLP from “counting” to “predicting”.

Word2Vec uses 2 methods to learn one vector per word appearing in the corpus.

CBOW (Continuous Bag of Words): Predicts a target word based on its context (the surrounding words).
- Best for: Smaller datasets; smooths over some noise.
Skip-gram: Predicts the context words given a single target word.
- Best for: Large datasets; better at representing rare words.

Word2Vec Architectures

images/natural_language_processing/text_representation/word2vec_example.png

Skip-Gram predicts the surrounding context words given the center or target word.
It operates on the principle that words appearing in similar contexts tend to have similar semantic meanings, making it effective at capturing both semantic and syntactic relationships.

images/natural_language_processing/text_representation/skip_gram_1.png

images/natural_language_processing/text_representation/skip_gram_2.png

\[P(\mathbf{w}_i \text{ is a context word} | \mathbf{w}_t) = \frac{exp(\mathbf{v}^T_{w_t}\mathbf{u}_{w_i})}{\sum_{w \in V} exp(\mathbf{v}^T_{w_t}\mathbf{u}_{w})}\]

If the training example is “brown”, then the softmax output corresponding to the word “fox” will be 1, rest all units will be 0.

Note: \(\mathbf{v}_{w_t}\) is used as the final dense vector representation for the word \(w_t\).

Let’s dive deeper into Skip-Gram architecture and understand the optimization problem.
Cost Function

\[log ~ P(w_{t+j} | w_t) = \mathbf{v}^T_{w_t}\mathbf{u}_{w_{t+j}} - log\sum_{w \in V} \mathbf{v}^T_{w_t}\mathbf{u}_{w}, \quad j = −k, . . . , k; j \ne 0\]

Problem
Here, the computation of probability of context word given center word \(P(w_{t+j} | w_t)\), and its derivative, i.e, \(\frac{\partial {P(w_{t+j} | w_t)}}{\partial {\mathbf{v}_{w_t}}}\) (for gradient descent weight update) is expensive if vocabulary size is large, i.e, \(|V| \gg 0\).

Note: Loss Function = Cross Entropy

Computational Scale Analysis
If our vocabulary has \(10^{5}\) words, every single update for a single word pair requires \(10^{5}\) operations.
Since a typical corpus has billions of tokens, performing \(|V|\) operations per token makes training mathematically possible but computationally intractable on standard hardware.

Solution
Negative Sampling
A clever trick !
Instead of summing over the whole vocabulary, the model only updates the “true” context word and a small sample (e.g., 5–20) of random “negative” words.
This turns a massive multiclass classification problem into a series of simple binary logistic regression problems.

Randomly select a small number of ‘K’ non-context words for which softmax unit output is 0.
For every “Positive” pair (words that actually appear together), we generate ‘K’ “negative” pairs (words that are randomly sampled from the dictionary and likely have no relationship).
- Positive Pair (\(w, c\)): “The quick brown fox” \(\rightarrow\)(quick, fox)
- Negative Pairs (\(w, c_{neg}\)): (quick, apple), (quick, potato), (quick, diary)

Benefit: If we have a vocabulary of 10,000 words and use , we only update weights for 6 output neurons (1 positive + 5 negative) instead of all 10,000.

Cost Function

\[log ~ \hat{P}(w_{t+j} | w_t) = \underbrace{log(\sigma(\mathbf{v}^T_{w_t}\mathbf{u}_{w_{t+j}}))}_{Positive Pair} + \sum_{k=1}^K \underbrace{log(\sigma(-\mathbf{v}^T_{w_t}\mathbf{u}_{w}))}_{Negative Pair}\]

🎯 Goal is to maximize the log likelihood function.

We can see that both the terms on the left of equality is of the form \(log(\sigma(x))\), where \(x\) is the dot product \(v^Tu\).
Since, sigmoid function outputs value in the range of 0 to 1, so \(log(\sigma(x))\) range will be \((-\infty, 0]\).
Therefore, in order to maximize the value of log likelihood we need to bring it closer to 0.

Let’s see how the values of \( x, \sigma(x)\), and \(log(\sigma(x))\) vary together:

Positive Pair: as \(x \rightarrow \infty, ~ \sigma(x) \rightarrow 1, ~ and ~ log(x) \rightarrow 0\)
Negative Pair: as \(x \rightarrow -\infty, ~ \sigma(x) \rightarrow 0, ~ and ~ log(x) \rightarrow -\infty\)

Note: \(\sigma(-x) = 1 - \sigma(x)\)

Video Word2Vec Embedding | Detailed Explanation with Maths | Skip Gram | CBOW | Negative Sampling

Limitation of Word2Vec
Word2Vec only uses local context and is excellent at analogies and capturing semantics but ignores the vast amount of statistical information available in the global co-occurrence counts.
As, we saw earlier in the case of TF-IDF that uses the statistics of the entire corpus.

GloVe leverages global co-occurrence matrix, allowing it to better understand how words relate across the entire dataset, resulting in superior semantic analogies and better representation of relationships between distant words.

Research Paper: GloVe: Global Vectors for Word Representations, Pennington et al., Stanford University, 2014, https://nlp.stanford.edu/pubs/glove.pdf

Glove model leverages statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, and uses matrix factorization to generate low-dimensional word representations.

images/natural_language_processing/text_representation/glove.png

\(X_{ij}\) : Number of times word \(c_j\) appears in the context of word \(w_i\)

Word-word matrix or word-context matrix \(X\) is factorized into 2 matrices, such that:

\[X_{ij} \approxeq \mathbf{v}^T_{w_i}\mathbf{u}_{c_j} \text{ or } X_{ij} \approxeq exp(\mathbf{v}^T_{w_i}\mathbf{u}_{c_j})\]

Note: We do an exponentiation because dot product can be negative but the word count is always positive.

We have to find the 2 matrices such that the difference between the dot product and the actual word co-occurrence count is minimum.
This can be formulated as an optimization problem, because we have to find some kind of optimum, in this case minimum value.

Optimization Problem

\[ \underset{\mathbf{v}_w, \mathbf{u}_c}{\mathrm{min}} \lVert \mathbf{V}_w \mathbf{U}_c - \text{log}(\mathbf{X}) \rVert^2_F \]

where, F is Frobenius Norm

Estimate word and context vectors by solving the optimization problem:

\[ \underset{\mathbf{v}_w, \mathbf{u}_c, b, \tilde{b}}{\mathrm{min}} \sum_{i,j=1}^{|V|} f(X_{ij}) ~ ( \mathbf{v}^T_{w_i}\mathbf{u}_{c_j} + b_i + \tilde{b}_j - \text{log}(X_{ij}))^2\]

\[f(x) = \begin{cases} (x/x_{max})^{\alpha} & \text{if } x < x_{max} \\ 1 & otherwise \end{cases} \]

Note: In the paper they empirically found that the most suitable value of \(\alpha = 3/4\).

Let’s understand all the parameters.

\(f(X_{ij})\): Weighting function.
- If \(X_{ij} = 0\): \(f(x) = 0\); Ignore pairs that never co-occur.
- If \(X_{ij}\) is small: \(f(x)\) is also small, but increases as count increases.
- If \(X_{ij}\) is very large (for stop words): \(f(x)\) saturates (usually at 1); prevents common stop-words from over-influencing the vectors.
Bias terms
- \(b_i\): Captures the “inherent frequency” of the target word \(w_i\).
  - If word \(w_i\) is very common (like “the”), \(b_i\) will be large.
- \(\tilde{b}_j\): Captures the “inherent frequency” of the context word \(c_j\).

✅ GloVe can often outperform Word2Vec on smaller datasets because it extracts more statistical signal from every available word pair.
✅ Pre-trained Word2Vec models are often significantly larger (e.g., 3.4GB for Google News) compared to GloVe’s more lightweight variants (e.g., 150MB for Wiki-Gigaword), making GloVe better for resource-constrained applications.
✅ In practice, we start with GloVe for its stability and availability of pre-trained vectors, then experiment with Word2Vec if the specific syntactic nuances of their dataset are not being captured.

Video Global Vectors (GloVe) Embedding | Detailed Explanation | Matrix Factorization | Word2Vec Vs GloVe

Instead of treating each word as an atomic unit, FastText breaks words into a “bag” of character ’n-grams’ (skip-gram), which allows it to capture morphological information and effectively handle rare or out-of-vocabulary (OOV) words.

Developed by Facebook AI Research (FAIR).

e.g. if n = 3, (typically, n = 3, 4, 5, 6):
<where> : { <wh, whe, her, ere, re>, <where>}

Note: FastText also uses Skip-Gram architecture, similar to Word2Vec, wo we will not discuss skip gram model and dive in directly into understanding the difference between FastText and Word2Vec approaches.

5 - RNN

Recurrent Neural Network

Playlist Natural Language Processing | Full Course

N-gram model is a probabilistic tool used to predict the next item in such a sequence by looking at the previous \(n-1\) items.

\[P(w_n| w_1, w_2, \dots, w_{n-1})\]

The “n” in n-gram refers to the number of items in the sequence.

Unigram (\(n=1\)): Single words (e.g., “We”, “learn”, “RNN”).
Bigram (\(n=2\)): Pairs of consecutive words (e.g., “We learn”, “learn RNN”).
Trigram (\(n=3\)): Triplets of consecutive words (e.g., “We learn RNN”)

Limitation
Natural language text may not be of fixed size.
Every sentence is of different length.

We cannot use n-gram model, because it calculates the probability of a word based on the ‘\(n-1\)’ preceding words, relying only on a fixed, finite context window.

So, we need a model that can process an input of any length and use all the sequence information from the past.

At each step, the RNN maintains a “hidden state” that captures information about the words seen so far.

Recurrent Relation

\[P(x_{t+1} | x_1, x_2, \dots, x_t) \approx P(x_{t+1} | h_{t-1}, x_t)\]

\(h_{t-1}\): Hidden/latent vector that captures all the information seen till time instance ‘\(t-1\)’.

Note: Hidden state can theoretically capture information from the entire history of the sequence.

RNN Architecture

RNN Parameters

Note: Number of parameters is independent of input sequence length, because the same neural network is processing different words at different time stamps.

images/natural_language_processing/rnn/rnn_parameters_2.png

Sequence Information
See below how different words are input at different time stamps and the hidden state stores the sequence information seen till that time stamp.

images/natural_language_processing/rnn/sequence_info.png

Note: The input \(x_1, x_2, \dots, x_T \) are the vector representations of the words or word embeddings, earlier OHE was used, but after 2013, Word2Vec or GloVe etc. are mostly used.

Teacher Forcing
Say we are training the RNN model for text generation or language modeling, i.e, we predict the next word based on all the words seen in the past.

images/natural_language_processing/rnn/teacher_forcing.png

e.g. Let’s assume that we are using the following sentence to train the model:
“This is a cat.”
Now at time ’t’ the model sees the current word (\(w_t\)) “a” and tries to predict the “t+1” or next word.
It generates a probability distribution (softmax output) \(p_{t+1}\) of all the words in the vocabulary.
But the model has just started training, and it has no idea of sentences now, so it can predict any garbage value, which may be totally meaningless.
So, in order to train the model and learn the patterns from the training data instead of feeding the predicted word \(p_{t+1}\) as the next input i.e \(w_{t+1}\), we feed the actual word i.e “cat” as \(w_{t+1}\).

This is called Teacher Forcing.

Note: This stops the error (caused by predicting some garbage value during training) from being propagated through the network.

RNN Algorithm

For every sentence ‘i’ in the corpus:
- For every token ‘t’ in the sentence:
  - Compute probability distribution \(p_t\) on the vocabulary, given the previous ‘t-1’ words in the sentence.
  - Compute the loss (cross entropy) between \(p_t\) and \(y_t\) (desired OHE). \[ E_t(\theta) = - \sum_{j=1}^{|V|} y_{t,j} ~ \text{log}(p_{t,j}) \]
  - Compute the average overall training loss for the sentence ‘i’. \[ E_i(\theta) = \frac{1}{T} \sum_{t=1}^{T} E_t(\theta) \]
Compute gradient information using back propagation through time and update model parameters.

Video Recurrent Neural Network (RNN) Architecture | Teacher Forcing | Detailed Explanation with Maths

Calculate the error at each time step and propagate it backward to update RNN weights.

images/natural_language_processing/rnn/bptt.png

Forward Pass

\(a_t = Ux_t + Wh_{t-1} + b\)
\(h_t = \sigma(a_t) = \sigma(Ux_t + Wh_{t-1} + b)\), (activation function = sigmoid, tanh, ReLU etc.)
\(o_t = Vh_t + c\)
\(p_{t+1} = \text{softmax} (o_t)\)
\(E_t = \psi(o_t, y_t)\), (cross -entropy loss)

Derivative of Error wrt to weight:

\[\frac{\partial E}{\partial w } = \sum_{t=1}^T \frac{\partial E_t}{\partial w }\]

We can re-write it using the chain rule of differentiation:

\[\frac{\partial E}{\partial w } = \sum_{t=1}^T \frac{\partial E_t}{\partial h_t } \frac{\partial h_t}{\partial w }\]

Let, \(\delta_t = \frac{\partial E_t}{\partial h_t }\)

Similarly, error at ‘k-th’ hidden state:

\[\frac{\partial E_t}{\partial h_k } = \frac{\partial E_t}{\partial h_t }. \frac{\partial h_{t} } {\partial h_{t-1} }. \frac{\partial h_{t-1} }{\partial h_{t-2} } \dots \frac{\partial h_{k+1} } {\partial h_k }\]

\[\implies \frac{\partial E_t}{\partial h_k } = \delta_t \prod_{i=k+1}^t \frac{\partial h_{i} } {\partial h_{i-1} }\]

Let’s calculate \(\frac{\partial h_{t} } {\partial h_{t-1}}\) now:

\[\frac{\partial h_{t} } {\partial h_{t-1}} = \frac{\partial h_{t} } {\partial a_{t}}. \frac{\partial a_{t} } {\partial h_{t-1}}\]

\[ \because h_t = \sigma(a_t)\]

\[ \tag{1} \implies \frac{\partial h_{t} } {\partial a_{t}} = \sigma'(a_t) \]

\[\because a_t = Ux_t + Wh_{t-1} + b\]

\[ \tag{2} \implies \frac{\partial a_{t} } {\partial h_{t-1}} = W\]

Therefore, combining equations 1 & 2:

\[\therefore \frac{\partial h_{t} } {\partial h_{t-1}} = \frac{\partial h_{t} } {\partial a_{t}}. \frac{\partial a_{t} } {\partial h_{t-1}} = \sigma'(a_t).W = \text{(Activation Derivative).(Weight)}\]

Now, let’s revisit the error at ‘k-th’ hidden state:

\[\frac{\partial E_t}{\partial h_k } = \delta_t \prod_{i=k+1}^t \frac{\partial h_{i} } {\partial h_{i-1} }\]

where, \(\delta_t = \frac{\partial E_t}{\partial h_t }\)

\[\implies \frac{\partial E_t}{\partial h_k } = \delta_t \prod_{i=k+1}^t \sigma'(a_i).W \propto \prod \text{Activation Gradient x Weight}\]

\[\text{We have seen already that: }\frac{\partial E_t}{\partial h_k } \propto \prod \text{Activation Gradient x Weight}\]

So, lets understand the vanishing gradient problem using an example.
Say, we have a sentence with 10 words.

\(\lVert W \rVert = 0.7 < 1\)
Gradient of activation function (sigmoid or tanh) = 0.2 \[\frac{\partial E_t}{\partial h_k } \propto \prod_{i=1}^{10} \text{Activation Gradient x Weight}\] \[ (0.7 \times 0.2)^{10} = (0.14)^{10} = 2.89 \times 10^{-9} \]

Therefore, the signal dies as it goes from the 10-th to first word.
The gradient value is so low that the weights will literally have no updates, hence effectively the early layers of the model stop learning.

Solution

Skip connections
ReLU activation (derivative 1)

6 - LSTM

Long Short Term Memory

Playlist Natural Language Processing | Full Course

RNNs are not good at capturing long range dependencies (due to vanishing gradient problem).

images/natural_language_processing/lstm/long_range_dependency.png

By the time the RNN is processing “not entertaining” it forgets the subject, i.e, “Jurassic Park movie” because they are separated too far.

LSTM solves vanishing gradient problem by introducing:

Cell State (\(C_t\)): which acts as a “long-term” memory conveyor belt that runs parallel to the
Hidden State (\(h_t\)): the “short-term” working memory (similar to RNN)

LSTM Memory Cell

LSTM Architecture

Each gate in an LSTM functions as a single-layer feedforward neural network that lives inside the LSTM cell.
Each gate has its own unique, learnable weight matrices (\(W\)) and bias vectors (\(b\)).

Decides what information from the previous cell state (\(C_{t-1}\)) is no longer useful and should be discarded.

\[f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\]

\(f_t = 0\): “forget” the previous memory entirely.
\(f_t = 1\): “remember” everything.

This stage decides what new information will be stored in the cell state (\(C_t\)).

Input Gate: Decides which values to update. \[ i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \]
Candidates: Creates a vector of new candidate values that could be added to the state. \[\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)\]
Update the Cell State: This is the “memory update” where the old memory is scaled by the forget gate and the new information is scaled by the input gate. \[C_t = \underbrace{f_t \odot C_{t-1}}_{\text{Filtered\ Old\ Info}} + \underbrace{i_t \odot \tilde{C}_t}_{\text{Filtered\ New\ Info}}\]

Determines the next hidden state (\(h_t\)).

\[o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \]

\[h_t = o_t \odot \tanh(C_t)\]

This ‘\(h_t\)’ is used for predictions and as input for the next time step.

Sigmoid (Gates): Outputs 0 to 1;
- For “gating” (blocking or passing info).
Tanh (Candidate): Outputs -1 to 1;
- Allows the candidate to either add to the memory (positive values) or subtract/correct the memory (negative values).
Element Wise Product: \[ \mathbf{a} \odot \mathbf{b} = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \odot \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = \begin{bmatrix} a_1 b_1 \\ a_2 b_2 \\ \vdots \\ a_n b_n \end{bmatrix} \]

Gates Summary

Component	Role
Forget Gate	Decide which information to delete.
Input Gate	Decide which information to add.
Candidate	Actual new information.
Cell State	Long-term memory; conveyor belt.
Output Gate	Decide the next hidden state (short-term memory).

Now let’s understand how back propagation through time happens in LSTM.
It is similar to RNN as LSTM is also sequential in nature and processes words in a sentence one by one.

Cell State = \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\)
Gradient = \(\frac{\partial c_{t} } {\partial c_{t-1}} = f_t\)
Recursively (for all words at different time steps): \[ \frac{\partial c_{t} } {\partial c_{k}} = \prod_{i=k+1}^t f_i\]

Therefore, in LSTM , the gradient is determined by the forget gate \(f_t\).

If the information is important then the network sets the \(f_t = 1\).
This means the gradient can flow backwards through hundreds of time steps without being scaled down to zero.

Note: Therefore, unlike RNNs, the vanishing/exploding gradient problem is significantly reduced in LSTM, because of the stable gradient flow, allowing LSTM to learn long range dependencies.

In RNN, the gradient is given by:

\[ \frac{\partial E_t}{\partial h_k } = \delta_t \prod_{i=k+1}^t \sigma'(a_i).W \propto \prod \text{Activation Gradient x Weight} \]

First of all let us understand what is the meaning of “Carousel”.

Carousel 🎠 is a rotating circular amusement ride with seats (often horses) commonly called a merry-go-round, or a circular conveyor system.

And, now let us understand the meaning of Constant Error Carousel.

If we place a “message” (the gradient) on a horse, and the carousel rotates with no friction or resistance (\(f_t=1\)), the information keeps circulating exactly as it is for an infinite number of rotations.

In other words, if the forget get output for a word is set to 1, then the model does not forget it even if there are infinite number words before it till the beginning of the sentence, i.e, no vanishing gradient problem.

Video Long Short Term Memory (LSTM) | Forget, Input & Output Gate | Constant Error Carousel | Explained

Previous: RNN Next: GRU

7 - GRU

Gated Recurrent Unit

Playlist Natural Language Processing | Full Course

LSTMs are complex, which makes them slower to train and more computationally intensive.

three gates (forget, input, output) and many parameters.
separate long-term memory (cell state) and short term memory (hidden state).

GRU is simplified LSTM with fewer gates, fewer parameters and simpler architecture, while retaining the ability to capture long-term dependencies, often leading to faster training times and reduced computational costs.

Simpler Architecture:
- GRU has two gates (reset, update) instead of three.
- This reduces the number of weight matrices by roughly 25-33%, leading to faster training and lower memory consumption.
Structural Simplicity:
- By merging the “Cell State” (\(C_t\)) and “Hidden State” (\(h_t\)) into a single vector, it eliminates the redundant storage of information.
Performance on Small Data:
- Due to fewer parameters, GRUs are often less prone to overfitting than LSTMs when training on smaller datasets.

GRU Memory Cell

GRU Architecture

Each gate in an GRU functions as a single-layer feedforward neural network that lives inside the GRU cell.
Each gate has its own unique, learnable weight matrices (\(W\)) and bias vectors (\(b\)).

Decides how much of the past hidden state (\(h_{t-1}\)) is relevant to calculating the current candidate.

\[r_t = \sigma(U_r x_t + W_r h_{t-1} + b_r)\]

\(r_t = 0\): “forget” the past context entirely.
\(r_t = 1\): use everything from the past.

This is the “suggestion” for what the new state should look like.

\[\tilde{h}_t = \tanh(U x_t + W(r_t \odot h_{t-1}) + b_h)\]

Note: If \(r_t=0\), then \(r_t \odot h_{t-1}\) will wipe out the previous state completely.

Performs a linear interpolation (slider) between the old state and the new suggestion.

\[z_t = \sigma(U_z x_t + W_z h_{t-1} + b_z)\]

Final state:

\[h_t = \underbrace{z_t \odot h_{t-1}}_{\text{Keep\ the\ Old}} + \underbrace{(1 - z_t) \odot \tilde{h}_t}_{\text{Adopt\ the\ New}} \]

Sigmoid (Gates): Outputs 0 to 1;
- For “gating” (blocking or passing info).
Tanh (Candidate): Outputs -1 to 1;
- Allows the candidate to either add to the memory (positive values) or subtract/correct the memory (negative values).
Element Wise Product: \[ \mathbf{a} \odot \mathbf{b} = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} \odot \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{bmatrix} = \begin{bmatrix} a_1 b_1 \\ a_2 b_2 \\ \vdots \\ a_n b_n \end{bmatrix} \]

Gates Summary

Component	Role
Reset Gate	“Clear the cache for a new sub-topic.”
Candidate	Propose new state.
Update Gate	Mix old + new info.

Now let’s understand how back propagation through time happens in GRU.
It is similar to LSTM as GRU is also sequential in nature and processes words in a sentence one by one.

Hidden State = \(h_t = z_t \odot h_{t-1} + (1 - z_t) \odot \tilde{h}_t\)
Gradient = \(\frac{\partial h_t}{\partial h_{t-1}} = z_t + (1-z_t)(W.r_t.tanh')\)
Recursively (for all words at different time steps): \[ \frac{\partial h_t}{\partial h_{k}} = \prod_{i=k+1}^t z_i + (1-z_i)(W.r_i.tanh')\]

Therefore, in GRU , the gradient is mostly determined by \(z_t\).

If the information is important then the network sets the \(z_t = 1\).
This means the gradient can flow backwards through hundreds of time steps without being scaled down to zero.

Note: Therefore, unlike RNNs, the vanishing/exploding gradient problem is significantly reduced in GRU, because of the stable gradient flow, allowing GRU to learn long range dependencies.

In RNN, the gradient is given by:

\[ \frac{\partial E_t}{\partial h_k } = \delta_t \prod_{i=k+1}^t \sigma'(a_i).W \propto \prod \text{Activation Gradient x Weight} \]

First of all let us understand what is the meaning of “Carousel”.

Carousel 🎠 is a rotating circular amusement ride with seats (often horses) commonly called a merry-go-round, or a circular conveyor system.

And, now let us understand the meaning of Constant Error Carousel.

Let us understand how GRU works with the help of few examples.

1. Context Shift
Say, there is a context shift between sentences, i.e, say , the previous sentence discussed ‘Stars’ and in the current sentence we start discussing “Fish”.
Clearly, there is a context shift, and we want to the network to forget the previous context entirely and start fresh.

In that case \(r_t=0, ~ z_t=0 \), so that the current input ‘\(x_t\)’ is passed as it is.

images/natural_language_processing/gru/context_shift.png

2. Ignore Current Input
Say, we have a GRU processing a long review to determine sentiment:

“The phone I bought yesterday, despite having a slightly scratched screen that makes it annoying to use sometimes, is absolutely amazing and has a black charger.”

The model reads “absolutely amazing” and updates its hidden state to a strong positive sentiment.
Later the review adds irrelevant details like - “has a black charger”,
- so GRU learns to set \(z_t \approx 1\) when these words are processed, and ignores them.

Note: When GRU wants to ignore current input to preserve long-term memory, it sets \(z_t\) close to 1.

images/natural_language_processing/gru/ignore_input.png

3. Default Behavior (RNN)
GRU isn’t trying to store long-term memories; it is simply processing the current input \(x_t\) in the context of the immediately preceding hidden state \(h_{t-1}\).

Reset Gate \(r_t = 1\): The candidate hidden state \(\tilde{h_{t}}\) has full access to the previous hidden state (\(h_{t-1}\)).
- Nothing from the past is blocked.
Update Gate (\(z_t = 0\)): The network decides to entirely replace the old state with the new candidate state.

images/natural_language_processing/gru/default_rnn.png

Note: GRU essentially “forgets” the past and behaves like a standard Simple RNN.

Video Gated Recurrent Unit (GRU) | Reset & Update Gate | Constant Error Carousel | Detailed Explanation

Previous: LSTM Next: Attention

8 - Attention

Attention Mechanism (Bahdanau Attention)

Playlist Natural Language Processing | Full Course

Entire input (\(h_1, h_2, \dots, h_T\) ) is encoded into one fixed-length vector.
So, as the length of an input sentence increases performance of a basic encoder–decoder deteriorates rapidly, because of vanishing gradient problem.

Encoder-Decoder Architecture (RNN)

images/natural_language_processing/attention/encoder_decoder.png

Decoder decides parts of the source sentence to pay attention to.
By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector.
While decoding an output pay attention to only relevant inputs.

Decoder Context

Note: Now the decoder context is not fixed.
While predicting the next word, the decoder (instead of relying only on the final encoder hidden state) dynamically pays attention to only relevant context from encoder at that time step.

Why it matters?
This allows the decoder to “look back” at the entire input sequence, preventing the information loss that occurs in basic encoder-decoder models when handling long sentences.

Goal
To determine the context vector (relevant) at each time step of decoder.

Before that we need to understand the meaning of few terms, viz., “Query”, “Key” & “Value”.

Research Paper: Neural Machine Translation by Jointly Learning to Align and Translate ; Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, 2014, https://arxiv.org/pdf/1409.0473

In the context of Attention mechanism, Query (Q), Key (K), and Value (V) are metaphors borrowed from retrieval systems (like a database or a search engine) to describe how information is accessed.
Let’s understand these terms via a “data fitting” problem.

Data Fitting

We are give some data points, i.e, some keys (\(k_i\)) and corresponding values (\(v_i\)), and the task is to find the best fit curve (\(\psi(q)\)) so that in future if we have a new query point ‘\(q\)’ we should be able to determine(predict) the corresponding value for the query point.
Something similar to linear regression.

Similarity Score
One of the ways can be that we find the similarity (dot product) of the query point ‘\(q\)’ with every key (\(k_i\)), if the similarity score is high means they are closer, and vice versa, then we can take the corresponding values (\(v_i\)) scaled by the similarity score and sum them up to get the predicted value for the query point ‘\(q\)’.

images/natural_language_processing/attention/query_key_value.png

Here, in above example query point ‘\(q\)’ is closer to keys \(k_7\) and \(k_8\), hence the predicted value for the query will be more close to \(v_7\) and \(v_8\) as compared to \(v_1\) or \(v_2\), which are quite far away.

Softmax
Moreover, we can apply a softmax on the similarity scores, so that the higher scores are amplified (winner takes most), thus giving us better prediction.
This turns our prediction into a weighted average of the values \(v_i\).

9 - Self Attention

Self Attention - Transformer Architecture

Playlist Natural Language Processing | Full Course

RNNs process words sequentially, i.e, one by one.

To understand word 20, you must first process words 1 to 19.
By the time the model gets to the end of a long sentence, it often “forgets” the beginning (vanishing gradient problem).

images/natural_language_processing/self_attention/cross_attention.png

Note: Luong Attention (2015) used “dot product” based attention was faster than Bahdanau Attention (2014) (additive), but was still RNN based Seq2Seq model.

10 - Transformer

Transformer Architecture - Attention Is All You Need !

Playlist Natural Language Processing | Full Course

Transformer replaced recurrence with attention, enabling faster training and superior performance in NLP and AI tasks.

images/natural_language_processing/transformer/transformer_architecture.png

Source: Attention is all you need, Vaswani et al., 2017; https://arxiv.org/pdf/1706.03762

Transformer Architecture
It was designed for machine translation, so it has 2 parts, viz., encoder and decoder.

Encoder
The encoder is composed of a stack of N = 6 identical layers.

Each encoder layer has 2 sub-layers:
- Multi-Head Self Attention
- Fully connected Neural Network (Feed Forward)
Residual connection around each of the sub-layers.
Each sub-layer has layer normalization.

Decoder
The decoder is also composed of a stack of N = 6 identical layers.

Each decoder layer has 3 sub-layers:
- Masked Multi-Head Attention
- Multi-Head Encoder Decoder (Cross) Attention
- Fully connected Neural Network (Feed Forward)
Residual connection around each of the sub-layers.
Each sub-layer has layer normalization.

Transformer replaced recurrence with attention.
3 kinds of attention are used in the transformer architecture:

Multi-Head Self Attention (Encoder)
Masked Multi-Head Attention (Decoder)
Encoder Decoder (Cross) Attention (Decoder)

Unlike older models that read left-to-right, self-attention sees the whole sentence at once, allowing it to catch relationships whether they are side-by-side or miles apart.

Every word in a sentence looks at every other word (including itself) to decide which word is most relevant to its own meaning in that specific context.

We have discussed self attention in detail in the previous article.

11 - LLM

Large Language Model

Playlist Natural Language Processing | Full Course

Generative Pre-Trained Transformer (GPT); Decoder only
Bidirectional Encoder Representations from Transformers (BERT); Encoder only; Sentiment analysis, Question answering, Search query etc.
Bidirectional and Auto-Regressive Transformers (BART); Encoder-Decoder; Text summarization, Document denoising (reconstruct corrupted text) etc.
Text-To-Text Transfer Transformer (T5); Encoder-Decoder; Translation, Text summarization, etc; uses task-specific prefix

GPT Architecture (Decoder Only)

images/natural_language_processing/llm/gpt_architecture.png

GPT (2018) was the first Large Language Model (LLM) based on Transformer (decoder only).
Because, of the highly parallelizable architecture of Transformer, it was possible to scale and train the model on internet scale data, which gave us the magical powers of LLMs.

LLM is first pre-trained on vast internet scale dataset to understand the language and then it is fine-tuned to do desired tasks, such as question answering.

The LLM training is broadly divided into 3 phases:

Pre-Training
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)

LLM Training Phases

images/natural_language_processing/llm/llm_training.png

Model is trained on a massive, unlabeled dataset (e.g., internet scale) to learn general language features, grammar, syntax, and reasoning patterns, using self-supervised learning techniques.

The self-supervised techniques are:

Causal Language Modeling (Auto-regressive)
- Predict next token (used in GPT).
Masked Language Modeling (used in BERT)
- A percentage of tokens in a sentence (typically 15%) are randomly hidden or “masked,” and the model must predict these missing words using context from both directions (bidirectional).
Next Sentence Prediction
- Model is given two sentences and must determine if the second sentence naturally follows the first in the original text.

Now we can take the “pre-trained” model that has already learned general language patterns and fine tune it for different tasks.

‘Transfer learning’ is primarily used to save time and computational resources when we have a limited amount of labeled data for a specific task.
By starting with a model that already “understands” general patterns—like shapes in images or grammar in text, we can adapt it to a new, related task with much less training.

Transfer Learning

While pre-training provides a broad knowledge of syntax, SFT focuses on alignment, teaching the model to follow specific instructions and produce functional, context-aware results.

Code Generation
- The model is trained on high-quality instruction-code pairs.
  - e.g., “Prompt: Write a Python function for binary search” → “Output: [Code snippet]”.
Math & Reasoning (Chain of Thought)
- Instead of training the model to give the answer directly (Prompt\(\rightarrow\) Answer), we train it to generate a rationale (Prompt \(\rightarrow\) Rationale \(\rightarrow\) Answer); include intermediate steps, step1, step2 … to teach model “Chain of Thought” by walking through logic (also multiple turns).
Tool/Function Calling
- Enables LLMs to interact with external tools and APIs.
- Transformed LLMs from passive chatbots into Agentic AI.

The goal of Chain of Thought (CoT) is to ensure the model does not just ‘guess’ the next token based on probability, but instead follows a logical path where each step provides the context for the next step.

e.g.
User: Akshay has 5 apples. His 2 friends give him some apples. Each friend gives him 3 apples. How many apples does Akshay have now ?

Agent: Let’s break it down in steps:

Number of apples Akshay had at start = 5.
Number of friends = 2
Number of apples from each friend = 3
Total number of apples from friends = 2x3 = 6
Total number of Apples = 5 + 6 = 11
Therefore, Akshay has 11 apples now.

Research Paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al. , 2023, https://arxiv.org/pdf/2201.11903

Train the model to recognize when to pause text generation and output a specific JSON format to invoke an external tool/function.
Say, we ask some information, like weather of a city, the LLM will not have that information and need to search the web or call some external API.

Note: Tool calling in LLMs is core enabler that transformed LLMs from passive chatbots into Agentic AI.

Function Calling (specific API):
- User Prompt: “What is the weather in Pune?”
- Model Output:
  - { “tool_call”: { “name”: “get_weather”, “arguments”: {“location”: “Pune”, “unit”: “celsius”} } }
Tool Calling:
- Model decides to use a Code Interpreter or a Web Search tool that is not just a simple API function.
  - Tool 1 (Calculator): {“tool”: “calculator”, “input”: “sqrt(15625)"}
  - Tool 2(Web Search): {“tool”: “web_search”, “query”: “recent AI news”}
    - {“name”: “web_search”, “arguments”: {“query”: “top AI news 2026/site:hackernews.com OR site:arxiv.org OR site:ieee.org OR site:wired.com”}

Research Paper: Toolformer: Language Models Can Teach Themselves to Use Tools, Schick et al. , 2023, https://arxiv.org/pdf/2302.04761

Video Supervised Fine Tuning (SFT) | Chain of Thought | Tool/Function Calling | Detailed Explanation

The “Average” Problem:
- If a model has been fine-tuned to solve a category of maths problems in 5 different ways, out of which 2 are good but 3 are mediocre; SFT tries to satisfy the average of those patterns.
- It does not inherently know which one is “better”.
Safety/Nuance:
- It is easy to show a model a “correct” answer, but it is very hard to show it every possible “wrong” or “unsafe” answer.
- The model may inadvertently provide instructions for harmful activities (e.g., “how to build a bomb”) because that data existed in its massive training set.

“Polishing” phase that aligns an LLM’s raw intelligence with human values, safety, and helpfulness.

Performed using Direct Preference Optimization (DPO).

Key Use Cases of RLHF

Instruction Following: Ensuring the model actually follows complex constraints (e.g., “Write this in 50 words only.”).
Safety Guardrails: Teaching the model to decline requests for hate speech or dangerous instructions.
Subjective Nuance: Aligning the “tone” of the model (e.g., making it sound professional vs. casual).
Factuality: Encouraging the model to say “I don’t know” when it is uncertain, rather than hallucinating a confident but wrong answer.
Coding Style: Ranking code that is more “pythonic” or efficient even if multiple versions are functionally correct.

Research Paper: InstructGPT: Training language models to follow instructions with human feedback, Ouyang et al., 2022, https://arxiv.org/pdf/2203.02155

Pairwise comparison

Generate Response

The SFT model takes a prompt and generates 2 different responses ().

Better Response

Human (or other LLM) chooses which is the better response. Result

A triplet is generated :

\(\{prompt, chosen\_response, rejected\_response\}\)

Optimization
Compares the 2 versions of the model simultaneously.

Policy Model (\(\pi_{\theta}\)): The model we are currently training.
Reference Model (\(\pi_{ref}\)): A frozen copy of our SFT model.

Loss Function: The goal of the loss function is to increase the likelihood of the chosen response and decrease the likelihood of the rejected one, while staying “close” to the original model.

Note: Does not train a separate reward model as in Proximal Policy Optimization (PPO).

Video Reinforcement Learning From Human Feedback (RLHF) | Direct Preference Optimization (DPO) | Explained

In SFT or RLHF we are not teaching the model language from scratch but only teaching a pre-trained model to follow instructions.

Training LLMs is very costly, both wrt to time and resources.

Challenge
When we fine-tune an LLM, we do not just store the model’s weights.
We also have to store optimizer states, gradients, and activations.

7B parameters LLM

images/natural_language_processing/llm/peft.png

Solution
Instead of retraining the entire \(d_{model} \times d_{model}\) weight matrices, PEFT methods only update a tiny fraction (often \(<1\%\)) of the parameters while keeping the original “knowledge” of the pre-trained Transformer frozen.

Key PEFT methods include LoRA (most popular), QLoRA, Prefix Tuning, and Adapter layers.

Low-Rank Adaptation (LoRA)
It freezes the original model weights and injects two small, trainable low-rank matrices into the attention layers.
Reduces trainable parameters by up to 99% while maintaining near-identical performance to full fine-tuning.
Prevents “catastrophic forgetting”.

Why Attention Layers ?

Attention layers are the primary target for LoRA because they act as the “brain” for context and relationships in Transformers.
By modifying the Attention Mechanism, we change how the model decides which parts of the input are relevant to each other.

While weights of Feed Forward Neural Network and Layer Normalization are frozen to avoid catastrophic forgetting.

Low Rank Matrices
When we train a model, we are looking for a change in weights, which we call \(\Delta W\).

\[\Delta W_{d \times d} = B_{d \times r} \times A_{r \times d}\]

\[W_{new} = W_{old} + BA\]

Note: ‘\(r\)’ is the rank, a small integer (e.g., 4, 8, or 16).

PEFT (LoRA) Working

Initialization:
- Matrix ‘\(B\)’ is initialized to 0.
- Matrix ‘\(A\)’ is initialized with random Gaussian noise.
- At the very start of training,\(BA = 0\), meaning the model starts with the exact same behavior as the original pre-trained model.
Forward Pass:
- Input ‘\(x\)’ is passed through both the frozen weights and the trainable “adapters” in parallel.
- \(h = Wx + BAx\)
Back Propagation:
- Gradients are only calculated and updated for matrices ‘\(A\)’ and ‘\(B\)’.
- Because ‘\(r\)’ is so small, we are often training less than 1% of the total parameters.
Inference:
- After training, we can mathematically “merge” the weights:\(W_{updated} = W + BA\).
- This results in a single matrix of the original size, meaning the model runs just as fast as the original with zero overhead.

Example
\(W_q = d \times d\) and \(B = d \times r\), \(A = r \times d\)
Let, d = 1000, and r = 4.

Full Matrix (\(W\)): \(1000 \times 1000 = 1,000,000\) parameters
Matrix B: \(1000 \times 4 = 4,000\) parameters
Matrix A: \(4 \times 1000 = 4,000\) parameters
\(\frac{\text{LoRA Parameters}}{\text{Full Parameters}} = \frac{8,000}{1,000,000} = 0.008\)

Therefore, we train only 0.8% of the parameters for that layer.
Across the whole model, this usually results in training ~0.1% to 1%.

Video Performance Efficient Fine Tuning (PEFT) | Low Rank Adaptation (LoRA) | Detailed Explanation

Video How Large Language Models (LLMs) are Trained ? | Pre-Training | Supervised Fine Tuning (SFT) | RLHF

Previous: Transformer Next: BERT

12 - BERT

Bidirectional Encoder Representations from Transformers

Playlist Natural Language Processing | Full Course

Limitations of Word Vectors
Before 2018, models like Word2Vec and GloVe provided “context-free” embeddings.
The word “bank” had the same vector whether it was a “river bank” or a “bank account”.

Google AI Language team developed, Transformer (encoder only) based language model designed to understand the meaning of words in a text by using the context from both directions.

BERT Architecture (Encoder Only)

images/natural_language_processing/bert/bert_architecture.png

BERT Input Representation

Note: BERT uses WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary.