This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Probability

Probability for AI & ML

1: Introduction to Probability
2: Conditional Probability
3: Independence of Events
4: Cumulative Distribution Function
5: Probability Mass Function
6: Probability Density Function
7: Expectation
8: Moment Generating Function
9: Joint & Marginal
10: Independent & Identically Distributed
11: Convergence
12: Law of Large Numbers
13: Markov's Inequality
14: Cross Entropy & KL Divergence
15: Parametric Model Estimation

In this section, we will cover the topics related to Probability for AI & ML.

Probability for AI & ML | Full Course Videos

Probability for AI & ML

This sheet contains all the topics that will be covered for Probability for AI & ML.

Below is the list of topics that will be covered in this section.

Probability
Sample Space
Random Variables
Discrete & Continuous Random Variables
Probability Distribution
Cumulative Distribution Function (CDF)
Probability Mass Function (PMF)
Probability Density Function (PDF)
Bayes’ Theorem
Conditional Probability
Joint Probability
Marginal Probability
Independence & Mutual Exclusion
Expectation
Markov Inequality
Chebyshev’s Inequality
Chernoff Bound
Law of Large Numbers
Independent & Identically Distributed Random Variables
Cross Entropy
Kullback-Leibler Divergence
Maximum Likelihood Estimation (MLE)
Maximum A Posteriori Estimation (MAP)
Minimum Mean Squared Error (MMSE)

End of Section

1 - Introduction to Probability

Introduction to Probability

This section will introduce you to basic terminologies and definitions used in probability for AI & ML.

Probability for AI & ML | Full Course Videos

💡 Why do we need to understand what is Probability?

Because the world around us is very uncertain, and Probability acts as -
the fundamental language to understand, express and deal with this uncertainty.

For example:

Toss a fair coin, $P(H) = P(T) = 1/2$
Roll a die, $P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6$
Email classifier, $P(spam) = 0.95 ,~ P(not ~ spam) = 0.05$

📘 Probability:
Numerical measure of chance or likelihood that an event will occur.

Range: $[0,1]$
$P=0$: Highly unlikely
$P=1$: Almost certain

Watch Video

📘 Sample Space:
Set of all possible outcomes of an experiment.
Symbol: $\Omega$

For example:

Toss a fair coin, sample space: $\Omega = \{H,T\}$
Roll a die, sample space: $\Omega = \{1,2,3,4,5,6\}$
Choose a real number $x$ from the interval $[2,3]$, sample space: $\Omega = [2,3]$; sample size = $\infin$
Note: There can be infinitely many points between 2 and 3, e.g: 2.21, 2.211, 2.2111, 2.21111, …
Randomly put a point in a rectangular region; sample size = $\infin$
Note: There can be infinitely many points in any rectangular region.

$P(\Omega) = 1$

📘 Event:
An outcome of an experiment. A subset of all possible outcomes.
A,B,…⊆Ω

For example:

Toss a fair coin, set of possible outcomes: $\{H,T\}$
Roll a die, set of possible outcomes: $\{1,2,3,4,5,6\}$
Roll a die, event $A = \{1,2\} => P(A) = 2/6 = 1/3$
Email classifier, set of possible outcomes: $\{spam,not ~spam\}$.

📘 Discrete:
Number of potential outcomes from an experiment is countable, distinct, or can be listed in a sequence, even if infinite i.e countably infinite.

For example:

Toss a fair coin, possible outcomes: $\Omega = \{H,T\}$
Roll a die, possible outcomes: $\Omega = \{1,2,3,4,5,6\}$
Choose a real number $x$ from the interval $[2,3]$ with decimal precision, sample space: $\Omega = [2,3]$.
Note: There are 99 real numbers between 2 and 3 with 2 decimal precision i.e from 2.01 to 2.99.
Number of cars passing a specific traffic signal in 1 hour.

📘 Continuous:
Potential outcomes from an experiment can take any value within a given range or interval, representing an uncountably infinite set of possibilities.

For example:

A line segment between 2 and 3 - forms a continuum.
Randomly put a point in a rectangular region.

graph TD
    A[Sample Space] --> |Discrete| B(Finite)
    A --> C(Infinite)
    C --> |Discrete| D(Countable)
    C --> |Continuous| E(Uncountable)

Watch Video

📘 Mutually Exclusive (Disjoint) Events:
Two or more events that cannot happen at the same time.
No overlapping or common outcomes.
If one event occurs, then the other event does NOT occur.

For example:

Roll a die, sample space: $\Omega = \{1,2,3,4,5,6\}$
Odd outcome = $A = \{1,3,5\}$
Even outcome = $B = \{2,4,6\}$ are mutually exclusive.

$P(A \cap B) = 0$
Since, $P(A \cup B) = P(A) + P(B) - (P(A \cap B)$
Therefore, $P(A \cup B) = P(A) + P(B)$

Note: If we know that event $A$ has occurred, then we can say for sure that the event $B$ did NOT occur.

📘 Independent Events:
Two events are independent if the occurrence of one event does NOT impact the outcome of the other event.

For example:

Roll a die twice , sample space: $\Omega = \{1,2,3,4,5,6\}$
Odd number in 1st throw = $A = \{1,3,5\}$
Odd number in 2nd throw = $B = \{1,3,5\}$
Note: A and B are independent because whether we get an odd number in 1st roll has NO impact of getting an odd number in second roll.

$P(A \cap B) = P(A)*P(B)$

Note: If we know that event $A$ has occurred, then that gives us NO new information about the event $B$.

Watch Video

💡 Does $P=0$ mean that the event is impossible or improbable ?

No, it means that the event is highly unlikely to occur.

Let’s understand this answer with an example.

💡 What is the probability of choosing a real number, say 2.5, from the interval $[2,3]$ ?

Probability of choosing exactly one point on the number line or a real number, say 2.5,
from the interval $[2,3]$ is almost = 0, because there are infinitely many points between 2 and 3.

Also, we can NOT say that choosing exactly 2.5 is impossible, because it exists there on the number line.
But, for all practical purposes, $P(2.5) = 0$.

Therefore, we say that $P=0$ means “Highly Unlikely” and NOT “Impossible”.

Extending this line of reasoning, we can say that probability of NOT choosing 2.5, $P(!2.5) = 1$.
Theoretically yes, because there are infinitely many points between 2 and 3.
But, we cannot say for sure that we cannot choose 2.5 exactly.
There is some probability of choosing 2.5, but it is very small.

Therefore, we say that $P=1$ means “Almost Sure” and NOT “Certain”.

Now, lets also see another example where $P=0$ means Impossible and $P=1$ means Certain.

💡 What is the probability of getting a 7 when we roll a 6 faced die ?

Here, in this case we can say that $P(7)=0$ and that means Impossible.

Similarly, we can say that $P(get ~any ~number ~between ~1 ~and ~6)=1$ and $P=1 => $ Certain.

Watch Video

End of Introduction

2 - Conditional Probability

Conditional Probability & Bayes Theorem

In this section, we will understand all the concepts related to Conditional Probability and Bayes’ Theorem.

Probability for AI & ML | Full Course Videos

📘

Conditional Probability:
It is the probability of an event occurring, given that another event has already occurred.
Allows us to update probability when additional information is revealed.

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]

📘 Chain Rule:
$P(A \cap B) = P(B)*P(A \mid B)$ or
$P(A \cap B) = P(A)*P(B \mid A)$

For example:

Roll a die, sample space: $\Omega = \{1,2,3,4,5,6\}$
Event A = Get a 5 = $\{5\} => P(A) = 1/6$
Event B = Get an odd number = $\{1, 3, 5\} => P(B) = 3/6 = 1/2$

$$ \begin{aligned} \because (A \cap B) = \{5\} ~ => P(A \cap B) = 1/6 \\ P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{1/6}{1/2} \\ &= 1/3 \end{aligned} $$

Watch Video

📘

Bayes’ Theorem:
It is a formula that uses conditional probability.
It allows us to update our belief about an event’s probability based on new evidence.
We know from conditional probability and chain rule that:

$$ \begin{aligned} P(A \cap B) = P(B)*P(A \mid B) \\ P(A \cap B) = P(A)*P(B \mid A) \\ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \end{aligned} $$

Combining all the above equations gives us the Bayes’ Theorem:

$$ \begin{aligned} P(A \mid B) = \frac{P(A)*P(B \mid A)}{P(B)} \end{aligned} $$

For example:

Roll a die, sample space: $\Omega = \{1,2,3,4,5,6\}$
Event A = Get a 5 = $\{5\} => P(A) = 1/6$
Event B = Get an odd number = $\{1, 3, 5\} => P(B) = 3/6 = 1/2$
Task: Find the probability of getting a 5 given that you rolled an odd number.

$P(B \mid A) = 1$ = Probability of getting an odd number given that we have rolled a 5.

$$ \begin{aligned} P(A \mid B) &= \frac{P(A) * P(B \mid A)}{P(B)} \\ &= \frac{1/6 * 1}{1/2} \\ &= 1/3 \end{aligned} $$

Now, let’s understand another concept called Law of Total Probability.
Here, we can say that the sample space $\Omega$ is divided into 2 parts - $A$ and $A ^ \complement $

So, the probability of an event $B$ is given by:

$$ B = B \cap A + B \cap A ^ \complement \\ P(B) = P(B \cap A) + P(B \cap A ^ \complement ) \\ By ~Chain ~Rule: P(B) = P(A)*P(B \mid A) + P(A ^ \complement )*P(B \mid A ^ \complement ) $$

💡 What if the sample space is divided into ’n’ such partitions ?

📘

Law of Total Probability:
Overall probability of an event B, considering all the different, mutually exclusive ways it can occur.
If A₁, A₂, …, Aₙ are a set of events that partition the sample space, such that they are -

Mututally exclusive : $A_i \cap A_j = \emptyset$ for all $i, j$
Exhaustive: $A₁ \cup A₂ ... \cup Aₙ = \Omega$ for all $i \neq j$ $$ P(B) = \sum_{i=1}^{n} P(A_i)*P(B \mid A_i) $$ where $n$ is the number of mutually exclusive partitions of the sample space $\Omega$ .

Now, we can also generalize the Bayes’ Theorem using the Law of Total Probability.

Generalised Bayes' Theorem

$$ P(A_i \mid B) = \frac{P(A_i)*P(B \mid A_i)}{\sum_{j=1}^{n} P(A_j)*P(B \mid A_j)} $$

Watch Video

End of Section

3 - Independence of Events

Independence of Events

In this section, we will understand different kinds of Independence of Events.

Probability for AI & ML | Full Course Videos

📘

Independence of Events:
Two events are independent if the occurrence of one event does not affect the probability of the other event.
There are 3 types of independence of events:

Mutual Independence
Pair-Wise Independence
Conditional Independence

📘

Mutual Independence:
Joint probability of two events is equal to the product of the individual probabilities of the two events.
$P(A \cap B) = P(A)*P(B)$

Joint probability: The probability of two or more events occurring simultaneously.
$P(A \cap B)$ or $P(A, B)$

For example:

Toss a coin and roll a die -
$A$ = Get a heads; $P(A)=1/2$
$B$ = Get an odd number; $P(B)=1/2$

$$ \begin{aligned} P(A \cap B) &= P(\text{Heads and Odd}) \\ &= \frac{1}{2} * \frac{1}{2} \\ &= \frac{1}{4} \\ \\ \text{also } P(A) * P(B) &= \frac{1}{2} * \frac{1}{2} \\ &= \frac{1}{4} \end{aligned} $$

=> A and B are mutually independent.

📘 Pair-Wise Independence:
Every pair of events in the set is independent.
Pair-wise independence != Mutual independence.

For example:

Toss 3 coins;
For 2 tosses, sample space: $\Omega = \{HH,HT, TH, TT\}$
$A$ = First and Second toss outcomes are same i.e $\{HH, TT\}$; $P(A)= 2/4 = 1/2$
$B$ = Second and Third toss outcomes are same i.e $\{HH, TT\}$; $P(B)= 2/4 = 1/2$
$C$ = Third and First toss outcomes are same i.e $\{HH, TT\}$; $P(C)= 2/4 = 1/2$

Now, pair-wise independence of the above events A & B is - $P(A \cap B)$
$P(A \cap B)$ => Outcomes of first and second toss are same &
outcomes of second and third toss are same.
=> Outcomes of all the three tosses are same.

Total number of outcomes = 8
Desired outcomes = $\{HHH, TTT\}$ = 2
=> $P(A \cap B) = 2/8 = 1/4 = P(A) * P(B) = 1/2 * 1/2 = 1/4$

Therefore, $A$ and $B$ are pair-wise independent.
Similarly, we can also prove that $A$ and $C$ and $B$ and $C$ are also pair-wise independent.

Now, let’s check for mutual independence of the above events A, B & C.
$P(A \cap B \cap C) = P(A)*P(B)*P(C)$
$P(A \cap B \cap C)$ = Outcomes of all the three tosses are same i.e $\{HHH, TTT\}$
Total number of outcomes = 8
Desired outcomes = $\{HHH, TTT\}$ = 2
So, $P(A \cap B \cap C)$ = 2/8 = 1/4
But, $P(A)*P(B)*P(C) = 1/2*1/2*1/2 = 1/8$
Therefore $P(A \cap B \cap C)$ ≠ $P(A)*P(B)*P(C)$
=> $A, B, C$ are NOT mutually independent but only pair wise independent.

📘 Conditional Independence:
Two events A & B are conditionally independent given a third event C,
if they are independent given that C has occurred.
Occurrence of C changes the context, causing the events A & B to become independent of each other.

$$ \begin{aligned} A = 10 ,~ B = 10 ,~ C = 20 ~and~ \Omega = 50 \\ P(A) = 10/50 = 1/5 \\ P(B) = 10/50 = 1/5 \\ P(A) * P(B) = 1/5*1/5 =1/25 \\ P(A \cap B) = 3/50 \\ \text{clearly, } P(A \cap B) ~⍯ ~P(A) * P(B) \\ \end{aligned} $$

=> A & B are NOT independent.
Now, let’s check for conditional independence of A & B given C.

$$ \begin{aligned} P(A \mid C) &= \frac{P(A \cap C)}{P(C)} = 4/20 = 1/5 \\ P(B \mid C) &= \frac{P(B \cap C)}{P(C)} = 5/20 = 1/4 \\ P(A \mid C) * P(B \mid C) &= 1/5 * 1/4 = 1/20 \\ P(A \cap B \mid C) &= \frac{P(A \cap B \cap C)}{P(C)} = 1/20 \\ \text{clearly, } P(A \cap B \mid C) &= P(A \mid C)*P(B \mid C) \\ \end{aligned} $$

Therefore, A & B are conditionally independent given C.

Watch Video

End of Section

4 - Cumulative Distribution Function

Cumulative Distribution Function of a Random Variable

In this section, we will understand Cumulative Distribution Function of a Random Variable.

Probability for AI & ML | Full Course Videos

📘 Random Variable(RV):
A random variable is a function that maps a the outcomes of a sample space to a real number.
Random Variable X represent as, $X: \Omega \to \mathbb{R} $
It maps abstract outcomes of a random experiment to concrete numerical values required for mathematical analysis.

For example:

Toss a coin 2 times, sample space: $\Omega = \{HH,HT, TH, TT\}$
The above random experiment of coin tosses can be mapped to a random variable $X: \Omega \to \mathbb{R} $
$X: \Omega = \{HH,HT, TH, TT\} \to \mathbb{R}) $
Say, if we count the number of heads, then
$$ \begin{aligned} X(0) = \{TT\} = 1 \\ X(1) = \{HT, TH\} = 2 \\ X(2) = \{HH\} = 1 \\ \end{aligned} $$ Similar, output will be observed for number of tails.

Depending upon the nature of output, random variables are of 2 types - Discrete and Continuous.

📘 Discrete Random Variable:
A random variable whose possible outcomes are finite or countably infinite.
Typically obtained by counting.
Discrete random variable cannot take any value between 2 consecutive values.

For example:

The number of heads in 2 coin tosses can be 0, 1 or 2 but NOT 1.5.

📘 Continuous Random Variable:
A random variable that can take any value between a given range/interval.
Possible outcomes are infinite.

For example:

A person’s height in a given range of say 150cm-200cm.
Height can take any value, not just round values, e.g: 150.1cm, 167.95cm, 180.123cm etc.

Now, that we have understood that how random variable can be used to map outcomes of abstract random experiment to real values for mathematical analysis, we will understand its applications.

💡 How to calculate the probability of a random variable?

Probability of a random variable is given by something called - Cumulative Distribution Function (CDF).

📘 Cumulative Distribution Function(CDF):
It gives the cumulative probability of a random variable $X$.
CDF = $F(X) = P(X \leq x)$ i.e Probability a random variable $X$ will take for a value $<=x$.

For example:

Discrete random variable - Toss a coin 2 times, sample space: $\Omega = \{HH,HT, TH, TT\}$
Count the number of heads.
$$ \begin{aligned} X(0) = \{TT\} = 1 => P(X = 0) = 1/4 \\ X(1) = \{HT, TH\} = 2 => P(X = 1) = 1/2\\ X(2) = \{HH\} = 1 => P(X = 2) = 1/4 \\ \\ CDF = F(X) = P(X \leq x) \\ F(0) = P(X \leq 0) = P(X < 0) + P(X = 0) = 1/4 \\ F(1) = P(X \leq 1) = P(X < 1) + P(X = 1) = 1/4 + 1/2 = 3/4 \\ F(2) = P(X \leq 2) = P(X < 2) + P(X = 2) = 3/4 + 1/4 = 1 \\ \end{aligned} $$

2. Continuous random variable - Consider a line segment/interval from $\Omega = [0,2] $
Random variable $X(\omega) = \omega$
i.e $X(1) = 1 ~and~ X(1.1) = 1.1 $

$$ \begin{aligned} P[(0,1/2)] = (1/2)/2 = 1/4 \\ P[(3/4, 2)] = (2-3/4)/2 = 5/8 \\ P(X<=1.1) = P(-\infty, 1.1)/2 = 1.1/2 \end{aligned} $$

\[ F_X(x) = P(X \leq x) = \begin{cases} \frac{x}{2} & \text{if } x \in [0,2] \\ 1 & \text{if } x > 2 \\ 0 & \text{if } x < 0 \end{cases} \]

Key properties of CDF

Non-Decreasing:
For any 2 values $x_1$ and $x_2$ such that $x_1 \leq x_2$, corresponding CDF must satisfy -
$F(x_1) \leq F(x_2)$
Note: Cumulative function can never decrease as x increases.
Bounded:
Range of CDF is always between 0 and 1, because CDF represents total probability, which cannot be negative or greater than 1.

$\lim_{x \to -\infty} F(X) = 0$; as $x \to -\infty$, event $(X \le x)$ becomes an impossible event
i.e $P(X \le x) =0$
$\lim_{x \to \infty} F(X) = 1$; as $x \to \infty$, event $(X \le x)$ includes all possible outcomes of the event, making sure that $P(X \le x) =1$
Right Continuous:
Function’s value at a point is same as the limit of the function, as we approach that point from right hand side (RHS).
Note: Reason for only right continuity is because how the cumulative distribution function is defined, as we may have a jump on the left side of the point.

$\lim_{h \to o^+} F(X+h) = F(X)$

For example:
In the above coin toss CDF example, if we approach X=1 from right, say 1.001, 1.01 etc, the value of $F(X) = P(X \leq x) = F(1) = 3/4$.
But, if we approach $X=1$ from left, say 0.99, 0.999 etc, the value of $F(X) = 1/4$, as these values do NOT yet include the value of the probability at $X=1$.

Discrete Case

For a discrete random variable, the CDF is a step function (i.e with jumps).
Value of the probability of a random variable X, at any given value x, is calculated by summing up all the probabilities for values $\le x$.

$ CDF = F_X(x) = \sum_{x_i \le x} P(X=x_i) $, where $P(X=x_i)$ is the Probability Mass Function (PMF) at $x_i$

In the above coin toss example -
Height of the jump at ‘x’ = Probability at that value ‘x’.
e.g: Jump at (x=1) = 1/2 = Probability at (x=1).

Continuous Case

For a continuous random variable, the CDF is a continuous function.
CDF for continuous random variable is calculated by integrating the probability density function (PDF) from $-\infty$ to the given value $x$.

$ CDF = F_X(x) = \int_{-\infty}^{x} f(x) \,dx $, where $f(x)$ is the Probability Density Function (PDF) of random variable.

Note: We can also say that PDF is the derivative of CDF for continuous random variable.

$PDF = f(x) = F'(X) = \frac{dF_X(x)}{dx} $

Read more about Integration

Watch Video

End of Section

5 - Probability Mass Function

Probability Mass Function of a Discrete Random Variable

In this section, we will understand Probability Mass Function of a Discrete Random Variable.

Probability for AI & ML | Full Course Videos

📘 Probability Mass Function(PMF):
It gives the exact value of a probability for a discrete random variable at a specific value $x$.
It assigns a “non-zero” mass or probability to a specific countable outcome.
Note: Called ‘mass’ because probability is concentrated at a single discrete point.
$PMF = P(X=x)$
e.g: Bernoulli, Binomial, Multinomial, Poisson etc.

Commonly visualised as a bar chart.
Note: PMF = Jump at a given point in CDF.

$PMF = P_x(X=x_i) = F_X(X=x_i) - F_X(X=x_{i-1})$

Key properties of PMF

Non-Negative: Probability of any value ‘x’ must be non-negative i.e $P(X=x) \ge 0 ~\forall x~$.
Sum = 1: Sum of probabilities of all possible outcomes must be 1.
$ \sum_{x} P(X=x) = 1 $
For any value that the discrete random variable can NOT take, the probability must be zero.

Watch Video

📘 Bernoulli Distribution:
It models a single event with two possible outcomes, success (1) or failure (0), with a fixed probability of success, ‘p’.
p = Probability of success
1-p = Probability of failure
Mean = p
Variance = p(1-p)
Note: A single trial that adheres to these conditions is called a Bernoulli trial.
$PMF, P(x) = p^x(1-p)^{1-x}$, where $x \in \{0,1\}$

For example:

Toss a coin, we get heads or tails.
Result of a test, pass or fail.
Machine learning, binary classification model.

📘

Binomial Distribution:
It extends the Bernoulli distribution by modeling the number of successes that occur over a fixed number of independent trials.
n = Number of trials
k = Number of successes
p = Probability of success
Mean = np
Variance = np(1-p)

$PMF, P(x=k) = \binom{n}{k}p^k(1-p)^{n-k}$, where $k \in \{0,1,2,3,...,n\}$
$\binom{n}{k} = \frac{n!}{k!(n-k)!}$ i.e number of ways to achieve ‘k’ successes in ‘n’ independent trials.

Note: Bernoulli is a special case of Binomial distribution where n = 1.
Also read about Multinomial distribution i.e where number of possible outcomes is > 2.

For example:

Counting number of heads(success) in ’n’ coin tosses.

Assumptions

Trials are independent.
Probability of success remains constant for every trial.

💡 What is the probability of getting exactly 2 heads in 3 coin tosses?

Total number of outcomes in 3 coin tosses = 2^3 = 8
Desired outcomes i.e 2 heads in 3 coin tosses = $\{HHT, HTH, THH\}$ = 3
Probability of getting exactly 2 heads in 3 coin tosses = $\frac{3}{8}$ = 0.375
Now lets solve the question using the binomial distribution formula.

\[P(k=2) = \binom{3}{2}p^2(1-p)^{3-2} \\ = \frac{3!}{2!(3-2)!}(0.5)^2(0.5) \\ = 3*0.25*0.5 = 3*0.125 = 0.375\]

💡 What is the probability of winning a lottery 1 out of 10 times, given that the probability of winning a single lottery = 1/3?

Number of successes, k = 1
Number of trials, n = 10
Probability of success, p = 1/3
Probability of winning lottery, P(k=1) =

\[\binom{10}{1}p^1(1-p)^{10-1} \\ = \frac{10!}{1!(10-1)!}(1/3)^1(2/3)^9 \\ = 10*0.333*0.026 = 0.866 \approx 8.66\% \]

Watch Video

📘

Poisson Distribution:
It expresses the probability of an event happening a certain number of times ‘k’ within a fixed interval of time.
Given that:

Events occur with a known constant average rate.
Occurrence of an event is independent of the time since the last event.

Parameters:
$\lambda$: Expected number of events per interval
$k$ = Number of events in the same interval
Mean = $\lambda$
Variance = $\lambda$

PMF = Probability of occurrence of ‘k’ events in the same interval

\[PMF = \lambda^ke^{-\lambda}/k!\]

Note: Useful to count data where total population size is large but the probability of an individual event is small.

For example:

Model the number of customers arrival at a service center per hour.
Number of website clicks in a given time period.

Graph: PMF of Poisson Distribution.

💡 A company receives, on an average, 5 customer emails per hour. What is the probability of receiving exactly 3 emails in the next hour?

Expected (average) number of emails per hour, $\lambda$ = 5
Probability of receiving exactly k=3 emails in the next hour =

\[P(k=3) = \lambda^3e^{-\lambda} / 3! \\ = 5^3e^{-5} / 3! = 125*e^{-5} / 6 \\ = 125*0.00674 / 6 \approx 0.14 ~or~ 14\% \]

Watch Video

End of Section

6 - Probability Density Function

Probability Density Function of a Continuous Random Variable

In this section, we will understand Probability Density Function of a Continuous Random Variable.

Probability for AI & ML | Full Course Videos

📘

Probability Density Function(PDF):
This is a function used for continuous random variables to describe the likelihood of the variable taking on a value within a specific range or interval.
Since, at any given point the probability of a continuous random variable is zero, we find the probability within a given range.
Note: Called ‘density’ because probability is spread continuously over a range of values rather than being concentrated at a single point as in PMF.
e.g: Uniform, Gaussian, Exponential, etc.

Note: PDF is a continuous function $f(x)$.
It is also the derivative of Cumulative Distribution Function (CDF) $F_X(x)$

$PDF = f(x) = F'(X) = \frac{dF_X(x)}{dx} $

Key properties of PDF

Non-Negative: Function must be non-negative everywhere i.e $f(x) \ge 0 \forall x$.
Sum = 1: Total area under curve must be equal to 1.
$ \int_{-\infty}^{\infty} f(x) \,dx = 1$
Probability of a continuous random variable in the range [a,b] is given by -
$ P(a \le x \le b) = \int_{a}^{b} f(x) \,dx$

Note

We use a general term Probability Distribution Function for both PMF(discrete) and PDF(continuous) because both describe how the probability is distributed across a random variable’s entire domain.

For example:
Consider a line segment/interval from $\Omega = [0,2] $
Random variable $X(\omega) = \omega$
i.e $X(1) = 1 ~and~ X(1.1) = 1.1 $

$$ F_X(x) = P(X \leq x) = \begin{cases} \frac{x}{2} & \text{if } x \in [0,2] \\ 1 & \text{if } x > 2 \\ 0 & \text{if } x < 0 \end{cases} $$

$$ \begin{aligned} PDF = f_X(x) = \frac{dF_X(x)}{dx} \\ \end{aligned} $$

$$ \text{PDF } = f_X(x) = \begin{cases} \dfrac{1}{2}, & x \in [0,2] \\ 0, & \text{otherwise.} \end{cases} $$

Note: If we know the PDF of a continuous random variable, then we can find the probability of any given region/interval by calculating the area under the curve.

Watch Video

📘

Uniform Distribution:
All the outcomes within the given range are equally likely to occur.
Also known as ‘fair’ distribution.
Note: This is a natural starting point to understand randomness in general.

\[ X \sim U(a,b) \]

$$ \begin{aligned} PDF = f(x) = \begin{cases} \frac{1}{b-a} & \text{if } x \in [a,b] \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$

Mean = Median = $ \frac{a+b}{2} ~if~ x \in [a,b] $
Variance = $ \frac{(b-a)^2}{12} $

Standard uniform distribution: $ X \sim U(0,1) $

$$ \begin{aligned} PDF = f(x) = \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$

PDF in terms of mean($\mu$) and standard deviation($\sigma$) -

$$ \begin{aligned} PDF = f(x) = \begin{cases} \frac{1}{2\sigma\sqrt{3}} & \text{if } \mu -\sigma\sqrt{3} \le x \le \mu + \sigma\sqrt{3}\\ \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$

For example:

Random number generator that generates a random number between 0 and 1.

Graph: PDF of Uniform Distribution.

Graph: CDF of Uniform Distribution.

Watch Video

📘

Gaussian(Normal) Distribution:
It is a continuous probability distribution characterized by a symmetrical, bell-shaped curve with most data clustered around the central average, with frequency of values decreasing as they move away from the center.

Most outcomes are average; extremely low or extremely high values are rare.
Characterised by mean and standard deviation/variance.
Peak at mean = median, symmetric around mean.
Note: Most important and widely used distribution.

\[ X \sim N(\mu, \sigma^2) \] $$ \begin{aligned} PDF = f(x) = \dfrac{1}{\sqrt{2\pi}\sigma}e^{-\dfrac{(x-\mu)^2}{2\sigma^2}} \\ \end{aligned} $$
Mean = $\mu$
Variance = $\sigma^2$

Standard normal distribution:

\[ Z \sim N(0,1) ~i.e~ \mu = 0, \sigma^2 = 1 \]

Any normal distribution can be standardized using Z-score transformation:

$$ \begin{aligned} Z = \dfrac{X-\mu}{\sigma} \end{aligned} $$

For example:

Human height, IQ scores, blood-pressure etc.
Measurement of errors in scientific experiments.

Graph: PDF of Gaussian Distribution.

Graph: CDF of Gaussian Distribution.

68-95-99 Rule

68.27% of the data lie within 1 standard deviation of the mean i.e $\mu \pm \sigma$
95.45% of the data lie within 2 standard deviations of the mean i.e $\mu \pm 2\sigma$
99.73% of the data lie within 3 standard deviations of the mean i.e $\mu \pm 3\sigma$

Watch Video

📘

Exponential Distribution:
It is used to model the amount of time until a specific event occurs.
Given that:

Events occur with a known constant average rate.
Occurrence of an event is independent of the time since the last event.

Parameters:
Rate parameter: $\lambda$: Average number of events per unit time
Scale parameter: $\mu ~or~ \beta $: Mean time between events
Mean = $\frac{1}{\lambda}$
Variance = $\frac{1}{\lambda^2}$
\[ \lambda = \dfrac{1}{\beta} = \dfrac{1}{\mu}\]

$$ \begin{aligned} PDF = f(x) = \lambda e^{-\lambda x} ~~~ \forall ~~~ x \ge 0 ~~~\&~~~ \lambda > 0 \\ CDF = F(x) = 1 - e^{-\lambda x} ~~~ \forall ~~~ x \ge 0 ~~~\&~~~ \lambda > 0 \end{aligned} $$

Graph: PDF of Exponential Distribution.

Graph: CDF of Exponential Distribution.

💡 At a bank, a teller spends 4 minutes, on an average, with every customer. What is the probability that a randomly selected customer will be served in less than 3 minutes?

Mean time to serve 1 customer = $\mu$ = 4 minutes
So, $\lambda$ = average number of customers served per unit time = $1/\mu$ = 1/4 = 0.25 minutes
Probability to serve a customer in less than 3 minutes can be found using CDF -

\[ F(x) = P(X \le x) = 1 - e^{-\lambda x}\]

$$ \begin{aligned} P(X \leq 3) &= 1 - e^{0.25*3} \\ &= 1 - e^{-0.75} \\ &= 1 - 0.47 \\ &\approx 0.53 \\ &\approx 53\% \end{aligned} $$

So, probability of a customer being served in less than 3 minutes is 53%(approx).

💡 At a bank, a teller spends 4 minutes, on an average, with every customer. What is the probability that a randomly selected customer will be served in greater than 2 minutes?

\[ CDF = F(x) = P(X \le x) = 1 - e^{-\lambda x} \\ \text{Total probability} = P(X \le x) + P(X > x) = 1\\ => 1 - e^{-\lambda x} + P(X > x) = 1 \\ => P(X > x) = e^{-\lambda x} \]

In this case x = 2 minutes, and $\lambda$ = 0.25 so,

\[ P(X > 2) = e^{-\lambda x} \\ = e^{-0.25*2} \\ = e^{-0.5} \\ = 0.6065 \\ \approx 60.65\% \]

So, probability of a customer being served in greater than 2 minutes is 60%(approx).

Watch Video

Memoryless Property of Exponential Distribution

Probability of waiting for an additional period of time for an event to occur is independent of how long you have already waited.
e.g: Lifetime of electronic items follow exponential distribution.

Probability of a computer part failing in the next 1 hour is the same regardless of whether it has been working for 1 day or 1 year or 5 years.

Note: Memoryless property makes exponential distribution particularly useful for -

Modeling systems that do not experience ‘wear and tear’; where failure is due to a constant random rate rather than degradation over time.
Also, useful for ‘reliability analysis’ of electronic systems where a ‘random failure’ model is more appropriate than a ‘wear out’ model.

💡 Suppose, we know that an electronic item has lasted for time $x>t$ days, then what is the probability that it will last for an additional time of ’s’ days ?

Task: We want to find the probability of the electronic item lasting for $x > t+s $ days,
given that it has already lasted for $x>t$ days.
This is a ‘conditional probability’.
Since,

\[ P(A \mid B) = \dfrac{P(A \cap B)}{P(B)} \]

We want to find: $ P(X > t+s \mid X > t) = ~? $

\[ P(X > t+s \mid X > t) = \dfrac{P(X > t+s ~and~ X > t)}{P(X > t)} \\ \text{Since, t+s > t, we can only consider t+s} \\ = \dfrac{P(X > t+s)}{P(X > t)} \\ = \dfrac{e^{-\lambda(t+s)}}{e^{-\lambda(t)}} \\ = e^{-\lambda(t) -\lambda(s) + \lambda(t)} \\ = e^{-\lambda(s)} \\ => \text{Independent of time 't'} \]

Hence, probability that the electronic item will survive for an additional time ’s’ days
is independent of the time ’t’ days it has already survived.

Watch Video

Relation of Exponential Distribution and Poisson Distribution

Poisson distribution models the number of events occurring in a fixed interval of time, given a constant average rate $\lambda $.
Exponential distribution models the time interval between those successive events.

2 faces of the same coin.
$\lambda_{poisson}$ is identical to rate parameter $\lambda_{exponential}$.

Note: If the number of events in a given interval follow a Poisson distribution, then the waiting time between those events will necessarily follow an Exponential distribution.

Lets see the proof for the above statement.

Poisson Case:
The probability of observing exactly ‘k’ events in a time interval of length ’t’
with an average rate of $ \lambda $ events per unit time is given by -

\[ PMF = P(X=k) = \dfrac{(\lambda t)^k e^{-\lambda t}}{k!} \]

The event that waiting time until next event > t, is same as observing ‘0’ events in that interval.
=> We can use the PMF of Poisson distribution with k=0.

\[ PMF = P(X=k=0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t} \]

Exponential Case:
Now, lets consider exponential distribution that models the waiting time ‘T’ until next event.
The event that waiting time ‘T’ > t, is same as observing ‘0’ events in that interval.

\[ CDF = P(X>t) = e^{-\lambda t} \]

Observation:
Exponential distribution is a direct consequence of Poisson distribution probability of ‘0’ events.

💡 Consider a machine that fails, on an average, every 20 hours. What is the probability of having NO failures in the next 10 hours?

Using Poisson:
Average failure rate, $\lambda$ = 1/20 = 0.05
Time interval, t = 10 hours
Number of events, k = 0
Average number of events in interval, ($\lambda t$) = (1/20) * 10 = 0.5
Probability of having NO failures in the next 10 hours = ?

\[ P(X=0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t} \\ = e^{-0.5} \approx 0.6065 \\ \approx 60.65\%\]

Using Exponential:
What is the probability that wait time until next failure > 10 hours?
Waiting time, T > t = 10 hours.
Average number of failures per hour, $\lambda$ = 1/20 per hour

\[ CDF = P(X>t) = e^{-\lambda t} \\ P(T>10) = e^{-(1/20) * 10} \\ = e^{-0.5} \approx 0.6065 \\ \approx 60.65\%\]

Therefore, we have seen that this problem can be solved using both Poisson and Exponential distribution.

Watch Video

End of Section

7 - Expectation

Expectation of a Random Variable

In this section, we will understand Expectation, Variance and Co-Variance of a Random Variable.

Probability for AI & ML | Full Course Videos

📘

Expectation:
Long run average of the outcomes of a random experiment.
When we talk about ‘Expectation’ we mean - on an average.

Value we expect to get when we repeated an experiment multiple times and took an average of the results.
Measures the central tendency of a data distribution.
Similar to ‘Center of Mass’ in Physics.

Note: If the probability of each outcome is different, then we take a weighted average.

Discrete Case:
$E[X] = \sum_{i=1}^{n} x_i.P(X = x_i) $

Continuous Case:
$E[X] = \int_{-\infty}^{\infty} x.f(x) dx $

💡 Let’s play a game where we flip a fair coin. If we get a head, then you win Rs. 100, and
if its a tail, then you lose Rs. 50. What is the expected value of the amount that you will win per toss?

Possible outcomes are: $x_1 = 100 ,~ x_2 = -50 $
Probability of each outcome is $P(X = 100) = 0.5,~ P(X = -50) = 0.5 $

\[ E[X] = \sum_{i=1}^{n} x_i.P(X = x_i) \\ = x_1.P(X = x_1) + x_2.P(X = x_2) \\ = 100*0.5 + (-50)*0.5 \\ = 50 - 25 \\ = 25 \]

Therefore, the expected value of the amount that you will win per toss is Rs. 25 in the long run.

💡 What is the expected value of a continuous uniform random variable distributed over the interval [a,b]?

$$ \text{PDF of continuous uniform random variable } = \\ f_X(x) = \begin{cases} \dfrac{1}{b-a}, & x \in [a,b] \\ 0, & \text{otherwise.} \end{cases} $$\[ \text{Expected value of continuous uniform random variable } = \\ E[X] = \int_{-\infty}^{\infty} x.f(x) dx \\ = \int_{-\infty}^{a} x.f(x) dx + \int_{a}^{b} x.f(x) dx + \int_{b}^{\infty} x.f(x) dx\\[0.5em] \text{Since, the PDF is defined only in the range [a,b] } \\[0.5em] = 0 + \int_{a}^{b} x.f(x) dx + 0 \\ = \int_{a}^{b} x.f(x) dx \\ = \dfrac{1}{b-a} \int_{a}^{b} x dx \\[0.5em] = \dfrac{1}{b-a} (\frac{x^2}{2})_{a}^{b} \\[0.5em] = \dfrac{1}{b-a} * (\frac{b^2 - a^2}{2}) \\[0.5em] = \dfrac{1}{b-a} * \{\frac{(b+a)(b-a)}{2}\}\\ = \dfrac{b+a}{2} \]

📘

Variance:
It is the average of the squared differences from the mean.

Measures the spread or variability in the data distribution.
If the variance is low, then the data is clustered around the mean i.e less variability;
and if the variance is high, then the data is widely spread out i.e high variability.

Variance in terms of expected value, where $E[X] = \mu$, is given by:

\[ \begin{aligned} Var[X] &= E[(X - E[X])^2] \\ &= E[X^2 + E[X]^2 - 2XE[X]] \\ &= E[X^2] + E[X]^2 - 2E[X]E[X] \\ &= E[X^2] + E[X]^2 - 2E[X]^2 \\ Var[X] &= E[X^2] - E[X]^2 \\ \end{aligned} \]

Note: This is the computational formula for variance, as it is easier to calculate than the average of square distances from mean.

💡 What is the variance of a continuous uniform random variable distributed over the interval [a,b]?

PDF of continuous uniform random variable =

$$ \begin{aligned} f_X(x) &= \begin{cases} \dfrac{1}{b-a}, & x \in [a,b] \\ 0, & \text{otherwise.} \end{cases} \\ \end{aligned} $$\[ \begin{aligned} \text{Expected value = mean } = \\[0.5em] E[X] = \dfrac{b+a}{2} \\[0.5em] Var[X] = E[X^2] - E[X]^2 \end{aligned} \]

We know $E[X]$ already, now we will calculate $E[X^2]$:

\[ \begin{aligned} E[X] &= \int_{-\infty}^{\infty} x.f(x) dx \\ E[X^2] &= \int_{-\infty}^{\infty} x^2.f(x) dx \\ &= \int_{-\infty}^{a} x^2f(x) dx + \int_{a}^{b} x^2f(x) dx + \int_{b}^{\infty} x^2f(x) dx\\ &= 0 + \int_{a}^{b} x^2f(x) dx + 0 \\ &= \int_{a}^{b} x^2f(x) dx \\ &= \dfrac{1}{b-a} \int_{a}^{b} x^2 dx \\ &= \dfrac{1}{b-a} * \{\frac{x^3}{3}\}_{a}^{b} \\ &= \dfrac{1}{b-a} * \{\frac{b^3 - a^3}{3}\} \\ &= \dfrac{1}{b-a} * \{\frac{(b-a)(b^2+ab+a^2)}{3}\} \\ E[X^2] &= \dfrac{b^2+ab+a^2}{3} \end{aligned} \]

Now, we know both $E[X]$ and $E[X^2]$, so we can calculate $Var[X]$:

\[ \begin{aligned} Var[X] &= E[X^2] - E[X]^2 \\ &= \dfrac{b^2+ab+a^2}{3} - \dfrac{(b+a)^2}{4} \\ &= \dfrac{b^2+ab+a^2}{3} - \dfrac{b^2+2ab+a^2}{4} \\ &= \dfrac{4b^2+4ab+4a^2-3b^2-6ab-3a^2}{12} \\ &= \dfrac{b^2-2ab+a^2}{12} \\ Var[X]&= \dfrac{(b-a)^2}{12} \end{aligned} \]

📘

Co-Variance:
It is the measure of how 2 variables X & Y vary together.
It gives the direction of the relationship between the variables.

\[ \begin{aligned} \text{Cov}(X, Y) &> 0 &&\Rightarrow \text{ } X \text{ and } Y \text{ increase or decrease together} \\[0.5em] \text{Cov}(X, Y) &= 0 &&\Rightarrow \text{ } \text{No linear relationship} \\[0.5em] \text{Cov}(X, Y) &< 0 &&\Rightarrow \text{ } \text{If } X \text{ increases, } Y \text{ decreases (and vice versa)} \end{aligned} \]

Note: For both direction as well as magnitude, we use Correlation.
Let’s use expectation to compute the co-variance of two random variables X & Y:

\[ \begin{aligned} Cov(X,Y) &= E[(X-E[X])(Y-E[Y])] \\ &\text{where, E[X] = mean of X and E[Y] = mean of Y} \\ & = E[XY - YE[X] -XE[Y] + E[X]E[Y] \\ &= E[XY] - E[Y]E[X] - \cancel{E[X]E[Y]} + \cancel{E[X]E[Y]} \\ Cov(X,Y) &= E[XY] - E[X]E[Y] \\ \end{aligned} \]

Note:

In a multivariate setting, relationship between all the pairs of random variables are summarized in a square symmetrix matrix called ‘Co-Variance Matrix’ $\Sigma$.
Covariance of a random variable with self gives the variance, hence the diagonals of covariance matrix are variances.

\[ \begin{aligned} Cov(X,Y) &= E[XY] - E[X]E[Y] \\ Cov(X,X) &= E[X^2] - E[X]^2 = Var[X] \\ \end{aligned} \]

Watch Video

End of Section

8 - Moment Generating Function

Moment Generating Function

In this section, we will understand Moments of a probability distribution and Moment Generating Function.

Probability for AI & ML | Full Course Videos

📘

Moments:
They are statistical measures that describe the characteristics of a probability distribution, such as its central tendency, spread/variability, and asymmetry.
Note: We will discuss ‘raw’ and ‘central’ moments.

Important moments:

Mean : Central tendency
Variance : Spread/variability
Skewness : Asymmetry
Kurtosis : Tailedness

Watch Video

📘

Moment Generating Function:
It is a function that simplifies computation of moments, such as, the mean, variance, skewness, and kurtosis, by providing a compact way to derive any moment of a random variable through differentiation.

Provides an alternative to PDFs and CDFs of random variables.
A powerful property of the MGF is that its $n^{th}$ derivative, when evaluated at $(t=0)$ = the $n^{th}$ moment of the random variable.

\[ \text{MGF of random variable X is the expected value of } e^{tX} \\ MGF = M_X(t) = E[e^{tX}] \]

Not all probability distributions have MGFs; sometimes, integrals may not converge and moment may not exist.

\[ \begin{aligned} e^x = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots + \frac{x^n}{n!} + \cdots\\ e^{tX} = 1 + \frac{tx}{1!} + \frac{(tx)^2}{2!} + \frac{(tx)^3}{3!} + \cdots + \frac{(tx)^n}{n!} + \cdots\\ \end{aligned} \]

Since, $MGF = M_X(t) = E[e^{tX}] $, we can write the MGF as:

\[ \begin{aligned} M_X(t) &= E[e^{tX}] \\ &= 1 + \frac{t}{1!}E[X] + \frac{t^2}{2!}E[X^2] + \frac{t^3}{3!}E[X^3] + \cdots + \frac{t^n}{n!}E[X^n] + \cdots \\ \end{aligned} \]

Say, $ E[X^n] = m_n $, where $m_n$ is the $n^{th}$ moment of the random variable, then:

\[ M_X(t) = 1 + \frac{t}{1!}m_1 + \frac{t^2}{2!}m_2 + \frac{t^3}{3!}m_3 + \cdots + \frac{t^n}{n!}m_n + \cdots \\ \]

Important: Differentiating $M_X(t)$ with respect to $t$, $i$ times, and setting $(t =0)$ gives us the $i^{th}$ moment of the random variable.

\[ \frac{dM_X(t) }{dt} = M'_X(t) = 0 + m_1 + \frac{2t}{2!}m_2 + \frac{3t^2}{3!}m_3 + \cdots + \frac{nt^n-1}{n!}m_n + \cdots \\ \]

Set $t=0$, to get the first moment:

\[ \frac{dM_X(t) }{dt} = M'_X(t) = m_1 \]

Similarly, second moment = $M''_X(t) = m_2 = E[X^2]$
And, $n^{th}$ derivative = $M^{(n)}_X(t) = m_n = E[X^n]$

\[ E[X^n] = \frac{d^nM_X(t)}{dt^n} \bigg|_{t=0} \]

e.g: $ Var[X] = M''_X(0) - (M'_X(0))^2 $

Note: If we know the MGF of a random variable, then we do NOT need to do integration or summation to find the moments.
Continuous:

\[ M_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x) dx \]

Discrete:

\[ M_X(t) = E[e^{tX}] = \sum_{i=1}^{\infty} e^{tx} P(X=x) \]

Watch Video

End of Section

9 - Joint & Marginal

Joint, Marginal & Conditional Probability

In this section, we will understand Joint, Marginal & Conditional Probability.
So far, we have dealt with a single random variable.
Now, let’s explore the probability distributions of 2 or more random variables occurring together.

Probability for AI & ML | Full Course Videos

📘

Joint Probability Distribution:
It describes the probability of 2 or more random variables occurring simultaneously.

The random variables can be from different distributions, such as, discrete and continuous.

Joint CDF:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Discrete Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \sum_{x_i \le a} \sum_{y_j \le b} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \int_{-\infty}^{a} \int_{-\infty}^{b} f_{X,Y}(x,y) dy dx \]

Joint PMF:

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

Key Properties:

$P(X = x, Y = y) \ge 0 ~ \forall (x,y) $
$ \sum_{i} \sum_{j} P(X = x_i, Y = y_j) = 1 $

Joint PDF:

\[ f_{X,Y}(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} \\ f_{X,Y}(x,y) = \iint_{A \in \mathbb{R}^2} f_{X,Y}(x,y) dy dx \]

Key Properties:

$f_{X,Y}(x,y) \ge 0 ~ \forall (x,y) $
$ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy dx = 1 $

For example:

If we consider 2 random variables, say, height(X) and weight(Y), then the joint distribution will tell us the probability of finding a person having a particular height and weight.

💡 There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand joint probability.

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

	A = Red	A = Blue
B = Red	2/5*3/5 = 6/25	3/5*3/5 = 9/25
B = Blue	2/5*2/5 = 4/25	3/5*2/5 = 6/25

Since, the draws are independent, joint probability = P(A) * P(B)
Each of the 4 cells in above table shows the probability of combination of results from 2 draws or joint probability.

Watch Video

📘

Marginal Probability Distribution:
It describes the probability distribution of an individual random variable in a joint distribution, without considering the outcomes of other random variables.

If we have the joint distribution, then we can get the marginal distribution of each random variable from it.
Marginal probability equals summing the joint probability across other random variables.

Marginal CDF:
We know that Joint CDF =

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Marginal CDF =

\[ F_X(a, \infty) = P(X \le a, Y < \infty) = P(X \le a) \]

Discrete Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \sum_{x_i \le a} \sum_{y_j \in \mathbb{R}} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \int_{-\infty}^{a} \int_{-\infty}^{\infty} f_{X,Y}(x,y)dydx = \int_{-\infty}^{\infty} f_{X,Y}(x,y)dy \]

Law of Total Probability
We know that Joint Probability Distribution =

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

The events $(Y=y)$ partition the sample space, such that:

$ (Y=y_1) \cap (Y=y_2) \cap ... \cap (Y=y_n) = \Phi $
$ (Y=y_1) \cup (Y=y_2) \cup ... \cup (Y=y_n) = \Omega $

From Law of Total Probability, we get:

Marginal PMF:

\[ P_X(x) = P(X=x) = \sum_{y} P_{X,Y}(x,y) = \sum_{y} P(X = x, Y = y) \]

Marginal PDF:

\[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \]

💡 Setup: Roll a die + Toss a coin.
X: Roll a die ; $ \Omega = \{1,2,3,4,5,6\} $
Y: Toss a coin ; $ \Omega = \{H,T\} $

Joint PMF = $ P_{X,Y}(x,y) = P(X=x, Y=y) = 1/6*1/2 = 1/12$
Marginal PMF of X = $ P_X(x) =\sum_{y \in \mathbb{\{H,T\}}} P_{X,Y}(x,y) = = 1/12+1/12 = 1/6$
=> Marginally, X is uniform over 1-6 i.e a fair die.

Marginal PMF of Y = $ P_Y(y) = \sum_{1}^6 P_{X,Y}(x,y) = 6*(1/12) = 1/2 $
=> Marginally, Y is uniform over H,T i.e a fair coin.

💡 Setup: X and Y are two continuous uniform distribution.
$ X \sim U(0,1) $
$ Y \sim U(0,1) $

Marginal PDF = $f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy $
Joint PDF =

$$ f_{X,Y}(x,y) = \begin{cases} 1 & \text{if } x \in [0,1], y \in [0,1] \\ 0 & \text{otherwise } \end{cases} $$

Marginal PDF =

\[ \begin{aligned} f_X(x) &= \int_{0}^{1} f_{X,Y}(x,y) dy \\ &= \int_{0}^{1} 1 dy \\ &= 1 \\ f_X(x) &= \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{otherwise } \end{cases} \end{aligned} \]

💡 Let’s re-visit the ball drawing example.
There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand marginal probability.

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

	A = Red	A = Blue	P(B) (Marginal)
B = Red	2/5*3/5 = 6/25	3/5*3/5 = 9/25	6/25 + 9/25 = 15/25 = 3/5
B = Blue	2/5*2/5 = 4/25	3/5*2/5 = 6/25	4/25 + 6/25 = 10/25 = 2/5
P(A) (Marginal)	6/25 + 4/25 = 10/25 = 2/5	9/25 + 6/25 = 15/25 = 3/5

We can see from the table above - P(A=Red) is the sum of joint distribution over all possible values of B i.e Red & Blue.

Watch Video

📘

Conditional Probability:
It measures the probability of an event occurring given that another event has already happened.

It provides a way to update our belief about the likelihood based on new information.

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

P(A, B) = Joint Probability of A and B
P(B) = Marginal Probability of B

=> Conditional Probability = Joint Probability / Marginal Probability

Conditional CDF:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) \\ \]

Discrete Case:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) = \sum_{x_i \le x} P(X = x_i \mid Y = y) \]

Continuous Case:

\[ F_{X \mid Y}(x \mid y) = \int_{-\infty}^{x} f_{X \mid Y}(x \mid y) dx = \int_{-\infty}^{x} \frac {f_{X,Y}(x, y)}{f_Y(y)} dx \\ f_Y(y) > 0 \]

Conditional PMF:

\[ P(X = x \mid Y = y) = \frac{P(X = x, Y = y)} {P(Y = y)} \\ P(Y = y) > 0 \]

Conditional PDF:

\[ f_{X \mid Y}(x \mid y) = \frac{F_{X,Y}(x,y)}{f_Y(y)} \\ f_Y(y) > 0 \]

Application:

Generative machine learning models, such as, GANs, learn the conditional distribution of pixels, given the style of input image.

💡 Let’s re-visit the ball drawing example.
Note: We only have information about the joint and marginal probabilities.
What is the conditional probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw?

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.
A = Red ball in first draw
B = Blue ball in second draw.

	A = Red	A = Blue	P(B) (Marginal)
B = Red	6/25	9/25	3/5
B = Blue	4/25	6/25	2/5
P(A) (Marginal)	2/5	3/5

\[ \begin{aligned} P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{4/25}{2/5} \\ &= 2/5 \end{aligned} \]

Therefore, probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw = 2/5.

📘

Conditional Expectation:
This gives us the conditional expectation of a random variable X, given another random variable Y=y.

Discrete Case:

\[ E[X \mid Y = y] = \sum_{x} x.P(X = x \mid Y = y) x = \sum_{x} x.P_{X \mid Y}(x \mid y) \]

Continuous Case:

\[ E[X \mid Y = y] = \int_{-\infty}^{\infty} x.f_{X \mid Y}(x \mid y) dx \]

For example:

Conditional expectation of of a person’s weight, given his/her height = 165 cm, will give us the average weight of all people with height = 165 cm.

Applications:

Linear regression algorithm is conditional expectation of target variable ‘Y’, given input feature variable ‘X’.
Expectation Maximisation(EM) algorithm is built on conditional expectation.

📘

Conditional Variance:
This gives us the variance of a random variable calculated after taking into account the value(s) of another related variable.

\[ \begin{aligned} Var[X \mid Y = y] &= \sum_{x} [x - E[X \mid Y = y])^2 \mid Y=y] \\ => Var[X \mid Y = y] &= E[X^2 \mid Y=y] - (E[X \mid Y=y])^2 \\ \end{aligned} \]

For example:

Variance of car’s mileage for city driving might be small, but the variance will be large for mix of city and highway driving.

Note: Models that take into account the change in variance or heteroscedasticity tend to be more accurate.

Watch Video

End of Section

10 - Independent & Identically Distributed

Independent & Identically Distributed (I.I.D) Random Variables

In this section, we will understand Independent & Identically Distributed(I.I.D) Random Variables.

Probability for AI & ML | Full Course Videos

There are 2 parts in I.I.D, “Independent” and “Identically Distributed”.
Let’s revisit and understand the independence of random variables first.

📘

Independence of Random Variables
It means that the knowing the outcome of one random variable does not impact the probability of the other random variable.
Two random variables X & Y are independent if:

\[ CDF = F_{X,Y}(x,y) = F_{X}(x)F_{Y}(y) \\ \text{ }\\ \text{Generalised for 'n' random variables:} \\ CDF = F_{X_1,X_2,...,X_n}(x_1,x_2,...,x_n) = \prod_{i=1}^{n}F_{X_i}(x_i) \\ \text{Discrete case: } \\ PMF = P_{X,Y}(x,y) = P_{X}(x)P_{Y}(y) \\ \text{ }\\ \text{Continuous case: } \\ PDF = f_{X,Y}(x,y) = f_{X}(x)f_{Y}(y) \\ \]

We know that if 2 random variables X,Y are independent, then their covariance is zero, since there is no linear relationship between them.
But the converse may NOT be true, i.e, if the covariance is zero, then we can say for sure that the random variables are independent.

11 - Convergence

Convergence of Random Variables

In this section, we will understand Convergence of Random Variables.
We will only discuss the below two types of convergence:

Convergence in Probability
Almost Sure Convergence

Probability for AI & ML | Full Course Videos

📘

Convergence in Probability:
A sequence of random variables $X_1, X_1, \dots, X_n$ is said to converge in probability to a known random variable $X$,
if for every number $\epsilon >0 $, the following is true:

\[ \lim_{n\rightarrow\infty} P(|X_n - X| > \epsilon) = 0, \forall \epsilon >0 \]

where,
$X_n$: is the estimator or sample based random variable.
$X$: is the known or limiting or target random variable.
$\epsilon$: is the tolerance level or margin of error.

12 - Law of Large Numbers

Law of Large Numbers

In this section, we will understand Law of Large Numbers.

Weak Law of Large Numbers (WLLN)
Strong Law of Large Numbers (SLLN)

Probability for AI & ML | Full Course Videos

📘

Weak Law of Large Numbers (WLLN):
This law states that given a sequence of independent and identically distributed (IID) samples $X_1, X_1, \dots, X_n$ from a random variable with finite mean, the sample mean ($\bar{X_n}$) converges in probability to the expected value $E[X]$ or population mean ($ \mu $).

\[ \lim_{n\rightarrow\infty} P(|\bar{X_n} - E[X]| \ge \epsilon) = 0, \forall ~ \epsilon >0 \\[10pt] \\[10pt] \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Probability} E[X], \text{ as } n \rightarrow \infty \]

Note: Does NOT guarantee that sample mean will be close to population mean,
but instead says that - the probability of sample mean being far away from the population mean is low.

Read more about Limits

For example:

Toss a coin large number of times $'n'$, as $n \rightarrow \infty$, the proportion of heads will probably be very close to $0.5$.
However, it does NOT rule out the possibility of a rare sequence, e.g., getting 10 consecutive heads.
But, the probability of such a rare event is extremely low.

📘

Strong Law of Large Numbers (SLLN):
This law states that given a sequence of independent and identically distributed (IID) samples $X_1, X_1, \dots, X_n$ from a random variable with finite mean, the sample mean ($\bar{X_n}$) converges almost surely to the expected value $E[X]$ or population mean ($ \mu $).

\[ P(\lim_{n\rightarrow\infty} \bar{X_n} = E[X]) = 1, \text{ as } n \rightarrow \infty \\[10pt] \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Almost ~ Sure} E[X], \text{ as } n \rightarrow \infty \]

Note:

It guarantees that the sequence of sample averages itself converges to population mean, with exception of set of outcomes that has probability = 0.
Almost certain guarantee; Much stronger statement than Weak Law of Large Numbers.

13 - Markov's Inequality

Markov’s, Chebyshev’s Inequality & Chernoff Bound

In this section, we will understand Markov’s, Chebyshev’s Inequality & Chernoff Bound.

Probability for AI & ML | Full Course Videos

📘

Markov’s Inequality:
It gives an upper bound for the probability that a non-negative random variable will NOT exceed, based on its expected value.

\[ P(X \ge a) \le \frac{E[X]}{a} \]

Note: It gives a very loose upper bound.

Read more about Expectation

💡 A restaurant, on an average, expects to serve 50 customers per hour.
What is the probability that the restaurant will serve more than 200 customers in the next hour?

\[ P(X \ge 200) \le \frac{E[X]}{200} = \frac{50}{200} = 0.25 \]

Hence, a 25% chance of serving more than 200 customers.

💡 Consider a test where the average score is 70/100 marks.
What is the probability that a randomly selected student gets a score of 90 marks or more?

\[ P(X \ge a) \le \frac{E[X]}{a} \\[10pt] P(X \ge 90) \le \frac{70}{90} \approx 0.78 \]

Hence, there is a 78% chance that a randomly selected student gets a score of 90 marks or more.

📘

Chebyshev’s Inequality:
It states that the probability of a random variable deviating from its mean is small if its variance is small.

It is a more powerful version of Markov’s Inequality.
It uses variance of the distribution in addition to the expected value or mean.
Also, it does NOT assume the random variable to be non-negative.
It uses more information about the data i.e mean and variance.

\[ P(|X - E[X]| \ge k) \le \frac{E[(X - E[X])^2]}{k^2} \text{ ; k > 0 } \\[10pt] \text{ We know that: } Var[X] = E[(X - E[X])^2] \\[10pt] => P(\big|X - E[X]\big| \ge k) \le \frac{Var[X]}{k^2} \]

Note: It gives a tighter upper bound than Markov’s Inequality.

💡 Consider a test where the average score is 70/100 marks.
What is the probability that a randomly selected student gets a score of 90 marks or more?
Given the standard deviation of the test score is 10 marks.

Given, the standard deviation of the test score $\sigma$ = 10 marks.
=> Variance = $\sigma^2$ = 100

\[ P(\big|X - E[X]\big| \ge k) \le \frac{Var[X]}{k^2} \\[10pt] E[X] = 70, Var[X] = 100 \\[10pt] P(X \ge 90) \le P(\big|X - 70\big| \ge 20) \\[10pt] P(\big|X - 70\big| \ge 20) \le \frac{100}{20^2} = \frac{1}{4} = 0.25 \]

Hence, Chebyshev’s Inequality gives a far tighter upper bound of 25% than Markov’s Inequality of 78%(approx).

📘

Chernoff Bound:
It is an upper bound on the probability that a random variable deviates from its expected value.
It’s an exponentially decreasing bound on the “tail” of a random variable’s distribution, which can be calculated using its moment generating function.

It is used for sum or average of independent random variables (not necessarily identically distributed).
It provides exponentially tighter bounds, better than Chebyshev’s Inequality’s quadratic decay.
It uses all moments to capture the full shape of the distribution, using the moment generating function(MGF).

\[ P(X \ge c) \le e^{-tc}E[e^{tX}] , \forall t>0\\[10pt] \text{ where } E[e^{tX}] \text{ is the Moment Generating Function of } X \]

Proof:

\[ P(X \ge c) = P(e^{tX} \ge e^{tc}), \text{ provided } t>0 \\[10pt] \text{ using Markov's Inequality: } \\[10pt] P(e^{tX} \ge e^{tc}) \le \frac{E[e^{tX}]}{e^{tc}} =e^{-tc}E[e^{tX}] \\[10pt] \]

For the sum of ’n’ independent random variables,

\[ P(S_n \ge c) \le e^{-tc}(M_x(t))^n \\[10pt] \text{ where } M_x(t) \text{ is the Moment Generating Function of } X \\[10pt] \]

Note: Used to compute how far the sum of independent random variables deviate from their expected value.

Read more about Moment Generating Function

Watch Video

End of Section

14 - Cross Entropy & KL Divergence

Cross Entropy & KL Divergence

In this section, we will understand Entropy, Cross Entropy, KL Divergence & JS Divergence.
All these topics come from the field of ‘Information Theory’ and can be understood through the lens of
‘Information’ or ‘Surprise’.

Probability for AI & ML | Full Course Videos

📘

Surprise Factor:
It is a measure of the amount of information gained when a specific, individual event occurs, and is defined based on the probability of that event.
It is defined as the negative logarithm of the probability of the event.

It is also called ‘Surprisal’.

\[ S(x) = -log(P(x))) \]

Note: Logarithm(for base > 1) is a monotonically increasing function, so as x increases, log(x) also increases.

So, if probability P(x) increases, then surprise factor S(x) decreases.
Common events have a high probability of occurrence, hence a low surprise factor.
Rare events have a low probability of occurrence, hence a high surprise factor.

Units:

The unit of surprise factor with log base 2 is bits.
with base ’e’ or natural log its nats (natural units of information).

📘

Entropy:
It conveys how much ‘information’ we expect to gain from a random event.

Entropy is the average or expected value of surprise factor of a random variable.
More uniform the distribution ⇒ greater average surprise factor, since all possible outcomes are equally likely.

\[ H(X) = E[-log(P(x)] = -\sum_{x \in X} P(x)log(P(x)) \]

15 - Parametric Model Estimation

Parametric Model Estimation

In this section, we will understand Parametric Model Estimation via two philosophical approaches.

Frequentist: Parameter $\Theta$ is fixed, but unknown.
Bayesian: Parameter $\Theta$ itself is unknown, so we model it as a random variable with a probability distribution.

Probability for AI & ML | Full Course Videos

📘

Parametric Model Estimation:
It is the process of finding the best-fitting finite set of parameters $\Theta$ for a model that assumes a specific probability distribution for the data.
It involves using the dataset to estimate the parameters (like the mean and standard deviation of a normal distribution) that define the model.

$P(X \mid \Theta) $ : Probability of seeing data $X: (X_1, X_2, \dots, X_n) $, given the parameters $\Theta$ of the underlying probability distribution from which the data is assumed to be generated.

Goal of estimation:
We observed data, $ D = \{X_1, X_2, \dots, X_n\} $, and we want to infer the the unknow parameters $\Theta$ of the underlying probability distribution, assuming that the data is generated I.I.D.
Read more for I.I.D

Note: Most of the time, from experience, we know the underlying probability distribution of data, such as, Bernoulli, Gaussian, etc.

There are 2 philosophical approaches to estimate the parameters of a parametric model:

Frequentist:

Parameters $\Theta$ is fixed, but unknown, only data is random.
It views probability as the long-run frequency of events in repeated trials; e.g toss a coin ’n’ times.
It is favoured when the sample size is large.
For example, Maximum Likelihood Estimation(MLE), Method of Moments, etc.

Bayesian:

Parameters $\Theta$ itself is unknown, so we model it as a random variable with a probability distribution.
It views probability as a degree of belief that can be updated with new evidence, i.e. data.
Thus, integrating prior knowledge with data to express the uncertainty about the parameters.
It is favoured when the sample size is small, as it uses prior belief about the data distribution too.
For example, Maximum A Posteriori Estimation(MAP), Minimum Mean Square Error Estimation(MMSE), etc..

Watch Video

📘

Maximum Likelihood Estimation:
It is the most popular frequentist approach to estimate the parameters of a model.
This method helps us find the parameters $\Theta$ that make the data most probable.

Likelihood Function:
Say, we have data, $D = X_1, X_2, \dots, X_n$ are I.I.D discrete random variable with PMF = $P_{\Theta}(.)$
Then, the likelihood function is the probability of observing the data, $D = \{X_1, X_2, \dots, X_n\}$, given the parameters $\Theta$.

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= P_{\Theta}(X_1, X_2, \dots, X_n) \\ &= \prod_{i=1}^{n} P_{\Theta}(X_i) \text{ ~ since, I.I.D } \\ \text{ for continuous case: } \\ & = \int_{\Theta} f_{\Theta}(x_i) d\theta \end{aligned} \]

Task:
Find the value of parameter $\Theta_{ML}$ that maximizes the likelihood function.

\[ \begin{aligned} \underset{\Theta}{\mathrm{argmax}}\ \mathcal{L_{X_1, X_2, \dots ,X_n}}(\Theta) &= \Theta_{ML}(X_1, X_2, \dots, X_n) \\ &= \prod_{i=1}^{n} P_{\Theta}(X_i) \end{aligned} \]

In order to find the parameter $\Theta_{ML}$ that maximises the likelihood function,
we need to take the first derivative of the likelihood function with respect to $\Theta$ and equate it to zero.
But, taking derivative of a product is challenging, so we will take the logarithm on both sides.
Note: Log is a monotonically increasing function, i.e, as x increases, log(x) increases too.

Let us denote the log-likelihood function as $\bar{L}$.

\[ \begin{aligned} \mathcal{\bar{L}_{X_1, X_2, \dots ,X_n}}(\Theta) &= log [\prod_{i=1}^{n} P_{\Theta}(X_i)] \\ &= \sum_{i=1}^{n} log P_{\Theta}(X_i) \\ \end{aligned} \]

Therefore, Maximum Likelihood Estimation is the parameter $\Theta_{ML}$ that maximises the log-likelihood function.

\[ \Theta_{ML}(X_1, X_2, \dots, X_n) = \underset{\Theta}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots ,X_n}(\Theta) \]

💡

Given $X_1, X_2, \dots, X_n$ are I.I.D. Bernoulli random variable with PMF as below:

\[ P(X_i = 1) = \theta \\ P(X_i = 0) = 1 - \theta \]

Estimate the parameter $\theta$ using Maximum Likelihood Estimation.

Let, $n_1 = $ number of 1’s in the dataset.

Likelihood Function:

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= \prod_{i=1}^{n} P_{\Theta}(X_i) \\ &= \theta^{n_1} (1 - \theta)^{n - n_1} \\ \end{aligned} \]

Log-Likelihood Function:

\[ \begin{aligned} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= log [\theta^{n_1} (1 - \theta)^{n - n_1}] \\ &= n_1 log \theta + (n - n_1) log (1 - \theta) \\ \end{aligned} \]

Maximum Likelihood Estimation:

\[ \begin{aligned} \Theta_{ML} &= \underset{\Theta}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ &= \underset{\Theta}{\mathrm{argmax}}\ n_1 log \theta + (n - n_1) log (1 - \theta) \\ \end{aligned} \]

In order to find the parameter $\theta_{ML}$, we need to take the first derivative of the log-likelihood function with respect to $\theta$ and equate it to zero.
Read more about Derivative

\[ \begin{aligned} \Theta_{ML} &= \underset{\Theta}{\mathrm{argmax}}\ n_1 log \theta + (n - n_1) log (1 - \theta) \\ =>& \frac{d}{d\theta} (n_1 log \theta + (n - n_1) log (1 - \theta)) = 0\\ =>& \frac{n_1}{\theta} + \frac{(n - n_1)}{1 - \theta}*(-1) = 0 \\ =>& \frac{n_1}{\theta} = \frac{(n - n_1)}{1 - \theta} \\ =>& n_1 - n_1\theta = n\theta - n_1\theta\\ =>& n_1 = n\theta \\ =>& \theta = \frac{n_1}{n} \text{ ~ i.e proportion of 1's} \\ \end{aligned} \]

Say, e.g., we have 10 observations for the Bernoulli random variable as: 1,0,1,1,0,1,1,0,1,0.
Then, the parameter $\Theta_{ML} = \frac{6}{10} = 0.6$ i.e proportion of 1’s.

💡 Given $X_1, X_2, \dots, X_n$ are I.I.D. Gaussian $ \sim N(\mu, \sigma^2) $
$x_1, x_2, \dots, x_n$ are the realisations/observations of the random variable.

Estimate the parameters $\mu$ and $\sigma$ of the Gaussian distribution using Maximum Likelihood Estimation.

Likelihood Function:

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= \prod_{i=1}^{n} f_{\mu, \sigma}(X_i) \\ &= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \\ &= (\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_{i=1}^{n} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \\ \end{aligned} \]

Log-Likelihood Function:

\[ \begin{aligned} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= log [(\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_{i=1}^{n} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}] \\ &= log (2\pi\ \sigma^2)^{\frac{-n}{2}} - \sum_{i=1}^{n} \frac{(x_i - \mu)^2}{2\sigma^2} \\ => \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= -\frac{n}{2} log (2\pi) -nlog(\sigma) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \\ \end{aligned} \]

Note: Here, the first term $ -\frac{n}{2} log (2\pi) $ is a constant wrt both $\mu$ and $\sigma$, so we can ignore the term.

Maximum Likelihood Estimation:

\[ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ \]

Instead of finding $\mu$ and $\sigma$ that maximises the log-likelihood function,
we can find $\mu$ and $\sigma$ that minimises the negative of the log-likelihood function.

\[ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmin}}\ -\bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmin}}\ nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \]

Now, lets differentiate the log likelihood function wrt $\mu$ and $\sigma$ separately to get $\mu_{ML}, \sigma_{ML}$.

Lets, calculate $\mu_{ML}$ first by taking the derivative of the log-likelihood function wrt $\mu$ and equating it to 0.
Read more about Derivative

\[ \begin{aligned} &\frac{d}{d\mu} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) = \frac{d}{d\mu} [nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2] = 0 \\ &=> 0 + \frac{2}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)*(-1) = 0\\ &=> \sum_{i=1}^{n} x_i - n\mu = 0 \\ &=> n\mu = \sum_{i=1}^{n} x_i \\ &=> \mu_{ML} = \frac{1}{n} \sum_{i=1}^{n} x_i \\ \end{aligned} \]

Similarly, we can calculate $\sigma_{ML}$ by taking the derivative of the log-likelihood function wrt $\sigma$ and equating it to 0.

\[ \begin{aligned} \frac{d}{d\sigma} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= \frac{d}{d\sigma} [nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2] = 0 \\ => \sigma^2_{ML} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_{ML})^2 \end{aligned} \]

Note: In general MLE is biased, i.e does NOT give an unbiased estimate => divides by $n$ instead of $(n-1)$.

Watch Video

📘

Bayesian Statistics:
Bayesian statistics model parameters by updating initial beliefs (prior probabilities) with observed data to form a final belief (posterior probability) using Bayes’ Theorem.
Instead of a single point estimate, it provides a probability distribution over possible parameter values, which allows to quantify uncertainty and yields more robust models, especially with limited data.

Bayes’ Theorem:

\[ P(\Theta \mid X) = \frac{P(\Theta)P(X \mid \Theta)}{P(X)} \]

MAP vs MMSE

MMSE is the average of the posterior distribution, whereas MAP is the mode/peak.
If posterior distribution is symmetric and unimodal (only 1 peak), then MAP and MMSE are very close.
If posterior distribution is skewed and multimodal (many peaks), then MAP and MMSE can differ a lot.
MMSE considers all the values of the posterior distribution; hence, it is more accurate than MAP, especially for skewed or multimodal distributions.

Watch Video

End of Section