This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Probability

Probability for AI & ML

In this section, we will cover the topics related to Probability for AI & ML.

Probability for AI & ML

This sheet contains all the topics that will be covered for Probability for AI & ML.

Below is the list of topics that will be covered in this section.

  • Probability
  • Sample Space
  • Random Variables
  • Discrete & Continuous Random Variables
  • Probability Distribution
  • Cumulative Distribution Function (CDF)
  • Probability Mass Function (PMF)
  • Probability Density Function (PDF)
  • Bayes’ Theorem
  • Conditional Probability
  • Joint Probability
  • Marginal Probability
  • Independence & Mutual Exclusion
  • Expectation
  • Markov Inequality
  • Chebyshev’s Inequality
  • Chernoff Bound
  • Law of Large Numbers
  • Independent & Identically Distributed Random Variables
  • Cross Entropy
  • Kullback-Leibler Divergence
  • Maximum Likelihood Estimation (MLE)
  • Maximum A Posteriori Estimation (MAP)
  • Minimum Mean Squared Error (MMSE)

End of Section

1 - Introduction to Probability

Introduction to Probability

This section will introduce you to basic terminologies and definitions used in probability for AI & ML.


πŸ’‘ Why do we need to understand what is Probability?
Because the world around us is very uncertain, and Probability acts as -
the fundamental language to understand, express and deal with this uncertainty.

For example:

  1. Toss a fair coin, \(P(H) = P(T) = 1/2\)
  2. Roll a die, \(P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6\)
  3. Email classifier, \(P(spam) = 0.95 ,~ P(not ~ spam) = 0.05\)

πŸ“˜ Probability:
Numerical measure of chance or likelihood that an event will occur.
Range: \([0,1]\)
\(P=0\): Highly unlikely
\(P=1\): Almost certain


πŸ“˜ Sample Space:
Set of all possible outcomes of an experiment.
Symbol: \(\Omega\)
For example:

  1. Toss a fair coin, sample space: \(\Omega = \{H,T\}\)
  2. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
  3. Choose a real number \(x\) from the interval \([2,3]\), sample space: \(\Omega = [2,3]\); sample size = \(\infin\)
    Note: There can be infinitely many points between 2 and 3, e.g: 2.21, 2.211, 2.2111, 2.21111, …
  4. Randomly put a point in a rectangular region; sample size = \(\infin\)
    Note: There can be infinitely many points in any rectangular region.

\(P(\Omega) = 1\)

πŸ“˜ Event:
An outcome of an experiment. A subset of all possible outcomes.
A,B,…βŠ†Ξ©
For example:

  1. Toss a fair coin, set of possible outcomes: \(\{H,T\}\)
  2. Roll a die, set of possible outcomes: \(\{1,2,3,4,5,6\}\)
  3. Roll a die, event \(A = \{1,2\} => P(A) = 2/6 = 1/3\)
  4. Email classifier, set of possible outcomes: \(\{spam,not ~spam\}\).

πŸ“˜ Discrete:
Number of potential outcomes from an experiment is countable, distinct, or can be listed in a sequence, even if infinite i.e countably infinite.
For example:

  1. Toss a fair coin, possible outcomes: \(\Omega = \{H,T\}\)
  2. Roll a die, possible outcomes: \(\Omega = \{1,2,3,4,5,6\}\)
  3. Choose a real number \(x\) from the interval \([2,3]\) with decimal precision, sample space: \(\Omega = [2,3]\).
    Note: There are 99 real numbers between 2 and 3 with 2 decimal precision i.e from 2.01 to 2.99.
  4. Number of cars passing a specific traffic signal in 1 hour.

πŸ“˜ Continuous:
Potential outcomes from an experiment can take any value within a given range or interval, representing an uncountably infinite set of possibilities.
For example:

  1. A line segment between 2 and 3 - forms a continuum.
  2. Randomly put a point in a rectangular region.

graph TD
    A[Sample Space] --> |Discrete| B(Finite)
    A --> C(Infinite)
    C --> |Discrete| D(Countable)
    C --> |Continuous| E(Uncountable)


πŸ“˜ Mutually Exclusive (Disjoint) Events:
Two or more events that cannot happen at the same time.
No overlapping or common outcomes.
If one event occurs, then the other event does NOT occur.

For example:

  1. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Odd outcome = \(A = \{1,3,5\}\)
    Even outcome = \(B = \{2,4,6\}\) are mutually exclusive.

    \(P(A \cap B) = 0\)
    Since, \(P(A \cup B) = P(A) + P(B) - (P(A \cap B)\)
    Therefore, \(P(A \cup B) = P(A) + P(B)\)

Note: If we know that event \(A\) has occurred, then we can say for sure that the event \(B\) did NOT occur.

πŸ“˜ Independent Events:
Two events are independent if the occurrence of one event does NOT impact the outcome of the other event.

For example:

  1. Roll a die twice , sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Odd number in 1st throw = \(A = \{1,3,5\}\)
    Odd number in 2nd throw = \(B = \{1,3,5\}\)
    Note: A and B are independent because whether we get an odd number in 1st roll has NO impact of getting an odd number in second roll.

    \(P(A \cap B) = P(A)*P(B)\)

Note: If we know that event \(A\) has occurred, then that gives us NO new information about the event \(B\).


πŸ’‘ Does \(P=0\) mean that the event is impossible or improbable ?
No, it means that the event is highly unlikely to occur.

Let’s understand this answer with an example.

πŸ’‘ What is the probability of choosing a real number, say 2.5, from the interval \([2,3]\) ?

Probability of choosing exactly one point on the number line or a real number, say 2.5,
from the interval \([2,3]\) is almost = 0, because there are infinitely many points between 2 and 3.

Also, we can NOT say that choosing exactly 2.5 is impossible, because it exists there on the number line.
But, for all practical purposes, \(P(2.5) = 0\).

Therefore, we say that \(P=0\) means “Highly Unlikely” and NOT “Impossible”.

Extending this line of reasoning, we can say that probability of NOT choosing 2.5, \(P(!2.5) = 1\).
Theoretically yes, because there are infinitely many points between 2 and 3.
But, we cannot say for sure that we cannot choose 2.5 exactly.
There is some probability of choosing 2.5, but it is very small.

Therefore, we say that \(P=1\) means “Almost Sure” and NOT “Certain”.

πŸ’‘ What is the probability of getting a 7 when we roll a 6 faced die ?

Here, in this case we can say that \(P(7)=0\) and that means Impossible.

Similarly, we can say that \(P(get ~any ~number ~between ~1 ~and ~6)=1\) and \(P=1 => \) Certain.


End of Introduction

2 - Conditional Probability

Conditional Probability & Bayes Theorem

In this section, we will understand all the concepts related to Conditional Probability and Bayes’ Theorem.


πŸ“˜

Conditional Probability:
It is the probability of an event occurring, given that another event has already occurred.
Allows us to update probability when additional information is revealed.

\[P(A \mid B) = \frac{P(A \cap B)}{P(B)}\]

πŸ“˜ Chain Rule:
\(P(A \cap B) = P(B)*P(A \mid B)\) or
\(P(A \cap B) = P(A)*P(B \mid A)\)

For example:

  1. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Event A = Get a 5 = \(\{5\} => P(A) = 1/6\)
    Event B = Get an odd number = \(\{1, 3, 5\} => P(B) = 3/6 = 1/2\)
$$ \begin{aligned} \because (A \cap B) = \{5\} ~ => P(A \cap B) = 1/6 \\ P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{1/6}{1/2} \\ &= 1/3 \end{aligned} $$


πŸ“˜

Bayes’ Theorem:
It is a formula that uses conditional probability.
It allows us to update our belief about an event’s probability based on new evidence.
We know from conditional probability and chain rule that:

$$ \begin{aligned} P(A \cap B) = P(B)*P(A \mid B) \\ P(A \cap B) = P(A)*P(B \mid A) \\ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \end{aligned} $$

Combining all the above equations gives us the Bayes’ Theorem:

$$ \begin{aligned} P(A \mid B) = \frac{P(A)*P(B \mid A)}{P(B)} \end{aligned} $$


For example:

  1. Roll a die, sample space: \(\Omega = \{1,2,3,4,5,6\}\)
    Event A = Get a 5 = \(\{5\} => P(A) = 1/6\)
    Event B = Get an odd number = \(\{1, 3, 5\} => P(B) = 3/6 = 1/2\)
    Task: Find the probability of getting a 5 given that you rolled an odd number.

\(P(B \mid A) = 1\) = Probability of getting an odd number given that we have rolled a 5.

$$ \begin{aligned} P(A \mid B) &= \frac{P(A) * P(B \mid A)}{P(B)} \\ &= \frac{1/6 * 1}{1/2} \\ &= 1/3 \end{aligned} $$

Now, let’s understand another concept called Law of Total Probability.
Here, we can say that the sample space \(\Omega\) is divided into 2 parts - \(A\) and \(A ^ \complement \)

So, the probability of an event \(B\) is given by:

$$ B = B \cap A + B \cap A ^ \complement \\ P(B) = P(B \cap A) + P(B \cap A ^ \complement ) \\ By ~Chain ~Rule: P(B) = P(A)*P(B \mid A) + P(A ^ \complement )*P(B \mid A ^ \complement ) $$


πŸ’‘ What if the sample space is divided into ’n’ such partitions ?

πŸ“˜

Law of Total Probability:
Overall probability of an event B, considering all the different, mutually exclusive ways it can occur.
If A₁, Aβ‚‚, …, Aβ‚™ are a set of events that partition the sample space, such that they are -

  • Mututally exclusive : \(A_i \cap A_j = \emptyset\) for all \(i, j\)
  • Exhaustive: \(A₁ \cup Aβ‚‚ ... \cup Aβ‚™ = \Omega\) for all \(i \neq j\) $$ P(B) = \sum_{i=1}^{n} P(A_i)*P(B \mid A_i) $$ where \(n\) is the number of mutually exclusive partitions of the sample space \(\Omega\) .

Now, we can also generalize the Bayes’ Theorem using the Law of Total Probability.


End of Section

3 - Independence of Events

Independence of Events

In this section, we will understand different kinds of Independence of Events.


πŸ“˜

Independence of Events:
Two events are independent if the occurrence of one event does not affect the probability of the other event.
There are 3 types of independence of events:

  • Mutual Independence
  • Pair-Wise Independence
  • Conditional Independence

πŸ“˜

Mutual Independence:
Joint probability of two events is equal to the product of the individual probabilities of the two events.
\(P(A \cap B) = P(A)*P(B)\)

Joint probability: The probability of two or more events occurring simultaneously.
\(P(A \cap B)\) or \(P(A, B)\)

For example:

  1. Toss a coin and roll a die -
    \(A\) = Get a heads; \(P(A)=1/2\)
    \(B\) = Get an odd number; \(P(B)=1/2\)
$$ \begin{aligned} P(A \cap B) &= P(\text{Heads and Odd}) \\ &= \frac{1}{2} * \frac{1}{2} \\ &= \frac{1}{4} \\ \\ \text{also } P(A) * P(B) &= \frac{1}{2} * \frac{1}{2} \\ &= \frac{1}{4} \end{aligned} $$
=> A and B are mutually independent.

πŸ“˜ Pair-Wise Independence:
Every pair of events in the set is independent.
Pair-wise independence != Mutual independence.

For example:

  1. Toss 3 coins;
    For 2 tosses, sample space: \(\Omega = \{HH,HT, TH, TT\}\)
    \(A\) = First and Second toss outcomes are same i.e \(\{HH, TT\}\); \(P(A)= 2/4 = 1/2\)
    \(B\) = Second and Third toss outcomes are same i.e \(\{HH, TT\}\); \(P(B)= 2/4 = 1/2\)
    \(C\) = Third and First toss outcomes are same i.e \(\{HH, TT\}\); \(P(C)= 2/4 = 1/2\)

Now, pair-wise independence of the above events A & B is - \(P(A \cap B)\)
\(P(A \cap B)\) => Outcomes of first and second toss are same &
outcomes of second and third toss are same.
=> Outcomes of all the three tosses are same.

Total number of outcomes = 8
Desired outcomes = \(\{HHH, TTT\}\) = 2
=> \(P(A \cap B) = 2/8 = 1/4 = P(A) * P(B) = 1/2 * 1/2 = 1/4\)

Therefore, \(A\) and \(B\) are pair-wise independent.
Similarly, we can also prove that \(A\) and \(C\) and \(B\) and \(C\) are also pair-wise independent.

Now, let’s check for mutual independence of the above events A, B & C.
\(P(A \cap B \cap C) = P(A)*P(B)*P(C)\)
\(P(A \cap B \cap C)\) = Outcomes of all the three tosses are same i.e \(\{HHH, TTT\}\)
Total number of outcomes = 8
Desired outcomes = \(\{HHH, TTT\}\) = 2
So, \(P(A \cap B \cap C)\) = 2/8 = 1/4
But, \(P(A)*P(B)*P(C) = 1/2*1/2*1/2 = 1/8\)
Therefore \(P(A \cap B \cap C)\) β‰  \(P(A)*P(B)*P(C)\)
=> \(A, B, C\) are NOT mutually independent but only pair wise independent.

πŸ“˜ Conditional Independence:
Two events A & B are conditionally independent given a third event C,
if they are independent given that C has occurred.
Occurrence of C changes the context, causing the events A & B to become independent of each other.


$$ \begin{aligned} A = 10 ,~ B = 10 ,~ C = 20 ~and~ \Omega = 50 \\ P(A) = 10/50 = 1/5 \\ P(B) = 10/50 = 1/5 \\ P(A) * P(B) = 1/5*1/5 =1/25 \\ P(A \cap B) = 3/50 \\ \text{clearly, } P(A \cap B) ~⍯ ~P(A) * P(B) \\ \end{aligned} $$

=> A & B are NOT independent.
Now, let’s check for conditional independence of A & B given C.

$$ \begin{aligned} P(A \mid C) &= \frac{P(A \cap C)}{P(C)} = 4/20 = 1/5 \\ P(B \mid C) &= \frac{P(B \cap C)}{P(C)} = 5/20 = 1/4 \\ P(A \mid C) * P(B \mid C) &= 1/5 * 1/4 = 1/20 \\ P(A \cap B \mid C) &= \frac{P(A \cap B \cap C)}{P(C)} = 1/20 \\ \text{clearly, } P(A \cap B \mid C) &= P(A \mid C)*P(B \mid C) \\ \end{aligned} $$

Therefore, A & B are conditionally independent given C.


End of Section

4 - Cumulative Distribution Function

Cumulative Distribution Function of a Random Variable

In this section, we will understand Cumulative Distribution Function of a Random Variable.


πŸ“˜ Random Variable(RV):
A random variable is a function that maps a the outcomes of a sample space to a real number.
Random Variable X represent as, \(X: \Omega \to \mathbb{R} \)
It maps abstract outcomes of a random experiment to concrete numerical values required for mathematical analysis.

For example:

  1. Toss a coin 2 times, sample space: \(\Omega = \{HH,HT, TH, TT\}\)
    The above random experiment of coin tosses can be mapped to a random variable \(X: \Omega \to \mathbb{R} \)
    \(X: \Omega = \{HH,HT, TH, TT\} \to \mathbb{R}) \)
    Say, if we count the number of heads, then
    $$ \begin{aligned} X(0) = \{TT\} = 1 \\ X(1) = \{HT, TH\} = 2 \\ X(2) = \{HH\} = 1 \\ \end{aligned} $$ Similar, output will be observed for number of tails.

Depending upon the nature of output, random variables are of 2 types - Discrete and Continuous.

πŸ“˜ Discrete Random Variable:
A random variable whose possible outcomes are finite or countably infinite.
Typically obtained by counting.
Discrete random variable cannot take any value between 2 consecutive values.
For example:

  1. The number of heads in 2 coin tosses can be 0, 1 or 2 but NOT 1.5.

πŸ“˜ Continuous Random Variable:
A random variable that can take any value between a given range/interval.
Possible outcomes are infinite.
For example:

  1. A person’s height in a given range of say 150cm-200cm.
    Height can take any value, not just round values, e.g: 150.1cm, 167.95cm, 180.123cm etc.

Now, that we have understood that how random variable can be used to map outcomes of abstract random experiment to real values for mathematical analysis, we will understand its applications.

πŸ’‘ How to calculate the probability of a random variable?
Probability of a random variable is given by something called - Cumulative Distribution Function (CDF).

πŸ“˜ Cumulative Distribution Function(CDF):
It gives the cumulative probability of a random variable \(X\).
CDF = \(F(X) = P(X \leq x)\) i.e Probability a random variable \(X\) will take for a value \(<=x\).

For example:

  1. Discrete random variable - Toss a coin 2 times, sample space: \(\Omega = \{HH,HT, TH, TT\}\)
    Count the number of heads.
    $$ \begin{aligned} X(0) = \{TT\} = 1 => P(X = 0) = 1/4 \\ X(1) = \{HT, TH\} = 2 => P(X = 1) = 1/2\\ X(2) = \{HH\} = 1 => P(X = 2) = 1/4 \\ \\ CDF = F(X) = P(X \leq x) \\ F(0) = P(X \leq 0) = P(X < 0) + P(X = 0) = 1/4 \\ F(1) = P(X \leq 1) = P(X < 1) + P(X = 1) = 1/4 + 1/2 = 3/4 \\ F(2) = P(X \leq 2) = P(X < 2) + P(X = 2) = 3/4 + 1/4 = 1 \\ \end{aligned} $$


2. Continuous random variable - Consider a line segment/interval from \(\Omega = [0,2] \)
Random variable \(X(\omega) = \omega\)
i.e \(X(1) = 1 ~and~ X(1.1) = 1.1 \)

$$ \begin{aligned} P[(0,1/2)] = (1/2)/2 = 1/4 \\ P[(3/4, 2)] = (2-3/4)/2 = 5/8 \\ P(X<=1.1) = P(-\infty, 1.1)/2 = 1.1/2 \end{aligned} $$

\[ F_X(x) = P(X \leq x) = \begin{cases} \frac{x}{2} & \text{if } x \in [0,2] \\ 1 & \text{if } x > 2 \\ 0 & \text{if } x < 0 \end{cases} \]



End of Section

5 - Probability Mass Function

Probability Mass Function of a Discrete Random Variable

In this section, we will understand Probability Mass Function of a Discrete Random Variable.


πŸ“˜ Probability Mass Function(PMF):
It gives the exact value of a probability for a discrete random variable at a specific value \(x\).
It assigns a “non-zero” mass or probability to a specific countable outcome.
Note: Called ‘mass’ because probability is concentrated at a single discrete point.
\(PMF = P(X=x)\)
e.g: Bernoulli, Binomial, Multinomial, Poisson etc.

Commonly visualised as a bar chart.
Note: PMF = Jump at a given point in CDF.

\(PMF = P_x(X=x_i) = F_X(X=x_i) - F_X(X=x_{i-1})\)


πŸ“˜ Bernoulli Distribution:
It models a single event with two possible outcomes, success (1) or failure (0), with a fixed probability of success, ‘p’.
p = Probability of success
1-p = Probability of failure
Mean = p
Variance = p(1-p)
Note: A single trial that adheres to these conditions is called a Bernoulli trial.
\(PMF, P(x) = p^x(1-p)^{1-x}\), where \(x \in \{0,1\}\)
For example:

  1. Toss a coin, we get heads or tails.
  2. Result of a test, pass or fail.
  3. Machine learning, binary classification model.

πŸ“˜

Binomial Distribution:
It extends the Bernoulli distribution by modeling the number of successes that occur over a fixed number of independent trials.
n = Number of trials
k = Number of successes
p = Probability of success
Mean = np
Variance = np(1-p)

\(PMF, P(x=k) = \binom{n}{k}p^k(1-p)^{n-k}\), where \(k \in \{0,1,2,3,...,n\}\)
\(\binom{n}{k} = \frac{n!}{k!(n-k)!}\) i.e number of ways to achieve ‘k’ successes in ‘n’ independent trials.

Note: Bernoulli is a special case of Binomial distribution where n = 1.
Also read about Multinomial distribution i.e where number of possible outcomes is > 2.

For example:

  • Counting number of heads(success) in ’n’ coin tosses.




πŸ’‘ What is the probability of getting exactly 2 heads in 3 coin tosses?

Total number of outcomes in 3 coin tosses = 2^3 = 8
Desired outcomes i.e 2 heads in 3 coin tosses = \(\{HHT, HTH, THH\}\) = 3
Probability of getting exactly 2 heads in 3 coin tosses = \(\frac{3}{8}\) = 0.375
Now lets solve the question using the binomial distribution formula.

\[P(k=2) = \binom{3}{2}p^2(1-p)^{3-2} \\ = \frac{3!}{2!(3-2)!}(0.5)^2(0.5) \\ = 3*0.25*0.5 = 3*0.125 = 0.375\]



πŸ’‘ What is the probability of winning a lottery 1 out of 10 times, given that the probability of winning a single lottery = 1/3?

Number of successes, k = 1
Number of trials, n = 10
Probability of success, p = 1/3
Probability of winning lottery, P(k=1) =

\[\binom{10}{1}p^1(1-p)^{10-1} \\ = \frac{10!}{1!(10-1)!}(1/3)^1(2/3)^9 \\ = 10*0.333*0.026 = 0.866 \approx 8.66\% \]




πŸ“˜

Poisson Distribution:
It expresses the probability of an event happening a certain number of times ‘k’ within a fixed interval of time.
Given that:

  1. Events occur with a known constant average rate.
  2. Occurrence of an event is independent of the time since the last event.

    Parameters:
    \(\lambda\): Expected number of events per interval
    \(k\) = Number of events in the same interval
    Mean = \(\lambda\)
    Variance = \(\lambda\)

PMF = Probability of occurrence of ‘k’ events in the same interval

\[PMF = \lambda^ke^{-\lambda}/k!\]

Note: Useful to count data where total population size is large but the probability of an individual event is small.

For example:

  1. Model the number of customers arrival at a service center per hour.
  2. Number of website clicks in a given time period.

πŸ’‘ A company receives, on an average, 5 customer emails per hour. What is the probability of receiving exactly 3 emails in the next hour?

Expected (average) number of emails per hour, \(\lambda\) = 5
Probability of receiving exactly k=3 emails in the next hour =

\[P(k=3) = \lambda^3e^{-\lambda} / 3! \\ = 5^3e^{-5} / 3! = 125*e^{-5} / 6 \\ = 125*0.00674 / 6 \approx 0.14 ~or~ 14\% \]





End of Section

6 - Probability Density Function

Probability Density Function of a Continuous Random Variable

In this section, we will understand Probability Density Function of a Continuous Random Variable.


πŸ“˜

Probability Density Function(PDF):
This is a function used for continuous random variables to describe the likelihood of the variable taking on a value within a specific range or interval.
Since, at any given point the probability of a continuous random variable is zero, we find the probability within a given range.
Note: Called ‘density’ because probability is spread continuously over a range of values rather than being concentrated at a single point as in PMF.
e.g: Uniform, Gaussian, Exponential, etc.

Note: PDF is a continuous function \(f(x)\).
It is also the derivative of Cumulative Distribution Function (CDF) \(F_X(x)\)


\(PDF = f(x) = F'(X) = \frac{dF_X(x)}{dx} \)

For example:
Consider a line segment/interval from \(\Omega = [0,2] \)
Random variable \(X(\omega) = \omega\)
i.e \(X(1) = 1 ~and~ X(1.1) = 1.1 \)

$$ F_X(x) = P(X \leq x) = \begin{cases} \frac{x}{2} & \text{if } x \in [0,2] \\ 1 & \text{if } x > 2 \\ 0 & \text{if } x < 0 \end{cases} $$


$$ \begin{aligned} PDF = f_X(x) = \frac{dF_X(x)}{dx} \\ \end{aligned} $$

$$ \text{PDF } = f_X(x) = \begin{cases} \dfrac{1}{2}, & x \in [0,2] \\ 0, & \text{otherwise.} \end{cases} $$


Note: If we know the PDF of a continuous random variable, then we can find the probability of any given region/interval by calculating the area under the curve.


πŸ“˜

Uniform Distribution:
All the outcomes within the given range are equally likely to occur.
Also known as ‘fair’ distribution.
Note: This is a natural starting point to understand randomness in general.

\[ X \sim U(a,b) \]

$$ \begin{aligned} PDF = f(x) = \begin{cases} \frac{1}{b-a} & \text{if } x \in [a,b] \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$


Mean = Median = \( \frac{a+b}{2} ~if~ x \in [a,b] \)
Variance = \( \frac{(b-a)^2}{12} \)

Standard uniform distribution: \( X \sim U(0,1) \)

$$ \begin{aligned} PDF = f(x) = \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$

PDF in terms of mean(\(\mu\)) and standard deviation(\(\sigma\)) -

$$ \begin{aligned} PDF = f(x) = \begin{cases} \frac{1}{2\sigma\sqrt{3}} & \text{if } \mu -\sigma\sqrt{3} \le x \le \mu + \sigma\sqrt{3}\\ \\ 0 & \text{~otherwise } \end{cases} \end{aligned} $$
For example:

  • Random number generator that generates a random number between 0 and 1.


πŸ“˜

Gaussian(Normal) Distribution:
It is a continuous probability distribution characterized by a symmetrical, bell-shaped curve with most data clustered around the central average, with frequency of values decreasing as they move away from the center.

  • Most outcomes are average; extremely low or extremely high values are rare.
  • Characterised by mean and standard deviation/variance.
  • Peak at mean = median, symmetric around mean.
    Note: Most important and widely used distribution.

    \[ X \sim N(\mu, \sigma^2) \] $$ \begin{aligned} PDF = f(x) = \dfrac{1}{\sqrt{2\pi}\sigma}e^{-\dfrac{(x-\mu)^2}{2\sigma^2}} \\ \end{aligned} $$
    Mean = \(\mu\)
    Variance = \(\sigma^2\)

Standard normal distribution:

\[ Z \sim N(0,1) ~i.e~ \mu = 0, \sigma^2 = 1 \]


Any normal distribution can be standardized using Z-score transformation:

$$ \begin{aligned} Z = \dfrac{X-\mu}{\sigma} \end{aligned} $$

For example:

  • Human height, IQ scores, blood-pressure etc.
  • Measurement of errors in scientific experiments.


πŸ“˜

Exponential Distribution:
It is used to model the amount of time until a specific event occurs.
Given that:

  1. Events occur with a known constant average rate.
  2. Occurrence of an event is independent of the time since the last event.

    Parameters:
    Rate parameter: \(\lambda\): Average number of events per unit time
    Scale parameter: \(\mu ~or~ \beta \): Mean time between events
    Mean = \(\frac{1}{\lambda}\)
    Variance = \(\frac{1}{\lambda^2}\)
    \[ \lambda = \dfrac{1}{\beta} = \dfrac{1}{\mu}\]
$$ \begin{aligned} PDF = f(x) = \lambda e^{-\lambda x} ~~~ \forall ~~~ x \ge 0 ~~~\&~~~ \lambda > 0 \\ CDF = F(x) = 1 - e^{-\lambda x} ~~~ \forall ~~~ x \ge 0 ~~~\&~~~ \lambda > 0 \end{aligned} $$


πŸ’‘ At a bank, a teller spends 4 minutes, on an average, with every customer. What is the probability that a randomly selected customer will be served in less than 3 minutes?

Mean time to serve 1 customer = \(\mu\) = 4 minutes
So, \(\lambda\) = average number of customers served per unit time = \(1/\mu\) = 1/4 = 0.25 minutes
Probability to serve a customer in less than 3 minutes can be found using CDF -

\[ F(x) = P(X \le x) = 1 - e^{-\lambda x}\]


$$ \begin{aligned} P(X \leq 3) &= 1 - e^{0.25*3} \\ &= 1 - e^{-0.75} \\ &= 1 - 0.47 \\ &\approx 0.53 \\ &\approx 53\% \end{aligned} $$

So, probability of a customer being served in less than 3 minutes is 53%(approx).


πŸ’‘ At a bank, a teller spends 4 minutes, on an average, with every customer. What is the probability that a randomly selected customer will be served in greater than 2 minutes?

\[ CDF = F(x) = P(X \le x) = 1 - e^{-\lambda x} \\ \text{Total probability} = P(X \le x) + P(X > x) = 1\\ => 1 - e^{-\lambda x} + P(X > x) = 1 \\ => P(X > x) = e^{-\lambda x} \]


In this case x = 2 minutes, and \(\lambda\) = 0.25 so,

\[ P(X > 2) = e^{-\lambda x} \\ = e^{-0.25*2} \\ = e^{-0.5} \\ = 0.6065 \\ \approx 60.65\% \]


So, probability of a customer being served in greater than 2 minutes is 60%(approx).


πŸ’‘ Suppose, we know that an electronic item has lasted for time \(x>t\) days, then what is the probability that it will last for an additional time of ’s’ days ?

Task: We want to find the probability of the electronic item lasting for \(x > t+s \) days,
given that it has already lasted for \(x>t\) days.
This is a ‘conditional probability’.
Since,

\[ P(A \mid B) = \dfrac{P(A \cap B)}{P(B)} \]


We want to find: \( P(X > t+s \mid X > t) = ~? \)

\[ P(X > t+s \mid X > t) = \dfrac{P(X > t+s ~and~ X > t)}{P(X > t)} \\ \text{Since, t+s > t, we can only consider t+s} \\ = \dfrac{P(X > t+s)}{P(X > t)} \\ = \dfrac{e^{-\lambda(t+s)}}{e^{-\lambda(t)}} \\ = e^{-\lambda(t) -\lambda(s) + \lambda(t)} \\ = e^{-\lambda(s)} \\ => \text{Independent of time 't'} \]

Hence, probability that the electronic item will survive for an additional time ’s’ days
is independent of the time ’t’ days it has already survived.


Lets see the proof for the above statement.

Poisson Case:
The probability of observing exactly ‘k’ events in a time interval of length ’t’
with an average rate of \( \lambda \) events per unit time is given by -

\[ PMF = P(X=k) = \dfrac{(\lambda t)^k e^{-\lambda t}}{k!} \]


The event that waiting time until next event > t, is same as observing ‘0’ events in that interval.
=> We can use the PMF of Poisson distribution with k=0.

\[ PMF = P(X=k=0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t} \]


Exponential Case:
Now, lets consider exponential distribution that models the waiting time ‘T’ until next event.
The event that waiting time ‘T’ > t, is same as observing ‘0’ events in that interval.

\[ CDF = P(X>t) = e^{-\lambda t} \]

Observation:
Exponential distribution is a direct consequence of Poisson distribution probability of ‘0’ events.


πŸ’‘ Consider a machine that fails, on an average, every 20 hours. What is the probability of having NO failures in the next 10 hours?

Using Poisson:
Average failure rate, \(\lambda\) = 1/20 = 0.05
Time interval, t = 10 hours
Number of events, k = 0
Average number of events in interval, (\(\lambda t\)) = (1/20) * 10 = 0.5
Probability of having NO failures in the next 10 hours = ?

\[ P(X=0) = \dfrac{(\lambda t)^0 e^{-\lambda t}}{0!} = e^{-\lambda t} \\ = e^{-0.5} \approx 0.6065 \\ \approx 60.65\%\]


Using Exponential:
What is the probability that wait time until next failure > 10 hours?
Waiting time, T > t = 10 hours.
Average number of failures per hour, \(\lambda\) = 1/20 per hour

\[ CDF = P(X>t) = e^{-\lambda t} \\ P(T>10) = e^{-(1/20) * 10} \\ = e^{-0.5} \approx 0.6065 \\ \approx 60.65\%\]


Therefore, we have seen that this problem can be solved using both Poisson and Exponential distribution.




End of Section

7 - Expectation

Expectation of a Random Variable

In this section, we will understand Expectation, Variance and Co-Variance of a Random Variable.


πŸ“˜

Expectation:
Long run average of the outcomes of a random experiment.
When we talk about ‘Expectation’ we mean - on an average.

  • Value we expect to get when we repeated an experiment multiple times and took an average of the results.
  • Measures the central tendency of a data distribution.
  • Similar to ‘Center of Mass’ in Physics.

Note: If the probability of each outcome is different, then we take a weighted average.

Discrete Case:
\(E[X] = \sum_{i=1}^{n} x_i.P(X = x_i) \)

Continuous Case:
\(E[X] = \int_{-\infty}^{\infty} x.f(x) dx \)

πŸ’‘ Let’s play a game where we flip a fair coin. If we get a head, then you win Rs. 100, and
if its a tail, then you lose Rs. 50. What is the expected value of the amount that you will win per toss?

Possible outcomes are: \(x_1 = 100 ,~ x_2 = -50 \)
Probability of each outcome is \(P(X = 100) = 0.5,~ P(X = -50) = 0.5 \)

\[ E[X] = \sum_{i=1}^{n} x_i.P(X = x_i) \\ = x_1.P(X = x_1) + x_2.P(X = x_2) \\ = 100*0.5 + (-50)*0.5 \\ = 50 - 25 \\ = 25 \]

Therefore, the expected value of the amount that you will win per toss is Rs. 25 in the long run.

πŸ’‘ What is the expected value of a continuous uniform random variable distributed over the interval [a,b]?

$$ \text{PDF of continuous uniform random variable } = \\ f_X(x) = \begin{cases} \dfrac{1}{b-a}, & x \in [a,b] \\ 0, & \text{otherwise.} \end{cases} $$\[ \text{Expected value of continuous uniform random variable } = \\ E[X] = \int_{-\infty}^{\infty} x.f(x) dx \\ = \int_{-\infty}^{a} x.f(x) dx + \int_{a}^{b} x.f(x) dx + \int_{b}^{\infty} x.f(x) dx\\[0.5em] \text{Since, the PDF is defined only in the range [a,b] } \\[0.5em] = 0 + \int_{a}^{b} x.f(x) dx + 0 \\ = \int_{a}^{b} x.f(x) dx \\ = \dfrac{1}{b-a} \int_{a}^{b} x dx \\[0.5em] = \dfrac{1}{b-a} (\frac{x^2}{2})_{a}^{b} \\[0.5em] = \dfrac{1}{b-a} * (\frac{b^2 - a^2}{2}) \\[0.5em] = \dfrac{1}{b-a} * \{\frac{(b+a)(b-a)}{2}\}\\ = \dfrac{b+a}{2} \]



πŸ“˜

Variance:
It is the average of the squared differences from the mean.

  • Measures the spread or variability in the data distribution.
  • If the variance is low, then the data is clustered around the mean i.e less variability;
    and if the variance is high, then the data is widely spread out i.e high variability.

Variance in terms of expected value, where \(E[X] = \mu\), is given by:

\[ \begin{aligned} Var[X] &= E[(X - E[X])^2] \\ &= E[X^2 + E[X]^2 - 2XE[X]] \\ &= E[X^2] + E[X]^2 - 2E[X]E[X] \\ &= E[X^2] + E[X]^2 - 2E[X]^2 \\ Var[X] &= E[X^2] - E[X]^2 \\ \end{aligned} \]

Note: This is the computational formula for variance, as it is easier to calculate than the average of square distances from mean.

πŸ’‘ What is the variance of a continuous uniform random variable distributed over the interval [a,b]?

PDF of continuous uniform random variable =

$$ \begin{aligned} f_X(x) &= \begin{cases} \dfrac{1}{b-a}, & x \in [a,b] \\ 0, & \text{otherwise.} \end{cases} \\ \end{aligned} $$\[ \begin{aligned} \text{Expected value = mean } = \\[0.5em] E[X] = \dfrac{b+a}{2} \\[0.5em] Var[X] = E[X^2] - E[X]^2 \end{aligned} \]

We know \(E[X]\) already, now we will calculate \(E[X^2]\):

\[ \begin{aligned} E[X] &= \int_{-\infty}^{\infty} x.f(x) dx \\ E[X^2] &= \int_{-\infty}^{\infty} x^2.f(x) dx \\ &= \int_{-\infty}^{a} x^2f(x) dx + \int_{a}^{b} x^2f(x) dx + \int_{b}^{\infty} x^2f(x) dx\\ &= 0 + \int_{a}^{b} x^2f(x) dx + 0 \\ &= \int_{a}^{b} x^2f(x) dx \\ &= \dfrac{1}{b-a} \int_{a}^{b} x^2 dx \\ &= \dfrac{1}{b-a} * \{\frac{x^3}{3}\}_{a}^{b} \\ &= \dfrac{1}{b-a} * \{\frac{b^3 - a^3}{3}\} \\ &= \dfrac{1}{b-a} * \{\frac{(b-a)(b^2+ab+a^2)}{3}\} \\ E[X^2] &= \dfrac{b^2+ab+a^2}{3} \end{aligned} \]

Now, we know both \(E[X]\) and \(E[X^2]\), so we can calculate \(Var[X]\):

\[ \begin{aligned} Var[X] &= E[X^2] - E[X]^2 \\ &= \dfrac{b^2+ab+a^2}{3} - \dfrac{(b+a)^2}{4} \\ &= \dfrac{b^2+ab+a^2}{3} - \dfrac{b^2+2ab+a^2}{4} \\ &= \dfrac{4b^2+4ab+4a^2-3b^2-6ab-3a^2}{12} \\ &= \dfrac{b^2-2ab+a^2}{12} \\ Var[X]&= \dfrac{(b-a)^2}{12} \end{aligned} \]

πŸ“˜

Co-Variance:
It is the measure of how 2 variables X & Y vary together.
It gives the direction of the relationship between the variables.

\[ \begin{aligned} \text{Cov}(X, Y) &> 0 &&\Rightarrow \text{ } X \text{ and } Y \text{ increase or decrease together} \\[0.5em] \text{Cov}(X, Y) &= 0 &&\Rightarrow \text{ } \text{No linear relationship} \\[0.5em] \text{Cov}(X, Y) &< 0 &&\Rightarrow \text{ } \text{If } X \text{ increases, } Y \text{ decreases (and vice versa)} \end{aligned} \]

Note: For both direction as well as magnitude, we use Correlation.
Let’s use expectation to compute the co-variance of two random variables X & Y:

\[ \begin{aligned} Cov(X,Y) &= E[(X-E[X])(Y-E[Y])] \\ &\text{where, E[X] = mean of X and E[Y] = mean of Y} \\ & = E[XY - YE[X] -XE[Y] + E[X]E[Y] \\ &= E[XY] - E[Y]E[X] - \cancel{E[X]E[Y]} + \cancel{E[X]E[Y]} \\ Cov(X,Y) &= E[XY] - E[X]E[Y] \\ \end{aligned} \]

Note:

  • In a multivariate setting, relationship between all the pairs of random variables are summarized in a square symmetrix matrix called ‘Co-Variance Matrix’ \(\Sigma\).
  • Covariance of a random variable with self gives the variance, hence the diagonals of covariance matrix are variances.
\[ \begin{aligned} Cov(X,Y) &= E[XY] - E[X]E[Y] \\ Cov(X,X) &= E[X^2] - E[X]^2 = Var[X] \\ \end{aligned} \]



End of Section

8 - Moment Generating Function

Moment Generating Function

In this section, we will understand Moments of a probability distribution and Moment Generating Function.


πŸ“˜

Moments:
They are statistical measures that describe the characteristics of a probability distribution, such as its central tendency, spread/variability, and asymmetry.
Note: We will discuss ‘raw’ and ‘central’ moments.

Important moments:

  1. Mean : Central tendency
  2. Variance : Spread/variability
  3. Skewness : Asymmetry
  4. Kurtosis : Tailedness


πŸ“˜

Moment Generating Function:
It is a function that simplifies computation of moments, such as, the mean, variance, skewness, and kurtosis, by providing a compact way to derive any moment of a random variable through differentiation.

  • Provides an alternative to PDFs and CDFs of random variables.
  • A powerful property of the MGF is that its \(n^{th}\) derivative, when evaluated at \((t=0)\) = the \(n^{th}\) moment of the random variable.
\[ \text{MGF of random variable X is the expected value of } e^{tX} \\ MGF = M_X(t) = E[e^{tX}] \]

Not all probability distributions have MGFs; sometimes, integrals may not converge and moment may not exist.

\[ \begin{aligned} e^x = 1 + \frac{x}{1!} + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots + \frac{x^n}{n!} + \cdots\\ e^{tX} = 1 + \frac{tx}{1!} + \frac{(tx)^2}{2!} + \frac{(tx)^3}{3!} + \cdots + \frac{(tx)^n}{n!} + \cdots\\ \end{aligned} \]

Since, \(MGF = M_X(t) = E[e^{tX}] \), we can write the MGF as:

\[ \begin{aligned} M_X(t) &= E[e^{tX}] \\ &= 1 + \frac{t}{1!}E[X] + \frac{t^2}{2!}E[X^2] + \frac{t^3}{3!}E[X^3] + \cdots + \frac{t^n}{n!}E[X^n] + \cdots \\ \end{aligned} \]

Say, \( E[X^n] = m_n \), where \(m_n\) is the \(n^{th}\) moment of the random variable, then:

\[ M_X(t) = 1 + \frac{t}{1!}m_1 + \frac{t^2}{2!}m_2 + \frac{t^3}{3!}m_3 + \cdots + \frac{t^n}{n!}m_n + \cdots \\ \]

Important: Differentiating \(M_X(t)\) with respect to \(t\), \(i\) times, and setting \((t =0)\) gives us the \(i^{th}\) moment of the random variable.

\[ \frac{dM_X(t) }{dt} = M'_X(t) = 0 + m_1 + \frac{2t}{2!}m_2 + \frac{3t^2}{3!}m_3 + \cdots + \frac{nt^n-1}{n!}m_n + \cdots \\ \]

Set \(t=0\), to get the first moment:

\[ \frac{dM_X(t) }{dt} = M'_X(t) = m_1 \]

Similarly, second moment = \(M''_X(t) = m_2 = E[X^2]\)
And, \(n^{th}\) derivative = \(M^{(n)}_X(t) = m_n = E[X^n]\)

\[ E[X^n] = \frac{d^nM_X(t)}{dt^n} \bigg|_{t=0} \]

e.g: \( Var[X] = M''_X(0) - (M'_X(0))^2 \)

Note: If we know the MGF of a random variable, then we do NOT need to do integration or summation to find the moments.
Continuous:

\[ M_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f_X(x) dx \]

Discrete:

\[ M_X(t) = E[e^{tX}] = \sum_{i=1}^{\infty} e^{tx} P(X=x) \]


Read more about Integration & Differentiation



End of Section

9 - Joint & Marginal

Joint, Marginal & Conditional Probability

In this section, we will understand Joint, Marginal & Conditional Probability.
So far, we have dealt with a single random variable.
Now, let’s explore the probability distributions of 2 or more random variables occurring together.


πŸ“˜

Joint Probability Distribution:
It describes the probability of 2 or more random variables occurring simultaneously.

  • The random variables can be from different distributions, such as, discrete and continuous.

Joint CDF:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Discrete Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \sum_{x_i \le a} \sum_{y_j \le b} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \int_{-\infty}^{a} \int_{-\infty}^{b} f_{X,Y}(x,y) dy dx \]

Joint PMF:

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

Key Properties:

  1. \(P(X = x, Y = y) \ge 0 ~ \forall (x,y) \)
  2. \( \sum_{i} \sum_{j} P(X = x_i, Y = y_j) = 1 \)

Joint PDF:

\[ f_{X,Y}(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} \\ f_{X,Y}(x,y) = \iint_{A \in \mathbb{R}^2} f_{X,Y}(x,y) dy dx \]

Key Properties:

  1. \(f_{X,Y}(x,y) \ge 0 ~ \forall (x,y) \)
  2. \( \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy dx = 1 \)
For example:

  • If we consider 2 random variables, say, height(X) and weight(Y), then the joint distribution will tell us the probability of finding a person having a particular height and weight.

πŸ’‘ There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand joint probability.


Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

A = RedA = Blue
B = Red2/5*3/5 = 6/253/5*3/5 = 9/25
B = Blue2/5*2/5 = 4/253/5*2/5 = 6/25

Since, the draws are independent, joint probability = P(A) * P(B)
Each of the 4 cells in above table shows the probability of combination of results from 2 draws or joint probability.




πŸ“˜

Marginal Probability Distribution:
It describes the probability distribution of an individual random variable in a joint distribution, without considering the outcomes of other random variables.

  • If we have the joint distribution, then we can get the marginal distribution of each random variable from it.
  • Marginal probability equals summing the joint probability across other random variables.

Marginal CDF:
We know that Joint CDF =

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Marginal CDF =

\[ F_X(a, \infty) = P(X \le a, Y < \infty) = P(X \le a) \]

Discrete Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \sum_{x_i \le a} \sum_{y_j \in \mathbb{R}} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \int_{-\infty}^{a} \int_{-\infty}^{\infty} f_{X,Y}(x,y)dydx = \int_{-\infty}^{\infty} f_{X,Y}(x,y)dy \]

Law of Total Probability
We know that Joint Probability Distribution =

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

The events \((Y=y)\) partition the sample space, such that:

  1. \( (Y=y_1) \cap (Y=y_2) \cap ... \cap (Y=y_n) = \Phi \)
  2. \( (Y=y_1) \cup (Y=y_2) \cup ... \cup (Y=y_n) = \Omega \)

From Law of Total Probability, we get:

Marginal PMF:

\[ P_X(x) = P(X=x) = \sum_{y} P_{X,Y}(x,y) = \sum_{y} P(X = x, Y = y) \]

Marginal PDF:

\[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \]


πŸ’‘ Setup: Roll a die + Toss a coin.
X: Roll a die ; \( \Omega = \{1,2,3,4,5,6\} \)
Y: Toss a coin ; \( \Omega = \{H,T\} \)

Joint PMF = \( P_{X,Y}(x,y) = P(X=x, Y=y) = 1/6*1/2 = 1/12\)
Marginal PMF of X = \( P_X(x) =\sum_{y \in \mathbb{\{H,T\}}} P_{X,Y}(x,y) = = 1/12+1/12 = 1/6\)
=> Marginally, X is uniform over 1-6 i.e a fair die.

Marginal PMF of Y = \( P_Y(y) = \sum_{1}^6 P_{X,Y}(x,y) = 6*(1/12) = 1/2 \)
=> Marginally, Y is uniform over H,T i.e a fair coin.

πŸ’‘ Setup: X and Y are two continuous uniform distribution.
\( X \sim U(0,1) \)
\( Y \sim U(0,1) \)

Marginal PDF = \(f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \)
Joint PDF =

$$ f_{X,Y}(x,y) = \begin{cases} 1 & \text{if } x \in [0,1], y \in [0,1] \\ 0 & \text{otherwise } \end{cases} $$

Marginal PDF =

\[ \begin{aligned} f_X(x) &= \int_{0}^{1} f_{X,Y}(x,y) dy \\ &= \int_{0}^{1} 1 dy \\ &= 1 \\ f_X(x) &= \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{otherwise } \end{cases} \end{aligned} \]

πŸ’‘ Let’s re-visit the ball drawing example.
There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand marginal probability.


Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

A = RedA = BlueP(B) (Marginal)
B = Red2/5*3/5 = 6/253/5*3/5 = 9/256/25 + 9/25 = 15/25 = 3/5
B = Blue2/5*2/5 = 4/253/5*2/5 = 6/254/25 + 6/25 = 10/25 = 2/5
P(A) (Marginal)6/25 + 4/25 = 10/25 = 2/59/25 + 6/25 = 15/25 = 3/5

We can see from the table above - P(A=Red) is the sum of joint distribution over all possible values of B i.e Red & Blue.



πŸ“˜

Conditional Probability:
It measures the probability of an event occurring given that another event has already happened.

  • It provides a way to update our belief about the likelihood based on new information.
\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

P(A, B) = Joint Probability of A and B
P(B) = Marginal Probability of B

=> Conditional Probability = Joint Probability / Marginal Probability

Conditional CDF:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) \\ \]

Discrete Case:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) = \sum_{x_i \le x} P(X = x_i \mid Y = y) \]

Continuous Case:

\[ F_{X \mid Y}(x \mid y) = \int_{-\infty}^{x} f_{X \mid Y}(x \mid y) dx = \int_{-\infty}^{x} \frac {f_{X,Y}(x, y)}{f_Y(y)} dx \\ f_Y(y) > 0 \]

Conditional PMF:

\[ P(X = x \mid Y = y) = \frac{P(X = x, Y = y)} {P(Y = y)} \\ P(Y = y) > 0 \]

Conditional PDF:

\[ f_{X \mid Y}(x \mid y) = \frac{F_{X,Y}(x,y)}{f_Y(y)} \\ f_Y(y) > 0 \]

Application:

  • Generative machine learning models, such as, GANs, learn the conditional distribution of pixels, given the style of input image.

πŸ’‘ Let’s re-visit the ball drawing example.
Note: We only have information about the joint and marginal probabilities.
What is the conditional probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw?


Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.
A = Red ball in first draw
B = Blue ball in second draw.

A = RedA = BlueP(B) (Marginal)
B = Red6/259/253/5
B = Blue4/256/252/5
P(A) (Marginal)2/53/5
\[ \begin{aligned} P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{4/25}{2/5} \\ &= 2/5 \end{aligned} \]

Therefore, probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw = 2/5.


πŸ“˜

Conditional Expectation:
This gives us the conditional expectation of a random variable X, given another random variable Y=y.

Discrete Case:

\[ E[X \mid Y = y] = \sum_{x} x.P(X = x \mid Y = y) x = \sum_{x} x.P_{X \mid Y}(x \mid y) \]

Continuous Case:

\[ E[X \mid Y = y] = \int_{-\infty}^{\infty} x.f_{X \mid Y}(x \mid y) dx \]

For example:

  • Conditional expectation of of a person’s weight, given his/her height = 165 cm, will give us the average weight of all people with height = 165 cm.

Applications:

  • Linear regression algorithm is conditional expectation of target variable ‘Y’, given input feature variable ‘X’.
  • Expectation Maximisation(EM) algorithm is built on conditional expectation.

πŸ“˜

Conditional Variance:
This gives us the variance of a random variable calculated after taking into account the value(s) of another related variable.

\[ \begin{aligned} Var[X \mid Y = y] &= \sum_{x} [x - E[X \mid Y = y])^2 \mid Y=y] \\ => Var[X \mid Y = y] &= E[X^2 \mid Y=y] - (E[X \mid Y=y])^2 \\ \end{aligned} \]

For example:

  • Variance of car’s mileage for city driving might be small, but the variance will be large for mix of city and highway driving.

Note: Models that take into account the change in variance or heteroscedasticity tend to be more accurate.




End of Section

10 - Independent & Identically Distributed

Independent & Identically Distributed (I.I.D) Random Variables

In this section, we will understand Independent & Identically Distributed(I.I.D) Random Variables.


There are 2 parts in I.I.D, “Independent” and “Identically Distributed”.
Let’s revisit and understand the independence of random variables first.

πŸ“˜

Independence of Random Variables
It means that the knowing the outcome of one random variable does not impact the probability of the other random variable.
Two random variables X & Y are independent if:

\[ CDF = F_{X,Y}(x,y) = F_{X}(x)F_{Y}(y) \\ \text{ }\\ \text{Generalised for 'n' random variables:} \\ CDF = F_{X_1,X_2,...,X_n}(x_1,x_2,...,x_n) = \prod_{i=1}^{n}F_{X_i}(x_i) \\ \text{Discrete case: } \\ PMF = P_{X,Y}(x,y) = P_{X}(x)P_{Y}(y) \\ \text{ }\\ \text{Continuous case: } \\ PDF = f_{X,Y}(x,y) = f_{X}(x)f_{Y}(y) \\ \]


  • We know that if 2 random variables X,Y are independent, then their covariance is zero, since there is no linear relationship between them.
  • But the converse may NOT be true, i.e, if the covariance is zero, then we can say for sure that the random variables are independent.

Read more about Covariance

\[ \text{For independent events: }\\ Cov(X,Y) = E[XY] - E[X]E[Y] = 0 \\ => E[XY] = E[X]E[Y] \\ \text{ }\\ \text{We know that: }\\ Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y) \\ \text{ }\\ \text{But, for independent events Cov(X,Y)=0}, so: \\ Var(X+Y) = Var(X) + Var(Y) \\ \]
Let’s go through few examples to understand the independence of random variables.
For example:

  1. Toss a coin + Roll a die.
    \[ X = \begin{cases} 1 & \text{if Heads} \\ \\ 0 & \text{if Tails} \end{cases} \\ \text{} \\ Y = \{1,2,3,4,5,6\} \\ \text{} \\ \text{ Joint probability is all possible combinations of X and Y i.e 2x6 = 12 } \\[10pt] \text{ Sample space: } \Omega = \{ (H,1), (H,2), (H,3), (H,4), (H,5), (H,6), \\ (T,1), (T,2), (T,3), (T,4), (T,5), (T,6) \} \]

Here, clearly, X and Y are independent.

  1. Toss a coin twice.
    \(X\) = Number of heads = \(\{0,1,2\}\)
    \(Y\) = Number of tails = \(\{0,1,2\}\)

X, Y are NOT independent, but mutually exclusive, because if we know about one, then we automatically know about the other.

  1. \(X\) is a continuous uniform random variable \( X \sim U(-1,1) \).
    \(Y = 2X\).
\[ f_X(x) = \begin{cases} \frac{1}{b-a} = \frac{1}{1-(-1)} = \frac{1}{2} & \text{if } x \in [a,b] \\ \\ 0 & \text{otherwise} \end{cases} \\ \text{}\\ E[X] = \text{ mean } = \frac{a+b}{2} = \frac{-1+1}{2} = 0 \\ \]


Let’s check for independence of \(X\) and \(Y\) i.e \(E[XY] = E[X]E[Y]\) or not?

\[ \text{We know that: } E[X] = 0\\ E[Y] = E[2X] = 2E[X] = 0 \\ \tag{1} E[X]E[Y] = 0 \]

Now, lets calculate the value of \(E[XY]\):

\[ \begin{aligned} E[XY] &= E[X.2X] = 2.E[X^2] \\ &= 2*\int_{-1}^{1} x^2 dx = 2*{\frac{x^3}{3}} \bigg|_{-1}^1 \\ &= \frac{2}{3}*\{1^3-(-1)^3\} & = \frac{2}{3}*2 = \frac{4}{3} \\ \tag{2} E[XY] &= \frac{4}{3} \end{aligned} \]


From (1) and (2), we can say that:

\[ E[XY] ⍯ E[X]E[Y] \\ \]

Hence, \(X\) and \(Y\) are NOT independent.

Read more about Integration

πŸ“˜ Identically Distributed
Random variable X is said to be identically distributed if each sample comes from the same probability distribution, such as, Gaussian, Bernoulli, Uniform, etc with the same properties i.e mean, variance, etc are same.
Similarly, random variables X & Y are said to be identically distributed if they belong to the same probability distribution.


πŸ“˜

Independent & Identically Distributed(I.I.D)
I.I.D assumption for samples(data points) in a dataset means that the samples are:

  • Independent, i.e, each sample is independent of the other.
  • Identically distributed, i.e, each sample is drawn from the same probability distribution.
For example:

  • We take the heights of a random sample of people to estimate the average height of the population of a city.
    • Here ‘independent’ assumption means that the height of each person in the sample is independent of the other person.
      Usually, heights of members of the same family may be highly correlated.
      However, for practical purposes, we can assume that all the heights are independent of one another.
    • And, for ‘identically distributed’ - we can assume that all the heights are from the same Gaussian distribution with some mean and variance.



End of Section

11 - Convergence

Convergence of Random Variables

In this section, we will understand Convergence of Random Variables.
We will only discuss the below two types of convergence:

  • Convergence in Probability
  • Almost Sure Convergence


πŸ“˜

Convergence in Probability:
A sequence of random variables \(X_1, X_1, \dots, X_n\) is said to converge in probability to a known random variable \(X\),
if for every number \(\epsilon >0 \), the following is true:

\[ \lim_{n\rightarrow\infty} P(|X_n - X| > \epsilon) = 0, \forall \epsilon >0 \]

where,
\(X_n\): is the estimator or sample based random variable.
\(X\): is the known or limiting or target random variable.
\(\epsilon\): is the tolerance level or margin of error.

Read more about Limits

For example:

  • Toss a fair coin:
    Estimator:
    \[ X_n = \begin{cases} \frac{n}{n+1} & \text{, if Head } \\ \\ \frac{1}{n} & \text{, if Tail} \\ \end{cases} \]
    Known random variable (Bernoulli):
    \[ X = \begin{cases} 1 & \text{, if Head } \\ \\ 0 & \text{, if Tail} \\ \end{cases} \]
\[ X_n - X = \begin{cases} \frac{n}{n+1} - 1 = \frac{-1}{n+1} & \text{, if Head } \\ \\ \frac{1}{n} - 0 = \frac{1}{n} & \text{, if Tail} \\ \end{cases} \]\[ |X_n - X| = \begin{cases} \frac{1}{n+1} & \text{, if Head } \\ \\ \frac{1}{n} & \text{, if Tail} \\ \end{cases} \]

Say, tolerance level \(\epsilon = 0.1\).
Then,

\[ \lim_{n\rightarrow\infty} P(|X_n - X| > \epsilon) = ? \]

If n=5;

\[ |X_n - X| = \begin{cases} \frac{1}{n+1} = \frac{1}{6} \approx 0.16 & \text{, if Head } \\ \\ \frac{1}{n} = \frac{1}{5} = 0.2 & \text{, if Tail} \\ \end{cases} \]

So, if n=5, then \(|X_n - X| > (\epsilon = 0.1)\).
=> \(P(|X_n - X| \ge (\epsilon=0.1)) = 1\).

if n=20;

\[ |X_n - X| = \begin{cases} \frac{1}{n+1} = \frac{1}{21} \approx 0.04 & \text{, if Head } \\ \\ \frac{1}{n} = \frac{1}{20} = 0.05 & \text{, if Tail} \\ \end{cases} \]

So, if n=20, then \(|X_n - X| < (\epsilon=0.1)\).
=> \(P(|X_n - X| \ge (\epsilon=0.1)) = 0\).
=> \(P(|X_n - X| \ge (\epsilon=0.1)) = 0 ~\forall ~n \ge 10\)

Therefore,

\[ \lim_{n\rightarrow\infty} P(|X_n - X| \ge (\epsilon=0.1)) = 0 \]

Similarly, we can prove that if \(\epsilon = 0.01\), then the probability will be equal to 0 for \( n\ge 100 \).

\[ \lim_{n\rightarrow\infty} P(|X_n - X| \ge (\epsilon=0.01)) = 0 \]

Note: Task is to check whether the sequence of randome variables \(X_1, X_2, \dots, X_n\) converges in probability to a known random variable \(X\) as \(n\rightarrow\infty\).

So, we can conclude that, if \(n > \frac{1}{\epsilon}\), then:

\[ \lim_{n\rightarrow\infty} P(|X_n - X| \ge \epsilon) = 0, \forall ~ \epsilon >0 \\[10pt] \text{Converges in Probability } \\[10pt] => X_n \xrightarrow{Probability} X \]


πŸ“˜

Almost Sure Convergence:
A sequence of random variables \(X_1, X_2, \dots, X_n\) is said to almost surely converge to a known random variable \(X\),
for \(n \ge 1\), if the following is true:

\[ P(\lim_{n\rightarrow\infty} X_n = X) = 1 \\[10pt] \text{Almost Sure or With Probability = 1 } \\[10pt] => X_n \xrightarrow{Almost ~ Sure} X \]

where,
\(X_n\): is the estimator or sample based random variable.
\(X\): is the known or limiting or target random variable.

If, \(X_n \xrightarrow{Almost ~ Sure} X \), => \(X_n \xrightarrow{Probability} X \)
But, converse is NOT true.

*Note: Almost Sure convergence is hardest to satisfy amongst all convergence, such as, convergence in probability, convergence in distribution, etc.

Read more about Limits

For example:

  • \(X\) is random variable such that \(X = \frac{1}{2} \), a constant, i.e \(X_1 = X_2 = \dots = X_n = \frac{1}{2}\).
    \(Y_1, Y_2,\dots ,Y_n \) are another sequence of random variables, such that :
    \[ Y_1 = X_1 \\[10pt] Y_2 = \frac{X_1 + X_2}{2} \\[10pt] Y_3 = \frac{X_1 + X_2 + X_3}{3} \\[10pt] \dots \\ Y_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Almost ~ Sure} \frac{1}{2} \]



End of Section

12 - Law of Large Numbers

Law of Large Numbers

In this section, we will understand Law of Large Numbers.

  • Weak Law of Large Numbers (WLLN)
  • Strong Law of Large Numbers (SLLN)


πŸ“˜

Weak Law of Large Numbers (WLLN):
This law states that given a sequence of independent and identically distributed (IID) samples \(X_1, X_1, \dots, X_n\) from a random variable with finite mean, the sample mean (\(\bar{X_n}\)) converges in probability to the expected value \(E[X]\) or population mean (\( \mu \)).

\[ \lim_{n\rightarrow\infty} P(|\bar{X_n} - E[X]| \ge \epsilon) = 0, \forall ~ \epsilon >0 \\[10pt] \\[10pt] \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Probability} E[X], \text{ as } n \rightarrow \infty \]


Note: Does NOT guarantee that sample mean will be close to population mean,
but instead says that - the probability of sample mean being far away from the population mean is low.


Read more about Limits

For example:

  • Toss a coin large number of times \('n'\), as \(n \rightarrow \infty\), the proportion of heads will probably be very close to \(0.5\).
    However, it does NOT rule out the possibility of a rare sequence, e.g., getting 10 consecutive heads.
    But, the probability of such a rare event is extremely low.



πŸ“˜

Strong Law of Large Numbers (SLLN):
This law states that given a sequence of independent and identically distributed (IID) samples \(X_1, X_1, \dots, X_n\) from a random variable with finite mean, the sample mean (\(\bar{X_n}\)) converges almost surely to the expected value \(E[X]\) or population mean (\( \mu \)).

\[ P(\lim_{n\rightarrow\infty} \bar{X_n} = E[X]) = 1, \text{ as } n \rightarrow \infty \\[10pt] \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{Almost ~ Sure} E[X], \text{ as } n \rightarrow \infty \]


Note:

  • It guarantees that the sequence of sample averages itself converges to population mean, with exception of set of outcomes that has probability = 0.
  • Almost certain guarantee; Much stronger statement than Weak Law of Large Numbers.

Read more about Limits

For example:

  • Toss a coin large number of times \('n'\), as \(n \rightarrow \infty\), the proportion of heads will converge to \(0.5\), with probability = 1.
    This means that a sequence where the proportion of heads never settles down to 0.5, is a probability = 0 event.

Application:

  • Almost sure convergence ensures ML model’s reliability by guaranteeing that the average error on a large dataset will converge to the true error.
    Thus, providing confidence that model will perform consistently and accurately on unseen data.



End of Section

13 - Markov's Inequality

Markov’s, Chebyshev’s Inequality & Chernoff Bound

In this section, we will understand Markov’s, Chebyshev’s Inequality & Chernoff Bound.


πŸ“˜

Markov’s Inequality:
It gives an upper bound for the probability that a non-negative random variable will NOT exceed, based on its expected value.

\[ P(X \ge a) \le \frac{E[X]}{a} \]

Note: It gives a very loose upper bound.

Read more about Expectation


πŸ’‘ A restaurant, on an average, expects to serve 50 customers per hour.
What is the probability that the restaurant will serve more than 200 customers in the next hour?

\[ P(X \ge 200) \le \frac{E[X]}{200} = \frac{50}{200} = 0.25 \]

Hence, a 25% chance of serving more than 200 customers.


πŸ’‘ Consider a test where the average score is 70/100 marks.
What is the probability that a randomly selected student gets a score of 90 marks or more?

\[ P(X \ge a) \le \frac{E[X]}{a} \\[10pt] P(X \ge 90) \le \frac{70}{90} \approx 0.78 \]

Hence, there is a 78% chance that a randomly selected student gets a score of 90 marks or more.


πŸ“˜

Chebyshev’s Inequality:
It states that the probability of a random variable deviating from its mean is small if its variance is small.

  • It is a more powerful version of Markov’s Inequality.
  • It uses variance of the distribution in addition to the expected value or mean.
  • Also, it does NOT assume the random variable to be non-negative.
  • It uses more information about the data i.e mean and variance.
\[ P(|X - E[X]| \ge k) \le \frac{E[(X - E[X])^2]}{k^2} \text{ ; k > 0 } \\[10pt] \text{ We know that: } Var[X] = E[(X - E[X])^2] \\[10pt] => P(\big|X - E[X]\big| \ge k) \le \frac{Var[X]}{k^2} \]

Note: It gives a tighter upper bound than Markov’s Inequality.


πŸ’‘ Consider a test where the average score is 70/100 marks.
What is the probability that a randomly selected student gets a score of 90 marks or more?
Given the standard deviation of the test score is 10 marks.

Given, the standard deviation of the test score \(\sigma\) = 10 marks.
=> Variance = \(\sigma^2\) = 100

\[ P(\big|X - E[X]\big| \ge k) \le \frac{Var[X]}{k^2} \\[10pt] E[X] = 70, Var[X] = 100 \\[10pt] P(X \ge 90) \le P(\big|X - 70\big| \ge 20) \\[10pt] P(\big|X - 70\big| \ge 20) \le \frac{100}{20^2} = \frac{1}{4} = 0.25 \]

Hence, Chebyshev’s Inequality gives a far tighter upper bound of 25% than Markov’s Inequality of 78%(approx).


πŸ“˜

Chernoff Bound:
It is an upper bound on the probability that a random variable deviates from its expected value.
It’s an exponentially decreasing bound on the “tail” of a random variable’s distribution, which can be calculated using its moment generating function.

  • It is used for sum or average of independent random variables (not necessarily identically distributed).
  • It provides exponentially tighter bounds, better than Chebyshev’s Inequality’s quadratic decay.
  • It uses all moments to capture the full shape of the distribution, using the moment generating function(MGF).
\[ P(X \ge c) \le e^{-tc}E[e^{tX}] , \forall t>0\\[10pt] \text{ where } E[e^{tX}] \text{ is the Moment Generating Function of } X \]


Proof:

\[ P(X \ge c) = P(e^{tX} \ge e^{tc}), \text{ provided } t>0 \\[10pt] \text{ using Markov's Inequality: } \\[10pt] P(e^{tX} \ge e^{tc}) \le \frac{E[e^{tX}]}{e^{tc}} =e^{-tc}E[e^{tX}] \\[10pt] \]


For the sum of ’n’ independent random variables,

\[ P(S_n \ge c) \le e^{-tc}(M_x(t))^n \\[10pt] \text{ where } M_x(t) \text{ is the Moment Generating Function of } X \\[10pt] \]

Note: Used to compute how far the sum of independent random variables deviate from their expected value.

Read more about Moment Generating Function




End of Section

14 - Cross Entropy & KL Divergence

Cross Entropy & KL Divergence

In this section, we will understand Entropy, Cross Entropy, KL Divergence & JS Divergence.
All these topics come from the field of ‘Information Theory’ and can be understood through the lens of
Information’ or ‘Surprise’.


πŸ“˜

Surprise Factor:
It is a measure of the amount of information gained when a specific, individual event occurs, and is defined based on the probability of that event.
It is defined as the negative logarithm of the probability of the event.

  • It is also called ‘Surprisal’.
\[ S(x) = -log(P(x))) \]

Note: Logarithm(for base > 1) is a monotonically increasing function, so as x increases, log(x) also increases.

  • So, if probability P(x) increases, then surprise factor S(x) decreases.
  • Common events have a high probability of occurrence, hence a low surprise factor.
  • Rare events have a low probability of occurrence, hence a high surprise factor.

Units:

  • The unit of surprise factor with log base 2 is bits.
  • with base ’e’ or natural log its nats (natural units of information).

πŸ“˜

Entropy:
It conveys how much ‘information’ we expect to gain from a random event.

  • Entropy is the average or expected value of surprise factor of a random variable.
  • More uniform the distribution β‡’ greater average surprise factor, since all possible outcomes are equally likely.
\[ H(X) = E[-log(P(x)] = -\sum_{x \in X} P(x)log(P(x)) \]

Read more about Expectation

  • Case 1:
    Toss a fair coin; \(P(H) = P(T) = 0.5 = 2^{-1}\)
    Surprise factor (Heads or Tails) = \(-log(P(x)) = -log_2(0.5) = -log_2(2^{-1}) = 1~bit\)
    Entropy = \( \sum_{x \in X} P(x)log(-P(x)) = 0.5*1 + 0.5*1 = 1 ~bit \)

  • Case 2:
    Toss a biased coin; \(P(H) = 0.9 P(T) = 0.1 \)
    Surprise factor (Heads) = \(-log(P(x)) = -log_2(0.9) \approx 0.15 ~bits \)
    Surprise factor (Tails) = \(-log(P(x)) = -log_2(0.1) \approx 3.32 ~bits\)
    Entropy = \( \sum_{x \in X} P(x)log(-P(x)) = 0.9*0.15 + 0.1*3.32 \approx 0.47 ~bits\)

Therefore, a biased coin is less surprising on an average than a fair coin, hence lower entropy.



πŸ“˜

Cross Entropy:
It is a measure of the average ‘information gain’ or ‘surprise’ when using an imperfect model \(Q\) to encode events from a true model \(P\).

  • It measures how surprised we are on an average, if the true distribution is \(P\), but we predict using another distribution \(Q\).
\[ H(P,Q) = E[-log(Q(x)] = -\sum_{i=1}^n P(x_i)log(Q(x_i)) \\[10pt] \text{ where true distribution of X} \sim P \text{ but the predictions are made using another distribution } Q \]

A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).
Let’s calculate the cross-entropy for the outputs of 2 models A & B.

  • Model A:
    \(Q_A = [0.8 ~(cat), 0.2 ~(dog)]\)
    Cross Entropy = \(H(P, Q_A) = -\sum_{i=1}^n P(x_i)log(Q_A(x_i)) \)
    \(= -[1*log_2(0.8) + 0*log_2(0.2)] \approx 0.32 ~bits\)
    Note: This is very low cross-entropy, since the predicted value is very close to actual, i.e low surprise.

  • Model B:
    \(Q_B = [0.2 ~(cat), 0.8 ~(dog)]\)
    Cross Entropy = \(H(P, Q_B) = -\sum_{i=1}^n P(x_i)log(Q_B(x_i)) \)
    \(= -[1*log_2(0.2) + 0*log_2(0.8)] \approx 2.32 ~bits\)
    Note: Here the cross-entropy is very high, since the predicted value is quite far from the actual truth, i.e high surprise.



πŸ“˜

Kullback Leibler (KL) Divergence:
It measures the information lost when one probability distribution \(Q\) is used to approximate another distribution \(P\).
It quantifies the ‘extra cost’ in bits needed to encode data using the approximate distribution \(Q\) instead of the true distribution \(P\).

KL Divergence = Cross Entropy(P,Q) - Entropy(P)

\[ \begin{aligned} D_{KL}(P \parallel Q) &= H(P, Q) - H(P) \\ &= -\sum_{i=1}^n P(x_i)log(Q(x_i)) - [-\sum_{i=1}^n P(x_i)log(P(x_i))] \\[10pt] &= \sum_{i=1}^n P(x_i)[log(P(x_i)) - log(Q(x_i)) ] \\[10pt] D_{KL}(P \parallel Q) &= \sum_{i=1}^n P(x_i)log(\frac{P(x_i)}{Q(x_i)}) \\[10pt] \text{ For continuous case: } \\ D_{KL}(P \parallel Q) &= \int_{-\infty}^{\infty} p(x)log(\frac{p(x)}{q(x)})dx \end{aligned} \]

Note:

  • If \(P = Q\) ,i.e, P and Q are the same distributions, then KL Divergence = 0.
  • KL divergence is NOT symmetrical ,i.e, \(D_{KL}(P \parallel Q) ⍯ (D_{KL}(Q \parallel P)\).

Using the same cat, dog classification problem example as mentioned above.
A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).
Let’s calculate the KL divergence for the outputs of the 2 models A & B.
Let’s calculate entropy first, and we can reuse the cross-entropy values calculated above already.
Entropy: \(H(P) = -\sum_{x \in X} P(x)log(P(x)) = -[1*log_2(1) + 0*log_2(0)] = 0 ~bits\)
Note: \(0*log_2(0) = 0\), is an approximation hack, since \(log(0)\) is undefined.

  • Model A:
    \( D_{KL}(P \parallel Q) = H(P, Q) - H(P) = 0.32 - 0 = 0.32 ~bits \)
    Model A incurs an additional 0.32 bits of surprise due to its imperfect prediction.

  • Model B:
    \( D_{KL}(P \parallel Q) = H(P, Q) - H(P) = 2.32 - 0 = 2.32 ~bits \)
    Model B has much more ‘information loss’ or incurs higher ‘penalty’ of 2.32 bits as its prediction was from from the truth.



πŸ“˜

Jensen-Shannon Divergence:
It is a smoothed and symmetric version of the Kullback-Leibler (KL) divergence and is calculated by averaging the
KL divergences between each of the two distributions and their combined average distribution.

  • Symmetrical and smoothed version of KL divergence.
  • Always finite; KL divergence can be infinite if \( P ⍯ 0 ~and~ Q = 0 \).
    • Makes JS divergence more stable for ML models where some predicted probabilities may be exactly 0.
\[ D_{JS}(P \parallel Q) = \frac{1}{2}[D_{KL}(P \parallel M) + D_{KL}(Q \parallel M)] \\ \text{ where: } M = \frac{P + Q}{2} \]

Let’s continue the cat and dog image classification example discussed above.
A model is trained to classify images as ‘cat’ or ‘dog’. Say, for an input image the true label is ‘cat’, so the true distribution:
\(P = [1.0 ~(cat), 0.0 ~(dog)]\).

Step 1: Calculate the average distribution M.

\[ M = \frac{P + Q}{2} = \frac{1}{2} [[1.0, 0.0] + [0.8, 0.2]] \\[10pt] => M = [0.9, 0.1] \]

Step 2: Calculate \(D_{KL}(P \parallel M)\).

\[ \begin{aligned} D_{KL}(P \parallel M) &= \sum_{i=1}^n P(x_i)log_2(\frac{P(x_i)}{M(x_i)}) \\[10pt] &= 1*log_2(\frac{1}{0.9}) + 0*log_2(\frac{0}{0.1}) \\[10pt] & = log_2(1.111) + 0 \\[10pt] => D_{KL}(P \parallel M) &\approx 0.152 ~bits \\[10pt] \end{aligned} \]

Step 3: Calculate \(D_{KL}(Q \parallel M)\).

\[ \begin{aligned} D_{KL}(Q \parallel M) &= \sum_{i=1}^n Q(x_i)log_2(\frac{Q(x_i)}{M(x_i)}) \\[10pt] &= 0.8*log_2(\frac{0.8}{0.9}) + 0.2*log_2(\frac{0.2}{0.1}) \\[10pt] &= 0.8*log_2(0.888) + 0.2*log_2(2) \\[10pt] &= 0.8*(-0.17) + 0.2*1 \\[10pt] &= -0.136 + 0.2 \\[10pt] => D_{KL}(Q \parallel M) &\approx 0.064 ~bits \\[10pt] \end{aligned} \]

Step 4: Finally, lets put all together to calculate \(D_{JS}(P \parallel Q)\).

\[ \begin{aligned} D_{JS}(P \parallel Q) &= \frac{1}{2}[D_{KL}(P \parallel M) + D_{KL}(Q \parallel M)] \\[10pt] &= \frac{1}{2}[0.152 + 0.064] \\[10pt] &= \frac{1}{2}*0.216 \\[10pt] => D_{JS}(P \parallel Q) &= 0.108 ~ bits \\[10pt] \end{aligned} \]

Therefore, lower JS divergence value => P and Q are more similar.




End of Section

15 - Parametric Model Estimation

Parametric Model Estimation

In this section, we will understand Parametric Model Estimation via two philosophical approaches.

  1. Frequentist: Parameter \(\Theta\) is fixed, but unknown.
  2. Bayesian: Parameter \(\Theta\) itself is unknown, so we model it as a random variable with a probability distribution.


πŸ“˜

Parametric Model Estimation:
It is the process of finding the best-fitting finite set of parameters \(\Theta\) for a model that assumes a specific probability distribution for the data.
It involves using the dataset to estimate the parameters (like the mean and standard deviation of a normal distribution) that define the model.

\(P(X \mid \Theta) \) : Probability of seeing data \(X: (X_1, X_2, \dots, X_n) \), given the parameters \(\Theta\) of the underlying probability distribution from which the data is assumed to be generated.

Goal of estimation:
We observed data, \( D = \{X_1, X_2, \dots, X_n\} \), and we want to infer the the unknow parameters \(\Theta\) of the underlying probability distribution, assuming that the data is generated I.I.D.
Read more for I.I.D

Note: Most of the time, from experience, we know the underlying probability distribution of data, such as, Bernoulli, Gaussian, etc.


There are 2 philosophical approaches to estimate the parameters of a parametric model:

  1. Frequentist:
  • Parameters \(\Theta\) is fixed, but unknown, only data is random.
  • It views probability as the long-run frequency of events in repeated trials; e.g toss a coin ’n’ times.
  • It is favoured when the sample size is large.
  • For example, Maximum Likelihood Estimation(MLE), Method of Moments, etc.
  1. Bayesian:
  • Parameters \(\Theta\) itself is unknown, so we model it as a random variable with a probability distribution.
  • It views probability as a degree of belief that can be updated with new evidence, i.e. data.
    Thus, integrating prior knowledge with data to express the uncertainty about the parameters.
  • It is favoured when the sample size is small, as it uses prior belief about the data distribution too.
  • For example, Maximum A Posteriori Estimation(MAP), Minimum Mean Square Error Estimation(MMSE), etc..


πŸ“˜

Maximum Likelihood Estimation:
It is the most popular frequentist approach to estimate the parameters of a model.
This method helps us find the parameters \(\Theta\) that make the data most probable.

Likelihood Function:
Say, we have data, \(D = X_1, X_2, \dots, X_n\) are I.I.D discrete random variable with PMF = \(P_{\Theta}(.)\)
Then, the likelihood function is the probability of observing the data, \(D = \{X_1, X_2, \dots, X_n\}\), given the parameters \(\Theta\).

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= P_{\Theta}(X_1, X_2, \dots, X_n) \\ &= \prod_{i=1}^{n} P_{\Theta}(X_i) \text{ ~ since, I.I.D } \\ \text{ for continuous case: } \\ & = \int_{\Theta} f_{\Theta}(x_i) d\theta \end{aligned} \]

Task:
Find the value of parameter \(\Theta_{ML}\) that maximizes the likelihood function.

\[ \begin{aligned} \underset{\Theta}{\mathrm{argmax}}\ \mathcal{L_{X_1, X_2, \dots ,X_n}}(\Theta) &= \Theta_{ML}(X_1, X_2, \dots, X_n) \\ &= \prod_{i=1}^{n} P_{\Theta}(X_i) \end{aligned} \]

In order to find the parameter \(\Theta_{ML}\) that maximises the likelihood function,
we need to take the first derivative of the likelihood function with respect to \(\Theta\) and equate it to zero.
But, taking derivative of a product is challenging, so we will take the logarithm on both sides.
Note: Log is a monotonically increasing function, i.e, as x increases, log(x) increases too.

Let us denote the log-likelihood function as \(\bar{L}\).

\[ \begin{aligned} \mathcal{\bar{L}_{X_1, X_2, \dots ,X_n}}(\Theta) &= log [\prod_{i=1}^{n} P_{\Theta}(X_i)] \\ &= \sum_{i=1}^{n} log P_{\Theta}(X_i) \\ \end{aligned} \]

Therefore, Maximum Likelihood Estimation is the parameter \(\Theta_{ML}\) that maximises the log-likelihood function.

\[ \Theta_{ML}(X_1, X_2, \dots, X_n) = \underset{\Theta}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots ,X_n}(\Theta) \]

πŸ’‘

Given \(X_1, X_2, \dots, X_n\) are I.I.D. Bernoulli random variable with PMF as below:

\[ P(X_i = 1) = \theta \\ P(X_i = 0) = 1 - \theta \]

Estimate the parameter \(\theta\) using Maximum Likelihood Estimation.


Let, \(n_1 = \) number of 1’s in the dataset.

Likelihood Function:

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= \prod_{i=1}^{n} P_{\Theta}(X_i) \\ &= \theta^{n_1} (1 - \theta)^{n - n_1} \\ \end{aligned} \]

Log-Likelihood Function:

\[ \begin{aligned} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= log [\theta^{n_1} (1 - \theta)^{n - n_1}] \\ &= n_1 log \theta + (n - n_1) log (1 - \theta) \\ \end{aligned} \]

Maximum Likelihood Estimation:

\[ \begin{aligned} \Theta_{ML} &= \underset{\Theta}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ &= \underset{\Theta}{\mathrm{argmax}}\ n_1 log \theta + (n - n_1) log (1 - \theta) \\ \end{aligned} \]

In order to find the parameter \(\theta_{ML}\), we need to take the first derivative of the log-likelihood function with respect to \(\theta\) and equate it to zero.
Read more about Derivative

\[ \begin{aligned} \Theta_{ML} &= \underset{\Theta}{\mathrm{argmax}}\ n_1 log \theta + (n - n_1) log (1 - \theta) \\ =>& \frac{d}{d\theta} (n_1 log \theta + (n - n_1) log (1 - \theta)) = 0\\ =>& \frac{n_1}{\theta} + \frac{(n - n_1)}{1 - \theta}*(-1) = 0 \\ =>& \frac{n_1}{\theta} = \frac{(n - n_1)}{1 - \theta} \\ =>& n_1 - n_1\theta = n\theta - n_1\theta\\ =>& n_1 = n\theta \\ =>& \theta = \frac{n_1}{n} \text{ ~ i.e proportion of 1's} \\ \end{aligned} \]

Say, e.g., we have 10 observations for the Bernoulli random variable as: 1,0,1,1,0,1,1,0,1,0.
Then, the parameter \(\Theta_{ML} = \frac{6}{10} = 0.6\) i.e proportion of 1’s.


πŸ’‘ Given \(X_1, X_2, \dots, X_n\) are I.I.D. Gaussian \( \sim N(\mu, \sigma^2) \)
\(x_1, x_2, \dots, x_n\) are the realisations/observations of the random variable.

Estimate the parameters \(\mu\) and \(\sigma\) of the Gaussian distribution using Maximum Likelihood Estimation.

Likelihood Function:

\[ \begin{aligned} \mathcal{L_{X_1, X_2, \dots, X_n}}(\Theta) &= \prod_{i=1}^{n} f_{\mu, \sigma}(X_i) \\ &= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \\ &= (\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_{i=1}^{n} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} \\ \end{aligned} \]

Log-Likelihood Function:

\[ \begin{aligned} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= log [(\frac{1}{\sqrt{2\pi\sigma^2}})^n \prod_{i=1}^{n} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}}] \\ &= log (2\pi\ \sigma^2)^{\frac{-n}{2}} - \sum_{i=1}^{n} \frac{(x_i - \mu)^2}{2\sigma^2} \\ => \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= -\frac{n}{2} log (2\pi) -nlog(\sigma) - \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \\ \end{aligned} \]

Note: Here, the first term \( -\frac{n}{2} log (2\pi) \) is a constant wrt both \(\mu\) and \(\sigma\), so we can ignore the term.

Maximum Likelihood Estimation:

\[ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmax}}\ \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ \]

Instead of finding \(\mu\) and \(\sigma\) that maximises the log-likelihood function,
we can find \(\mu\) and \(\sigma\) that minimises the negative of the log-likelihood function.

\[ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmin}}\ -\bar{L}_{X_1, X_2, \dots, X_n}(\Theta) \\ \mu_{ML}, \sigma_{ML} = \underset{\mu, \sigma}{\mathrm{argmin}}\ nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2 \]

Now, lets differentiate the log likelihood function wrt \(\mu\) and \(\sigma\) separately to get \(\mu_{ML}, \sigma_{ML}\).

Lets, calculate \(\mu_{ML}\) first by taking the derivative of the log-likelihood function wrt \(\mu\) and equating it to 0.
Read more about Derivative

\[ \begin{aligned} &\frac{d}{d\mu} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) = \frac{d}{d\mu} [nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2] = 0 \\ &=> 0 + \frac{2}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)*(-1) = 0\\ &=> \sum_{i=1}^{n} x_i - n\mu = 0 \\ &=> n\mu = \sum_{i=1}^{n} x_i \\ &=> \mu_{ML} = \frac{1}{n} \sum_{i=1}^{n} x_i \\ \end{aligned} \]

Similarly, we can calculate \(\sigma_{ML}\) by taking the derivative of the log-likelihood function wrt \(\sigma\) and equating it to 0.

\[ \begin{aligned} \frac{d}{d\sigma} \bar{L}_{X_1, X_2, \dots, X_n}(\Theta) &= \frac{d}{d\sigma} [nlog(\sigma) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (x_i - \mu)^2] = 0 \\ => \sigma^2_{ML} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_{ML})^2 \end{aligned} \]

Note: In general MLE is biased, i.e does NOT give an unbiased estimate => divides by \(n\) instead of \((n-1)\).



πŸ“˜

Bayesian Statistics:
Bayesian statistics model parameters by updating initial beliefs (prior probabilities) with observed data to form a final belief (posterior probability) using Bayes’ Theorem.
Instead of a single point estimate, it provides a probability distribution over possible parameter values, which allows to quantify uncertainty and yields more robust models, especially with limited data.

Bayes’ Theorem:

\[ P(\Theta \mid X) = \frac{P(\Theta)P(X \mid \Theta)}{P(X)} \]

Read more about Bayes’ Theorem

\(P(\Theta)\): Prior: Initial distribution of \(\Theta\) before seeing the data.
\(P(X \mid \Theta)\): Likelihood: Conditional distribution of data \(X\), given the parameter \(\Theta\).
\(P(\Theta \mid X)\): Posterior: Conditional distribution of parameter \(\Theta\), given the data \(X\).
\(P(X)\): Evidence: Probability of seeing the data \(X\).


πŸ’‘ \(X_1, X_2, \dots, X_n\) are I.I.D. Bernoulli random variable.
\(x_1, x_2, \dots, x_n\) are the realisations, \(\Theta \in [0, 1] \).
Estimate the parameter \(\Theta\) using Bayesian statistics.

Let, \(n_1 = \) number of 1’s in the dataset.
Prior: \(P(\Theta)\) : \(\Theta \sim U(0, 1) \), i.e, parameter \(\Theta\) comes from a continuos uniform distribution in the range [0,1].
\(f_{\Theta}(\theta) = 1, \theta \in [0, 1] \)

Likelihood: \(P_{X \mid \Theta}(x \mid \theta) = \theta^{n_1} (1 - \theta)^{n - n_1} \).

Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\ \]

Note: The most difficult part is to calculate the denominator \(f_{X}(x)\).
So, either we try NOT to compute it all together, or we try to map it to some known functions to make calculations easier.

We know that we can get the marginal probability by integrating the joint probability over another variable.

\[ \tag{1} f_{X}(x) = \int_{Y}f_{X,Y}(x,y)dy \\ \]

Also from conditional probability, we know:

\[ \tag{2} f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x,y)}{f_{Y}(y)} \\ => f_{X,Y}(x,y) = f_{Y}(y) f_{X \mid Y}(x \mid y) \]

From equations 1 and 2, we have:

\[ \tag{3} f_{X}(x) = \int_{Y}f_{Y}(y) f_{X \mid Y}(x \mid y)dy \]

Now let’s replace the value of \(f_{X}(x) \) in the posterior from equation 3:
Posterior:

\[ \begin{aligned} f_{\Theta \mid X} (\theta \mid x) &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{\Theta}f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] \text{ We know that: } f_{\Theta}(\theta) = 1, \theta \in [0, 1] \\ &= \frac{1* P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1 1* P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] => f_{\Theta \mid X} (\theta \mid x) & = \frac{\theta^{n_1} (1 - \theta)^{n - n_1}}{\int_{0}^1 \theta^{n_1} (1 - \theta)^{n - n_1}} \\ \end{aligned} \]

Beta Function:
It is a special mathematical function denoted by B(a, b) or Ξ²(a, b) that is defined by the integral formula:

\[ Ξ²(a, b) = \int_{0}^1 t^{a-1}(1-t)^{b-1}dt \]

Note: We can see that the denominator of the posterior is of the form of Beta function.
Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{\theta^{n_1} (1 - \theta)^{n - n_1}}{Ξ²(n_1+1, n-n_1+1)} \]

πŸ’‘

Suppose, in the above example, we are told that the parameter \(\Theta\) is closer to 1 than 0.
How will we incorporate this useful information (apriori knowledge) into our parameter estimation?

\[ f_{\Theta}(\theta) = \begin{cases} 2\Theta, & \forall ~ \theta \in [0,1] \\ \\ 0, & \text{otherwise} \end{cases} \\ \]

Since, we know apriori that \(\Theta\) is closer to 1 than 0, we should take this initial belief into account to do our parameter estimation.

Prior:

\[ f_{\Theta}(\theta) = \begin{cases} 2\Theta, & \forall ~ \theta \in [0,1] \\ \\ 0, & \text{otherwise} \end{cases} \\ \]

Posterior:

\[ \begin{aligned} f_{\Theta \mid X} (\theta \mid x) &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{\Theta}f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] &= \frac{f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1f_{\Theta}(\theta) P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] \text{ We know that: } f_{\Theta}(\theta) = 2\Theta, \theta \in [0, 1] \\ &= \frac{2\Theta * P_{X \mid \Theta}(x \mid \theta)}{\int_{0}^1 2\Theta* P_{X \mid \Theta}(x \mid \theta)d\theta} \\[10pt] & = \frac{2\Theta * \theta^{n_1} (1 - \theta)^{n - n_1}}{\int_{0}^1 2\Theta * \theta^{n_1} (1 - \theta)^{n - n_1}} \\[10pt] => f_{\Theta \mid X} (\theta \mid x) &= \frac{\theta^{n_1+1} (1 - \theta)^{n - n_1}}{Ξ²(n_1+2, n-n_1+1)} \end{aligned} \]

Note: If we do NOT have enough data, then we should NOT ignore our intial belief.
However, if we have enough data, then the data will override our initial belief and the posterior will be dominated by data.



Note: Bayesian approach gives us a probability distribution of parameter \(\Theta\).

πŸ’‘ What if we want to have a single point estimate of the parameter \(\Theta\) instead of a probability distribution?

We can use Bayesian Point Estimators, after getting the posterior distribution, to summarize it with a single value for practical use, such as,

  • Maximum A Posteriori (MAP) Estimator
  • Minimum Mean Square Error (MMSE) Estimator


πŸ“˜

Maximum A Posteriori (MAP) Estimator:
It finds the mode(peak) of the posterior distribution.

  • MAP has the minimum probability of error, since it picks single most probable value.
\[ \Theta_{MAP} = \underset{\Theta}{\mathrm{argmax}}\ f_{\Theta \mid X} (\theta \mid x) \text{, \(\theta\) is continuous} \\ \Theta_{MAP} = \underset{\Theta}{\mathrm{argmax}}\ P_{\Theta \mid X} (\theta \mid x) \text{, \(\theta\) is discrete} \]
πŸ’‘ Given a Gaussian distribution, \( X \sim N(\Theta, 1) \) with a prior belief that \(\mu\) is equally likely to be 0 or 1.
Estimate the unknown parameter \(\mu\) using MAP.

Given that :
\(\Theta\) is discrete, with probability 0.5 for both 0 and 1.
=> The Gaussian distribution is equally likely to be centered at 0 or 1.
Variance: \(\sigma^2 = 1\)

Prior:

\[ P_{\Theta}(\theta=0) = P_{\Theta}(\theta=1) = 1/2 \]

Likelihood:

\[ f_{X \mid \Theta}(x \mid \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x_i-\theta)^2}{2 \sigma^2}} \\ f_{X \mid \Theta}(x \mid \theta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{(x_i-\theta)^2}{2}}, \text{ since \(\sigma = 1\)} \]

Posterior:

\[ P_{\Theta \mid X} (\theta \mid x) = \frac{P_{\Theta}(\theta) f_{X \mid \Theta}(x \mid \theta)}{f_{X}(x)} \\ \]

We need to find:

\[ \underset{\Theta}{\mathrm{argmax}}\ P_{\Theta \mid X} (\theta \mid x) \]

Taking log on both sides:

\[ \tag{1}\log P_{\Theta \mid X} (\theta \mid x) = \log P_{\Theta}(\theta) + \log f_{X \mid \Theta}(x \mid \theta) - \log f_{X}(x) \]

Let’s calculate the log-likelihood function first -

\[ \begin{aligned} \log f_{X \mid \Theta}(x \mid \theta) &= \log \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{(x_i-\theta)^2}{2}} \\ &= log(\frac{1}{\sqrt{2\pi}})^n + \sum_{i=1}^n \log (e^{-\frac{(x_i-\theta)^2}{2}}), \text{ since \(\sigma = 1\)} \\ &= n\log(\frac{1}{\sqrt{2\pi}}) + \sum_{i=1}^n -\frac{(x_i-\theta)^2}{2} \\ => \tag{2} \log f_{X \mid \Theta}(x \mid \theta) &= n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i-\theta)^2}{2} \end{aligned} \]

Here computing \(f_{X}(x)\) is difficult.
So, instead of differentiating above equation wrt \(\Theta\), and equating to 0,
we will calculate the above expression for \(\theta=0\) and \(\theta=1\) and compare the values.
This way we can get rid of the common value \(f_{X}(x)\) in both expressions that is not dependent on \(\Theta\).

When \(\theta=1\):

\[ \begin{aligned} \log P_{\Theta \mid X} (1 \mid x) &= \log P_{\Theta}(1) + \log f_{X \mid \Theta}(x \mid 1) \\ \tag{3} \log P_{\Theta \mid X} (1 \mid x) &= \log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 1)^2}{2}\\ \end{aligned} \]

Similarly, when \(\theta=0\):

\[ \begin{aligned} \log P_{\Theta \mid X} (0 \mid x) &= \log P_{\Theta}(0) + \log f_{X \mid \Theta}(x \mid 0) \\ \tag{4} \log P_{\Theta \mid X} (0 \mid x) &= \log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 0)^2}{2}\\ \end{aligned} \]

So, we can say that \(\theta = 1\) only if :
the value of the above expression for \(\theta = 1\) > the value of the above expression for \(\theta = 0\) From equation 3 and 4:

\[ \begin{aligned} &\log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log(1/2) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\cancel{\log(1/2)} + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \cancel{\log(1/2)} + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\sum_{i=1}^n \frac{(x_i - 0)^2}{2} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > 0 \\ => & \sum_{i=1}^n \frac{\cancel{x_i^2} - \cancel{x_i^2} + 2x_i -1 }{2} > 0 \\ => & \sum_{i=1}^n [x_i - \frac{1}{2}] > 0 \\ => & \sum_{i=1}^n x_i - \frac{n}{2} > 0 \\ => & \sum_{i=1}^n x_i > \frac{n}{2} \\ => & \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} \\ => & \Theta_{MAP}(X) = \begin{cases} 1 & \text{if } \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} \\ \\ 0 & \text{otherwise.} \end{cases} \end{aligned} \]

Note: If the prior is uniform then \(\Theta_{MAP} = \Theta_{MLE}\), because uniform prior does NOT give any information about the initial bias, and all possibilities are equally likely.



πŸ’‘

What if, in the above example, we know that the initial belief is not uniform but biased towards 0?
Prior:

\[ P_{\Theta}(\theta=0) = 3/4 , P_{\Theta}(\theta=1) = 1/4 \]

Now, let’s compare the log-posterior for both the cases i.e \(\theta=0\) and \(\theta=1\) as we did earlier.
But, note that this time the probabilities for \(\theta=0\) and \(\theta=1\) are different.

So, we can say that \(\theta = 1\) only if :
the value of the above expression for \(\theta = 1\) > the value of the above expression for \(\theta = 0\) From equation 3 and 4 above:

\[ \begin{aligned} &\log(1/4) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log(3/4) + n\log(\frac{1}{\sqrt{2\pi}}) - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\log(1/4) + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log(3/4) + \cancel{n\log(\frac{1}{\sqrt{2\pi}})} - \sum_{i=1}^n \frac{(x_i - 0)^2}{2} \\ => &\sum_{i=1}^n \frac{(x_i - 0)^2}{2} - \sum_{i=1}^n \frac{(x_i - 1)^2}{2} > \log3 -\cancel{\log4} -\log1 +\cancel{\log4} \\ => & \sum_{i=1}^n \frac{\cancel{x_i^2} - \cancel{x_i^2} + 2x_i -1 }{2} > \log3 - 0 \text{ , since log(1) = 0}\\ => & \sum_{i=1}^n [x_i - \frac{1}{2}] > \log3 \\ => & \sum_{i=1}^n x_i - \frac{n}{2} > \log3 \\ => & \sum_{i=1}^n x_i > \frac{n}{2} + \log3 \\ \text{ dividing both sides by n: } \\ => & \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} + \frac{\log3}{n}\\ => & \Theta_{MAP}(X) = \begin{cases} 1 & \text{if } \frac{1}{n} \sum_{i=1}^n x_i > \frac{1}{2} + \frac{\log3}{n}\\ \\ 0 & \text{otherwise.} \end{cases} \end{aligned} \]

Therefore, we can see that \(\Theta_{MAP}\) is extra biased towards 0.

Note: For a non-uniform prior \(\Theta_{MAP}\) estimate will be pulled towards the prior’s mode.



πŸ’‘ MAP estimator is good for classification like problems, such as Yes/No, True/False, etc.
e.g: Patient has a certain disease or not.
But, what if we want to minimize average magnitude of errors, over time, say predicting a stock’s price?
Just getting to know whether the prediction was right or wrong is not sufficient here.
We also want to know that the prediction was wrong by how much, so that we can minimize the loss over time.

We can use Minimum Mean Square Error (MMSE) Estimator to do this.

πŸ“˜

Minimum Mean Square Error (MMSE) Estimation:
Minimizes the expected value of squared error.
Mean of the posterior distribution is the conditional expectation of parameter \(\Theta\), given the data.

  • Posterior mean minimizes the the mean squared error.
\[ \hat\Theta_{MMSE}(X) = \underset{\Theta}{\mathrm{argmin}}\ \mathbb{E}[(\hat\Theta(X) - \Theta)^2] \]

\(\hat\Theta(X)\): Predicted value.
\(\Theta\): Actual value.

\[ \hat\Theta_{MMSE}(X) = \sum_{\Theta} \theta P_{\Theta \mid X} (\theta \mid x), \text{ if \(\Theta\) is discrete} \\ \hat\Theta_{MMSE}(X) = \int_{\Theta} \theta f_{\Theta \mid X} (\theta \mid x)d\theta, \text{ if \(\Theta\) is continuous} \]

Let’s revisit the above examples that we used for MLE.

Case 1: Uniform Continuous Prior for parameter \(\Theta\) of Bernoulli distribution
Prior:

\[ f_{\Theta}(\theta) = 1, ~or~ X \sim U(0,1), ~or~ \beta(1,1) \]

Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{\theta^{n_1} (1 - \theta)^{n - n_1}}{\beta(n_1+1, n-n_1+1)} \]

Let’s calculate the \(\Theta_{MMSE}\):

\[ \begin{aligned} \Theta_{MMSE}(X) &= \int_{0}^1 f_{\Theta \mid X} (\theta \mid x)d\theta \\ & = \frac{\int_{0}^1 \theta * \theta^{n_1} (1 - \theta)^{n - n_1}}{\beta(n_1+1, n-n_1+1)} \\[10pt] & = \frac{ \int_{0}^1 \theta^{n_1+1} (1 - \theta)^{n - n_1}}{\beta(n_1+1, n-n_1+1)} \\[10pt] \text{ Since: } \beta(a, b) = \int_0^1 t^{a-1}(1-t)^{b-1}dt \\[10pt] => \Theta_{MMSE}(X) &= \frac{\beta(n_1+2, n-n_1+1)}{\beta(n_1+1, n-n_1+1)} \end{aligned} \]

Similarly, for the second case where the prior is biased towards 1.
Case 2: Prior is biased towards 1 for parameter \(\Theta\) of Bernoulli distribution
Prior:

\[ f_{\Theta}(\theta) = 2\Theta \]

Posterior:

\[ f_{\Theta \mid X} (\theta \mid x) = \frac{\theta^{n_1+1} (1 - \theta)^{n - n_1}}{\beta(n_1+2, n-n_1+1)} \]

So, here in this case our \(\Theta_{MMSE}\) is:

\[ \Theta_{MMSE}(X) = \frac{\beta(n_1+3, n-n_1+1)}{\beta(n_1+2, n-n_1+1)} \]



End of Section