Joint & Marginal

Joint, Marginal & Conditional Probability

7 minute read

In this section, we will understand Joint, Marginal & Conditional Probability.
So far, we have dealt with a single random variable.
Now, let’s explore the probability distributions of 2 or more random variables occurring together.

Watch Playlist

📘

Joint Probability Distribution:
It describes the probability of 2 or more random variables occurring simultaneously.

The random variables can be from different distributions, such as, discrete and continuous.

Joint CDF:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Discrete Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \sum_{x_i \le a} \sum_{y_j \le b} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b) = \int_{-\infty}^{a} \int_{-\infty}^{b} f_{X,Y}(x,y) dy dx \]

Joint PMF:

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

Key Properties:

$P(X = x, Y = y) \ge 0 ~ \forall (x,y) $
$ \sum_{i} \sum_{j} P(X = x_i, Y = y_j) = 1 $

Joint PDF:

\[ f_{X,Y}(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} \\ f_{X,Y}(x,y) = \iint_{A \in \mathbb{R}^2} f_{X,Y}(x,y) dy dx \]

Key Properties:

$f_{X,Y}(x,y) \ge 0 ~ \forall (x,y) $
$ \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy dx = 1 $

For example:

If we consider 2 random variables, say, height(X) and weight(Y), then the joint distribution will tell us the probability of finding a person having a particular height and weight.

💡 There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand joint probability.

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

	A = Red	A = Blue
B = Red	2/5*3/5 = 6/25	3/5*3/5 = 9/25
B = Blue	2/5*2/5 = 4/25	3/5*2/5 = 6/25

Since, the draws are independent, joint probability = P(A) * P(B)
Each of the 4 cells in above table shows the probability of combination of results from 2 draws or joint probability.

📘

Marginal Probability Distribution:
It describes the probability distribution of an individual random variable in a joint distribution, without considering the outcomes of other random variables.

If we have the joint distribution, then we can get the marginal distribution of each random variable from it.
Marginal probability equals summing the joint probability across other random variables.

Marginal CDF:
We know that Joint CDF =

\[ F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty \]

Marginal CDF =

\[ F_X(a, \infty) = P(X \le a, Y < \infty) = P(X \le a) \]

Discrete Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \sum_{x_i \le a} \sum_{y_j \in \mathbb{R}} P(X = x_i, Y = y_j) \]

Continuous Case:

\[ F_X(a) = P(X \le a, Y \le \infty) = \int_{-\infty}^{a} \int_{-\infty}^{\infty} f_{X,Y}(x,y)dydx = \int_{-\infty}^{\infty} f_{X,Y}(x,y)dy \]

Law of Total Probability
We know that Joint Probability Distribution =

\[ P_{X,Y}(x,y) = P(X = x, Y = y) \]

The events $(Y=y)$ partition the sample space, such that:

$ (Y=y_1) \cap (Y=y_2) \cap ... \cap (Y=y_n) = \Phi $
$ (Y=y_1) \cup (Y=y_2) \cup ... \cup (Y=y_n) = \Omega $

From Law of Total Probability, we get:

Marginal PMF:

\[ P_X(x) = P(X=x) = \sum_{y} P_{X,Y}(x,y) = \sum_{y} P(X = x, Y = y) \]

Marginal PDF:

\[ f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \]

💡 Setup: Roll a die + Toss a coin.
X: Roll a die ; $ \Omega = \{1,2,3,4,5,6\} $
Y: Toss a coin ; $ \Omega = \{H,T\} $

Joint PMF = $ P_{X,Y}(x,y) = P(X=x, Y=y) = 1/6*1/2 = 1/12$
Marginal PMF of X = $ P_X(x) =\sum_{y \in \mathbb{\{H,T\}}} P_{X,Y}(x,y) = = 1/12+1/12 = 1/6$
=> Marginally, X is uniform over 1-6 i.e a fair die.

Marginal PMF of Y = $ P_Y(y) = \sum_{1}^6 P_{X,Y}(x,y) = 6*(1/12) = 1/2 $
=> Marginally, Y is uniform over H,T i.e a fair coin.

💡 Setup: X and Y are two continuous uniform distribution.
$ X \sim U(0,1) $
$ Y \sim U(0,1) $

Marginal PDF = $f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy $
Joint PDF =

$$ f_{X,Y}(x,y) = \begin{cases} 1 & \text{if } x \in [0,1], y \in [0,1] \\ 0 & \text{otherwise } \end{cases} $$

Marginal PDF =

\[ \begin{aligned} f_X(x) &= \int_{0}^{1} f_{X,Y}(x,y) dy \\ &= \int_{0}^{1} 1 dy \\ &= 1 \\ f_X(x) &= \begin{cases} 1 & \text{if } x \in [0,1] \\ 0 & \text{otherwise } \end{cases} \end{aligned} \]

💡 Let’s re-visit the ball drawing example.
There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand marginal probability.

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.

	A = Red	A = Blue	P(B) (Marginal)
B = Red	2/5*3/5 = 6/25	3/5*3/5 = 9/25	6/25 + 9/25 = 15/25 = 3/5
B = Blue	2/5*2/5 = 4/25	3/5*2/5 = 6/25	4/25 + 6/25 = 10/25 = 2/5
P(A) (Marginal)	6/25 + 4/25 = 10/25 = 2/5	9/25 + 6/25 = 15/25 = 3/5

We can see from the table above - P(A=Red) is the sum of joint distribution over all possible values of B i.e Red & Blue.

📘

Conditional Probability:
It measures the probability of an event occurring given that another event has already happened.

It provides a way to update our belief about the likelihood based on new information.

\[ P(A \mid B) = \frac{P(A \cap B)}{P(B)} \]

P(A, B) = Joint Probability of A and B
P(B) = Marginal Probability of B

=> Conditional Probability = Joint Probability / Marginal Probability

Conditional CDF:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) \\ \]

Discrete Case:

\[ F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) = \sum_{x_i \le x} P(X = x_i \mid Y = y) \]

Continuous Case:

\[ F_{X \mid Y}(x \mid y) = \int_{-\infty}^{x} f_{X \mid Y}(x \mid y) dx = \int_{-\infty}^{x} \frac {f_{X,Y}(x, y)}{f_Y(y)} dx \\ f_Y(y) > 0 \]

Conditional PMF:

\[ P(X = x \mid Y = y) = \frac{P(X = x, Y = y)} {P(Y = y)} \\ P(Y = y) > 0 \]

Conditional PDF:

\[ f_{X \mid Y}(x \mid y) = \frac{F_{X,Y}(x,y)}{f_Y(y)} \\ f_Y(y) > 0 \]

Application:

Generative machine learning models, such as, GANs, learn the conditional distribution of pixels, given the style of input image.

💡 Let’s re-visit the ball drawing example.
Note: We only have information about the joint and marginal probabilities.
What is the conditional probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw?

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags respectively.
A = Red ball in first draw
B = Blue ball in second draw.

	A = Red	A = Blue	P(B) (Marginal)
B = Red	6/25	9/25	3/5
B = Blue	4/25	6/25	2/5
P(A) (Marginal)	2/5	3/5

\[ \begin{aligned} P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\ &= \frac{4/25}{2/5} \\ &= 2/5 \end{aligned} \]

Therefore, probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw = 2/5.

📘

Conditional Expectation:
This gives us the conditional expectation of a random variable X, given another random variable Y=y.

Discrete Case:

\[ E[X \mid Y = y] = \sum_{x} x.P(X = x \mid Y = y) x = \sum_{x} x.P_{X \mid Y}(x \mid y) \]

Continuous Case:

\[ E[X \mid Y = y] = \int_{-\infty}^{\infty} x.f_{X \mid Y}(x \mid y) dx \]

For example:

Conditional expectation of of a person’s weight, given his/her height = 165 cm, will give us the average weight of all people with height = 165 cm.

Applications:

Linear regression algorithm is conditional expectation of target variable ‘Y’, given input feature variable ‘X’.
Expectation Maximisation(EM) algorithm is built on conditional expectation.

📘

Conditional Variance:
This gives us the variance of a random variable calculated after taking into account the value(s) of another related variable.

\[ \begin{aligned} Var[X \mid Y = y] &= \sum_{x} [x - E[X \mid Y = y])^2 \mid Y=y] \\ => Var[X \mid Y = y] &= E[X^2 \mid Y=y] - (E[X \mid Y=y])^2 \\ \end{aligned} \]

For example:

Variance of car’s mileage for city driving might be small, but the variance will be large for mix of city and highway driving.

Note: Models that take into account the change in variance or heteroscedasticity tend to be more accurate.

End of Section