Joint & Marginal
Joint, Marginal & Conditional Probability
In this section, we will understand Joint, Marginal & Conditional Probability.
So far, we have dealt with a single random variable.
Now, let’s explore the probability distributions of 2 or more random variables occurring together.
📘
Joint Probability Distribution:
It describes the probability of 2 or more random variables occurring simultaneously.
- The random variables can be from different distributions, such as, discrete and continuous.
Joint CDF:
\[
F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty
\]
Discrete Case:
\[
F_{X,Y}(a,b) = P(X \le a, Y \le b) = \sum_{x_i \le a} \sum_{y_j \le b} P(X = x_i, Y = y_j)
\]
Continuous Case:
\[
F_{X,Y}(a,b) = P(X \le a, Y \le b) = \int_{-\infty}^{a} \int_{-\infty}^{b} f_{X,Y}(x,y) dy dx
\]
Joint PMF:
\[
P_{X,Y}(x,y) = P(X = x, Y = y)
\]
Key Properties:
- \(P(X = x, Y = y) \ge 0 ~ \forall (x,y) \)
- \( \sum_{i} \sum_{j} P(X = x_i, Y = y_j) = 1 \)
Joint PDF:
\[
f_{X,Y}(x,y) = \frac{\partial^2 F_{X,Y}(x,y)}{\partial x \partial y} \\
f_{X,Y}(x,y) = \iint_{A \in \mathbb{R}^2} f_{X,Y}(x,y) dy dx
\]
Key Properties:
- \(f_{X,Y}(x,y) \ge 0 ~ \forall (x,y) \)
- \( \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy dx = 1 \)
For example:
- If we consider 2 random variables, say, height(X) and weight(Y), then the joint distribution will tell us
the probability of finding a person having a particular height and weight.
💡 There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand joint probability.

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags
respectively.
| A = Red | A = Blue |
---|
B = Red | 2/5*3/5 = 6/25 | 3/5*3/5 = 9/25 |
B = Blue | 2/5*2/5 = 4/25 | 3/5*2/5 = 6/25 |
| | |
Since, the draws are independent, joint probability = P(A) * P(B)
Each of the 4 cells in above table shows the probability of combination of results from 2 draws or joint probability.
📘
Marginal Probability Distribution:
It describes the probability distribution of an individual random variable in a joint distribution,
without considering the outcomes of other random variables.
- If we have the joint distribution, then we can get the marginal distribution of each random variable from it.
- Marginal probability equals summing the joint probability across other random variables.
Marginal CDF:
We know that Joint CDF =
\[
F_{X,Y}(a,b) = P(X \le a, Y \le b),~ -\infty < a, b < \infty
\]
Marginal CDF =
\[
F_X(a, \infty) = P(X \le a, Y < \infty) = P(X \le a)
\]
Discrete Case:
\[
F_X(a) = P(X \le a, Y \le \infty) = \sum_{x_i \le a} \sum_{y_j \in \mathbb{R}} P(X = x_i, Y = y_j)
\]
Continuous Case:
\[
F_X(a) = P(X \le a, Y \le \infty) = \int_{-\infty}^{a} \int_{-\infty}^{\infty} f_{X,Y}(x,y)dydx = \int_{-\infty}^{\infty} f_{X,Y}(x,y)dy
\]
Law of Total Probability
We know that Joint Probability Distribution =
\[
P_{X,Y}(x,y) = P(X = x, Y = y)
\]
The events \((Y=y)\) partition the sample space, such that:
- \( (Y=y_1) \cap (Y=y_2) \cap ... \cap (Y=y_n) = \Phi \)
- \( (Y=y_1) \cup (Y=y_2) \cup ... \cup (Y=y_n) = \Omega \)
From Law of Total Probability, we get:
Marginal PMF:
\[
P_X(x) = P(X=x) = \sum_{y} P_{X,Y}(x,y) = \sum_{y} P(X = x, Y = y)
\]
Marginal PDF:
\[
f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy
\]
💡 Setup: Roll a die + Toss a coin.
X: Roll a die ; \( \Omega = \{1,2,3,4,5,6\} \)
Y: Toss a coin ; \( \Omega = \{H,T\} \)
Joint PMF = \( P_{X,Y}(x,y) = P(X=x, Y=y) = 1/6*1/2 = 1/12\)
Marginal PMF of X = \( P_X(x) =\sum_{y \in \mathbb{\{H,T\}}} P_{X,Y}(x,y) = = 1/12+1/12 = 1/6\)
=> Marginally, X is uniform over 1-6 i.e a fair die.
Marginal PMF of Y = \( P_Y(y) = \sum_{1}^6 P_{X,Y}(x,y) = 6*(1/12) = 1/2 \)
=> Marginally, Y is uniform over H,T i.e a fair coin.
💡 Setup: X and Y are two continuous uniform distribution.
\( X \sim U(0,1) \)
\( Y \sim U(0,1) \)
Marginal PDF = \(f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) dy \)
Joint PDF =
$$
f_{X,Y}(x,y) =
\begin{cases}
1 & \text{if } x \in [0,1], y \in [0,1] \\
0 & \text{otherwise }
\end{cases}
$$
Marginal PDF =
\[
\begin{aligned}
f_X(x) &= \int_{0}^{1} f_{X,Y}(x,y) dy \\
&= \int_{0}^{1} 1 dy \\
&= 1 \\
f_X(x) &=
\begin{cases}
1 & \text{if } x \in [0,1] \\
0 & \text{otherwise }
\end{cases}
\end{aligned}
\]
💡 Let’s re-visit the ball drawing example.
There are 2 bags; bag_1 has 2 red balls & 3 blue balls, bag_2 has 3 red balls & 2 blue balls.
A ball is picked at random from each bag, such that both draws are independent of each other.
Let’s use this example to understand marginal probability.

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags
respectively.
| A = Red | A = Blue | P(B) (Marginal) |
---|
B = Red | 2/5*3/5 = 6/25 | 3/5*3/5 = 9/25 | 6/25 + 9/25 = 15/25 = 3/5 |
B = Blue | 2/5*2/5 = 4/25 | 3/5*2/5 = 6/25 | 4/25 + 6/25 = 10/25 = 2/5 |
P(A) (Marginal) | 6/25 + 4/25 = 10/25 = 2/5 | 9/25 + 6/25 = 15/25 = 3/5 | |
We can see from the table above - P(A=Red) is the sum of joint distribution over all possible values of B i.e Red & Blue.
📘
Conditional Probability:
It measures the probability of an event occurring given that another event has already happened.
- It provides a way to update our belief about the likelihood based on new information.
\[
P(A \mid B) = \frac{P(A \cap B)}{P(B)}
\]
P(A, B) = Joint Probability of A and B
P(B) = Marginal Probability of B
=> Conditional Probability = Joint Probability / Marginal Probability
Conditional CDF:
\[
F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) \\
\]
Discrete Case:
\[
F_{X \mid Y}(x \mid y) = P(X \le x \mid Y = y) = \sum_{x_i \le x} P(X = x_i \mid Y = y)
\]
Continuous Case:
\[
F_{X \mid Y}(x \mid y) = \int_{-\infty}^{x} f_{X \mid Y}(x \mid y) dx = \int_{-\infty}^{x} \frac {f_{X,Y}(x, y)}{f_Y(y)} dx \\
f_Y(y) > 0
\]
Conditional PMF:
\[
P(X = x \mid Y = y) = \frac{P(X = x, Y = y)} {P(Y = y)} \\
P(Y = y) > 0
\]
Conditional PDF:
\[
f_{X \mid Y}(x \mid y) = \frac{F_{X,Y}(x,y)}{f_Y(y)} \\
f_Y(y) > 0
\]
Application:
- Generative machine learning models, such as, GANs, learn the conditional distribution of pixels, given the style of
input image.
💡 Let’s re-visit the ball drawing example.
Note: We only have information about the joint and marginal probabilities.
What is the conditional probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw?

Let A & B be discrete random variables associated with the outcome of the ball drawn from first and second bags
respectively.
A = Red ball in first draw
B = Blue ball in second draw.
| A = Red | A = Blue | P(B) (Marginal) |
---|
B = Red | 6/25 | 9/25 | 3/5 |
B = Blue | 4/25 | 6/25 | 2/5 |
P(A) (Marginal) | 2/5 | 3/5 | |
\[
\begin{aligned}
P(A \mid B) &= \frac{P(A \cap B)}{P(B)} \\
&= \frac{4/25}{2/5} \\
&= 2/5
\end{aligned}
\]
Therefore, probability of drawing a red ball in the first draw, given that a blue ball is drawn in second draw = 2/5.
📘
Conditional Expectation:
This gives us the conditional expectation of a random variable X, given another random variable Y=y.
Discrete Case:
\[
E[X \mid Y = y] = \sum_{x} x.P(X = x \mid Y = y) x = \sum_{x} x.P_{X \mid Y}(x \mid y)
\]
Continuous Case:
\[
E[X \mid Y = y] = \int_{-\infty}^{\infty} x.f_{X \mid Y}(x \mid y) dx
\]
For example:
- Conditional expectation of of a person’s weight, given his/her height = 165 cm, will give us the average weight
of all people with height = 165 cm.
Applications:
- Linear regression algorithm is conditional expectation of target variable ‘Y’, given input feature variable ‘X’.
- Expectation Maximisation(EM) algorithm is built on conditional expectation.
📘
Conditional Variance:
This gives us the variance of a random variable calculated after taking into account the value(s) of another related variable.
\[
\begin{aligned}
Var[X \mid Y = y] &= \sum_{x} [x - E[X \mid Y = y])^2 \mid Y=y] \\
=> Var[X \mid Y = y] &= E[X^2 \mid Y=y] - (E[X \mid Y=y])^2 \\
\end{aligned}
\]
For example:
- Variance of car’s mileage for city driving might be small, but the variance will be large for mix of city and highway driving.
Note: Models that take into account the change in variance or heteroscedasticity tend to be more accurate.
End of Section