Central Limit Theorem

Central Limit Theorem & Confidence Interval

6 minute read

In this section, we will understand about Central Limit Theorem & Confidence Interval.

Before we understand the Central Limit Theorem, let’s understand a few related concepts.

📘

Population Mean:
It is the true average of the entire group.
It describe the central tendency of the entire population.

\( \mu = \frac{1}{N}\sum_{i=1}^{N}x_i \)
N: Number of data points

📘

Sample Mean:
It is the average of a smaller representative subset (a sample) of the entire population.
It provides an estimate of the population mean.

\( \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \)
n: size of sample

📘 Law of Large Numbers:
This law states that as the number of I.I.D samples from a population increases,
the sample mean converges to the true the population mean.
In other words, a long-run average of a repeated random variable approaches the expected value.

📘

Central Limit Theorem:
This law states that for a sequence of of I.I.D random variables \( X_1, X_2, \dots, X_n \),
with finite mean and variance, the distribution of the sample mean \( \bar{X} \) approaches a normal distribution as \( n \rightarrow \infty \), regardless of its original population distribution.
The distribution of the sample mean is : \( \bar{X} \sim N(\mu, \sigma^2/n)\)

Let, \( X_1, X_2, \dots, X_n \) are I.I.D random variables.

Population mean = \(E[X_i] = \mu < \infty\)
Population Variance = \(Var[X_i] = \sigma^2 < \infty \)
Sample mean = \( \bar{X_n} = \frac{1}{n}\sum_{i=1}^{n}X_i = \frac{1}{n}(X_1 + X_2+ \dots +X_n) \)
Variance of sample means = \( Var[\bar{X_n}] = Var[\frac{1}{n}(X_1+ X_2+ \dots+ X_n)]\)

Now, let’s calculate the variance of sample means.
We know that:

\(Var[X+Y] = Var[X] + Var[Y] \), for independent random variables X and Y.
\(Var[cX] = c^2Var[X] \), for constant ‘c’.

Let’s apply above 2 rules on the variance of sample means equation above:

\[ \begin{aligned} Var[\bar{X_n}] &= Var[\frac{1}{n}(X_1+ X_2+ \dots+ X_n)] \\ &= \frac{1}{n^2}[Var[X_1+ X_2+ \dots+ X_n]] \\ &= \frac{1}{n^2}[Var[X_1] + Var[X_2] + \dots + Var[X_n]] \\ \text{We know that: } Var[X_i] = \sigma^2 \\ &= \frac{1}{n^2}[\sigma^2 + \sigma^2 + \dots + \sigma^2] \\ &= \frac{n\sigma^2}{n^2} \\ => Var[\bar{X_n}] &= \frac{\sigma^2}{n} \end{aligned} \]

Since, standard deviation = \(\sigma = \sqrt{Variance}\)
Therefore, Standard Deviation\([\bar{X_n}] = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\)
The standard deviation of the sample means is also known as “Standard Error”.

Note: We can also standardize the sample mean, i.e, mean centering and variance scaling.
Standardisation helps us to use the Z-tables of normal distribution.

We know that, a standardized random variable \(Y_i = \frac{X_i - \mu}{\sigma}\)
Similarly, standardized sample mean:

\[ Z_n = \frac{\bar{X_n} - \mu}{\sqrt{Var[\bar{X_n}]}} = \frac{ \frac{1}{n}\sum_{i=1}^{n}X_i - \mu}{\frac{\sigma}{\sqrt{n}}} \\ = \frac{\sum_{i=1}^{n}X_i - n\mu}{\sigma\sqrt{n}} \xrightarrow{Distribution} N(0,1) , \text{ as } n \rightarrow \infty \\ Z_n \text{ converges in distribution to } N(0,1), \text{ as } n \rightarrow \infty \]

Note: For practical purposes, \(n \ge 30\) is considered as a sufficient sample size for the CLT to hold.

For example:

Let’s collect the data for height of people in a city to find the average height of people in the city.

Sample size (n) = 100
And then repeat this data collection process 1000 times.
For each of these 1000 (k) samples, calculate the sample mean \(X_1, X_2, \dots, X_{1000(k)} \)
Now, when we plot these 1000(k) sample means, the resulting distribution will be very close to a normal/Gaussian distribution.
\(\bar{X_n} \sim N(\mu, \sigma^2/n)\), for large n, typically \(n \ge 30\).

Note:

‘k’ = a large number of repetitions allows us to observe the distribution of sample means after plotting.
’n’ = number of samples in each repetition is fixed for any given calculation of sample mean \(\bar{X_n}\).

💡 Why variance must be finite??

The variance must be finite, else, the sample mean will NOT converge to a normal distribution.
If a distribution has a heavy tail, then the expected value calculation diverges.
e.g:

Cauchy distribution has infinite mean and infinite variance.
Pareto distribution (with low alpha) has infinite variance, such as distribution of wealth.

📘

Confidence Interval:
It is a range of values that is likely to contain the true population mean, based on a sample.
Instead of giving a point estimate, it gives a range of values with confidence level.

For normal distribution, confidence interval :

\[ CI = \bar{X} \pm Z\frac{\sigma}{\sqrt{n}} \]

\(\bar{X}\): Sample mean
\(Z\): Z-score corresponding to confidence level
\(n\): Sample size
\( \sigma \): Population Standard Deviation

Applications:

A/B testing, i.e., compare 2 or more versions of a product.
ML model performance evaluation, i.e, instead of giving a single performance score of say 85%,
it is better to provide a 95% confidence interval, such as, [82.5%, 87.8%].

95% confidence interval does NOT mean there is a 95% chance that the true mean lies in the specific calculated interval.

It just means that if we repeat the sampling process many times, then 95% of of those calculated intervals will capture or contain the true population mean \(\mu\).
Also, we cannot say there is 95% probability that the true mean is within that specific range because true population mean is a fixed constant, NOT a random variable.

For example:
Let’s suppose we want to measure the average weight of a certain species of dog.
We want to estimate the true population mean \(\mu\) using confidence interval.
Note: True average weight = 30 kg, but this is NOT known to us.

Sample Number	Sample Mean	95% Confidence Interval	Did it capture \(\mu\) ?
1	29.8 kg	(28.5, 31.1)	Yes
2	30.4 kg	(29.1, 31.7)	Yes
3	31.5 kg	(30.2, 32.8)	No
4	28.1 kg	(26.7, 29.3)	No
-	-	-	-
-	-	-	-
-	-	-	-
100	29.9 kg	(28.6, 31.2)	Yes

We generated 100 confidence intervals(CI) each based on different samples.
95% CI guarantees that, in long run, 95 out of 100 CIs will include the true average weight, i.e, \(\mu=30kg\), and may be will miss 5 out of 100 times.

💡

Suggest which company is offering a better salary?
Below is the details of the salaries based on a survey of 50 employees.

Company	Average Salary(INR)	Standard Deviation
A	36 lpa	7 lpa
B	40 lpa	14 lpa

For comparison, let’s calculate the 95% confidence interval for the average salaries of both companies A and B.
We know that:
\( CI = \bar{X} \pm Z\frac{\sigma}{\sqrt{n}} \)
Margin of Error(MoE) \( = Z\frac{\sigma}{\sqrt{n}} \)
Z-Score for 95% CI = 1.96

\(MoE_A = 1.96*\frac{7}{\sqrt{50}} \approx 1.94 \)
=> 95% CI for A = \(36 \pm 1.94 \) = [34.06, 37.94]

\(MoE_B = 1.96*\frac{14}{\sqrt{50}} \approx 3.88\)
=> 95% CI for B = \(40 \pm 3.88 \) = [36.12, 43.88]

We can see that initially company B’s salary looked obviously better,
but after calculating the 95% CI, we can see that there is a significant overlap in the salaries of two companies,
i.e [36.12, 37.94].

End of Section