6 minute read
In this section, we will understand about Central Limit Theorem & Confidence Interval.
Before we understand the Central Limit Theorem, let’s understand a few related concepts.
Population Mean:
It is the true average of the entire group.
It describe the central tendency of the entire population.
\( \mu = \frac{1}{N}\sum_{i=1}^{N}x_i \)
N: Number of data points
Sample Mean:
It is the average of a smaller representative subset (a sample) of the entire population.
It provides an estimate of the population mean.
\( \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \)
n: size of sample
Central Limit Theorem:
This law states that for a sequence of of I.I.D random variables \( X_1, X_2, \dots, X_n \),
with finite mean and variance, the distribution of the sample mean \( \bar{X} \) approaches a normal distribution
as \( n \rightarrow \infty \), regardless of its original population distribution.
The distribution of the sample mean is : \( \bar{X} \sim N(\mu, \sigma^2/n)\)
Let, \( X_1, X_2, \dots, X_n \) are I.I.D random variables.
Now, let’s calculate the variance of sample means.
We know that:
Let’s apply above 2 rules on the variance of sample means equation above:
\[ \begin{aligned} Var[\bar{X_n}] &= Var[\frac{1}{n}(X_1+ X_2+ \dots+ X_n)] \\ &= \frac{1}{n^2}[Var[X_1+ X_2+ \dots+ X_n]] \\ &= \frac{1}{n^2}[Var[X_1] + Var[X_2] + \dots + Var[X_n]] \\ \text{We know that: } Var[X_i] = \sigma^2 \\ &= \frac{1}{n^2}[\sigma^2 + \sigma^2 + \dots + \sigma^2] \\ &= \frac{n\sigma^2}{n^2} \\ => Var[\bar{X_n}] &= \frac{\sigma^2}{n} \end{aligned} \]Since, standard deviation = \(\sigma = \sqrt{Variance}\)
Therefore, Standard Deviation\([\bar{X_n}] = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\)
The standard deviation of the sample means is also known as “Standard Error”.
Note: We can also standardize the sample mean, i.e, mean centering and variance scaling.
Standardisation helps us to use the Z-tables of normal distribution.
We know that, a standardized random variable \(Y_i = \frac{X_i - \mu}{\sigma}\)
Similarly, standardized sample mean:
Note: For practical purposes, \(n \ge 30\) is considered as a sufficient sample size for the CLT to hold.
Note:
The variance must be finite, else, the sample mean will NOT converge to a normal distribution.
If a distribution has a heavy tail, then the expected value calculation diverges.
e.g:
Confidence Interval:
It is a range of values that is likely to contain the true population mean, based on a sample.
Instead of giving a point estimate, it gives a range of values with confidence level.
For normal distribution, confidence interval :
\(\bar{X}\): Sample mean
\(Z\): Z-score corresponding to confidence level
\(n\): Sample size
\( \sigma \): Population Standard Deviation
Applications:
95% confidence interval does NOT mean there is a 95% chance that the true mean lies in the specific calculated interval.
For example:
Let’s suppose we want to measure the average weight of a certain species of dog.
We want to estimate the true population
mean \(\mu\) using confidence interval.
Note: True average weight = 30 kg, but this is NOT known to us.
Sample Number | Sample Mean | 95% Confidence Interval | Did it capture \(\mu\) ? |
---|---|---|---|
1 | 29.8 kg | (28.5, 31.1) | Yes |
2 | 30.4 kg | (29.1, 31.7) | Yes |
3 | 31.5 kg | (30.2, 32.8) | No |
4 | 28.1 kg | (26.7, 29.3) | No |
- | - | - | - |
- | - | - | - |
- | - | - | - |
100 | 29.9 kg | (28.6, 31.2) | Yes |
Suggest which company is offering a better salary?
Below is the details of the salaries based on a survey of 50 employees.
Company | Average Salary(INR) | Standard Deviation |
---|---|---|
A | 36 lpa | 7 lpa |
B | 40 lpa | 14 lpa |
For comparison, let’s calculate the 95% confidence interval for the average salaries of both companies A and B.
We know that:
\( CI = \bar{X} \pm Z\frac{\sigma}{\sqrt{n}} \)
Margin of Error(MoE) \( = Z\frac{\sigma}{\sqrt{n}} \)
Z-Score for 95% CI = 1.96
\(MoE_A = 1.96*\frac{7}{\sqrt{50}} \approx 1.94 \)
=> 95% CI for A = \(36 \pm 1.94 \) = [34.06, 37.94]
\(MoE_B = 1.96*\frac{14}{\sqrt{50}} \approx 3.88\)
=> 95% CI for B = \(40 \pm 3.88 \) = [36.12, 43.88]
We can see that initially company B’s salary looked obviously better,
but after calculating the 95% CI, we can see that there is a significant overlap in the salaries of two companies,
i.e [36.12, 37.94].
End of Section