Hypothesis Testing

Hypothesis Testing

In this section, we will understand about Hypothesis Testing, along with a framework for Hypothesis Testing.

💡 Why do we need Hypothesis Testing?

Hypothesis Testing is used to determine whether a claim or theory about a population is supported by a sample data,
by assessing whether observed difference or patterns are likely due to chance or represent a true effect.

  • It allows companies to test marketing campaigns or new strategies on a small scale before committing to larger investments.
  • Based on the results of hypothesis testing, we can make reliable inferences about the whole group based on a representative sample.
  • It helps us determine whether an observed result is statistically significant finding, or if it could have just happened by random chance.

📘

Hypothesis Testing:
It is a statistical inference framework used to make decisions about a population parameter, such as, the mean, variance, distribution, correlation, etc., based on a sample of data. It provides a formal method to evaluate competing claims.

Null Hypothesis (\(H_0\)):
Status quo or no-effect or no difference statement; almost always contains a statement of equality.

Alternative Hypothesis (\(H_1 ~or~ H_a\)):
The statement representing an effect, a difference, or a relationship.
It must be true if the null hypothesis is rejected.

💡 Perform a hypothesis test to compare the mean recovery time of 2 medicines.

Say, the data, D: <patient_id, med_1/med_2, recovery_time(in days)>
We need some metric to compare the recovery times of 2 medicines.
We can use the mean recovery time as the metric, because we know that we can use following techniques for comparison:

  1. 2-Sample T-Test; if \(n < 30\) and population standard deviation \(\sigma\) is unknown.
  2. 2-Sample Z-Test; if \(n \ge 30\) and population standard deviation \(\sigma\) is known.

Note: Let’s assume the sample size \(n < 30\), because medical tests usually have small sample sizes.
=> We will use the 2-Sample T-Test; we will continue using T-Test throughout the discussion.

Step 1: Define the null and alternative hypotheses.
Null Hypothesis \(H_0\): The mean recovery time of 2 medicines is the same i.e \(Mean_{m1} = Mean_{m2}\) or \(m_{m1} = m_{m2}\).
Alternate Hypothesis \(H_a\): \(m_{m1} < m_{m2}\) (1-Sided T-Test) or \(m_{m1} ⍯ m_{m2}\) (2-Sided T-Test).

Step 2: Select a relevant statistical test for the task with associated test statistic.
Let’s do a 2 sample T-Test, i.e, \(m_{m1} < m_{m2}\)

Step 3: Calculate the test statistic under null hypothesis.
Test Statistic:
For 2 sample T-Test:

\[t_{obs} = \frac{m_{m_1} - m_{m_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]


s: Standard Deviation
n: Sample Size

Note: If the 2 means are very close then \(t_{obs} \approx 0\).

Step 4: Suppose significance level (\(\alpha\)) = 5% or 0.05.

Step 5: Compute the p-value from the observed value of test-statistic.
P-Value:

\[p_{value} = \mathbb{P}(t \geq t_{obs} | H_0)\]


p-value = area under curve = probability of observing test statistic \( \ge t_{obs} \) if the null hypothesis is true.

Step 6: Accept or reject the null hypothesis, based on the significance level (\(\alpha\)).
If \(p_{value} < \alpha\), we reject the null hypothesis and accept the alternative hypothesis and vice versa.
Note: In the above example \(p_{value} < \alpha\), so we reject the null hypothesis.




📘

Significance Level (\(\alpha\)):
It is the probability of wrongly rejecting a true null hypothesis, known as a Type I error or false +ve rate.

  • Tolerance level of wrongly accepting alternate hypothesis.
  • If the p-value < \(\alpha\), we reject the null hypothesis and conclude that the finding is statistically NOT so significant..

📘

Critical Value:
It is a specific point on the test-statistic distribution that defines the boundaries of the null hypothesis acceptance/rejection region.

  • It tells us that at what value (\(t_{\alpha}\)) of test statistic will the area under curve be equal to the significance level \(\alpha\).
  • For a right tailed/sided test:
    • if \(t_{obs} > t_{\alpha} => p_{value} < \alpha\); therefore, reject null hypothesis.
    • if \(t_{obs} < t_{\alpha} => p_{value} \ge \alpha\); therefore, failed to reject null hypothesis.



📘

Power of Test:
It is the probability that a hypothesis test will correctly reject a false null hypothesis (\(H_{0}\)) when the alternative hypothesis (\(H_{a}\)) is true.

  • Power of test = \(1 - \beta\)
  • Probability of correctly accepting alternate hypothesis (\(H_{a}\))
  • \(\alpha\): Probability of wrongly accepting alternate hypothesis \(H_{a}\)
  • \(\beta\): Probability of wrongly rejecting alternate hypothesis \(H_{a}\)


💡 Does having a large sample size make a hypothesis test more powerful?

Yes, having a large sample size makes a hypothesis test more powerful.

  • As n increases, sample mean \(\bar{x}\) approaches the population mean \(\mu\).
  • Also, as n increases, t-distribution approaches normal distribution.

📘

Effect Size:
It is a standardized objective measure that complements p-value by clarifying whether a statistically significant finding has any real world relevance.
It quantifies the magnitude of relationship between two variables.

  • Larger effect size => more impactful effect.
  • Standardized (mean centered + variance scaled) measure allows us to compare the imporance of effect across various studies or groups, even with different sample sizes.

Effect size is measured using Cohen’s d formula:

\[ d = \frac{\bar{X_1} - \bar{X_2}}{s_p} \\[10pt] s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]

\(\bar{X}\): Sample mean
\(s_p\): Pooled Standard deviation
\(n\): Sample size
\(s\): Standard deviation

Note: Theoretically, Cohen’s d value can range from negative infinity to positive infinity.
but for practical purposes, we use the following value:
small effect (\(d=0.2\)), medium effect (\(d=0.5\)), and large effect (\(d\ge 0.8\)).

  • More overlap => less effect i.e low Cohen’s d value.
For example:

  • A study on drug trials finds that patients taking a new drug had statistically significant
    improvement (p-value<0.05), compared to a placebo group.
  1. Small effect size: Cohen’s d = 0.1 => drug had minimal effect.
  2. Large effect size: Cohen’s d = 0.8 => drug produced substantial improvement.



End of Section