Correlation

Covariance & Correlation

In this section, we will understand about Correlation and Covariance.

📘

Covariance:
It measures the direction of linear relationship between two variables \(X\) and \(Y\).

\[Population ~ Covariance(X,Y) = \sigma_{xy} = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu_{x})(y_i - \mu_{y})\]


\(N\) = size of population
\(\mu_{x}\) = population mean of \(X\)
\(\mu_{y}\) = population mean of \(Y\)

\[Sample ~ Covariance(X,Y) = s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\]


\(n\) = size of sample
\(\bar{x}\) = sample mean of \(X\)
\(\bar{y}\) = sample mean of \(Y\)

Note: We have a term (n-1) instead of n in the denominator to make it an unbiased estimate, called Bessel’s Correction.

If both \((x_i - \bar{x})\) and \((y_i - \bar{y})\) have the same sign, then the product is positive(+ve).
If both \((x_i - \bar{x})\) and \((y_i - \bar{y})\) have opposite signs, then the product is negative(-ve).
The final value of covariance depends on the sum of the above individual products.

\( \begin{aligned} \text{Cov}(X, Y) &> 0 &&\Rightarrow \text{ } X \text{ and } Y \text{ increase or decrease together} \\ \text{Cov}(X, Y) &= 0 &&\Rightarrow \text{ } \text{No linear relationship} \\ \text{Cov}(X, Y) &< 0 &&\Rightarrow \text{ } \text{If } X \text{ increases, } Y \text{ decreases (and vice versa)} \end{aligned} \)

Limitation:
Covariance is scale-dependent, i.e, units of X and Y impact its magnitude.
This makes it hard to make comparisons of covariance across different datasets.
E.g: Covariance between age and height will NOT be same as the covariance between years of experience and salary.

Note:It only measures the direction of the relationship, but does NOT give any information about the strength of the relationship.

For example:

  1. \(X = [1, 2, 3] \) and \(Y = [2, 4, 6] \)
    Let’s calculate the covariance:
    \(\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
    \(\bar{x} = 2\) and \(\bar{y} = 4\)
    \(\text{Cov}(X, Y) = \frac{1}{3-1}[(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)]= 0\)
    \( = \frac{1}{2}[2+0+2]= 2\)
    => Cov(X,Y) > 0 i.e if X increases, Y increases and vice versa.



📘

Correlation:
It measures both the strength and direction of the linear relationship between two variables \(X\) and \(Y\).
It is a standardized version of covariance that gives a dimensionless measure of linear relationship.

There are 2 popular ways to calculate correlation coefficient:

  1. Pearson Correlation Coefficient (r)
  2. Spearman Rank Correlation Coefficient (\(\rho\))

📘

Pearson Correlation Coefficient (r):
It is a standardized version of covariance and most widely used measure of correlation.
Assumption: Data is normally distributed.

\[r_{xy} = \frac{Cov(X, Y)}{\sigma_{x} \sigma_{y}}\]


\(\sigma_{x}\) and \(\sigma_{y}\) are the standard deviations of \(X\) and \(Y\).

Range of \(r\) is between -1 and 1.
\(r = 1\) => perfect +ve linear relationship between X and Y
\(r = -1\) => perfect -ve linear relationship between X and Y
\(r = 0\) => NO linear relationship between X and Y.

Note: A correlation coefficient of 0.9 means that there is a strong linear relationship between X and Y, irrespective of their units.

For example:

  1. \(X = [1, 2, 3] \) and \(Y = [2, 4, 6] \)
    Let’s calculate the covariance:
    \(\text{Cov}(X, Y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})\)
    \(\bar{x} = 2\) and \(\bar{y} = 4\)
    \(\text{Cov}(X, Y) = \frac{1}{3-1}[(1-2)(2-4) + (2-2)(4-4) + (3-2)(6-4)]= 0\)
    \( => \text{Cov}(X, Y) = \frac{1}{2}[2+0+2]= 2\)

Let’s calculate the standard deviation of \(X\) and \(Y\):
\(\sigma_{x} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2} \)
\(= \sqrt{\frac{1}{3-1}[(1-2)^2 + (2-2)^2 + (3-2)^2]}\)
\(= \sqrt{\frac{1+0+1}{2}} =\sqrt{\frac{2}{2}} = 1 \)

Similarly, we can calculate the standard deviation of \(Y\):
\(\sigma_{y} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(y_i - \bar{y})^2} \)
\(= \sqrt{\frac{1}{3-1}[(2-4)^2 + (4-4)^2 + (6-4)^2]}\)
\(= \sqrt{\frac{4+0+4}{2}} =\sqrt{\frac{8}{2}} = 2 \)

Now, we can calulate the pearson correlation coefficient (r):
\(r_{xy} = \frac{Cov(X, Y)}{\sigma_{x} \sigma_{y}}\)
=> \(r_{xy} = \frac{2}{1* 2}\)
=> \(r_{xy} = 1\)
Therefore, we can say that there is a strong +ve linear relationship between X and Y.

📘

Spearman Rank Correlation Coefficient (\(\rho\)):
It is a measure of the strength and direction of the monotonic relationship between two ranked variables \(X\) and \(Y\).
It captures monotonic relationship, meaning the variables move in the same or opposite direction,
but not necessarily a linear relationship.

  • It is used when Pearson’s correlation is not suitable, such as, ordinal data, or when the continuous data does not meet the assumptions of linear methods, such as, Pearson’s correlation.
  • Non-parametric measure of correlation that uses ranks instead of raw data.
  • Quantifies how well the ranks of one variable predict the ranks of the other variable.
  • Range of \(\rho\) is between -1 and 1.
\[\rho_{xy} = 1 - \frac{6\sum_{i}d_i^2}{n(n^2-1)}\]


For example:

  1. Compute the correlation of ranks awarded to a group of 5 students by 2 different teacherrs.
    StudentTeacher A RankTeacher B Rank\(d_i\)\(d_i^2\)
    S112-11
    S22111
    S33300
    S445-11
    S55411

\(\sum_{i}d_i^2 = 4 \)
\( n = 5 \)
\(\rho_{xy} = 1 - \frac{6\sum_{i}d_i^2}{n(n^2-1)}\)
=> \(\rho_{xy} = 1 - \frac{6*4}{5(5^2-1)}\)
=> \(\rho_{xy} = 1 - \frac{24}{5*24}\)
=> \(\rho_{xy} = 1 - \frac{1}{5}\)
=> \(\rho_{xy} = 0.8\)
Therefore, we can say that there is a strong +ve correlation between the ranks given by teacher A and teacher B.

  1. \(X = [1, 2, 3] \) and \(Y = [1, 8, 27] \)
    Here, Spearman’s rank correlation coefficient \(\rho\) will be perfect 1 as there is a monotonic relationship i.e as X increases, Y increases and vice versa.
    But, the Pearson’s correlation coefficient (r) will be slightly less than 1 i.e r = 0.9662.

End of Section