# Cosine Similarity VS Pearson Correlation Coefficient

## Introduction

In some scenarios, I saw people get confused about the difference between the cosine similarity and the Pearson correlation coefficient, as their mathematical definition looks somewhat similar.

In this blog post, I would like to quickly discuss the definition for the cosine similarity and the Pearson correlation coefficient and their difference.

## Cosine Similarity

The cosine similarity computes the similarity between *two samples*. The two samples can be obtained from the same distribution or different distributions. The two samples should have the same number of features.

Given two sample feature vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{x} = \{x_1, x_2, \cdots, x_n\}$, $\mathbf{y} = \{y_1, y_2, \cdots, y_n\}$, the cosine similarity is defined as

$$

\begin{align}

\cos(\theta) &= \frac{\mathbf{x} \cdot \mathbf{y}}{\left\Vert \mathbf{x} \right\Vert \left\Vert \mathbf{y} \right\Vert} \\

&= \frac{ \sum_{i=1}^{n} x_i y_i }{\sqrt{\sum_{i=1}^{n} x_i^2 } \sqrt{\sum_{i=1}^{n} y_i^2 }} \\

\end{align}

$$

The cosine similarity ranges from $-1$ to $1$. $1$ means the two samples are the most similar and $-1$ means the two samples are the least similar. If somehow we know $\mathbf{x}$ and $\mathbf{y}$ are unit vectors, or $\left\Vert \mathbf{x} \right\Vert \equiv \left\Vert \mathbf{y} \right\Vert$, $1$ means the two samples are the identical and $-1$ means the two samples are the opposite.

## Pearson Correlation

The Pearson correlation coefficient computes the correlation between *two jointly distributed random variables*. We sampled $n$ samples from a bivariate $(X, Y)$ joint distribution.

Given $n$ samples consisting of two features, $\{ (x_1, y_1), (x_2, y_2), \cdots, (x_n, y_n) \}$, the Pearson correlation coefficient is defined as

$$

\begin{align}

\rho_{X, Y} &= \frac{\text{cov}(X, Y)}{ \sigma_{X} \sigma_{Y} } \\

&= \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)] }{ \sigma_{X} \sigma_{Y} } \\

&= \frac{\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]}{ \sqrt{\mathbb{E}[X^2] - (\mathbb{E}[X])^2} \sqrt{\mathbb{E}[Y^2] - (\mathbb{E}[Y])^2} } \\

&= \frac{ \big(\sum_{i=1}^{n} x_i y_i\big) - \big(n \bar{x}\bar{y} \big) }{ \sqrt{ \sum_{i=1}^{n} x_i^2 - n \bar{x}^2} \sqrt{\sum_{i=1}^{n} y_i^2 - n \bar{y}^2} } \\

\end{align}

$$

The Pearson correlation coefficient ranges from $-1$ to $1$. $1$ means the two random variables are perfectly positively correlated, $-1$ means the two random variables are perfectly negatively correlated, $0$ means the two random variables are not correlated.

## Cosine Similarity VS Pearson Correlation

Someone might try to compare the cosine similarity and the Pearson correlation coefficient and ask what the difference between them. If somehow $\mathbb{E}[X] = \mathbb{E}[Y] = 0$ and $\bar{x} = \bar{y} = 0$, the Pearson correlation coefficient will become

$$

\begin{align}

\rho_{X, Y}

&= \frac{ \sum_{i=1}^{n} x_i y_i }{ \sqrt{ \sum_{i=1}^{n} x_i^2} \sqrt{\sum_{i=1}^{n} y_i^2 } } \\

\end{align}

$$

It seems that the Pearson correlation coefficient has “decayed” to cosine similarity. Someone might even claim that the cosine similarity is a special case of the Pearson correlation coefficient.

This is incorrect. The reason is extremely simple. The two quantities represent two different physical entities. The cosine similarity computes the similarity between *two samples*, whereas the Pearson correlation coefficient computes the correlation between *two jointly distributed random variables*.

## References

Cosine Similarity VS Pearson Correlation Coefficient

https://leimao.github.io/blog/Cosine-Similarity-VS-Pearson-Correlation-Coefficient/