Matthews Correlation Coefficient
Introduction
In more recent machine learning benchmark tests, such as Linguistic Acceptability, I started to see Matthews Correlation Coefficient (MCC), instead of more traditional accuracy or F1 score, being used as an evaluation metric. This Matthews Correlation Coefficient sounds familiar to me because previously I studied crystallography and there is a “Matthews Coefficient (Probabilities)” in crystallography which estimates probabilities for the occurrence of (protein) multimerization states in crystal. So I was very curious to see whether the two Matthews are the same person, what the correlations between Matthews Correlation Coefficient and Matthews Coefficient is, why a correlation coefficient would be used as a classification metric, and how Matthews Correlation Coefficient was derived. Unfortunately, Wikipedia hardly mentioned any of these.
In this blog post, I am going to present my shallow archaeological investigation on Matthews Correlation Coefficient.
Confusion Matrix
Matthews Correlation Coefficient is computed entirely based on confusion matrix. It is necessary to understand the confusion matrix before we talk about Matthews Correlation Coefficient.
This is the confusion matrix for binary classifications. I will also restrict the discussion of the topic in this blog post to binary classifications.
$\boldsymbol{S}$ | |||
---|---|---|---|
$0$ | $1$ | ||
$\boldsymbol{P}$ | $0$ | $TN$ | $FN$ |
$1$ | $FP$ | $TP$ |
Where $\boldsymbol{S}$ is the label of the data, and $\boldsymbol{P}$ is the prediction of the data. $TN$ is “True Positives”, $FN$ is “False Negatives”, $FP$ is “False Positives”, and $TP$ is “True Positives”.
Definition of Matthews Correlation Coefficient
On Wikipedia, it sounds like that this magic correlation coefficient was invented by Brian Matthews in the paper “Comparison of the predicted and observed secondary structure of T4 phage lysozyme” published in 1975.
The original form of the expression revised from the paper, as described on Wikipedia, are
$$
\begin{gather}
n = TN + TP + FN + FP \\
\bar{S} = \frac{TP + FN}{n} \\
\bar{P} = \frac{TP + FP}{n} \\
\text{MCC} = \frac{TP/n - \bar{S}\bar{P}}{\sqrt{\bar{S}\bar{P}(1-\bar{S})(1-\bar{P})}} \\
\end{gather}
$$
Derivation of Matthews Correlation Coefficient
Wikipedia never mentioned anything about how this coefficient was derived and why this expression is called as a “correlation coefficient”. So I looked into the paper.
In the paper, the author said the (Matthews) correlation coefficient was a special case for this correlation coefficient given by a book “Statistical Methods for Research Workers” authored by Ronald Fisher.
$$
C = \frac{\sum_\limits{n}(S_n - \bar{S})(P_n - \bar{P})}{\sqrt{\big[\sum_\limits{n}(S_n - \bar{S})^2\big] \big[\sum_\limits{n}(P_n - \bar{P})^2}\big]}
$$
Where $S_n$ is the observation value (label) for sample $n$, $P_n$ is the predicted value (prediction) for sample $n$, $\bar{S}$ and $\bar{P}$ are the mean values of $S_n$ and $P_n$, respectively.
Wait, does this correlation coefficient look familiar? I immediately realized that it is Pearson Correlation Coefficient, which is a measure of the linear correlation between two variables $X$ and $Y$! The Pearson Correlation Coefficient was developed in the 1800s which was way before Matthews Correlation Coefficient was born!
As a special case for Pearson Correlation Coefficient, in Matthews Correlation Coefficient, $S_n$ and $P_n$ could only be binary values $0$ and $1$, which is typical for classification problems. With this assumption, we could derive the original form of the Matthews Correlation Coefficient.
It might be a little bit brain-twisting, but we have the following facts from definition or slight math tricks.
$$
\begin{gather}
\sum_\limits{n}S_n = n\bar{S} = TP + FN\\
\sum_\limits{n}P_n = n\bar{P} = TP + FP\\
S^2_n = S_n \\
P^2_n = P_n \\
\sum_\limits{n} S_n P_n = TP \\
\end{gather}
$$
We used the above facts to further derive the numerator and denominator.
For numerator,
$$
\begin{aligned}
\sum_\limits{n}(S_n - \bar{S})(P_n - \bar{P}) &= \sum_\limits{n}(S_n P_n - S_n\bar{P} - \bar{S}P_n + \bar{S}\bar{P}) \\
&= \sum_\limits{n} S_n P_n - \sum_\limits{n} S_n\bar{P} - \sum_\limits{n} \bar{S}P_n + \sum_\limits{n} \bar{S}\bar{P} \\
&= TP - \bar{P} \sum_\limits{n} S_n - \bar{S} \sum_\limits{n} P_n + n \bar{S}\bar{P} \\
&= TP - n \bar{S}\bar{P} - n \bar{S}\bar{P} + n \bar{S}\bar{P} \\
&= TP - n \bar{S}\bar{P} \\
\end{aligned}
$$
For denominator,
$$
\begin{aligned}
\sum_\limits{n}(S_n - \bar{S})^2 &= \sum_\limits{n} (S^2_n - 2S_n \bar{S} + \bar{S}^2) \\
&= \sum_\limits{n} S^2_n - \sum_\limits{n} 2S_n \bar{S} + \sum_\limits{n} \bar{S}^2\\
&= \sum_\limits{n} S_n - 2 \bar{S} \sum_\limits{n} S_n + n\bar{S}^2 \\
&= n\bar{S} - 2 n \bar{S}^2 + n\bar{S}^2 \\
&= n\bar{S} - n \bar{S}^2 \\
&= n\bar{S}(1 - \bar{S}) \\
\end{aligned}
$$
Similarly, we could derive
$$
\begin{aligned}
\sum_\limits{n}(P_n - \bar{P})^2 &= n\bar{P}(1 - \bar{P}) \\
\end{aligned}
$$
Thus, the whole denominator is
$$
\sqrt{\big[\sum_\limits{n}(S_n - \bar{S})^2\big] \big[\sum_\limits{n}(P_n - \bar{P})^2}\big] = n \sqrt{\bar{S}\bar{P}(1 - \bar{S})(1 - \bar{P})}
$$
Taken together, we have
$$
\begin{aligned}
\text{MCC} &= \frac{TP - n \bar{S}\bar{P}}{n \sqrt{\bar{S}\bar{P}(1 - \bar{S})(1 - \bar{P})}} \\
&= \frac{TP/n - \bar{S}\bar{P}}{\sqrt{\bar{S}\bar{P}(1-\bar{S})(1-\bar{P})}}
\end{aligned}
$$
Anecdote
It turns out that Brian Matthews is the author for both “Matthews Correlation Coefficient” and “Matthews Coefficient (Probabilities)”. However, the physical meanings for these two are entirely different. People who do not study crystallography would probably never hear or use Matthews Coefficient (Probabilities).
Conclusion
Matthews Correlation Coefficient is nothing special but applying Pearson Correlation Coefficient to binary classification problems where two random variables are prediction and label. That is to say, Matthews Correlation Coefficient is a discrete case for Pearson Correlation Coefficient.
When Pearson Correlation Coefficient or Matthews Correlation Coefficient is more close to 1.0, this means that more predictions match labels; When Pearson Correlation Coefficient is more close to -1.0, this means that more predictions disagree with labels; When Pearson Correlation Coefficient is more close to 0, this means that more predictions and labels do not have strong correlations, i.e., the predictions seem to be random.
Final Remarks
I am very surprised that after so many years Pearson Correlation Coefficient was published, Pearson Correlation Coefficient was not used for binary classification problems until Brian Matthews’s work. If there are people who used Pearson Correlation Coefficient for binary classification problems, probably Matthews Correlation Coefficient has to change names.
References
Matthews Correlation Coefficient
https://leimao.github.io/blog/Matthews-Correlation-Coefficient/