Principal Component Analysis
Introduction
Principal components analysis (PCA) is one of a family of techniques for taking high-dimensional data, and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. It has been widely used for data compression and de-noising. However, its entire mathematical process is sometimes ambiguous to the user.
In this article, I would like to discuss the entire process of PCA mathematically, including PCA projection and reconstruction, with most of the derivations and proofs provided. At the end of the article, I implemented PCA projection and reconstruction from scratch. After reading this article, there should be no more black box in PCA anymore.
Prerequisites
Orthogonal Matrix
In linear algebra, an orthogonal matrix is a real square matrix whose columns and rows are orthogonal unit vectors (orthonormal vectors).
Equivalently, the mathematical expression is
$$
\begin{align}
A^{\top}A = AA^{\top} = I
\end{align}
$$
By the definition of invertible matrix, this means matrix $A$ is invertible and $A^{-1} = A^{\top}$.
We could also view this from the perspective of determinant.
Because $A$ and $A^{\top}$ are square matrices and using the properties of determinant
$$
\begin{align}
\det(I) &= \det(A^{\top}A) \\
&= \det(AA^{\top}) \\
&= \det(A) \det(A^{\top}) \\
&= \det(A) \det(A) \\
&= \det(A)^2 \\
&= \det(A^{\top})^2 \\
&= 1 \\
\end{align}
$$
Since $\det(A) \neq 0$, matrix $A$ is invertible. We multiply $A^{-1}$ on both side of the orthogonal matrix definition.
$$
\begin{align}
A^{\top}A A^{-1} &= I A^{-1}\\
A^{\top} I &= A^{-1} \\
A^{\top} &= A^{-1} \\
\end{align}
$$
We have also derived the conclusion that $A^{-1} = A^{\top}$.
Similarly, a complex square matrix $A$ is unitary if its transpose conjugate $A^{\dagger}$ is also its inverse.
Equivalently, the mathematical expression is
$$
\begin{align}
A^{\dagger}A = AA^{\dagger} = I
\end{align}
$$
Symmetric Matrix
Real symmetric matrix has the following useful properties:
If $A$ is a real symmetric matrix, all of its eigenvalues are real numbers.
Because of the conjugate properties and $\overline{A} = A$ since $A$ is a real value matrix,
$$
\begin{align}
\overline{Av} &= \overline{\lambda v} \\
&= \overline{A} \overline{v} \\
&= A \overline{v} \\
&= \overline{\lambda} \overline{v} \\
\end{align}
$$
We got $A \overline{v} = \overline{\lambda} \overline{v}$.
Let $\lambda \in \mathbb{C}$ be an eigenvalue of the symmetric matrix $A$. $Av = \lambda v$ and $v \neq 0$. We multiply $v^{\dagger}$ ($v^{\dagger} = \overline{v}^{\top}$) to the both sides, and because of $A^{\top} = A$ and the property we have just derived $A \overline{v} = \overline{\lambda} \overline{v}$,
$$
\begin{align}
v^{\dagger} A v &= \lambda v^{\dagger} v \\
&= v^{\dagger} A^{\top} v \\
&= \overline{v}^{\top} A^{\top} v \\
&= (A\overline{v})^{\top} v \\
&= (\overline{\lambda} \overline{v})^{\top} v \\
&= \overline{\lambda}^{\top} \overline{v}^{\top} v \\
&= \overline{\lambda} v^{\dagger} v \\
\end{align}
$$
We have $\lambda v^{\dagger} v = \overline{\lambda} v^{\dagger} v$, thus $\lambda$ is real.
This concludes the proof.
Positive Semi-Definite Matrix
The $n \times n$ symmetric matrix $A$ is defined to be positive semi-definite, if $x^{\dagger} A x \geq 0$ for $x \in \mathbb{C}^n$.
The positive semi-definite matrix has the following important property:
The eigenvalues of positive semi-definite matrix are non-negative.
Because $x^{\dagger} A x \geq$ for $x \in \mathbb{C}^n$, suppose $x$ is an eigenvector of $A$ and $Ax = \lambda x$ where $x \neq 0$,
$$
\begin{align}
x^{\dagger} A x &= x^{\dagger} \lambda x \\
&= \lambda x^{\dagger} x \\
&\geq 0 \\
\end{align}
$$
Because $x^{\dagger} x$ must be real number and $x^{\dagger} x > 0$, we have $\lambda \geq 0$.
This concludes the proof.
Covariance Matrix
The covariance matrix has the following important property:
Covariance matrix is positive semi-definite. This means that the eigenvalues of covariance matrix is non-negative.
The proof of that covariance must be positive semi-definite could be found in my previous post on Multivariate Gaussian and Covariance Matrix.
Singular Values
The singular values, $\sigma_1$, $\sigma_2$, $\cdots$, $\sigma_r$, of an $m \times n$ matrix $A$ are the square roots, $\sigma_i = \sqrt{\lambda_i}$, of non-negative eigenvalues of the associated Gram matrix $K = A^{\dagger}A$. The corresponding eigenvectors of $K$ are known as singular vectors of $A$.
Note that the associated Gram matrix $K = A^{\dagger}A$ is real and symmetric, so the eigenvalues of $K$ are all real.
$K = A^{\dagger}A$ is also positive semi-definite.
For any vector $x$
$$
x^{\dagger} (A^{\dagger} A) x = (Ax)^{\dagger} Ax
$$
Because $Ax$ is also a vector,
$$
\begin{align}
x^{\dagger} (A^{\dagger} A) x \geq 0
\end{align}
$$
Therefore, all the eigenvalues of $K = A^{\dagger}A$ are non-negative and they all have a corresponded singular value of $A$.
Singular Value Decomposition
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any $ m\times n$ matrix via an extension of the polar decomposition.
Specifically, the singular value decomposition of an $m \times n$ real or complex matrix $M$ is a factorization of the form $U \Sigma V^{\dagger}$, where $U$ is an $m \times m$ real or complex unitary matrix, $\Sigma$ is a $m \times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal, and $V$ is an $n \times n$ real or complex unitary matrix.
The diagonal entries $\sigma_{i}=\Sigma_{ii}$ of $\Sigma$ are known as the singular values of $M$. The number of non-zero singular values is equal to the rank of $M$.
In particular, for any matrix $A \in \mathbb{C}$,
$$
A_{m \times n} = U_{m\times m} \Sigma_{m \times n} V_{n \times n}^{\dagger}
$$
We will skip the proof for why every matrix has SVD and the algorithm for SVD.
Singular Value Decomposition for Norm Matrix
For any matrix $A \in \mathbb{C}$, $A^{\dagger} A$ could be expressed as
$$
\begin{align}
A^{\dagger} A &= (U \Sigma V^{\dagger})^{\dagger} U \Sigma V^{\dagger} \\
&= V \Sigma^{\dagger} U^{\dagger} U \Sigma V^{\dagger} \\
&= V \Sigma^{\dagger} I \Sigma V^{\dagger} \\
&= V \Sigma^{\dagger} \Sigma V^{\dagger}
\end{align}
$$
We multiply $V$ at both side of the equation.
$$
\begin{align}
A^{\dagger} A V &= V \Sigma^{\dagger} \Sigma V^{\dagger} V \\
&= V \Sigma^{\dagger} \Sigma I \\
&= V \Sigma^{\dagger} \Sigma \\
&= \Sigma^{\dagger} \Sigma V \\
\end{align}
$$
Note that $\Sigma^{\dagger} \Sigma$ is a square diagonal matrix. Based on the definition of eigenvalue and eigenvectors, the diagonal values of $\Sigma^{\dagger} \Sigma$, including the zeros, are the eigenvalues of $A^{\dagger} A$. All the columns of $V$ are the corresponding eigenvectors of $A^{\dagger} A$.
Similarly, $A A^{\dagger}$ could be expressed as
$$
\begin{align}
A A^{\dagger} U &= \Sigma \Sigma^{\dagger} U \\
\end{align}
$$
Note that $\Sigma \Sigma^{\dagger}$ is a square diagonal matrix. Based on the definition of eigenvalue and eigenvectors, the diagonal values of $\Sigma \Sigma^{\dagger}$, including the zeros, are the eigenvalues of $A A^{\dagger}$. All the columns of $U$ are the corresponding eigenvectors of $A A^{\dagger}$.
Mathematics of Principal Components Analysis
We start with $p$-dimensional vectors, and want to summarize them by projecting down into a $q$-dimensional subspace, where $q \leq p$. Our summary will be the projection of the original vectors on to $q$ directions, the principal axes, which span the subspace.
Minimizing Projection Residuals
Given a dataset $X \in \mathbb{R}^{n \times p}$ whose row is the centered data vectors $x_i \in \mathbb{R}^p$ for $0 \leq i \leq n-1$ ($\sum_{i=0}^{n-1} x_{i} = 0$), if we have a unit vector $w \in \mathbb{R}^p$ ($|w| = 1$) and we project the all the data vectors to this unit vector $w$.
The length of projection for data vector $x_i$ on $w$, by definition, is
$$
\begin{align}
|x_i| \cos \theta &= \frac{\langle x_i, w \rangle}{|w|} \\
&= \langle x_i, w \rangle \\
\end{align}
$$
where $\langle x_i, w \rangle$ is the inner product of $x_i$ and $w$.
The projection vector for data vector $x_i$ on $w$ is $\langle x_i, w \rangle w$.
The residual, which is the distance from data vector $x_i$ to $w$, is the length of vector $x_i - \langle x_i, w \rangle w$.
Let’s check what the residual square $| x_i - \langle x_i, w \rangle w | ^2$ is.
$$
\begin{align}
| x_i - \langle x_i, w \rangle w |^2 &= \langle x_i - \langle x_i, w \rangle w, x_i - \langle x_i, w \rangle w \rangle \\
&= \langle x_i, x_i \rangle - \langle x_i, \langle x_i, w \rangle w \rangle - \langle \langle x_i, w \rangle w, x_i \rangle + \langle \langle x_i, w \rangle w, \langle x_i, w \rangle w \rangle \\
&= \langle x_i, x_i \rangle - \langle x_i, w \rangle \langle x_i, w \rangle - \langle x_i, w \rangle \langle w, x_i \rangle + \langle x_i, w \rangle ^2 \langle w, w \rangle \\
&= \langle x_i, x_i \rangle - 2 \langle x_i, w \rangle ^2 + \langle x_i, w \rangle ^2 |w| ^2 \\
&= \langle x_i, x_i \rangle - 2 \langle x_i, w \rangle ^2 + \langle x_i, w \rangle ^2 \\
&= \langle x_i, x_i \rangle - \langle x_i, w \rangle ^2 \\
\end{align}
$$
The optimization goal of projection is to minimize mean squared error $\text{MSE}(w)$, which is the mean of the residual sum of squares.
$$
\begin{align}
\text{MSE}(w) &= \frac{1}{n} \sum_{i=0}^{n-1} | x_i - \langle x_i, w \rangle w |^2 \\
&= \frac{1}{n} \sum_{i=0}^{n-1} \big( \langle x_i, x_i \rangle - \langle x_i, w \rangle ^2 \big)\\
&= \frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, x_i \rangle - \frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle ^2\\
\end{align}
$$
Remember the relationship between variance and expected value, $\mathbb{V}(X) = \mathbb{E}(X^2) - \mathbb{E}(X)^2$, we have
$$
\begin{align}
\frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle ^2 &= \mathbb{E} \big(\langle x, w \rangle ^2 \big) \\
&= \mathbb{V}\big(\langle x, w \rangle\big) + \mathbb{E} \big(\langle x, w \rangle\big) ^2 \\
&= \mathbb{V}\big(\langle x, w \rangle\big) + \frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle \\
&= \mathbb{V}\big(\langle x, w \rangle\big) + \langle \frac{1}{n} \sum_{i=0}^{n-1} x_i, w \rangle \\
&= \mathbb{V}\big(\langle x, w \rangle\big) + \langle 0, w \rangle \\
&= \mathbb{V}\big(\langle x, w \rangle\big) \\
\end{align}
$$
So $\frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle ^2$ is actually the variance of the inner product between data vector and the projection unit vector $\mathbb{V}\big(\langle x, w \rangle\big)$.
We further have another expression of $\text{MSE}(w)$.
$$
\begin{align}
\text{MSE}(w) &= \frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, x_i \rangle - \frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle ^2\\
&= \mathbb{E}\big(\langle x, x \rangle\big) - \mathbb{V}\big(\langle x, w \rangle\big)\\
\end{align}
$$
Because $\mathbb{E}\big(\langle x, x \rangle\big)$ is not dependent on $w$, minimizing the mean of the residual sum of squares $\text{MSE}(w)$ turns out to be equivalent to maximizing the variance of the inner product between data vector and the projection unit vector.
The variance of the inner product between data vector and the projection unit vector could also be expressed using the dataset matrix $X$.
$$
\begin{align}
\mathbb{V}\big(\langle x, w \rangle\big) &= \frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle ^2 \\
&= \frac{1}{n} \langle X w, X w \rangle \\
&= \frac{1}{n} (X w)^{\top} X w \\
&= \frac{1}{n} w^{\top} X^{\top} X w \\
&= w^{\top} \frac{X^{\top} X}{n} w \\
\end{align}
$$
where $\frac{X^{\top} X}{n}$ is the covariance matrix of $X$.
If we project the data vectors to $q$ orthogonal unit vectors $w_0$, $w_1$, $\cdots$, $w_{q-1}$. These unit vectors assembled as columns to a matrix $W \in \mathbb{R}^{p \times q}$. Then our optimization goal becomes maximizing the residual sum of squares on all these projection vectors.
$$
\begin{align}
\mathscr{L}(W) &= \sum_{j=0}^{q-1} \mathbb{V}\big(\langle x, w_j \rangle\big) \\
&= \frac{1}{n} \sum_{j=0}^{q-1} \sum_{i=0}^{n-1} \langle x_i, w_j \rangle ^2 \\
&= \frac{1}{n} \sum_{j=0}^{q-1} \langle X w_j, X w_j \rangle \\
&= \frac{1}{n} \sum_{j=0}^{q-1} (X w_j)^{\top} X w_j \\
&= \frac{1}{n} \sum_{j=0}^{q-1} w_j^{\top} X^{\top} X w_j \\
&= \frac{1}{n} \operatorname{tr}(W^{\top} X^{\top} X W) \\
\end{align}
$$
subject to
$$
W^{\top} W = I
$$
Maximizing Variance
Suppose we are projecting the data to $p$ orthonormal unit vectors $w \in \mathbb{R}^{p}$. These unit vectors assembled as columns to a matrix $W \in \mathbb{R}^{p \times p}$, and $W$ is a orthonormal matrix.
$$
\begin{align}
\mathscr{L}(W) &= \frac{1}{n} \sum_{j=0}^{p-1} w_j^{\top} X^{\top} X w_j \\
\end{align}
$$
subject to
$$
W^{\top} W = I
$$
Let’s try to maximize one $\mathscr{L}(w) = w^{\top} X^{\top} X w$ first, and the constrain is $w^{\top} w = 1$. We would use Lagrange multiplier $\lambda$ to help us solve the optimization problem.
$$
\begin{align}
\mathscr{L}(w, \lambda) &= \mathscr{L}(w) - \lambda(w^{\top} w - 1) \\
&= w^{\top} X^{\top} X w - \lambda(w^{\top} w - 1) \\
\end{align}
$$
$$
\begin{align}
\frac{\partial \mathscr{L}}{\lambda} &= w^{\top} w - 1 \\
\frac{\partial \mathscr{L}}{w} &= 2 X^{\top} X w - 2 \lambda w \\
\end{align}
$$
We set $\frac{\partial \mathscr{L}}{\lambda} = 0$ and $\frac{\partial \mathscr{L}}{w} = 0$, we get
$$
\begin{align}
w^{\top} w &= 1 \\
X^{\top} X w &= \lambda w \\
\end{align}
$$
So by definition, $\lambda$ and $w$ are the eigenvalue and eigenvector of matrix $X^{\top} X$.
Because $X^{\top} X$ is real symmetric matrix and positive semi-definite, all the eigenvalues of $X^{\top} X$ are non-negative.
$X^{\top} X$ has exact $p$ eigenvalues $\lambda_0, \lambda_1, \cdots, \lambda_{p-1}$ and their corresponding eigenvectors $w_0, w_1, \cdots, w_{p-1}$. The matrix $W$, which consists of the eigenvectors $w_0, w_1, \cdots, w_{p-1}$, happens to be the global maximum solution for $\mathscr{L}(W)$. The global maximum value for $\mathscr{L}(W)$ is the sum of non-negative eigenvalues.
$$
\begin{align}
\mathscr{L}(W) &= \frac{1}{n} \sum_{j=0}^{p-1} w_j^{\top} X^{\top} X w_j \\
&= \frac{1}{n} \sum_{j=0}^{p-1} w_j^{\top} \lambda_j w_j \\
&= \frac{1}{n} \sum_{j=0}^{p-1} \lambda_j w_j^{\top} w_j \\
&= \frac{1}{n} \sum_{j=0}^{p-1} \lambda_j \\
\end{align}
$$
Because these eigenvalues are non-negative, and their contribution to the global maximum could be sorted by their values.
We call the eigenvector corresponding to the largest eigenvalue the first principal axis, the eigenvector corresponding to the second largest eigenvalue the second principal axis, and so on.
Suppose we are projecting the data to $q$ orthonormal unit vectors $w \in \mathbb{R}^{p}$, and $q \leq p$. These unit vectors assembled as columns to a matrix $W \in \mathbb{R}^{p \times q}$. The optimization problem becomes maximizing the following equation.
$$
\begin{align}
\mathscr{L}(W) &= \frac{1}{n} \sum_{j=0}^{q-1} w_j^{\top} X^{\top} X w_j \\
\end{align}
$$
subject to
$$
W^{\top} W = I
$$
However, the solution is obvious. We just have to pick the $q$ largest eigenvalues and their corresponding eigenvectors for matrix $X^{\top} X$.
If we have computed the eigenvalues and eigenvectors for matrix $X^{\top} X$, we could do projections to any number of principle components without doing extensive computations.
In short, PCA is nothing special but computing the eigenvalues and eigenvectors for matrix $X^{\top} X$. To project the data matrix $X \in \mathbb{R}^{n \times p}$ to $q$ orthonormal unit vectors, simply take the $q$ corresponding eigenvectors corresponding to the $q$ largest eigenvalues, assemble them as the columns of matrix $W \in \mathbb{R}^{p \times q}$, and $XW$ is the projected dimensionality reduced dataset. Given any new data matrix $A \in \mathbb{R}^{m \times p}$, similarly, the projected dimensionality reduced dataset would be $AW$.
Singular Value Decomposition Approach
Computing eigenvalues and eigenvectors directly from matrix $X^{\top} X$ using computer sometimes results in numerical instabilities, which I am not going to give an example here, due to limited precisions a computer has.
People found that using SVD to compute the eigenvalues and eigenvectors for matrix $X^{\top} X$ is more numerical stable.
As I have proved in the Singular Value Decomposition for Norm Matrix section in the Prerequisites, the eigenvectors are just the columns of matrix $V$ in the SVD of matrix $X$, and the corresponding eigenvalues are just the diagonal values of matrix $\Sigma^{\top} \Sigma$.
Concretely, for any data matrix $X \in \mathbb{R}^{n \times p}$, we have SVD
$$
X_{n \times p} = U_{n\times n} \Sigma_{n \times p} V_{p \times p}^{\top}
$$
The eigenvalues are the diagonal values of $\Sigma_{n \times p}^{\top} \Sigma_{n \times p}$ and the eigenvectors are the columns of $V_{p \times p}$.
We would skip the SVD algorithms here.
Reconstruction
PCA is a processing of projecting data matrix from high dimension to low dimension. PCA reconstruction, however, is a process of projecting data from low dimension to high dimension, which is a inverse process of PCA.
Suppose we have a data matrix $X \in \mathbb{R}^{n \times p}$, and the projected data matrix after PCA is $X^{\prime} \in \mathbb{R}^{n \times q}$, where $X^{\prime} = X W$ and the projection matrix $W \in \mathbb{R}^{p \times q}$, our optimization goal is to find the projection matrix $Z \in \mathbb{R}^{q \times p}$ such that the reconstruction error between the reconstructed matrix $X^{\prime\prime} = X^{\prime} Z = X W Z$ and the original matrix $X$, $\left\Vert X - X^{\prime\prime} \right\Vert_2^{2}$, the square of L2 norm, is minimized.
It turns out that when $Z = W^{\top} = V_{:,:q}$, the reconstruction error is minimized.
$$
\begin{align}
X - X^{\prime\prime} &= X - X W Z \\
&= X - X V_{:,:q} Z \\
&= U \Sigma V^{\top} - U \Sigma V^{\top} V_{:,:q} Z \\
&= U \Sigma V^{\top} - U \Sigma \begin{bmatrix} I_{q \times q} \\ 0_{p-q \times q} \end{bmatrix} Z \\
&= U \Sigma V^{\top} - U \Sigma \begin{bmatrix} Z_{q \times p} \\ 0_{p-q \times p} \end{bmatrix} \\
&= U \Sigma \begin{bmatrix} V_{:,:q} \\ V_{:,q:} \end{bmatrix}^{\top} - U \Sigma \begin{bmatrix} Z_{q \times p} \\ 0_{p-q \times p} \end{bmatrix} \\
&= U \Sigma \begin{bmatrix} V_{:,:q}^{\top} \\ V_{:,q:}^{\top} \end{bmatrix} - U \Sigma \begin{bmatrix} Z_{q \times p} \\ 0_{p-q \times p} \end{bmatrix} \\
&= U \Sigma \begin{bmatrix} V_{:,:q}^{\top} - Z_{q \times p} \\ V_{:,q:}^{\top} \end{bmatrix} \\
&= \begin{bmatrix} U \Sigma (V_{:,:q}^{\top} - Z_{q \times p}) \\ U \Sigma V_{:,q:}^{\top} \end{bmatrix} \\
\end{align}
$$
Therefore, for the reconstruction error, we have
$$
\begin{align}
\left\Vert X - X^{\prime\prime} \right\Vert_2^{2} &= \left\Vert \begin{bmatrix} U \Sigma (V_{:,:q}^{\top} - Z_{q \times p}) \\ U \Sigma V_{:,q:}^{\top} \end{bmatrix} \right\Vert_2^{2} \\
\end{align}
$$
Obviously, to minimize the reconstruction error, $Z = W^{\top} = V_{:,:q}$.
Experiments
Since we have discussed so much about the fundamentals to PCA, let’s implement PCA from scratch using Numpy and Scipy, and compare the results to the PCA using Scikit-Learn.
Implementations
1 | import numpy as np |
Caveats
Be aware that the dataset matrix $X$ has always to be centered for any features, otherwise PCA will not be valid.
References
Principal Component Analysis
https://leimao.github.io/article/Principal-Component-Analysis/