Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

Principal components analysis (PCA) is one of a family of techniques for taking high-dimensional data, and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. It has been widely used for data compression and de-noising. However, its entire mathematical process is sometimes ambiguous to the user.


In this article, I would like to discuss the entire process of PCA mathematically, including PCA projection and reconstruction, with most of the derivations and proofs provided. At the end of the article, I implemented PCA projection and reconstruction from scratch. After reading this article, there should be more black box in PCA anymore.

Prerequisites

Orthogonal Matrix

In linear algebra, an orthogonal matrix is a real square matrix whose columns and rows are orthogonal unit vectors (orthonormal vectors).


Equivalently, the mathematical expression is

By the definition of invertible matrix, this means matrix $A$ is invertible and $A^{-1} = A^{\top}$.


We could also view this from the perspective of determinant.


Because $A$ and $A^{\top}$ are square matrices and using the properties of determinant

Since $\det(A) \neq 0$, matrix $A$ is invertible. We multiply $A^{-1}$ on both side of the orthogonal matrix definition.

We have also derived the conclusion that $A^{-1} = A^{\top}$.


Similarly, a complex square matrix $A$ is unitary if its transpose conjugate $A^{\dagger}$ is also its inverse.


Equivalently, the mathematical expression is

Symmetric Matrix

Real symmetric matrix has the following useful properties:


If $A$ is a real symmetric matrix, all of its eigenvalues are real numbers.


Because of the conjugate properties and $\overline{A} = A$ since $A$ is a real value matrix,

We got $A \overline{v} = \overline{\lambda} \overline{v}$.


Let $\lambda \in \mathbb{C}$ be an eigenvalue of the symmetric matrix $A$. $Av = \lambda v$ and $v \neq 0$. We multiply $v^{\dagger}$ ($v^{\dagger} = \overline{v}^{\top}$) to the both sides, and because of $A^{\top} = A$ and the property we have just derived $A \overline{v} = \overline{\lambda} \overline{v}$,

We have $\lambda v^{\dagger} v = \overline{\lambda} \overline{v}^{\dagger} v$, thus $\lambda$ is real.


This concludes the proof.

Positive Semi-Definite Matrix

The $n \times n$ symmetric matrix $A$ is defined to be positive semi-definite, if $x^{\dagger} A x \geq$ for $x \in \mathbb{C}^n$.


The positive semi-definite matrix has the following important property:


The eigenvalues of positive semi-definite matrix are non-negative.


Because $x^{\dagger} A x \geq$ for $x \in \mathbb{C}^n$, suppose $x$ is an eigenvector of $A$ and $Ax = \lambda x$ where $x \neq 0$,

Because $x^{\dagger} x$ must be real number and $x^{\dagger} x > 0$, we have $\lambda \geq 0$.


This concludes the proof.

Covariance Matrix

The covariance matrix has the following important property:


Covariance matrix is positive semi-definite. This means that the eigenvalues of covariance matrix is non-negative.


The proof of that covariance must be positive semi-definite could be found in my previous post on Multivariate Gaussian and Covariance Matrix.

Singular Values

The singular values, $\sigma_1$, $\sigma_2$, $\cdots$, $\sigma_r$, of an $m \times n$ matrix $A$ are the square roots, $\sigma_i = \sqrt{\lambda_i}$, of non-negative eigenvalues of the associated Gram matrix $K = A^{\dagger}A$. The corresponding eigenvectors of $K$ are known as singular vectors of $A$.


Note that the associated Gram matrix $K = A^{\dagger}A$ is real and symmetric, so the eigenvalues of $K$ are all real.


$K = A^{\dagger}A$ is also positive semi-definite.


For any vector $x$

Because $Ax$ is also a vector,

Therefore, all the eigenvalues of $K = A^{\dagger}A$ are non-negative and they all have a corresponding singular values of $A$.

Singular Value Decomposition

In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix that generalizes the eigendecomposition of a square normal matrix to any $ m\times n$ matrix via an extension of the polar decomposition.


Specifically, the singular value decomposition of an $m \times n$ real or complex matrix $M$ is a factorization of the form $U \Sigma V^{\dagger}$, where $U$ is an $m \times m$ real or complex unitary matrix, $\Sigma$ is a $m \times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal, and $V$ is an $n \times n$ real or complex unitary matrix.


The diagonal entries $\sigma_{i}=\Sigma_{ii}$ of $\Sigma$ are known as the singular values of $M$. The number of non-zero singular values is equal to the rank of $M$.


In particular, for any matrix $A \in \mathbb{C}$,

We will skip the proof for why every matrix has SVD and the algorithm for SVD.

Singular Value Decomposition for Norm Matrix

For any matrix $A \in \mathbb{C}$, $A^{\dagger} A$ could be expressed as

We multiply $V$ at both side of the equation.

Note that $\Sigma^{\dagger} \Sigma$ is a square diagonal matrix. Based on the definition of eigenvalue and eigenvectors, the diagonal values of $\Sigma^{\dagger} \Sigma$, including the zeros, are the eigenvalues of $A^{\dagger} A$. All the columns of $V$ are the corresponding eigenvectors of $A^{\dagger} A$.


Similarly, $A A^{\dagger}$ could be expressed as

Note that $\Sigma \Sigma^{\dagger}$ is a square diagonal matrix. Based on the definition of eigenvalue and eigenvectors, the diagonal values of $\Sigma \Sigma^{\dagger}$, including the zeros, are the eigenvalues of $A A^{\dagger}$. All the columns of $U$ are the corresponding eigenvectors of $A A^{\dagger}$.

Mathematics of Principal Components Analysis

We start with $p$-dimensional vectors, and want to summarize them by projecting down into a $q$-dimensional subspace, where $q \leq p$. Our summary will be the projection of the original vectors on to $q$ directions, the principal axes, which span the subspace.

Minimizing Projection Residuals

Given a dataset $X \in \mathbb{R}^{n \times p}$ whose row is the centered data vectors $x_i \in \mathbb{R}^p$ for $0 \leq i \leq n-1$ ($\sum_{i=0}^{n-1} x_{i} = 0$), if we have a unit vector $w \in \mathbb{R}^p$ ($|w| = 1$) and we project the all the data vectors to this unit vector $w$.


The length of projection for data vector $x_i$ on $w$, by definition, is

where $\langle x_i, w \rangle$ is the inner product of $x_i$ and $w$.


The projection vector for data vector $x_i$ on $w$ is $\langle x_i, w \rangle w$.


The residual, which is the distance from data vector $x_i$ to $w$, is the length of vector $x_i - \langle x_i, w \rangle w$.


Let’s check what the residual square $| x_i - \langle x_i, w \rangle w | ^2$ is.

The optimization goal of projection is to minimize mean squared error $\text{MSE}(w)$, which is the mean of the residual sum of squares.

Remember the relationship between variance and expected value, $\mathbb{V}(X) = \mathbb{E}(X^2) - \mathbb{E}(X)^2$, we have

So $\frac{1}{n} \sum_{i=0}^{n-1} \langle x_i, w \rangle ^2$ is actually the variance of the inner product between data vector and the projection unit vector $\mathbb{V}\big(\langle x, w \rangle\big)$.


We further have another expression of $\text{MSE}(w)$.

Because $\mathbb{E}\big(\langle x, x \rangle\big)$ is not dependent on $w$, minimizing the mean of the residual sum of squares $\text{MSE}(w)$ turns out to be equivalent to maximizing the variance of the inner product between data vector and the projection unit vector.


The variance of the inner product between data vector and the projection unit vector could also be expressed using the dataset matrix $X$.

where $\frac{X^{\top} X}{n}$ is the covariance matrix of $X$.


If we project the data vectors to $q$ orthogonal unit vectors $w_0$, $w_1$, $\cdots$, $w_{q-1}$. These unit vectors assembled as columns to a matrix $W \in \mathbb{R}^{p \times q}$. Then our optimization goal becomes maximizing the residual sum of squares on all these projection vectors.

subject to

Maximizing Variance

Suppose we are projecting the data to $p$ orthonormal unit vectors $w \in \mathbb{R}^{p}$. These unit vectors assembled as columns to a matrix $W \in \mathbb{R}^{p \times p}$, and $W$ is a orthonormal matrix.

subject to

Let’s try to maximize one $\mathscr{L}(w) = w^{\top} X^{\top} X w$ first, and the constrain is $w^{\top} w = 1$. We would use Lagrange multiplier $\lambda$ to help us solve the optimization problem.

We set $\frac{\partial \mathscr{L}}{\lambda} = 0$ and $\frac{\partial \mathscr{L}}{w} = 0$, we get

So by definition, $\lambda$ and $w$ are the eigenvalue and eigenvector of matrix $X^{\top} X$.


Because $X^{\top} X$ is real symmetric matrix and positive semi-definite, all the eigenvalues of $X^{\top} X$ are non-negative.


$X^{\top} X$ has exact $p$ eigenvalues $\lambda_0, \lambda_1, \cdots, \lambda_{p-1}$ and their corresponding eigenvectors $w_0, w_1, \cdots, w_{p-1}$. The matrix $W$, which consists of the eigenvectors $w_0, w_1, \cdots, w_{p-1}$, happens to be the global maximum solution for $\mathscr{L}(W)$. The global maximum value for $\mathscr{L}(W)$ is the sum of non-negative eigenvalues.

Because these eigenvalues are non-negative, and their contribution to the global maximum could be sorted by their values.


We call the eigenvector corresponding to the largest eigenvalue the first principal axis, the eigenvector corresponding to the second largest eigenvalue the second principal axis, and so on.


Suppose we are projecting the data to $q$ orthonormal unit vectors $w \in \mathbb{R}^{p}$, and $q \leq p$. These unit vectors assembled as columns to a matrix $W \in \mathbb{R}^{p \times q}$. The optimization problem becomes maximizing the following equation.

subject to

However, the solution is obvious. We just have to pick the $q$ largest eigenvalues and their corresponding eigenvectors for matrix $X^{\top} X$.


If we have computed the eigenvalues and eigenvectors for matrix $X^{\top} X$, we could do projections to any number of principle components without doing extensive computations.


In short, PCA is nothing special but computing the eigenvalues and eigenvectors for matrix $X^{\top} X$. To project the data matrix $X \in \mathbb{R}^{n \times p}$ to $q$ orthonormal unit vectors, simply take the $q$ corresponding eigenvectors corresponding to the $q$ largest eigenvalues, assemble them as the columns of matrix $W \in \mathbb{R}^{p \times q}$, and $XW$ is the projected dimensionality reduced dataset. Given any new data matrix $A \in \mathbb{R}^{m \times p}$, similarly, the projected dimensionality reduced dataset would be $AW$.

Singular Value Decomposition Approach

Computing eigenvalues and eigenvectors directly from matrix $X^{\top} X$ using computer sometimes results in numerical instabilities, which I am not going to give an example here, due to limited precisions a computer has.


People found that using SVD to compute the eigenvalues and eigenvectors for matrix $X^{\top} X$ is more numerical stable.


As I have proved in the Singular Value Decomposition for Norm Matrix section in the Prerequisites, the eigenvectors are just the columns of matrix $V$ in the SVD of matrix $X$, and the corresponding eigenvalues are just the diagonal values of matrix $\Sigma^{\top} \Sigma$.


Concretely, for any data matrix $X \in \mathbb{R}^{n \times p}$, we have SVD

The eigenvalues are the diagonal values of $\Sigma_{n \times p}^{\top} \Sigma_{n \times p}$ and the eigenvectors are the columns of $V_{p \times p}$.


We would skip the SVD algorithms here.

Reconstruction

PCA is a processing of projecting data matrix from high dimension to low dimension. PCA reconstruction, however, is a process of projecting data from low dimension to high dimension, which is a inverse process of PCA.


Suppose we have a data matrix $X \in \mathbb{R}^{n \times p}$, and the projected data matrix after PCA is $X^{\prime} \in \mathbb{R}^{n \times q}$, where $X^{\prime} = X W$ and the projection matrix $W \in \mathbb{R}^{p \times q}$, our optimization goal is to find the projection matrix $Z \in \mathbb{R}^{q \times p}$ such that the reconstruction error between the reconstructed matrix $X^{\prime\prime} = X^{\prime} Z = X W Z$ and the original matrix $X$, $\lVert X - X^{\prime\prime} \rVert_2^{2}$, is minimized.


It turns out that when $Z = W^{\top} = V_{:,:q}$, the reconstruction error is minimized.

Therefore, for the reconstruction error, we have

Obviously, to minimize the reconstruction error, $Z = W^{\top} = V_{:,:q}$.

Experiments

Since we have discussed so much about the fundamentals to PCA, let’s implement PCA from scratch using Numpy and Scipy, and compare the results to the PCA using Scikit-Learn.

Implementations

import numpy as np
import sklearn.datasets, sklearn.decomposition
from scipy import linalg
import time

class ManualPCA(object):

    def __init__(self, n_components=None, backend="svd", verbose=False):

        self.n_components = n_components
        self.backend = backend
        if backend != "svd" and backend != "eig":
            raise Exception("Unsupported PCA backend {}!".format(self.backend))
        self.verbose = verbose

    def svd_fit(self, X):

        start_time = time.time()
        self.mu = np.mean(X, axis=0)
        X = X - self.mu
        # The columns of self.V are the eigenvectors of X^T X
        self.U, self.s, self.Vh = linalg.svd(X)
        self.V = self.Vh.transpose()
        end_time = time.time()
        if self.verbose == True:
            print("SVD fit time: {:.6f} second".format(end_time - start_time))

    def eig_fit(self, X):

        start_time = time.time()
        self.mu = np.mean(X, axis=0)
        X = X - self.mu
        covariance = np.dot(X.transpose(), X)
        # May use eigh instead of eig for symmetric matrix for faster speed
        self.w, self.V = linalg.eig(covariance)
        idx_order = np.argsort(self.w)[::-1]
        self.w = self.w[idx_order]
        self.V = self.V[:,idx_order]
        self.Vh = self.V.transpose()
        end_time = time.time()
        if self.verbose == True:
            print("Eigen fit time: {:.6f} second".format(end_time - start_time))

    def fit(self, X):

        if self.backend == "svd":
            self.svd_fit(X)
        elif self.backend == "eig":
            self.eig_fit(X)
        else:
            raise Exception("Unsupported PCA backend {}!".format(self.backend))

    def transform(self, X):

        X = X - self.mu

        if self.n_components == None:
            return np.dot(X, self.V)
        else:
            return np.dot(X, self.V[:,:self.n_components])

    def inverse_transform(self, X):

        if self.n_components == None:
            return np.dot(X, self.Vh) + self.mu
        else:
            return np.dot(X, self.Vh[:self.n_components]) + self.mu

def main():

    X = sklearn.datasets.load_iris().data

    n_components = None
    # n_components = 2

    scikit_pca = sklearn.decomposition.PCA(n_components=n_components)
    scikit_pca.fit(X)
    X_projected = scikit_pca.transform(X)
    X_projected_inversed = scikit_pca.inverse_transform(X_projected)

    assert np.allclose(X, X_projected_inversed) == True, "Scikit PCA is problematic!"

    svd_pca = ManualPCA(n_components=n_components, backend="svd", verbose=True)
    svd_pca.fit(X)
    X_projected = svd_pca.transform(X)
    X_projected_inversed = svd_pca.inverse_transform(X_projected)

    assert np.allclose(X, X_projected_inversed) == True, "Manual SVD PCA is problematic!"

    eig_pca = ManualPCA(n_components=n_components, backend="eig", verbose=True)
    eig_pca.fit(X)
    X_projected = eig_pca.transform(X)
    X_projected_inversed = eig_pca.inverse_transform(X_projected)

    assert np.allclose(X, X_projected_inversed) == True, "Manual Eigen PCA is problematic!"

if __name__ == "__main__":
    
    main()

Caveats

Be aware that the dataset matrix $X$ has always to be centered for any features, otherwise PCA will not be valid.

References