SVD for PCA#

Supplementary material to the answer by amoeba to Relationship between SVD and PCA. How to use SVD to perform PCA? - Cross Validated.

import numpy as np
from numpy import linalg as la
np.random.seed(42)

Let the data matrix \(\mathbf X\) be of \(n \times p\) size, where n is the number of samples and p is the number of variables (see Tidy data, Xarray).

n, p = 5, 3
X = np.random.randint(0, 8, [n, p]).astype(float)
X
array([[6., 3., 4.],
       [6., 2., 7.],
       [4., 4., 6.],
       [1., 2., 6.],
       [2., 2., 7.]])

Let us assume that it is centered, i.e. column means have been subtracted and are now equal to zero.

X -= np.mean(X, axis=0)
X
array([[ 2.2,  0.4, -2. ],
       [ 2.2, -0.6,  1. ],
       [ 0.2,  1.4,  0. ],
       [-2.8, -0.6,  0. ],
       [-1.8, -0.6,  1. ]])

Then the \(p \times p\) covariance matrix \(\mathbf C\) is given by \(\mathbf C = \mathbf X^\top \mathbf X/(n-1)\):

C = np.cov(X, rowvar=False)
C
array([[ 5.2 ,  0.65, -1.  ],
       [ 0.65,  0.8 , -0.5 ],
       [-1.  , -0.5 ,  1.5 ]])

It is a symmetric matrix and so it can be diagonalized (see Eigendecomposition - Decomposition for special matrices):

\[ \mathbf C = \mathbf V \mathbf L \mathbf V^\top \]

The eigenvalues in decreasing order:

l, principal_axes = la.eig(C)
idx = l.argsort()[::-1]
l, principal_axes = l[idx], principal_axes[:, idx]
l
array([5.57232637, 1.39174803, 0.5359256 ])

Where \(\mathbf V\) is a matrix of eigenvectors (each column is an eigenvector):

principal_axes
array([[ 0.95454539,  0.29627488, -0.03262345],
       [ 0.15658321, -0.40530879,  0.90067002],
       [-0.25362333,  0.8648387 ,  0.4332773 ]])

The eigenvectors are called principal axes or principal directions of the data.

Projections of the data on the principal axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables. The \(j\)-th principal component is given by \(j\)-th column of \(\mathbf {XV}\). The coordinates of the \(i\)-th data point in the new PC space are given by the \(i\)-th row of \(\mathbf{XV}\).

principal_components = X.dot(principal_axes)
principal_components
array([[ 2.66987981, -1.23999617, -0.5780582 ],
       [ 1.7524266 ,  1.75982872, -0.1788963 ],
       [ 0.41012557, -0.50817732,  1.25441334],
       [-2.76667702, -0.58638441, -0.44905635],
       [-2.06575496,  0.57472918, -0.04840249]])

If we now perform singular value decomposition of \(\mathbf X\), we obtain a decomposition:

\[ \mathbf X = \mathbf U \mathbf S \mathbf V^\top\]

where \(\mathbf U\) is a unitary matrix and \(\mathbf S\) is the diagonal matrix of singular values \(s_i\).

See Singular value decomposition - Thin SVD:

U, s, Vt = la.svd(X, full_matrices=False)
V = Vt.T
S = np.diag(s)

From here one can easily see that:

\[ \mathbf C = \mathbf V \mathbf S \mathbf U^\top \mathbf U \mathbf S \mathbf V^\top /(n-1) = \mathbf V \frac{\mathbf S^2}{n-1}\mathbf V^\top \]

meaning that right singular vectors \(\mathbf V\) are principal directions and that singular values are related to the eigenvalues of covariance matrix via \(\lambda_i = s_i^2/(n-1)\). Principal components are given by \(\mathbf X \mathbf V = \mathbf U \mathbf S \mathbf V^\top \mathbf V = \mathbf U \mathbf S\).

def flip_signs(A, B):
    """
    utility function for resolving the sign ambiguity in SVD
    http://stats.stackexchange.com/q/34396/115202
    """
    signs = np.sign(A) * np.sign(B)
    return A, B * signs

To summarize:

  • If \(\mathbf X = \mathbf U \mathbf S \mathbf V^\top\), then columns of \(\mathbf V\) are principal directions/axes.

assert np.allclose(*flip_signs(V, principal_axes))
  • Columns of \(\mathbf {US}\) are principal components (“scores”).

assert np.allclose(*flip_signs(U.dot(S), principal_components))
  • Singular values are related to the eigenvalues of covariance matrix via \(\lambda_i = s_i^2/(n-1)\). Eigenvalues \(\lambda_i\) show variances of the respective PCs.

assert np.allclose((s ** 2) / (n - 1), l)
  • Standardized scores are given by columns of \(\sqrt{n-1}\mathbf U\) and loadings are given by columns of \(\mathbf V \mathbf S/\sqrt{n-1}\). See e.g. here and here for why “loadings” should not be confused with principal directions.

  • The above is correct only if \(\mathbf X\) is centered. Only then is covariance matrix equal to \(\mathbf X^\top \mathbf X/(n-1)\).

  • The above is correct only for \(\mathbf X\) having samples in rows and variables in columns. If variables are in rows and samples in columns, then \(\mathbf U\) and \(\mathbf V\) exchange interpretations.

  • If one wants to perform PCA on a correlation matrix (instead of a covariance matrix), then columns of \(\mathbf X\) should not only be centered, but standardized as well, i.e. divided by their standard deviations.

  • To reduce the dimensionality of the data from \(p\) to \(k < p\), select \(k\) first columns of \(\mathbf U\), and \(k\times k\) upper-left part of \(\mathbf S\). Their product \(\mathbf U_k \mathbf S_k\) is the required \(n \times k\) matrix containing first \(k\) PCs.

k = 2
PC_k = principal_components[:, :k]
US_k = U[:, :k].dot(S[:k, :k])
assert np.allclose(*flip_signs(PC_k, US_k))
  • Further multiplying the first \(k\) PCs by the corresponding principal axes \(\mathbf V_k^\top\) yields \(\mathbf X_k = \mathbf U_k^\vphantom \top \mathbf S_k^\vphantom \top \mathbf V_k^\top\) matrix that has the original \(n \times p\) size but is of lower rank (of rank \(k\)). This matrix \(\mathbf X_k\) provides a reconstruction of the original data from the first \(k\) PCs. It has the lowest possible reconstruction error, see the answer here.

Xk = US_k.dot(Vt[:k, :])
Xk
array([[ 2.18114175,  0.92063969, -1.7495405 ],
       [ 2.19416378, -0.43887346,  1.07751171],
       [ 0.24092329,  0.27018751, -0.54350883],
       [-2.81464977, -0.19554841,  0.19456592],
       [-1.80157906, -0.55640533,  1.0209717 ]])
[la.matrix_rank(X), la.matrix_rank(Xk)]
[3, 2]
  • Strictly speaking, \(\mathbf U\) is of \(n\times n\) size and \(\mathbf V\) is of \(p \times p\) size. However, if \(n>p\) then the last \(n-p\) columns of \(\mathbf U\) are arbitrary (and corresponding rows of \(\mathbf S\) are constant zero); one should therefore use an economy size (or thin) SVD that returns \(\mathbf U\) of \(n\times p\) size, dropping the useless columns. For large \(n\gg p\) the matrix \(\mathbf U\) would otherwise be unnecessarily huge. The same applies for an opposite situation of \(n\ll p\).

assert U.shape == (n, p)
assert S.shape == (p, p)
assert V.shape == (p, p)

See also#