{ "cells": [ { "cell_type": "markdown", "id": "73a5fa40", "metadata": {}, "source": [ "# SVD for PCA\n", "\n", "Supplementary material to the answer by amoeba to [Relationship between SVD and PCA. How to use\n", "SVD to perform PCA? - Cross Validated](https://stats.stackexchange.com/a/134283/189415)." ] }, { "cell_type": "code", "execution_count": 1, "id": "56324790", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from numpy import linalg as la\n", "np.random.seed(42)" ] }, { "cell_type": "markdown", "id": "d3470ff5", "metadata": {}, "source": [ "[xarr]: https://tutorial.xarray.dev/overview/xarray-in-45-min.html\n", "[td]: https://tidyr.tidyverse.org/articles/tidy-data.html\n", "\n", "Let the data matrix $\\mathbf X$ be of $n \\times p$ size, where *n* is the number of samples and *p*\n", "is the number of variables (see [Tidy data][td], [Xarray][xarr])." ] }, { "cell_type": "code", "execution_count": 2, "id": "ac947b96", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[6., 3., 4.],\n", " [6., 2., 7.],\n", " [4., 4., 6.],\n", " [1., 2., 6.],\n", " [2., 2., 7.]])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n, p = 5, 3\n", "X = np.random.randint(0, 8, [n, p]).astype(float)\n", "X" ] }, { "cell_type": "markdown", "id": "839ae38c", "metadata": {}, "source": [ "Let us assume that it is *centered*, i.e. column means have been subtracted and are now equal to\n", "zero." ] }, { "cell_type": "code", "execution_count": 3, "id": "22c887f4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 2.2, 0.4, -2. ],\n", " [ 2.2, -0.6, 1. ],\n", " [ 0.2, 1.4, 0. ],\n", " [-2.8, -0.6, 0. ],\n", " [-1.8, -0.6, 1. ]])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X -= np.mean(X, axis=0)\n", "X" ] }, { "cell_type": "markdown", "id": "d2b1cbbd", "metadata": {}, "source": [ "[cm]: https://en.wikipedia.org/wiki/Covariance_matrix\n", "\n", "Then the $p \\times p$ [covariance matrix][cm] $\\mathbf C$ is given by $\\mathbf C = \\mathbf X^\\top \\mathbf\n", "X/(n-1)$:" ] }, { "cell_type": "code", "execution_count": 4, "id": "890ed54b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 5.2 , 0.65, -1. ],\n", " [ 0.65, 0.8 , -0.5 ],\n", " [-1. , -0.5 , 1.5 ]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C = np.cov(X, rowvar=False)\n", "C" ] }, { "cell_type": "markdown", "id": "4e7e1827", "metadata": {}, "source": [ "[edsp]: https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Decomposition_for_special_matrices\n", "\n", "It is a symmetric matrix and so it can be diagonalized (see [Eigendecomposition - Decomposition for\n", "special matrices][edsp]):\n", "\n", "$$\n", "\\mathbf C = \\mathbf V \\mathbf L \\mathbf V^\\top\n", "$$\n", "\n", "The eigenvalues in decreasing order:" ] }, { "cell_type": "code", "execution_count": 5, "id": "2bcdc294", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([5.57232637, 1.39174803, 0.5359256 ])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "l, principal_axes = la.eig(C)\n", "idx = l.argsort()[::-1]\n", "l, principal_axes = l[idx], principal_axes[:, idx]\n", "l" ] }, { "cell_type": "markdown", "id": "fd0f0272", "metadata": {}, "source": [ "Where $\\mathbf V$ is a matrix of eigenvectors (each column is an eigenvector):" ] }, { "cell_type": "code", "execution_count": 6, "id": "0fa6bc91", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.95454539, 0.29627488, -0.03262345],\n", " [ 0.15658321, -0.40530879, 0.90067002],\n", " [-0.25362333, 0.8648387 , 0.4332773 ]])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "principal_axes" ] }, { "cell_type": "markdown", "id": "14820bf7", "metadata": {}, "source": [ "The eigenvectors are called *principal axes* or *principal directions* of the data.\n", "\n", "Projections of the data on the principal axes are called *principal components*, also known as *PC\n", "scores*; these can be seen as new, transformed, variables. The $j$-th principal component is given\n", "by $j$-th column of $\\mathbf {XV}$. The coordinates of the $i$-th data point in the new PC space are\n", "given by the $i$-th row of $\\mathbf{XV}$." ] }, { "cell_type": "code", "execution_count": 7, "id": "5dddaf32", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 2.66987981, -1.23999617, -0.5780582 ],\n", " [ 1.7524266 , 1.75982872, -0.1788963 ],\n", " [ 0.41012557, -0.50817732, 1.25441334],\n", " [-2.76667702, -0.58638441, -0.44905635],\n", " [-2.06575496, 0.57472918, -0.04840249]])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "principal_components = X.dot(principal_axes)\n", "principal_components" ] }, { "cell_type": "markdown", "id": "9c3534f1", "metadata": {}, "source": [ "If we now perform singular value decomposition of $\\mathbf X$, we obtain a decomposition:\n", "\n", "$$\n", "\\mathbf X = \\mathbf U \\mathbf S \\mathbf V^\\top$$\n", "\n", "where $\\mathbf U$ is a unitary matrix and $\\mathbf S$ is the diagonal matrix of singular values\n", "$s_i$.\n", "\n", "See [Singular value decomposition - Thin SVD](\n", "https://en.wikipedia.org/wiki/Singular_value_decomposition#Thin_SVD):" ] }, { "cell_type": "code", "execution_count": 8, "id": "80fd109f", "metadata": {}, "outputs": [], "source": [ "U, s, Vt = la.svd(X, full_matrices=False)\n", "V = Vt.T\n", "S = np.diag(s)" ] }, { "cell_type": "markdown", "id": "c7a5e882", "metadata": {}, "source": [ "From here one can easily see that:\n", "\n", "$$\n", "\\mathbf C = \\mathbf V \\mathbf S \\mathbf U^\\top \\mathbf U \\mathbf S \\mathbf V^\\top /(n-1) = \\mathbf V\n", "\\frac{\\mathbf S^2}{n-1}\\mathbf V^\\top\n", "$$\n", "\n", "meaning that right singular vectors $\\mathbf V$ are principal directions and that singular values\n", "are related to the eigenvalues of covariance matrix via $\\lambda_i = s_i^2/(n-1)$. Principal\n", "components are given by $\\mathbf X \\mathbf V = \\mathbf U \\mathbf S \\mathbf V^\\top \\mathbf V =\n", "\\mathbf U \\mathbf S$." ] }, { "cell_type": "code", "execution_count": 9, "id": "14c27a3f", "metadata": {}, "outputs": [], "source": [ "def flip_signs(A, B):\n", " \"\"\"\n", " utility function for resolving the sign ambiguity in SVD\n", " http://stats.stackexchange.com/q/34396/115202\n", " \"\"\"\n", " signs = np.sign(A) * np.sign(B)\n", " return A, B * signs" ] }, { "cell_type": "markdown", "id": "3b720925", "metadata": {}, "source": [ "To summarize:\n", "\n", "- If $\\mathbf X = \\mathbf U \\mathbf S \\mathbf V^\\top$, then columns of $\\mathbf V$ are principal\n", " directions/axes." ] }, { "cell_type": "code", "execution_count": 10, "id": "0ffa1135", "metadata": {}, "outputs": [], "source": [ "assert np.allclose(*flip_signs(V, principal_axes))" ] }, { "cell_type": "markdown", "id": "174f6c69", "metadata": {}, "source": [ "- Columns of $\\mathbf {US}$ are principal components (\"scores\")." ] }, { "cell_type": "code", "execution_count": 11, "id": "38320bb7", "metadata": {}, "outputs": [], "source": [ "assert np.allclose(*flip_signs(U.dot(S), principal_components))" ] }, { "cell_type": "markdown", "id": "dc993274", "metadata": {}, "source": [ "- Singular values are related to the eigenvalues of covariance matrix via $\\lambda_i =\n", " s_i^2/(n-1)$. Eigenvalues $\\lambda_i$ show variances of the respective PCs." ] }, { "cell_type": "code", "execution_count": 12, "id": "6269d3ed", "metadata": {}, "outputs": [], "source": [ "assert np.allclose((s ** 2) / (n - 1), l)" ] }, { "cell_type": "markdown", "id": "f5cfb1b7", "metadata": {}, "source": [ "- Standardized scores are given by columns of $\\sqrt{n-1}\\mathbf U$ and loadings are given by\n", " columns of $\\mathbf V \\mathbf S/\\sqrt{n-1}$. See e.g.\n", "[here](https://stats.stackexchange.com/questions/125684) and\n", "[here](https://stats.stackexchange.com/questions/143905) for why \"loadings\" should not be confused\n", "with principal directions.\n", "- **The above is correct only if $\\mathbf X$ is centered.** Only then is covariance matrix equal to\n", " $\\mathbf X^\\top \\mathbf X/(n-1)$.\n", "- The above is correct only for $\\mathbf X$ having samples in rows and variables in columns. If\n", " variables are in rows and samples in columns, then $\\mathbf U$ and $\\mathbf V$ exchange\n", "interpretations.\n", "- If one wants to perform PCA on a correlation matrix (instead of a covariance matrix), then\n", " columns of $\\mathbf X$ should not only be centered, but standardized as well, i.e. divided by\n", "their standard deviations.\n", "\n", "- To reduce the dimensionality of the data from $p$ to $k < p$, select $k$ first columns of\n", " $\\mathbf U$, and $k\\times k$ upper-left part of $\\mathbf S$. Their product $\\mathbf U_k \\mathbf\n", "S_k$ is the required $n \\times k$ matrix containing first $k$ PCs." ] }, { "cell_type": "code", "execution_count": 13, "id": "ca0c99c7", "metadata": {}, "outputs": [], "source": [ "k = 2\n", "PC_k = principal_components[:, :k]\n", "US_k = U[:, :k].dot(S[:k, :k])\n", "assert np.allclose(*flip_signs(PC_k, US_k))" ] }, { "cell_type": "markdown", "id": "a31103a0", "metadata": {}, "source": [ "- Further multiplying the first $k$ PCs by the corresponding principal axes $\\mathbf V_k^\\top$\n", " yields $\\mathbf X_k = \\mathbf U_k^\\vphantom \\top \\mathbf S_k^\\vphantom \\top \\mathbf V_k^\\top$\n", "matrix that has the original $n \\times p$ size but is *of lower rank* (of rank $k$). This matrix\n", "$\\mathbf X_k$ provides a *reconstruction* of the original data from the first $k$ PCs. It has the\n", "lowest possible reconstruction error, [see the answer\n", "here](https://stats.stackexchange.com/questions/130721)." ] }, { "cell_type": "code", "execution_count": 14, "id": "88740dad", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 2.18114175, 0.92063969, -1.7495405 ],\n", " [ 2.19416378, -0.43887346, 1.07751171],\n", " [ 0.24092329, 0.27018751, -0.54350883],\n", " [-2.81464977, -0.19554841, 0.19456592],\n", " [-1.80157906, -0.55640533, 1.0209717 ]])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xk = US_k.dot(Vt[:k, :])\n", "Xk" ] }, { "cell_type": "code", "execution_count": 15, "id": "d5e0ea95", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[3, 2]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[la.matrix_rank(X), la.matrix_rank(Xk)]" ] }, { "cell_type": "markdown", "id": "eb492325", "metadata": {}, "source": [ "- Strictly speaking, $\\mathbf U$ is of $n\\times n$ size and $\\mathbf V$ is of $p \\times p$ size.\n", " However, if $n>p$ then the last $n-p$ columns of $\\mathbf U$ are arbitrary (and corresponding\n", "rows of $\\mathbf S$ are constant zero); one should therefore use an *economy size* (or *thin*) SVD\n", "that returns $\\mathbf U$ of $n\\times p$ size, dropping the useless columns. For large $n\\gg p$ the\n", "matrix $\\mathbf U$ would otherwise be unnecessarily huge. The same applies for an opposite situation\n", "of $n\\ll p$." ] }, { "cell_type": "code", "execution_count": 16, "id": "af262cd9", "metadata": {}, "outputs": [], "source": [ "assert U.shape == (n, p)\n", "assert S.shape == (p, p)\n", "assert V.shape == (p, p)" ] }, { "cell_type": "markdown", "id": "40734ce5", "metadata": {}, "source": [ "# See also\n", "\n", "- [Principal component analysis - Singular value decomposition - Wikipedia](\n", "https://en.wikipedia.org/wiki/Principal_component_analysis#Singular_value_decomposition)\n", "- [Singular value decomposition - Relation to eigenvalue decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition#Relation_to_eigenvalue_decomposition)\n", "- [Singular value](https://en.wikipedia.org/wiki/Singular_value)\n", "\n", "% See TODO-rbsap" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "formats": "md:myst", "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.14.1" } }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" }, "source_map": [ 14, 21, 25, 33, 37, 42, 45, 52, 55, 68, 73, 77, 79, 88, 91, 104, 108, 122, 130, 137, 139, 143, 145, 150, 152, 172, 177, 186, 191, 193, 202, 206 ] }, "nbformat": 4, "nbformat_minor": 5 }