BD Brain Drip
Mathematical Foundations

Matrix Decompositions

Eigendecomposition, SVD, and Cholesky – factoring matrices to reveal structure, compress data, and solve systems efficiently.

Prerequisites | Vectors and Matrices basic linear algebra (rank inverse transpose).

What Are Matrix Decompositions?

Think of factoring the integer 60 into 22×3×52^2 \times 3 \times 5. The factorization reveals structure – which primes compose it and in what proportions. Matrix decompositions do the same for matrices: they break a matrix into a product of simpler, interpretable factors. These factors expose the principal directions of variation in data, enable dramatic compression, and transform hard computational problems into easy ones.

Formally, a matrix decomposition expresses AA as a product of structured matrices. The three workhorses for ML are eigendecomposition (A=QΛQ1A = Q\Lambda Q^{-1}), singular value decomposition (A=UΣVTA = U\Sigma V^T), and Cholesky factorization (A=LLTA = LL^T).

How It Works

Eigendecomposition

An eigenvector v\mathbf{v} of a square matrix AA satisfies:

Av=λvA\mathbf{v} = \lambda \mathbf{v}

where λ\lambda is the corresponding eigenvalue. Geometrically, AA stretches v\mathbf{v} by a factor of λ\lambda without changing its direction. If AA is a covariance matrix, eigenvectors point along the axes of maximum variance and eigenvalues quantify the variance along each axis.

For a diagonalizable matrix with nn linearly independent eigenvectors:

A=QΛQ1A = Q\Lambda Q^{-1}

where Q=[v1vn]Q = [\mathbf{v}_1 | \cdots | \mathbf{v}_n] and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n). When AA is symmetric (real), QQ is orthogonal (Q1=QTQ^{-1} = Q^T), which simplifies computation and guarantees all eigenvalues are real.

Singular Value Decomposition (SVD)

SVD generalizes eigendecomposition to any matrix ARm×nA \in \mathbb{R}^{m \times n}, not just square ones:

A=UΣVTA = U \Sigma V^T

where URm×mU \in \mathbb{R}^{m \times m} (left singular vectors), ΣRm×n\Sigma \in \mathbb{R}^{m \times n} (diagonal matrix of singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0), and VRn×nV \in \mathbb{R}^{n \times n} (right singular vectors).

Geometric interpretation: Any linear transformation can be decomposed into three steps: (1) rotate/reflect by VTV^T, (2) scale along coordinate axes by Σ\Sigma, (3) rotate/reflect by UU.

Key relationships:

  • Singular values: σi=λi(ATA)\sigma_i = \sqrt{\lambda_i(A^TA)}
  • rank(A)\text{rank}(A) = number of nonzero singular values
  • AF=iσi2\|A\|_F = \sqrt{\sum_i \sigma_i^2} (Frobenius norm)

Truncated SVD for Compression

The Eckart-Young theorem states that the best rank-kk approximation (in both Frobenius and spectral norm) is obtained by keeping only the top kk singular values:

Ak=UkΣkVkTA_k = U_k \Sigma_k V_k^T

where UkU_k has kk columns, Σk\Sigma_k is k×kk \times k, and VkV_k has kk columns. The approximation error is:

AAkF=i=k+1rσi2\|A - A_k\|_F = \sqrt{\sum_{i=k+1}^{r} \sigma_i^2}

This is the mathematical foundation of latent semantic analysis in NLP, image compression, and recommendation systems (matrix factorization).

Cholesky Decomposition

If AA is symmetric positive definite (all eigenvalues >0> 0), then it has a unique factorization:

A=LLTA = LL^T

where LL is lower triangular with positive diagonal entries. Cholesky is roughly twice as fast as LU decomposition and numerically more stable. It is used to:

  • Solve Ax=bA\mathbf{x} = \mathbf{b} via forward/backward substitution
  • Sample from multivariate Gaussians: if zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I), then Lz+μN(μ,A)L\mathbf{z} + \boldsymbol{\mu} \sim \mathcal{N}(\boldsymbol{\mu}, A)
  • Compute log-determinants efficiently: logdet(A)=2ilogLii\log\det(A) = 2\sum_i \log L_{ii}

Connection to PCA

Principal Component Analysis is eigendecomposition of the covariance matrix C=1n1XTXC = \frac{1}{n-1}X^TX (for centered data XX). Equivalently, PCA can be computed via SVD of XX itself. If X=UΣVTX = U\Sigma V^T, then the principal components are the columns of VV, and the variance explained by each component is σi2/(n1)\sigma_i^2 / (n-1).

Why It Matters

Matrix decompositions are the computational backbone of ML. PCA (via eigendecomposition or SVD) is the most widely used dimensionality reduction technique. SVD powers recommender systems and low-rank matrix completion. Cholesky enables efficient Gaussian process inference and sampling. Without these tools, many models would be computationally intractable.

Key Technical Details

  • Eigendecomposition applies only to square matrices; SVD applies to any matrix.
  • Not all matrices are diagonalizable. Defective matrices require the Jordan normal form, but this rarely arises in ML where matrices of interest (covariance, kernel) are symmetric.
  • Computing full SVD of ARm×nA \in \mathbb{R}^{m \times n} costs O(min(m2n,mn2))O(\min(m^2 n, mn^2)). Randomized SVD algorithms reduce this for the truncated case.
  • The condition number κ(A)=σmax/σmin\kappa(A) = \sigma_{\max}/\sigma_{\min} measures sensitivity to perturbation. Ill-conditioned matrices (κ1\kappa \gg 1) cause numerical instability.
  • For positive semi-definite matrices (eigenvalues 0\geq 0, not strictly >0> 0), Cholesky can be applied with pivoting or a small diagonal shift (Tikhonov regularization: A+ϵIA + \epsilon I).
  • Eigenvalue decomposition of symmetric matrices is guaranteed to yield real eigenvalues and orthogonal eigenvectors – a fact that underpins PCA, spectral clustering, and Laplacian-based methods.

Common Misconceptions

  • “SVD and eigendecomposition are the same thing.” Eigendecomposition requires a square matrix and may not exist. SVD exists for every matrix and decomposes it into orthogonal bases for both the domain and codomain.
  • “Keeping more singular values is always better.” Truncated SVD acts as a regularizer. Retaining noise-dominated components can hurt generalization – this is the bias-variance tradeoff applied to representation.
  • “PCA always works.” PCA finds linear directions of maximum variance. If the data lies on a nonlinear manifold, PCA may miss its structure entirely. Kernel PCA or autoencoders may be more appropriate.

Connections to Other Concepts

  • Vectors And Matrices: Decompositions require fluency with rank, column space, and matrix operations.
  • Cost Latency Optimization: The condition number from SVD determines how fast gradient descent converges. Preconditioning uses decompositions to accelerate optimization.
  • Probability Fundamentals: The covariance matrix is the object being decomposed in PCA; its eigenvalues are variances along principal axes.
  • Norms And Distance Metrics: The spectral norm A2=σmax\|A\|_2 = \sigma_{\max} and Frobenius norm AF=σi2\|A\|_F = \sqrt{\sum \sigma_i^2} are defined through singular values.
  • Statistical Inference: Decompositions enable efficient computation of likelihoods involving multivariate Gaussians.

Further Reading

  • Strang, Linear Algebra and Its Applications (2006) – Thorough treatment of eigendecomposition and SVD with geometric intuition.
  • Trefethen & Bau, Numerical Linear Algebra (1997) – The definitive reference on the numerical aspects and stability of matrix decompositions.
  • Halko, Martinsson & Tropp, “Finding Structure with Randomness” (2011) – Landmark paper on randomized algorithms for low-rank approximation. [Scholar]