BD Brain Drip
Mathematical Foundations

Vectors and Matrices

The fundamental data structures of ML – representing data as points in high-dimensional space and transformations as matrices.

Prerequisites | Basic algebra coordinate geometry.

What Are Vectors and Matrices?

Imagine you are describing a house to a buyer. You might list its square footage, number of bedrooms, age, and price. Each of these numbers is a feature, and together they form a vector – an ordered list of numbers that locates the house as a single point in a four-dimensional “feature space.” Now imagine describing ten thousand houses: you stack their feature vectors into rows and get a matrix, a rectangular grid of numbers that encodes an entire dataset in one object.

Formally, a vector xRn\mathbf{x} \in \mathbb{R}^n is an element of an nn-dimensional real vector space. A matrix ARm×nA \in \mathbb{R}^{m \times n} is a rectangular array with mm rows and nn columns. In ML the convention is almost universal: each row of a data matrix XRm×nX \in \mathbb{R}^{m \times n} is one sample and each column is one feature.

How It Works

Vector Spaces and Operations

A vector space over R\mathbb{R} is a set VV equipped with vector addition and scalar multiplication satisfying closure, associativity, commutativity, and the existence of additive identity and inverses. The canonical example is Rn\mathbb{R}^n.

Key operations on vectors:

  • Addition: x+y=(x1+y1,,xn+yn)\mathbf{x} + \mathbf{y} = (x_1 + y_1, \ldots, x_n + y_n)
  • Scalar multiplication: cx=(cx1,,cxn)c\mathbf{x} = (cx_1, \ldots, cx_n)
  • Dot product: xy=i=1nxiyi=xycosθ\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i y_i = \|\mathbf{x}\| \|\mathbf{y}\| \cos\theta

The dot product deserves special attention. It simultaneously measures (a) the projection of one vector onto another, and (b) how “aligned” two vectors are. When xy=0\mathbf{x} \cdot \mathbf{y} = 0 the vectors are orthogonal – completely unrelated directions. This idea powers everything from cosine similarity in NLP to the normal equations in linear regression.

Matrix Multiplication as Linear Transformation

Matrix multiplication is not just a computational recipe; it is the algebraic encoding of a linear transformation. If ARm×nA \in \mathbb{R}^{m \times n}, then the map xAx\mathbf{x} \mapsto A\mathbf{x} sends vectors in Rn\mathbb{R}^n to vectors in Rm\mathbb{R}^m. This single idea unifies:

  • Rotation and scaling (geometric transformations)
  • Projection (dimensionality reduction via PCA)
  • Neural network layers (a dense layer computes h=σ(Wx+b)\mathbf{h} = \sigma(W\mathbf{x} + \mathbf{b}))

The product C=ABC = AB where ARm×pA \in \mathbb{R}^{m \times p} and BRp×nB \in \mathbb{R}^{p \times n} is defined element-wise as:

Cij=k=1pAikBkjC_{ij} = \sum_{k=1}^{p} A_{ik} B_{kj}

This requires the inner dimensions to match and yields CRm×nC \in \mathbb{R}^{m \times n}.

Transpose and Symmetry

The transpose ATA^T is obtained by swapping rows and columns: (AT)ij=Aji(A^T)_{ij} = A_{ji}. A matrix is symmetric if A=ATA = A^T. Covariance matrices, Hessians, and kernel matrices are all symmetric, which grants computational advantages such as guaranteed real eigenvalues.

Inverse and Rank

A square matrix AA is invertible if there exists A1A^{-1} such that AA1=A1A=IAA^{-1} = A^{-1}A = I. The inverse exists if and only if det(A)0\det(A) \neq 0, equivalently when AA has full rank.

The rank of a matrix is the dimension of its column space (equivalently, its row space). For ARm×nA \in \mathbb{R}^{m \times n}:

rank(A)min(m,n)\text{rank}(A) \leq \min(m, n)

When the rank is less than min(m,n)\min(m, n), the matrix is rank-deficient – some features are linearly dependent. This signals multicollinearity in regression and motivates regularization techniques.

Column Space and Null Space

The column space Col(A)\text{Col}(A) is the span of AA‘s columns – the set of all vectors b\mathbf{b} for which Ax=bA\mathbf{x} = \mathbf{b} has a solution. The null space Null(A)\text{Null}(A) is the set of all x\mathbf{x} satisfying Ax=0A\mathbf{x} = \mathbf{0}. Together they satisfy the rank-nullity theorem:

rank(A)+dim(Null(A))=n\text{rank}(A) + \dim(\text{Null}(A)) = n

Why It Matters

Nearly every ML algorithm begins by organizing data into a matrix. Linear regression solves Xw=yX\mathbf{w} = \mathbf{y}. PCA finds eigenvectors of XTXX^TX. Neural networks chain matrix multiplications with nonlinearities. Understanding how matrices encode transformations, when systems are solvable, and what rank reveals about data redundancy is prerequisite knowledge for almost everything that follows in ML.

Key Technical Details

  • Matrix multiplication is not commutative: ABBAAB \neq BA in general.
  • (AB)T=BTAT(AB)^T = B^T A^T – the transpose reverses the order of multiplication.
  • The Gram matrix XTXRn×nX^TX \in \mathbb{R}^{n \times n} encodes pairwise dot products between features; XXTRm×mXX^T \in \mathbb{R}^{m \times m} encodes pairwise dot products between samples.
  • Computational cost of naive matrix multiplication of two n×nn \times n matrices is O(n3)O(n^3); Strassen’s algorithm achieves O(n2.81)O(n^{2.81}).
  • Sparse matrices (most entries zero) arise in NLP bag-of-words and graph adjacency matrices, enabling specialized storage formats (CSR, CSC) that reduce memory from O(mn)O(mn) to O(nnz)O(\text{nnz}).
  • An orthogonal matrix QQ satisfies QTQ=IQ^TQ = I, meaning its columns are orthonormal. Orthogonal matrices preserve lengths and angles, which is why they appear in SVD and QR decomposition.

Common Misconceptions

  • “A matrix is just a table of numbers.” A matrix is an operator. The same grid of numbers can represent a dataset, a linear map, a covariance structure, or a graph adjacency. Interpreting it correctly depends on context.
  • “Inverse always exists for square matrices.” Only if the determinant is nonzero. Singular matrices (rank-deficient) have no inverse, which is precisely when the system Ax=bA\mathbf{x} = \mathbf{b} may have no solution or infinitely many solutions.
  • “Higher-dimensional vectors can’t be visualized, so intuition fails.” Many properties – orthogonality, projection, span – generalize perfectly from 2D/3D. Building geometric intuition in low dimensions transfers reliably.

Connections to Other Concepts

  • Matrix Decompositions: Eigendecomposition and SVD factor matrices to expose latent structure, rank, and enable compression.
  • Derivatives And Gradients: Gradients are vectors; Jacobians and Hessians are matrices. Backpropagation is a sequence of matrix-vector products.
  • Norms And Distance Metrics: The L2 norm x2=xx\|\mathbf{x}\|_2 = \sqrt{\mathbf{x} \cdot \mathbf{x}} is defined via the dot product; the Mahalanobis distance uses the inverse covariance matrix.
  • Probability Fundamentals: Covariance matrices encode the joint variability of random variables.
  • Cost Latency Optimization: The Hessian matrix determines the curvature of the loss surface and the conditioning of optimization.

Further Reading

  • Strang, Introduction to Linear Algebra (2016) – The gold-standard textbook for building geometric intuition about vector spaces.
  • Boyd & Vandenberghe, Introduction to Applied Linear Algebra (2018) – Focused on applications in data science and ML, freely available online.
  • Goodfellow et al., Deep Learning, Chapter 2 (2016) – A concise review of the linear algebra needed specifically for deep learning.