Foundational Architecture

Core transformer components — self-attention, multi-head attention, feed-forward networks, residual connections, and architectural variants like MoE and sparse attention.

The Transformer Architecture

The Transformer is a neural network architecture built entirely on attention mechanisms that processes all input tokens in parallel, replacing sequential recurrence and becoming the universal foundation of modern large language models.

Self-Attention Mechanism

Self-attention allows every token in a sequence to dynamically compute a weighted combination of all other tokens’ representations, enabling the model to capture contextual relationships regardless of distance.

Multi-Head Attention

Multi-head attention runs several self-attention operations in parallel, each with its own learned projection, enabling the model to simultaneously attend to different types of relationships – syntactic, semantic, positional – and then combines the results.

Causal (Masked) Attention

Causal attention restricts each token to attend only to itself and preceding tokens by applying a triangular mask to the attention matrix, enforcing the left-to-right autoregressive property required for text generation.

Grouped Query Attention (GQA)

Grouped Query Attention reduces the memory footprint of the key-value cache by sharing key-value heads across groups of query heads, achieving near-full-attention quality at a fraction of the memory cost – making it the de facto standard for production LLM deployment.

Sliding Window Attention

Sliding window attention restricts each token’s attention to a fixed-size local window of W neighboring tokens, reducing the quadratic memory cost of full attention to linear while preserving long-range information flow through layer stacking – where each additional layer extends the effective receptive field by W tokens.

Sparse Attention

Sparse attention mechanisms restrict each token to attending to only a subset of other tokens rather than the full sequence, reducing attention’s O(n^2) cost to O(n log n) or O(n) – enabling practical processing of very long sequences.

Attention Sinks

Attention sinks are the phenomenon where the first few tokens in a sequence accumulate disproportionately large attention scores regardless of their semantic content – a mathematical artifact of softmax’s requirement to produce a valid probability distribution – and exploiting this property via StreamingLLM enables stable language model inference over millions of tokens with fixed memory.

Differential Transformer

The Differential Transformer computes attention as the difference between two separate softmax attention maps – A_{\text{diff}} = A_1 - \lambda A_2 – canceling out noise and irrelevant attention patterns much like a differential amplifier in electrical engineering filters out common-mode noise to isolate the true signal.

Feed-Forward Networks (FFN / MLP Layers)

The feed-forward network in each Transformer layer is a two-layer fully connected network applied independently to each token position, acting as the model’s primary knowledge store and accounting for roughly two-thirds of total parameters.

Activation Functions in LLMs

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns, and the evolution from ReLU to GELU to SwiGLU represents a progression toward smoother, gated functions that improve large language model training dynamics and performance.

Residual Connections & The Residual Stream

Residual connections (skip connections) add each layer’s input directly to its output, creating a “residual stream” that flows through the entire model and enables effective training of networks with dozens to hundreds of layers.

Layer Normalization

Layer normalization standardizes activations across the feature dimension at each position independently, stabilizing training of deep Transformer networks and enabling the use of higher learning rates.

Logits and Softmax

Logits are the raw, unnormalized output scores of a language model for each token in the vocabulary, and the softmax function converts them into a valid probability distribution from which the next token is selected.

Encoder-Decoder vs Decoder-Only vs Encoder-Only

The three Transformer paradigms – encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) – represent fundamentally different choices about how the model processes context, with decoder-only emerging as the dominant architecture for generative AI.

Autoregressive Generation

Autoregressive generation is the process by which LLMs produce text one token at a time, feeding each newly generated token back as input for predicting the next, creating a sequential feedback loop that is both the source of their generative power and their primary inference bottleneck.

Next-Token Prediction

Next-token prediction is the deceptively simple training objective at the heart of all decoder-based LLMs – predicting the most likely next token given all preceding tokens – and this single objective, applied at sufficient scale, gives rise to emergent capabilities including grammar, factual knowledge, reasoning, and more.

Mixture of Experts (MoE)

Mixture of Experts is an architecture that replaces the dense feed-forward network with multiple parallel “expert” networks and a learned router that selects only a small subset of experts for each token, enabling models with vastly more parameters while keeping per-token computation constant.

Mixture of Depths

Mixture of Depths (MoD) dynamically routes each token at each layer through either the full transformer block or a skip connection, using a lightweight router to select only the top-k most important tokens for computation, reducing FLOPs by up to 50% while matching or exceeding standard transformer performance.

Byte Latent Transformers

Byte Latent Transformers (BLT) are a tokenizer-free architecture that operates directly on raw bytes with dynamic patching, eliminating tokenization artifacts while matching the performance of token-based models at equivalent compute budgets.