Architectural Innovation Threads

MoE, state space models, and architectural evolution.

Attention Mechanism Evolution

The journey from every attention head having its own memory to groups sharing compressed memory — a relentless drive to make attention cheaper without making it dumber.

Positional Encoding Evolution

How Transformers went from rigid, pre-set notions of word order to flexible, rotatable representations that let models generalize to sequences far longer than anything seen during training.

Flash Attention and Hardware-Aware Computing

The realization that attention’s bottleneck was not arithmetic but memory bandwidth, and the tiling algorithm that turned that insight into a 2-4x speedup with zero approximation.

Mixture of Experts Evolution

The three-decade journey from a theoretical gating idea to the dominant architecture for frontier models — getting more parameters without paying for all of them at inference time.

State Space Models and Mamba

The bet that linear-time sequence models can challenge the Transformer’s quadratic attention — and the selective state space mechanism that made that bet credible.

KV Cache and Serving Optimization

How the field borrowed operating system concepts — virtual memory, paging, demand allocation — to solve the memory crisis of storing every token’s past for every concurrent request.

Long-Context Techniques

The twenty-fold expansion of context windows from 512 tokens to 10 million — achieved through positional encoding tricks, memory-efficient attention, and the hard-won realization that nominal context length and effective context length are not the same thing.

Normalization and Activation Evolution

The quiet evolution of normalization (LayerNorm to RMSNorm) and activation functions (ReLU to SwiGLU) in transformers represents the kind of incremental architectural refinement that individually yields small gains but collectively defines the “modern LLM recipe.”

Speculative Decoding and Inference Speedups

Speculative decoding and related inference optimization techniques overcome the autoregressive bottleneck — generating tokens one at a time — to achieve 2-10x speedups in production LLM serving without sacrificing output quality.