Architectural Innovation Threads
MoE, state space models, and architectural evolution.
Attention Mechanism Evolution
The journey from every attention head having its own memory to groups sharing compressed memory — a relentless drive to make attention cheaper without making it dumber.
Positional Encoding Evolution
How Transformers went from rigid, pre-set notions of word order to flexible, rotatable representations that let models generalize to sequences far longer than anything seen during training.
Flash Attention and Hardware-Aware Computing
The realization that attention’s bottleneck was not arithmetic but memory bandwidth, and the tiling algorithm that turned that insight into a 2-4x speedup with zero approximation.
Mixture of Experts Evolution
The three-decade journey from a theoretical gating idea to the dominant architecture for frontier models — getting more parameters without paying for all of them at inference time.
State Space Models and Mamba
The bet that linear-time sequence models can challenge the Transformer’s quadratic attention — and the selective state space mechanism that made that bet credible.
KV Cache and Serving Optimization
How the field borrowed operating system concepts — virtual memory, paging, demand allocation — to solve the memory crisis of storing every token’s past for every concurrent request.
Long-Context Techniques
The twenty-fold expansion of context windows from 512 tokens to 10 million — achieved through positional encoding tricks, memory-efficient attention, and the hard-won realization that nominal context length and effective context length are not the same thing.
Normalization and Activation Evolution
The quiet evolution of normalization (LayerNorm to RMSNorm) and activation functions (ReLU to SwiGLU) in transformers represents the kind of incremental architectural refinement that individually yields small gains but collectively defines the “modern LLM recipe.”
Speculative Decoding and Inference Speedups
Speculative decoding and related inference optimization techniques overcome the autoregressive bottleneck — generating tokens one at a time — to achieve 2-10x speedups in production LLM serving without sacrificing output quality.