The Transformer Revolution

Attention is all you need and the transformer breakthrough.

Attention Is All You Need

Vaswani et al. (2017) introduced the Transformer — a fully parallel architecture based entirely on self-attention that eliminated recurrence, achieved 28.4 BLEU on English-German translation in 3.5 days on 8 GPUs, and became the foundational architecture for every major language model that followed.

GPT-1: Generative Pre-Training

GPT-1 (Radford et al., 2018) combined a decoder-only Transformer with unsupervised generative pre-training followed by supervised fine-tuning, establishing the paradigm that decoder-only models trained on next-token prediction could develop broad language understanding.

BERT: Bidirectional Encoder Representations from Transformers

BERT (Devlin et al., 2018) introduced masked language modeling and bidirectional pre-training with an encoder-only Transformer, achieving state-of-the-art results on 11 NLP tasks and triggering the “BERT-ification” of the entire field — the most influential NLP paper since the Transformer itself.

GPT-2: Language Models Are Unsupervised Multitask Learners

GPT-2 (Radford et al., 2019) scaled the GPT-1 architecture to 1.5 billion parameters, demonstrated zero-shot task performance without any fine-tuning, sparked the first major AI safety debate with its “too dangerous to release” rollout, and established the scaling hypothesis that larger models develop qualitatively new capabilities.

T5: The Text-to-Text Transfer Transformer

T5 (Raffel et al., 2019) unified every NLP task into a single text-to-text format, conducted the most systematic empirical study of transfer learning design choices, and introduced the C4 dataset — demonstrating that encoder-decoder models could match or exceed decoder-only approaches when all tasks are treated as text generation.

XLNet: Permutation Language Modeling

XLNet (Yang et al., 2019) introduced permutation language modeling to capture bidirectional context without BERT’s [MASK] token corruption, combining the strengths of autoregressive and autoencoding approaches while integrating Transformer-XL’s recurrence mechanism for longer-range dependencies — outperforming BERT on 20 benchmarks before being eclipsed by simpler alternatives.

Encoder-Only vs Decoder-Only vs Encoder-Decoder: The Three Architecture Paradigms

The Transformer spawned three architectural families — encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) — each with distinct strengths, and the surprising dominance of the decoder-only paradigm in the scaling era is one of the most consequential developments in modern AI, though the story is more nuanced than “decoder-only won.”