Pre-Transformer Foundations

RNNs, LSTMs, seq2seq, and the path to attention.

Word Embeddings: Word2Vec and GloVe

Word2Vec, GloVe, and FastText gave words numerical meaning by learning dense vector representations from massive text corpora, establishing the distributional foundation for all modern NLP.

Recurrent Neural Networks and LSTMs

RNNs processed language one token at a time like reading left to right, and LSTMs solved their crippling memory problem with learned gates — dominating NLP from 2014 to 2017 before the Transformer made their sequential bottleneck obsolete.

Sequence-to-Sequence Models

The Seq2Seq framework (Sutskever et al., 2014) established the encoder-decoder paradigm for mapping variable-length inputs to variable-length outputs, achieving breakthrough machine translation results while revealing the fixed-length bottleneck that would drive the invention of attention.

Attention Mechanism Origins

Bahdanau attention (2014) let decoders dynamically focus on different parts of the input sequence, solving the fixed-length bottleneck of Seq2Seq and laying the conceptual foundation for the Transformer’s self-attention.

ELMo and Contextual Embeddings

ELMo (Peters et al., 2018) demonstrated that deep bidirectional LSTMs pre-trained on language modeling could generate context-dependent word representations, breaking the static embedding paradigm and pioneering the pre-train-then-fine-tune approach.

ULMFiT and Transfer Learning for NLP

ULMFiT (Howard & Ruder, 2018) demonstrated that a three-stage transfer learning recipe — pre-train a language model, fine-tune it on domain text, then fine-tune on the task — could match or beat state-of-the-art NLP systems trained from scratch, establishing the methodology that GPT and BERT would scale to transformative effect.

The Bottlenecks That Motivated Transformers

Three fundamental limitations of RNN-based NLP — sequential computation preventing parallelism, vanishing gradients limiting memory, and fixed-length bottleneck vectors losing information — created an urgent need for a fully parallel architecture, setting the stage for the Transformer.