Pre-trained Models for NLP

BERT, GPT, T5, and transfer learning for NLP.

BERT

BERT (Bidirectional Encoder Representations from Transformers) pre-trains a deep transformer encoder using masked language modeling and next sentence prediction, producing bidirectional contextualized representations that shattered records across 11 NLP benchmarks and spawned an entire family of variants that continue to dominate NLP.

Cross-Lingual Transfer

Cross-lingual transfer leverages multilingual pre-trained models to transfer NLP capabilities from high-resource languages (primarily English) to low-resource languages without target-language labeled data – enabling zero-shot task performance across 100+ languages through shared representations.

Domain Adaptation

Domain adaptation extends general-purpose pre-trained models to specialized domains – biomedical, scientific, financial, legal, and clinical text – through continued pre-training on domain corpora, producing models like BioBERT, SciBERT, and FinBERT that outperform their general counterparts by 2-10% on in-domain tasks.

ELMo

ELMo (Embeddings from Language Models) produces deep contextualized word representations by running a two-layer bidirectional LSTM language model, generating different vectors for the same word depending on its surrounding context – the first major pre-trained model that bridged static word embeddings and modern transformers.

GPT for NLP Tasks

The GPT series – from GPT-1’s generative pre-training with discriminative fine-tuning, through GPT-2’s surprising zero-shot abilities, to GPT-3’s in-context learning revolution – demonstrated that autoregressive decoder-only transformers can perform virtually any NLP task through prompting alone, without task-specific fine-tuning.

Prompt-Based NLP

Prompt-based NLP reformulates traditional NLP tasks as cloze-style fill-in-the-blank or text generation problems, leveraging pre-trained language models’ existing knowledge to perform tasks with minimal or zero labeled data by converting classification into “predict the next/masked word” problems.

T5 and Text-to-Text

T5 (Text-to-Text Transfer Transformer) unifies every NLP task – classification, translation, summarization, question answering, and more – into a single text-to-text framework where both inputs and outputs are text strings, enabling a systematic comparison of pre-training objectives, architectures, and datasets at scales from 60M to 11B parameters.

Transfer Learning in NLP

Transfer learning revolutionized NLP by replacing task-specific training from scratch with a two-stage paradigm – pre-train on massive unlabeled corpora, then fine-tune on small task-specific datasets – reducing data requirements by 10-100x and establishing new state-of-the-art results across virtually every benchmark.