Input Representation

Tokenization, positional encoding, embeddings, and how text becomes numbers.

Tokenization

Tokenization is the process of breaking raw text into discrete units (tokens) that a language model can process numerically, and the choices made here ripple through every aspect of model behavior.

Byte-Pair Encoding (BPE)

Byte-Pair Encoding is a data compression algorithm repurposed for tokenization that iteratively merges the most frequent pair of adjacent symbols to build a subword vocabulary from the bottom up.

Vocabulary Design

Vocabulary design is the process of choosing how many and which tokens a language model should know, balancing compression efficiency against embedding size, multilingual coverage, and tokenization fairness across languages.

Special Tokens

Special tokens are reserved vocabulary entries that carry control signals rather than linguistic content, directing model behavior for tasks like indicating sequence boundaries, separating segments, and managing chat turn-taking.

Token Embeddings

Token embeddings convert discrete, meaningless token IDs into dense, continuous vectors in a high-dimensional space where geometric relationships encode semantic meaning.

Positional Encoding

Positional encoding injects information about token order into the transformer architecture, which would otherwise treat its input as an unordered set.

Rotary Position Embedding (RoPE)

Rotary Position Embedding encodes token positions by rotating query and key vectors in the attention mechanism, so that their dot product naturally depends on the relative distance between tokens rather than their absolute positions.

ALiBi (Attention with Linear Biases)

ALiBi replaces learned positional embeddings with simple linear biases added directly to attention scores, enabling models to extrapolate to sequence lengths far beyond their training context with zero additional parameters and no fine-tuning.

Context Window

The context window is the fixed-length span of tokens a transformer model can attend to in a single forward pass – the model’s “working memory” that determines how much text it can consider at once.