Text Representation

Bag of words, TF-IDF, word embeddings, and contextual representations.

Bag of Words

Representing text as unordered word frequency vectors – simple, interpretable, and surprisingly effective for many classification and retrieval tasks.

Contextual Embeddings

Word representations that change based on surrounding context – the same word gets different vectors in different sentences, resolving polysemy and capturing nuance.

Document Embeddings

Representing documents as dense vectors for retrieval, clustering, and classification at scale – from TF-IDF with dimensionality reduction to neural encoders for long text.

FastText

Subword-aware embeddings that represent each word as the sum of its character n-gram vectors, gracefully handling morphology and out-of-vocabulary words.

GloVe

Global matrix factorization of word co-occurrence statistics producing word vectors with linear substructures – bridging count-based and prediction-based embedding methods.

N-Gram Language Models

Predicting the next word from the previous N-1 words using maximum likelihood estimation – the statistical foundation of language modeling.

Sentence Embeddings

Fixed-length vector representations of entire sentences – from simple word vector averaging to dedicated neural encoders trained for semantic similarity.

TF-IDF

Weighting words by term frequency times inverse document frequency to surface discriminative terms and suppress ubiquitous ones.

Word2Vec

Learning dense word vectors from co-occurrence via Skip-gram and CBOW – the embedding revolution that showed words with similar meanings occupy nearby points in vector space.