Text Representation
Bag of words, TF-IDF, word embeddings, and contextual representations.
Bag of Words
Representing text as unordered word frequency vectors – simple, interpretable, and surprisingly effective for many classification and retrieval tasks.
Contextual Embeddings
Word representations that change based on surrounding context – the same word gets different vectors in different sentences, resolving polysemy and capturing nuance.
Document Embeddings
Representing documents as dense vectors for retrieval, clustering, and classification at scale – from TF-IDF with dimensionality reduction to neural encoders for long text.
FastText
Subword-aware embeddings that represent each word as the sum of its character n-gram vectors, gracefully handling morphology and out-of-vocabulary words.
GloVe
Global matrix factorization of word co-occurrence statistics producing word vectors with linear substructures – bridging count-based and prediction-based embedding methods.
N-Gram Language Models
Predicting the next word from the previous N-1 words using maximum likelihood estimation – the statistical foundation of language modeling.
Sentence Embeddings
Fixed-length vector representations of entire sentences – from simple word vector averaging to dedicated neural encoders trained for semantic similarity.
TF-IDF
Weighting words by term frequency times inverse document frequency to surface discriminative terms and suppress ubiquitous ones.
Word2Vec
Learning dense word vectors from co-occurrence via Skip-gram and CBOW – the embedding revolution that showed words with similar meanings occupy nearby points in vector space.