Text Preprocessing
Tokenization, normalization, stemming, and text cleaning.
Data Annotation and Labeling
Creating labeled NLP datasets through systematic annotation schemes, measuring inter-annotator agreement, managing crowdsourced labor, and applying active learning to minimize the high cost of human labeling.
Regular Expressions for NLP
Pattern matching as the workhorse of text preprocessing โ defining formal string patterns with a concise syntax to search, extract, validate, and transform text in NLP pipelines.
Sentence Segmentation
Detecting sentence boundaries in running text despite the ambiguity of periods, which serve triple duty as sentence terminators, abbreviation markers, and decimal points.
Stemming and Lemmatization
Reducing words to base forms โ stemming by crude affix removal and lemmatization by linguistically-informed morphological analysis โ to collapse inflectional variants into shared representations.
Stopword Removal
Filtering high-frequency function words (the, is, at, which) that carry little semantic content to reduce noise and dimensionality in frequency-based text representations, though modern neural models often benefit from retaining them.
Text Cleaning and Noise Removal
Handling the messy reality of real-world text โ stripping HTML, fixing encoding errors, correcting OCR artifacts, normalizing social media conventions, deduplicating, and detecting language โ before any NLP model can be reliably applied.
Text Normalization
Standardizing text through case folding, unicode normalization, accent removal, and format unification so that superficially different strings map to a single canonical form before downstream processing.
Tokenization in NLP
Splitting raw text into discrete units โ words and sentences โ using rule-based, statistical, or hybrid methods, with strategies that vary dramatically across languages and domains.