Speech & Multimodal NLP
Speech recognition, synthesis, and multimodal language processing.
Automatic Speech Recognition
Converting spoken language into written text by mapping acoustic signals through feature extraction, acoustic modeling, and language decoding – progressing from HMM-GMM pipelines to end-to-end neural systems like Whisper.
Document Understanding
Extracting and understanding information from visually rich documents (forms, invoices, reports, tables) by jointly modeling text content, visual appearance, and spatial layout – powered by the LayoutLM family and multimodal document representations.
Image Captioning
Generating natural language descriptions of images by bridging visual perception and language generation – from CNN-LSTM pipelines to attention-based and transformer models, now increasingly subsumed by vision-language foundation models.
Multimodal NLP
Combining language with vision, audio, and other modalities to build systems that perceive and reason across multiple information channels – from contrastive pre-training (CLIP) to multimodal large language models (GPT-4V, Gemini).
Speech Language Models
Unified models that process both text and speech as token sequences, enabling zero-shot voice cloning, speech generation, and the convergence toward universal language models that handle any modality.
Text-to-Speech
Generating natural-sounding human speech from written text, progressing from concatenative and parametric methods to neural systems (Tacotron, WaveNet, FastSpeech) that approach human-level naturalness.
Visual Question Answering
Answering natural language questions about images by jointly reasoning over visual and textual information – a fundamental test of multimodal understanding that exposes the tension between genuine reasoning and superficial language bias.