Speech & Multimodal NLP

Speech recognition, synthesis, and multimodal language processing.

Automatic Speech Recognition

Converting spoken language into written text by mapping acoustic signals through feature extraction, acoustic modeling, and language decoding – progressing from HMM-GMM pipelines to end-to-end neural systems like Whisper.

Document Understanding

Extracting and understanding information from visually rich documents (forms, invoices, reports, tables) by jointly modeling text content, visual appearance, and spatial layout – powered by the LayoutLM family and multimodal document representations.

Image Captioning

Generating natural language descriptions of images by bridging visual perception and language generation – from CNN-LSTM pipelines to attention-based and transformer models, now increasingly subsumed by vision-language foundation models.

Multimodal NLP

Combining language with vision, audio, and other modalities to build systems that perceive and reason across multiple information channels – from contrastive pre-training (CLIP) to multimodal large language models (GPT-4V, Gemini).

Speech Language Models

Unified models that process both text and speech as token sequences, enabling zero-shot voice cloning, speech generation, and the convergence toward universal language models that handle any modality.

Text-to-Speech

Generating natural-sounding human speech from written text, progressing from concatenative and parametric methods to neural systems (Tacotron, WaveNet, FastSpeech) that approach human-level naturalness.

Visual Question Answering

Answering natural language questions about images by jointly reasoning over visual and textual information – a fundamental test of multimodal understanding that exposes the tension between genuine reasoning and superficial language bias.