🔬

Module 11 27 concepts

Advanced & Emerging

Cutting-edge research and emerging techniques.

In-Context Learning

In-context learning (ICL) is the emergent ability of large language models to learn new tasks from examples provided in the prompt at inference time, without any gradient updates or parameter changes.

Multimodal Models

Multimodal models extend LLMs beyond text by connecting vision encoders, audio processors, and other modality-specific modules to a language model backbone, enabling AI systems that can see, hear, and reason across different types of input simultaneously.

Vision-Language Models (VLMs)

Vision-Language Models integrate visual perception with language understanding in a single system, enabling AI to see, reason about, and describe the visual world – and increasingly, to act on it through Vision-Language-Action architectures.

State Space Models & Mamba

State Space Models offer a fundamentally different approach to sequence modeling that processes tokens in linear time through learned recurrent state updates, with Mamba’s selective mechanism making them the most credible alternative to Transformers.

Mechanistic Interpretability

Mechanistic interpretability is the scientific effort to reverse-engineer neural networks at the level of individual computations, identifying the specific features models represent, the circuits that connect them, and how these give rise to complex behaviors like reasoning, factual recall, and potentially deception.

Representation Engineering and Activation Steering

Representation engineering controls LLM behavior at inference time by identifying interpretable directions in the model’s internal activation space (e.g., a “honesty direction” or “refusal direction”) and adding or subtracting these steering vectors from the model’s hidden states during forward passes – modifying behavior without any weight updates or fine-tuning.

Model Merging

Model merging combines the weights of two or more separately trained models into a single model without any additional training, exploiting the surprising geometric structure of neural network loss landscapes to blend capabilities from different fine-tuned variants.

Multi-Token Prediction

Multi-token prediction trains language models to predict several future tokens simultaneously from each position, producing richer internal representations and enabling faster inference through speculative self-decoding.

Context Window Extension

Context window extension encompasses the techniques that have stretched LLM context lengths from 512 tokens to over 1 million, overcoming the quadratic cost of attention through clever positional encoding manipulation, architectural modifications, and distributed computation strategies.

Test-Time Compute & Inference-Time Scaling

Test-time compute is the paradigm shift from making models bigger to making models think harder, allocating additional computation at inference to explore reasoning paths, verify answers, and dramatically improve performance on complex problems.

Inference-Time Scaling Laws

Performance on reasoning tasks improves predictably as you spend more compute at inference time – through repeated sampling, extended chain-of-thought, tree search, and verifier-guided selection – enabling smaller models to match larger ones on hard problems.

Reasoning Models (o1/R1 Paradigm)

Reasoning models perform extended internal deliberation before answering, trading additional inference-time compute for dramatically improved accuracy on math, code, and science tasks.

Tree-of-Thought (ToT)

Tree-of-Thought extends chain-of-thought reasoning by exploring multiple reasoning paths simultaneously in a branching tree structure, enabling backtracking from dead ends and systematic search for the best solution – treating reasoning as a search problem rather than a linear narrative.

Neurosymbolic AI

Neurosymbolic AI combines the pattern recognition and fluency of neural networks with the precision, verifiability, and logical consistency of symbolic systems, aiming to create AI that can both understand natural language and reason with formal guarantees.

Compound AI Systems

Compound AI systems combine LLMs with retrievers, tools, code execution, verifiers, and other models into integrated architectures that exceed the capabilities of any single model, representing the shift from “better models” to “better systems” as the primary path to improved AI performance.

Mixture of Agents (MoA)

Mixture of Agents uses multiple LLMs collaboratively in layered rounds – each model refining the outputs of others – to achieve aggregate quality that exceeds any individual model, including frontier systems.

Agentic RAG

Agentic RAG replaces the rigid “retrieve then generate” pipeline with an AI agent that dynamically reasons about what to retrieve, when to retrieve, whether the retrieved information is sufficient, and how to synthesize multi-step retrieval results – transforming RAG from a fixed pipeline into an adaptive, iterative reasoning process.

Corrective RAG (CRAG)

Corrective RAG adds a critical evaluation step after retrieval to assess whether retrieved documents are actually relevant to the query, then takes corrective actions – query rewriting, web search fallback, or knowledge refinement – when retrieval quality is insufficient, preventing the generation phase from hallucinating over irrelevant context.

Self-RAG (Self-Reflective Retrieval-Augmented Generation)

Self-RAG trains a single language model to adaptively decide when to retrieve external knowledge, evaluate whether retrieved passages are relevant, assess whether its own generation is supported by the evidence, and judge the overall utility of its response – all through special reflection tokens learned during training, eliminating the need for separate retriever and critic components.

GraphRAG (Graph-Based Retrieval-Augmented Generation)

GraphRAG augments standard RAG by constructing a knowledge graph of entities and relationships from the document corpus, applying hierarchical community detection, and generating community summaries at multiple levels of abstraction – enabling both precise local retrieval and global sensemaking queries that standard vector-based RAG fundamentally cannot answer.

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval)

RAPTOR builds a hierarchical tree index over a document corpus by recursively clustering text chunks using UMAP and Gaussian mixture models, then summarizing each cluster with an LLM – creating a multi-resolution representation where leaf nodes are original text chunks and higher nodes are increasingly abstract summaries, enabling retrieval at any level of detail from granular facts to high-level themes.

HyDE (Hypothetical Document Embeddings)

HyDE bridges the semantic gap between queries and documents by using an LLM to generate a hypothetical answer document, then embedding that hypothetical document (instead of the original query) as the retrieval vector – leveraging the insight that a fabricated-but-plausible answer is closer in embedding space to real answers than the question itself is.

ColBERT and Late Interaction Retrieval

ColBERT (Contextualized Late Interaction over BERT) replaces the standard single-vector representation of queries and documents with multi-vector representations – one embedding per token – and computes relevance through a “MaxSim” operation that finds the best-matching document token for each query token, achieving cross-encoder-level accuracy at bi-encoder-level speed.

Reranking and Cross-Encoders

Reranking is a second-stage retrieval technique where a more powerful model (typically a cross-encoder) re-scores and reorders the initial retrieval results from a fast first-stage retriever (bi-encoder or BM25), dramatically improving precision by jointly processing each query-document pair rather than comparing independent embeddings – making two-stage “retrieve then rerank” the standard architecture for production retrieval systems.

Late Chunking

Late chunking reverses the traditional “chunk then embed” pipeline by first passing the entire document through the embedding model’s transformer layers to produce contextualized token representations, then chunking those rich token embeddings into segment-level vectors – preserving cross-chunk context that traditional chunking destroys.

Matryoshka Representation Learning (MRL)

Matryoshka Representation Learning trains embedding models so that any prefix of an embedding vector is itself a valid, useful embedding, enabling a single model to produce embeddings at multiple dimensionalities with graceful quality degradation – like Russian nesting dolls where each inner doll is a complete, functional representation.

Query Decomposition and Multi-Step Retrieval

Query decomposition breaks complex user queries into simpler sub-queries that can each be answered through targeted retrieval, while multi-step retrieval iteratively retrieves information where each step’s findings inform the next – together enabling RAG systems to answer complex, multi-faceted, and multi-hop questions that single-shot retrieval fundamentally cannot handle.