Inference & Deployment

Serving, decoding strategies, caching, and quantization.

KV Cache

KV cache stores previously computed key and value tensors from the attention mechanism so the model never re-computes them, turning autoregressive generation from an O(n^2) nightmare into an O(n) operation – at the cost of memory that grows linearly with sequence length.

Flash Attention

Flash Attention is an IO-aware attention algorithm that restructures the computation to keep data in the GPU’s fast on-chip SRAM rather than repeatedly reading and writing to slow high-bandwidth memory (HBM), reducing memory usage from O(N^2) to O(N) and delivering 2-4x wall-clock speedups – while computing exact attention, not an approximation.

PagedAttention

PagedAttention applies OS-style virtual memory paging to the KV cache, breaking each sequence’s key-value data into fixed-size blocks that are dynamically allocated and mapped through per-sequence block tables, eliminating 60-80% memory waste and enabling 2-4x higher serving throughput.

Throughput vs. Latency Trade-offs

Throughput (how many total tokens the system produces per second) and latency (how quickly an individual user receives their response) are fundamentally competing objectives in LLM serving, and every deployment architecture involves conscious decisions about where to sit on this trade-off curve.

Continuous Batching

Continuous batching (also called iteration-level or in-flight batching) inserts new requests and retires completed sequences at every decoding step rather than waiting for an entire batch to finish, eliminating idle GPU cycles and achieving 10-23x higher throughput than static batching.

Model Serving Frameworks

Model serving frameworks handle the complex orchestration of loading LLM weights onto GPUs, managing memory, batching requests, and delivering generated tokens to users – and the choice of framework can mean a 10-23x difference in throughput for the same hardware.

KV Cache Compression

KV cache compression encompasses quantization, eviction, and token merging techniques that reduce the memory footprint of stored key-value states by 2-8x, making long-context inference (128K+ tokens) practically deployable on existing GPU hardware.

Prefix Caching

Prefix caching stores the computed KV cache states for shared prompt prefixes (system prompts, few-shot examples, RAG context) so that subsequent requests sharing the same prefix skip recomputation entirely, delivering up to 90% cost savings and 85% reduction in time-to-first-token.

Prefill-Decode Disaggregation

Prefill-decode disaggregation separates the compute-bound prefill phase (processing input tokens in parallel) and the memory-bandwidth-bound decode phase (generating tokens one at a time) onto different, independently optimized hardware pools, improving cost-efficiency by 1.5-2x and eliminating cross-phase interference.

Speculative Decoding

Speculative decoding uses a small, fast “draft” model to guess multiple tokens ahead, then verifies all guesses in a single forward pass of the large “target” model, achieving 2-3x faster generation while producing output that is mathematically identical to standard decoding.

Medusa and Parallel Decoding

Medusa adds multiple lightweight prediction heads to a base LLM, enabling parallel token generation and tree-structured verification to achieve 2-3x speedups without a separate draft model.

Temperature, Top-K, and Top-P Sampling

Sampling strategies control how an LLM selects the next token from its predicted probability distribution, ranging from deterministic (always pick the most likely) to highly creative (sample from a broad set of candidates), with each method offering a different trade-off between coherence and diversity.

Constrained Decoding

Constrained decoding forces LLM output to conform to formal grammars (JSON schemas, regex patterns, context-free grammars) by masking invalid tokens at each decoding step, providing a 100% structural validity guarantee and eliminating retry loops for malformed output.

Quantization

Quantization reduces the numerical precision of a model’s weights (and sometimes activations) from 16-bit floating point to 8-bit or 4-bit integers, shrinking memory footprint by 2-4x and accelerating inference, with surprisingly small losses in quality because neural networks are remarkably tolerant of reduced precision.

Knowledge Distillation

Knowledge distillation trains a smaller “student” model to mimic the behavior of a larger “teacher” model by learning from the teacher’s soft probability distributions rather than just hard labels, transferring rich knowledge about inter-class relationships that the raw training data alone cannot convey.

Distillation for Reasoning

Distillation for reasoning transfers chain-of-thought reasoning capabilities from large teacher models to smaller student models by training on the teacher’s detailed reasoning traces – enabling results like DeepSeek-R1-Distill-Qwen-7B scoring 55.5% on AIME 2024 and R1-Distill-Qwen-14B achieving 93.9% on MATH, with the critical finding that distillation outperforms direct RL training at small model scales.

Prompt Compression / LLMLingua

Prompt compression reduces input token count while preserving semantic meaning, using perplexity-based importance scoring or trained classifiers to cut costs by up to 75% and accelerate prefill by 2-4x.

Model Routing / LLM Routers

Model routing dynamically selects which LLM to use for each query based on estimated complexity and cost, achieving 40-60% cost reduction while maintaining quality by sending only hard queries to expensive frontier models.