Parameter-Efficient Fine-Tuning
LoRA, adapters, and methods for efficient model adaptation.
Full Fine-Tuning vs PEFT: When to Use What
Full fine-tuning updates every parameter in a model for maximum adaptability but at enormous compute and memory cost, while PEFT methods achieve surprisingly competitive quality by training only a small fraction of parameters – and at sufficient model scale, the gap between them effectively vanishes.
LoRA (Low-Rank Adaptation)
LoRA freezes the pretrained model weights and injects small, trainable low-rank matrices into each layer, achieving fine-tuning quality with a fraction of the trainable parameters.
Adapters, Prefix Tuning & Prompt Tuning
Beyond LoRA, a family of parameter-efficient fine-tuning methods – including bottleneck adapters, prefix tuning, prompt tuning, (IA)^3, and DoRA – each offer distinct trade-offs in where and how they inject trainable parameters into a frozen pretrained model.
QLoRA (Quantized LoRA)
QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision, enabling fine-tuning of 65B+ parameter models on a single 48GB GPU without meaningful quality loss.
S-LoRA / Multi-LoRA Serving
Multi-LoRA serving systems like S-LoRA enable thousands of LoRA adapters to be served simultaneously from a single shared base model, using unified memory management and custom CUDA kernels to maintain near-baseline throughput.