Parameter-Efficient Fine-Tuning

LoRA, adapters, and methods for efficient model adaptation.

Full Fine-Tuning vs PEFT: When to Use What

Full fine-tuning updates every parameter in a model for maximum adaptability but at enormous compute and memory cost, while PEFT methods achieve surprisingly competitive quality by training only a small fraction of parameters – and at sufficient model scale, the gap between them effectively vanishes.

LoRA (Low-Rank Adaptation)

LoRA freezes the pretrained model weights and injects small, trainable low-rank matrices into each layer, achieving fine-tuning quality with a fraction of the trainable parameters.

Adapters, Prefix Tuning & Prompt Tuning

Beyond LoRA, a family of parameter-efficient fine-tuning methods – including bottleneck adapters, prefix tuning, prompt tuning, (IA)^3, and DoRA – each offer distinct trade-offs in where and how they inject trainable parameters into a frozen pretrained model.

QLoRA (Quantized LoRA)

QLoRA combines 4-bit quantization of the frozen base model with LoRA adapters trained in higher precision, enabling fine-tuning of 65B+ parameter models on a single 48GB GPU without meaningful quality loss.

S-LoRA / Multi-LoRA Serving

Multi-LoRA serving systems like S-LoRA enable thousands of LoRA adapters to be served simultaneously from a single shared base model, using unified memory management and custom CUDA kernels to maintain near-baseline throughput.