The Small Model Revolution

Phi, Mistral, and efficient small language models.

Phi Series

Microsoft Research’s Phi models proved that training data quality matters more than model size, achieving frontier-class performance with models as small as 1.3 billion parameters.

Gemma

Google DeepMind’s Gemma series brought Gemini-class technology to the open-weight ecosystem, evolving from simple text models to multimodal, multilingual systems designed for edge deployment.

Knowledge Distillation for LLMs

Knowledge distillation evolved from compressing BERT-era models by mimicking output probabilities to a modern paradigm where large “teacher” models generate entire synthetic training datasets – including reasoning traces – that transfer intelligence through data rather than architecture mimicry.

Quantization and Compression

Quantization techniques evolved from a niche optimization into the critical bridge that brought frontier-class language models from data center clusters to consumer laptops, shrinking memory requirements by 4x with less than 1% quality loss.

LoRA and Fine-Tuning Democratization

Low-Rank Adaptation (LoRA) transformed LLM fine-tuning from a privilege of well-funded labs into something any developer with a single GPU could do, by training only 0.1-1% of a model’s parameters through injected low-rank matrices.

llama.cpp and Local Inference

Georgi Gerganov’s llama.cpp project, started in March 2023 as a C/C++ port of LLaMA inference, sparked a revolution in local AI by proving that large language models could run on ordinary laptops and even phones without a GPU.

The SLM Revolution

The Small Language Model revolution proved that for the majority of real-world tasks, right-sized models – optimized for quality data, efficient architecture, and targeted deployment – outperform the brute-force scaling approach on every practical metric.