The Small Model Revolution
Phi, Mistral, and efficient small language models.
Phi Series
Microsoft Research’s Phi models proved that training data quality matters more than model size, achieving frontier-class performance with models as small as 1.3 billion parameters.
Gemma
Google DeepMind’s Gemma series brought Gemini-class technology to the open-weight ecosystem, evolving from simple text models to multimodal, multilingual systems designed for edge deployment.
Knowledge Distillation for LLMs
Knowledge distillation evolved from compressing BERT-era models by mimicking output probabilities to a modern paradigm where large “teacher” models generate entire synthetic training datasets – including reasoning traces – that transfer intelligence through data rather than architecture mimicry.
Quantization and Compression
Quantization techniques evolved from a niche optimization into the critical bridge that brought frontier-class language models from data center clusters to consumer laptops, shrinking memory requirements by 4x with less than 1% quality loss.
LoRA and Fine-Tuning Democratization
Low-Rank Adaptation (LoRA) transformed LLM fine-tuning from a privilege of well-funded labs into something any developer with a single GPU could do, by training only 0.1-1% of a model’s parameters through injected low-rank matrices.
llama.cpp and Local Inference
Georgi Gerganov’s llama.cpp project, started in March 2023 as a C/C++ port of LLaMA inference, sparked a revolution in local AI by proving that large language models could run on ordinary laptops and even phones without a GPU.
The SLM Revolution
The Small Language Model revolution proved that for the majority of real-world tasks, right-sized models – optimized for quality data, efficient architecture, and targeted deployment – outperform the brute-force scaling approach on every practical metric.