Training & Optimization

Data augmentation, transfer learning, and training strategies.

Batch Normalization

Batch normalization normalizes activations across the mini-batch at each layer, enabling higher learning rates, faster convergence, and acting as a mild regularizer.

Data Augmentation

Data augmentation artificially expands the training set by applying random transformations to images, acting as the cheapest and most effective regularizer available.

Dropout and Regularization

Dropout randomly zeroes neuron activations during training to prevent co-adaptation, while L2 regularization and its variants penalize large weights – together they are the primary tools for controlling overfitting in deep networks.

Knowledge Distillation

Knowledge distillation transfers the learned behavior of a large teacher network into a smaller student network by training the student to match the teacher’s soft output probabilities, capturing inter-class relationships that hard labels miss.

Label Smoothing

Label smoothing replaces hard one-hot target vectors with soft distributions, preventing the model from becoming overconfident and improving generalization and calibration.

Learning Rate Scheduling

Learning rate scheduling systematically varies the learning rate during training – typically warming up, then decaying – to achieve faster convergence and better final accuracy than any fixed rate.

Mixup and CutMix

Mixup linearly blends pairs of images and their labels, while CutMix cuts and pastes rectangular regions between images, both producing soft training targets that improve generalization, calibration, and robustness.

Progressive Resizing

Progressive resizing starts training on small images and gradually increases resolution, achieving faster convergence and often better accuracy by providing a natural curriculum from coarse to fine features.

Self-Supervised Pretraining

Self-supervised pretraining learns visual representations from unlabeled images by solving pretext tasks – such as predicting masked patches or matching augmented views – producing features that rival or exceed supervised ImageNet pretraining.

Transfer Learning

Transfer learning reuses features learned on a large source dataset (typically ImageNet) to solve a different target task, eliminating the need to train from scratch and dramatically reducing data and compute requirements.