Training & Optimization
Data augmentation, transfer learning, and training strategies.
Batch Normalization
Batch normalization normalizes activations across the mini-batch at each layer, enabling higher learning rates, faster convergence, and acting as a mild regularizer.
Data Augmentation
Data augmentation artificially expands the training set by applying random transformations to images, acting as the cheapest and most effective regularizer available.
Dropout and Regularization
Dropout randomly zeroes neuron activations during training to prevent co-adaptation, while L2 regularization and its variants penalize large weights โ together they are the primary tools for controlling overfitting in deep networks.
Knowledge Distillation
Knowledge distillation transfers the learned behavior of a large teacher network into a smaller student network by training the student to match the teacherโs soft output probabilities, capturing inter-class relationships that hard labels miss.
Label Smoothing
Label smoothing replaces hard one-hot target vectors with soft distributions, preventing the model from becoming overconfident and improving generalization and calibration.
Learning Rate Scheduling
Learning rate scheduling systematically varies the learning rate during training โ typically warming up, then decaying โ to achieve faster convergence and better final accuracy than any fixed rate.
Mixup and CutMix
Mixup linearly blends pairs of images and their labels, while CutMix cuts and pastes rectangular regions between images, both producing soft training targets that improve generalization, calibration, and robustness.
Progressive Resizing
Progressive resizing starts training on small images and gradually increases resolution, achieving faster convergence and often better accuracy by providing a natural curriculum from coarse to fine features.
Self-Supervised Pretraining
Self-supervised pretraining learns visual representations from unlabeled images by solving pretext tasks โ such as predicting masked patches or matching augmented views โ producing features that rival or exceed supervised ImageNet pretraining.
Transfer Learning
Transfer learning reuses features learned on a large source dataset (typically ImageNet) to solve a different target task, eliminating the need to train from scratch and dramatically reducing data and compute requirements.