Neural Network Foundations

Perceptrons, backpropagation, and deep learning basics.

Nonlinear transforms between layers – ReLU, sigmoid, tanh, and why the choice matters for gradient flow and expressivity.

Computing gradients layer by layer via the chain rule – the algorithm that makes deep learning computationally feasible.

Normalizing layer inputs within each mini-batch – stabilizing training, enabling higher learning rates, and acting as regularization.

Randomly zeroing activations during training – an implicit ensemble that prevents co-adaptation of neurons.

SGD, momentum, RMSProp, Adam, and AdamW – adaptive methods that navigate loss landscapes faster than vanilla gradient descent.

From single linear classifiers to universal function approximators – stacking layers creates representational power.

A single hidden layer with enough neurons can approximate any continuous function – but finding those weights is the hard part.

Xavier, He, and orthogonal initialization – breaking symmetry and controlling signal magnitude at the start of training.