Lab 11: Gradient Descent Optimisers

Objective

Implement and compare gradient descent optimisers from scratch: vanilla SGD, mini-batch SGD, SGD with momentum, RMSProp, and the Adam optimiser β€” measuring convergence speed, final loss, and behaviour on different loss landscapes including saddle points and narrow valleys.

Background

The choice of optimiser dramatically affects how fast and how well a neural network trains. Vanilla SGD often oscillates or gets stuck. Momentum accumulates a velocity vector, accelerating in consistent directions. RMSProp adapts the learning rate per-parameter based on recent gradient magnitudes. Adam (Adaptive Moment Estimation) combines both momentum AND adaptive learning rates β€” currently the most popular optimiser in deep learning.

Time

30 minutes

Prerequisites

  • Lab 01 (Linear Regression) β€” gradient descent basics

  • Lab 03 (Neural Network) β€” understanding training loops

Tools

  • Docker: zchencow/innozverse-python:latest


Lab Instructions

πŸ’‘ Adam's bias correction is critical in early training. At step t=1, the first moment m is initialised to 0, then updated to 0.9Β·0 + 0.1Β·grad = 0.1Β·grad. Without bias correction, Adam thinks the gradient is 10Γ— smaller than it is, giving tiny updates. The correction m_hat = m / (1-β₁ᡗ) at t=1 gives 0.1Β·grad / 0.1 = grad. This warms up correctly and avoids the "cold start" issue that plagued early adaptive optimisers.

πŸ“Έ Verified Output:


Summary

Optimiser
Memory
Adapts LR?
Best for

SGD

O(d)

No

Simple, convex

SGD+Momentum

O(d)

No

Faster, momentum

RMSProp

O(d)

Per-param

RNNs

Adam

O(2d)

Per-param

Default choice

Last updated