Lab 5: Neural Networks Demystified

Objective

Build intuition for how neural networks actually work — from a single neuron to GPT-scale models. By the end you will understand:

  • What a neuron computes mathematically

  • How layers build increasingly abstract representations

  • What activation functions, loss functions, and optimisers do

  • The key architectural families: CNNs, RNNs, Transformers


The Neuron: One Unit of Computation

A biological neuron receives signals from other neurons, sums them, and fires if the sum exceeds a threshold. The artificial neuron does the same — but with mathematics:

                    w₁
x₁ ───────────────▶ ×  ⎤
                    w₂  ⎥
x₂ ───────────────▶ ×  ⎥──▶ Σ ──▶ f(Σ) ──▶ output
                    w₃  ⎥
x₃ ───────────────▶ ×  ⎦  + bias (b)

Mathematically:

output = f(w₁x₁ + w₂x₂ + w₃x₃ + b)

Where:

  • xᵢ = inputs (features)

  • wᵢ = weights (learned parameters)

  • b = bias (learned offset)

  • f = activation function (introduces non-linearity)


Activation Functions: The Non-Linearity Source

Without activation functions, stacking layers is useless — multiple linear transformations compose into one linear transformation.

Function
Formula
Shape
Used In

Sigmoid

1/(1+e⁻ˣ)

S-curve, (0,1)

Binary output

Tanh

(eˣ-e⁻ˣ)/(eˣ+e⁻ˣ)

S-curve, (-1,1)

RNNs

ReLU

max(0, x)

Flat then linear

Hidden layers (default)

GELU

x·Φ(x)

Smooth ReLU

Transformers, BERT, GPT

Softmax

eˣⁱ/Σeˣ

Probability distribution

Multiclass output


Layers: Building Abstraction

Each layer transforms the representation from the previous layer into something more useful:


The Loss Function: Measuring Wrongness

The loss quantifies how far the model's predictions are from the truth. Training minimises this.

Loss Function
Use Case

Binary Cross-Entropy

Binary classification (spam/not)

Categorical Cross-Entropy

Multiclass (digit 0–9)

Mean Squared Error

Regression (house prices)

Huber Loss

Regression, outlier-robust


Backpropagation: Learning by Attribution

The chain rule from calculus tells us: how much did each weight contribute to the total loss?

PyTorch handles this automatically with autograd — every operation records itself and the chain rule is applied backwards automatically.


Optimisers: How Weights Update

Gradient descent: nudge every weight in the direction that reduces loss.

Learning rate: the most important hyperparameter.

  • Too high → overshoots minimum, loss oscillates or diverges

  • Too low → training is extremely slow

  • Learning rate scheduling reduces LR over time


Architectural Families

Convolutional Neural Networks (CNNs) — for Images

Instead of connecting every neuron to every other (expensive), CNNs use filters that slide across the image — detecting the same feature anywhere in the image (translation invariance).

Recurrent Neural Networks (RNNs/LSTMs) — for Sequences

Process sequences step-by-step, maintaining a hidden state (memory).

Transformers — for Everything

Process the entire sequence at once using self-attention: each position attends to all other positions simultaneously.

The Transformer replaced RNNs as the dominant architecture for NLP (2017) and is now being applied to images (ViT), audio, and multimodal data.


Overfitting vs Underfitting

Regularisation techniques to combat overfitting:

Technique
How It Works

Dropout

Randomly disable neurons during training

Weight Decay (L2)

Penalise large weights

Data Augmentation

Artificially expand training set

Early Stopping

Stop when validation loss stops improving

Batch Normalisation

Normalise layer outputs → smoother training


Summary

Concept
One-line Description

Neuron

Weighted sum + activation function

Layer

Group of neurons transforming one representation to another

Activation

Non-linearity that lets deep networks learn complex patterns

Loss

Numerical measure of how wrong the model is

Backprop

Chain rule applied backwards to compute gradients

Optimiser

Algorithm that uses gradients to update weights

CNN

Spatial feature detection via sliding filters

Transformer

Parallel sequence processing via self-attention


Further Reading

Last updated