Lab 6: The Transformer Revolution — Attention is All You Need
Objective
Understand the architecture that powers every major AI system today. By the end you will be able to explain:
Why RNNs failed at long-range dependencies
What self-attention computes and why it's powerful
The full Transformer encoder/decoder architecture
How scaling Transformers produced the LLM revolution
The Problem with RNNs
Before Transformers, sequences (text, audio, time series) were processed with Recurrent Neural Networks — reading one token at a time, updating a hidden state:
"The cat sat on the mat"
↓ ↓ ↓ ↓ ↓ ↓
h₁ → h₂ → h₃→ h₄→ h₅ → h₆ → outputThree fatal problems:
Sequential — you can't process token 5 until you've processed tokens 1–4. No parallelism → slow training.
Vanishing gradients — in long sequences, the gradient signal fades as it propagates backwards through hundreds of time steps. The model forgets early tokens.
Fixed-size bottleneck — in encoder-decoder RNNs (for translation), the entire source sentence must be compressed into one fixed-size vector. Information loss is inevitable.
The Attention Mechanism (2014)
Bahdanau et al. (2014) added attention to RNNs: instead of compressing everything into one vector, let the decoder look back at ALL encoder hidden states — weighted by relevance.
This was the idea that unlocked everything. The 2017 paper took it further: what if attention is the only mechanism you need?
Attention is All You Need (2017)
Google Brain researchers Vaswani, Shazeer, Parmar et al. proposed the Transformer: discard the RNN entirely. Use only attention. Process all tokens simultaneously.
The paper's title was deliberately provocative. It was correct.
The Query-Key-Value Framework
Self-attention is computed via three learned matrices: Q (queries), K (keys), V (values). Think of it as a soft database lookup:
Concrete example with 4 tokens:
This is why Transformers handle long-range dependencies so well — every token can attend to every other token in a single operation.
Multi-Head Attention
Rather than one attention computation, run h parallel attention heads, each learning to attend to different aspects:
Each head learns different relationships:
Head 1: syntactic dependencies (subject-verb agreement)
Head 2: coreference (pronouns to their referents)
Head 3: positional proximity
Head 4–8: semantic relationships, etc.
Positional Encoding
Self-attention has no notion of order — "cat sat mat" and "mat sat cat" would produce the same attention scores. Positional encoding adds position information:
The Full Transformer Architecture
Two variants:
Encoder-only (BERT): good at understanding. Used for classification, NER, embeddings.
Decoder-only (GPT, Llama, Claude): good at generation. Predicts the next token, autoregressively.
Encoder-Decoder (T5, BART): translation, summarisation.
Why Scaling Works: The Scaling Laws
Kaplan et al. (OpenAI, 2020) showed that Transformer performance follows predictable power laws:
The implication: If you have 10× more compute, you can predict exactly how much better your model will be. And the relationship is smooth — no diminishing returns observed up to current scales.
Chinchilla (DeepMind, 2022) refined this: for a given compute budget, models had been overtrained on too few tokens. Optimal training: scale parameters and data equally.
From Transformer to LLMs: The Key Innovations
Transformer (2017)
Base architecture; replaced RNNs
GPT (2018)
Decoder-only; next-token prediction at scale
BERT (2018)
Encoder-only; masked language modelling
Scaling Laws (2020)
Showed predictable improvement with scale
RLHF (2022)
Aligned LLMs to human preferences via feedback
Flash Attention (2022)
2–4× faster, 10× more memory-efficient attention
RoPE / ALiBi (2022)
Positional encoding that extrapolates to longer context
Mixture of Experts (2024)
Route tokens to specialised sub-networks → more efficient
Summary
The Transformer succeeded because:
Parallel computation — every token processed simultaneously (vs. RNN's sequential)
Global context — every token attends to every other token (no bottleneck)
Scalability — more parameters + more data = better, predictably
Transfer learning — pre-train once on internet text; fine-tune on any task
Everything after 2017 — BERT, GPT, Claude, Gemini, Llama, Stable Diffusion — is built on Transformers.
Further Reading
Last updated
