Lab 6: The Transformer Revolution — Attention is All You Need

Objective

Understand the architecture that powers every major AI system today. By the end you will be able to explain:

  • Why RNNs failed at long-range dependencies

  • What self-attention computes and why it's powerful

  • The full Transformer encoder/decoder architecture

  • How scaling Transformers produced the LLM revolution


The Problem with RNNs

Before Transformers, sequences (text, audio, time series) were processed with Recurrent Neural Networks — reading one token at a time, updating a hidden state:

"The cat sat on the mat"
   ↓    ↓    ↓   ↓   ↓    ↓
  h₁ → h₂ → h₃→ h₄→ h₅ → h₆ → output

Three fatal problems:

  1. Sequential — you can't process token 5 until you've processed tokens 1–4. No parallelism → slow training.

  2. Vanishing gradients — in long sequences, the gradient signal fades as it propagates backwards through hundreds of time steps. The model forgets early tokens.

  3. Fixed-size bottleneck — in encoder-decoder RNNs (for translation), the entire source sentence must be compressed into one fixed-size vector. Information loss is inevitable.


The Attention Mechanism (2014)

Bahdanau et al. (2014) added attention to RNNs: instead of compressing everything into one vector, let the decoder look back at ALL encoder hidden states — weighted by relevance.

This was the idea that unlocked everything. The 2017 paper took it further: what if attention is the only mechanism you need?


Attention is All You Need (2017)

Google Brain researchers Vaswani, Shazeer, Parmar et al. proposed the Transformer: discard the RNN entirely. Use only attention. Process all tokens simultaneously.

The paper's title was deliberately provocative. It was correct.

The Query-Key-Value Framework

Self-attention is computed via three learned matrices: Q (queries), K (keys), V (values). Think of it as a soft database lookup:

Concrete example with 4 tokens:

This is why Transformers handle long-range dependencies so well — every token can attend to every other token in a single operation.


Multi-Head Attention

Rather than one attention computation, run h parallel attention heads, each learning to attend to different aspects:

Each head learns different relationships:

  • Head 1: syntactic dependencies (subject-verb agreement)

  • Head 2: coreference (pronouns to their referents)

  • Head 3: positional proximity

  • Head 4–8: semantic relationships, etc.


Positional Encoding

Self-attention has no notion of order — "cat sat mat" and "mat sat cat" would produce the same attention scores. Positional encoding adds position information:


The Full Transformer Architecture

Two variants:

  • Encoder-only (BERT): good at understanding. Used for classification, NER, embeddings.

  • Decoder-only (GPT, Llama, Claude): good at generation. Predicts the next token, autoregressively.

  • Encoder-Decoder (T5, BART): translation, summarisation.


Why Scaling Works: The Scaling Laws

Kaplan et al. (OpenAI, 2020) showed that Transformer performance follows predictable power laws:

The implication: If you have 10× more compute, you can predict exactly how much better your model will be. And the relationship is smooth — no diminishing returns observed up to current scales.

Chinchilla (DeepMind, 2022) refined this: for a given compute budget, models had been overtrained on too few tokens. Optimal training: scale parameters and data equally.


From Transformer to LLMs: The Key Innovations

Innovation
What It Did

Transformer (2017)

Base architecture; replaced RNNs

GPT (2018)

Decoder-only; next-token prediction at scale

BERT (2018)

Encoder-only; masked language modelling

Scaling Laws (2020)

Showed predictable improvement with scale

RLHF (2022)

Aligned LLMs to human preferences via feedback

Flash Attention (2022)

2–4× faster, 10× more memory-efficient attention

RoPE / ALiBi (2022)

Positional encoding that extrapolates to longer context

Mixture of Experts (2024)

Route tokens to specialised sub-networks → more efficient


Summary

The Transformer succeeded because:

  1. Parallel computation — every token processed simultaneously (vs. RNN's sequential)

  2. Global context — every token attends to every other token (no bottleneck)

  3. Scalability — more parameters + more data = better, predictably

  4. Transfer learning — pre-train once on internet text; fine-tune on any task

Everything after 2017 — BERT, GPT, Claude, Gemini, Llama, Stable Diffusion — is built on Transformers.


Further Reading

Last updated