Lab 7: Large Language Models Explained — GPT, Claude, Gemini, Llama
Objective
Understand how modern LLMs are built, trained, and differentiated. By the end you will be able to:
Explain the three stages of LLM training (pre-training, SFT, RLHF)
Compare the major LLM families and their key design choices
Describe what makes each model distinctive
Understand context windows, tokenisation, and inference
What is a Large Language Model?
A Large Language Model is a Transformer-based neural network trained to predict the next token in a sequence, at massive scale.
Input: "The capital of France is"
LLM: [token probabilities for every word in vocabulary]
Paris: 0.94
Lyon: 0.02
Rome: 0.01
...
Output: "Paris"
That's it. Next-token prediction, trained on trillions of tokens of text. The emergent result: a system that can write code, reason about logic, translate languages, summarise documents, and pass professional exams.
Stage 1: Pre-Training
What happens: The model trains on a massive dataset of internet text, books, code, and scientific papers — learning to predict the next token.
Cost: Estimated $4M–$100M per major model in compute alone.
Result: A base model that can complete any text — but is not yet useful as an assistant. It might complete "How do I make a bomb?" with "Here are the steps..." because the training data contains such text.
Stage 2: Supervised Fine-Tuning (SFT)
Take the pre-trained base model and fine-tune on high-quality (prompt, response) pairs written by humans:
This teaches the model to follow instructions and respond helpfully. But it doesn't teach it to be safe or aligned — that requires stage 3.
Stage 3: RLHF — Reinforcement Learning from Human Feedback
The alignment stage. Humans rate model outputs; those ratings train a reward model; the LLM is then optimised to maximise that reward using PPO (Proximal Policy Optimisation).
RLHF is why ChatGPT felt so different from GPT-3 — same underlying capability, vastly more aligned behaviour.
Modern variants:
DPO (Direct Preference Optimisation) — simpler than RLHF; skips the reward model
Constitutional AI (Anthropic) — uses a set of principles; the model critiques its own outputs
Tokenisation
Text is converted to integers before entering the model:
Why it matters:
LLMs think in tokens, not characters or words
Token count ≠ word count (~0.75 words per token for English)
Non-English languages are often less efficient (more tokens per word)
Context window limits are measured in tokens, not words
# Conceptual pre-training loop
for batch in internet_text_corpus: # 15 trillion tokens
tokens = tokenise(batch) # split text into integer IDs
# Input: all tokens except last. Target: all tokens except first (shifted by 1)
input_ids = tokens[:-1]
target_ids = tokens[1:]
logits = model(input_ids) # forward pass
loss = cross_entropy(logits, target_ids) # how wrong?
loss.backward() # compute gradients
optimiser.step() # update 70B parameters
# Repeat ~10M times
Prompt: "Explain quantum entanglement to a 10-year-old"
Response: "Imagine you and your friend each have a magic coin..."
Prompt: "Write a Python function to sort a list"
Response: "def sort_list(lst):\n return sorted(lst)\n..."
Step 1: Sample many responses to the same prompt
"Tell me about sharks"
Response A: "Sharks are fascinating marine predators..."
Response B: "Sharks will kill you. They're terrifying."
Step 2: Human raters rank A > B
(more helpful, more accurate, less alarmist)
Step 3: Train a reward model R(prompt, response) → score
Step 4: Fine-tune LLM with RL to maximise R(prompt, response)
while staying close to the SFT model (KL divergence penalty)
GPT-3: 4,096 tokens ≈ 3,000 words
GPT-4: 128,000 tokens ≈ 96,000 words (a full novel)
Claude 3: 200,000 tokens ≈ 150,000 words
Gemini: 1,000,000 tokens ≈ 750,000 words (the entire Harry Potter series)
Prompt: "The"
Step 1: model → "cat" → "The cat"
Step 2: model → "sat" → "The cat sat"
Step 3: model → "on" → "The cat sat on"
Step 4: model → "the" → "The cat sat on the"
Step 5: model → "mat" → "The cat sat on the mat"
Step 6: model → "<|endoftext|>" → stop
# High temperature → more creative, more random
# Low temperature → more deterministic, more precise
logits = model(input_ids)
probs = softmax(logits / temperature) # temperature=0.0 → argmax (greedy)
next_token = sample(probs) # temperature=1.0 → sample distribution