Lab 4: Data is Everything — Datasets, Bias, and Garbage-In
Objective
Understand why data quality determines AI quality. By the end you will be able to:
Explain why "more data" is not always better
Identify the main types of dataset bias and their real-world consequences
Describe how major training datasets are built
Apply basic data quality checks before training any model
The Fundamental Truth
"Garbage in, garbage out." — IBM, 1957
Every ML model is a compressed reflection of its training data. It cannot know things that weren't in the data. It will repeat every bias, error, and gap in the data — often amplified.
A 99% accurate model trained on biased data is a precise, reliable bias-generator.
What Makes a Good Dataset?
Five dimensions of data quality:
Volume
Enough examples?
50 images to train a face detector
Variety
Covers all real-world cases?
Only light-skinned faces in training set
Accuracy
Labels are correct?
Mislabelled medical scans
Recency
Reflects current reality?
2010 data for 2024 fraud detection
Representativeness
Matches deployment distribution?
English-only training for global product
Famous Training Datasets
ImageNet
14M images, 1,000 classes
Vision
AlexNet, ResNet, all vision benchmarks
Common Crawl
~250B web pages
Text
GPT-3, LLaMA, most LLMs
Wikipedia
~6.7M articles (English)
Text
BERT pre-training
LAION-5B
5.85B image-text pairs
Vision-language
DALL-E, Stable Diffusion
The Pile
825GB curated text
Text
EleutherAI open models
MS COCO
330K images + captions
Vision
Object detection, captioning
💡 Scale insight: GPT-4 was estimated to train on ~13 trillion tokens — roughly 10,000× all the books ever published in English.
Types of Dataset Bias
1. Historical Bias
The data reflects historical inequalities, which the model then perpetuates.
Example: Amazon's AI recruiting tool (2018) trained on 10 years of résumés — which came mostly from men because the tech industry is male-dominated. The model learned to penalise résumés that contained the word "women's" (as in "women's chess club"). Amazon scrapped it.
2. Representation Bias
Some groups are underrepresented in the training data.
Example: Facial recognition systems (2018 MIT study by Joy Buolamwini):
Error rate for light-skinned men: 0.8%
Error rate for dark-skinned women: 34.7%
The training datasets (Adience, IJB-A) were overwhelmingly light-skinned. The models were accurate — on the data they were trained on.
3. Measurement Bias
The way data is collected or labelled introduces systematic errors.
Example: Predicting hospital readmission. "Healthcare cost" is used as a proxy for "health needs." But Black patients, on average, have lower healthcare costs because they receive less healthcare — not because they're healthier. The model learned to predict cost, not need, and systematically underestimated the health needs of Black patients. (Obermeyer et al., Science, 2019)
4. Aggregation Bias
Using one model for multiple subgroups when different subgroups have different underlying patterns.
Example: Blood glucose prediction models trained on a mixed population may perform well on average but poorly for diabetic patients with atypical presentations.
5. Deployment Bias
The model is used in a context different from where it was trained.
Example: A depression screening model trained on Twitter data from the USA deployed in the UK — cultural expressions of distress differ. Performance degrades. Errors aren't caught because the deployers don't have ground truth.
The Data Pipeline
Data Splits: The Cardinal Rule
Data leakage is the #1 source of misleadingly good model performance:
Train/test split after normalisation (test statistics leak into training)
Future information in features (tomorrow's stock price to predict today's)
Duplicate records spanning train and test sets
Data Preprocessing Basics
Synthetic Data — The New Frontier
When real data is scarce, biased, or private, synthetic data can help:
GANs generate realistic medical images where patient data is restricted
Simulation (games, physics engines) generates infinite labelled data for robotics
LLMs generate synthetic training data for other LLMs (a controversial practice called "model collapse" when done naively)
💡 2024 trend: Meta's Llama 3 used synthetically generated instruction data. OpenAI's GPT-4 was reportedly used to generate training data for smaller models. The line between "real" and "synthetic" training data is blurring rapidly.
Practical Data Quality Checklist
Before training any model:
Summary
The model is only as good as the data. No algorithm, however sophisticated, can overcome:
Missing subgroups in training data
Mislabelled examples
Historical bias baked into labels
Data leakage between splits
The best AI practitioners spend 80% of their time on data and 20% on modelling.
Further Reading
Last updated
