Lab 11: Vision AI — How Machines See the World

Objective

Understand how AI systems process and generate images. By the end you will be able to:

  • Explain how CNNs learn to "see" features in images

  • Describe CLIP — the model that connected images and language

  • Understand how diffusion models generate images from text

  • Identify real-world applications of vision AI


How Machines See

A digital image is a 3D array of numbers: height × width × 3 (RGB channels). A 1080p image is roughly 1920 × 1080 × 3 = 6.2 million numbers. Vision AI learns patterns in these numbers.

import numpy as np
from PIL import Image

# Load image as numpy array
img = Image.open("cat.jpg").convert("RGB")
pixels = np.array(img)

print(pixels.shape)   # (480, 640, 3)  — height, width, RGB
print(pixels[0, 0])   # [134, 201, 89]  — first pixel: R=134, G=201, B=89
print(pixels.dtype)   # uint8  — values 0-255

# Neural networks normalise to 0-1 or -1 to +1
pixels_norm = pixels / 255.0

Convolutional Neural Networks: Learning to See

CNNs learn filters — small matrices that detect specific visual patterns when slid across an image.

What different layers learn:

Layer
Detects

Layer 1

Edges, gradients, colour blobs

Layer 2

Corners, curves, textures

Layer 3

Object parts: eyes, wheels, fur

Layer 4

Full objects: face, car, cat


The ImageNet Moment (2012)

AlexNet's 2012 ImageNet victory (15.3% error vs 26.2% second place) launched modern vision AI. Key innovations:

  • ReLU activations instead of sigmoid — faster training

  • GPU training — reduced training time from weeks to days

  • Dropout — prevented overfitting on 1.2M images

  • Data augmentation — random flips, crops, colour jitter

Since 2012, ImageNet error rates have dropped below 2% — better than human performance (5%). The benchmark is now considered solved.


Transfer Learning: Don't Train from Scratch

The most practical computer vision technique: take a model pre-trained on ImageNet and fine-tune it on your specific task.

Why it works: Features learned from millions of natural images (edges, textures, shapes) transfer remarkably well to medical imaging, satellite imagery, quality control — almost any visual domain.


CLIP: Connecting Images and Language

CLIP (Contrastive Language-Image Pre-training, OpenAI 2021) is the model that unified vision and language. It was trained on 400 million (image, text caption) pairs from the internet using contrastive learning:

CLIP enabled zero-shot classification — classify images into any category without training examples. It also powers the text encoders in image generation models.


Diffusion Models: Generating Images from Text

Stable Diffusion, DALL-E, and Midjourney all use diffusion models. The process:

The guidance_scale parameter:

  • Low (1–3): image ignores the prompt, very creative but unconstrained

  • Medium (7–8): balanced — follows prompt, allows some variation

  • High (15+): rigidly follows prompt, less natural-looking


Key Vision AI Models (2024–2025)

Model
Organisation
Capability

GPT-4V / GPT-4o

OpenAI

Understands any image; generates text descriptions

Claude 3.5

Anthropic

Strong image analysis; chart/document reading

Gemini Vision

Google

Natively multimodal; long video understanding

DALL-E 3

OpenAI

Text-to-image; very prompt-faithful

Midjourney v6

Midjourney

Artistic quality; best aesthetics

Stable Diffusion 3

Stability AI

Open weights; runs locally

Flux

Black Forest Labs

State-of-the-art open text-to-image (2024)

Sora

OpenAI

Text-to-video; up to 1-minute HD video

Runway Gen-3

Runway

Professional video generation

SAM 2

Meta

Segment anything in images AND video


Real-World Vision AI Applications

Application
Technology
Impact

Medical imaging

CNN (tumour detection)

Radiologist-level accuracy for lung cancer detection

Autonomous vehicles

CNN + LiDAR fusion

Tesla Autopilot, Waymo One

Quality control

Anomaly detection CNN

Semiconductor defect detection: 99.97% accuracy

Security cameras

Object detection (YOLO)

Real-time person/vehicle/weapon detection

Agriculture

Satellite + CNN

Crop disease detection, yield prediction

Retail

Computer vision

Amazon Go: checkout-free shopping

Content moderation

CLIP + classifiers

Facebook: 95%+ harmful content removed before reporting


The Multimodal Future

The boundaries between vision, language, and audio are dissolving:


Further Reading

Last updated