Lab 19: The AI Landscape in 2025–2026

Objective

Survey the rapidly evolving AI landscape: frontier models (GPT-4o, Claude 4, Gemini 2, Llama 3, Grok), benchmark comparisons, the open vs closed model ecosystem, emerging capabilities (long context, vision, agents, reasoning), and what the next 12 months will bring.

Time: 35 minutes | Level: Foundations | No coding required


Background: Why the Landscape Changes So Fast

AI capabilities are advancing faster than any technology in history. What was state-of-the-art 18 months ago is now surpassed by open-source models running on a laptop.

2020: GPT-3 (175B params) — text only, API only, expensive
2022: ChatGPT launches → mainstream AI awareness
2023: GPT-4 (multimodal), Llama 1/2 (open-source), Claude 2
2024: GPT-4o, Claude 3 (Haiku/Sonnet/Opus), Gemini 1.5 Pro (1M context)
      Llama 3 (open, competitive with closed models)
2025: Claude 4, Gemini 2, GPT-4.5/o3, Grok 3, Llama 4
      Reasoning models (o1, o3, R1), long context (10M+ tokens)
2026: Multi-agent, real-time voice, autonomous coding agents emerging

Step 1: Frontier Model Comparison

Closed Models (API-only)

Model
Provider
Context
Strengths
Pricing (approx)

Claude Sonnet 4.6

Anthropic

200K

Coding, safety, instruction following

$3/MTok in

Claude Opus 4

Anthropic

200K

Most capable, complex reasoning

$15/MTok in

GPT-4o

OpenAI

128K

Multimodal, fast, ecosystem

$5/MTok in

GPT-o3

OpenAI

200K

Chain-of-thought reasoning

$10/MTok in

Gemini 2.0 Pro

Google

2M

Huge context, multimodal, Search grounding

$3.5/MTok in

Grok 3

xAI

128K

Real-time web, Twitter data

Available via API

💡 The context window arms race: Gemini 2.0 supports 2 million tokens — enough to fit the entire Linux kernel source code in a single prompt. Claude and GPT are not far behind.

Open-Source Models (run locally)

Model
Params
Context
Quantised Size
Who Uses It

Llama 3.3 70B

70B

128K

~40GB (FP16), 20GB (INT8)

Enterprises, researchers

Mistral 7B

7B

32K

~4GB (INT4)

Edge, privacy-first

Qwen 2.5 72B

72B

128K

~40GB

Coding tasks

DeepSeek R1

671B

128K

Too large for most

Research

Phi-3 Mini

3.8B

128K

~2GB

Mobile, IoT


Step 2: Benchmark Comparison

Standard Benchmarks (as of early 2026)

💡 Benchmark saturation: Models are approaching 90%+ on many benchmarks designed for humans. The field is moving to harder benchmarks: SWE-bench (real GitHub issues), ARC-AGI (novel reasoning), Humanity's Last Exam.

Real-World Performance Matters More

Benchmark scores don't tell the whole story:

  • Instruction following: Claude is generally considered best

  • Code quality: Claude Sonnet and GPT-4o are neck-and-neck

  • Safety: Claude (RLHF + Constitutional AI) leads

  • Speed: Gemini Flash, Claude Haiku are fastest at low cost

  • Long context accuracy: Gemini 2.0 handles 1M+ tokens best


Step 3: Emerging Capabilities in 2025–2026

1. Reasoning Models (o1, o3, R1, QwQ)

2. Agent Frameworks Maturing

3. Multimodal Expansion

4. Context Window Explosion


Step 4: Open-Source vs Closed — 2026 State

The Capability Gap is Closing

Choosing Open vs Closed

Factor
Closed (API)
Open-Source

Capability ceiling

Highest (today)

Catching up fast

Data privacy

Data sent to provider

100% local

Cost at scale

Per-token (can be high)

Hardware only

Customisation

Limited fine-tuning

Full control

Compliance (HIPAA/GDPR)

Requires BAA/DPA

Easier

Setup complexity

Minutes (API key)

Hours–days

Latency

Network round-trip

Local inference


Step 5: What to Expect in the Next 12 Months


Key Takeaways

  1. Models are commoditising — the API layer matters less; the application layer is where value is created

  2. Open-source is viable for most enterprise use cases (data privacy, cost, customisation)

  3. Reasoning models are a new paradigm — not just bigger, but thinking differently

  4. Context windows are now practically unlimited — retrieval is supplemented, not replaced

  5. Agents are real — not sci-fi; companies are deploying them in production today

  6. Security implications are growing — AI is both a defensive tool and an attack vector


Further Reading

Last updated