Lab 11: AI Cost Optimization

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

AI infrastructure costs can spiral out of control without intentional FinOps practices. This lab covers the full cost landscape, GPU utilization optimization, spot instances, model distillation, query caching, and building a comprehensive ROI model.

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    AI Cost Landscape                         │
├───────────────┬──────────────────┬───────────────────────────┤
│  COMPUTE      │    DATA          │    PEOPLE                 │
│  ───────────  │  ─────────────── │  ─────────────────        │
│  GPU training │  Storage (hot)   │  ML engineers             │
│  GPU inference│  Storage (cold)  │  Data engineers           │
│  CPU serving  │  Data transfer   │  MLOps engineers          │
│  Spot savings │  Feature store   │  AI product managers      │
├───────────────┴──────────────────┴───────────────────────────┤
│  OPTIMIZATION LEVERS                                         │
│  Spot (70% off) | Distillation (60% off) | Cache (30% off)  │
│  Quantization (50% off) | Batching | Right-sizing            │
└──────────────────────────────────────────────────────────────┘

Step 1: AI Cost Components

Complete Cost Breakdown:

Category
Component
Typical % of Total
Optimization Potential

Compute

GPU training

15-30%

Spot instances (70% savings)

Compute

GPU inference

10-20%

Quantization, distillation

Data

Storage (model/data)

3-8%

Tiered storage, compression

Data

Transfer costs

1-3%

CDN, regional serving

API

External LLM APIs

5-15%

Caching, model cascade

People

ML/Data engineers

30-50%

Platform automation

Tooling

MLOps tools/licenses

2-5%

Open-source alternatives

Cloud

Networking, K8s

5-10%

Reserved instances

People Costs Are Often Underestimated:


Step 2: GPU Utilization Optimization

GPU idling is money on fire. Target > 80% GPU utilization.

Why GPUs Are Underutilized:

Optimization Techniques:

Technique
GPU Utilization Impact
Complexity

Mixed precision (FP16/BF16)

+20-30%

Low (1 line of code)

Larger batch sizes

+15-25%

Low

DataLoader prefetching

+10-20%

Low

Gradient accumulation

Enables larger effective batch

Low

Flash Attention

+30-50% (memory + speed)

Medium

Continuous batching

+40-60% (inference)

Medium (use vLLM)

Multi-GPU tensor parallel

Near-linear scaling

High

Right-sizing GPUs:


Step 3: Spot and Preemptible Instances

Training workloads are ideal for spot instances — they can checkpoint and resume.

Spot Instance Economics:

Cloud
On-demand (A100 80G)
Spot (A100 80G)
Savings

AWS p4d

$3.20/hr

$0.96/hr

70%

GCP A100

$3.47/hr

$1.04/hr

70%

Azure NC A100

$3.40/hr

$0.85/hr

75%

Spot Instance Best Practices:

Spot-Safe Training Architecture:

💡 For LLM fine-tuning (hours to days), spot instances can save $50K+ per training run. The engineering investment in checkpoint/resume pays back in the first training job.


Step 4: Model Distillation

Train a small "student" model to mimic a large "teacher" model. Often 3-10x smaller with <10% quality loss.

Distillation Process:

Distillation Variants:

Variant
Method
Best For

Response distillation

Student mimics teacher outputs

Classification, extraction

Feature distillation

Student mimics intermediate activations

Complex reasoning

Attention distillation

Student mimics attention patterns

Sequence tasks

Data augmentation

Teacher generates training data

Low-data domains

Cost Savings Example:


Step 5: Query Caching

Cache identical or semantically similar queries. LLM calls are expensive; caching is free.

Caching Layers:

Layer
Type
Hit Rate
Implementation

Exact match

Redis

5-15%

Hash query → cache result

Semantic match

Vector similarity

20-40%

Embed query → find similar cached

KV cache (LLM)

Prefix reuse

60-80%

Cache system prompt KV

CDN/Edge

HTTP

10-30%

Cache API responses at edge

Semantic Cache Architecture:

KV Cache Prefix Sharing:


Step 6: Batching for Cost Optimization

Batching amortizes fixed costs across many requests.

Batch Processing Economics:

Asynchronous Batch Pattern:

When to Use Batch vs Online:


Step 7: FinOps Practices for AI

FinOps = Cloud Financial Operations applied to AI

Show-back / Charge-back Model:

Cost Anomaly Detection:

Reserved Instance Planning:


Step 8: Capstone — AI Cost Model with ROI

📸 Verified Output:


Summary

Concept
Key Points

Cost Components

Compute (45%), People (40%), Data (8%), Tools (7%)

GPU Utilization

Target > 80%; use mixed precision, larger batches, Flash Attention

Spot Instances

70% savings on training; checkpoint every 30 min

Model Distillation

3-10x smaller student model, <10% quality loss, 60% cost reduction

Query Caching

Exact (Redis) + Semantic (vector) + KV prefix = 30-80% cost reduction

Batching

Amortize GPU costs; 3-5x throughput improvement for async workloads

FinOps

Show-back/charge-back, cost anomaly alerts, reserved vs spot planning

Next Lab: Lab 12: Enterprise AI Platform →

Last updated