Lab 15: LLM Fine-Tuning Infrastructure

Time: 50 minutes | Level: Architect | Docker: docker run -it --rm zchencow/innozverse-ai:latest bash

Overview

Fine-tuning LLMs for enterprise use cases requires careful infrastructure decisions. This lab covers fine-tuning approaches (full, LoRA, QLoRA), data preparation, GPU memory planning, alignment techniques (DPO vs RLHF), and evaluation metrics.

Architecture

┌──────────────────────────────────────────────────────────────┐
│              LLM Fine-Tuning Infrastructure                  │
├──────────────────────────────────────────────────────────────┤
│  DATA PIPELINE          │  TRAINING INFRASTRUCTURE           │
│  Raw data → JSONL       │  Base Model (HuggingFace)         │
│  Instruction format     │  PEFT/LoRA adapter init           │
│  Train/val split        │  Mixed precision (BF16)           │
│  Quality filtering      │  Gradient checkpointing           │
│                         │  DeepSpeed ZeRO / FSDP            │
├──────────────────────────┴──────────────────────────────────┤
│  EVALUATION             │  DEPLOYMENT                        │
│  BLEU, ROUGE, BERTScore │  Merge LoRA → Base model          │
│  Human evaluation       │  Quantize to INT4/INT8            │
│  Benchmark suites       │  Serve with vLLM                  │
└──────────────────────────────────────────────────────────────┘

Step 1: Fine-Tuning Approaches

When to Fine-Tune vs Prompt Engineering:

Fine-Tuning Methods Comparison:

Method
Trainable Params
GPU Memory (7B)
Quality
Use Case

Full fine-tuning

100%

~112 GB

Best

Max quality, ample GPU

LoRA

0.1-1%

~18 GB

Excellent

Most enterprise use cases

QLoRA

0.1-1%

~8 GB

Very good

Single consumer GPU

Prefix tuning

<0.1%

~14 GB

Good

Task-specific, less storage

Prompt tuning

<0.01%

~14 GB

Moderate

Many tasks, limited storage


Step 2: LoRA Architecture

LoRA (Low-Rank Adaptation):

Why LoRA Works:

LoRA Target Modules:


Step 3: QLoRA

QLoRA = Quantized LoRA. Fine-tune on 4-bit quantized base model.

QLoRA Innovations:

  1. 4-bit NormalFloat (NF4): Optimal quantization for normally-distributed weights

  2. Double quantization: Quantize the quantization constants (saves ~0.37 GB for 65B)

  3. Paged optimizers: NVIDIA unified memory for gradient optimizer states

QLoRA Memory Breakdown (LLaMA-7B):

QLoRA vs LoRA:


Step 4: Data Preparation

Instruction Following Format (Alpaca/ChatML):

Alpaca format:

ChatML format (Llama-3, Mistral):

Data Quality for Fine-Tuning:


Step 5: Alignment Techniques (RLHF vs DPO)

RLHF (Reinforcement Learning from Human Feedback):

DPO (Direct Preference Optimization):

DPO Data Format:


Step 6: Compute Requirements

GPU Memory Calculator:

Compute Budget Estimation:


Step 7: Evaluation (BLEU, ROUGE, BERTScore)

Automated Metrics:

Metric
Measures
Range
Use Case

BLEU

N-gram precision vs reference

0-100

Translation, generation

ROUGE-L

Longest common subsequence

0-1

Summarization

BERTScore

Semantic similarity (BERT embeddings)

0-1

Open-ended generation

Perplexity

Model confidence on holdout set

Lower=better

Language modeling quality

Human Evaluation (Gold Standard):

LLM-as-Judge (Scalable Human-Quality Eval):


Step 8: Capstone — LoRA Parameter Calculator

📸 Verified Output:


Summary

Concept
Key Points

Full Fine-tuning

Best quality; needs 112GB+ for 7B; use for mission-critical

LoRA

Train 0.1% params; W' = W + BA; 18GB for 7B; standard approach

QLoRA

4-bit base + FP16 LoRA adapters; 8GB for 7B; single consumer GPU

Data Format

Instruction format (Alpaca/ChatML); 1K-10K quality > 100K quantity

RLHF

SFT → Reward Model → PPO; complex but powerful

DPO

Preference pairs only; no reward model; simpler, comparable quality

Evaluation

BLEU/ROUGE (automated) + BERTScore (semantic) + LLM-judge (scalable)

Next Lab: Lab 16: Knowledge Graph + LLM →

Last updated