Lab 13: Fine-Tuning with LoRA

Objective

Understand and implement Low-Rank Adaptation (LoRA) — the dominant technique for fine-tuning large language models efficiently. Learn why LoRA works mathematically, implement it from scratch, and understand how it enables adapting billion-parameter models on consumer hardware.

Time: 50 minutes | Level: Practitioner | Docker Image: zchencow/innozverse-ai:latest


Background

Fine-tuning GPT-4 (1.8 trillion parameters) would require:

  • ~1.8 TB of GPU memory at full precision

  • Weeks of training time

  • ~$1 million+ in compute

LoRA (Hu et al., 2021) reduces this to 1–10% of parameters by exploiting a key insight: the changes needed to adapt a model are low-rank. Instead of modifying weight matrix W (d×d), add two small matrices A (d×r) and B (r×d) where r << d.

Original forward:  h = Wx
LoRA forward:      h = Wx + (α/r) * BAx

Parameters to train:
  Original W:   d × d = 4,096 × 4,096 = 16,777,216
  LoRA A+B:     d×r + r×d = 4096×8 + 8×4096 = 65,536  (0.4% of original!)

Step 1: Environment Setup

📸 Verified Output:


Step 2: LoRA Mathematics

📸 Verified Output:

💡 For a 4096-dim layer with rank 8, LoRA uses 0.39% of the original parameters. For a 70B parameter model, that means fine-tuning only ~280M parameters instead of 70 billion.


Step 3: LoRA Layer Implementation

📸 Verified Output:

💡 Perfect zero difference at initialisation. This is crucial — LoRA fine-tuning starts from exactly where the pretrained model is, not from random noise.


Step 4: Training a LoRA-Adapted Model

📸 Verified Output:


Step 5: Rank Selection and the Intrinsic Dimensionality Hypothesis

📸 Verified Output:

💡 Quality improves with rank but with diminishing returns. For most tasks, rank=8 or rank=16 is the sweet spot — good quality at minimal parameter overhead.


Step 6: LoRA Merging — Zero Inference Overhead

📸 Verified Output:


Step 7: Multi-Task LoRA — Different Adaptors for Different Tasks

📸 Verified Output:

💡 Each task has its own small LoRA adaptor (~2K parameters vs 16K base). At serving time, swap adaptors per request — one base model, many specialisations.


Step 8: Real-World Capstone — Security QA Fine-Tuning Simulation

📸 Verified Output:

💡 This is exactly what happens when you fine-tune a general LLM (GPT, Claude, Llama) with LoRA on domain-specific data: near-zero to near-perfect on your target domain, at a tiny fraction of the cost of full fine-tuning.


Summary

Concept
Key Insight

Low-rank hypothesis

Weight updates during fine-tuning are inherently low-rank

B=0 initialisation

Ensures no change to pretrained behaviour at start

Rank selection

rank=8–16 covers most tasks; higher = more capacity, more params

Merging

A@B can be merged into W post-training for zero inference overhead

Multi-task

One base model + multiple tiny LoRA adaptors = flexible serving

Production workflow (with Hugging Face):

Further Reading

Last updated