Lab 05: Sharding Concepts

Time: 45 minutes | Level: Advanced | DB: MySQL 8.0 (application-level sharding)

Overview

Sharding distributes data across multiple independent database instances (shards). Unlike partitioning (one logical table, multiple physical segments), sharding splits data across entirely separate databases — each shard is a complete, independent MySQL instance.


Step 1: Shard Key Selection Theory

Before writing any code, understand shard key selection criteria:

# No Docker needed for this step — conceptual analysis

cat << 'EOF'
=== SHARD KEY SELECTION CRITERIA ===

BAD Shard Keys:
  ❌ Sequential IDs     → All new rows hit the same shard (hotspot)
  ❌ Timestamps         → Same hotspot problem for time-series writes
  ❌ Low-cardinality    → "gender" (M/F) can only create 2 shards
  ❌ Mutable values     → If shard key changes, row must move shards

GOOD Shard Keys:
  ✅ user_id (hashed)   → Even distribution, high cardinality
  ✅ tenant_id          → Natural isolation for multi-tenant apps
  ✅ product_id         → Products don't change their ID
  ✅ Geographic region  → Locality, often maps to data residency

RANGE vs HASH SHARDING:
  Range: users 1-1M → shard1, 1M-2M → shard2
    + Range queries are efficient (all 2022 orders on shard3)
    - Risk of hotspots if IDs are sequential

  Hash: hash(user_id) % num_shards → shard N
    + Even distribution, no hotspots
    - Range queries must hit ALL shards
    - Resharding requires moving half the data
EOF

Step 2: Launch Three MySQL Shards

📸 Verified Output:


Step 3: Application-Level Sharding — Python Router

📸 Verified Output:


Step 4: Demonstrate the Cross-Shard Query Problem

📸 Verified Output:

💡 This is why shard key selection is critical — queries that filter by shard key hit 1 shard; queries that don't hit ALL shards (scatter-gather), multiplying latency and load.


Step 5: Consistent Hashing Concept

📸 Verified Output:


Step 6: Verify Data Isolation per Shard

📸 Verified Output:


Step 7: Resharding Challenge Demo

📸 Verified Output:


Step 8: Capstone — Sharding Decision Framework

📸 Verified Output:


Summary

Concept
Description
Tradeoff

Range sharding

Route by value ranges

Risk of hotspots

Hash sharding

Route by hash(key) % N

Even distribution, hard range queries

Consistent hashing

Virtual ring, minimal remapping

Complex implementation

Shard key

Column used for routing

Must match query patterns

Cross-shard query

Hit all shards, merge results

N× latency, application complexity

Resharding

Split/merge shards

High complexity, requires downtime or double-write

Key Takeaways

  • Sharding is a last resort — exhaust all single-server optimizations first

  • Shard key is irreversible — wrong choice means full resharding later

  • Cross-shard queries are expensive — design data model to minimize them

  • Consistent hashing minimizes remapping when adding/removing shards

  • Managed solutions (Vitess, Citus, TiDB) handle resharding automatically

Last updated