Lab 19: Global Data Platform

Time: 50 minutes | Level: Architect | DB: PostgreSQL, ClickHouse, Elasticsearch, Redis, Kafka

Modern applications need more than a single database. This lab designs a polyglot data platform where each technology plays to its strengths: PostgreSQL for OLTP, ClickHouse for analytics, Elasticsearch for search, Redis for caching, and Kafka for event streaming.


Step 1: Platform Overview & Design Principles

┌─────────────────────────────────────────────────────────────────────────────┐
│                       GLOBAL DATA PLATFORM                                   │
│                                                                              │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────────────────────┐  │
│   │   Web / API  │    │   Mobile     │    │    Internal Services         │  │
│   └──────┬───────┘    └──────┬───────┘    └───────────┬──────────────────┘  │
│          └─────────────────────────────────────────────┘                     │
│                              │                                               │
│                    ┌─────────▼──────────┐                                   │
│                    │   API Gateway /    │                                    │
│                    │   Load Balancer    │                                    │
│                    └────────┬───────────┘                                   │
│                             │                                                │
│          ┌──────────────────┼──────────────────────┐                        │
│          │                  │                       │                        │
│   ┌──────▼──────┐   ┌───────▼──────┐   ┌──────────▼────────┐               │
│   │  Redis 7    │   │ PostgreSQL 15│   │  Elasticsearch 8  │               │
│   │  Cluster    │   │  Primary +   │   │   3-node cluster  │               │
│   │  (Cache)    │   │  2 Replicas  │   │   (Search/Audit)  │               │
│   └─────────────┘   └──────┬───────┘   └───────────────────┘               │
│                             │  CDC                                           │
│                    ┌────────▼───────────┐                                   │
│                    │   Apache Kafka 3   │                                    │
│                    │   3 brokers        │                                    │
│                    │   (Event Hub)      │                                    │
│                    └────────┬───────────┘                                   │
│                             │ Consumers                                      │
│          ┌──────────────────┼──────────────────────┐                        │
│          │                  │                       │                        │
│   ┌──────▼──────┐   ┌───────▼──────┐   ┌──────────▼────────┐               │
│   │ ClickHouse  │   │   MongoDB    │   │     Grafana /      │               │
│   │  Cluster    │   │  Atlas       │   │     Dashboards     │               │
│   │  (OLAP)     │   │  (Profiles)  │   │                    │               │
│   └─────────────┘   └──────────────┘   └───────────────────┘               │
└─────────────────────────────────────────────────────────────────────────────┘

Data Flow:
  Read path:  API → Redis (cache hit) → PostgreSQL (cache miss)
  Write path: API → PostgreSQL → Kafka CDC → ClickHouse/ES/MongoDB
  Analytics:  Kafka → ClickHouse → Grafana
  Search:     Kafka → Elasticsearch → API search endpoint

Design principles:

  1. Polyglot persistence — Use the right database for each job

  2. Event-driven — Kafka as the central nervous system

  3. Cache-first reads — Redis reduces database load by 80%+

  4. CQRS — Separate read and write paths

  5. Eventual consistency — Accept lag in analytics, require freshness in OLTP


Step 2: OLTP Layer — PostgreSQL with PgBouncer

💡 PgBouncer in transaction mode allows 1000+ application connections to share just 25 database connections. This eliminates the "too many clients" error that kills unoptimized PostgreSQL setups.


Step 3: Analytics Layer — ClickHouse

ClickHouse is a columnar OLAP database that can scan billions of rows per second.


Step 4: Search Layer — Elasticsearch

💡 Elasticsearch is eventually consistent. Index updates after a Kafka consume may take 1-10 seconds to appear in search. Design your UI to handle this — show "results may take a moment to update" after writes.


Step 5: Cache Layer — Redis Cluster


Step 6: Event Streaming — Apache Kafka

💡 Use Debezium for Change Data Capture (CDC) from PostgreSQL to Kafka. It reads the WAL (Write-Ahead Log) and produces events for every INSERT/UPDATE/DELETE — no code changes needed in your application.


Step 7: Data Flow Patterns


Step 8: Capstone — Platform Validation

Run verification:

📸 Verified Output:


Summary

Component
Technology
Primary Use
Throughput
Latency

OLTP

PostgreSQL 15

Transactions, ACID

10K TPS

<5ms

Connection Pool

PgBouncer

Connection multiplexing

1000 clients→25 PG

+0.1ms

Analytics

ClickHouse

OLAP, aggregations

1B rows/sec scan

<100ms

Search

Elasticsearch

Full-text, log search

50K docs/sec

<50ms

Cache

Redis Cluster

Read cache, rate limits

1M ops/sec

<1ms

Streaming

Apache Kafka

CDC, event bus

500MB/s

<10ms

Profiles

MongoDB

Document/JSON data

50K writes/sec

<10ms

Monitoring

Grafana + Prometheus

Metrics, dashboards

Last updated