Master Kafka architecture: brokers, topics, partitions, consumer groups, offsets. Run Kafka with Docker, create topics, produce/consume messages, measure consumer lag, and implement CDC pattern.
📚 Background
Kafka Architecture
Producers → [Topic: orders] → Consumers
├── Partition 0: [msg1][msg2][msg5]... ← Consumer Group A
├── Partition 1: [msg3][msg6]... ← Consumer Group A
└── Partition 2: [msg4][msg7]... ← Consumer Group A
Consumer Group B reads same topic
Key Concepts:
Concept
Description
Broker
Kafka server; stores and serves partitions
Topic
Named stream of records (like a database table)
Partition
Ordered, immutable log; enables parallelism
Offset
Sequential ID for each message in partition
Consumer Group
Logical subscriber; each partition → one consumer
Retention
Messages kept N days/bytes; not deleted on consume
Kafka as Database Changelog (CDC)
Step 1: Start Kafka with Docker
📸 Verified Output:
Step 2: Create Topics
📸 Verified Output:
Step 3: Produce Messages
📸 Verified Output:
Step 4: Consume Messages
📸 Verified Output:
Step 5: Consumer Groups & Lag
📸 Verified Output:
💡 Consumer lag is the most important Kafka operational metric. Lag = LOG-END-OFFSET - CURRENT-OFFSET. Increasing lag = consumer can't keep up with producers.
Step 6: Kafka as Database Changelog (CDC)
📸 Verified Output:
Step 7: Kafka Configuration Best Practices
📸 Verified Output:
Step 8: Capstone — Kafka Architecture Review
📸 Verified Output:
Summary
Concept
Key Takeaway
Topic
Named append-only log; messages retained N days
Partition
Ordered unit of parallelism; key → partition (consistent)
Offset
Sequential message ID within partition; consumers track position
Consumer Group
Group of consumers sharing a topic; 1 partition → 1 consumer
Consumer Lag
LOG-END-OFFSET - CURRENT-OFFSET; key health metric
kafka-topics.sh
Create, list, describe, alter topics
kafka-consumer-groups.sh
Inspect and reset consumer group offsets
CDC with Debezium
MySQL binlog → Kafka topic; every DB change = event
acks=all
Wait for all ISR replicas; required for durability
idempotent producer
Exactly-once at producer level; prevents duplicates
💡 Architect's insight: Kafka is not just a message queue — it's a durable, replayable event log. The ability to replay from offset 0 makes it the backbone of event sourcing, CDC pipelines, and CQRS architectures.
cat > /tmp/kafka_architecture.py << 'EOF'
"""
Kafka as a database architecture component.
"""
print("Kafka in Modern Data Architecture")
print("="*65)
print("""
┌──────────────────────────────────────────────────────────┐
│ Event-Driven Data Flow │
│ │
│ ┌─────────┐ CDC ┌──────────┐ │
│ │ MySQL │──────────► │ │───► Elasticsearch │
│ │ Postgres│ Debezium │ Kafka │───► Data Warehouse │
│ │ MongoDB │ │ Topics │───► Analytics DB │
│ └─────────┘ │ │───► Microservices │
│ │ │───► ML Pipeline │
│ ┌─────────┐ App events │ │───► Audit Store │
│ │ Services│──────────► │ │ │
│ └─────────┘ └──────────┘ │
└──────────────────────────────────────────────────────────┘
""")
use_cases = [
("Event bus", "Decouple microservices; pub/sub messaging"),
("CDC pipeline", "Capture DB changes; sync to search/warehouse"),
("Stream processing", "Kafka Streams / Flink: real-time analytics"),
("Activity tracking", "User clicks, page views, audit logs"),
("ML feature store", "Real-time features for ML inference"),
("Command sourcing", "Commands as events; CQRS write side"),
("Log aggregation", "Centralized logs from all services"),
("Metrics pipeline", "Metrics → Prometheus / TimescaleDB"),
]
print("Common Use Cases:")
for uc, desc in use_cases:
print(f" {uc:<22}: {desc}")
print("\nKafka vs Alternative Messaging Systems:")
alternatives = [
("Kafka", "High throughput, durability, replay, large-scale CDC"),
("RabbitMQ", "Complex routing, low latency, message acknowledgement"),
("AWS SQS/SNS", "Serverless, managed, less control, AWS ecosystem"),
("Redis Streams", "Low latency, simple, co-located with cache layer"),
("Pulsar", "Kafka alternative, multi-tenancy, geo-replication"),
("EventBridge", "AWS-native serverless events, SaaS integrations"),
]
print(f"\n{'System':<15} {'Best For'}")
print("-"*65)
for system, best in alternatives:
print(f"{system:<15} {best}")
EOF
python3 /tmp/kafka_architecture.py
Kafka in Modern Data Architecture
MySQL → Debezium → Kafka → Elasticsearch
→ Data Warehouse
→ Microservices
Common Use Cases:
Event bus : Decouple microservices; pub/sub messaging
CDC pipeline : Capture DB changes; sync to search/warehouse
Stream processing : Kafka Streams / Flink: real-time analytics
System Best For
-----------------------------------------------------------------
Kafka High throughput, durability, replay, large-scale CDC
RabbitMQ Complex routing, low latency, message acknowledgement
Redis Streams Low latency, simple, co-located with cache layer