Lab 18: Cassandra Wide Column

Time: 45 minutes | Level: Advanced | DB: Apache Cassandra

Overview

Cassandra is a distributed wide-column database designed for massive write throughput and linear scalability. Its data model — partition keys, clustering columns, and configurable consistency levels — makes it ideal for time-series, IoT, and high-write workloads.


Step 1: Launch Cassandra

docker run -d --name cassandra-lab \
  -e MAX_HEAP_SIZE=512M \
  -e HEAP_NEWSIZE=128M \
  cassandra:4.1

echo "Waiting for Cassandra to start (60-90 seconds)..."
for i in $(seq 1 60); do
  docker exec cassandra-lab cqlsh -e "SELECT now() FROM system.local;" 2>/dev/null | grep -q "system.time_uuid" && break || sleep 3
done

echo "Cassandra ready!"
docker exec cassandra-lab nodetool status

📸 Verified Output:

💡 A single-node Cassandra cluster is useful for development but not fault-tolerant. Production requires minimum 3 nodes for quorum (QUORUM consistency).


Step 2: Create Keyspace with Replication Strategy

📸 Verified Output:


Step 3: Create Tables — Partition Key and Clustering Columns

📸 Verified Output:

💡 Cassandra design rule: Design tables around queries, not normalized data. Each query gets its own table. Duplicate data is expected and encouraged.


Step 4: INSERT, SELECT, UPDATE Operations

📸 Verified Output:


Step 5: CQL Data Types and TTL

📸 Verified Output:


Step 6: Consistency Levels

📸 Verified Output:

💡 Golden rule: QUORUM writes + QUORUM reads = strong consistency. ONE writes + ONE reads = eventual consistency (may read stale data).


Step 7: nodetool — Cluster Operations

📸 Verified Output:


Step 8: Capstone — Compaction Strategies

📸 Verified Output:


Summary

Concept
Key Detail
Command/Setting

Keyspace

Namespace + replication config

CREATE KEYSPACE ... WITH REPLICATION

Partition key

Determines which node stores data

PRIMARY KEY ((pk))

Clustering column

Row order within partition

PRIMARY KEY ((pk), cc) ... CLUSTERING ORDER BY

Consistency ONE

1 replica responds

Fast, eventual consistency

Consistency QUORUM

Majority responds

Balanced consistency/availability

Consistency ALL

All replicas respond

Strongest, lowest availability

TTL

Automatic row expiration

INSERT ... USING TTL seconds

TWCS

Time-window compaction

Optimal for time-series

nodetool status

Node health + ownership

nodetool status

compactionstats

Pending compaction work

nodetool compactionstats

Key Takeaways

  • Design tables around queries — Cassandra has no JOINs, so duplicate data is expected

  • Partition key drives distribution — high cardinality, uniform access pattern = balanced nodes

  • Clustering columns provide efficient range queries within a partition

  • QUORUM writes + QUORUM reads = strong consistency without sacrificing availability

  • TWCS compaction is essential for time-series data — dramatically reduces write amplification

Last updated