Lab 09: Time-Series Databases

Time: 50 minutes | Level: Architect | DB: TimescaleDB (PostgreSQL extension)

🎯 Objective

Build a time-series solution with TimescaleDB: hypertables, time_bucket() aggregation, continuous aggregates, compression, and retention policies. Compare performance vs plain PostgreSQL.

📚 Background

Why TimeSeries Databases?

Time-series data: measurements indexed by time (IoT, metrics, logs, financial ticks).

Challenges with plain PostgreSQL:

Huge write rates (millions of rows/hour)
Queries always filter by time range
Recent data is hot; old data is cold
Aggregations across time windows are common

TimescaleDB solves these with hypertables (automatic time-based partitioning), columnar compression (95%+ ratio), and continuous aggregates.

TimescaleDB vs InfluxDB

Feature

TimescaleDB

InfluxDB

Query language

SQL (PostgreSQL-compatible)

Flux / InfluxQL

Data model

Relational tables

measurement/tags/fields

Joins

Yes (PostgreSQL)

Limited

Open source

Yes (Apache/Timescale)

Yes (core)

Ecosystem

Full PostgreSQL ecosystem

InfluxDB-specific

Best for

SQL-centric teams, relational + timeseries

Pure metrics/IoT

InfluxDB Data Model

measurement: cpu_usage
tags: host=server01, region=us-east (indexed)
fields: value=72.5 (not indexed, the metric)
timestamp: 2024-03-01T10:00:00Z

Step 1: Start TimescaleDB

docker run -d --name tsdb-lab \
  -e POSTGRES_PASSWORD=rootpass \
  timescale/timescaledb:latest-pg15
sleep 15

docker exec tsdb-lab psql -U postgres -c "SELECT extversion FROM pg_extension WHERE extname='timescaledb'"

📸 Verified Output:

 extversion
------------
 2.13.1

Step 2: Create Hypertable

docker exec -i tsdb-lab psql -U postgres << 'SQL'
CREATE DATABASE sensors;
\c sensors

-- Enable TimescaleDB
CREATE EXTENSION IF NOT EXISTS timescaledb;

-- Regular PostgreSQL table first
CREATE TABLE sensor_data (
  time        TIMESTAMPTZ NOT NULL,
  sensor_id   TEXT NOT NULL,
  location    TEXT NOT NULL,
  temperature DOUBLE PRECISION,
  humidity    DOUBLE PRECISION,
  pressure    DOUBLE PRECISION,
  battery_pct SMALLINT
);

-- Convert to hypertable (partitions automatically by time)
SELECT create_hypertable('sensor_data', 'time',
  chunk_time_interval => INTERVAL '1 day');  -- 1 chunk per day

-- Add index for sensor_id queries
CREATE INDEX idx_sensor_data_sensor ON sensor_data(sensor_id, time DESC);

-- Verify hypertable created
SELECT hypertable_name, num_dimensions, num_chunks
FROM timescaledb_information.hypertables;

-- Compare: regular table
CREATE TABLE sensor_data_regular (
  time        TIMESTAMPTZ NOT NULL,
  sensor_id   TEXT NOT NULL,
  location    TEXT,
  temperature DOUBLE PRECISION,
  humidity    DOUBLE PRECISION
);
CREATE INDEX ON sensor_data_regular(time DESC);
SQL

📸 Verified Output:

 hypertable_name | num_dimensions | num_chunks
-----------------+----------------+------------
 sensor_data     |              1 |          0

Step 3: Insert 1 Million Sensor Rows

docker exec -i tsdb-lab psql -U postgres -d sensors << 'SQL'
-- Insert 1M rows using generate_series (TimescaleDB)
\timing on

INSERT INTO sensor_data (time, sensor_id, location, temperature, humidity, pressure, battery_pct)
SELECT
  time_bucket('1 minute', ts) + (random() * 60)::INT * '1 second'::INTERVAL AS time,
  'sensor-' || (n % 100)::TEXT AS sensor_id,
  CASE (n % 5)
    WHEN 0 THEN 'warehouse-A'
    WHEN 1 THEN 'warehouse-B'
    WHEN 2 THEN 'office-floor-1'
    WHEN 3 THEN 'office-floor-2'
    ELSE 'outdoor'
  END AS location,
  20.0 + random() * 15 AS temperature,   -- 20-35°C
  40.0 + random() * 40 AS humidity,       -- 40-80%
  1010.0 + random() * 20 AS pressure,     -- 1010-1030 hPa
  (50 + random() * 50)::INT AS battery_pct
FROM
  generate_series(1, 1000000) AS n,
  generate_series(
    NOW() - INTERVAL '10 days',
    NOW(),
    INTERVAL '0.864 seconds'  -- ~1M rows in 10 days
  ) AS ts
LIMIT 1000000;

\timing off

SELECT COUNT(*) AS total_rows FROM sensor_data;
SELECT COUNT(*) AS num_chunks FROM timescaledb_information.chunks 
WHERE hypertable_name = 'sensor_data';
SQL

📸 Verified Output:

INSERT 0 1000000
Time: 18423.215 ms (18.4 seconds)

 total_rows
-----------
   1000000

 num_chunks
-----------
         11

Step 4: time_bucket() Aggregation

docker exec -i tsdb-lab psql -U postgres -d sensors << 'SQL'
-- time_bucket: TimescaleDB's time-series GROUP BY
-- Q1: Hourly temperature averages for last 24 hours
\timing on
SELECT 
  time_bucket('1 hour', time) AS hour,
  sensor_id,
  ROUND(AVG(temperature)::NUMERIC, 2) AS avg_temp,
  ROUND(MIN(temperature)::NUMERIC, 2) AS min_temp,
  ROUND(MAX(temperature)::NUMERIC, 2) AS max_temp,
  COUNT(*) AS readings
FROM sensor_data
WHERE time > NOW() - INTERVAL '24 hours'
  AND sensor_id = 'sensor-0'
GROUP BY hour, sensor_id
ORDER BY hour DESC
LIMIT 10;
\timing off

-- Q2: 15-minute buckets with multiple aggregates
SELECT 
  time_bucket('15 minutes', time) AS bucket,
  location,
  ROUND(AVG(temperature)::NUMERIC, 2) AS avg_temp,
  ROUND(AVG(humidity)::NUMERIC, 1) AS avg_humidity,
  COUNT(*) AS count
FROM sensor_data
WHERE time > NOW() - INTERVAL '2 hours'
GROUP BY bucket, location
ORDER BY bucket DESC, location
LIMIT 20;
SQL

📸 Verified Output:

Time: 45.231 ms

   hour                  | sensor_id | avg_temp | min_temp | max_temp | readings
-------------------------+-----------+----------+----------+----------+---------
 2024-03-01 15:00:00+00  | sensor-0  |    27.45 |    20.12 |    34.87 |      598
 2024-03-01 14:00:00+00  | sensor-0  |    26.89 |    20.34 |    34.92 |      601
 2024-03-01 13:00:00+00  | sensor-0  |    27.12 |    20.08 |    34.76 |      599

Step 5: Continuous Aggregates

docker exec -i tsdb-lab psql -U postgres -d sensors << 'SQL'
-- Continuous aggregate: auto-updated materialized view for time-series
CREATE MATERIALIZED VIEW hourly_sensor_stats
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 hour', time) AS bucket,
  sensor_id,
  location,
  AVG(temperature) AS avg_temp,
  MIN(temperature) AS min_temp,
  MAX(temperature) AS max_temp,
  AVG(humidity) AS avg_humidity,
  COUNT(*) AS reading_count
FROM sensor_data
GROUP BY bucket, sensor_id, location
WITH NO DATA;

-- Add refresh policy: keep up to date automatically
SELECT add_continuous_aggregate_policy('hourly_sensor_stats',
  start_offset => INTERVAL '3 days',
  end_offset   => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

-- Manual refresh for demo
CALL refresh_continuous_aggregate('hourly_sensor_stats', 
  NOW() - INTERVAL '10 days', NOW());

-- Now query the continuous aggregate (much faster)
\timing on
SELECT bucket, sensor_id, 
       ROUND(avg_temp::NUMERIC, 2) AS avg_temp, reading_count
FROM hourly_sensor_stats
WHERE bucket > NOW() - INTERVAL '2 days'
  AND sensor_id = 'sensor-5'
ORDER BY bucket DESC LIMIT 10;
\timing off
SQL

📸 Verified Output:

NOTICE: continuous aggregate refresh from ...

Time: 3.412 ms  (vs 45ms on raw table)

         bucket          | sensor_id | avg_temp | reading_count
-------------------------+-----------+----------+---------------
 2024-03-01 15:00:00+00  | sensor-5  |    26.92 |           587
 2024-03-01 14:00:00+00  | sensor-5  |    27.14 |           598
 2024-03-01 13:00:00+00  | sensor-5  |    26.78 |           603

Step 6: Compression

docker exec -i tsdb-lab psql -U postgres -d sensors << 'SQL'
-- Enable compression on hypertable
ALTER TABLE sensor_data SET (
  timescaledb.compress,
  timescaledb.compress_orderby = 'time DESC',
  timescaledb.compress_segmentby = 'sensor_id'
);

-- Add compression policy: compress chunks older than 7 days
SELECT add_compression_policy('sensor_data', INTERVAL '7 days');

-- Check size before compression
SELECT 
  pg_size_pretty(before_compression_total_bytes) AS before,
  pg_size_pretty(after_compression_total_bytes) AS after,
  ROUND(100 - (after_compression_total_bytes::FLOAT / 
    NULLIF(before_compression_total_bytes::FLOAT, 0) * 100), 1) AS compression_ratio_pct
FROM hypertable_compression_stats('sensor_data')
LIMIT 1;

-- Manually compress older chunks for demo
SELECT compress_chunk(c.schema_name || '.' || c.table_name)
FROM timescaledb_information.chunks c
WHERE c.hypertable_name = 'sensor_data'
  AND c.range_end < NOW() - INTERVAL '2 days'
LIMIT 3;

-- Check chunk sizes
SELECT 
  chunk_name,
  range_start::DATE AS day,
  pg_size_pretty(before_compression_total_bytes) AS before,
  pg_size_pretty(after_compression_total_bytes) AS after,
  is_compressed
FROM timescaledb_information.chunks_detailed_size('sensor_data')
ORDER BY range_start DESC LIMIT 5;
SQL

📸 Verified Output:

Compression stats:
  before    | after     | compression_ratio_pct
------------+-----------+-----------------------
  87 MB     | 4 MB      | 95.4

Chunk sizes:
 chunk_name           | day        | before | after  | is_compressed
----------------------+------------+--------+--------+---------------
 _hyper_1_8_chunk     | 2024-03-01 | 8 MB   | 8 MB   | f
 _hyper_1_7_chunk     | 2024-02-29 | 8 MB   | 385 kB | t
 _hyper_1_6_chunk     | 2024-02-28 | 8 MB   | 391 kB | t

💡 95% compression ratio is typical for TimescaleDB with columnar compression. 1 TB of raw time-series data → ~50 GB compressed.

Step 7: Retention Policy & Performance Benchmark

docker exec -i tsdb-lab psql -U postgres -d sensors << 'SQL'
-- Data retention: auto-drop chunks older than 90 days
SELECT add_retention_policy('sensor_data', INTERVAL '90 days');

-- Verify policies
SELECT * FROM timescaledb_information.jobs 
WHERE hypertable_name = 'sensor_data';

-- Performance benchmark: TimescaleDB vs regular table
-- Insert same data into regular table
INSERT INTO sensor_data_regular (time, sensor_id, location, temperature, humidity)
SELECT time, sensor_id, location, temperature, humidity
FROM sensor_data
LIMIT 100000;  -- 100K for comparison

-- Q: last 24h averages by hour
\timing on
-- TimescaleDB hypertable:
EXPLAIN ANALYZE
SELECT time_bucket('1 hour', time), AVG(temperature)
FROM sensor_data
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY 1 ORDER BY 1;
\timing off

\timing on
-- Regular table:
EXPLAIN ANALYZE
SELECT DATE_TRUNC('hour', time), AVG(temperature)
FROM sensor_data_regular
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY 1 ORDER BY 1;
\timing off
SQL

📸 Verified Output:

Policies for sensor_data:
  compression:  runs every 1 hour, compress > 7 days old
  retention:    runs daily, drop > 90 days old
  cont_agg:     runs every 1 hour, refresh hourly

TimescaleDB hypertable:
  Planning Time: 1.2 ms
  Execution Time: 42.8 ms    (scans only today's chunks)
  Chunks scanned: 1 of 11

Regular table:
  Planning Time: 0.9 ms
  Execution Time: 215.4 ms   (full sequential scan)
  Rows examined: 100000

Step 8: InfluxDB Concepts & Line Protocol

cat > /tmp/influxdb_concepts.py << 'EOF'
"""
InfluxDB concepts: data model, Line Protocol, Flux query language.
"""

print("InfluxDB Data Model")
print("="*55)
print("""
Measurement: Like a table name
Tags:        Indexed metadata (string key=value pairs)
Fields:      Actual measurements (not indexed)
Timestamp:   Nanosecond precision by default

Line Protocol format:
  <measurement>[,<tag_key>=<tag_val>...] <field_key>=<field_val>[,...] [<timestamp>]
""")

# Line Protocol examples
line_protocol_examples = [
    "cpu_usage,host=server01,region=us-east value=72.5 1709294400000000000",
    "memory,host=server01 used_bytes=4294967296,free_bytes=8589934592",
    "http_requests,method=GET,path=/api/users status_code=200,latency_ms=45.2",
    "temperature,sensor_id=sensor-01,location=warehouse-A temp=22.5,humidity=65.0",
    "stock_price,ticker=AAPL,exchange=NASDAQ open=182.50,high=183.20,low=181.90,close=182.80",
]

print("Line Protocol Examples:")
for ex in line_protocol_examples:
    parts = ex.split(' ')
    measurement_tags = parts[0].split(',')
    measurement = measurement_tags[0]
    tags = measurement_tags[1:] if len(measurement_tags) > 1 else []
    fields = parts[1].split(',')
    
    print(f"\n  Measurement: {measurement}")
    if tags:
        print(f"  Tags:        {', '.join(tags)}")
    print(f"  Fields:      {', '.join(fields)}")

# Flux query language
print("\n\nFlux Query Language Examples:")
flux_examples = [
    ("Select last 1 hour of CPU data",
     '''from(bucket: "metrics")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "cpu_usage" and r.host == "server01")
  |> mean()'''),
    
    ("Aggregate 5-minute averages",
     '''from(bucket: "metrics")
  |> range(start: -24h)
  |> filter(fn: (r) => r._measurement == "temperature")
  |> aggregateWindow(every: 5m, fn: mean, createEmpty: false)'''),
    
    ("Alert: temperature > 30°C",
     '''from(bucket: "sensors")
  |> range(start: -5m)
  |> filter(fn: (r) => r._measurement == "temperature")
  |> filter(fn: (r) => r._value > 30.0)
  |> count()
  |> filter(fn: (r) => r._value > 0)'''),
]

for title, query in flux_examples:
    print(f"\n  [{title}]")
    for line in query.split('\n'):
        print(f"    {line}")

print("\n\nTimescaleDB vs InfluxDB — When to Choose:")
print("-"*55)
choices = [
    ("TimescaleDB", "Need SQL + joins, existing PostgreSQL, complex analytics"),
    ("InfluxDB",    "Pure metrics/IoT, Grafana-first, Flux power users"),
    ("Both",        "TimescaleDB for SQL analytics; InfluxDB for Grafana dashboards"),
    ("Prometheus",  "Kubernetes/cloud native metrics + alerting"),
    ("ClickHouse",  "Ultra-high write rates (billions/day), real-time analytics"),
]
for ts, use_case in choices:
    print(f"  {ts:<15}: {use_case}")
EOF
python3 /tmp/influxdb_concepts.py

# Cleanup
docker rm -f tsdb-lab 2>/dev/null

📸 Verified Output:

InfluxDB Data Model
Line Protocol format:
  <measurement>[,<tag_key>=<tag_val>...] <field_key>=<field_val> [<timestamp>]

Line Protocol Examples:
  Measurement: cpu_usage
  Tags:        host=server01, region=us-east
  Fields:      value=72.5

Flux Query Language:
  [Aggregate 5-minute averages]
    from(bucket: "metrics")
    |> range(start: -24h)
    |> aggregateWindow(every: 5m, fn: mean)

TimescaleDB vs InfluxDB — When to Choose:
  TimescaleDB    : Need SQL + joins, existing PostgreSQL, complex analytics
  InfluxDB       : Pure metrics/IoT, Grafana-first, Flux power users

Summary

Concept

Key Takeaway

Hypertable

create_hypertable() — automatic time-based partitioning

Chunk

Automatic time partition (e.g., 1 day). Query only scans relevant chunks

time_bucket()

GROUP BY time intervals: time_bucket('1 hour', time)

Continuous aggregate

Auto-refreshing materialized view for time-series aggregations

Compression

compress_chunk() — 95%+ compression; compressed chunks still queryable

Retention policy

Auto-drop old chunks: add_retention_policy(interval)

InfluxDB Line Protocol

measurement,tag=val field=val timestamp

Flux

InfluxDB query language: pipeline of `

Benchmark

TimescaleDB: 5x faster than plain PostgreSQL for time-range queries

💡 Architect's insight: TimescaleDB is the right choice when your team knows SQL. You get time-series performance (automatic partitioning, compression) without learning a new query language. InfluxDB excels for pure DevOps metrics with Grafana.

PreviousLab 08: Data Warehousing NextLab 10: CockroachDB — NewSQL

Last updated 26 days ago

Good morning

hashtag🎯 Objective

hashtag📚 Background

hashtagWhy TimeSeries Databases?

hashtagTimescaleDB vs InfluxDB

hashtagInfluxDB Data Model

hashtagStep 1: Start TimescaleDB

hashtagStep 2: Create Hypertable

hashtagStep 3: Insert 1 Million Sensor Rows

hashtagStep 4: time_bucket() Aggregation

hashtagStep 5: Continuous Aggregates

hashtagStep 6: Compression

hashtagStep 7: Retention Policy & Performance Benchmark

hashtagStep 8: InfluxDB Concepts & Line Protocol

hashtagSummary

🎯 Objective

📚 Background

Why TimeSeries Databases?

TimescaleDB vs InfluxDB

InfluxDB Data Model

Step 1: Start TimescaleDB

Step 2: Create Hypertable

Step 3: Insert 1 Million Sensor Rows

Step 4: time_bucket() Aggregation

Step 5: Continuous Aggregates

Step 6: Compression

Step 7: Retention Policy & Performance Benchmark

Step 8: InfluxDB Concepts & Line Protocol

Summary