Lab 12: Database Observability

Time: 50 minutes | Level: Architect | DB: MySQL 8.0, PostgreSQL 15


🎯 Objective

Implement database observability: MySQL performance_schema, PostgreSQL pg_stat_statements, key metrics (QPS/latency/connections/buffer hit rate), and Prometheus exporters.


📚 Background

Four Golden Signals for Databases

  1. Latency — Query execution time (p50, p95, p99)

  2. Traffic — Queries per second (QPS), connections

  3. Errors — Failed queries, deadlocks, replication errors

  4. Saturation — CPU, memory, I/O, connection pool

Key Metrics by Database

Metric
MySQL
PostgreSQL

QPS

Com_select + Com_insert + ...

pg_stat_statements.calls

Slow queries

Slow_queries, slow log

pg_stat_statements.mean_exec_time

Buffer hit rate

Innodb_buffer_pool_read_requests / reads

blks_hit / (blks_hit + blks_read)

Connections

Threads_connected / max_connections

pg_stat_activity

Replication lag

Seconds_Behind_Source

pg_replication_slots


Step 1: MySQL performance_schema Setup

📸 Verified Output:


Step 2: Generate Load & Find Top Queries

📸 Verified Output:


Step 3: MySQL Key Health Metrics

📸 Verified Output:


Step 4: PostgreSQL pg_stat_statements

📸 Verified Output:


Step 5: PostgreSQL Key Metrics

📸 Verified Output:


Step 6: Prometheus Exporters Setup

📸 Verified Output:


Step 7: Alerting Rules & SLOs

📸 Verified Output:


Step 8: Capstone — Observability Dashboard Design

📸 Verified Output:


Summary

Concept
Key Takeaway

performance_schema

MySQL: events_statements_summary_by_digest → top queries by time

pg_stat_statements

PostgreSQL: calls, mean_exec_time, total_exec_time, stddev

Buffer hit rate

Target >99%; below 95% = disk reads causing latency

Consumer lag

Key metric: current_offset vs log_end_offset

Replication lag

Seconds_Behind_Source (MySQL), pg_stat_replication (PG)

mysqld_exporter

Prometheus exporter: all MySQL global status variables

postgres_exporter

Prometheus exporter: pg_stat_* views as metrics

Grafana dashboards

Pre-built at grafana.com/grafana/dashboards

SLO

p99 < 100ms OLTP; availability 99.99%; buffer hit > 99%

💡 Architect's insight: The most important metric is query latency distribution (p50/p95/p99), not averages. A 1-second average can hide 5% of queries taking 20 seconds. Use pg_stat_statements histogram or slow query log for percentiles.

Last updated