Lab 15: Capstone — Full Observability Stack

Time: 45 minutes | Level: Architect | Docker: docker run -it --rm ubuntu:22.04 bash

Overview

This capstone lab designs and validates a complete observability stack from scratch. You will combine all concepts from Labs 11–14: Prometheus metrics collection, alert rules, Alertmanager routing, Grafana dashboards, Filebeat → Logstash → Elasticsearch log pipeline, Kibana search, and SLI/SLO definitions with error budget calculations.

Complete Stack Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    Production Observability Stack                            │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                         METRICS LAYER                                │  │
│  │                                                                      │  │
│  │  App Servers ──► node_exporter:9100 ◄── Prometheus:9090             │  │
│  │  Kubernetes  ──► kube-state-metrics  ◄── (scrape every 15s)         │  │
│  │  Databases   ──► mysql_exporter:9104 ──► TSDB (30d retention)       │  │
│  │                                         │                            │  │
│  │  Alert Rules ──► Alertmanager:9093 ──► Slack / PagerDuty / Email    │  │
│  │                                         │                            │  │
│  │  Prometheus ──► Grafana:3000 ──────────► Dashboards / Alerts        │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                           LOGS LAYER                                 │  │
│  │                                                                      │  │
│  │  /var/log/* ──► Filebeat:5066 ──► Logstash:5044 ──► ES:9200         │  │
│  │  journald   ──► (beats proto)     (grok/mutate)    (indices)         │  │
│  │                                                     │                │  │
│  │  Kibana:5601 ◄──────────────────────────────────── │                │  │
│  │  (Discover / Dashboards / Alerts / Saved Searches)                   │  │
│  └──────────────────────────────────────────────────────────────────────┘  │
│                                                                             │
│  ┌──────────────────────────────────────────────────────────────────────┐  │
│  │                       SLI/SLO LAYER                                  │  │
│  │                                                                      │  │
│  │  SLI Metrics (Prometheus) ──► SLO Windows (28/7d) ──► Error Budget  │  │
│  │  Burn Rate Alerts ──────────► Runbooks ──────────────► Incident Mgmt│  │
│  └──────────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────────┘

Step 1: Prometheus + node_exporter — Verify Binary Stack

📸 Verified Output:

💡 In production, run node_exporter as a systemd service with --collector.systemd --collector.processes for extended metrics. Start Prometheus with --storage.tsdb.retention.time=30d --web.enable-lifecycle to enable hot-reload of config via POST /-/reload.


Step 2: Alert Rules for CPU, Disk, and Memory Thresholds

📸 Verified Output:

💡 Burn rate alerts are the Google SRE approach to SLO alerting. Instead of alerting at a fixed error rate threshold, they alert based on how fast you're consuming your error budget. A 36x burn rate means you'd exhaust a monthly 0.1% budget in 20 minutes — page immediately.


Step 3: Alertmanager Routing to Email and Slack

📸 Verified Output:

💡 The inhibit_rules section prevents alert storms. When NodeDown fires, all other alerts (CPU, memory, disk) for the same instance are suppressed — they're meaningless if the node is unreachable. The equal: [instance] ensures suppression is scoped to the specific down node, not all instances.


Step 4: Grafana Dashboard Provisioning

📸 Verified Output:

💡 For GitOps-driven Grafana, store all dashboard JSON files in a Git repo and mount them as a volume. The allowUiUpdates: false setting ensures dashboards can only be changed via Git — any UI edits are discarded on Grafana restart. This prevents configuration drift.


Step 5: Filebeat → Logstash → Elasticsearch Pipeline

📸 Verified Output:

💡 All Elastic Stack components (Elasticsearch, Logstash, Kibana, Filebeat) must be the same major.minor version. Mixing versions causes compatibility issues. The entire stack was at 8.19.12 as of 2025. Use the Elastic Compatibility Matrix at https://www.elastic.co/support/matrix to verify cross-component compatibility.


📸 Verified Output:

💡 KQL (Kibana Query Language) is simpler than Elasticsearch Query DSL for interactive exploration. Use it in Discover, dashboards, and alerting. For programmatic access (CI/CD, automation), use the Saved Objects API to import/export searches and dashboards as NDJSON files.


Step 7: SLI/SLO Definitions and Error Budget Calculation

Service SLO Budget % Budget min

API Availability 99.900% 0.1000% 40.3 API Latency p99 <500ms 95.000% 5.0000% 2016.0 Search Availability 99.950% 0.0500% 20.2 Checkout Success Rate 99.990% 0.0100% 4.0

28d error budget for 99.9% SLO: 40.3 minutes 36x burn rate exhausts budget in: 1.1 minutes 6x burn rate exhausts budget in: 0.1 hours

📸 Verified Output:

💡 A good runbook answers: "What is this alert? What does it mean for users? How do I fix it?" in under 15 minutes for an on-call engineer who has never seen this system before. Link runbooks in alert annotations.runbook_url so engineers land directly on the relevant page from Slack/PagerDuty.


Summary

Layer
Component
Version
Key Config

Metrics

Prometheus

2.45.0

scrape_interval: 15s, retention: 30d

Metrics

node_exporter

1.6.1

--collector.systemd --collector.processes

Metrics

Alertmanager

0.26.x

Routes: critical→PagerDuty, warning→Slack

Visualization

Grafana

10.1.2

Provisioned datasources + dashboards via YAML/JSON

Logs shipper

Filebeat

8.19.12

output.logstash, fields_under_root: true

Log pipeline

Logstash

8.19.12

queue.type: persisted, dead_letter_queue

Log storage

Elasticsearch

8.19.12

ILM: hot→warm→cold→delete, strict mappings

Log UI

Kibana

8.19.12

Data Views, KQL, Saved Searches, Dashboards

SLI

Availability

PromQL

sum(rate(http_requests_total{status!~'5..'}[5m])) / sum(rate(...[5m]))

SLO

API Availability

99.9%

28-day rolling window, 40.3 min error budget

Burn Rate

Fast alert

36x

Page: budget exhausted in 1.1 min at this rate

Burn Rate

Slow alert

6x

Ticket: budget exhausted in 6 hours at this rate

Runbook

Response time

P0: 5 min

Triage → Investigate → Remediate → Verify → Post-mortem

Last updated