Lab 11: Prometheus — Metrics & Alerting

Time: 45 minutes | Level: Architect | Docker: docker run -it --rm ubuntu:22.04 bash

Overview

Prometheus is a pull-based monitoring system with a time-series database (TSDB), PromQL query language, and a built-in alerting pipeline. In this lab you will install and configure Prometheus and node_exporter from official binaries, write PromQL queries, define recording rules and alert rules, and configure Alertmanager routing — all verified inside Docker.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Prometheus Architecture                   │
│                                                             │
│  ┌──────────────┐    scrape     ┌─────────────────────────┐ │
│  │  node_exporter│◄─────────────│     Prometheus Server   │ │
│  │  :9100/metrics│              │  ┌──────────────────┐   │ │
│  └──────────────┘              │  │   TSDB (chunks)  │   │ │
│                                │  │   /prometheus    │   │ │
│  ┌──────────────┐    scrape    │  └──────────────────┘   │ │
│  │  app_exporter │◄────────────│  ┌──────────────────┐   │ │
│  │  :8080/metrics│             │  │  PromQL Engine   │   │ │
│  └──────────────┘             │  └──────────────────┘   │ │
│                               └─────────────┬───────────┘ │
│  ┌──────────────────────────────────────────▼───────────┐  │
│  │                  Alertmanager :9093                  │  │
│  │     routes → email / slack / pagerduty / webhook     │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Step 1: Install Prometheus Binary

📸 Verified Output:

💡 The netgo and builtinassets tags mean Prometheus is statically compiled — no external library dependencies. The web UI assets are embedded in the binary.


Step 2: Install node_exporter

📸 Verified Output:

💡 node_exporter exposes 40+ metric families by default: CPU, memory, disk I/O, filesystem, network, systemd units, and more. Run with --collector.systemd to include systemd service states.


Step 3: Write prometheus.yml Configuration

📸 Verified Output:

💡 Use file_sd_configs for dynamic environments (Kubernetes, auto-scaling groups). Files are JSON arrays of {targets: [...], labels: {...}}. Prometheus reloads them automatically on change without restart.


Step 4: PromQL Query Reference

📸 Verified Output:

💡 rate() vs increase(): Use rate() for per-second calculations (CPU, requests/s). Use increase() for total count over a time window (e.g., errors in last hour). Both only work on counters.


Step 5: Define Recording Rules

📸 Verified Output:

💡 Recording rules follow the naming convention level:metric:operation. This makes it easy to identify the aggregation level (instance:), the metric (node_cpu_utilisation), and the operation (rate5m). Dashboards referencing pre-computed metrics load significantly faster.


Step 6: Define Alert Rules

📸 Verified Output:

💡 The for: field sets a pending duration — the condition must be true continuously for this period before firing. This prevents flapping on brief spikes. for: 0m fires immediately. Use {{$labels.instance}} in annotations to include the target name dynamically.


Step 7: Configure Alertmanager

📸 Verified Output:

💡 inhibit_rules suppress child alerts when a parent fires. If NodeDown (critical) fires, it silences HighCPUUsage (warning) for the same instance — avoiding alert storms. continue: false (default) stops routing to sibling receivers once matched.


Step 8: Capstone — Production Monitoring Rollout

Scenario: You are deploying Prometheus monitoring for a 3-tier production application (load balancer, 4 app servers, 2 database servers). Requirements: 5-minute alert response, 30-day metric retention, HA setup.

📸 Verified Output:

💡 For HA Prometheus, run two identical Prometheus instances and configure both to point at the same Alertmanager cluster. Use Thanos or Cortex for long-term storage and global query federation. Enable --web.enable-lifecycle to allow config reload via POST /-/reload.


Summary

Concept
Key Details

Scrape model

Pull-based; Prometheus scrapes /metrics endpoints every scrape_interval

TSDB

Local time-series DB in chunks; default 2h blocks, compacted to 2h→2d→...

prometheus.yml

global, alerting, rule_files, scrape_configs sections

PromQL rate()

Per-second average of counter increase over range vector

histogram_quantile

Calculates percentile from histogram _bucket metrics

Recording rules

Pre-compute expensive queries; naming: level:metric:operation

Alert rules

expr, for (pending duration), labels, annotations fields

Alertmanager

Routes by label matchers; groups, inhibits, silences notifications

Retention

--storage.tsdb.retention.time=30d or --storage.tsdb.retention.size=50GB

Prometheus v2.45.0

Released 2023-06-23, Go 1.20.5, static binary 121MB

node_exporter v1.6.1

Released 2023-07-17, Go 1.20.6, static binary 19MB

Last updated