Lab 18: System Monitoring & Performance

Time: 30 minutes | Level: Practitioner | Docker: docker run -it --rm ubuntu:22.04 bash


Overview

Performance analysis is a critical sysadmin skill. In this lab you will use vmstat, iostat, sar, top, uptime, and /proc files to measure CPU, memory, disk I/O, and identify system bottlenecks with real data from a running system.

Prerequisites: Docker installed, Labs 01–15 completed.


Step 1: uptime & Load Averages

Load average is the most fundamental performance metric.

docker run -it --rm ubuntu:22.04 bash
uptime
cat /proc/loadavg

📸 Verified Output:

 05:50:18 up 6 days,  7:20,  0 users,  load average: 3.51, 2.26, 1.72
3.51 2.26 1.72 3/789 1031

Interpreting load average:

Load average = average number of runnable + uninterruptible-sleep processes over 1/5/15 minutes.

System
CPUs
Load 1.00
Load 2.00
Concern?

Single CPU

1

100% busy

200% (queued)

✅ Yes

Quad core

4

25% busy

50% busy

❌ Fine

32 CPUs

32

3% busy

6% busy

❌ Fine

📸 Verified Output:

💡 The 15-minute load average tells the trend. If 1min > 15min, load is increasing. If 1min < 15min, load is decreasing. A 1min spike might be a cron job; sustained high 15min average means a real problem.


Step 2: top — Real-time Process Monitoring

📸 Verified Output:

CPU line breakdown (%Cpu(s)):

Field
Meaning
High = Problem?

us

User space CPU

Normal workload

sy

System/kernel CPU

Too many syscalls?

ni

Nice (low-priority) processes

Usually fine

id

Idle

Low idle = busy

wa

I/O wait

Disk bottleneck!

hi

Hardware interrupts

Network/device overload

si

Software interrupts

Usually network

st

Steal time

VM CPU being taken by hypervisor

top interactive keys:

Key
Action

P

Sort by CPU usage

M

Sort by memory usage

k

Kill a process (enter PID)

r

Renice a process

1

Show per-CPU stats

H

Show threads

q

Quit

💡 %st (steal time) > 5% in a VM means your cloud provider is over-provisioning the host. Your VM is waiting for CPU that was promised to it. Consider upgrading instance type or moving to a dedicated host.


Step 3: vmstat — Virtual Memory Statistics

vmstat gives a compact view of processes, memory, I/O, and CPU.

📸 Verified Output:

vmstat column guide:

Section
Column
Meaning

procs

r

Runnable processes (in CPU queue)

procs

b

Blocked (waiting for I/O)

memory

swpd

Virtual memory used (KB)

memory

free

Idle memory (KB)

memory

buff

Memory used as buffers

memory

cache

Memory used as cache

swap

si

Swap-in per second (KB/s)

swap

so

Swap-out per second (KB/s)

io

bi

Blocks read from devices

io

bo

Blocks written to devices

system

in

Interrupts per second

system

cs

Context switches per second

cpu

us/sy/id/wa/st

CPU percentages

Red flags in vmstat:

  • r consistently > number of CPUs → CPU bottleneck

  • b > 0 regularly → I/O bottleneck

  • si/so > 0 → Swapping (memory pressure!)

  • wa > 20% → Disk I/O wait

💡 The first vmstat line shows averages since boot. Start reading from the second line for current activity. Use vmstat -s for a full memory summary, and vmstat -d for disk statistics.


Step 4: free — Memory Analysis

📸 Verified Output:

Memory concepts:

Metric
Meaning
Action if High

used

Allocated by processes

Normal — monitor trend

buff/cache

Kernel disk cache

Normal — kernel reclaims when needed

available

What apps can actually use

Low available → add RAM

Swap used

Overflow to disk

Investigate memory leak!

📸 Verified Output:

💡 MemAvailable is more accurate than MemFree for determining actual free memory. MemFree excludes cache, but the kernel will reclaim cache when needed. MemAvailable accounts for this and shows what's truly available for new processes.


Step 5: iostat — Disk I/O Analysis

📸 Verified Output:

Key iostat columns:

Column
Meaning
Concern if...

r/s / w/s

Reads/writes per second

Very high = busy disk

rkB/s / wkB/s

Throughput (KB/s)

Near disk max = saturated

r_await / w_await

Average I/O latency (ms)

> 20ms for HDD, > 1ms for SSD

%util

Disk utilization

> 80% = potential bottleneck

%iowait (CPU)

CPU waiting for I/O

> 20% = disk bottleneck

💡 %util near 100% means the disk is saturated. For SSDs, %util can be misleading since they handle parallel I/O — look at await latency instead. High r_await with low %util can indicate a slow SAN or NFS mount.


Step 6: sar — System Activity Reporter

sar collects and reports historical system activity (the sadc daemon must run for history).

📸 Verified Output:

sar command flags:

Flag
Reports

sar -u 1 5

CPU utilization (5 samples, 1s interval)

sar -r 1 5

Memory utilization

sar -d 1 5

Disk I/O activity

sar -n DEV 1 5

Network interface statistics

sar -q 1 5

Load average and queue

sar -b 1 5

I/O and transfer rate

sar -f /var/log/sa/sa15

Read saved data (15th of month)

📸 Verified Output:

💡 Enable sadc for historical data. On Ubuntu: systemctl enable --now sysstat. This runs sadc every 10 minutes, storing data in /var/log/sa/. After 24h you can run sar -u without arguments to see today's history, or sar -u -f /var/log/sa/sa$(date +%d).


Step 7: Bottleneck Identification Methodology

📸 Verified Output:

💡 Performance tuning order: Always check in this sequence: CPU → Memory → Disk I/O → Network. A disk bottleneck often masquerades as high CPU (the kernel burning cycles waiting for I/O). Check %iowait first when CPU looks high.


Step 8: Capstone — Comprehensive Performance Dashboard

Scenario: Your manager asks for a 1-page performance snapshot to baseline a new server.

📸 Verified Output:

💡 Schedule this dashboard with cron for shift handover reports. Run every 8 hours and email the output: 0 */8 * * * /usr/local/bin/perf-dashboard.sh | mail -s "Server Status $(hostname)" [email protected]. Over time, you build a performance baseline that makes anomalies obvious.


Summary

Tool
Purpose
Key Flags

uptime

Load average overview

/proc/loadavg

Raw load average data

top -bn1

Process list snapshot

-b batch, -n iterations

free -h

Memory overview

-h human-readable

/proc/meminfo

Detailed memory stats

vmstat 1 5

System-wide stats

1 interval, 5 count

iostat -x 1 3

Disk I/O detail

-x extended stats

sar -u 1 5

CPU history

-r memory, -d disk

ps aux --sort=-%cpu

Process CPU ranking

--sort=-%mem for memory

nproc

CPU count

Last updated