Lab 10: System Resource Monitoring

Time: 30 minutes | Level: Practitioner | Docker: docker run -it --rm ubuntu:22.04 bash


Overview

Understanding system resource usage is essential for performance tuning, capacity planning, and troubleshooting. In this lab you'll use top, ps, free, vmstat, uptime, and the /proc filesystem to monitor CPU, memory, I/O, and load average in real time.


Step 1: uptime and Load Average

Load average is the single most important quick-glance health metric.

# Check system uptime and load
uptime

# Read /proc/loadavg directly
cat /proc/loadavg

πŸ“Έ Verified Output:

 05:49:15 up 6 days,  7:19,  0 users,  load average: 3.23, 1.94, 1.59

3.23 1.94 1.59 3/813 9

Understanding load average:

πŸ“Έ Verified Output:

πŸ’‘ Trend matters more than snapshots: Compare 1-min vs 15-min load. If 1-min >> 15-min: sudden spike (investigate now). If 1-min β‰ˆ 15-min and both high: sustained overload (need more capacity). If 1-min << 15-min: load is decreasing (situation resolving).


Step 2: Memory Monitoring with free

πŸ“Έ Verified Output:

πŸ’‘ "available" is the real metric: Ignore the free column β€” Linux aggressively uses RAM for disk cache (buff/cache), which it releases on demand. The available column shows how much RAM is actually usable for new processes. If available < 10% of total, you have a real memory problem.


Step 3: ps aux with Sorting β€” Finding Resource Hogs

πŸ“Έ Verified Output:

πŸ’‘ RSS vs VSZ: RSS (Resident Set Size) = actual physical RAM in use right now. VSZ (Virtual Size) = all virtual memory including mapped files and shared libraries. RSS is what you care about for memory pressure. A process with high VSZ but low RSS is fine.


Step 4: vmstat β€” Virtual Memory Statistics

vmstat gives a system-wide snapshot of processes, memory, swap, I/O, and CPU.

πŸ“Έ Verified Output:

vmstat column guide:

πŸ“Έ Verified Output:

πŸ’‘ Key warning signs in vmstat: r > CPU count = CPU bound. b > 0 persistently = I/O bottleneck. si/so > 0 = swapping (RAM pressure). wa > 20% = disk I/O bottleneck. st > 5% = noisy neighbor on VM host.


Step 5: top β€” Interactive Process Monitor

top is a live, updating process monitor. Key interactive commands:

πŸ“Έ Verified Output (top -bn1):

top interactive key reference:

πŸ“Έ Verified Output:

πŸ’‘ htop is better: Install htop (apt-get install htop) for a much more user-friendly alternative with color coding, mouse support, and easy tree view. In production environments where htop isn't available, top is your fallback.


Step 6: /proc/meminfo Deep Dive

πŸ“Έ Verified Output:

πŸ’‘ Dirty pages: Dirty in /proc/meminfo shows data waiting to be written to disk. High dirty pages + slow disk = data loss risk during a crash. Kernel writes dirty pages to disk periodically (controlled by vm.dirty_ratio and vm.dirty_background_ratio in /proc/sys/vm/).


Step 7: sar and iostat β€” I/O and Historical Stats

πŸ“Έ Verified Output:

πŸ’‘ Enabling historical sar: On Ubuntu/Debian, edit /etc/default/sysstat and set ENABLED="true", then restart the sysstat service. Data is collected every 10 minutes and stored in /var/log/sysstat/. Invaluable for "what happened last Tuesday at 3 AM" forensics.


Step 8: Capstone β€” System Health Dashboard Script

Scenario: Create a comprehensive system health check script that monitors CPU, memory, disk, and processes β€” useful for cron-based alerting or quick triage.

πŸ“Έ Verified Output:

πŸ’‘ Schedule it with cron: Add to cron for proactive alerting: */15 * * * * /usr/local/bin/system_health.sh | grep -E 'WARNING|CRITICAL' | mail -s "Alert: $(hostname)" [email protected]. Only emails when thresholds are breached β€” no noise when everything is healthy.


Summary

Tool
Purpose
Key Options

uptime

Load average + uptime

β€”

/proc/loadavg

Raw load average data

cat /proc/loadavg

free -h

Memory & swap overview

-h human, -m MB, -t totals

/proc/meminfo

Granular memory stats

awk to parse specific fields

ps aux --sort=-%cpu

Processes sorted by CPU

--sort=-%mem for memory

ps -eo pid,%cpu,rss,comm

Custom process columns

Mix any ps fields

vmstat 1 5

System-wide I/O+CPU stats

1 5 = 5 samples, 1s apart

top -bn1

One-shot top snapshot

P=CPU, M=MEM, 1=per-CPU

iostat -x 1 2

Disk I/O extended stats

Requires sysstat package

sar -u 2 5

Historical CPU stats

-r memory, -n DEV network

/proc/diskstats

Raw disk I/O counters

cat /proc/diskstats

nproc

Number of CPU cores

Compare against load average

vmstat r column

CPU run queue length

r > CPUs = CPU bound

vmstat b column

Blocked on I/O

b > 0 = I/O bottleneck

vmstat si/so

Swap in/out activity

Non-zero = memory pressure

Last updated