Lab 02: Performance Profiling with perf

Time: 40 minutes | Level: Advanced | Docker: docker run -it --rm --privileged ubuntu:22.04 bash


Overview

Performance profiling helps identify CPU bottlenecks, cache misses, and hot code paths. The Linux perf tool is the gold standard for CPU-level performance analysis. This lab covers perf installation, CPU statistics, /proc/cpuinfo analysis, and alternative profiling tools for environments where perf is unavailable.


Step 1: Install perf Tools

The perf tool must match your running kernel version:

apt-get update -qq && apt-get install -y linux-tools-generic linux-tools-common
perf --version 2>&1 || echo "perf not available for this kernel"

📸 Verified Output:

WARNING: perf not found for kernel 6.14.0-37

  You may need to install the following packages for this specific kernel:
    linux-tools-6.14.0-37-generic
    linux-cloud-tools-6.14.0-37-generic

  You may also want to install one of the following packages to keep up to date:
    linux-tools-generic
    linux-cloud-tools-generic
perf not available for this kernel

💡 In Docker containers, the kernel is the host kernel, not Ubuntu 22.04's. You need linux-tools-$(uname -r) matching the host. On a native Ubuntu install, linux-tools-generic installs a matching version automatically.

On a native Ubuntu system where perf is available:


Step 2: Inspect CPU Hardware with /proc/cpuinfo

Understanding your hardware is the foundation of performance analysis:

📸 Verified Output:

📸 Verified Output:

💡 Key flags to look for: sse4_2 (SIMD), avx/avx2 (vector ops), aes (hardware encryption), vmx/svm (virtualization support).


Step 3: perf stat — Count Hardware Events

perf stat measures CPU performance counters for a command's execution:

Example output (native system):

Key metrics explained:

  • cycles: Total CPU clock cycles consumed

  • instructions: Machine instructions executed

  • IPC (insn per cycle): Higher = more efficient execution (ideal > 1.0)

  • cache-misses: LLC (Last Level Cache) misses — expensive memory accesses

  • branch-misses: Mispredicted branches causing pipeline flushes


Step 4: Measure with /usr/bin/time -v (Available in Docker)

When perf isn't available, /usr/bin/time -v provides detailed resource usage:

📸 Verified Output:

💡 Maximum resident set size shows peak RAM usage. Minor page faults indicate memory-mapped operations. Major page faults indicate disk I/O for page loading — minimize these!


Step 5: Monitor System-Wide CPU with vmstat

vmstat provides real-time system performance statistics:

📸 Verified Output:

Column meanings:

Column
Description

r

Processes in run queue (>nproc = CPU-bound)

b

Processes blocked (I/O wait)

si/so

Swap in/out (non-zero = memory pressure)

us

User CPU %

sy

System/kernel CPU %

id

Idle CPU %

wa

Wait for I/O %


Step 6: perf record and report — Sampling Profiler

perf record samples the call stack at high frequency to find hot functions:

Example output (native system):

💡 The [.] prefix = user space, [k] = kernel space. High kernel time (sy in vmstat) often points to excessive syscalls.


Step 7: Flame Graphs Concept

Flame graphs visualize CPU profiling data as a stacked bar chart:

To generate a flame graph (native system):

Reading flame graphs:

  • X-axis: Time proportion (wider = more CPU time)

  • Y-axis: Call stack depth (bottom = kernel, top = leaf functions)

  • Hot spot: Wide bars near the top = functions to optimize

💡 Flame graphs were invented by Brendan Gregg at Netflix. They remain the most effective way to visualize where CPU time is spent across deep call stacks.


Step 8: Capstone — Profile a CPU-Bound Workload

Scenario: A Python script is running slowly. Profile it to find the bottleneck.

📸 Verified Output:

📸 Verified Output:

Fields: user nice system idle iowait irq softirq steal guest guest_nice


Summary

Tool
Purpose

perf stat <cmd>

Count CPU events (cycles, instructions, cache misses)

perf record -g -F 99 <cmd>

Sample call stacks at 99Hz

perf report --stdio

View hot functions from recorded data

perf top

Live view of hot functions (like top but per-function)

/usr/bin/time -v <cmd>

Detailed resource usage (memory, page faults, switches)

vmstat 1 3

Real-time CPU/memory/IO overview

/proc/cpuinfo

CPU hardware details and feature flags

/proc/stat

Raw CPU time counters

Flame graphs

Visualize stack profiling as weighted call tree

Last updated