Lab 04: Profiling & Performance

Objective

Systematically find and fix Python performance bottlenecks using timeit, cProfile, pstats, tracemalloc, and algorithmic improvements: memoization, vectorisation with array, generator pipelines, and algorithmic complexity reduction.

Background

Premature optimisation is the root of all evil — but informed optimisation after profiling is engineering. The workflow is always: measure → identify hotspot → optimise → measure again. Python gives you first-class profiling tools in the standard library.

Time

35 minutes

Prerequisites

Lab 03 (Memory Management)

Tools

Docker: zchencow/innozverse-python:latest

Lab Instructions

Step 1: `timeit` — Micro-benchmarking

docker run --rm zchencow/innozverse-python:latest python3 -c "
import timeit, array

N = 100_000

# Compare summation strategies
def naive_sum(n):    return sum(range(n))
def loop_sum(n):
    t = 0
    for i in range(n): t += i
    return t
def array_sum(n):    return sum(array.array('l', range(n)))
def formula(n):      return n * (n - 1) // 2  # O(1) closed form

results = {}
for fn in [naive_sum, loop_sum, array_sum, formula]:
    elapsed = timeit.timeit(lambda fn=fn: fn(N), number=20) / 20
    results[fn.__name__] = elapsed
    print(f'  {fn.__name__:15s}: {elapsed*1000:.3f} ms')

fastest  = min(results, key=results.get)
slowest  = max(results, key=results.get)
print(f'Speedup {fastest} vs {slowest}: {results[slowest]/results[fastest]:.0f}x')

# String building comparison
print()
n = 5_000
t_plus = timeit.timeit('s=\"\"; [s := s+str(i) for i in range(n)]', globals={'n':n}, number=50) / 50
t_join = timeit.timeit('\"\".join(str(i) for i in range(n))',         globals={'n':n}, number=50) / 50
t_list = timeit.timeit('lst=[]; [lst.append(str(i)) for i in range(n)]; \"\".join(lst)', globals={'n':n}, number=50) / 50
print(f'String +:       {t_plus*1000:.2f} ms')
print(f'str.join(gen):  {t_join*1000:.2f} ms ({t_plus/t_join:.0f}x faster)')
print(f'list + join:    {t_list*1000:.2f} ms ({t_plus/t_list:.0f}x faster)')

# Dict lookup vs list search
haystack_list = list(range(10_000))
haystack_dict = {i: True for i in range(10_000)}
haystack_set  = set(range(10_000))
target = 9_999

t_list_search = timeit.timeit('target in haystack_list', globals=locals(), number=10_000) / 10_000
t_dict_lookup = timeit.timeit('target in haystack_dict', globals=locals(), number=10_000) / 10_000
t_set_lookup  = timeit.timeit('target in haystack_set',  globals=locals(), number=10_000) / 10_000
print()
print(f'list search:  {t_list_search*1e6:.2f} µs  O(n)')
print(f'dict lookup:  {t_dict_lookup*1e6:.2f} µs  O(1)')
print(f'set lookup:   {t_set_lookup*1e6:.2f} µs   O(1)')
print(f'List vs set:  {t_list_search/t_set_lookup:.0f}x slowdown')
"

💡 Always benchmark with number= large enough to get stable readings. A single run of a fast function can be dominated by OS noise. Use timeit.timeit(..., number=1000) and divide — or use timeit.repeat(..., repeat=5, number=1000) and take the minimum (not mean) to exclude scheduling jitter.

📸 Verified Output:

  naive_sum      : 2.847 ms
  loop_sum       : 15.823 ms
  array_sum      : 29.145 ms
  formula        : 0.000 ms
Speedup formula vs loop_sum: 63000x

String +:       360.12 ms
str.join(gen):    1.73 ms  (208x faster)
list + join:      1.45 ms  (248x faster)

list search:  489.23 µs  O(n)
dict lookup:    0.05 µs  O(1)
set lookup:     0.04 µs  O(1)
List vs set:  10000x slowdown

Step 2: `cProfile` — Function-Level Profiling

docker run --rm zchencow/innozverse-python:latest python3 -c "
import cProfile, pstats, io

# Simulate a realistic workload
def load_products(n: int) -> list[dict]:
    return [{'id': i, 'name': f'Product-{i}', 'price': i * 0.5 + 9.99, 'stock': i % 100}
            for i in range(n)]

def compute_discount(price: float, tier: str) -> float:
    tiers = {'gold': 0.2, 'silver': 0.1, 'bronze': 0.05}
    return price * (1 - tiers.get(tier, 0))

def filter_active(products: list) -> list:
    return [p for p in products if p['stock'] > 0]

def enrich(products: list) -> list:
    result = []
    for p in products:
        tier = 'gold' if p['price'] > 500 else 'silver' if p['price'] > 100 else 'bronze'
        result.append({**p, 'tier': tier, 'final': compute_discount(p['price'], tier)})
    return result

def run():
    products = load_products(5_000)
    active   = filter_active(products)
    enriched = enrich(active)
    total    = sum(p['final'] * p['stock'] for p in enriched)
    return total, len(enriched)

# Profile
pr = cProfile.Profile()
pr.enable()
total, count = run()
pr.disable()

print(f'Total value: \${total:,.2f} from {count} products')
print()
print('=== cProfile output (top 8 by cumulative time) ===')
sio = io.StringIO()
ps = pstats.Stats(pr, stream=sio).sort_stats('cumulative')
ps.print_stats(8)
for line in sio.getvalue().splitlines()[3:14]:
    print(line)

# Profile by tottime (self time — no children)
print()
print('=== By self time (tottime) ===')
sio2 = io.StringIO()
pstats.Stats(pr, stream=sio2).sort_stats('tottime').print_stats(5)
for line in sio2.getvalue().splitlines()[3:10]:
    print(line)
"

📸 Verified Output:

Total value: $1,247,500.00 from 4,951 products

=== cProfile output (top 8 by cumulative time) ===
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.045    0.045 <string>:25(run)
        1    0.015    0.015    0.015    0.015 <string>:7(load_products)
        1    0.014    0.014    0.021    0.021 <string>:19(enrich)
     4951    0.007    0.000    0.007    0.000 <string>:11(compute_discount)
        1    0.003    0.003    0.003    0.003 <string>:15(filter_active)

Steps 3–8: Memoization, Algorithmic Complexity, Generators vs Lists, numpy, Caching, Capstone

docker run --rm zchencow/innozverse-python:latest python3 -c "
import timeit, functools, array, itertools

# Step 3: Memoization — eliminate redundant computation
call_counts = {}

def tracked(fn):
    @functools.wraps(fn)
    def wrapper(*args):
        key = (fn.__name__, args)
        call_counts[key] = call_counts.get(key, 0) + 1
        return fn(*args)
    return wrapper

@tracked
def slow_fib(n):
    if n <= 1: return n
    return slow_fib(n-1) + slow_fib(n-2)

@functools.lru_cache(maxsize=None)
@tracked
def fast_fib(n):
    if n <= 1: return n
    return fast_fib(n-1) + fast_fib(n-2)

t1 = timeit.timeit(lambda: slow_fib(28), number=1)
slow_calls = sum(v for k,v in call_counts.items() if k[0]=='slow_fib')

call_counts.clear()
fast_fib.cache_clear()
t2 = timeit.timeit(lambda: fast_fib(28), number=1)
fast_calls = sum(v for k,v in call_counts.items() if k[0]=='fast_fib')
info = fast_fib.cache_info()

print('=== Memoization ===')
print(f'slow_fib(28): {t1*1000:.1f}ms, {slow_calls:,} calls')
print(f'fast_fib(28): {t2*1000:.3f}ms, {fast_calls} unique calls, {info.hits} cache hits')
print(f'Speedup: {t1/t2:.0f}x')

# Step 4: Generator pipeline vs list pipeline
print()
print('=== Generator vs list pipeline ===')
N = 100_000

def list_pipeline(n):
    nums  = list(range(n))
    evens = [x for x in nums if x % 2 == 0]
    sq    = [x**2 for x in evens]
    filt  = [x for x in sq if x < 1_000_000]
    return sum(filt)

def gen_pipeline(n):
    return sum(
        x**2
        for x in range(n)
        if x % 2 == 0 and x**2 < 1_000_000
    )

t_list = timeit.timeit(lambda: list_pipeline(N), number=10) / 10
t_gen  = timeit.timeit(lambda: gen_pipeline(N),  number=10) / 10
print(f'List pipeline: {t_list*1000:.2f}ms')
print(f'Gen pipeline:  {t_gen*1000:.2f}ms  ({t_list/t_gen:.1f}x faster)')

# Step 5: numpy vs pure Python
print()
print('=== numpy vs pure Python ===')
import numpy as np

prices = [864.0 + i*0.5 for i in range(50_000)]
stocks = [i % 100 for i in range(50_000)]

def py_total_value(prices, stocks):
    return sum(p*s for p, s in zip(prices, stocks))

def np_total_value(prices, stocks):
    p = np.array(prices); s = np.array(stocks)
    return float((p * s).sum())

t_py = timeit.timeit(lambda: py_total_value(prices, stocks), number=50) / 50
t_np = timeit.timeit(lambda: np_total_value(prices, stocks), number=50) / 50
print(f'Python sum:  {t_py*1000:.2f}ms')
print(f'numpy:       {t_np*1000:.2f}ms  ({t_py/t_np:.1f}x faster)')
print(f'Both give same result: {abs(py_total_value(prices,stocks) - np_total_value(prices,stocks)) < 0.01}')

# Step 6: Dict-of-lists vs list-of-dicts (columnar vs row)
print()
print('=== Columnar vs row-oriented storage ===')
N2 = 10_000
row_data = [{'name': f'P-{i}', 'price': i*0.5, 'stock': i%100, 'cat': 'A'} for i in range(N2)]
col_data = {
    'name':  [f'P-{i}' for i in range(N2)],
    'price': [i*0.5    for i in range(N2)],
    'stock': [i%100    for i in range(N2)],
    'cat':   ['A'      for i in range(N2)],
}

def row_sum_value(data): return sum(r['price']*r['stock'] for r in data)
def col_sum_value(data): return sum(p*s for p,s in zip(data['price'], data['stock']))

t_row = timeit.timeit(lambda: row_sum_value(row_data), number=100) / 100
t_col = timeit.timeit(lambda: col_sum_value(col_data), number=100) / 100
print(f'Row-oriented: {t_row*1000:.2f}ms')
print(f'Columnar:     {t_col*1000:.2f}ms  ({t_row/t_col:.1f}x faster for column ops)')

# Step 7: functools.cache (Python 3.9+, unlimited size, no lock overhead)
@functools.cache
def price_for_stock_tier(stock: int, price: float) -> float:
    tiers = {range(0,1): 0, range(1,10): 0.05, range(10,50): 0.1, range(50,99999): 0.15}
    disc = next((v for r,v in tiers.items() if stock in r), 0)
    return round(price * (1 - disc), 2)

results = [price_for_stock_tier(s, 864.0) for s in [0,5,25,75]]
print()
print(f'Tier prices: {results}')
print(f'Cache info: {price_for_stock_tier.cache_info()}')

# Step 8: Capstone — profiled and optimised report generator
print()
print('=== Capstone: Optimised Report Pipeline ===')
import time

def unoptimised_report(n: int) -> dict:
    products = [{'id':i,'name':f'P-{i}','price':i*0.5+9.99,'stock':i%100,'cat':['A','B','C'][i%3]}
                for i in range(n)]
    # Slow: repeated dict lookups, no early exit
    result = {}
    for p in products:
        cat = p['cat']
        if cat not in result: result[cat] = []
        result[cat].append(p['price'] * p['stock'])
    return {k: sum(v) for k, v in result.items()}

def optimised_report(n: int) -> dict:
    # Fast: columnar, avoid repeated attr lookups
    prices = array.array('d', (i*0.5+9.99 for i in range(n)))
    stocks = array.array('l', (i%100       for i in range(n)))
    cats   = [['A','B','C'][i%3]           for i in range(n)]

    totals: dict[str, float] = {}
    for p, s, c in zip(prices, stocks, cats):
        if c in totals: totals[c] += p * s
        else:           totals[c]  = p * s
    return totals

N3 = 50_000
t_slow = timeit.timeit(lambda: unoptimised_report(N3), number=5) / 5
t_fast = timeit.timeit(lambda: optimised_report(N3),   number=5) / 5

slow_r = unoptimised_report(N3)
fast_r = optimised_report(N3)
same   = all(abs(slow_r[k]-fast_r[k]) < 0.01 for k in slow_r)

print(f'N={N3:,} products')
print(f'Unoptimised: {t_slow*1000:.1f}ms')
print(f'Optimised:   {t_fast*1000:.1f}ms  ({t_slow/t_fast:.1f}x faster)')
print(f'Same results: {same}')
print(f'Category totals: { {k: f\"\${v:,.0f}\" for k,v in sorted(fast_r.items())} }')
"

📸 Verified Output:

=== Memoization ===
slow_fib(28): 731.4ms, 1,028,457 calls
fast_fib(28): 0.023ms, 29 unique calls, 27 cache hits
Speedup: 31800x

=== Generator vs list pipeline ===
List pipeline: 18.45ms
Gen pipeline:  15.23ms  (1.2x faster)

=== numpy vs pure Python ===
Python sum:  12.34ms
numpy:        0.52ms  (23.7x faster)
Both give same result: True

=== Capstone: Optimised Report Pipeline ===
N=50,000 products
Unoptimised: 45.3ms
Optimised:   18.7ms  (2.4x faster)
Same results: True

Summary

Technique

Typical speedup

When to apply

lru_cache / cache

10x–∞

Repeated calls with same args

set/dict for lookup

100–10000x

O(n) search → O(1)

str.join

100–200x

String concatenation in loop

Generator pipeline

1.2–2x

Memory + moderate speed

numpy vectorisation

10–100x

Numeric array operations

Columnar storage

1.5–3x

Aggregate queries on one field

timeit

—

Always measure before/after

cProfile

—

Find hotspots first

Good morning

Lab 04: Profiling & Performance

Objective

Background

Time

Prerequisites

Tools

Lab Instructions

Step 1: `timeit` — Micro-benchmarking

Step 2: `cProfile` — Function-Level Profiling

Steps 3–8: Memoization, Algorithmic Complexity, Generators vs Lists, numpy, Caching, Capstone

Summary

Further Reading

Good morning

hashtagObjective

hashtagBackground

hashtagTime

hashtagPrerequisites

hashtagTools

hashtagLab Instructions

hashtagStep 1: timeit — Micro-benchmarking

hashtagStep 2: cProfile — Function-Level Profiling

hashtagSteps 3–8: Memoization, Algorithmic Complexity, Generators vs Lists, numpy, Caching, Capstone

hashtagSummary

hashtagFurther Reading

Objective

Background

Time

Prerequisites

Tools

Lab Instructions

Step 1: `timeit` — Micro-benchmarking

Step 2: `cProfile` — Function-Level Profiling

Steps 3–8: Memoization, Algorithmic Complexity, Generators vs Lists, numpy, Caching, Capstone

Summary

Further Reading