Master pandas beyond basics: MultiIndex hierarchical data, groupby with custom aggregations, time series resampling with pd.Grouper, method chaining with pipe(), apply() with complex functions, pd.eval() for fast expressions, and building a full ETL pipeline.
Background
pandas 2.x uses Copy-on-Write semantics — operations on slices no longer silently modify the original. The 2.x API also aligns better with numpy via the Arrow backend. Understanding method chaining with pipe() and avoiding loops with apply() is the difference between 10-line and 100-line pandas code.
💡 df.loc[(slice(None), 'Laptop'), :] selects all regions (first level = slice(None)) with category 'Laptop' (second level). MultiIndex enables truly hierarchical data with efficient cross-level aggregations — far faster than repeated filtering. Use pd.IndexSlice for cleaner syntax: idx = pd.IndexSlice; df.loc[idx[:, 'Laptop'], :].
=== MultiIndex DataFrame ===
Shape: (12, 3)
units revenue returns
region category
North Laptop 134 23451.28 ...
=== Unstack (pivot region → columns) ===
region East North South West
category
Accessory 12345.67 23456.78 ...
docker run --rm zchencow/innozverse-python:latest python3 -c "
import pandas as pd
import numpy as np
np.random.seed(42)
N = 500
df = pd.DataFrame({
'date': pd.date_range('2026-01-01', periods=N, freq='D')[:N],
'product': np.random.choice(['Surface Pro','Surface Pen','Office 365','USB-C Hub','Surface Book'], N),
'category': np.random.choice(['Laptop','Accessory','Software','Hardware'], N),
'region': np.random.choice(['North','South','East','West'], N),
'qty': np.random.randint(1, 20, N),
'price': np.random.choice([864.0, 49.99, 99.99, 29.99, 1299.0], N),
})
df['revenue'] = df['qty'] * df['price']
df['month'] = df['date'].dt.to_period('M')
# 1. Named aggregations
print('=== Named Aggregations (agg) ===')
summary = df.groupby('category').agg(
orders =('qty', 'count'),
units =('qty', 'sum'),
revenue =('revenue', 'sum'),
avg_price =('price', 'mean'),
max_order =('revenue', 'max'),
top_region=('region', lambda x: x.value_counts().index[0]),
).round(2).sort_values('revenue', ascending=False)
print(summary.to_string())
# 2. transform — add group statistics back to original rows
print()
print('=== Transform (group stats on rows) ===')
df['cat_revenue_total'] = df.groupby('category')['revenue'].transform('sum')
df['pct_of_category'] = (df['revenue'] / df['cat_revenue_total'] * 100).round(2)
df['rank_in_category'] = df.groupby('category')['revenue'].rank(ascending=False, method='dense')
sample = df[df['category']=='Laptop'][['product','revenue','pct_of_category','rank_in_category']].head(5)
print(sample.to_string(index=False))
# 3. apply — arbitrary function on groups
def top_products(group: pd.DataFrame, n: int = 2) -> pd.DataFrame:
return group.nlargest(n, 'revenue')[['product','revenue','qty']]
print()
print('=== Top 2 per Category (apply) ===')
top = df.groupby('category', group_keys=True).apply(top_products, n=2, include_groups=False)
print(top.to_string())
# 4. Pivot table
print()
print('=== Pivot Table ===')
pivot = df.pivot_table(
values='revenue', index='category', columns='region',
aggfunc='sum', margins=True, margins_name='Total'
).round(2)
print(pivot.to_string())
"
=== Named Aggregations (agg) ===
orders units revenue avg_price max_order top_region
category
Laptop ... ... xxx,xxx.xx xxx.xx xxxxx.xx North
...
=== Pivot Table ===
region East North South West Total
category
Accessory xxxx.xx xxxx.xx ...