Lab 10: pandas Advanced

Objective

Master pandas beyond basics: MultiIndex hierarchical data, groupby with custom aggregations, time series resampling with pd.Grouper, method chaining with pipe(), apply() with complex functions, pd.eval() for fast expressions, and building a full ETL pipeline.

Background

pandas 2.x uses Copy-on-Write semantics — operations on slices no longer silently modify the original. The 2.x API also aligns better with numpy via the Arrow backend. Understanding method chaining with pipe() and avoiding loops with apply() is the difference between 10-line and 100-line pandas code.

Time

35 minutes

Prerequisites

  • Practitioner Lab 10 (pandas basics)

Tools

  • Docker: zchencow/innozverse-python:latest (pandas 3.x)


Lab Instructions

Step 1: MultiIndex — Hierarchical Indexing

💡 df.loc[(slice(None), 'Laptop'), :] selects all regions (first level = slice(None)) with category 'Laptop' (second level). MultiIndex enables truly hierarchical data with efficient cross-level aggregations — far faster than repeated filtering. Use pd.IndexSlice for cleaner syntax: idx = pd.IndexSlice; df.loc[idx[:, 'Laptop'], :].

📸 Verified Output:


Step 2: Advanced GroupBy — Custom Aggregations & Transform

📸 Verified Output:


Steps 3–8: Time Series, pipe() chaining, pd.eval, Data Quality, MultiIndex merges, Capstone

📸 Verified Output:


Summary

Feature
API
Use case

MultiIndex

pd.MultiIndex.from_product

Hierarchical dimensions

Named agg

.agg(name=('col', 'func'))

Multiple aggregations cleanly

Transform

.transform('sum')

Group stat back to row

Method chain

.pipe(fn)

ETL steps without temp vars

Resample

.resample('W').sum()

Time series aggregation

Rolling

.rolling(7).mean()

Moving average

pd.eval

df.eval('c = a * b')

Fast column expressions

Category dtype

.astype('category')

8x memory savings

Further Reading

Last updated