Lab 10: Data Processing — pandas & numpy

Objective

Process real-world datasets using pandas and numpy: DataFrames, groupby, merge, pivot tables, time series, and data cleaning pipelines.

Time

35 minutes

Prerequisites

  • Lab 03 (Generators), Lab 07 (Type Hints)

Tools

  • Docker image: zchencow/innozverse-python:latest (pandas 2.x, numpy 2.x)


Lab Instructions

Step 1: numpy Fundamentals

docker run --rm zchencow/innozverse-python:latest python3 -c "
import numpy as np

# Array creation
a = np.array([1, 2, 3, 4, 5])
b = np.arange(0, 10, 2)
c = np.linspace(0, 1, 5)
zeros = np.zeros((3, 3))
ones  = np.ones((2, 4))
eye   = np.eye(3)

print('a:', a)
print('arange(0,10,2):', b)
print('linspace(0,1,5):', c)
print('eye(3):\n', eye)

# Vectorized operations (no loops!)
prices = np.array([864.0, 49.99, 99.99, 29.99, 1299.0])
stocks = np.array([15, 80, 999, 0, 5])

values = prices * stocks
print('Values:', values)
print('Total:', values.sum())
print('Mean price:', prices.mean())
print('Std  price:', prices.std().round(2))

# Boolean indexing
in_stock_mask = stocks > 0
print('In stock prices:', prices[in_stock_mask])
print('Top 3 prices:', np.sort(prices)[-3:][::-1])

# Broadcasting
discount = np.array([0.1, 0.05, 0.0, 0.15, 0.2])
final_prices = prices * (1 - discount)
print('Final prices:', np.round(final_prices, 2))

# Statistical operations
sales_data = np.random.default_rng(42).integers(1, 50, size=(7, 5))
print('Weekly sales (7 days x 5 products):\n', sales_data)
print('Daily totals:', sales_data.sum(axis=1))
print('Product totals:', sales_data.sum(axis=0))
print('Best day:', sales_data.sum(axis=1).argmax())
"

💡 numpy vectorized operations run in C — thousands of times faster than Python loops. prices * stocks computes element-wise multiplication for ALL elements simultaneously. For data science and ML, always prefer numpy array operations over Python for loops.

📸 Verified Output:


Step 2: pandas DataFrame Basics

📸 Verified Output:


Steps 3–8: GroupBy, Merge, Pivot, Time Series, Data Cleaning, Capstone Pipeline

📸 Verified Output:


Summary

Operation
pandas
numpy

Create

pd.DataFrame(dict)

np.array([...])

Filter

df[df['col'] > val]

arr[arr > val]

GroupBy

df.groupby('col').agg(...)

np.unique / manual

Merge

df1.merge(df2, on='col')

Time series

df.resample('W').sum()

Stats

df.describe()

np.mean, np.std

Clean

dropna, fillna, query

np.nan, np.isnan

Further Reading

Last updated