Build a star schema data warehouse in PostgreSQL. Implement fact/dimension tables, slowly changing dimensions (SCD Type 1/2/3), GROUP BY ROLLUP/CUBE for OLAP queries, and MATERIALIZED VIEWs for performance.
💡 ROLLUP vs CUBE: ROLLUP creates subtotals along a hierarchy (year → category → total). CUBE creates subtotals for ALL combinations of dimensions.
Step 5: Slowly Changing Dimensions (SCD)
📸 Verified Output:
Step 6: Materialized View for Aggregates
📸 Verified Output:
Step 7: ETL vs ELT
📸 Verified Output:
Step 8: Capstone — Warehouse Design Review
📸 Verified Output:
Summary
Concept
Key Takeaway
Star schema
Fact table at center; dimensions as spokes; simple joins
Snowflake schema
Normalized dimensions; more joins but less storage
Fact table
Contains metrics (revenue, qty) + FK references to dimensions
Dimension table
Descriptive context (customer, product, date)
SCD Type 1
Overwrite — simple, no history
SCD Type 2
New row per change — full history; use surrogate keys
SCD Type 3
Previous value column — one version back only
GROUP BY ROLLUP
Hierarchical subtotals (year → month → day)
GROUP BY CUBE
All combinations of dimensions
Materialized View
Pre-computed aggregates; REFRESH to update
ETL
Transform outside warehouse (Spark, Glue)
ELT
Transform inside warehouse (dbt, SQL)
💡 Architect's insight: Star schemas intentionally violate 3NF. Denormalization is a performance feature in analytics: fewer joins = faster aggregations. Use MATERIALIZED VIEW for your most expensive recurring queries.
docker exec -i pg-lab psql -U postgres -d warehouse << 'SQL'
-- ROLLUP: subtotals and grand total (hierarchical)
SELECT
COALESCE(d.year_actual::TEXT, 'ALL YEARS') AS year,
COALESCE(p.category, 'ALL CATEGORIES') AS category,
ROUND(SUM(f.net_revenue), 2) AS revenue,
COUNT(*) AS sales
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY ROLLUP(d.year_actual, p.category)
ORDER BY d.year_actual NULLS LAST, p.category NULLS LAST
LIMIT 10;
-- CUBE: all combinations (cross-dimensional analysis)
SELECT
COALESCE(p.category, 'ALL') AS category,
COALESCE(c.country, 'ALL') AS country,
ROUND(SUM(f.net_revenue), 2) AS revenue
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY CUBE(p.category, c.country)
ORDER BY p.category NULLS LAST, c.country NULLS LAST
LIMIT 15;
SQL
ROLLUP — hierarchical subtotals:
year | category | revenue | sales
------+-----------------+------------+------
2024 | Accessories | 21560.20 | 85
2024 | Books | 12050.00 | 102
2024 | Electronics | 452318.00 | 313
2024 | ALL CATEGORIES | 485928.20 | 500
ALL YEARS | ALL CATEGORIES | 485928.20 | 500
CUBE — all dimension combinations:
category | country | revenue
-------------+-----------+---------
Electronics | China | 125431.20
Electronics | Korea | 89021.50
Electronics | ALL | 452318.00
ALL | ALL | 485928.20
docker exec -i pg-lab psql -U postgres -d warehouse << 'SQL'
-- SCD Type 1: Overwrite (no history)
-- Alice moves to San Francisco (old city LOST)
UPDATE dim_customer
SET city = 'San Francisco'
WHERE customer_id = 1 AND is_current = TRUE;
-- SCD Type 2: New row (full history preserved)
-- Bob gets upgraded to gold tier → insert new version, expire old
-- Step 1: Expire current version
UPDATE dim_customer
SET effective_to = CURRENT_DATE - 1, is_current = FALSE
WHERE customer_id = 2 AND is_current = TRUE;
-- Step 2: Insert new version
INSERT INTO dim_customer (customer_id, first_name, last_name, email, city, country, tier, effective_from, is_current)
SELECT customer_id, first_name, last_name, email, city, country,
'gold' AS tier,
CURRENT_DATE AS effective_from, TRUE
FROM dim_customer WHERE customer_id = 2 AND is_current = FALSE
ORDER BY customer_key DESC LIMIT 1;
-- View SCD2 history for Bob
SELECT customer_key, customer_id, first_name, tier,
effective_from, effective_to, is_current
FROM dim_customer WHERE customer_id = 2
ORDER BY effective_from;
-- SCD Type 3: Add previous_value column
ALTER TABLE dim_customer ADD COLUMN IF NOT EXISTS previous_tier VARCHAR(20);
-- Carol upgrades: keep old tier in previous_tier
UPDATE dim_customer
SET previous_tier = tier, tier = 'gold'
WHERE customer_id = 4 AND is_current = TRUE;
SELECT customer_id, first_name, tier, previous_tier FROM dim_customer WHERE customer_id = 4;
SQL
SCD2 History for Bob:
customer_key | customer_id | first_name | tier | effective_from | effective_to | is_current
--------------+-------------+------------+--------+----------------+--------------+------------
2 | 2 | Bob | silver | 2023-03-15 | 2024-03-01 | f
6 | 2 | Bob | gold | 2024-03-01 | | t
SCD3 (one version back):
customer_id | first_name | tier | previous_tier
-------------+------------+------+---------------
4 | David | gold | bronze
docker exec -i pg-lab psql -U postgres -d warehouse << 'SQL'
-- Create materialized view for monthly sales summary
-- This pre-computes expensive aggregations
CREATE MATERIALIZED VIEW mv_monthly_sales AS
SELECT
d.year_actual,
d.month_actual,
d.month_name,
d.quarter,
p.category,
ch.channel_type,
c.country,
COUNT(*) AS transaction_count,
SUM(f.quantity) AS units_sold,
ROUND(SUM(f.gross_revenue)::NUMERIC, 2) AS gross_revenue,
ROUND(SUM(f.net_revenue)::NUMERIC, 2) AS net_revenue,
ROUND(SUM(f.gross_profit)::NUMERIC, 2) AS gross_profit,
ROUND(AVG(f.net_revenue)::NUMERIC, 2) AS avg_order_value
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_channel ch ON f.channel_key = ch.channel_key
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY d.year_actual, d.month_actual, d.month_name, d.quarter,
p.category, ch.channel_type, c.country
WITH DATA;
CREATE INDEX idx_mv_monthly_year ON mv_monthly_sales(year_actual, month_actual);
CREATE INDEX idx_mv_monthly_category ON mv_monthly_sales(category);
-- Query materialized view (much faster than fact table join)
\timing on
SELECT year_actual, quarter, category,
SUM(net_revenue) AS revenue,
SUM(transaction_count) AS transactions
FROM mv_monthly_sales
WHERE year_actual = 2024
GROUP BY year_actual, quarter, category
ORDER BY quarter, revenue DESC;
\timing off
-- Refresh when new data arrives (production: schedule via pg_cron)
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_monthly_sales;
SQL
docker exec -i pg-lab psql -U postgres -d warehouse << 'SQL'
-- Final analysis: compare star schema query performance
-- Q1: YoY comparison by country
WITH yearly AS (
SELECT
d.year_actual,
c.country,
ROUND(SUM(f.net_revenue)::NUMERIC, 2) AS revenue
FROM fact_sales f
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_customer c ON f.customer_key = c.customer_key
GROUP BY d.year_actual, c.country
)
SELECT
country,
MAX(CASE WHEN year_actual = 2024 THEN revenue END) AS revenue_2024,
MAX(CASE WHEN year_actual = 2023 THEN revenue END) AS revenue_2023
FROM yearly
GROUP BY country
ORDER BY revenue_2024 DESC NULLS LAST;
-- Q2: Top products by profit margin
SELECT
p.product_name,
p.category,
COUNT(*) AS sales_count,
ROUND(SUM(f.gross_profit)::NUMERIC, 2) AS total_profit,
ROUND(100.0 * SUM(f.gross_profit) / NULLIF(SUM(f.gross_revenue), 0), 1) AS margin_pct
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
GROUP BY p.product_name, p.category
ORDER BY margin_pct DESC NULLS LAST;
SQL
# Cleanup
docker rm -f pg-lab 2>/dev/null