Reference

Star Schema Basics

Use this page as an implementation reference for building a datamart with fact and dimension tables.

Goal:

define the correct grain
model dimensions and facts cleanly
load data in a safe order
validate metric correctness before dashboard use

The shape that gives the pattern its name: one fact table in the middle, dimensions around it. Every fact row carries one key per dimension; joins fan out from the center, never dimension-to-dimension.

When to Use a Star Schema #

Choose star schema when:

analysts need fast, readable SQL
KPIs are mostly aggregate reporting metrics
you want reusable dimensions across many facts
BI tools are primary consumers

Core Components #

Fact table: measurable events or snapshots
Dimension tables: descriptive context (customer, product, date, channel)
Surrogate keys: stable integer keys for joins
Conformed dimensions: shared dimensions used by multiple facts

Fact Table Types #

Transaction fact: one row per event (order line, claim)
Periodic snapshot fact: one row per entity per time period
Accumulating snapshot fact: one row tracking multi-step lifecycle

Build Workflow (Recommended) #

Choose one business process (for example: orders).
Declare exact fact grain in one sentence.
List dimensions needed for slicing/filtering.
Define additive and non-additive metrics.
Choose SCD policy per dimension.
Build dimension tables first, then fact.
Add data tests and reconciliation checks.
Expose as mart/semantic layer for BI.

Step 1: Declare Grain Explicitly #

Example grain statement:
One row in fact_order_item represents one product line on one completed order.

If this sentence is ambiguous, stop and fix it before modeling.

Step 2: Dimension Design #

For each dimension, define:

business key (natural key)
surrogate key
attributes used by analysis
SCD behavior (Type 1 or Type 2)
null/default handling rules

Step 3: Fact Design #

For fact table columns:

foreign keys to dimensions
event timestamp/date key
numeric metrics
optional degenerate dimensions (order_number, invoice_id)

Classify metrics:

additive (safe to sum across all dims)
semi-additive (safe across some dims, not all)
non-additive (ratios/percentages; compute from base measures)

SCD Considerations #

Dimension history strategy affects metric interpretation:

Type 1: overwrite (no history)
Type 2: row versioning (history preserved)
Type 3: limited historical columns

Pick SCD policy per business question, not per engineering convenience.

Starter SQL Templates #

sql

-- Dimension templates
CREATE TABLE dim_customer (
  customer_key INTEGER PRIMARY KEY,
  customer_id VARCHAR,              -- business key
  customer_name VARCHAR,
  segment VARCHAR,
  country VARCHAR,
  effective_start_ts TIMESTAMP,     -- for SCD2
  effective_end_ts TIMESTAMP,       -- for SCD2
  is_current BOOLEAN
);

CREATE TABLE dim_product (
  product_key INTEGER PRIMARY KEY,
  product_id VARCHAR,
  product_name VARCHAR,
  category VARCHAR,
  brand VARCHAR
);

CREATE TABLE dim_date (
  date_key INTEGER PRIMARY KEY,     -- yyyymmdd
  calendar_date DATE,
  year INTEGER,
  quarter INTEGER,
  month INTEGER,
  week INTEGER,
  day_of_month INTEGER
);

sql

-- Fact template
-- Grain: one row = one product line on one order -> (order_id, order_item_id)
CREATE TABLE fact_order_item (
  order_item_key INTEGER PRIMARY KEY,
  order_id VARCHAR,                 -- degenerate dimension
  order_item_id VARCHAR,            -- degenerate line key (completes the grain)
  order_date_key INTEGER,           -- FK -> dim_date
  customer_key INTEGER,             -- FK -> dim_customer
  product_key INTEGER,              -- FK -> dim_product
  quantity INTEGER,
  gross_amount DECIMAL(18,2),
  discount_amount DECIMAL(18,2),
  net_amount DECIMAL(18,2),
  cost_amount DECIMAL(18,2)
);

sql

-- Query pattern
SELECT
  DATE_TRUNC('month', d.calendar_date) AS calendar_month,
  c.segment,
  SUM(f.net_amount) AS revenue,
  SUM(f.net_amount - f.cost_amount) AS gross_margin
FROM fact_order_item AS f
JOIN dim_date AS d ON d.date_key = f.order_date_key
JOIN dim_customer AS c ON c.customer_key = f.customer_key
GROUP BY DATE_TRUNC('month', d.calendar_date), c.segment;

Load Order and Data Contracts #

Recommended load sequence:

stage source tables
load/update dimensions
resolve surrogate keys
load fact rows
run quality + reconciliation tests

Contract examples:

no duplicate fact grain keys
no orphan fact foreign keys
metric totals reconcile to trusted source aggregates

Validation SQL You Should Always Run #

sql

-- 1) Grain uniqueness on the true grain key
-- (order_id, product_key) is NOT the grain: an order can legitimately
-- repeat the same product on separate lines
SELECT order_id, order_item_id, COUNT(*) AS row_count
FROM fact_order_item
GROUP BY order_id, order_item_id
HAVING COUNT(*) > 1;

-- 2) Orphan keys
SELECT COUNT(*) AS missing_customer_keys
FROM fact_order_item f
LEFT JOIN dim_customer c ON c.customer_key = f.customer_key
WHERE c.customer_key IS NULL;

-- 3) Reconciliation
SELECT
  ROUND(SUM(net_amount), 2) AS fact_total
FROM fact_order_item;

Performance and Maintenance Notes #

Keep fact tables narrow; move verbose text to dimensions.
Partition/cluster fact tables by common time filters where supported.
Pre-aggregate common dashboard grains into summary marts.
Use conformed dimensions to avoid duplicated business logic.
Version metric definitions when formulas change.

Common Modeling Mistakes #

Mixed grain in one fact table.
Ratios stored as base fact metrics (instead of deriving from numerators/denominators).
SCD policy chosen without business sign-off.
Facts loaded before dimensions are stable.
Hidden duplicate joins caused by non-unique dimension business keys.

Interactive SQL Check #

Run this query to practice a classic star-style aggregate across orders, items, and product dimensions.

Loading SQL editor...

Tip

A datamart is production-ready when:
1) grain is explicit and tested,
2) metrics reconcile,
3) dimensions are trustworthy,
4) analysts can query without custom join logic.