Reference

SCD Type 1, 2, and 3

Use this page as a practical implementation guide for Slowly Changing Dimensions (SCDs).

Goal:

  • choose the correct SCD pattern for each attribute
  • load dimension history correctly and repeatably
  • avoid duplicate current rows and broken historical joins
  • preserve reporting trust when attributes change over time

When to Use Type 1 vs Type 2 vs Type 3 #

  • Type 1: overwrite old value. Use for corrected attributes where history is not analytically meaningful.
  • Type 2: add new row version. Use when historical point-in-time analysis must be preserved.
  • Type 3: keep current + one prior value. Use sparingly for narrow comparison use cases.

Dimension Contract (Required) #

For each dimension, define:

  • business key (natural key)
  • surrogate key
  • SCD policy per attribute
  • effective start/end timestamps
  • current-row flag
  • late-arriving update policy

Reference Table Design (Type 2) #

sql
CREATE TABLE dim_customer (
  customer_sk BIGINT,
  customer_id VARCHAR,                 -- business key
  customer_name VARCHAR,
  segment VARCHAR,
  country VARCHAR,
  email VARCHAR,
  effective_start_ts TIMESTAMP,
  effective_end_ts TIMESTAMP,
  is_current BOOLEAN,
  record_source VARCHAR,
  load_ts TIMESTAMP
);

Type 1 Load Pattern (Overwrite) #

sql
UPDATE dim_customer d
SET
  customer_name = s.customer_name,
  email = s.email,
  load_ts = CURRENT_TIMESTAMP
FROM stg_customer s
WHERE d.customer_id = s.customer_id
  AND (
    d.customer_name IS DISTINCT FROM s.customer_name
    OR d.email IS DISTINCT FROM s.email
  );

In a hybrid Type 1/Type 2 dimension, a Type 1 overwrite must update every version of the member, not just the is_current row. Otherwise historical versions keep the stale value and point-in-time reports disagree with current ones. Note the change detection uses IS DISTINCT FROM rather than <>, so NULL-to-value and value-to-NULL transitions are detected too.

Type 2 Load Pattern (Versioned History) #

sql
-- Step 1: close existing current rows when tracked attributes changed
UPDATE dim_customer d
SET
  effective_end_ts = s.change_ts,
  is_current = FALSE,
  load_ts = CURRENT_TIMESTAMP
FROM stg_customer_changes s
WHERE d.customer_id = s.customer_id
  AND d.is_current = TRUE
  AND (
    d.segment IS DISTINCT FROM s.segment
    OR d.country IS DISTINCT FROM s.country
  );

-- Step 2: insert new current version
INSERT INTO dim_customer (
  customer_sk,
  customer_id,
  customer_name,
  segment,
  country,
  email,
  effective_start_ts,
  effective_end_ts,
  is_current,
  record_source,
  load_ts
)
SELECT
  s.customer_sk,
  s.customer_id,
  s.customer_name,
  s.segment,
  s.country,
  s.email,
  s.change_ts,
  TIMESTAMP '9999-12-31 00:00:00',
  TRUE,
  s.record_source,
  CURRENT_TIMESTAMP
FROM stg_customer_changes s
LEFT JOIN dim_customer d
  ON d.customer_id = s.customer_id
 AND d.is_current = TRUE
WHERE d.customer_id IS NULL
   OR d.segment IS DISTINCT FROM s.segment
   OR d.country IS DISTINCT FROM s.country;
A dim_customer table with two rows for customer C001: version 1 (segment SMB) end-dated on 2024-03-01 with is_current FALSE, and version 2 (segment Enterprise) open to 9999-12-31 with is_current TRUE
Type 2 keeps history: when a tracked attribute changes, the current row is end-dated and is_current set FALSE, and a new open-ended version is inserted under the same business key.

Type 3 Pattern (Current + Previous) #

sql
ALTER TABLE dim_customer ADD COLUMN prev_segment VARCHAR;

UPDATE dim_customer d
SET
  prev_segment = d.segment,
  segment = s.segment,
  load_ts = CURRENT_TIMESTAMP
FROM stg_customer s
WHERE d.customer_id = s.customer_id
  AND d.is_current = TRUE
  AND d.segment IS DISTINCT FROM s.segment;

Fact Join Rule for Type 2 #

For Type 2 dimensions, join facts to dimensions using business key plus event timestamp window:

  • fact.event_ts >= dim.effective_start_ts
  • fact.event_ts < dim.effective_end_ts

Do not join only on business key for historical reporting.

sql
SELECT
  f.order_id,
  f.order_ts,
  d.segment,
  f.net_amount
FROM fact_order_item f
JOIN dim_customer d
  ON d.customer_id = f.customer_id
 AND f.order_ts >= d.effective_start_ts
 AND f.order_ts < d.effective_end_ts;

Validation SQL You Should Always Run #

sql
-- 1) exactly one current row per business key
SELECT customer_id, COUNT(*) AS current_rows
FROM dim_customer
WHERE is_current = TRUE
GROUP BY customer_id
HAVING COUNT(*) <> 1;

-- 2) overlapping validity windows
SELECT a.customer_id
FROM dim_customer a
JOIN dim_customer b
  ON a.customer_id = b.customer_id
 AND a.customer_sk <> b.customer_sk
 AND a.effective_start_ts < b.effective_end_ts
 AND b.effective_start_ts < a.effective_end_ts
GROUP BY a.customer_id;

-- 3) orphan fact joins (Type 2 time window)
SELECT COUNT(*) AS orphan_fact_rows
FROM fact_order_item f
LEFT JOIN dim_customer d
  ON d.customer_id = f.customer_id
 AND f.order_ts >= d.effective_start_ts
 AND f.order_ts < d.effective_end_ts
WHERE d.customer_sk IS NULL;

Advanced Engineer Notes #

  • Separate tracked vs non-tracked attributes to avoid unnecessary Type 2 churn.
  • Use deterministic change detection (hashdiff or explicit column compare).
  • Define restatement policy for late-arriving corrections.
  • Treat SCD policy changes as breaking contract changes and version carefully.

Senior Engineer Playbook #

  • Temporal join audit: assert the fact row count is identical before and after the Type 2 join: any delta means the validity windows drop or duplicate facts (the inclusive-BETWEEN trap).
  • Failure postmortem pattern: track when duplicate current rows were introduced and which pipeline run caused overlap.
  • Migration playbook: when moving Type 1 -> Type 2, run dual-publish windows and reconcile metric deltas before cutover.

Interactive SQL Check #

Run this temporal join check to simulate a validity-window style join between premium events and policy records.

Apply the half-open rule from above, adapted to month grain: premium fact rows sit at the first of the month while policies can start mid-month, so compare month_date against the policy start truncated to its month. The seed end_date is an inclusive last covered day, so the exclusive upper bound is end_date plus one day. An inclusive BETWEEN on the raw dates silently drops every first-month premium row for mid-month policy starts.

Loading SQL editor...
Warning

Most historical metric defects come from dimension versioning errors, not from aggregation SQL.