Reference

Data Quality for Models

Use this page to establish a production-quality test framework for modeled datasets.

Goal:

detect correctness defects before BI users do
standardize model-level data quality contracts
operationalize quality with clear severity and ownership

Quality Layers #

Recommended checks by layer:

Schema checks: required columns, expected data types
Model checks: grain uniqueness, referential integrity
Metric checks: reconciliation to trusted source totals
Operational checks: freshness, completeness, anomaly thresholds

Mandatory Model Tests #

Every published model should include:

primary grain uniqueness test
non-null test for required keys/metrics
foreign-key integrity test
basic numeric sanity (no impossible negatives where not allowed)
row-count or aggregate reconciliation test

Reference SQL Test Pack #

sql

-- 1) grain uniqueness
SELECT order_id, order_item_id, COUNT(*) AS row_count
FROM fact_order_item
GROUP BY order_id, order_item_id
HAVING COUNT(*) > 1;

-- 2) required not-null checks
SELECT COUNT(*) AS null_key_rows
FROM fact_order_item
WHERE customer_key IS NULL
   OR product_key IS NULL
   OR order_date_key IS NULL;

-- 3) referential integrity
SELECT COUNT(*) AS orphan_customer_rows
FROM fact_order_item f
LEFT JOIN dim_customer d ON d.customer_key = f.customer_key
WHERE d.customer_key IS NULL;

-- 4) numeric sanity checks
SELECT COUNT(*) AS invalid_metric_rows
FROM fact_order_item
WHERE quantity < 0
   OR net_amount < 0
   OR cost_amount < 0;

-- 5) reconciliation (example against source)
WITH source_totals AS (
  SELECT ROUND(SUM(net_amount), 2) AS source_total
  FROM stg_order_item
),
mart_totals AS (
  SELECT ROUND(SUM(net_amount), 2) AS mart_total
  FROM fact_order_item
)
SELECT source_total, mart_total
FROM source_totals, mart_totals
WHERE source_total <> mart_total;

Severity Model #

Classify failures so alerting is meaningful:

P0: metric correctness broken (block publish)
P1: key integrity failures beyond threshold
P2: freshness/completeness drift
P3: non-critical formatting/optional attribute issues

Operational Runbook Minimums #

For each certified model, publish:

owner and on-call path
expected refresh schedule and SLO
failure thresholds
rollback and restatement procedure
known edge cases and accepted exceptions

Senior Engineer Playbook #

Query-plan diagnostics: baseline heavy data-quality queries and optimize anti-join checks on large facts.
Failure postmortem pattern: when a check misses a defect, audit its join direction first. An INNER JOIN reconciliation silently hides rows that are entirely absent from one side.
Migration playbook: move from ad hoc checks to CI-gated model contracts with severity-based release policies.

Interactive SQL Check #

Run this reconciliation check to detect completed orders where header totals do not match item totals, including completed orders with no item rows at all. That is why it must be a LEFT JOIN with a COALESCE'd item sum: an INNER JOIN makes a zero-item order invisible, which is exactly the defect class this check exists to catch. On the current seed both joins return zero mismatches. The seed is clean; the join choice is what makes the check trustworthy.

Loading SQL editor...

Warning

Model quality is not a one-time test suite. It is an operating discipline with ownership and incident workflows.