Data Quality for Models
Use this page to establish a production-quality test framework for modeled datasets.
Goal:
- detect correctness defects before BI users do
- standardize model-level data quality contracts
- operationalize quality with clear severity and ownership
Quality Layers #
Recommended checks by layer:
- Schema checks: required columns, expected data types
- Model checks: grain uniqueness, referential integrity
- Metric checks: reconciliation to trusted source totals
- Operational checks: freshness, completeness, anomaly thresholds
Mandatory Model Tests #
Every published model should include:
- primary grain uniqueness test
- non-null test for required keys/metrics
- foreign-key integrity test
- basic numeric sanity (no impossible negatives where not allowed)
- row-count or aggregate reconciliation test
Reference SQL Test Pack #
-- 1) grain uniqueness
SELECT order_id, order_item_id, COUNT(*) AS row_count
FROM fact_order_item
GROUP BY order_id, order_item_id
HAVING COUNT(*) > 1;
-- 2) required not-null checks
SELECT COUNT(*) AS null_key_rows
FROM fact_order_item
WHERE customer_key IS NULL
OR product_key IS NULL
OR order_date_key IS NULL;
-- 3) referential integrity
SELECT COUNT(*) AS orphan_customer_rows
FROM fact_order_item f
LEFT JOIN dim_customer d ON d.customer_key = f.customer_key
WHERE d.customer_key IS NULL;
-- 4) numeric sanity checks
SELECT COUNT(*) AS invalid_metric_rows
FROM fact_order_item
WHERE quantity < 0
OR net_amount < 0
OR cost_amount < 0;
-- 5) reconciliation (example against source)
WITH source_totals AS (
SELECT ROUND(SUM(net_amount), 2) AS source_total
FROM stg_order_item
),
mart_totals AS (
SELECT ROUND(SUM(net_amount), 2) AS mart_total
FROM fact_order_item
)
SELECT source_total, mart_total
FROM source_totals, mart_totals
WHERE source_total <> mart_total;Severity Model #
Classify failures so alerting is meaningful:
- P0: metric correctness broken (block publish)
- P1: key integrity failures beyond threshold
- P2: freshness/completeness drift
- P3: non-critical formatting/optional attribute issues
Operational Runbook Minimums #
For each certified model, publish:
- owner and on-call path
- expected refresh schedule and SLO
- failure thresholds
- rollback and restatement procedure
- known edge cases and accepted exceptions
Senior Engineer Playbook #
- Query-plan diagnostics: baseline heavy data-quality queries and optimize anti-join checks on large facts.
- Failure postmortem pattern: when a check misses a defect, audit its join direction first. An INNER JOIN reconciliation silently hides rows that are entirely absent from one side.
- Migration playbook: move from ad hoc checks to CI-gated model contracts with severity-based release policies.
Interactive SQL Check #
Run this reconciliation check to detect completed orders where header totals do not match item totals, including completed orders with no item rows at all. That is why it must be a LEFT JOIN with a COALESCE'd item sum: an INNER JOIN makes a zero-item order invisible, which is exactly the defect class this check exists to catch. On the current seed both joins return zero mismatches. The seed is clean; the join choice is what makes the check trustworthy.
Related Lessons #
Model quality is not a one-time test suite. It is an operating discipline with ownership and incident workflows.