Data Quality Audit
This dataset is clean, and your job is to prove it. Each audit below returns counts of violations, and the contract is that every count is zero. That framing is exactly how production data tests work: the query encodes the rule, an empty (or all-zero) result is the pass, and the same query keeps guarding the warehouse as new rows arrive.
Three audits:
- order headers reconcile with their line items
- returns reference real order items and never exceed the line's value
- no orphaned facts (items without orders, orders without customers)
Audit 1: Headers vs Line-Item Sums #
Task: count orders whose header order_total disagrees with the sum of their line items by more than 0.01 (a tolerance for decimal rounding). Include every order, even cancelled ones, and treat an order with no line items at all as a mismatch (its line sum is 0, not NULL). Output a single row with one column, mismatched_orders.
The skill: aggregate lines to order grain first, then LEFT JOIN from headers so an order with zero lines cannot silently disappear from the audit.
Audit 2: Returns Reference Real Lines and Never Exceed Them #
Task: validate ecom_returns against ecom_order_items in one pass. Output a single row with two columns:
orphan_returns: returns whoseorder_item_iddoes not exist inecom_order_itemsover_refunded_returns: returns whosereturn_amountexceeds the matched line'sline_totalby more than 0.01
Both must be 0. A refund larger than the line it refunds is how negative net revenue sneaks into dashboards.
Audit 3: Orphan Facts #
Task: check both ends of the order chain in one row:
items_without_orders: line items whoseorder_iddoes not exist inecom_ordersorders_without_customers: orders whosecustomer_iddoes not exist inecom_customers
Both must be 0. The anti-join shape (LEFT JOIN ... WHERE right_key IS NULL) is the same one you eyeballed on the exploration page. Now it carries a contract.
Keep these three queries. In production they become scheduled tests: same SQL, run after every load, alert when any count leaves zero. An audit that only ran once is an anecdote — an audit that runs on every refresh is a contract.
Next Step #
The data is proven clean, so now put it to work: