Data Quality Audit
Auditing before analyzing is the job. Every loss ratio, severity, and retention number you ship later inherits whatever is wrong in the tables today, and "the dashboard was wrong because the data was wrong" is not a defense anyone accepts.
A professional audit is not "look around and see if anything seems off." It is a set of named invariants the data must satisfy, each one written as a query whose output proves the invariant holds:
- Every claim lands in a month where its policy actually earned premium.
- No claim has paid more than it has incurred.
- The premium booked on each policy reconciles with the premium fact table.
All three tasks are graded. For the first two, the dataset is clean: your job is to prove it with a count that comes back zero. The third returns a full reconciliation report that would expose drift the moment anyone touches the numbers.
Audit 1: Claims Without an Earned-Premium Month #
A claim dated in a month where its policy has no row in ins_premium_facts is a red flag: either premium accrual is missing, or the claim is booked against the wrong policy or date. Loss-ratio math silently breaks on these rows.
Write the anti-join: for each claim, look for a premium-fact row with the same policy_id and a month_date equal to the claim's calendar month (date_trunc the claim date and cast it back to DATE). Count the claims that find no match.
Output contract: a single row with one column, orphan_claim_months. On this dataset the count must be 0: that zero is the audit passing, not a boring result.
Audit 2: Paid Exceeding Incurred #
incurred_amount is the insurer's total estimated cost of a claim; paid_amount is the cash that has actually gone out. Paid can lag incurred for months, but it can never legitimately exceed it. A claim with paid > incurred means a reserving error or a data-entry bug, and it understates every loss metric downstream.
Output contract: a single row with one column, overpaid_claims. The seed is clean, so prove it: the count must be 0.
Audit 3: Written Premium Reconciliation #
Two systems claim to know each policy's written premium: ins_policies.annual_premium (the amount booked on the policy) and SUM(written_premium) in ins_premium_facts (the amount that hit the fact table). Finance calls comparing them a reconciliation, and it runs before every reporting cycle.
Build the reconciliation report at policy grain:
- aggregate
ins_premium_factsto onebooked_writtentotal per policy first (never join raw fact rows to policies and sum, because that repeatsannual_premiumonce per month and fabricates mismatches), - start from
ins_policieswith a LEFT JOIN andCOALESCEthe total to0, so a policy missing from the fact table entirely surfaces as a full-premium mismatch instead of silently dropping out, - compute
premium_diff = annual_premium - booked_written, rounded to 2 decimals.
Output contract: one row per policy with policy_id, annual_premium, booked_written, premium_diff. Today every premium_diff is 0.00 within the 0.01 tolerance, but this exact query is what catches the drift the day someone reloads the fact table wrong.
Keep these three queries. Real analytics teams run exactly this kind of audit as scheduled checks: every query encodes one invariant, and a non-zero count (or non-zero diff) pages someone before a bad number reaches a dashboard. "Prove it's clean" is a deliverable, not a formality.
Next Step #
The data is certified clean, so now you can compute on it with confidence: