Meet the Insurance Dataset
Before you write a single KPI query, meet the data. Analysts who skip this step ship wrong numbers with confidence.
The routine is always the same:
- Grain first: how many rows, and what does one row mean in each table?
- Ranges next: what time window does the data actually cover?
- Referential sanity: do the facts actually point at real dimension rows?
Every editor on this page is a sandbox. Nothing is graded, so run the queries, change them, break them.
1. Row Counts and Grain #
Start with raw row counts, then confirm the declared grain:
ins_customers: one row per customerins_policies: one row per policyins_premium_facts: one row per policy per monthins_claims: one row per claim
A count alone does not prove grain. Compare COUNT(*) against COUNT(DISTINCT key) to prove there are no duplicates.
2. Date Ranges #
Next question: what period does this book of business cover? Premium accrual, claim activity, and policy starts each have their own window — and they rarely line up perfectly.
3. Policy Status Mix #
Status mix tells you how much of the portfolio is still in force. It also warns you which filters matter later: a metric "for active policies" excludes a different population than a metric for the whole book.
4. Trace One Policy End-to-End #
Pick one policy and follow it through every table. Policy 1001 is an Auto policy that starts in January: it should show its full written premium in month one, a steady earned amount every month after, and one claim landing in February.
Tracing a single entity end-to-end is the fastest way to internalize how the fact tables relate to the dimension.
5. Referential Sanity #
Facts are only trustworthy if every fact row points at a real dimension row. The classic test is an anti-join: keep fact rows whose policy does not exist. An empty result is the pass.
6. Find the Oddity #
One policy in ins_premium_facts behaves differently from every other policy that started in the same month. Profile months-present per policy and find it — then explain why using what you learned in the status-mix section.
This 15-minute routine (counts, grain proof, date ranges, one end-to-end trace, referential anti-joins) works on any unfamiliar dataset, not just this one. Do it before every new project and you will catch fanout bugs and missing-data surprises before they reach a dashboard.
Next Step #
Now formalize what you just eyeballed:
- Data Quality Audit — graded checks that prove the dataset is clean.