Practice

Capstone: clean & reconcile

Business brief. You have been handed a dirty export from two systems - messy_customers and messy_orders - and asked to report numbers leadership can actually trust. The export is a mess: the same person appears several times under different casing and spacing, countries are free text ('US', 'usa', 'United States', 'U.S.A.'), signup dates come in three formats plus blanks, amounts carry currency noise, one order is entered twice, and a couple of orders point at customers who do not exist (or no customer at all).

Your job: clean it, then reconcile it. First collapse the export to one trustworthy row per real person. Then join orders to that clean list and produce a data-quality report so the reader sees the rows a naive JOIN would silently swallow or double-count.

Tables: messy_customers, messy_orders.

Task 1: Full clean pipeline - one row per real person #

Task: collapse messy_customers to exactly one clean row per real person and standardize their attributes. The same person appears under casing/spacing variants (' Ana Costa ', 'ANA COSTA'), so the de-duplication key is the normalized name TRIM(LOWER(full_name)). Keep the earliest record for each person (smallest id) as the canonical row, and report that canonical row's standardized attributes.

Output one row per person with:

  • id - the smallest id for that normalized name (the canonical record).
  • clean_name - the normalized name: trimmed and lowercased (TRIM(LOWER(full_name)) -> 'ana costa'). Most teams title-case for display, but DuckDB has no INITCAP, so the canonical normalized form is lowercase (and it matches the dedup key exactly).
  • country_code - standardized to 'US', 'CA', or NULL (map 'US'/'usa'/'United States'/'U.S.A.' -> 'US'; 'CA'/'ca'/'canada'/'Canada' -> 'CA'; NULL stays NULL) from the canonical row.
  • signup_month - the canonical row's signup date parsed and truncated to the first of its month (NULL if that row's date was blank/unknown).

Order by id. There are 5 real people, so you must return exactly 5 rows.

The skill: standardize per row first (normalize the name, map the country, parse the date), then dedupe by MIN(id) per normalized name and join back to pick the canonical row's clean attributes. The country mapping is robust if you strip punctuation and spaces before comparing - 'U.S.A.' and 'United States' both fold to a token your CASE can match.

Loading SQL editor...

Task 2: Reconcile orders against the clean customer list #

Task: before anyone trusts a revenue join, surface the rows that would corrupt it. Reconcile messy_orders against the customer list and emit a small data-quality report - one row per problem - so the reader sees exactly what a naive JOIN orders ... customers would either silently drop or double-count.

Flag three issue types:

  • 'orphan_customer' - the order's customer_id is present but does not exist in messy_customers (the seed has one pointing at id 99).
  • 'null_customer' - the order has a NULL customer_id (no customer at all).
  • 'duplicate_order' - an exact re-entry of an earlier order (same customer_id, amount, status, order_ts); flag the later row, keep the first.

Output issue_type and order_id, ordered by issue_type, then order_id. There are exactly 3 violations, so the report has 3 rows.

The skill: an anti-join (NOT IN / NOT EXISTS) finds orphans; an explicit IS NULL branch catches the NULL key (NOT IN would not flag a NULL key on its own - mind the three-valued logic); and ROW_NUMBER() OVER (PARTITION BY the de-noised columns) lets you keep the first copy of a duplicate and flag the rest. A plain inner join would just drop the orphan and the NULL key without telling you, and double-count the duplicate - this report makes all three visible.

Loading SQL editor...

Findings note #

Once both tasks pass, write two or three sentences a reviewer could act on:

  • Customer count. The raw export has 10 rows but only 5 real people - any per-customer metric built on the raw table is inflated 2x. State the dedup key you trusted (TRIM(LOWER(full_name))) and one risk it carries (two genuinely different people with the same name would wrongly merge).
  • Order integrity. Of 8 orders, 3 are unsafe to join naively: one orphan (customer_id 99, no such customer), one with a NULL customer, and one exact duplicate that would double-count revenue. Note which numbers you would not publish until these are resolved with the source system.
Warning

A naive JOIN orders TO customers would have silently dropped the orphan and the NULL-key order (so revenue looks lower than reality) and double-counted the duplicate (so one customer's January looks twice as big). Clean and reconcile before you aggregate - the violations a reconciliation report makes visible are exactly the ones an inner join hides.