Practice

Text & categories: five drills

Five cleaning drills on the messy tables, ordered easy to hard. Each isolates one real-world cleaning skill. The default editor shows a working solution. Try clearing it and writing your own first; the hidden checks grade what's there when you press Run.

Tables: messy_customers, messy_orders. If you have not yet, run the Meet the Messy Data tour first; it documents every mess these drills target.

1. Normalize names: collapse the fuzzy duplicates (Easy) #

Task: messy_customers.full_name holds the same five people entered twice each, differing only in letter case and stray leading/trailing spaces. Return the distinct normalized names, one row per real person. Output a single column clean_name. Order by clean_name.

The skill: TRIM strips outer whitespace and LOWER flattens case, so ' Ana Costa ' and 'ANA COSTA' both become 'ana costa' and DISTINCT collapses them. Ten raw rows become five.

Loading SQL editor...

2. Count customers with no real email (Easy) #

Task: count how many customers truly have no email. An email counts as missing when it is any of: a real NULL, an empty string '', or the sentinel text 'n/a' (case-insensitive). Everything else is a real email. Return one row, one column missing_emails.

The skill: "missing" is a business definition, not a single SQL state. A plain email IS NULL undercounts here because '' and 'n/a' are non-NULL strings. Fold all three into one condition with a CASE (or with NULLIF/COALESCE). Of the ten rows, four are missing.

Loading SQL editor...

3. Dedupe to one row per person (Medium) #

Task: reduce messy_customers to one row per real person. For each normalized name (the TRIM(LOWER(full_name)) key from drill 1), keep the row with the lowest id. Output id, cleaned_name. Order by id.

The skill: ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY id) numbers the rows within each duplicate group; keeping rn = 1 picks one survivor per group. This is the canonical dedupe pattern: the partition key defines "same entity," the ORDER BY decides which copy wins. Expect five rows: ids 1, 3, 5, 7, 9.

Loading SQL editor...

4. Standardize country to a canonical code (Medium) #

Task: country is free text. The same country appears as 'US', 'usa', 'United States', 'U.S.A.', and 'CA', 'canada', plus a NULL. Map each value to a canonical code with a CASE over LOWER(TRIM(country)), then count customers per canonical country. Use these codes:

US for us, usa, united states, u.s.a.
CA for ca, canada
UNKNOWN for anything else, including NULL

Output country_code, customers. Order by country_code. Expect CA = 4, UNKNOWN = 1, US = 5.

The skill: a CASE mapping turns dirty free-text categories into a controlled vocabulary. Normalizing with LOWER(TRIM(...)) first keeps the WHEN list short, and the ELSE branch is your safety net for unmapped values and NULL (which never matches an equality, so it falls through to ELSE).

Loading SQL editor...

5. Clean amounts and roll up revenue per status (Hard) #

Task: in messy_orders, three messes stack on top of each other:

status has case and trailing-space variants ('Paid', 'paid ', 'PAID', 'Refunded '). Normalize with TRIM(LOWER(status)).
amount is text with currency noise ('$1,299.00') and genuinely unparseable values ('free', ''). Strip $ and ,, then TRY_CAST to DECIMAL so unparseable values become NULL and drop out of the sum.
Rows 201 and 207 are the same order entered twice — count it once.

Return the revenue per canonical status, summing only parseable amounts and excluding the duplicate. Output status, revenue. Order by status.

Expect exactly two rows: paid = 1698.97 (1299.00 + 59.98 + 240.00 + 99.99) and refunded = 15.50. The pending and NULL-status rows have no parseable amount, so they contribute nothing and do not appear.

The skill: stacking cleanups in the right order. Dedupe first (so 207 cannot double-count), parse the amount with TRY_CAST (so bad values vanish instead of erroring), then filter to parseable amounts and group by the normalized status.

Loading SQL editor...

Tip

The cleaning patterns to remember:

TRIM(LOWER(col)) is the workhorse normalizer: it collapses casing and whitespace so fuzzy duplicates and category variants line up.
"Missing" is a definition, not a state. Decide whether NULL, '', and sentinels like 'n/a' all count, then handle them together with a CASE (or NULLIF/COALESCE).
ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreak) + WHERE rn = 1 is the canonical dedupe: the partition defines "same entity," the ORDER BY picks the survivor.
A CASE mapping turns dirty free-text categories into a controlled vocabulary; let the ELSE branch absorb unmapped values and NULL.
TRY_CAST after REPLACE parses money safely — strip $ and ,, then TRY_CAST so unparseable text becomes NULL and drops out of a SUM instead of erroring.
Order your cleanups. Dedupe before you aggregate, parse before you filter. The sequence is part of the correctness.

Text & categories: five drills

1. Normalize names: collapse the fuzzy duplicates (Easy) #

2. Count customers with no real email (Easy) #

3. Dedupe to one row per person (Medium) #

4. Standardize country to a canonical code (Medium) #

5. Clean amounts and roll up revenue per status (Hard) #

Related In Real-World Data Cleaning