Text & categories: five drills
Five cleaning drills on the messy tables, ordered easy to hard. Each isolates one real-world cleaning skill. The default editor shows a working solution. Try clearing it and writing your own first; the hidden checks grade what's there when you press Run.
Tables: messy_customers, messy_orders. If you have not yet, run the Meet the Messy Data tour first; it documents every mess these drills target.
1. Normalize names: collapse the fuzzy duplicates (Easy) #
Task: messy_customers.full_name holds the same five people entered twice each, differing only in letter case and stray leading/trailing spaces. Return the distinct normalized names, one row per real person. Output a single column clean_name. Order by clean_name.
The skill: TRIM strips outer whitespace and LOWER flattens case, so ' Ana Costa ' and 'ANA COSTA' both become 'ana costa' and DISTINCT collapses them. Ten raw rows become five.
2. Count customers with no real email (Easy) #
Task: count how many customers truly have no email. An email counts as missing when it is any of: a real NULL, an empty string '', or the sentinel text 'n/a' (case-insensitive). Everything else is a real email. Return one row, one column missing_emails.
The skill: "missing" is a business definition, not a single SQL state. A plain email IS NULL undercounts here because '' and 'n/a' are non-NULL strings. Fold all three into one condition with a CASE (or with NULLIF/COALESCE). Of the ten rows, four are missing.
3. Dedupe to one row per person (Medium) #
Task: reduce messy_customers to one row per real person. For each normalized name (the TRIM(LOWER(full_name)) key from drill 1), keep the row with the lowest id. Output id, cleaned_name. Order by id.
The skill: ROW_NUMBER() OVER (PARTITION BY <key> ORDER BY id) numbers the rows within each duplicate group; keeping rn = 1 picks one survivor per group. This is the canonical dedupe pattern: the partition key defines "same entity," the ORDER BY decides which copy wins. Expect five rows: ids 1, 3, 5, 7, 9.
4. Standardize country to a canonical code (Medium) #
Task: country is free text. The same country appears as 'US', 'usa', 'United States', 'U.S.A.', and 'CA', 'canada', plus a NULL. Map each value to a canonical code with a CASE over LOWER(TRIM(country)), then count customers per canonical country. Use these codes:
USforus,usa,united states,u.s.a.CAforca,canadaUNKNOWNfor anything else, includingNULL
Output country_code, customers. Order by country_code. Expect CA = 4, UNKNOWN = 1, US = 5.
The skill: a CASE mapping turns dirty free-text categories into a controlled vocabulary. Normalizing with LOWER(TRIM(...)) first keeps the WHEN list short, and the ELSE branch is your safety net for unmapped values and NULL (which never matches an equality, so it falls through to ELSE).
5. Clean amounts and roll up revenue per status (Hard) #
Task: in messy_orders, three messes stack on top of each other:
statushas case and trailing-space variants ('Paid','paid ','PAID','Refunded '). Normalize withTRIM(LOWER(status)).amountis text with currency noise ('$1,299.00') and genuinely unparseable values ('free',''). Strip$and,, thenTRY_CASTtoDECIMALso unparseable values become NULL and drop out of the sum.- Rows 201 and 207 are the same order entered twice — count it once.
Return the revenue per canonical status, summing only parseable amounts and excluding the duplicate. Output status, revenue. Order by status.
Expect exactly two rows: paid = 1698.97 (1299.00 + 59.98 + 240.00 + 99.99) and refunded = 15.50. The pending and NULL-status rows have no parseable amount, so they contribute nothing and do not appear.
The skill: stacking cleanups in the right order. Dedupe first (so 207 cannot double-count), parse the amount with TRY_CAST (so bad values vanish instead of erroring), then filter to parseable amounts and group by the normalized status.
The cleaning patterns to remember:
TRIM(LOWER(col))is the workhorse normalizer: it collapses casing and whitespace so fuzzy duplicates and category variants line up.- "Missing" is a definition, not a state. Decide whether
NULL,'', and sentinels like'n/a'all count, then handle them together with aCASE(orNULLIF/COALESCE). ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreak)+WHERE rn = 1is the canonical dedupe: the partition defines "same entity," the ORDER BY picks the survivor.- A
CASEmapping turns dirty free-text categories into a controlled vocabulary; let theELSEbranch absorb unmapped values andNULL. TRY_CASTafterREPLACEparses money safely — strip$and,, thenTRY_CASTso unparseable text becomes NULL and drops out of aSUMinstead of erroring.- Order your cleanups. Dedupe before you aggregate, parse before you filter. The sequence is part of the correctness.