Real-World Data Cleaning
Tutorials teach SQL on spotless toy tables. Real tables are not spotless. The single most common complaint from people moving from tutorials to a real job is the same: "I can write the query, but the data is a mess and nothing matches." This track is that mess, on purpose.
Every table here is deliberately dirty, and the dirt has a defined correct answer, so you can practice cleaning it and check yourself.
The messes you will meet #
- Inconsistent text — the same value typed five ways:
' Ana Costa ','ANA COSTA','ana costa'. Fuzzy duplicates that only collapse afterTRIM(LOWER(...)). - Missing-but-not-NULL — real
NULL, empty string'', and sentinels like'n/a'all meaning "no value". A naiveWHERE email IS NULLmisses two of the three. - Mixed date formats in a text column —
'2024-01-05','01/05/2024','2024/04/10', plus''and'unknown'. A plainCASTthrows; you needTRY_STRPTIMEwith fallbacks. - Numbers stored as strings —
'$1,299.00','free',''. Strip the noise,TRY_CAST, decide what to do with the unparseable. - Duplicates — the same row entered twice; sum it and your total is wrong.
- Orphans & NULL keys — an order pointing at a customer that does not exist, and an order with no customer at all.
The one habit this track drills #
Look before you clean, and validate after. Every messy task has the same shape: inspect the raw values, decide the rule (what counts as missing? which date formats exist? what is the canonical form?), clean to that rule, then prove the cleaned output is right. The cleaning functions are easy; the judgment of what clean means is the skill interviewers and jobs actually test.
DuckDB cleaning toolkit #
You will lean on these throughout:
TRIM,LOWER/UPPER: normalize whitespace and case before comparing or grouping.NULLIF(x, '')andCOALESCE: fold empty strings and sentinels into realNULL.REPLACE/regexp_replace: strip currency symbols, commas, stray characters.TRY_CAST(x AS DECIMAL): convert without erroring; bad values becomeNULL.TRY_STRPTIME(x, '%Y-%m-%d'): parse a date in a specific format,NULLif it does not match; chain several withCOALESCEto handle mixed formats.ROW_NUMBER() OVER (PARTITION BY ...): keep one row per de-duplicated key.
Track order #
- Meet the Messy Data — look at the dirt first (ungraded).
- Text & categories: casing, whitespace, missing-value rules, dedup, category standardization.
- Dates & types: parse mixed date formats safely, clean numbers-as-strings.
- Capstone: clean & reconcile — run the full pipeline and produce trustworthy numbers from a dirty export.