Practice

Real-World Data Cleaning

Tutorials teach SQL on spotless toy tables. Real tables are not spotless. The single most common complaint from people moving from tutorials to a real job is the same: "I can write the query, but the data is a mess and nothing matches." This track is that mess, on purpose.

Every table here is deliberately dirty, and the dirt has a defined correct answer, so you can practice cleaning it and check yourself.

The messes you will meet #

  • Inconsistent text — the same value typed five ways: ' Ana Costa ', 'ANA COSTA', 'ana costa'. Fuzzy duplicates that only collapse after TRIM(LOWER(...)).
  • Missing-but-not-NULL — real NULL, empty string '', and sentinels like 'n/a' all meaning "no value". A naive WHERE email IS NULL misses two of the three.
  • Mixed date formats in a text column'2024-01-05', '01/05/2024', '2024/04/10', plus '' and 'unknown'. A plain CAST throws; you need TRY_STRPTIME with fallbacks.
  • Numbers stored as strings'$1,299.00', 'free', ''. Strip the noise, TRY_CAST, decide what to do with the unparseable.
  • Duplicates — the same row entered twice; sum it and your total is wrong.
  • Orphans & NULL keys — an order pointing at a customer that does not exist, and an order with no customer at all.

The one habit this track drills #

Look before you clean, and validate after. Every messy task has the same shape: inspect the raw values, decide the rule (what counts as missing? which date formats exist? what is the canonical form?), clean to that rule, then prove the cleaned output is right. The cleaning functions are easy; the judgment of what clean means is the skill interviewers and jobs actually test.

DuckDB cleaning toolkit #

You will lean on these throughout:

  • TRIM, LOWER / UPPER: normalize whitespace and case before comparing or grouping.
  • NULLIF(x, '') and COALESCE: fold empty strings and sentinels into real NULL.
  • REPLACE / regexp_replace: strip currency symbols, commas, stray characters.
  • TRY_CAST(x AS DECIMAL): convert without erroring; bad values become NULL.
  • TRY_STRPTIME(x, '%Y-%m-%d'): parse a date in a specific format, NULL if it does not match; chain several with COALESCE to handle mixed formats.
  • ROW_NUMBER() OVER (PARTITION BY ...): keep one row per de-duplicated key.

Track order #

  1. Meet the Messy Data — look at the dirt first (ungraded).
  2. Text & categories: casing, whitespace, missing-value rules, dedup, category standardization.
  3. Dates & types: parse mixed date formats safely, clean numbers-as-strings.
  4. Capstone: clean & reconcile — run the full pipeline and produce trustworthy numbers from a dirty export.