Dates & types — five drills
Five data-cleaning drills, ordered easy to hard. The data is deliberately dirty: dates live in three different string formats, amounts carry currency noise and the word 'free', statuses have stray case and trailing spaces, and one order is entered twice. The default editor shows a working solution - try clearing it and writing your own first; the hidden checks will grade what's there when you press Run.
Tables: messy_customers, messy_orders.
The core DuckDB tools here: TRY_STRPTIME(str, '%fmt') (returns NULL instead of erroring when the string does not match the format - chain several inside COALESCE to accept multiple formats), TRY_CAST (NULL instead of error on a bad cast), REPLACE (strip $ and ,), and TRIM / LOWER to fold messy text before grouping.
1. Safe-parse a multi-format date column (Easy) #
Task: messy_customers.signup_date is a VARCHAR holding three different formats - ISO '2024-01-05', US '01/05/2024', and slash '2024/03/01' - plus '' and 'unknown'. Parse it into a real DATE. Return id and the parsed signup_date (a true DATE, NULL for the unparseable '' and 'unknown' rows). Order by id.
The skill: TRY_STRPTIME(text, format) parses a string with a strptime format and returns NULL (instead of raising) when the string does not match. To accept several formats, wrap them in COALESCE - the first one that parses wins, and a value that matches none falls through to NULL. Cast the result to DATE to drop the time component.
2. Signups per month - watch the denominator (Easy) #
Task: count signups per month using the parsed date, dropping the rows whose date could not be parsed. Output signup_month (a DATE, first of the month) and signups. Order by signup_month.
The skill: parse first (as in drill 1), then DATE_TRUNC('month', ...) and GROUP BY. The lesson hiding in the denominator: of 10 customers, only 8 have a usable date, so a per-month total built on COUNT(*) over the raw table would silently mis-attribute the two bad rows. Filtering WHERE parsed IS NOT NULL makes the dropped rows explicit - your monthly counts add to 8, not 10.
3. Strip currency noise and cast to DECIMAL (Easy) #
Task: messy_orders.amount is a VARCHAR mixing '$1,299.00', plain '59.98', the word 'free', and ''. Clean it to a numeric DECIMAL. Return id and the cleaned amount (NULL for the unparseable 'free' and '' rows). Order by id.
The skill: strip the noise, then try the cast. REPLACE(REPLACE(amount, '$', ''), ',', '') removes the dollar sign and thousands comma; TRY_CAST(... AS DECIMAL(12, 2)) turns the cleaned string into a number and returns NULL (instead of erroring) when the string still is not numeric - which is exactly what you want for 'free' and ''. A plain CAST would abort the whole query on the first bad row.
4. Classify parseable vs sentinel values (Medium) #
Task: build a data-quality report that labels each order's amount as 'numeric' or 'unparseable'. Output id and amount_kind. Order by id.
The skill: TRY_CAST(... IS NULL) is a clean boolean probe for "can this value be coerced to a number?" Strip the currency noise first (same as drill 3), then a CASE on whether the TRY_CAST came back NULL tells you which rows are genuinely numeric versus sentinels like 'free' or ''. This is the report you would run before trusting a column, so you know how many rows you will lose to coercion.
5. Monthly clean revenue, end to end (Hard) #
Task: produce trustworthy monthly paid revenue from messy_orders. Combine everything: parse amount to a number, fold status with TRIM(LOWER(...)) and keep only 'paid', drop the exact-duplicate order (row 207 is a re-entry of row 201 - same customer, amount, status, timestamp), and bucket by the order month (cast order_ts to a DATE, then DATE_TRUNC). Output month_start (DATE, first of the month) and paid_revenue (rounded to 2 decimals). Order by month_start.
The skill: real revenue numbers need every cleaning step in the right order. The trap is the duplicate - a naive SUM counts order 201 twice and overstates January. SELECT DISTINCT over the de-noised columns (without id) collapses the re-entry to one row before you aggregate. Status needs TRIM(LOWER(...)) because the seed has 'Paid', 'paid ', and 'PAID' that must all count as paid.
The cleaning patterns to remember:
COALESCE(TRY_STRPTIME(d, fmt1), TRY_STRPTIME(d, fmt2), ...)parses a column that holds several date formats - the first format that matches wins, unmatched values fall through toNULL.TRY_CASToverCASTon dirty columns - one bad row aborts a plainCAST;TRY_CASTreturnsNULLand keeps the query alive.- Strip before you cast -
REPLACE(REPLACE(amount, '$', ''), ',', '')removes currency noise so the cast has a chance. TRIM(LOWER(...))folds case and trailing-space variants ('Paid','paid ','PAID') into one comparable value before filtering or grouping.SELECT DISTINCTover the de-noised columns removes exact re-entries before aggregation - a duplicate row silently doubles aSUM.- Watch the denominator - dropping unparseable rows changes how many rows feed an average or a per-month count. Make the drop explicit.