Meet the Messy Data
Real data arrives dirty. Names are typed five different ways, "no email" hides behind blanks and the word n/a, dates come in whatever format the source system felt like, and money shows up as text with dollar signs and commas. Before you write a single UPDATE or transform, you profile the mess so your cleaning has a target.
Two tables drive this whole track:
messy_customers: people, but the same person entered several times with different casing and spacing.messy_orders: purchases, with amounts stored as text and a couple of rows that should not be trusted.
Every editor below is a sandbox. Nothing is graded, so run the queries, change them, break them. The one rule of data cleaning: always look before you clean.
1. Look at Everything First #
Before profiling individual columns, just look. SELECT * on a small table is the single most useful first move: you see the shape, the column types, and the obvious problems in one glance.
Read every cell in both tables below. You will already start spotting trouble: full_name values with stray spaces, an email that just says n/a, dates that do not match each other, and amounts wearing dollar signs.
2. Casing and Whitespace Hide Duplicates #
messy_customers looks like ten customers. It is actually five people, each entered twice. The duplicates do not match exactly: they differ only in letter case and in leading/trailing spaces, so a naive SELECT DISTINCT full_name still returns all ten.
The fix that collapses fuzzy duplicates is TRIM (strips outer whitespace) combined with LOWER (normalizes case). Run the query and compare the raw names against the normalized ones side by side.
3. "No Email" Wears Three Disguises #
If your boss asks "how many customers have no email," the honest answer depends on agreeing what missing means. In this table a missing email shows up three different ways:
- a real
NULL, - an empty string
'', - the sentinel text
'n/a'that some upstream form wrote instead of leaving the field blank.
A plain WHERE email IS NULL catches only one of the three and undercounts. List every email next to a flag that treats all three as missing.
4. Three Date Formats in One Column #
signup_date is stored as text (VARCHAR), and the source systems disagreed on format. You will find all of these in the same column:
- ISO
yyyy-mm-dd(e.g.2024-01-05) - US
mm/dd/yyyy(e.g.01/05/2024) - slash
yyyy/mm/dd(e.g.2024/03/01)
plus an empty string and the literal word unknown. A single CAST(... AS DATE) would error or return NULL on most of these. The DuckDB tool for "try each format, give NULL if none fit" is TRY_STRPTIME(string, '%format'), which returns NULL instead of throwing. Chain a few with COALESCE and look at which rows resolve.
5. Money Stored as Text, Plus Rows You Cannot Trust #
Now messy_orders. Two separate problems live here.
Amount is text with currency noise. Values like $1,299.00 carry a dollar sign and a thousands comma, so they will not CAST to a number directly. Strip the $ and , with REPLACE, then TRY_CAST to DECIMAL, and notice that free and the empty string '' are genuinely unparseable and land as NULL.
Some rows should not be trusted. Row 205 has a NULL customer_id; row 206 points at customer 99, who does not exist in messy_customers (an orphan); and rows 201 and 207 are the same order entered twice (identical customer, amount, status, and timestamp). Profile the amounts first.
Before you aggregate anything, hunt the untrustworthy rows directly. The query below finds the orphan, the NULL link, and the exact duplicate in one pass: the rows you will have to exclude or dedupe in the graded drills.
The look-before-you-clean checklist:
- **
SELECT* first: eyeball the shape and the obvious problems before writing transforms. - Casing + whitespace hide duplicates;
TRIM(LOWER(col))is the great collapser. - "Missing" has many disguises:
NULL,'', and sentinels like'n/a'. Agree on what missing means, then handle all of them with aCASE. - Dates as text rarely share one format;
TRY_STRPTIME(col, '%fmt')+COALESCEparses each row by trying formats in order. - Money as text carries
$and,;REPLACEthem away, thenTRY_CASTso unparseable values become NULL instead of errors. - Not every row is trustworthy — find duplicates with
GROUP BY ... HAVING COUNT() > 1and orphans with an anti-join, before* you aggregate.
Next Step #
Now turn the eyeballing into graded transforms:
- Text & categories: five drills: normalize names, define "missing email," dedupe to one row per person, standardize country codes, and clean order amounts.