Deduping
Before you dedupe, answer the harder question: what counts as a duplicate?
- Exact duplicates: every column the same.
- Key-duplicates: same primary key, different values in other columns (a version history).
- Fuzzy duplicates: effectively the same after normalization, like
" Alice "and"alice".
The right tool depends on which one you have.
Three Tools, One Decision #
| Tool | Best for | Limitation |
|---|---|---|
SELECT DISTINCT | Exact duplicates on all selected columns | Can't pick which duplicate to keep; can't see non-key columns |
GROUP BY + aggregate | Key-duplicates where you want a summary (sum, max, latest) | You lose row-level detail; only group-level results come out |
ROW_NUMBER() + WHERE rn = 1 | Key-duplicates where you want one specific row to survive | More syntax, but the only option when you need full row detail |
Start with DISTINCT or GROUP BY. Reach for ROW_NUMBER when those can't express the "keep exactly this one" logic.
DISTINCT #
-- Distinct list of user_ids that have any paid order
SELECT DISTINCT user_id
FROM orders
WHERE status = 'paid';DISTINCT operates on the entire selected row, not a subset. SELECT DISTINCT user_id, total dedupes on both columns, not just user_id. That's the classic beginner trap: you added a column "just to see it" and suddenly your distinct count changed.
GROUP BY + Aggregate #
When you have duplicates on a key and want one summary row per key, GROUP BY is usually cleanest.
-- One row per user: latest order timestamp and total count
SELECT
user_id,
MAX(created_at) AS last_order_at,
COUNT(*) AS order_count
FROM orders
GROUP BY user_id;The catch: you only get columns you aggregated or grouped. If you want the row itself where created_at is max, with all its other columns too, you need ROW_NUMBER.
ROW_NUMBER: Latest-Row Pattern #
The canonical "keep the latest row per key" query. Partition by the key, order by your tie-breaker, keep rn = 1.
-- Keep the most recent order per user, full row detail
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at DESC, id DESC
) AS rn
FROM orders
)
SELECT id, user_id, total, status, created_at
FROM ranked
WHERE rn = 1;The , id DESC tie-breaker is deliberate: if two orders for a user share the exact same created_at, the database would otherwise pick arbitrarily. Always make the sort deterministic.
DuckDB and Snowflake support QUALIFY — the one-line version.
SELECT id, user_id, total, status, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC, id DESC) = 1;Same result as the CTE version. If you're writing portable SQL, stick with the CTE; if the engine supports QUALIFY, use it, since it's much easier to read.
Fuzzy Duplicates #
"Alice" and " alice " are different strings but probably the same person. Dedupe on the normalized value, not the raw one:
-- Distinct customers after normalizing name
SELECT DISTINCT TRIM(LOWER(name)) AS normalized_name
FROM users;**SELECT DISTINCT doesn't see through whitespace or case.*
If your "duplicate" rows differ by a trailing space or a capital letter, DISTINCT treats them as separate. Normalize first (covered in the String Cleanup lesson), then dedupe. This is why cleanup and dedupe are paired lessons: one doesn't work without the other.
Try It #
A raw_events input with exact duplicates and fuzzy duplicates. Dedupe on a normalized email and keep the latest event per user.
Practice #
From the same raw_events input, keep exactly one row per normalized email — the latest by ts, tie-broken by highest id. Return id, email (the normalized value), event, ts. No ordering requirement.
Mistakes to Watch For #
- Deduping raw values without normalizing first. Whitespace and case differences hide duplicates from
DISTINCT. - Adding a column to
SELECT DISTINCT"to see it" and silently changing the dedupe semantics;DISTINCTlooks at every selected column. - Forgetting a tie-breaker in
ROW_NUMBER'sORDER BY, so which "latest" row wins becomes non-deterministic. - Using
ROW_NUMBERwhenGROUP BY+ aggregate would be simpler and faster. If you only needMAX(ts)per key, you don't need the full row. - Deduping before joining, then finding the join re-introduces duplicates via fan-out. Dedupe after the join when the join is the source of the duplication.
Knowledge check #
3 questions
For "find duplicate emails" the canonical pattern is:
For "keep one row per email, deleting the rest," the modern pattern is:
For tables that should never have duplicates, the long-term fix is:
Next Step #
You've finished the Data Cleaning section. Next up: Analytics Patterns, covering top-N per group, cohorts, funnels, and rolling metrics, built on everything you've learned so far.