Lesson Intermediate

Deduping

Before you dedupe, answer the harder question: what counts as a duplicate?

  • Exact duplicates: every column the same.
  • Key-duplicates: same primary key, different values in other columns (a version history).
  • Fuzzy duplicates: effectively the same after normalization, like " Alice " and "alice".

The right tool depends on which one you have.

Three Tools, One Decision #

ToolBest forLimitation
SELECT DISTINCTExact duplicates on all selected columnsCan't pick which duplicate to keep; can't see non-key columns
GROUP BY + aggregateKey-duplicates where you want a summary (sum, max, latest)You lose row-level detail; only group-level results come out
ROW_NUMBER() + WHERE rn = 1Key-duplicates where you want one specific row to surviveMore syntax, but the only option when you need full row detail

Start with DISTINCT or GROUP BY. Reach for ROW_NUMBER when those can't express the "keep exactly this one" logic.

DISTINCT #

sql
-- Distinct list of user_ids that have any paid order
SELECT DISTINCT user_id
FROM orders
WHERE status = 'paid';

DISTINCT operates on the entire selected row, not a subset. SELECT DISTINCT user_id, total dedupes on both columns, not just user_id. That's the classic beginner trap: you added a column "just to see it" and suddenly your distinct count changed.

GROUP BY + Aggregate #

When you have duplicates on a key and want one summary row per key, GROUP BY is usually cleanest.

sql
-- One row per user: latest order timestamp and total count
SELECT
  user_id,
  MAX(created_at) AS last_order_at,
  COUNT(*) AS order_count
FROM orders
GROUP BY user_id;

The catch: you only get columns you aggregated or grouped. If you want the row itself where created_at is max, with all its other columns too, you need ROW_NUMBER.

ROW_NUMBER: Latest-Row Pattern #

The canonical "keep the latest row per key" query. Partition by the key, order by your tie-breaker, keep rn = 1.

sql
-- Keep the most recent order per user, full row detail
WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY created_at DESC, id DESC
    ) AS rn
  FROM orders
)
SELECT id, user_id, total, status, created_at
FROM ranked
WHERE rn = 1;

The , id DESC tie-breaker is deliberate: if two orders for a user share the exact same created_at, the database would otherwise pick arbitrarily. Always make the sort deterministic.

Info

DuckDB and Snowflake support QUALIFY — the one-line version.

sql
SELECT id, user_id, total, status, created_at
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC, id DESC) = 1;

Same result as the CTE version. If you're writing portable SQL, stick with the CTE; if the engine supports QUALIFY, use it, since it's much easier to read.

Fuzzy Duplicates #

"Alice" and " alice " are different strings but probably the same person. Dedupe on the normalized value, not the raw one:

sql
-- Distinct customers after normalizing name
SELECT DISTINCT TRIM(LOWER(name)) AS normalized_name
FROM users;
Warning

**SELECT DISTINCT doesn't see through whitespace or case.*
If your "duplicate" rows differ by a trailing space or a capital letter, DISTINCT treats them as separate. Normalize first (covered in the String Cleanup lesson), then dedupe. This is why cleanup and dedupe are paired lessons: one doesn't work without the other.

Try It #

A raw_events input with exact duplicates and fuzzy duplicates. Dedupe on a normalized email and keep the latest event per user.

Loading SQL editor...

Practice #

From the same raw_events input, keep exactly one row per normalized email — the latest by ts, tie-broken by highest id. Return id, email (the normalized value), event, ts. No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

  • Deduping raw values without normalizing first. Whitespace and case differences hide duplicates from DISTINCT.
  • Adding a column to SELECT DISTINCT "to see it" and silently changing the dedupe semantics; DISTINCT looks at every selected column.
  • Forgetting a tie-breaker in ROW_NUMBER's ORDER BY, so which "latest" row wins becomes non-deterministic.
  • Using ROW_NUMBER when GROUP BY + aggregate would be simpler and faster. If you only need MAX(ts) per key, you don't need the full row.
  • Deduping before joining, then finding the join re-introduces duplicates via fan-out. Dedupe after the join when the join is the source of the duplication.

Knowledge check #

3 questions

0 / 3 answered
  1. For "find duplicate emails" the canonical pattern is:

  2. For "keep one row per email, deleting the rest," the modern pattern is:

  3. For tables that should never have duplicates, the long-term fix is:

Next Step #

You've finished the Data Cleaning section. Next up: Analytics Patterns, covering top-N per group, cohorts, funnels, and rolling metrics, built on everything you've learned so far.