Recipe

Delete duplicate rows

Short answer: rank the rows with ROW_NUMBER() OVER (PARTITION BY <dup_columns> ORDER BY <preferred_first>), then delete every row with rank > 1. This pattern works on every major engine, gives you control over which row to keep (oldest, newest, by ID), and is safer than the older MIN(id)-by-grouping approach. Always run it inside a transaction the first time.

Canonical SQL #

sql

-- The portable pattern — works on every engine that supports window functions.
DELETE FROM signups
WHERE id IN (
  SELECT id FROM (
    SELECT id,
           ROW_NUMBER() OVER (
             PARTITION BY email
             ORDER BY signed_up_at, id        -- keep the earliest; tie-break by id
           ) AS rn
    FROM signups
  ) ranked
  WHERE rn > 1
);

-- Postgres / DuckDB shorter form using ctid / rowid (no helper column needed).
DELETE FROM signups
WHERE ctid IN (
  SELECT ctid FROM (
    SELECT ctid,
           ROW_NUMBER() OVER (PARTITION BY email ORDER BY signed_up_at) AS rn
    FROM signups
  ) ranked
  WHERE rn > 1
);

-- DuckDB / Snowflake / BigQuery / Databricks: with QUALIFY (cleaner).
-- Generally easier to rewrite as INSERT INTO new_table ... QUALIFY rn = 1, then swap.

Example #

Loading SQL editor...

Common variants #

sql

-- Keep the most recent row per email instead of the earliest.
DELETE FROM signups
WHERE id IN (
  SELECT id FROM (
    SELECT id,
           ROW_NUMBER() OVER (
             PARTITION BY email
             ORDER BY signed_up_at DESC
           ) AS rn
    FROM signups
  ) ranked
  WHERE rn > 1
);

-- Multi-column dedup key (keep one row per lowercased name + email).
DELETE FROM users
WHERE id IN (
  SELECT id FROM (
    SELECT id,
           ROW_NUMBER() OVER (
             PARTITION BY LOWER(name), LOWER(email)
             ORDER BY id
           ) AS rn
    FROM users
  ) ranked
  WHERE rn > 1
);

-- Safer "rebuild" pattern for huge tables — no DELETE at all.
CREATE TABLE signups_dedup AS
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY signed_up_at) AS rn
  FROM signups
) WHERE rn = 1;

-- Inspect signups_dedup, then:
-- BEGIN; ALTER TABLE signups RENAME TO signups_old;
--        ALTER TABLE signups_dedup RENAME TO signups; COMMIT;

When to dedup vs prevent dupes upstream #

Dedup as a one-time cleanup: fine. The patterns above are the right tool. Run inside a transaction, verify with SELECT before DELETE, document what you did.

Dedup as a recurring scheduled job: a sign that the data model is wrong. Add a UNIQUE constraint on the column that should be unique (UNIQUE(email) on users) and let the database reject duplicates at insert time. The constraint is one line; the dedup job is a quarterly debugging session. Always prefer prevention.

Dedup at insert via ON CONFLICT: the modern compromise. INSERT INTO users (email) VALUES (...) ON CONFLICT (email) DO NOTHING (Postgres / SQLite / DuckDB) or INSERT IGNORE (MySQL) handles the "what if a duplicate arrives?" question without needing a periodic cleanup. See INSERT and MERGE / UPSERT.

Dialect comparison #

Engine	Syntax	Gotcha
PostgreSQL	`DELETE … WHERE ctid IN (…)`	`ctid` is Postgres-specific
MySQL	`DELETE … WHERE id IN (ROW_NUMBER subquery)`	needs MySQL 8+ for window functions
SQL Server	CTE + `ROW_NUMBER()`, then `DELETE FROM cte WHERE rn > 1`	none
BigQuery	`CREATE OR REPLACE TABLE … AS SELECT … QUALIFY rn = 1`	no row id, so rebuild instead of `DELETE`
Snowflake	rebuild with `QUALIFY ROW_NUMBER() OVER (…) = 1`	none
Oracle	`DELETE … WHERE ROWID IN (…)`	none
SQLite	`DELETE … WHERE rowid IN (…)`	none
DuckDB	`DELETE … WHERE rowid IN (…)`	none

Warning

Always run dedup inside a transaction the first time on production data. BEGIN; DELETE FROM signups WHERE ...; SELECT COUNT() FROM signups GROUP BY email HAVING COUNT() > 1; If any duplicates remain, you know the partition key was wrong; if too many rows are gone, you know the rank logic was wrong. ROLLBACK; and try again. Once the SELECT confirms exactly the right state, COMMIT;. Dedup queries that look obviously correct fail silently more often than almost any other class of DML, so the transaction is the seatbelt. Reference: BEGIN/COMMIT/ROLLBACK.