Delete duplicate rows
Short answer: rank the rows with ROW_NUMBER() OVER (PARTITION BY <dup_columns> ORDER BY <preferred_first>), then delete every row with rank > 1. This pattern works on every major engine, gives you control over which row to keep (oldest, newest, by ID), and is safer than the older MIN(id)-by-grouping approach. Always run it inside a transaction the first time.
Canonical SQL #
-- The portable pattern — works on every engine that supports window functions.
DELETE FROM signups
WHERE id IN (
SELECT id FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY signed_up_at, id -- keep the earliest; tie-break by id
) AS rn
FROM signups
) ranked
WHERE rn > 1
);
-- Postgres / DuckDB shorter form using ctid / rowid (no helper column needed).
DELETE FROM signups
WHERE ctid IN (
SELECT ctid FROM (
SELECT ctid,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY signed_up_at) AS rn
FROM signups
) ranked
WHERE rn > 1
);
-- DuckDB / Snowflake / BigQuery / Databricks: with QUALIFY (cleaner).
-- Generally easier to rewrite as INSERT INTO new_table ... QUALIFY rn = 1, then swap.Example #
Common variants #
-- Keep the most recent row per email instead of the earliest.
DELETE FROM signups
WHERE id IN (
SELECT id FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY signed_up_at DESC
) AS rn
FROM signups
) ranked
WHERE rn > 1
);
-- Multi-column dedup key (keep one row per lowercased name + email).
DELETE FROM users
WHERE id IN (
SELECT id FROM (
SELECT id,
ROW_NUMBER() OVER (
PARTITION BY LOWER(name), LOWER(email)
ORDER BY id
) AS rn
FROM users
) ranked
WHERE rn > 1
);
-- Safer "rebuild" pattern for huge tables — no DELETE at all.
CREATE TABLE signups_dedup AS
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY signed_up_at) AS rn
FROM signups
) WHERE rn = 1;
-- Inspect signups_dedup, then:
-- BEGIN; ALTER TABLE signups RENAME TO signups_old;
-- ALTER TABLE signups_dedup RENAME TO signups; COMMIT;When to dedup vs prevent dupes upstream #
Dedup as a one-time cleanup: fine. The patterns above are the right tool. Run inside a transaction, verify with SELECT before DELETE, document what you did.
Dedup as a recurring scheduled job: a sign that the data model is wrong. Add a UNIQUE constraint on the column that should be unique (UNIQUE(email) on users) and let the database reject duplicates at insert time. The constraint is one line; the dedup job is a quarterly debugging session. Always prefer prevention.
Dedup at insert via ON CONFLICT: the modern compromise. INSERT INTO users (email) VALUES (...) ON CONFLICT (email) DO NOTHING (Postgres / SQLite / DuckDB) or INSERT IGNORE (MySQL) handles the "what if a duplicate arrives?" question without needing a periodic cleanup. See INSERT and MERGE / UPSERT.
Dialect comparison #
| Engine | Syntax | Gotcha |
|---|---|---|
| PostgreSQL | DELETE … WHERE ctid IN (…) | ctid is Postgres-specific |
| MySQL | DELETE … WHERE id IN (ROW_NUMBER subquery) | needs MySQL 8+ for window functions |
| SQL Server | CTE + ROW_NUMBER(), then DELETE FROM cte WHERE rn > 1 | none |
| BigQuery | CREATE OR REPLACE TABLE … AS SELECT … QUALIFY rn = 1 | no row id, so rebuild instead of DELETE |
| Snowflake | rebuild with QUALIFY ROW_NUMBER() OVER (…) = 1 | none |
| Oracle | DELETE … WHERE ROWID IN (…) | none |
| SQLite | DELETE … WHERE rowid IN (…) | none |
| DuckDB | DELETE … WHERE rowid IN (…) | none |
Always run dedup inside a transaction the first time on production data. BEGIN; DELETE FROM signups WHERE ...; SELECT COUNT() FROM signups GROUP BY email HAVING COUNT() > 1; If any duplicates remain, you know the partition key was wrong; if too many rows are gone, you know the rank logic was wrong. ROLLBACK; and try again. Once the SELECT confirms exactly the right state, COMMIT;. Dedup queries that look obviously correct fail silently more often than almost any other class of DML, so the transaction is the seatbelt. Reference: BEGIN/COMMIT/ROLLBACK.