Join Debugging: Duplicates & Missing Rows
Almost every join bug in production shows up as one of two symptoms:
- The row count ballooned. Your
SUMis suddenly double, yourCOUNTis 4x what it should be. That's fan-out from joining on a non-unique key. - Rows you expected are gone: customers are missing, dates are skipped. That's usually a filter in
WHEREsilently converting aLEFT JOINinto anINNER JOIN.
This lesson covers how to diagnose both, and the checklist to run before trusting any join.
Symptom 1: Duplicates from Fan-Out #
A join multiplies rows whenever the join key is not unique on one side. Join users (one row per user) to orders (many rows per user) on user_id, and each user appears once per order. That's usually fine, until you SUM a user-level column over the joined rows.
-- BUG: users.created_at is counted once per order, not once per user.
-- If a user has 5 orders, their signup counts 5 times.
SELECT COUNT(DISTINCT u.id) AS user_count
FROM users AS u
INNER JOIN orders AS o ON o.user_id = u.id;
-- ^ This is the fix: COUNT(DISTINCT) ignores the fan-out.
-- The silent bug is COUNT(u.id) or COUNT(*) over the same join.How to Diagnose #
Check whether your join key is unique on the side you think it is:
-- Does orders.user_id have duplicates? (It should — one user, many orders.)
SELECT user_id, COUNT(*) AS rows
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 1
ORDER BY rows DESC;If the side you expected to be "1" in the 1-to-many has duplicates (say, users.id), you've got a data quality issue or you're joining on the wrong column. If the "many" side has duplicates, that's by design; now check whether your aggregates handle it.
How to Fix #
Three fixes, in order of preference:
- Aggregate first, then join. Collapse the many-side to one row per key in a subquery or CTE, then join that.
SELECT DISTINCT/COUNT(DISTINCT)when you only need uniqueness, not sums.- Narrow the join with a stricter
ONcondition (e.g., add a status or date filter).
(The fix below uses WITH. That's a CTE, a named subquery covered properly in a later lesson. For now, read user_orders as a named intermediate result.)
-- Fix by aggregating first: one row per user before joining
WITH user_orders AS (
SELECT user_id, COUNT(*) AS order_count, SUM(total) AS revenue
FROM orders
GROUP BY user_id
)
SELECT u.id, u.email, uo.order_count, uo.revenue
FROM users AS u
LEFT JOIN user_orders AS uo ON uo.user_id = u.id;Symptom 2: Missing Rows from WHERE on a LEFT JOIN #
A filter on a right-side column in WHERE silently converts LEFT JOIN into INNER JOIN.
Unmatched rows have NULL for every right-side column. NULL = 'paid' is UNKNOWN, which WHERE treats as false — so those rows disappear. This is the single most common join bug in analytics.
-- BUG: users with zero paid orders are silently dropped.
-- The LEFT JOIN did its job; the WHERE undid it.
SELECT u.id, u.email, COUNT(o.id) AS paid_orders
FROM users AS u
LEFT JOIN orders AS o ON o.user_id = u.id
WHERE o.status = 'paid'
GROUP BY u.id, u.email;
-- FIX: move the condition into ON so it filters the join, not the result.
SELECT u.id, u.email, COUNT(o.id) AS paid_orders
FROM users AS u
LEFT JOIN orders AS o
ON o.user_id = u.id
AND o.status = 'paid'
GROUP BY u.id, u.email;ON vs WHERE: The Rule #
| Join type | ON and WHERE behave... |
|---|---|
INNER JOIN | The same. Put predicates wherever reads better. |
LEFT JOIN / RIGHT JOIN | Different. ON filters the join; WHERE filters the result. For right-side conditions, use ON to preserve unmatched left rows. |
FULL OUTER JOIN | Different on both sides. Only use WHERE for predicates that are NULL-safe or are about the result, not the join. |
CROSS JOIN | No ON exists. All filters go in WHERE. |
Debugging Checklist #
When a joined query looks wrong, run these in order:
- Unjoined baseline.
SELECT COUNT(*) FROM left_tableand the same for the right table. Know your inputs. - Key uniqueness.
GROUP BY key HAVING COUNT(*) > 1on both sides. Understand which side is "one" and which is "many". - Row count after join. Run the join without aggregation first and
COUNT(*). Compare to what fan-out would predict. - Find duplicates. If a key appears more than expected,
GROUP BYthat key andHAVING COUNT(*) > 1on the joined result. - Check
WHEREon nullable columns. If the query usesLEFT JOIN, look for anyWHEREcondition on the right table; that's almost always the bug.
Try It #
The query below looks reasonable but drops users with no paid orders. Move the status filter into the ON clause and compare row counts.
Practice #
Return one row per user: user_id, email, and paid_orders (the count of that user's paid orders, 0 if they have none). Every user must appear, including users with no paid orders. No ordering requirement.
Mistakes to Watch For #
- Trusting a join's row count without comparing against the unjoined baseline.
- Using
COUNT(*)when you wanted distinct rows. Always reach forCOUNT(DISTINCT key)or aggregate-before-join. - Putting every predicate in
WHEREout of habit; for outer joins,ONvsWHEREchanges the result. - Debugging a join by staring at the SQL instead of running cardinality checks on each side.
- Assuming
NOT IN (subquery)behaves likeNOT EXISTSwhenNULLis possible. It doesn't.
Knowledge check #
5 questions
Your row count increased unexpectedly after adding a join. The most likely cause:
Your row count decreased unexpectedly after a join. Most likely cause:
The classic
LEFT JOINtrap is:For "users who have never placed an order," the safest pattern is:
Before writing a complex multi-join query, the most useful sanity check is:
Next Step #
Joins combine columns from related tables; set operations stack rows from similar queries. Continue to UNION vs UNION ALL.