Lesson Beginner

Join Debugging: Duplicates & Missing Rows

Almost every join bug in production shows up as one of two symptoms:

The row count ballooned. Your SUM is suddenly double, your COUNT is 4x what it should be. That's fan-out from joining on a non-unique key.
Rows you expected are gone: customers are missing, dates are skipped. That's usually a filter in WHERE silently converting a LEFT JOIN into an INNER JOIN.

This lesson covers how to diagnose both, and the checklist to run before trusting any join.

Symptom 1: Duplicates from Fan-Out #

A join multiplies rows whenever the join key is not unique on one side. Join users (one row per user) to orders (many rows per user) on user_id, and each user appears once per order. That's usually fine, until you SUM a user-level column over the joined rows.

sql

-- BUG: users.created_at is counted once per order, not once per user.
-- If a user has 5 orders, their signup counts 5 times.
SELECT COUNT(DISTINCT u.id) AS user_count
FROM users AS u
INNER JOIN orders AS o ON o.user_id = u.id;
-- ^ This is the fix: COUNT(DISTINCT) ignores the fan-out.
-- The silent bug is COUNT(u.id) or COUNT(*) over the same join.

How to Diagnose #

Check whether your join key is unique on the side you think it is:

sql

-- Does orders.user_id have duplicates? (It should — one user, many orders.)
SELECT user_id, COUNT(*) AS rows
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 1
ORDER BY rows DESC;

If the side you expected to be "1" in the 1-to-many has duplicates (say, users.id), you've got a data quality issue or you're joining on the wrong column. If the "many" side has duplicates, that's by design; now check whether your aggregates handle it.

How to Fix #

Three fixes, in order of preference:

Aggregate first, then join. Collapse the many-side to one row per key in a subquery or CTE, then join that.
SELECT DISTINCT / COUNT(DISTINCT) when you only need uniqueness, not sums.
Narrow the join with a stricter ON condition (e.g., add a status or date filter).

(The fix below uses WITH. That's a CTE, a named subquery covered properly in a later lesson. For now, read user_orders as a named intermediate result.)

sql

-- Fix by aggregating first: one row per user before joining
WITH user_orders AS (
  SELECT user_id, COUNT(*) AS order_count, SUM(total) AS revenue
  FROM orders
  GROUP BY user_id
)
SELECT u.id, u.email, uo.order_count, uo.revenue
FROM users AS u
LEFT JOIN user_orders AS uo ON uo.user_id = u.id;

Symptom 2: Missing Rows from WHERE on a LEFT JOIN #

Warning

A filter on a right-side column in WHERE silently converts LEFT JOIN into INNER JOIN.
Unmatched rows have NULL for every right-side column. NULL = 'paid' is UNKNOWN, which WHERE treats as false — so those rows disappear. This is the single most common join bug in analytics.

sql

-- BUG: users with zero paid orders are silently dropped.
-- The LEFT JOIN did its job; the WHERE undid it.
SELECT u.id, u.email, COUNT(o.id) AS paid_orders
FROM users AS u
LEFT JOIN orders AS o ON o.user_id = u.id
WHERE o.status = 'paid'
GROUP BY u.id, u.email;

-- FIX: move the condition into ON so it filters the join, not the result.
SELECT u.id, u.email, COUNT(o.id) AS paid_orders
FROM users AS u
LEFT JOIN orders AS o
  ON o.user_id = u.id
  AND o.status = 'paid'
GROUP BY u.id, u.email;

ON vs WHERE: The Rule #

Join type	`ON` and `WHERE` behave...
`INNER JOIN`	The same. Put predicates wherever reads better.
`LEFT JOIN` / `RIGHT JOIN`	Different. `ON` filters the join; `WHERE` filters the result. For right-side conditions, use `ON` to preserve unmatched left rows.
`FULL OUTER JOIN`	Different on both sides. Only use `WHERE` for predicates that are NULL-safe or are about the result, not the join.
`CROSS JOIN`	No `ON` exists. All filters go in `WHERE`.

Debugging Checklist #

When a joined query looks wrong, run these in order:

Unjoined baseline. SELECT COUNT(*) FROM left_table and the same for the right table. Know your inputs.
Key uniqueness. GROUP BY key HAVING COUNT(*) > 1 on both sides. Understand which side is "one" and which is "many".
Row count after join. Run the join without aggregation first and COUNT(*). Compare to what fan-out would predict.
Find duplicates. If a key appears more than expected, GROUP BY that key and HAVING COUNT(*) > 1 on the joined result.
Check WHERE on nullable columns. If the query uses LEFT JOIN, look for any WHERE condition on the right table; that's almost always the bug.

Try It #

The query below looks reasonable but drops users with no paid orders. Move the status filter into the ON clause and compare row counts.

Loading SQL editor...

Practice #

Return one row per user: user_id, email, and paid_orders (the count of that user's paid orders, 0 if they have none). Every user must appear, including users with no paid orders. No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

Trusting a join's row count without comparing against the unjoined baseline.
Using COUNT(*) when you wanted distinct rows. Always reach for COUNT(DISTINCT key) or aggregate-before-join.
Putting every predicate in WHERE out of habit; for outer joins, ON vs WHERE changes the result.
Debugging a join by staring at the SQL instead of running cardinality checks on each side.
Assuming NOT IN (subquery) behaves like NOT EXISTS when NULL is possible. It doesn't.

Knowledge check #

5 questions

0 / 5 answered

Your row count increased unexpectedly after adding a join. The most likely cause:
Your row count decreased unexpectedly after a join. Most likely cause:
The classic LEFT JOIN trap is:
For "users who have never placed an order," the safest pattern is:
Before writing a complex multi-join query, the most useful sanity check is:

Next Step #

Joins combine columns from related tables; set operations stack rows from similar queries. Continue to UNION vs UNION ALL.