Lesson Beginner

Join Debugging: Duplicates & Missing Rows

Almost every join bug in production shows up as one of two symptoms:

  1. The row count ballooned. Your SUM is suddenly double, your COUNT is 4x what it should be. That's fan-out from joining on a non-unique key.
  2. Rows you expected are gone: customers are missing, dates are skipped. That's usually a filter in WHERE silently converting a LEFT JOIN into an INNER JOIN.

This lesson covers how to diagnose both, and the checklist to run before trusting any join.

Symptom 1: Duplicates from Fan-Out #

A join multiplies rows whenever the join key is not unique on one side. Join users (one row per user) to orders (many rows per user) on user_id, and each user appears once per order. That's usually fine, until you SUM a user-level column over the joined rows.

sql
-- BUG: users.created_at is counted once per order, not once per user.
-- If a user has 5 orders, their signup counts 5 times.
SELECT COUNT(DISTINCT u.id) AS user_count
FROM users AS u
INNER JOIN orders AS o ON o.user_id = u.id;
-- ^ This is the fix: COUNT(DISTINCT) ignores the fan-out.
-- The silent bug is COUNT(u.id) or COUNT(*) over the same join.

How to Diagnose #

Check whether your join key is unique on the side you think it is:

sql
-- Does orders.user_id have duplicates? (It should — one user, many orders.)
SELECT user_id, COUNT(*) AS rows
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 1
ORDER BY rows DESC;

If the side you expected to be "1" in the 1-to-many has duplicates (say, users.id), you've got a data quality issue or you're joining on the wrong column. If the "many" side has duplicates, that's by design; now check whether your aggregates handle it.

How to Fix #

Three fixes, in order of preference:

  1. Aggregate first, then join. Collapse the many-side to one row per key in a subquery or CTE, then join that.
  2. SELECT DISTINCT / COUNT(DISTINCT) when you only need uniqueness, not sums.
  3. Narrow the join with a stricter ON condition (e.g., add a status or date filter).

(The fix below uses WITH. That's a CTE, a named subquery covered properly in a later lesson. For now, read user_orders as a named intermediate result.)

sql
-- Fix by aggregating first: one row per user before joining
WITH user_orders AS (
  SELECT user_id, COUNT(*) AS order_count, SUM(total) AS revenue
  FROM orders
  GROUP BY user_id
)
SELECT u.id, u.email, uo.order_count, uo.revenue
FROM users AS u
LEFT JOIN user_orders AS uo ON uo.user_id = u.id;

Symptom 2: Missing Rows from WHERE on a LEFT JOIN #

Warning

A filter on a right-side column in WHERE silently converts LEFT JOIN into INNER JOIN.
Unmatched rows have NULL for every right-side column. NULL = 'paid' is UNKNOWN, which WHERE treats as false — so those rows disappear. This is the single most common join bug in analytics.

sql
-- BUG: users with zero paid orders are silently dropped.
-- The LEFT JOIN did its job; the WHERE undid it.
SELECT u.id, u.email, COUNT(o.id) AS paid_orders
FROM users AS u
LEFT JOIN orders AS o ON o.user_id = u.id
WHERE o.status = 'paid'
GROUP BY u.id, u.email;

-- FIX: move the condition into ON so it filters the join, not the result.
SELECT u.id, u.email, COUNT(o.id) AS paid_orders
FROM users AS u
LEFT JOIN orders AS o
  ON o.user_id = u.id
  AND o.status = 'paid'
GROUP BY u.id, u.email;

ON vs WHERE: The Rule #

Join typeON and WHERE behave...
INNER JOINThe same. Put predicates wherever reads better.
LEFT JOIN / RIGHT JOINDifferent. ON filters the join; WHERE filters the result. For right-side conditions, use ON to preserve unmatched left rows.
FULL OUTER JOINDifferent on both sides. Only use WHERE for predicates that are NULL-safe or are about the result, not the join.
CROSS JOINNo ON exists. All filters go in WHERE.

Debugging Checklist #

When a joined query looks wrong, run these in order:

  1. Unjoined baseline. SELECT COUNT(*) FROM left_table and the same for the right table. Know your inputs.
  2. Key uniqueness. GROUP BY key HAVING COUNT(*) > 1 on both sides. Understand which side is "one" and which is "many".
  3. Row count after join. Run the join without aggregation first and COUNT(*). Compare to what fan-out would predict.
  4. Find duplicates. If a key appears more than expected, GROUP BY that key and HAVING COUNT(*) > 1 on the joined result.
  5. Check WHERE on nullable columns. If the query uses LEFT JOIN, look for any WHERE condition on the right table; that's almost always the bug.

Try It #

The query below looks reasonable but drops users with no paid orders. Move the status filter into the ON clause and compare row counts.

Loading SQL editor...

Practice #

Return one row per user: user_id, email, and paid_orders (the count of that user's paid orders, 0 if they have none). Every user must appear, including users with no paid orders. No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

  • Trusting a join's row count without comparing against the unjoined baseline.
  • Using COUNT(*) when you wanted distinct rows. Always reach for COUNT(DISTINCT key) or aggregate-before-join.
  • Putting every predicate in WHERE out of habit; for outer joins, ON vs WHERE changes the result.
  • Debugging a join by staring at the SQL instead of running cardinality checks on each side.
  • Assuming NOT IN (subquery) behaves like NOT EXISTS when NULL is possible. It doesn't.

Knowledge check #

5 questions

0 / 5 answered
  1. Your row count increased unexpectedly after adding a join. The most likely cause:

  2. Your row count decreased unexpectedly after a join. Most likely cause:

  3. The classic LEFT JOIN trap is:

  4. For "users who have never placed an order," the safest pattern is:

  5. Before writing a complex multi-join query, the most useful sanity check is:

Next Step #

Joins combine columns from related tables; set operations stack rows from similar queries. Continue to UNION vs UNION ALL.