Lesson Intermediate

Correlated Subqueries

A correlated subquery is an inner query that references a column from the outer row. Logically, it's evaluated once per outer row: the inner query is parameterized by the outer row's values.
That per-row model is both its power (answer "this row vs the rest of its group" in pure SQL) and its cost.

Correlated vs Uncorrelated #

sql

-- Uncorrelated: inner query runs once, returns a list
SELECT id, email
FROM users
WHERE id IN (SELECT user_id FROM orders WHERE status = 'paid');

-- Correlated: inner query references u.id — runs per outer row
SELECT id, email
FROM users AS u
WHERE EXISTS (
  SELECT 1
  FROM orders AS o
  WHERE o.user_id = u.id        -- <-- correlation on outer u
    AND o.status = 'paid'
);

The tell is a reference to the outer alias inside the subquery. Remove that reference and the subquery becomes independent: it can be evaluated once and cached.

Per-Row Aggregates #

The most useful correlated pattern: compute an aggregate relative to the current row's group and compare. Classic case, "each order compared to that user's average order":

sql

-- Each order alongside its user's average order total
SELECT
  o1.id AS order_id,
  o1.user_id,
  o1.total,
  ROUND(
    (SELECT AVG(o2.total) FROM orders AS o2 WHERE o2.user_id = o1.user_id),
    2
  ) AS user_avg_total
FROM orders AS o1;

The scalar subquery is correlated on o2.user_id = o1.user_id, so it computes one average per outer row.

When to Rewrite #

Warning

Correlated subqueries are conceptually per-row.
A modern optimizer often transforms them into joins or hash lookups, so the cost isn't always catastrophic. But on some engines, or when the correlation is complex, performance is O(n × m). When the same answer is expressible with a window function or a JOIN + GROUP BY, those are usually faster and often clearer.

sql

-- Same answer, window function — one pass over orders
SELECT
  id AS order_id,
  user_id,
  total,
  ROUND(AVG(total) OVER (PARTITION BY user_id), 2) AS user_avg_total
FROM orders;

-- Same answer, JOIN + aggregate CTE — also one pass per side
WITH user_avg AS (
  SELECT user_id, AVG(total) AS user_avg_total
  FROM orders
  GROUP BY user_id
)
SELECT o.id AS order_id, o.user_id, o.total, ROUND(ua.user_avg_total, 2) AS user_avg_total
FROM orders AS o
INNER JOIN user_avg AS ua ON ua.user_id = o.user_id;

All three return the same rows. Use the correlated subquery when it's the most readable; switch to the window function or JOIN+CTE when readability is equal and performance matters.

Try It #

For each order, show the user's average order total. Then rewrite using a window function and compare.

Loading SQL editor...

Practice #

Return every order whose total is strictly greater than that user's average order total. Output columns: order_id, user_id, total, user_avg_total (rounded to 2). No ordering requirement.

Loading SQL editor...

Mistakes to Watch For #

Forgetting the correlation clause. An uncorrelated subquery returns a global aggregate, not a per-row one, and silently gives the wrong answer.
Reaching for correlated subqueries when a window function would be clearer and faster. If your correlation is "per group", OVER (PARTITION BY ...) is usually the better tool.
Ambiguous aliases: always give the outer and inner tables distinct aliases (o1, o2).
Expecting NOT EXISTS and NOT IN to be interchangeable when NULL is possible. Only NOT EXISTS is NULL-safe.

Knowledge check #

3 questions

0 / 3 answered

Performance-wise, correlated subqueries:
EXISTS (subquery) and IN (subquery) differ how on NULL handling?
For "the row with the maximum X per group," a correlated subquery is probably not the best tool. The better choice is:

Next Step #

Continue to WITH (CTEs): named subqueries that make multi-step pipelines readable and let you reference the same intermediate result more than once.