Correlated Subqueries
A correlated subquery is an inner query that references a column from the outer row. Logically, it's evaluated once per outer row: the inner query is parameterized by the outer row's values.
That per-row model is both its power (answer "this row vs the rest of its group" in pure SQL) and its cost.
Correlated vs Uncorrelated #
-- Uncorrelated: inner query runs once, returns a list
SELECT id, email
FROM users
WHERE id IN (SELECT user_id FROM orders WHERE status = 'paid');
-- Correlated: inner query references u.id — runs per outer row
SELECT id, email
FROM users AS u
WHERE EXISTS (
SELECT 1
FROM orders AS o
WHERE o.user_id = u.id -- <-- correlation on outer u
AND o.status = 'paid'
);The tell is a reference to the outer alias inside the subquery. Remove that reference and the subquery becomes independent: it can be evaluated once and cached.
Per-Row Aggregates #
The most useful correlated pattern: compute an aggregate relative to the current row's group and compare. Classic case, "each order compared to that user's average order":
-- Each order alongside its user's average order total
SELECT
o1.id AS order_id,
o1.user_id,
o1.total,
ROUND(
(SELECT AVG(o2.total) FROM orders AS o2 WHERE o2.user_id = o1.user_id),
2
) AS user_avg_total
FROM orders AS o1;The scalar subquery is correlated on o2.user_id = o1.user_id, so it computes one average per outer row.
When to Rewrite #
Correlated subqueries are conceptually per-row.
A modern optimizer often transforms them into joins or hash lookups, so the cost isn't always catastrophic. But on some engines, or when the correlation is complex, performance is O(n × m). When the same answer is expressible with a window function or a JOIN + GROUP BY, those are usually faster and often clearer.
-- Same answer, window function — one pass over orders
SELECT
id AS order_id,
user_id,
total,
ROUND(AVG(total) OVER (PARTITION BY user_id), 2) AS user_avg_total
FROM orders;
-- Same answer, JOIN + aggregate CTE — also one pass per side
WITH user_avg AS (
SELECT user_id, AVG(total) AS user_avg_total
FROM orders
GROUP BY user_id
)
SELECT o.id AS order_id, o.user_id, o.total, ROUND(ua.user_avg_total, 2) AS user_avg_total
FROM orders AS o
INNER JOIN user_avg AS ua ON ua.user_id = o.user_id;All three return the same rows. Use the correlated subquery when it's the most readable; switch to the window function or JOIN+CTE when readability is equal and performance matters.
Try It #
For each order, show the user's average order total. Then rewrite using a window function and compare.
Practice #
Return every order whose total is strictly greater than that user's average order total. Output columns: order_id, user_id, total, user_avg_total (rounded to 2). No ordering requirement.
Mistakes to Watch For #
- Forgetting the correlation clause. An uncorrelated subquery returns a global aggregate, not a per-row one, and silently gives the wrong answer.
- Reaching for correlated subqueries when a window function would be clearer and faster. If your correlation is "per group",
OVER (PARTITION BY ...)is usually the better tool. - Ambiguous aliases: always give the outer and inner tables distinct aliases (
o1,o2). - Expecting
NOT EXISTSandNOT INto be interchangeable whenNULLis possible. OnlyNOT EXISTSis NULL-safe.
Knowledge check #
3 questions
Performance-wise, correlated subqueries:
EXISTS (subquery)andIN (subquery)differ how onNULLhandling?For "the row with the maximum X per group," a correlated subquery is probably not the best tool. The better choice is:
Next Step #
Continue to WITH (CTEs): named subqueries that make multi-step pipelines readable and let you reference the same intermediate result more than once.