Lesson Advanced

Capstone Part 2: Cohorts & Retention

Part 1 showed the business has growing paid volume. But is that growth healthy? Two growth patterns look identical on a top-line chart:

  • A: customers sign up, buy once, never return. Growth requires constant acquisition.
  • B: customers sign up, buy, come back, buy again. Growth compounds.

The difference is retention, and you can't see it in a daily revenue line. You see it by grouping users into cohorts (usually by signup month) and tracking what fraction come back.

Step 1: Define the Cohort #

A cohort is a group of users who share a first event. The usual choice is signup month (DATE_TRUNC('month', users.created_at)). You could also use first-order month if signup and purchase decouple, but pick one and be explicit.

sql
-- Cohort sizes: how many users signed up each month?
SELECT
  DATE_TRUNC('month', created_at) AS cohort_month,
  COUNT(*) AS cohort_size
FROM users
GROUP BY cohort_month
ORDER BY cohort_month;

Step 2: Connect Cohorts to Activity #

The retention question is: of each cohort, how many ever placed a paid order? That's a LEFT JOIN from users to their paid orders: left, not inner, because a cohort user who never bought must still count toward the denominator.

sql
WITH cohort_users AS (
  SELECT
    id AS user_id,
    DATE_TRUNC('month', created_at) AS cohort_month
  FROM users
),
buyer_flags AS (
  SELECT
    c.cohort_month,
    c.user_id,
    MAX(CASE WHEN o.status = 'paid' THEN 1 ELSE 0 END) AS is_buyer
  FROM cohort_users AS c
  LEFT JOIN orders AS o ON o.user_id = c.user_id
  GROUP BY c.cohort_month, c.user_id
)
SELECT
  cohort_month,
  COUNT(*) AS cohort_size,
  SUM(is_buyer) AS buyers,
  ROUND(SUM(is_buyer) * 1.0 / COUNT(*), 3) AS conversion_rate
FROM buyer_flags
GROUP BY cohort_month
ORDER BY cohort_month;

Read it top-down: cohort users → flag each as a buyer-or-not → aggregate to cohort level. The MAX(CASE WHEN ...) collapses multiple orders per user into a single 0/1 buyer flag so no user is double-counted.

Warning

"Retained" is a definition, not a metric. Is a user retained if they ever bought after signup? If they bought in month 2? If they bought in months 1 AND 2? Each answers a different question and will give different numbers. Pick the definition that matches what the business actually wants to know, and write it down. Drifting definitions across a quarter will destroy stakeholder trust.

Step 3: Cohort Revenue (The Money View) #

Conversion rate tells you whether cohorts buy. Revenue-per-signup tells you how much they're worth. On real data, early cohorts have had more time to spend (bigger numbers), so always compare cohorts at the same age, not absolute total.

sql
-- Paid revenue contributed by each cohort
SELECT
  DATE_TRUNC('month', u.created_at) AS cohort_month,
  COUNT(DISTINCT u.id) AS cohort_size,
  ROUND(COALESCE(SUM(o.total), 0), 2) AS paid_revenue,
  ROUND(COALESCE(SUM(o.total), 0) / COUNT(DISTINCT u.id), 2) AS revenue_per_signup
FROM users AS u
LEFT JOIN orders AS o
  ON o.user_id = u.id
  AND o.status = 'paid'
GROUP BY cohort_month
ORDER BY cohort_month;

The AND o.status = 'paid' sits inside the ON clause, not WHERE. Otherwise the LEFT JOIN becomes effectively an INNER JOIN and cohorts with zero paid orders disappear. Classic pitfall.

Try It #

Run the cohort conversion query. With the seeded 10 users across 5 signup months, the cohorts are tiny. The pattern is the takeaway, not the specific numbers.

Loading SQL editor...

Practice #

Build the cohort revenue report. Output: cohort_month (first of month from users.created_at), cohort_size (distinct users in cohort), paid_revenue (sum of total across their paid orders, 0 if none, rounded to 2 decimals), revenue_per_signup (paid_revenue / cohort_size, rounded to 2 decimals). Order by cohort_month ASC.

Loading SQL editor...

Mistakes to Watch For #

  • Filtering WHERE o.status = 'paid' on a LEFT JOIN. It demotes the join to an inner join and cohorts with zero buyers vanish. The filter belongs in the ON clause.
  • Comparing young and old cohorts at the same wall-clock moment. A cohort from 12 months ago has had 12x the time to spend. Always normalize to "months since signup" before comparing.
  • Reporting conversion rate without cohort size. "Conversion was 100%" is meaningless if the cohort had two users. Always show the denominator.
  • Double-counting users with multiple orders. COUNT(*) counts rows; COUNT(DISTINCT user_id) counts users. The first inflates cohort size by anyone with repeat orders.
  • Letting "retained" mean different things across reports. Write the definition down once ("made ≥1 paid order within 30 days of signup") and use it everywhere.

Next Step #

Continue to Capstone Part 3: Segments & Recommendation: rank products within categories, tier customers by lifetime value, and turn the numbers into a one-paragraph recommendation.