Recipe

Percentile and median

Short answer: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) returns the median (50th percentile, interpolated). PERCENTILE_DISC(0.5) returns the median as an actual row value (no interpolation). For p95, p99, and other thresholds, swap the argument. Useful for "typical" values that are robust to outliers: AVG averages get pulled by extremes, but PERCENTILE_CONT(0.5) doesn't.

Canonical SQL #

sql
-- Median (interpolated between adjacent values).
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) AS median
FROM   measurements;

-- Median (discrete — actual row value).
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY value) AS median
FROM   measurements;

-- 95th percentile.
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95
FROM   request_logs;

-- Multiple percentiles in one query.
SELECT
  PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY value) AS p50,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS p75,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) AS p95,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM   request_logs;

-- Per group.
SELECT
  region,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
FROM   user_revenue
GROUP  BY region;

Example #

Loading SQL editor...

PERCENTILE_CONT vs PERCENTILE_DISC #

PERCENTILE_CONT (continuous) interpolates between adjacent values. For an even-numbered set [1, 2, 3, 4], the median is 2.5 (halfway between the two middle values). The standard median definition; what most people mean by "median."

PERCENTILE_DISC (discrete) returns an actual value from the dataset, picking the row where the cumulative distribution reaches the percentile. For [1, 2, 3, 4] at 0.5, returns 2 (the lower of the two middle values). Use when you need a real data point (for example, "the median customer's email") rather than an interpolated number.

For numeric metrics, PERCENTILE_CONT is almost always the right call. For categorical or "give me a real row" use cases, PERCENTILE_DISC.

Common variants #

sql
-- Approximate percentile (faster on huge datasets, small accuracy loss).
SELECT
  APPROX_PERCENTILE(response_time_ms, 0.95) AS p95   -- BigQuery / Snowflake / Databricks
FROM request_logs;

SELECT
  APPROX_QUANTILES(response_time_ms, 100)[OFFSET(95)] AS p95   -- BigQuery
FROM request_logs;

-- IQR (interquartile range = p75 - p25), a robust spread measure.
SELECT
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) -
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS iqr
FROM measurements;

-- Per-row percentile rank (where this row falls in the distribution).
SELECT
  value,
  PERCENT_RANK() OVER (ORDER BY value) AS pct_rank,
  CUME_DIST()    OVER (ORDER BY value) AS cum_dist
FROM measurements;

-- "Median per group" without PERCENTILE — older engines, MySQL pre-8.0.
SELECT
  region,
  AVG(value) AS approx_median
FROM (
  SELECT
    region,
    value,
    ROW_NUMBER() OVER (PARTITION BY region ORDER BY value) AS rn,
    COUNT(*)     OVER (PARTITION BY region)                AS cnt
  FROM measurements
) ranked
WHERE rn IN (FLOOR((cnt + 1) / 2), CEIL((cnt + 1) / 2))
GROUP BY region;

Dialect comparison #

EngineSyntaxGotcha
PostgreSQLPERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)ordered-set aggregate; no window form
MySQLROW_NUMBER() / COUNT(*) emulationno aggregate PERCENTILE_CONT
SQL ServerPERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) OVER (...)analytic only, repeats per row; dedupe yourself
BigQueryPERCENTILE_CONT(x, 0.5) OVER ()window function, not an aggregate; value is the first argument
SnowflakePERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)MEDIAN(x) shortcut; APPROX_PERCENTILE at scale
OraclePERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)MEDIAN(x) shortcut
DuckDBPERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)MEDIAN(x) and QUANTILE_CONT(x, 0.5) shortcuts

Dialect notes #

  • PERCENTILE_CONT, PERCENTILE_DISC with WITHIN GROUP (ORDER BY ...): ANSI standard, supported on Postgres, SQL Server, Oracle, Snowflake, DuckDB, Redshift.
  • MySQL: no PERCENTILE_CONT until 8.0, and even then it's a window function variant only. Emulate via the ROW_NUMBER trick above.
  • BigQuery: PERCENTILE_CONT(value, 0.5) is a window function, not an aggregate. Use it as PERCENTILE_CONT(value, 0.5) OVER () or OVER (PARTITION BY group). Confusing if you're coming from Postgres syntax.
  • Approximate variants for big-data scale: APPROX_PERCENTILE (Snowflake / Databricks), APPROX_QUANTILES (BigQuery), PERCENTILE_APPROX (Hive, Spark). Use a t-digest or HLL-style structure under the hood; trade ~1% accuracy for big speedups.
  • As a window function: most engines also support PERCENTILE_CONT(0.5) OVER (PARTITION BY ... ORDER BY ...) for per-row percentile, useful for "show each row alongside the group's median."
Tip

For latency, performance, and cost metrics, percentiles beat averages. AVG(response_time_ms) says the system is "fast on average" while ignoring that 5% of users are having a terrible experience. p50, p95, p99 reveal the shape of the distribution: where the typical user lives (p50), where the slow tail starts (p95), where the worst experiences happen (p99). The same logic applies to revenue per customer, time-on-site, and query duration. Anywhere the distribution is skewed by a few extreme values, the median and percentiles are the more honest summary.