Percentile and median
Short answer: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) returns the median (50th percentile, interpolated). PERCENTILE_DISC(0.5) returns the median as an actual row value (no interpolation). For p95, p99, and other thresholds, swap the argument. Useful for "typical" values that are robust to outliers: AVG averages get pulled by extremes, but PERCENTILE_CONT(0.5) doesn't.
Canonical SQL #
-- Median (interpolated between adjacent values).
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY value) AS median
FROM measurements;
-- Median (discrete — actual row value).
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY value) AS median
FROM measurements;
-- 95th percentile.
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time_ms) AS p95
FROM request_logs;
-- Multiple percentiles in one query.
SELECT
PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY value) AS p50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) AS p75,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY value) AS p95,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY value) AS p99
FROM request_logs;
-- Per group.
SELECT
region,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY revenue) AS median_revenue
FROM user_revenue
GROUP BY region;Example #
PERCENTILE_CONT vs PERCENTILE_DISC #
PERCENTILE_CONT (continuous) interpolates between adjacent values. For an even-numbered set [1, 2, 3, 4], the median is 2.5 (halfway between the two middle values). The standard median definition; what most people mean by "median."
PERCENTILE_DISC (discrete) returns an actual value from the dataset, picking the row where the cumulative distribution reaches the percentile. For [1, 2, 3, 4] at 0.5, returns 2 (the lower of the two middle values). Use when you need a real data point (for example, "the median customer's email") rather than an interpolated number.
For numeric metrics, PERCENTILE_CONT is almost always the right call. For categorical or "give me a real row" use cases, PERCENTILE_DISC.
Common variants #
-- Approximate percentile (faster on huge datasets, small accuracy loss).
SELECT
APPROX_PERCENTILE(response_time_ms, 0.95) AS p95 -- BigQuery / Snowflake / Databricks
FROM request_logs;
SELECT
APPROX_QUANTILES(response_time_ms, 100)[OFFSET(95)] AS p95 -- BigQuery
FROM request_logs;
-- IQR (interquartile range = p75 - p25), a robust spread measure.
SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) AS iqr
FROM measurements;
-- Per-row percentile rank (where this row falls in the distribution).
SELECT
value,
PERCENT_RANK() OVER (ORDER BY value) AS pct_rank,
CUME_DIST() OVER (ORDER BY value) AS cum_dist
FROM measurements;
-- "Median per group" without PERCENTILE — older engines, MySQL pre-8.0.
SELECT
region,
AVG(value) AS approx_median
FROM (
SELECT
region,
value,
ROW_NUMBER() OVER (PARTITION BY region ORDER BY value) AS rn,
COUNT(*) OVER (PARTITION BY region) AS cnt
FROM measurements
) ranked
WHERE rn IN (FLOOR((cnt + 1) / 2), CEIL((cnt + 1) / 2))
GROUP BY region;Dialect comparison #
| Engine | Syntax | Gotcha |
|---|---|---|
| PostgreSQL | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) | ordered-set aggregate; no window form |
| MySQL | ROW_NUMBER() / COUNT(*) emulation | no aggregate PERCENTILE_CONT |
| SQL Server | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) OVER (...) | analytic only, repeats per row; dedupe yourself |
| BigQuery | PERCENTILE_CONT(x, 0.5) OVER () | window function, not an aggregate; value is the first argument |
| Snowflake | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) | MEDIAN(x) shortcut; APPROX_PERCENTILE at scale |
| Oracle | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) | MEDIAN(x) shortcut |
| DuckDB | PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) | MEDIAN(x) and QUANTILE_CONT(x, 0.5) shortcuts |
Dialect notes #
PERCENTILE_CONT,PERCENTILE_DISCwithWITHIN GROUP (ORDER BY ...): ANSI standard, supported on Postgres, SQL Server, Oracle, Snowflake, DuckDB, Redshift.- MySQL: no
PERCENTILE_CONTuntil 8.0, and even then it's a window function variant only. Emulate via theROW_NUMBERtrick above. - BigQuery:
PERCENTILE_CONT(value, 0.5)is a window function, not an aggregate. Use it asPERCENTILE_CONT(value, 0.5) OVER ()orOVER (PARTITION BY group). Confusing if you're coming from Postgres syntax. - Approximate variants for big-data scale:
APPROX_PERCENTILE(Snowflake / Databricks),APPROX_QUANTILES(BigQuery),PERCENTILE_APPROX(Hive, Spark). Use a t-digest or HLL-style structure under the hood; trade ~1% accuracy for big speedups. - As a window function: most engines also support
PERCENTILE_CONT(0.5) OVER (PARTITION BY ... ORDER BY ...)for per-row percentile, useful for "show each row alongside the group's median."
For latency, performance, and cost metrics, percentiles beat averages. AVG(response_time_ms) says the system is "fast on average" while ignoring that 5% of users are having a terrible experience. p50, p95, p99 reveal the shape of the distribution: where the typical user lives (p50), where the slow tail starts (p95), where the worst experiences happen (p99). The same logic applies to revenue per customer, time-on-site, and query duration. Anywhere the distribution is skewed by a few extreme values, the median and percentiles are the more honest summary.