Calculating percentiles in SQL

22.4k Views Asked by At

This should be very straightforward, but as a newbie to SQL I am really struggling. I've been recommended to use PERCENTILE_CONT with continuous (non-discrete) data.

The data in question concerns two columns: (1) the IDs for a list of patients and (2) the average number of events per year.

Working from some code I found online, this is where I am up to

SELECT ID,
percentile_cont (0.25) WITHIN GROUP
(ORDER BY PPPY ASC) OVER(PARTITION BY ID) as percentile_25,
percentile_cont (0.50) WITHIN GROUP
(ORDER BY PPPY ASC) OVER(PARTITION BY ID) as percentile_50,
percentile_cont (0.75) WITHIN GROUP
(ORDER BY PPPY ASC) OVER(PARTITION BY ID) as percentile_75
FROM AE_COUNT;

This just seems to report thee columns each with identical values of PPPY.

Any idea where I'm going wrong?

2

There are 2 best solutions below

1
On

Assuming that you want to get the percentiles for the whole table, try this:

SELECT Distinct
percentile_cont (0.25) WITHIN GROUP
(ORDER BY PPPY ASC) OVER() as percentile_25,
percentile_cont (0.50) WITHIN GROUP
(ORDER BY PPPY ASC) OVER() as percentile_50,
percentile_cont (0.75) WITHIN GROUP
(ORDER BY PPPY ASC) OVER() as percentile_75
FROM AE_COUNT;

Removing the partition statement will run it against the whole table. I also removed the Id column form the select statement and made it distinct.

I would also like to point out that you said the second column is the average number of events per year. I have no idea what you need the percentiles for, but be aware that calculating a percentile of the averages of a group of sets will not yield the same result as calculating the percentiles of a union of the sets.

0
On

PERCENTILE_CONT() is either a window function or an aggregate function. If you want a single row summarized for all the data, use it as an aggregate function:

SELECT percentile_cont(0.25) WITHIN GROUP (ORDER BY PPPY ASC)  as percentile_25,
       percentile_cont(0.50) WITHIN GROUP (ORDER BY PPPY ASC) as percentile_50,
       percentile_cont(0.75) WITHIN GROUP (ORDER BY PPPY ASC) as percentile_75
FROM AE_COUNT;

If you wanted the value per patient, you would do:

SELECT id, percentile_cont(0.25) WITHIN GROUP (ORDER BY PPPY ASC)  as percentile_25,
       percentile_cont(0.50) WITHIN GROUP (ORDER BY PPPY ASC) as percentile_50,
       percentile_cont(0.75) WITHIN GROUP (ORDER BY PPPY ASC) as percentile_75
FROM AE_COUNT
GROUP BY id;

However, patients probably have very few rows, so the values are likely to be the same for any given patient.