How can I achieve a windowed running median in Redshift?

2.1k Views Asked by At

I am pulling my hair out trying to create a running / cumulative median of a partitioned value, in a chronological ordering. Basically I have a table:

create table "SomeData"
(
    ClientId INT,
    SomeData DECIMAL(10,2),
    SomeDate TIMESTAMP
);

With some data:

INSERT INTO "SomeData" (ClientId, SomeData, SomeDate) VALUES
(1, 1, '1 Jan 2000'),
(1, 2, '2 Jan 2000'),
(1, 3, '3 Jan 2000'),
(1, 4, '4 Jan 2000'),
(2, 100, '1 Jan 2000'),
(2, 100, '2 Jan 2000'),
(2, 100, '3 Jan 2000'),
(2, 200, '4 Jan 2000'),
(2, 200, '5 Jan 2000'),
(2, 200, '6 Jan 2000'),
(2, 200, '7 Jan 2000');

And I need a running median partitioned by ClientId, ordered by SomeDate.

Basically, what I need to produce is this:

ClientId    SomeDate      Median of SomeData
1           "2000-01-01"  1.000
1           "2000-01-02"  1.500
1           "2000-01-03"  2.000
1           "2000-01-04"  2.500
2           "2000-01-01"  100.0
2           "2000-01-02"  100.0
2           "2000-01-03"  100.0
2           "2000-01-04"  100.0
2           "2000-01-05"  100.0
2           "2000-01-06"  150.0
2           "2000-01-07"  200.0

I am able to do this in a number of ways in in PostgresSql 9.x with the Aggregate_median function, however this is proving to be difficult in Redshift, which only has an aggregate median

SELECT ClientId, SomeDate, median(SomeData) OVER (PARTITION BY ClientId ORDER BY SomeDate)
FROM "SomeData" xout
ORDER BY ClientId, SomeDate;

Running the above on Redshift however gives the error:

ERROR: window specification should not contain frame clause and order-by for window function median

The median can be replaced with a manual correlated subquery back to the original table, however RedShift doesn't appear to support these either.

ERROR: This type of correlated subquery pattern is not supported due to internal error

Here are a bunch of fiddles which work in PostGres, none of which work in Redshift

At this point it looks like I'll need to pull the data into memory and do this in code, but would be appreciative if this can be done in Redshift directly.

3

There are 3 best solutions below

6
On BEST ANSWER

I wonder if you can do this with nth_value():

SELECT ClientId, SomeDate,
       NTH_VALUE(seqnum / 2) OVER (PARTITION BY ClientId ORDER BY SomeDate)
FROM (SELECT s.*,
             COUNT(*) OVER (PARTITION BY ClientId ORDER BY SomeDate) as seqnum
      FROM SomeData s
     ) s
ORDER BY ClientId, SomeDate;

As a note: that use of COUNT(*) instead of ROW_NUMBER() takes some getting used to.

0
On

This is an exact computation of the quantity that you are looking for.

Not sexy per se but it properly handles medians for odd vs. even length.

with row_numbers as (
    SELECT d.partitionField -- the field (or fields) you are partitioning the window function by
         , d.orderField  -- your sort field for the window functions
         , d.medianField -- quantity your are computing the median of

         , ROW_NUMBER() 
           OVER (PARTITION BY partitionField ORDER BY orderField) as seqnum

    FROM data d
)

, medians as (    
    SELECT nth_value(medianField, CASE 
                                  WHEN mod(seqnum, 2) = 0 THEN (seqnum/2)::int 
                                  ELSE ((seqnum/2)::int + 1) 
                                  END) 
           OVER (PARTITION BY partitionField ORDER BY orderField ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as median1

         , nth_value(medianField, (seqnum/2)::int + 1) OVER (PARTITION BY partitionField ORDER BY orderField ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as median2

         , mod(seqnum, 2) as mod1
    FROM row_numbers
    ORDER BY partitionField, orderField
)

select CASE
       when mod(mod1,2) = 0
       then ((median1 + median2)/2)::FLOAT
       else median1
       end as median
from medians
1
On

I think the solution presented by @GordonLinoff is not correct because it does not order the rows with the value you are trying to find median of. The correct way inspired by :

Moving Median, Mode in T-SQL

works on redshift:

WITH CTE
AS
(
SELECT  ClientId,
        ROW_NUMBER() OVER (PARTITION BY ClientId ORDER BY SomeDate ASC) row_num,
        SomeDate,
        SomeData
FROM "SomeData" 
)
SELECT A.SomeDate,
       A.SomeData,
                (SELECT  MEDIAN(B.SomeData)
                FROM CTE B 
                WHERE B.row_num BETWEEN 1 AND A.row_num 
                GROUP BY A.ClientId) AS median
FROM CTE A