I am trying to figure out how to query a table (Hive 3) of dates and values with a moving temporal window aggregation. In the example below, I want to collect every two-day window possible (in which non-terminal dates would each be used twice).
Sample data
first_date | value |
---|---|
2020-01-01 | 3 |
2020-01-02 | 4 |
2020-01-03 | 5 |
2020-01-04 | 6 |
Desired output (combining every two-day window)
first_date | total |
---|---|
2020-01-01 | 7 |
2020-01-02 | 9 |
2020-01-03 | 11 |
2020-01-04 | 6 |
I have tried something like this with no luck
select
first_date,
sum(value) over(
partition by first_date
range between first_date and first_date + interval '1' day
) as total
Apparently I am not allowed to use the partition column (dates) in the range clause, which is somewhat inconvenient. I may be able to duplicate the date column to get around this restriction, but there is probably a better way. What else could I try to get this working?
(Also in practice we may have many instances of any single date, so trying something like counting adjacent rows would not be reliable)
The range syntax is really screwy. You can't use a column in the range. You specify the column in your order by, then you define the range to apply to that column.
Since you have dates, convert your dates into unix timestamps, and add 86,400 to it for your 1 day range. It's ugly, but I don't think there's an alternative. (Running this with the CTE is brutally slow, at least in our environment)