Presto / AWS Athena query, historicized table (last value in aggregation)

778 Views Asked by At

I've got a table split in a static part and a history one. I have to create a query which groups by a series of dimensions, including year and month, and do some aggregations. One of the values that I need to project is a value of the last tuple of the history table matching the given year / month couple.

History table have validity_date_start and validity_date_end, and the latter is NULL if it's up-to-date.

This is the query I've done so far (using temporary tables for ease of reproduction):

SELECT
  time.year,
  time.month,
  t1.name,
  FIRST_VALUE(t2.value1) OVER(ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
  (CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-07-01 00:00:00' THEN 27
     ELSE CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-03-01 00:00:00' THEN 1
       ELSE CASE WHEN t1.id = 2 AND time.date >= timestamp '2020-05-01 00:00:00' THEN 42 END
     END
   END) AS expected_value
FROM
  (SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
    (VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
    CROSS JOIN UNNEST(ts_array) AS ts(date)
  ) GROUP BY ts.date) time

  CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
  
  LEFT JOIN (VALUES
    (1, 1, timestamp '2020-01-03 10:22:33', timestamp '2020-07-03 23:59:59'),
    (1, 27, timestamp '2020-07-04 00:00:00', NULL),
    (2, 42, timestamp '2020-05-29 10:22:31', NULL)
  ) AS t2 (id, value1, validity_date_start, validity_date_end)
    ON t1.id = t2.id
    AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
    AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
  
  GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
  ORDER BY time.year, time.month, t1.id

value and expected_value should match, but they don't (value is always empty). I've evidently misunderstood how FIRST_VALUE(...) OVER(...) works.

May you please help me?

Thank you very much!

1

There are 1 best solutions below

0
On

I've eventually found out what I was doing wrong here.

In the documents it is written:

The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions

This led me to think that if I already had a GROUP BY statement, this was useless. It is not: generally if you want to get the datum for the given group, you have to specify it in the PARTITION BY statement, too (or better the dimensions that you're projecting in the SELECT part).

SELECT
  time.year,
  time.month,
  t1.name,
  FIRST_VALUE(t2.value1) OVER(PARTITION BY (time.year, time.month, t1.name) ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
  (CASE WHEN time.date >= timestamp '2020-07-01 00:00:00' AND t1.id = 1 THEN 27
     ELSE CASE WHEN time.date >= timestamp '2020-05-01 00:00:00' AND t1.id = 2 THEN 42
       ELSE CASE WHEN time.date >= timestamp '2020-03-01 00:00:00' AND t1.id = 1 THEN 1 END
     END
   END) AS expected_value
FROM
  (SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
    (VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
    CROSS JOIN UNNEST(ts_array) AS ts(date)
  ) GROUP BY ts.date) time

  CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
  
  LEFT JOIN (VALUES
    (1, 1, timestamp '2020-03-01 10:22:33', timestamp '2020-07-03 23:59:59'),
    (1, 27, timestamp '2020-07-04 00:00:00', NULL),
    (2, 42, timestamp '2020-05-29 10:22:31', NULL)
  ) AS t2 (id, value1, validity_date_start, validity_date_end)
    ON t1.id = t2.id
    AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
    AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
  
  GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
  ORDER BY time.year, time.month, t1.id