I've got a table split in a static part and a history one. I have to create a query which groups by a series of dimensions, including year and month, and do some aggregations. One of the values that I need to project is a value of the last tuple of the history table matching the given year / month couple.
History table have validity_date_start and validity_date_end, and the latter is NULL if it's up-to-date.
This is the query I've done so far (using temporary tables for ease of reproduction):
SELECT
time.year,
time.month,
t1.name,
FIRST_VALUE(t2.value1) OVER(ORDER BY t2.validity_date_start DESC) AS value, -- take the last valid t2 part for the month
(CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-07-01 00:00:00' THEN 27
ELSE CASE WHEN t1.id = 1 AND time.date >= timestamp '2020-03-01 00:00:00' THEN 1
ELSE CASE WHEN t1.id = 2 AND time.date >= timestamp '2020-05-01 00:00:00' THEN 42 END
END
END) AS expected_value
FROM
(SELECT year(ts.date) year, month(ts.date) month, ts.date FROM (
(VALUES (SEQUENCE(date '2020-01-01', current_date, INTERVAL '1' MONTH))) AS ts(ts_array)
CROSS JOIN UNNEST(ts_array) AS ts(date)
) GROUP BY ts.date) time
CROSS JOIN (VALUES (1, 'Hal'), (2, 'John'), (3, 'Jack')) AS t1 (id, name)
LEFT JOIN (VALUES
(1, 1, timestamp '2020-01-03 10:22:33', timestamp '2020-07-03 23:59:59'),
(1, 27, timestamp '2020-07-04 00:00:00', NULL),
(2, 42, timestamp '2020-05-29 10:22:31', NULL)
) AS t2 (id, value1, validity_date_start, validity_date_end)
ON t1.id = t2.id
AND t2.validity_date_start <= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)
AND (t2.validity_date_end IS NULL OR t2.validity_date_end >= (CAST(time.date as timestamp) + interval '1' month - interval '1' second)) -- last_day_of_month (Athena doesn't have the fn)
GROUP BY time.date, time.year, time.month, t1.id, t1.name, t2.value1, t2.validity_date_start
ORDER BY time.year, time.month, t1.id
value and expected_value should match, but they don't (value is always empty). I've evidently misunderstood how FIRST_VALUE(...) OVER(...) works.
May you please help me?
Thank you very much!
I've eventually found out what I was doing wrong here.
In the documents it is written:
This led me to think that if I already had a
GROUP BYstatement, this was useless. It is not: generally if you want to get the datum for the given group, you have to specify it in thePARTITION BYstatement, too (or better the dimensions that you're projecting in theSELECTpart).