In BigQuery, I have a table with a date field called date, and another row called hit. Many rows can share the same date, but date + hit is unique.
| Row | Date | Source | Hit |
|---|---|---|---|
| 1 | 1/1/24 | 1 | |
| 2 | 1/1/24 | NULL | 2 |
| 3 | 1/2/24 | 3 | |
| 4 | 1/2/24 | NULL | 4 |
| 5 | 1/2/24 | 5 | |
| 6 | 1/2/24 | NULL | 6 |
| 7 | 1/2/24 | bing | 7 |
| 8 | 1/31/24 | NULL | 8 |
| 9 | 2/10/24 | 9 | |
| 10 | 2/10/24 | NULL | 10 |
I want to create a window that looks back 30 days but ends on the current row when ordered by date and hit. With a RANGE BETWEEN clause you can only order by one value.
By design, it seems like ORDER BY date RANGE BETWEEN 30 AND CURRENT ROW will create a window that ends on all rows that match the current row's date, including those in the future.
For instance, observe the results of lastNonDirect for rows 5 and 6. I expect it to be 'twitter', however, it returns 'bing', since the date for row 5 and 6 is the same as row 7, and it chooses 7 as the LAST_VALUE.
WITH t1 AS (
SELECT
1 as origOrder, DATE '2024-01-01' as Date, 'google' as Source, 1 as Hit, 'google' as lastNonDirectClick, 'google' as currentCode UNION ALL
SELECT
2, DATE '2024-01-01', NULL, 2, 'google', 'google' UNION ALL
SELECT
3, DATE '2024-01-02', 'facebook', 3, 'facebook', 'bing' UNION ALL
SELECT
4, DATE '2024-01-02', NULL, 4,'facebook','bing' UNION ALL
SELECT
5 ,DATE '2024-01-02','twitter',5,'twitter','bing' UNION ALL
SELECT
6 ,DATE '2024-01-02',NULL ,6,'twitter','bing' UNION ALL
SELECT
7 ,DATE '2024-01-02','bing',7,'bing','bing' UNION ALL
SELECT
8 ,DATE '2024-01-31',NULL ,8,'bing','bing' UNION ALL
SELECT
9 ,DATE '2024-02-10' ,'instagram',9,'instagram' ,'instagram' UNION ALL
SELECT
10 ,DATE '2024-02-10',NULL ,10,'instagram' ,'instagram')
########################################################################################
SELECT
origOrder,
Date,
SOURCE,
Hit,
LAST_VALUE(SOURCE IGNORE NULLS) OVER(
ORDER BY
UNIX_DATE(date)
RANGE BETWEEN
30 PRECEDING
AND CURRENT ROW) AS lastNonDirect
FROM
t1
ORDER BY
origOrder
Results:
| origOrder | Date | SOURCE | Hit | lastNonDirect |
|---|---|---|---|---|
| 1 | 2024-01-01 | 1 | ||
| 2 | 2024-01-01 | 2 | ||
| 3 | 2024-01-02 | 3 | bing | |
| 4 | 2024-01-02 | 4 | bing | |
| 5 | 2024-01-02 | 5 | bing | |
| 6 | 2024-01-02 | 6 | bing | |
| 7 | 2024-01-02 | bing | 7 | bing |
| 8 | 2024-01-31 | 8 | bing | |
| 9 | 2024-02-10 | 9 | ||
| 10 | 2024-02-10 | 10 |
I attempted to be creative with the ORDER BY:
LAST_VALUE(source) OVER(ORDER BY UNIX_DATE(date) + hit / 100000)
But this won't work if my current hit number is greater than the hit I expect to get 30 days ago.
An option can be convert the date to number using the format
%Y%m%dthat will keep the expected order and then add with the hit:Output: