In BIgQuery, how do I force a window to end at the current row when using a RANGE BETWEEN?

34 Views Asked by At

In BigQuery, I have a table with a date field called date, and another row called hit. Many rows can share the same date, but date + hit is unique.

Row Date Source Hit
1 1/1/24 google 1
2 1/1/24 NULL 2
3 1/2/24 facebook 3
4 1/2/24 NULL 4
5 1/2/24 twitter 5
6 1/2/24 NULL 6
7 1/2/24 bing 7
8 1/31/24 NULL 8
9 2/10/24 instagram 9
10 2/10/24 NULL 10

I want to create a window that looks back 30 days but ends on the current row when ordered by date and hit. With a RANGE BETWEEN clause you can only order by one value.

By design, it seems like ORDER BY date RANGE BETWEEN 30 AND CURRENT ROW will create a window that ends on all rows that match the current row's date, including those in the future.

For instance, observe the results of lastNonDirect for rows 5 and 6. I expect it to be 'twitter', however, it returns 'bing', since the date for row 5 and 6 is the same as row 7, and it chooses 7 as the LAST_VALUE.

WITH t1 AS (
SELECT 
  1 as origOrder, DATE '2024-01-01' as Date, 'google' as Source, 1 as Hit, 'google' as lastNonDirectClick, 'google' as currentCode UNION ALL
SELECT 
  2, DATE '2024-01-01', NULL, 2, 'google', 'google' UNION ALL
SELECT 
  3, DATE '2024-01-02', 'facebook', 3, 'facebook', 'bing' UNION ALL
SELECT 
  4, DATE '2024-01-02', NULL, 4,'facebook','bing' UNION ALL
SELECT 
5 ,DATE '2024-01-02','twitter',5,'twitter','bing' UNION ALL
SELECT 
6 ,DATE '2024-01-02',NULL ,6,'twitter','bing' UNION ALL
SELECT 
7 ,DATE '2024-01-02','bing',7,'bing','bing' UNION ALL
SELECT 
8 ,DATE '2024-01-31',NULL ,8,'bing','bing' UNION ALL
SELECT 
9 ,DATE '2024-02-10' ,'instagram',9,'instagram' ,'instagram' UNION ALL
SELECT 
10 ,DATE '2024-02-10',NULL ,10,'instagram' ,'instagram')
  ########################################################################################
SELECT
  origOrder,
  Date,
  SOURCE,
  Hit,
  LAST_VALUE(SOURCE IGNORE NULLS) OVER(
    ORDER BY
      UNIX_DATE(date) 
    RANGE BETWEEN
       30 PRECEDING
       AND CURRENT ROW) AS lastNonDirect
FROM
  t1
ORDER BY
  origOrder

Results:

origOrder Date SOURCE Hit lastNonDirect
1 2024-01-01 google 1 google
2 2024-01-01 2 google
3 2024-01-02 facebook 3 bing
4 2024-01-02 4 bing
5 2024-01-02 twitter 5 bing
6 2024-01-02 6 bing
7 2024-01-02 bing 7 bing
8 2024-01-31 8 bing
9 2024-02-10 instagram 9 instagram
10 2024-02-10 10 instagram

I attempted to be creative with the ORDER BY:

LAST_VALUE(source) OVER(ORDER BY UNIX_DATE(date) + hit / 100000)

But this won't work if my current hit number is greater than the hit I expect to get 30 days ago.

1

There are 1 best solutions below

1
Damião Martins On

An option can be convert the date to number using the format %Y%m%d that will keep the expected order and then add with the hit:

SELECT
  origOrder,
  Date,
  SOURCE,
  Hit,
  LAST_VALUE(SOURCE IGNORE NULLS) OVER(
    ORDER BY
      (CAST(FORMAT_DATE('%Y%m%d', date) AS INT64) * 10000) + hit
    RANGE BETWEEN
       30 PRECEDING
       AND CURRENT ROW) AS lastNonDirect,
  (CAST(FORMAT_DATE('%Y%m%d', date) AS INT64) * 10000) + hit
FROM
  t1
ORDER BY
  origOrder

Output: Output