JOIN data in event table with most recent data in changelog table before the time of the event

192 Views Asked by At

I'm looking to JOIN data from a change-log style table of users into a table of events which which have matching ids

The tables are as follows:

project_events

schema

timestamp       TIMESTAMP
event_id        STRING  
user_id         STRING
data            STRING  

example data

| timestamp                   | event_id  | user_id | data             |
|-----------------------------|-----------|---------|------------------|
| 2020-08-22 17:01:18.807 UTC | hHZuTE8Y= | ABC123  | {"some":"json" } |
| 2020-08-20 16:57:28.022 UTC | tF5Gky8Q= | ZXY432  | {"foo":"item" }  |
| 2020-08-15 16:44:25.607 UTC | 1dOU8pOo= | ABC123  | {"bar":"val" }   |

users_changelog

schema

timestamp       TIMESTAMP
event_id        STRING  
operation       STRING  
user_id         STRING
data            STRING  

example data

| timestamp                   | event_id  | operation | user_id | data                |
|-----------------------------|-----------|-----------|---------|---------------------|
| 2020-08-30 12:50:59.036 UTC | mGdNKy+o= | DELETE    | ABC123  | {"name":"removed" } |
| 2020-08-20 16:50:59.036 UTC | mGdNKy+o= | UPDATE    | ABC123  | {"name":"final" }   |
| 2020-08-05 20:45:36.936 UTC | mIICo9LY= | UPDATE    | ZXY432  | {"name":"asdf" }    |
| 2020-08-05 20:45:21.023 UTC | nEDKyCks= | UPDATE    | ABC123  | {"name":"other" }   |
| 2020-08-03 12:40:49.036 UTC | GxnbUqQ0= | CREATE    | ABC123  | {"name":"initial" } |
| 1970-01-01 00:00:00 UTC     | 1y+6fVWo= | IMPORT    | ZXY432  | {"name":"test" }    |

NOTE: operation can be either "CREATE", "UPDATE", "DELETE" or "IMPORT". Since users can be updated multiple times there can be multiple rows with the same user_id

The goal is to display the event_id and data column associated with the most recent operation in the user's table for the matching ID. Using example data, the intended result would be:

| event_id  | event_data       | user_id | user_data         |
|-----------|------------------|---------|-------------------|
| hHZuTE8Y= | {"some":"json" } | ABC123  | {"name":"final" } |
| tF5Gky8Q= | {"foo":"item" }  | ZXY432  | {"name":"asdf" }  |
| 1dOU8pOo= | {"bar":"val" }   | ABC123  | {"name":"other" } |

I have tried the following but it produces duplicate rows (one for each row in changelog table with matching id):

SELECT
  events.event_id as event_id,
  events.data as event_data,
  users.user_id as user_id,
  users.data as user_data
FROM my_project.my_dataset.project_events as events
LEFT JOIN my_project.my_dataset.users_changelog as users
ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
2

There are 2 best solutions below

0
On

Below is for BigQuery Standard SQL

#standardSQL
SELECT event_id, data AS event_data, user_id, 
  ( SELECT data
    FROM UNNEST(arr) rec
    WHERE rec.timestamp < t.timestamp
    ORDER BY rec.timestamp DESC
    LIMIT 1
  ) AS user_data
FROM (
  SELECT
    ANY_VALUE(events).*,
    ARRAY_AGG(STRUCT(users.data, users.timestamp)) arr
  FROM `my_project.my_dataset.project_events` AS events
  LEFT JOIN `my_project.my_dataset.users_changelog` AS users
  ON events.user_id = users.user_id 
  GROUP BY FORMAT('%t', events)
) t    

If to apply to sample data from your question - the output is

Row event_id        event_data          user_id     user_data    
1   hHZuTE8Y=       {"some":"json" }    ABC123      {"name":"final" }    
2   tF5Gky8Q=       {"foo":"item" }     ZXY432      {"name":"asdf" }     
3   1dOU8pOo=       {"bar":"val" }      ABC123      {"name":"other" }    
0
On

Using SQL Server, I used the ROW_NUMBER() route to retrieve your goal output:

SELECT event_id,
       event_data,
       user_id,
       user_data
FROM (
      SELECT 
        events.event_id as event_id,
        events.data as event_data,
        users.user_id as user_id,
        users.data as user_data,
        ROW_NUMBER() OVER (PARTITION BY users.user_id, events.event_id ORDER BY users.timestamp desc) AS Count_by_User
      FROM #TEMP1 as events
      LEFT JOIN #TEMP2 as users
            ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
) as a 
WHERE Count_by_User = 1

Output:

event_id    event_data          user_id  user_data
1dOU8pOo=   {"bar":"val" }      ABC123  {"name":"other" }  
hHZuTE8Y=   {"some":"json" }    ABC123  {"name":"final" }  
tF5Gky8Q=   {"foo":"item" }     ZXY432  {"name":"asdf" }   

Here's the code I used to generate the test tables (if others would like to validate):

create table #TEMP1
(timestamp  VARCHAR(max), event_id  VARCHAR(max), user_id VARCHAR(max) , data VARCHAR(max))
INSERT INTO #TEMP1 (timestamp, event_id, user_id , data)
VALUES
('2020-08-22 17:01:18.807 UTC' , 'hHZuTE8Y=' , 'ABC123'  , '{"some":"json" }' ),
('2020-08-20 16:57:28.022 UTC' , 'tF5Gky8Q=' , 'ZXY432'  , '{"foo":"item" } ' ),
('2020-08-15 16:44:25.607 UTC' , '1dOU8pOo=' , 'ABC123'  , '{"bar":"val" }  ' )


create table #TEMP2
(timestamp  VARCHAR(max), event_id  VARCHAR(max), operation VARCHAR(MAX), user_id VARCHAR(max) , data VARCHAR(max))

INSERT INTO #TEMP2 (timestamp, event_id, operation, user_id , data)
VALUES
('2020-08-30 12:50:59.036 UTC' , 'mGdNKy+o=' , 'DELETE'    , 'ABC123'  , '{"name":"removed" }'),
('2020-08-20 16:50:59.036 UTC' , 'mGdNKy+o=' , 'UPDATE'    , 'ABC123'  , '{"name":"final" }  '),
('2020-08-05 20:45:36.936 UTC' , 'mIICo9LY=' , 'UPDATE'    , 'ZXY432'  , '{"name":"asdf" }   '),
('2020-08-05 20:45:21.023 UTC' , 'nEDKyCks=' , 'UPDATE'    , 'ABC123'  , '{"name":"other" }  '),
('2020-08-03 12:40:49.036 UTC' , 'GxnbUqQ0=' , 'CREATE'    , 'ABC123'  , '{"name":"initial" }'),
('1970-01-01 00:00:00 UTC'     , '1y+6fVWo=' , 'IMPORT'    , 'ZXY432'  , '{"name":"test" }   ')