JOIN data in event table with most recent data in changelog table before the time of the event

Question

JOIN data in event table with most recent data in changelog table before the time of the event

192 Views Asked by Scott At 20 October 2024 at 19:46

I'm looking to JOIN data from a change-log style table of users into a table of events which which have matching ids

The tables are as follows:

project_events

schema

timestamp       TIMESTAMP
event_id        STRING  
user_id         STRING
data            STRING

example data

| timestamp                   | event_id  | user_id | data             |
|-----------------------------|-----------|---------|------------------|
| 2020-08-22 17:01:18.807 UTC | hHZuTE8Y= | ABC123  | {"some":"json" } |
| 2020-08-20 16:57:28.022 UTC | tF5Gky8Q= | ZXY432  | {"foo":"item" }  |
| 2020-08-15 16:44:25.607 UTC | 1dOU8pOo= | ABC123  | {"bar":"val" }   |

users_changelog

schema

timestamp       TIMESTAMP
event_id        STRING  
operation       STRING  
user_id         STRING
data            STRING

example data

| timestamp                   | event_id  | operation | user_id | data                |
|-----------------------------|-----------|-----------|---------|---------------------|
| 2020-08-30 12:50:59.036 UTC | mGdNKy+o= | DELETE    | ABC123  | {"name":"removed" } |
| 2020-08-20 16:50:59.036 UTC | mGdNKy+o= | UPDATE    | ABC123  | {"name":"final" }   |
| 2020-08-05 20:45:36.936 UTC | mIICo9LY= | UPDATE    | ZXY432  | {"name":"asdf" }    |
| 2020-08-05 20:45:21.023 UTC | nEDKyCks= | UPDATE    | ABC123  | {"name":"other" }   |
| 2020-08-03 12:40:49.036 UTC | GxnbUqQ0= | CREATE    | ABC123  | {"name":"initial" } |
| 1970-01-01 00:00:00 UTC     | 1y+6fVWo= | IMPORT    | ZXY432  | {"name":"test" }    |

NOTE: operation can be either "CREATE", "UPDATE", "DELETE" or "IMPORT". Since users can be updated multiple times there can be multiple rows with the same user_id

The goal is to display the event_id and data column associated with the most recent operation in the user's table for the matching ID. Using example data, the intended result would be:

| event_id  | event_data       | user_id | user_data         |
|-----------|------------------|---------|-------------------|
| hHZuTE8Y= | {"some":"json" } | ABC123  | {"name":"final" } |
| tF5Gky8Q= | {"foo":"item" }  | ZXY432  | {"name":"asdf" }  |
| 1dOU8pOo= | {"bar":"val" }   | ABC123  | {"name":"other" } |

I have tried the following but it produces duplicate rows (one for each row in changelog table with matching id):

SELECT
  events.event_id as event_id,
  events.data as event_data,
  users.user_id as user_id,
  users.data as user_data
FROM my_project.my_dataset.project_events as events
LEFT JOIN my_project.my_dataset.users_changelog as users
ON events.user_id = users.user_id AND users.timestamp <= events.timestamp

Original Q&A

There are 2 best solutions below

**Mikhail Berlyant** · Answer 1

Below is for BigQuery Standard SQL

#standardSQL
SELECT event_id, data AS event_data, user_id, 
  ( SELECT data
    FROM UNNEST(arr) rec
    WHERE rec.timestamp < t.timestamp
    ORDER BY rec.timestamp DESC
    LIMIT 1
  ) AS user_data
FROM (
  SELECT
    ANY_VALUE(events).*,
    ARRAY_AGG(STRUCT(users.data, users.timestamp)) arr
  FROM `my_project.my_dataset.project_events` AS events
  LEFT JOIN `my_project.my_dataset.users_changelog` AS users
  ON events.user_id = users.user_id 
  GROUP BY FORMAT('%t', events)
) t

If to apply to sample data from your question - the output is

Row event_id        event_data          user_id     user_data    
1   hHZuTE8Y=       {"some":"json" }    ABC123      {"name":"final" }    
2   tF5Gky8Q=       {"foo":"item" }     ZXY432      {"name":"asdf" }     
3   1dOU8pOo=       {"bar":"val" }      ABC123      {"name":"other" }

**HamiltonPharmD** · Answer 2

Using SQL Server, I used the ROW_NUMBER() route to retrieve your goal output:

SELECT event_id,
       event_data,
       user_id,
       user_data
FROM (
      SELECT 
        events.event_id as event_id,
        events.data as event_data,
        users.user_id as user_id,
        users.data as user_data,
        ROW_NUMBER() OVER (PARTITION BY users.user_id, events.event_id ORDER BY users.timestamp desc) AS Count_by_User
      FROM #TEMP1 as events
      LEFT JOIN #TEMP2 as users
            ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
) as a 
WHERE Count_by_User = 1

Output:

event_id    event_data          user_id  user_data
1dOU8pOo=   {"bar":"val" }      ABC123  {"name":"other" }  
hHZuTE8Y=   {"some":"json" }    ABC123  {"name":"final" }  
tF5Gky8Q=   {"foo":"item" }     ZXY432  {"name":"asdf" }

Here's the code I used to generate the test tables (if others would like to validate):

create table #TEMP1
(timestamp  VARCHAR(max), event_id  VARCHAR(max), user_id VARCHAR(max) , data VARCHAR(max))
INSERT INTO #TEMP1 (timestamp, event_id, user_id , data)
VALUES
('2020-08-22 17:01:18.807 UTC' , 'hHZuTE8Y=' , 'ABC123'  , '{"some":"json" }' ),
('2020-08-20 16:57:28.022 UTC' , 'tF5Gky8Q=' , 'ZXY432'  , '{"foo":"item" } ' ),
('2020-08-15 16:44:25.607 UTC' , '1dOU8pOo=' , 'ABC123'  , '{"bar":"val" }  ' )


create table #TEMP2
(timestamp  VARCHAR(max), event_id  VARCHAR(max), operation VARCHAR(MAX), user_id VARCHAR(max) , data VARCHAR(max))

INSERT INTO #TEMP2 (timestamp, event_id, operation, user_id , data)
VALUES
('2020-08-30 12:50:59.036 UTC' , 'mGdNKy+o=' , 'DELETE'    , 'ABC123'  , '{"name":"removed" }'),
('2020-08-20 16:50:59.036 UTC' , 'mGdNKy+o=' , 'UPDATE'    , 'ABC123'  , '{"name":"final" }  '),
('2020-08-05 20:45:36.936 UTC' , 'mIICo9LY=' , 'UPDATE'    , 'ZXY432'  , '{"name":"asdf" }   '),
('2020-08-05 20:45:21.023 UTC' , 'nEDKyCks=' , 'UPDATE'    , 'ABC123'  , '{"name":"other" }  '),
('2020-08-03 12:40:49.036 UTC' , 'GxnbUqQ0=' , 'CREATE'    , 'ABC123'  , '{"name":"initial" }'),
('1970-01-01 00:00:00 UTC'     , '1y+6fVWo=' , 'IMPORT'    , 'ZXY432'  , '{"name":"test" }   ')

JOIN data in event table with most recent data in changelog table before the time of the event

project_events

schema

example data

users_changelog

schema

example data

There are 2 best solutions below

Related Questions in SQL

Related Questions in GOOGLE-CLOUD-PLATFORM

Related Questions in GOOGLE-BIGQUERY

Related Questions in BIGQUERY-UDF

Trending Questions

Popular # Hahtags

Popular Questions