I have two tables containing different events, Table A and Table B, both of them are partitioned tables. I need to join these two tables, however, if I filter using a timestamp, I'll lose some events due to partitioning filter.
Example - Table A:
event_a_id | timestamp | event_b_id |
---|---|---|
a1 | 2023-10-01 | b1 |
a2 | 2023-10-01 | b2 |
Table B:
event_b_id | timestamp | text |
---|---|---|
b1 | 2023-01-01 | lorem |
b2 | 2023-10-01 | ipsum |
Result:
event_a_id | timestamp | event_b_id | text |
---|---|---|---|
a1 | 2023-10-01 | b1 | null |
a2 | 2023-10-01 | b2 | ipsum |
If I filter both tables on timestamp = "2023-10-01"
I will get event b2, but not event b1. How can I avoid this? I can't simple select the whole table, because it is huge, however, I can't publish the table with missing data.
I have tried filtering only 1 of tables, it reduces the amount of data processed, but does not solve the problem of missing information on the rows.
There is a good write-up of your options by Tristan Handy, CEO of dbt Labs, here.
To summarize, you can design for correctness or performance, but you may need to accept some tradeoffs.
Max performance would be filtering both tables on the current date, as you describe:
Max correctness would be only filtering table A:
A compromise solution might include creating a window for late-arriving data in table A. For example, join rows in B if they were recorded less than 30 days before A:
(If your data in B arrives late, you would flip this logic around; you could also include a range on both tables A and B).