How can I join fact tables on incremental models without loosing data?

965 Views Asked by At

I have two tables containing different events, Table A and Table B, both of them are partitioned tables. I need to join these two tables, however, if I filter using a timestamp, I'll lose some events due to partitioning filter.

Example - Table A:

event_a_id timestamp event_b_id
a1 2023-10-01 b1
a2 2023-10-01 b2

Table B:

event_b_id timestamp text
b1 2023-01-01 lorem
b2 2023-10-01 ipsum

Result:

event_a_id timestamp event_b_id text
a1 2023-10-01 b1 null
a2 2023-10-01 b2 ipsum

If I filter both tables on timestamp = "2023-10-01" I will get event b2, but not event b1. How can I avoid this? I can't simple select the whole table, because it is huge, however, I can't publish the table with missing data.

I have tried filtering only 1 of tables, it reduces the amount of data processed, but does not solve the problem of missing information on the rows.

1

There are 1 best solutions below

0
On

There is a good write-up of your options by Tristan Handy, CEO of dbt Labs, here.

To summarize, you can design for correctness or performance, but you may need to accept some tradeoffs.

Max performance would be filtering both tables on the current date, as you describe:

select a.id, a.timestamp, a.event_b_id, b.text
from a
left join b 
    on a.event_b_id = b.id
    {% if is_incremental() %}
    and b.timestamp >= (select max(timestamp) from {{ this }})
    {% endif %}
{% if is_incremental() %}
where a.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}

Max correctness would be only filtering table A:

select a.id, a.timestamp, a.event_b_id, b.text
from a
left join b 
    on a.event_b_id = b.id
{% if is_incremental() %}
where a.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}

A compromise solution might include creating a window for late-arriving data in table A. For example, join rows in B if they were recorded less than 30 days before A:

select a.id, a.timestamp, a.event_b_id, b.text
from a
left join b 
    on a.event_b_id = b.id
    {% if is_incremental() %}
    and b.timestamp >= (
        select date_sub(max(timestamp), interval '30 days')
        from {{ this }}
    )
    {% endif %}
{% if is_incremental() %}
where a.timestamp >= (select max(timestamp) from {{ this }})
{% endif %}

(If your data in B arrives late, you would flip this logic around; you could also include a range on both tables A and B).