In https://www.linkedin.com/pulse/data-lt-mysteries-effectivity-satellite-driving-key-patrick-cuba/?trk=read_related_article-card_title Patrick Cuba showed how the insert-only logic work when populating an effectivity satellite.
However when applying this to an event stream where we have out of order events, the ingestion and querying logic seems to explode in complexity.
Example
Consider this series of change events depicting change in a person's employment relationship.
person_id | employee_id | src_event_time | change_type |
---|---|---|---|
A | X | t1 | INSERT |
A | X | t2 | DELETE |
A | X | t3 | INSERT |
Consider the following scenario where the event come in the order of t3, t1, t2
and we're
ingesting it into the effectivity link satellite table:
1. Processing event t3
person_id | employee_id | effectivity_start_date | effectivity_end_date |
---|---|---|---|
A | X | t3 | t9999 |
2. Processing event t1
Here in our ingestion logic we can identify that the new event time t1
is before the existing
event t3
. So we set the effectivity to between t1
and t3
person_id | employee_id | effectivity_start_date | effectivity_end_date |
---|---|---|---|
A | X | t3 | t9999 |
A | X | t1 | t3 |
3. Processing event t2
This is where things get a bit weird. Our logic will have to detect that there are existing
effectivity period (from t1 to t3) and then "correct" that by adding a (t1 - t2) effectivity period.
Since our constraint is that we only ever INSERT and not update existing rows, we now have two
effectivity rows with the same start time t1
.
person_id | employee_id | effectivity_start_date | effectivity_end_date |
---|---|---|---|
A | X | t3 | t9999 |
A | X | t1 | t3 |
A | X | t1 | t2 |
Question
Is this the correct approach? How would I go about querying the state of a relationship at a given point in time?
(For example, how would I query the relationship at t1.5
, or t2.5
?)
It’s not quite correct Data Vault Guru contains better examples, load code and query code