Converting EAVT table into SCD type 2

244 Views Asked by At

After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.

Here's the issue: I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to). So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.

I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.

To give an example: Company has two tables:

  • Current state of the Companies:
company_id name number_of_employees city
1 Company 1 500 Paris
2 Company 2 500 Paris
  • History Table:
company_id attribute value valid_from valid_to
1 city New York 01/01/2020 01/05/2022
1 city Paris 01/05/2022 12/31/9999
1 number_of_employees 50 01/01/2021 01/01/2022
1 number_of_employees 100 01/01/2022 12/31/9999

What I want to have as a result is the following:

company_id name city number_of_employees valid_from valid_to is_active
1 Company 1 New York null 01/01/2020 01/01/2021 false
1 Company 1 New York 50 01/01/2021 01/01/2022 false
1 Company 1 New York 100 01/01/2022 01/01/2022 false
1 Company 1 Paris 100 01/05/2022 12/31/9999 true

So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time. We then modified our company to add 50 as the number of employees, this happened on 01/01/2021. We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.

This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows. The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.

To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :)

To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.

I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database )

I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD). So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<

Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D

1

There are 1 best solutions below

3
On BEST ANSWER

Too broad question - but anyway, below is just to give you an idea. Obviously it does not cover all cases - but hope you can work it further out

select company_id, city, number_of_employees, min(day) valid_from, max(day) valid_to
from (
  select * from (
    select company_id, attribute, value, day
    from history, 
    unnest(generate_date_array(date(valid_from), if(valid_to = '9999-12-31', date('2222-12-31'), date(valid_to)))) day
  )
  pivot (any_value(value) for attribute in ('city', 'number_of_employees'))
) 
group by company_id, city, number_of_employees     

if applied to sample data as in your question

with history as (
  select 1 company_id, 'city' attribute, 'New York' value,  '2020-01-01' valid_from,    '2022-01-05' valid_to union all
  select 1, 'city', 'Paris', '2022-01-05', '2222-12-31' union all
  select 1, 'number_of_employees', '50', '2021-01-01', '2022-01-01' union all
  select 1, 'number_of_employees', '100', '2022-01-01', '2222-12-31' 
)

output is

enter image description here