Different SCD types for different columns

864 Views Asked by At

Does it make sense to have different SCD types for different columns at single dimension? Or it's always better to split a dimension table into two tables?

For example, the operational systems sends me the following data:

ID | CHANNEL_CODE | NAME | TYPE
1  |      A       |  X   |  0
2  |      B       |  Y   |  1

Where CHANNEL_CODE = natural key, TYPE should be SCD type 2, while NAME could be SCD type 1. Let say NAME is frequently changed.

3

There are 3 best solutions below

0
On

I always keep Type 0, 1 and 2 in one table.

I also treat Type 0-s and Type 1-s the same, and Type 2-s the other way.

  • Type 1/0 changes lead to updating all versions in the SCD table for the same business/natural ID to reflect the new info.
  • Type 2 changes lead to updating the newest row in the SCD table to contain the new change date as the valid-to date, and the current-indicator to contain FALSE, and a new row with the new data,valid-from as the change date, valid-to to '9999-12-31'.

With this type of staging data:

CREATE TABLE stg_customer (
     cust_id            INT NOT NULL -- natural identifier
    ,cust_chg_dt        DATE         -- change date
    ,cust_first_name    VARCHAR(30)  -- type 1
    ,cust_last_name     VARCHAR(30)  -- type 1
    ,cust_phoneno       VARCHAR(30)  -- type 1
    ,cust_loyalty_lvl   INT          -- type 2
    ,cust_org_id        INT          -- type 2
) ;

... my SCD table looks like so:

CREATE TABLE dim_customer_scd (
  cust_key         BIGINT 
      DEFAULT dim_customer_scd_seq.NEXTVAL NOT NULL -- surrogate key, primary key
, cust_id          BIGINT       NOT NULL  -- natural identifier
, cust_from_dt     DATE         NOT NULL  -- effective begin date
, cust_to_dt       DATE         NOT NULL  -- effective end date
, cust_is_current  BOOLEAN 
    DEFAULT(cust_to_dt='9999-12-31') NOT NULL -- current indicator
, cust_cre_ts      TIMESTAMP(0) NOT NULL  -- created timestamp
, cust_udt_ts      TIMESTAMP(0) NOT NULL  -- updated timestamp
, cust_first_name  VARCHAR(30)            -- Type 1 column
, cust_last_name   VARCHAR(30)            -- Type 1 column
, cust_phoneno     VARCHAR(30)            -- Type 1 column
, cust_loyalty_lvl INT                    -- Type 2 column
, cust_org_id      INT                    -- Type 2 column
)
;
0
On

The answer, IMO, is it depends...

If you need the current and historic values of fields in a Dimension, and the volume of data + frequency of change makes it practicable, then implement SCD Type 7.

If the data for a small number of fields in the dimension is volatile, and the number of records in the dimension is large, then implement SCD Type 4 or SCD Type 5

0
On

As is typically the case, it depends.

First, make sure you understand the distinction between the OLAP model and the RDBMS storage. The RDBMS storage strategy does not have to reflect the OLAP model.

You can easily split your SCD types into separate tables and than join them in a view to present a single source to the OLAP model. This allows the underlying tables to have completely different ETL strategies.

If most of your columns are type 2, separate tables doesn't make much sense as there will be little gain. However, if most of your columns are not type 2, you may see some performance gains in the ETL and processing tasks. Skinner tables are quicker to read and update in OLAP processing.

On a separate note, if you use a view, I recommend investigating using left joins to combine the tables. In many OLAP load processing models, the load processing queries for a single attribute at a time. With multiple tables as left joins, the query processor can discard the table references that are not needed to provide the results, rather than filtering the data through inner joins. This can improve overall OLAP processing speed.