OLTP-STAGGING-DW-DUPLICATES

38 Views Asked by At

I have a project to extract data from an OLTP db to a Stagging star schema and then Datawarehouse db. I have a relationship : playlist<---(1-N)PlaylistTRack----->(1-N)Track--->(N-1)Sales. There is a concern in the group that duplicates will be created if I put ttables Playlist/playlistTrack/Track in 1 dimtable in Stagging bd, but i dont get how that comes about or if there really is a problem.

As we have searched, we found that we have to do a bridge table. But does that solve the problem if there is an actual one?

1

There are 1 best solutions below

0
On BEST ANSWER

There is a concern in the group that duplicates will be created if I put ttables Playlist/playlistTrack/Track in 1 dimtable

This is a valid concern. Your playlist and your track are two different conceptual things that relate to each other in a many-to-many way (a playlist can have multiple tracks and a track can appear in more than one playlist). You most certainly would not want to collapse this into a single table or you will have duplicates.

The bridge table that solves this is your playlisttrack table. The grain of that table is one track on one playlist, and the key is a combination of the keys to both the playlist and track dimensions.

playlist <--- playlisttrack ---> track

Now, if you have a different table playing the true role of a fact table (sales), then this doesn't change what's needed. Your sales fact table would have a single key to playlist:

sales (fact) ---> playlist <--- playlisttrack ---> track

Your product dimension has been split into three tables in a pattern known as "snowflaking" but that's unavoidable. In this situation, while your sales table still acts as the real fact table, the bridge table (playlisttrack) is in a way a fact table of its own with its own granularity. It's important to realize this because you have to be careful when joining from sales through the bridge table to track, as that changes the granularity of your query. Unless your query filters on a single track, you would need to use careful GROUP BY logic with appropriate aggregation (MAX vs SUM on sales facts until grouped at the sales fact grain, then SUM only above that grouping, etc..) in any such query to avoid overcounting your sales facts.