In data vault 2.0 one hashes the business key and takes this hash as a primary key of the table. Also Link Tables use the hash primary key to create a relationship.
My problem is with hashes that are basically random, the query optimizer cannot apply any good estimation since the statistics - of course - are not usable for randomly distributed data.
So the query optimizer uses weird plans where it wants to sort often (because it thinks there are only 4 rows to sort). Since I am surely not the first one to deal with data vault in sql server, how is this fixable?.
When query optimizer uses an index seek or a join operator it completely misses the row estimation hence chooses ridiculous plans.
I have to pimp them with join hints and query hints such as (FORCE ORDER) to get anything out of it.
What's the common approach for this?
Personally I would not hash the BK but would include all the fields in the HUB if it was a compound key. Why would the LINK table be using the hash values, it should be using the HUB_ID which I would always set up as an incrementing value
I would however create and store a HASH in the SAT table as this is the fastest way to check for changes in the ETL process: Hash the values coming in and compare to the hash on the current SAT record- no point in recomputing the hash on the existing record.