How surrogate keys are handles in hive

1.9k Views Asked by At

I know that hive cannot create surrogate keys or is rather difficult. I want to understand how companies have implemented dimensional modeling in their warehouse.

One way I can think of is leaving the dimension details as is in fact. Then move the distinct of dimension to a different table. But then how are scd1 and scd2 handled. I have checked talks by Kimball on cloudera and I still don't understand how this works.

2

There are 2 best solutions below

2
On

There are two ways of handling this problem in Hive.

The first does not directly answer your question, and that is to use natural keys instead of surrogates. While surrogates are more convenient and performant, since you're using Hive I'm guessing that performance isn't one of your major criteria, so the cost of using natural keys will just be in the extra lines of code you have to write to cater for compound keys.

The second way is to use Hive's windowing functions to calculate the surrogate. I don't have a Hive environment handy to test this query, but the surrogate would look something like:

(select max(surrogate_key_column) from dimension_table)
+ row_number() over (order by 1)
2
On

As far as I know, In version 3.0, Hive supports the surrogate keys on ACID tables

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/using-hiveql/content/hive_surrogate_keys.html

Summarised from the link:


The SURROGATE_KEY UDF generates a unique Id for every row that you insert into a table.

Example usage:

-Create a table

CREATE TABLE students_v2 
(`ID` BIGINT DEFAULT SURROGATE_KEY(),
 row_id INT,
 name VARCHAR(64), 
 dorm INT, 
 PRIMARY KEY (ID) DISABLE NOVALIDATE);

-Insert data, which automatically generates surrogate keys for the primary keys.

INSERT INTO students_v2 (row_id, name, dorm) SELECT * FROM students;

-Take a look at the surrogate keys.

SELECT * FROM students_v2;