I am implementing a Data Vault hub, and have X number of columns, 2 of them being:
Company_ID, md_5(Company_id).alias(HK_company_ID)
The hash is created on the stage level, and I want to load my data from stage into the data vault. I select these two columns (company_id and precalculated hash). I then want to load new companies from stage into the hub. I want the hub to be partitioned by the HK_company_ID. How to achieve it in order to be able to perform efficient co-partitioned joins?
The only thing I've found is repartition, which is supposed to use hash partitioning under the hood.
- But if it does the hash partitioning under the hood, then should i use it on the HK_company_ID or just the company_ID? Maybe there isn't even a need for creating the hash-key at the staging level? I'm not that familiar with the data vault.
- Will it even work on the HK_company_ID then? Would it hash the hashes? Or do modulo on the already precalculated hashes?
- Regardless of the column that I repartition by, how do I implement it within my script? Right now I do this:
pseudocode:
df1 = stage
df2 = hub
if not hub: #first load
df_diff= df1
else:
df_diff = df1.antijoin(hub)
df_diff.append(hub_location)
Where do I fit repartitioning within this code?
- Maybe there's some other option, to partiton by modulos or some range on the hash column?
I'd like to stick to DFs api if possible.
Thanks a lot