I am loading data on daily basis into a data vault model on Snowflake data warehouse.
I have split the ingest script (javascript procedure) into 3 main parts for logging purposes.
- Getting data into temporary table
- Metadata part, where I add data into several hubs and links that holds metadata of each row (location, data entry name…)
- Loading main data holding indicators and their values into specific hubs and satellites.
Here is the average loading time of each part of a data file having around 2000 rows with ~300k indicator values:
- 3 to 5 seconds for adding data from stage into temporary table
- 19 to 25 seconds for adding metadata into 9 hubs, satellites and links
- 3 to 5 seconds for adding 2000 rows into a hub and then 300k values into sat and link.
For part 2, whether there is a need to insert or not as I am using a merge statement it will take the same time.
Many things comes to my mind as loading thousands of records take few seconds while merging into few hubs (tables) if value not found originally is taking way more.
Can I replace all merge statements of tables related to part 2 and replace it with one conditional
insert all
into table1 … where…
into table2 … where …
Can a insert into with conditions similar to when not matched of the merge statement may reduce the ingest time taking into considerations that where clause on each table will contain a select subquery to ensure existing data not added again?
I was reading this article on optimizing load into data vault model with its related scripts on github but still concerned about ingest time being reduced in an efficient way.
Admirable as Galavan's article is it comes with some fatal flaws around loading to the same hub in the case of same-as link or hierarchical links --- and that is, you will load duplicates. I would discourage you from using Multi-Table inserts to load hubs, links and satellites -- for analysis and testing on this please visit here: https://patrickcuba.medium.com/data-vault-test-automation-52c0316e8e3a
It's not to say MTI don't have a place in DV, they do! In the case of loading logarithmic PIT structures absolutely! An in-depth article on this is published here: https://patrickcuba.medium.com/data-vault-pit-flow-manifold-e2b68df26628
Now merges vs insert conversation in particular should not be in a Data Vault 2.0 vocabulary because DV2.0 is INSERT-ONLY. I did another piece on that here focussing on hashing but there is a segment discussing what happens at the micro-partition level in Snowflake that should discourage you from using MERGE INTO, visit here: https://patrickcuba.medium.com/data-vault-2-0-on-snowflake-5b25bb50ed9e
Seeing as you are building out your own DV automation tool these two blogs are worth a read too:
https://patrickcuba.medium.com/you-might-be-doing-datavault-wrong-888e9b0fa07d
https://medium.com/snowflake/decided-to-build-your-own-data-vault-automation-tool-a9a6273b9f9b