Does replacing merge statements over several tables in a data vault model with conditional insert all into will reduce ingest time?

250 Views Asked by alim1990 At 12 November 2021 at 16:05

I am loading data on daily basis into a data vault model on Snowflake data warehouse.

I have split the ingest script (javascript procedure) into 3 main parts for logging purposes.

Getting data into temporary table
Metadata part, where I add data into several hubs and links that holds metadata of each row (location, data entry name…)
Loading main data holding indicators and their values into specific hubs and satellites.

Here is the average loading time of each part of a data file having around 2000 rows with ~300k indicator values:

3 to 5 seconds for adding data from stage into temporary table
19 to 25 seconds for adding metadata into 9 hubs, satellites and links
3 to 5 seconds for adding 2000 rows into a hub and then 300k values into sat and link.

For part 2, whether there is a need to insert or not as I am using a merge statement it will take the same time.

Many things comes to my mind as loading thousands of records take few seconds while merging into few hubs (tables) if value not found originally is taking way more.

Can I replace all merge statements of tables related to part 2 and replace it with one conditional

insert all 

into table1 … where…

into table2 … where …

Can a insert into with conditions similar to when not matched of the merge statement may reduce the ingest time taking into considerations that where clause on each table will contain a select subquery to ensure existing data not added again?

I was reading this article on optimizing load into data vault model with its related scripts on github but still concerned about ingest time being reduced in an efficient way.

Original Q&A

There are 1 best solutions below

patrick_at_snowflake On 12 November 2021 at 23:42

Admirable as Galavan's article is it comes with some fatal flaws around loading to the same hub in the case of same-as link or hierarchical links --- and that is, you will load duplicates. I would discourage you from using Multi-Table inserts to load hubs, links and satellites -- for analysis and testing on this please visit here: https://patrickcuba.medium.com/data-vault-test-automation-52c0316e8e3a

It's not to say MTI don't have a place in DV, they do! In the case of loading logarithmic PIT structures absolutely! An in-depth article on this is published here: https://patrickcuba.medium.com/data-vault-pit-flow-manifold-e2b68df26628

Now merges vs insert conversation in particular should not be in a Data Vault 2.0 vocabulary because DV2.0 is INSERT-ONLY. I did another piece on that here focussing on hashing but there is a segment discussing what happens at the micro-partition level in Snowflake that should discourage you from using MERGE INTO, visit here: https://patrickcuba.medium.com/data-vault-2-0-on-snowflake-5b25bb50ed9e

Seeing as you are building out your own DV automation tool these two blogs are worth a read too:

Does replacing merge statements over several tables in a data vault model with conditional insert all into will reduce ingest time?

There are 1 best solutions below

Related Questions in SQL

Related Questions in MERGE

Related Questions in CLOUD

Related Questions in SNOWFLAKE-CLOUD-DATA-PLATFORM

Related Questions in DATA-VAULT

Trending Questions

Popular # Hahtags

Popular Questions