We are trying to setup dev
and qa
environments using data from a prod
environment.
We are not using CREATE DATABASE dev CLONE prod
because we are trying to avoid cloning database-specific objects like stages
and pipes
, since we are using per-environment Terraform to manage pipe-related objects and want to avoid out-of-band changes to those objects.
On top of that, there are some tables that should not be cloned from prod->dev. I'm trying to design a cleaner solution than the cobbled mess that we have.
We have a scheduled script that does the following:
- Connect to
prod
anddev
databases and fetch the rightsrc
anddst
schemas - Run
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = '<>' AND TABLE_TYPE = 'BASE TABLE'
to get tables to clone- Cloning tables across databases results in dangling references to constraints and sequences, so those have to be manually cloned https://docs.snowflake.com/en/user-guide/database-replication-considerations.html?#references-to-objects-in-another-database
- For each table:
- If it shouldn't be cloned, skip it
- Run
CREATE OR REPLACE TABLE <dev> CLONE <prod> COPY GRANTS;
- Run
GET_DDL(<dev>)
to see if the table has sequences/constraints to update - Run
CREATE OR REPLACE SEQUENCE <dev> CLONE <prod>
to update thenextval
of the sequence since our table was cloned and references the sequence from the source database (and it also has the wrong value anyways) - Run
ALTER TABLE <dev> ALTER COLUMN <> SET DEFAULT <new seq>.nextval
- Check if there are constraints
- Run
ALTER TABLE <dev> DROP CONSTRAINT <>
since the cloned tables reference the source database - Run
ALTER TABLE <dev> ADD CONSTRAINT <>
to rebuild them to reference the destination database
- Run
So... it works, but it's hacky, fragile, and prone to updating because of custom rules. We currently have this running on an AWS lambda, but a first step would be to migrate this to pure Snowflake.
Does anyone have any suggestions to improve this process? Or at least have recommendations on Snowflake tools that
I realise this is not really an answer to your question but I would absolutely not do what you are proposing to do - it's not the way to manage your SDLC (in my opinion) and, especially if your data contains any PII information, copying data from a Prod to a non-Prod database runs the risk of all sorts of regulatory and audit issues.
I would do the following:
You can then use these scripts to build any environment and you would change them by going through the standard Dev, Test, Prod SDLC.
As far as populating these environments with data goes, if you really need Production-like data (and production volumes of data) then you should build routines for copying the data from Prod to the chosen target environment that, where necessary, anonymise the data. These scripts should be managed in your code repository and as part of your SDLC there should be a requirement to build/update the script for any new/changed table