Google Dataprep copy flows from one project to another

399 Views Asked by At

I have two Google projects: dev and prod. I import data from also different storage buckets located in these projects: dev-bucket and prod-bucket.

After I have made and tested changes in the dev environment, how can I smoothly apply (deploy/copy) the changes to prod as well?

What I do now is I export the flow from devand then re-import it into prod. However, each time I need to manually do the following in the `prod flows:

  • Change the dataset that serve as inputs in the flow
  • Replace the manual and scheduled destinations for the right BigQuery dataset (dev-dataset-bigquery and prod-dataset-bigquery)

How can this be done more smoother?

2

There are 2 best solutions below

0
On

Follow below procedure for movement from one environment to another using API and for updating the dataset and the output as per new environment.

1)Export a plan

GET

https://api.clouddataprep.com/v4/plans/<plan_id>/package

2)Import the plan

Post:

https://api.clouddataprep.com/v4/plans/package

3)Update the input dataset

PUT:

https://api.clouddataprep.com/v4/importedDatasets/<datset_id>


{
   
    "name": "<new_dataset_name>",

    "bucket": "<bucket_name>",

    "path": "<bucket_file_name>"
}

4)Update the output

PATCH

https://api.clouddataprep.com/v4/outputObjects/<output_id>


{
  
"publications": [

    {

      "path": [

          "<project_name>",

          "<dataset_name>"
      ],

      "tableName": "<table_name>",

      "targetType": "bigquery",

      "action": "create"

    }

  ]

}
1
On

If you want to copy data between Google Cloud Storage (GCS) buckets dev-bucket and prod-bucket, Google provides a Storage Transfer Service with this functionality. https://cloud.google.com/storage-transfer/docs/create-manage-transfer-console You can either manually trigger data to be copied from one bucket to another or have it run on a schedule.

For the second part, it sounds like both dev-dataset-bigquery and prod-dataset-bigquery are loaded from files in GCS? If this is the case, the BigQuery Transfer Service may be of use. https://cloud.google.com/bigquery/docs/cloud-storage-transfer You can trigger a transfer job manually, or have it run on a schedule.

As others have said in the comments, if you need to verify data before initiating transfers from dev to prod, a CI system such as spinnaker may help. If the verification can be automated, a system such as Apache Airflow (running on Cloud Composer, if you want a hosted version) provides more flexibility than the transfer services.