Clone/ Backups for BigQuery Project

35 Views Asked by At

How can one back up a Project containing datasets in Google BigQuery to another GCS bucket? Is it feasible to employ a Python script for this task, and how can the process be automated?

Just trying to gather ideas to implement for Disaster Recovery. To clone Project into another GCP bucket

the data structure as below:

 Project -> dataset -> table
                    -> table
                    -> ...

         -> dataset -> table
                    -> table
                    -> ...
         -> ...
1

There are 1 best solutions below

2
ShaiD On
  1. Considering Disaster Recovery is great, but it's important to know that Google BigQuery provides a disaster system out-of-the-box. Any data you have stored in a region is automatically copied to another zone within that region, so that in case one location is down, the other should provide you with the required service. This is especially strong when using their "Multi-Region" approach where you basically select just "US" - so it can be copied to more distant locations. See more info here
  2. It is however recommended by Google themselves to create some sort of disaster recovery plan. One of the options suggested is actually exporting the data to another region in GCS - See here

Now for the next part - it is definitely possible to create a python script that performs this backup. You could even run it as a Google Cloud Function using a Google Cloud Service Account on a schedule, and it will automatically copy your entire dataset / project into whatever location you decide to do that.

You should use the google-bigquery python package for that, which will allow you to connect to bigquery and create an export job.

You can refer to this example code provided by Google's Documentation to get started:

# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = "bigquery-public-data"
dataset_id = "samples"
table_id = "shakespeare"

destination_uri = "gs://{}/{}".format(bucket_name, "shakespeare.csv")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table(table_id)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location="US",
)  # API request
extract_job.result()  # Waits for job to complete.

print(
    "Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri)
)