load data from vm instances without external ip to big query

749 Views Asked by At

i have a goal to connect and load data to big query from vm instances (this instances installed maria db server). i can access the database using tcp ip forwarding with this command on my terminal (the instances not having external ip for security reason):

gcloud beta compute start-iap-tunnel my-instances 3306 --local-host-port=localhost:3309

after that i am open mysql workbench type user root and password, then the database show, doing query, save into csv, the last is upload the csv to big query.

any suggestion to automate this process ?

2

There are 2 best solutions below

3
On

Devops style using Cloud Workflows.

You can automate this using Cloud Workflows, and the steps with Cloud Build.

There is a full guide posted how to run a shell command on a VM with Cloud Workflows, also another one how to automate files from Cloud Storage to BigQuery using Cloud Workflows.

Instead of running the shell command, you can have the chain of operations to

  1. connect to IAP
  2. run the SELECT INTO FILE as csv command (escaping what needs to be done)
  3. move the file to Cloud Storage
  4. import into BigQuery using the load API

This way you incur no loading costs to BigQuery as loading data to BQ is free. You will incur costs for Cloud Storage, but only for the time that the file exists and once imported you can delete it.

Using External data sources.

You may want to setup a Cloud SQL connection with your database. See if works for your setup by doing the steps defined here.

You need to follow the guide and in the External data source pane, enter the required connection information such as cloud SQL instance id, user, password.

If the connection is established, you can use EXTERNAL_QUERY syntax to read data from Cloud SQL instance, and write a BigQuery table. You will incur costs for doing this, as the query is billed for the data that is read from Cloud SQL as well.

SELECT * FROM EXTERNAL_QUERY("connection_id",
"select * from db.table;");

you can automate this step with Scheduled Queries.

https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries#setting-up-cloud-sql-database-connections

2
On

If your VM don't have public IP, the VM can't access the public BigQuery api on the public DNS. So, use the private API!!

For this, you need to go to your VPC, and select the subnet where is deployed your Compute Engine VM.

Click on Edit and turn on the private google access to ON enter image description here

Now you can call the BigQuery API, through the internal network and without public IP. use the CLI or script to achieve the load job.