Google BigQuery DML - Slow performance when executing updates & deletes

2.1k Views Asked by At

I have running few BigQuery DML tests to get a better idea of the performance of BigQuery DML capabilities. So far, here are some initial observations: 1) Slow performance when updating only a few records in a very small table (30K+ records)

UPDATE babynames.names_2014
SET name = 'Emma B' 
WHERE name = 'Emma';

Output: - 2 rows affected (# of records in the table: 33176) - Query complete (4.5s elapsed, 621 KB processed)


2) Very slow performance when deleting only few records from a small table

SQL:

DELETE from babynames.names_2014_copy
where gender<>'U'

Output: -2 rows affected. -Query complete (162.6s elapsed, 1.21 MB processed) - ~3 minutes

QUESTIONS: 1) Are these known behavior? 2) Any suggestions on how to improve the performance?

3

There are 3 best solutions below

0
On

I have also noticed that Update and Delete operations might me very slow in BigQuery.

Interestingly, overwriting the table with "create or replace table" statement usually has significantly better performance.

So instead of:

DELETE from babynames.names_2014_copy
where gender<>'U'

consider just using:

create or replace table babynames.names_2014_copy
AS
select *
from babynames.names_2014_copy
where not  gender<>'U'

A similar technique also works for Update; you just need to write a case statement to modify your values.

1
On

(1) is approximately expected - primary DML scenarios are large updates/deletes affecting many rows (millions/billions rows). So latency is less important than throughput.

(2) doesn't look normal - could you try one more time? Anything unusual about table you are trying to update?

Any suggestions on how to improve the performance?

Optimize towards having few DML statements with each statement updating many rows. For example you may use joins/semijoins to specify large sets of affected rows.

0
On

"Interestingly, overwriting the table with "create or replace table" statement usually has significantly better performance." - I'm wondering how much data processing quota is being using with this method :/

My suspicion is - google is trying to limit and slow down these operations on purpose, since big query is strictly non-production database to store data for analytical purposes. I also found that in the morning, when I start some operation it works just fine for a while, then it throttles down. This really makes me think, maybe postgreSQL is a way to go, since bigquery suppose to be easier and more straight-forward, but apparently it's not always a case.