Best JOIN performance on Google BigQuery

6.1k Views Asked by At

I have two large tables, a base table and a table I'm using to filter. I want the keys where 'my_filter' is true.

Which one gives me the best performance Google BigQuery?

Standard Join

select
[fields]
from base_table
join filter_table
on filter_table.my_key = base_table.my_key
where filter_table.my_filter

Join on a subquery

select
[fields]
from base_table
join (
  select my_key from filter_table where my_filter
) filter_table
on filter_table.my_key = base_table.my_key

Semi-join

select
[fields]
from base_table
where my_key in (
  select my_key from filter_table where my_filter
)

For context: the Google BQ docs discuss semi-joins but don't explain why you would use them.

2

There are 2 best solutions below

0
On BEST ANSWER

I got curious with your question and decided to run a test in our BQ to see how results would change.

I've simulated a query, hopefully it does a similar job to what you proposed. It processed 169 GBs of data with no caching enabled. Here's the results I found:

Standard Join: 10.2s

enter image description here

Join on Subquery: 10.6s

enter image description here

Semi Join: 10.2s

enter image description here

Conclusion

As Elliott said, try using Standard SQL. Other than that, BigQuery will, for the most part, regardless of your query, process everything in a few seconds.

0
On

The first one will have the worst performance, when using legacy SQL with BigQuery, since filters are not pushed through joins. Use standard SQL instead, which has automatic filter pushdown through joins. The third query is only equivalent if each key from the left appears at most one in the right table; otherwise it will return fewer rows than the first two queries do.