I have two large tables, a base table and a table I'm using to filter. I want the keys where 'my_filter' is true.
Which one gives me the best performance Google BigQuery?
Standard Join
select
[fields]
from base_table
join filter_table
on filter_table.my_key = base_table.my_key
where filter_table.my_filter
Join on a subquery
select
[fields]
from base_table
join (
select my_key from filter_table where my_filter
) filter_table
on filter_table.my_key = base_table.my_key
Semi-join
select
[fields]
from base_table
where my_key in (
select my_key from filter_table where my_filter
)
For context: the Google BQ docs discuss semi-joins but don't explain why you would use them.
I got curious with your question and decided to run a test in our BQ to see how results would change.
I've simulated a query, hopefully it does a similar job to what you proposed. It processed 169 GBs of data with no caching enabled. Here's the results I found:
Standard Join: 10.2s
Join on Subquery: 10.6s
Semi Join: 10.2s
Conclusion
As Elliott said, try using Standard SQL. Other than that, BigQuery will, for the most part, regardless of your query, process everything in a few seconds.