Why does the date-range of the table on the left side of a LEFT JOIN not limit the bytes processed in the same way as a where clause?

Question

Why does the date-range of the table on the left side of a LEFT JOIN not limit the bytes processed in the same way as a where clause?

60 Views Asked by 164_user At 27 July 2025 at 16:44

I'm trying to understand why the date-range restriction imposed by the first table in a Left Join does not limit the Bytes Processed in the same way as adding a WHERE clause with the same restriction does.

Query 1: Bytes Processed = 3.9 GB

SELECT *
FROM `project.dataset.calendar_halfhour` as cal
LEFT JOIN `project.dataset.big_table` as t
ON cal.tstamp = t.tstamp

Query 2: Bytes Processed 1.07 GB

SELECT *
FROM `project.dataset.calendar_halfhour` as cal
LEFT JOIN 
  (SELECT * 
  FROM `project.dataset.big_table` 
  WHERE tstamp BETWEEN "2022-09-01 00:00:00 UTC" AND "2022-11-24 00:00:00 UTC"
  ) as t
ON cal.tstamp = t.tstamp

In the above queries project.dataset.calendar_halfhour is a table of halfhourly incrementing timestamps created using the below query. In other words it already contains the date range restriction that is used in Query 2. Although it returns the same result, there is a significant difference between the bytes processed for Query 1 and Query 2 and I am not sure why.

SELECT tstamp 
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY("2022-09-01 00:00:00 UTC",
                                     "2022-11-24 00:00:00 UTC",
                                     INTERVAL 30 MINUTE)) AS tstamp

Clarification:

The big_table is partitioned by tstamp (by DAY), not sure if that will make a difference
The Left hand table: project.dataset.calendar_halfhour is much smaller than the right hand table
I realise that my query might not be best practice, I am not concerned by that - I created it solely to try and understand the processing happening in the background a bit better in order to better optimise queries in future.

My assumption was that in a LEFT JOIN it will first look at the small table on the left, and then fetch only the rows required of the big table on the right (as specified by the tstamp field that it must join on). Therefore I assumed that adding a WHERE clause providing the same tstamp limitations shouldn't change the bytes processed. But it does and I am not sure why.

Original Q&A

There are 1 best solutions below

**David Morales** · Answer 1

In both queries, tstamp has the same size. But in query 2, you are filtering out the amount of data that is involved in the join, therefore the amount of total processed data is smaller compared to query 1.

You can quickly check this out by measuring the size of big_table compared to the subquery.

In other words, the size of tstamp is not a hard limit for data processing, since bigquery still needs to join it with all values on the other side of the join.

Adding the large set first will be more performant, according to the docs:

it is still recommended to order your joined tables appropriately. As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.

When you have a large table as the left side of the JOIN and a small one on the right side of the JOIN, a broadcast join is created. A broadcast join sends all the data in the smaller table to each slot that processes the larger table. It is advisable to perform the broadcast join first.

https://cloud.google.com/bigquery/docs/best-practices-performance-compute#optimize_your_join_patterns

Why does the date-range of the table on the left side of a LEFT JOIN not limit the bytes processed in the same way as a where clause?

There are 1 best solutions below

Related Questions in GOOGLE-BIGQUERY

Related Questions in LEFT-JOIN

Related Questions in WHERE-CLAUSE

Related Questions in OPERATOR-PRECEDENCE

Related Questions in ORDER-OF-EXECUTION

Trending Questions

Popular # Hahtags

Popular Questions