I'm trying to understand why the date-range restriction imposed by the first table in a Left Join does not limit the Bytes Processed in the same way as adding a WHERE clause with the same restriction does.
Query 1: Bytes Processed = 3.9 GB
SELECT *
FROM `project.dataset.calendar_halfhour` as cal
LEFT JOIN `project.dataset.big_table` as t
ON cal.tstamp = t.tstamp
Query 2: Bytes Processed 1.07 GB
SELECT *
FROM `project.dataset.calendar_halfhour` as cal
LEFT JOIN
(SELECT *
FROM `project.dataset.big_table`
WHERE tstamp BETWEEN "2022-09-01 00:00:00 UTC" AND "2022-11-24 00:00:00 UTC"
) as t
ON cal.tstamp = t.tstamp
In the above queries project.dataset.calendar_halfhour
is a table of halfhourly incrementing timestamps created using the below query. In other words it already contains the date range restriction that is used in Query 2. Although it returns the same result, there is a significant difference between the bytes processed for Query 1 and Query 2 and I am not sure why.
SELECT tstamp
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY("2022-09-01 00:00:00 UTC",
"2022-11-24 00:00:00 UTC",
INTERVAL 30 MINUTE)) AS tstamp
Clarification:
- The big_table is partitioned by tstamp (by DAY), not sure if that will make a difference
- The Left hand table: project.dataset.calendar_halfhour is much smaller than the right hand table
- I realise that my query might not be best practice, I am not concerned by that - I created it solely to try and understand the processing happening in the background a bit better in order to better optimise queries in future.
My assumption was that in a LEFT JOIN it will first look at the small table on the left, and then fetch only the rows required of the big table on the right (as specified by the tstamp field that it must join on). Therefore I assumed that adding a WHERE clause providing the same tstamp limitations shouldn't change the bytes processed. But it does and I am not sure why.
In both queries, tstamp has the same size. But in query 2, you are filtering out the amount of data that is involved in the join, therefore the amount of total processed data is smaller compared to query 1.
You can quickly check this out by measuring the size of big_table compared to the subquery.
In other words, the size of tstamp is not a hard limit for data processing, since bigquery still needs to join it with all values on the other side of the join.
Adding the large set first will be more performant, according to the docs:
it is still recommended to order your joined tables appropriately. As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.
When you have a large table as the left side of the JOIN and a small one on the right side of the JOIN, a broadcast join is created. A broadcast join sends all the data in the smaller table to each slot that processes the larger table. It is advisable to perform the broadcast join first.
https://cloud.google.com/bigquery/docs/best-practices-performance-compute#optimize_your_join_patterns