I have a series of ~30 datasets that all need to be joined together for making a wide final table. This final table takes ~5 years of individual tables (one table per year) and unions them together, then joins this full history with the full history of other tables (similarly unioned) to make a big, historical, wide table.
The layout of these first, per year tables is as such:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
With other year tables like this:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 1 |
| key_2 | 1 |
These are then unioned together to create:
table_type_1:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
Similarly, a second type of table when unioned results in the following:
table_type_2:
| primary_key | year |
|-------------|------|
| key_1 | 0 |
| key_2 | 0 |
| key_3 | 0 |
| key_1 | 1 |
| key_2 | 1 |
I now want to join table_type_1 with table_type_2 on primary_key and year to yield a much wider table. I notice that this final join takes a very long time and shuffles a lot of data.
How can I make this faster?
You can use bucketing on the per-year tables over the
primary_keyandyearcolumns into the exact same number of buckets to avoid an expensive exchange when computing the final join.Note: When you are picking the
BUCKET_COUNTvalue, it's important to understand it should be optimized for the finalall_tablesoutput, not for the intermediate tables. This will mean you likely will end up with files that are quite small for the intermediate tables. This is likely to be inconsequential compared to the efficiency gains of theall_tablesoutput since you won't have to compute a massive exchange when joining everything up; your buckets will be pre-computed and you can simplySortMergeJoinon the input files.For an explicit example on how to write the transform writing out a specified number of buckets, my answer over here is probably useful.