Join 2 large size tables (50 Gb and 1 billion records)

265 Views Asked by Red Maple At 16 June 2025 at 16:50

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :

I need to tune it, as I am getting OOM errors due to Java heap space.
I have to apply left join.

There will not be any null values, so it might not improve performance.

What should I do to achieve this scenario?

Jfyi: While loading this parquet data I have already applied repartition based on a column.

I have loaded both df1 and df2
When I tried caching it, it failed, but since it needs to be used multiple times,caching is required , persisting is not an option.
Applied repartitioning on both df to evenly distribute the data

There are 0 best solutions below