Join 2 large size tables (50 Gb and 1 billion records)

251 Views Asked by At

I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :

  1. I need to tune it, as I am getting OOM errors due to Java heap space.
  2. I have to apply left join.

There will not be any null values, so it might not improve performance.

  1. What should I do to achieve this scenario?

Jfyi: While loading this parquet data I have already applied repartition based on a column.

  1. I have loaded both df1 and df2
  2. When I tried caching it, it failed, but since it needs to be used multiple times,caching is required , persisting is not an option.
  3. Applied repartitioning on both df to evenly distribute the data
0

There are 0 best solutions below