I have 2 super large tables which I am loading as dataframe in parquet format with one join key. Now the issues I need help in :
- I need to tune it, as I am getting OOM errors due to Java heap space.
- I have to apply left join.
There will not be any null values, so it might not improve performance.
- What should I do to achieve this scenario?
Jfyi: While loading this parquet data I have already applied repartition based on a column.
- I have loaded both df1 and df2
- When I tried caching it, it failed, but since it needs to be used multiple times,caching is required , persisting is not an option.
- Applied repartitioning on both df to evenly distribute the data