I am trying to compare two very large dataframes, each with around 10 petabytes of data in Spark. The execution is throwing out of memory e this.issues even after increasing the memory configurations. Could anyone suggest a better alternative to solve this?
Approach I am using:
- Generate row_hashes for each dataframe
- diff = df.select('row_hash') - df1.select('row_hash')
- diff.join(df, df.columns.toSeq, "inner")
You can use Spark's LSH implementation to hash both dataframes into lower dimensions with a very strict similarity measure. After hashing, you can perform an approxSimilarityJoin
Some basic code on how to do this:
This method should be certainly better than a straight comparison. However you might find that for the kind of scale you are talking about you will certainly need to play with the config settings as well.