Pyspark BucketedRandomProjectionLSH - count() after approxsimilarityjoin gives different results when i persist output

63 Views Asked by At

I am using pyspark.ml.feature.BucketedRandomProjectionLSH to identify to similar items.

I have two datasets which have been vectorized. I have used LSH to hash both data sets and have stored them in a separate location. Model used to transform both datasets in stored on hdfs as well. However, when i run approxsimilarityjoin againsts these two datasets and try to write it out to a parquet, it gives me different results versus when I don't right it into parquet.

This is how i create my datasets

    brp = BucketedRandomProjectionLSH()
    brp.setInputCol(output_col)

    brp.setOutputCol("hashes")
    brp.setSeed(12345)
    brp.setBucketLength(buck_len)
    brp.setNumHashTables(num_hshtbls)

    model = brp.fit(dfLeft)
    model.write().overwrite().save('LSH_model_test')
    model = BucketedRandomProjectionLSHModel.load('LSH_model_test')
   
    dfLeft_T=model.transform(dfLeft)
    dfLeft_T.write.mode('overwrite').parquet('dfLeft_transformed_test')    
    

    dfRight_T=model.transform(dfRight)   
    dfRight_T.write.mode('overwrite').parquet('dfRight_transformed_test')

    

To find similar items i use this :

dfLeft_T=spark.read.parquet('dfLeft_transformed_test')
dfRight_T=spark.read.parquet('dfRight_transformed_test')
pairs1_=model.approxSimilarityJoin(dfLeft_T, dfRight_T, cut_off,distCol="EuclideanDistance")

To get number of pairs with 0 distance , i use this command :

pairs1_.filter(col('EuclideanDistance') == 0).count()
    

which gives output as 924

However, when i try to write pairs1_ to a parquet file and run the command like this

pairs1_.write.mode('overwrite').parquet('pairs1_test')
pairs1A_=spark.read.parquet('pairs1_test')
pairs1A_.filter(col('EuclideanDistance') == 0).count()

the output is 200.

Can you help me understand why writing to a parquet might change the results or outcomes of this output?

I have tried running the above multiple times and the count is always lower when i write to a parquet as opposed to when I don't

0

There are 0 best solutions below