why is JaccardDistance always 0 for different docs from spark MinHashLSHModel approxSimilarityJoin

671 Views Asked by At

I am new to Spark ML. Spark ML has MinHash implementation for Jaccard Distance. Please see the doc https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance. In the sample code, input data for comparison are from vectors. I have no question about the sample code. But When I use the text docs as input and then convert them to vectors via word2Vec, I got 0 jaccard distance. Do not know what's wrong in my codes. Something I did not understand. Thanks in advance for any help.

SparkSession spark = SparkSession.builder().appName("TestMinHashLSH").config("spark.master", "local").getOrCreate();

List<Row> data1 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
            RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
            RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" "))));

List<Row> data2 = Arrays.asList(RowFactory.create(Arrays.asList("Hi I heard about Scala".split(" "))),
            RowFactory.create(Arrays.asList("I wish python could also use case classes".split(" "))));

StructType schema4word = new StructType(new StructField[] {
            new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) });
Dataset<Row> documentDF1 = spark.createDataFrame(data1, schema4word);

// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec().setInputCol("text").setOutputCol("result").setVectorSize(30).setMinCount(0);

Word2VecModel w2vModel1 = word2Vec.fit(documentDF1);
Dataset<Row> result1 = w2vModel1.transform(documentDF1);

List<Row> myDataList1 = new ArrayList<>();      
int id = 0;
for (Row row : result1.collectAsList()) {
    List<String> text = row.getList(0);
    Vector vector = (Vector) row.get(1);
    myDataList1.add(RowFactory.create(id++, vector));
}
StructType schema1 = new StructType(
        new StructField[] { new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()) });

Dataset<Row> df1 = spark.createDataFrame(myDataList1, schema1);

Dataset<Row> documentDF2 = spark.createDataFrame(data2, schema4word);

Word2VecModel w2vModel2 = word2Vec.fit(documentDF2);
Dataset<Row> result2 = w2vModel2.transform(documentDF2);

List<Row> myDataList2 = new ArrayList<>();      
id = 10;
for (Row row : result2.collectAsList()) {
    List<String> text = row.getList(0);
    Vector vector = (Vector) row.get(1);
    System.out.println("Text: " + text + " => \nVector: " + vector + "\n");
    myDataList2.add(RowFactory.create(id++, vector));
}

Dataset<Row> df2 = spark.createDataFrame(myDataList2, schema1);

MinHashLSH mh = new MinHashLSH().setNumHashTables(5).setInputCol("features").setOutputCol("hashes");

MinHashLSHModel model = mh.fit(df1);

// Feature Transformation
System.out.println("The hashed dataset where hashed values are stored in the column 'hashes':");
model.transform(df1).show();

// Compute the locality sensitive hashes for the input rows, then perform
// approximate
// similarity join.
// We could avoid computing hashes by passing in the already-transformed
// dataset, e.g.
// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
System.out.println("Approximately joining df1 and df2 on Jaccard distance smaller than 0.6:");
model.approxSimilarityJoin(df1, df2, 1.6, "JaccardDistance")
        .select(col("datasetA.id").alias("id1"), col("datasetB.id").alias("id2"), col("JaccardDistance"))
        .show();

// $example off$

spark.stop();

From Word2Vec, I got the different vectors for different docs. I would expect to get some non zero values for JaccardDistance when comparing two different docs. But instead, I got all 0s. The following shows what I got when I run the program:

Text: [Hi, I, heard, about, Scala] => Vector: [0.005808539432473481,-0.001387741044163704,0.007890049391426146,... ,04969391227]

Text: [I, wish, python, could, also, use, case, classes] => Vector: [-0.0022146602132124826,0.0032128597667906433,-0.00658524181926623,...,-3.716901264851913E-4]

Approximately joining df1 and df2 on Jaccard distance smaller than 0.6: +---+---+---------------+ |id1|id2|JaccardDistance| +---+---+---------------+ | 1| 11| 0.0| | 0| 10| 0.0| | 2| 11| 0.0| | 0| 11| 0.0| | 1| 10| 0.0| | 2| 10| 0.0| +---+---+---------------+

1

There are 1 best solutions below

0
On

Jaccard similarity as per the definition and spark implementation is between two sets.

As the spark documentation:

Jaccard distance of two sets is defined by the cardinality of their intersection and union:

d(A,B)=1−|A∩B|/|A∪B|

Therefore, when you apply word2vec to a specific document it converts it into a vector space or embedding capturing the semantic of the text. Also the range of each element in the vector in your example looks like less than 1. This is an issue for minhash with jaccard distance. If you still want to pursue word2vec, go for cosine distance.

The correct preprocessing step with jaccard distance would be something like

  1. CountVectorizer
  2. Or you could hash the tokens themselves and use a vector assembler

Minhash expects binary vectors, non-zero values are treated as binary “1” values.

For a working example, please refer this example provided by Uber: https://eng.uber.com/lsh/