Applying LSH approach by using sparse matrix instead of dense matrix

493 Views Asked by At

I try to apply LSH (https://github.com/soundcloud/cosine-lsh-join-spark) to calculate cosine similarity for some vectors. For my real data I have 2M rows (documents) and 30K features belonging to them. Besides, that matrix is highly sparse. To give a sample let's say my data is as below:

D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ... 

In the related code the features are put in a dense vector as below:

val input = "text.txt"
    val conf = new SparkConf()
      .setAppName("LSH-Cosine")
      .setMaster("local[4]")
    val storageLevel = StorageLevel.MEMORY_AND_DISK
    val sc = new SparkContext(conf)

    // read in an example data set of word embeddings
    val data = sc.textFile(input, numPartitions).map {
      line =>
        val split = line.split(" ")
        val word = split.head
        val features = split.tail.map(_.toDouble)
        (word, features)
    }

    // create an unique id for each word by zipping with the RDD index
    val indexed = data.zipWithIndex.persist(storageLevel)

    // create indexed row matrix where every row represents one word
    val rows = indexed.map {
      case ((word, features), index) =>
        IndexedRow(index, Vectors.dense(features))
    }

What I want to do is to use a sparse matrix instead of using dense. How can I adjust 'Vectors.dense(features)'?

1

There are 1 best solutions below

0
On

The equivalent factory method for sparse vectors is Vectors.sparse, which requires an array of the indices and a corresponding array of the values for the non-zero entries. The method signatures in cosine-lsh-join-spark library are based on the general Vector class, so it appears that the library will accept either sparse or dense vectors.