Applying LSH approach by using sparse matrix instead of dense matrix

Question

Applying LSH approach by using sparse matrix instead of dense matrix

479 Views Asked by mlee_jordan At 29 July 2025 at 06:01

I try to apply LSH (https://github.com/soundcloud/cosine-lsh-join-spark) to calculate cosine similarity for some vectors. For my real data I have 2M rows (documents) and 30K features belonging to them. Besides, that matrix is highly sparse. To give a sample let's say my data is as below:

D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ...

In the related code the features are put in a dense vector as below:

val input = "text.txt"
    val conf = new SparkConf()
      .setAppName("LSH-Cosine")
      .setMaster("local[4]")
    val storageLevel = StorageLevel.MEMORY_AND_DISK
    val sc = new SparkContext(conf)

    // read in an example data set of word embeddings
    val data = sc.textFile(input, numPartitions).map {
      line =>
        val split = line.split(" ")
        val word = split.head
        val features = split.tail.map(_.toDouble)
        (word, features)
    }

    // create an unique id for each word by zipping with the RDD index
    val indexed = data.zipWithIndex.persist(storageLevel)

    // create indexed row matrix where every row represents one word
    val rows = indexed.map {
      case ((word, features), index) =>
        IndexedRow(index, Vectors.dense(features))
    }

What I want to do is to use a sparse matrix instead of using dense. How can I adjust 'Vectors.dense(features)'?

Original Q&A

There are 1 best solutions below

**Karl Higley** · Answer 1

The equivalent factory method for sparse vectors is Vectors.sparse, which requires an array of the indices and a corresponding array of the values for the non-zero entries. The method signatures in cosine-lsh-join-spark library are based on the general Vector class, so it appears that the library will accept either sparse or dense vectors.

Applying LSH approach by using sparse matrix instead of dense matrix

There are 1 best solutions below

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in LOCALITY-SENSITIVE-HASH

Trending Questions

Popular # Hahtags

Popular Questions