How can I access values outside of Spark GraphX .map loop?

259 Views Asked by At

Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. See below:

def mapTripletsMethod(edgeWeights: Graph[Int, Double], stationaryDistribution: Graph[Double, Double]) = {
  val tempMatrix: SparseDoubleMatrix2D = graphToSparseMatrix(edgeWeights)

  stationaryDistribution.mapTriplets{ e =>
      val row = e.srcId.toInt
      val column = e.dstId.toInt
      var cellValue = -1 * tempMatrix.get(row, column) + e.dstAttr
      tempMatrix.set(row, column, cellValue) // this doesn't do anything to tempMatrix
      e
    }
}

I'm guessing this is due to the design of an RDD and there's no simple way to update the tempMatrix value. When I run the above code the tempMatrix.set method does nothing. It was rather difficult to try to follow the problem in the debugger.

Does anyone have an easy solution? Thank you!

Edit

I've made an update above to show that stationaryDistribution is a graph RDD.

1

There are 1 best solutions below

1
On BEST ANSWER

You could make tempMatrix be of type RDD[((Int,Int), Double)] -- that is, each entry is a pair where the first element is in turn a (row,col) pair. Then use the PairRDDFunctions class to combine that with ((row,col),weight) triplets generated by your mapTriplets call. (So, don't think of it as updating the tempMatrix, but rather combining two RDDs to get a third.)

If you need to support stationary distribution graphs where there is more than one edge per vertex pair it gets a little tricky: you'll probably need to combine those edges in a reduction pass to create an RDD with one entry per pair, with a list of weights, and then apply all the weights to a given (row,col) pair at the same time. Otherwise it's very simple.

Notice that `PairRDDFunctions' on the one hand give you ways to combine multiple RDDs into one, or on the other hand to pull the values out into a Map on the master. Assuming that the distribution matrix is large enough to merit an RDD in the first place, I think you should do the whole thing on RDDs.

Another approach is to make the tempMatrix be a GraphRDD too, which may or may not make sense depending on what you're going to do with it next.