Spark scala : iterable to individual key-value pairs

5.9k Views Asked by At

I have a problem with Spark Scala converting an Iterable (CompactBuffer) to individual pairs. I want to create a new RDD with key-value pairs of the ones in the CompactBuffer.

It looks like this:

CompactBuffer(Person2, Person5)
CompactBuffer(Person2, Person5, Person7)
CompactBuffer(Person1, Person5, Person11)

The CompactBuffers could obtain more persons than just 3. Basically what I want is a new RDD that has the individual combinations of the CompactBuffer like this (I also want to avoid identical key-values):

Array[
<Person2, Person5>
<Person5, Person2>
<Person2, Person7>
<Person7, Person2>
<Person5, Person7>
<Person7, Person5>
<Person1, Person5>
<Person5, Person1>
<Person1, Person11>
<Person11, Person1>
<Person5, Person11>
<Person11, Person5>]

Can someone help me?

Thank you in advance

1

There are 1 best solutions below

0
On

Here's something that produces the pairs (and removes repeated ones). I couldn't work out how to use CompactBuffer so it uses ArrayBuffer, since the source for CompactBuffer says it's a more efficient ArrayBuffer. You may need to convert your CompactBuffer in the flatMap to something that supports .combinations.

object sparkapp extends App {
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer


val data = List(
ArrayBuffer("Person2", "Person5"),
ArrayBuffer("Person2", "Person5", "Person7"),
ArrayBuffer("Person1", "Person5", "Person11"))

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)


val dataRDD = sc.makeRDD(data, 1)
val pairs = dataRDD.flatMap(
             ab => ab.combinations(2)
                     .flatMap{case ArrayBuffer(x,y) => List((x, y),(y,x))}
            ).distinct

pairs.foreach (println _)

}

Output

(Person7,Person2)
(Person7,Person5)
(Person5,Person2)
(Person11,Person1)
(Person11,Person5)
(Person2,Person7)
(Person5,Person7)
(Person1,Person11)
(Person2,Person5)
(Person5,Person11)
(Person1,Person5)
(Person5,Person1)