This question is unrelated to Spark, and I'm working on scala 2.12 so I don't have access to groupMap or groupMapReduce.
I currently have code like the following
val customerTrips = customerData.toList.par.groupBy(c => c.customerId) // map between customerId and their trips
val res = (
customerData
.toList.par
.flatMap(
c1 =>
customerData(c1.tripId)
.map(c2 => intermediateSchema(customer1Id=c1.customerId,customer2Id=c2.customerId, date=c1.date))
.withFilter(row => row.customer1Id < row.customer2Id)
)
) // create a list of two customers on the same trip
println(res.size) // size is about 5,000,000
val res2 = res.groupBy(row => (row.customer1Id, row.customer2Id)) // GC overhead limit exceeded here
println(res2.size)
val res3 = res2.mapValues(_.size) // trying to get the number of trips shared between customers
I tried to annotate where I'm getting the error above with the groupBy. I'm pretty new to scala and don't know if there is a preferred/better method of doing this. I'd like to keep the aggregation parallel so I don't thinks something like foldLeft is what I'm looking for. If the only option is to increase the heap size (I have it set to the default 1.33GB) or change the garbage collector let me know.