More efficient groupBy scala, avoid GC overhead limit exceeded

71 Views Asked by At

This question is unrelated to Spark, and I'm working on scala 2.12 so I don't have access to groupMap or groupMapReduce.

I currently have code like the following

    val customerTrips = customerData.toList.par.groupBy(c => c.customerId)  // map between customerId and their trips

    val res = (
      customerData
        .toList.par
        .flatMap(
          c1 =>
            customerData(c1.tripId)
              .map(c2 => intermediateSchema(customer1Id=c1.customerId,customer2Id=c2.customerId, date=c1.date))
              .withFilter(row => row.customer1Id < row.customer2Id)
        )
      )  // create a list of two customers on the same trip
    println(res.size)  // size is about 5,000,000

    val res2 = res.groupBy(row => (row.customer1Id, row.customer2Id))  // GC overhead limit exceeded here
    println(res2.size) 

    val res3 = res2.mapValues(_.size)  // trying to get the number of trips shared between customers

I tried to annotate where I'm getting the error above with the groupBy. I'm pretty new to scala and don't know if there is a preferred/better method of doing this. I'd like to keep the aggregation parallel so I don't thinks something like foldLeft is what I'm looking for. If the only option is to increase the heap size (I have it set to the default 1.33GB) or change the garbage collector let me know.

0

There are 0 best solutions below