More efficient groupBy scala, avoid GC overhead limit exceeded

71 Views Asked by Ken Myers At 17 April 2023 at 01:10

This question is unrelated to Spark, and I'm working on scala 2.12 so I don't have access to groupMap or groupMapReduce.

I currently have code like the following

    val customerTrips = customerData.toList.par.groupBy(c => c.customerId)  // map between customerId and their trips

    val res = (
      customerData
        .toList.par
        .flatMap(
          c1 =>
            customerData(c1.tripId)
              .map(c2 => intermediateSchema(customer1Id=c1.customerId,customer2Id=c2.customerId, date=c1.date))
              .withFilter(row => row.customer1Id < row.customer2Id)
        )
      )  // create a list of two customers on the same trip
    println(res.size)  // size is about 5,000,000

    val res2 = res.groupBy(row => (row.customer1Id, row.customer2Id))  // GC overhead limit exceeded here
    println(res2.size) 

    val res3 = res2.mapValues(_.size)  // trying to get the number of trips shared between customers

I tried to annotate where I'm getting the error above with the groupBy. I'm pretty new to scala and don't know if there is a preferred/better method of doing this. I'd like to keep the aggregation parallel so I don't thinks something like foldLeft is what I'm looking for. If the only option is to increase the heap size (I have it set to the default 1.33GB) or change the garbage collector let me know.

Original Q&A

More efficient groupBy scala, avoid GC overhead limit exceeded

There are 0 best solutions below

Related Questions in SCALA

Related Questions in COLLECTIONS

Related Questions in PARALLEL-PROCESSING

Related Questions in GARBAGE-COLLECTION

Related Questions in SCALA-2.12

Trending Questions

Popular # Hahtags

Popular Questions