Spark NullPointerException: Cannot invoke invalidateSerializedMapOutputStatusCache() because "shuffleStatus" is null

1.1k Views Asked by At

I'm running a simple little Spark 3.3.0 pipeline on Windows 10 using Java 17 and UDFs. I hardly do anything interesting, and now when I run the pipeline on only 30,000 records I'm getting this:

[ERROR] Error in removing shuffle 2
java.lang.NullPointerException: Cannot invoke "org.apache.spark.ShuffleStatus.invalidateSerializedMapOutputStatusCache()" because "shuffleStatus" is null
        at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1(MapOutputTracker.scala:882)
        at org.apache.spark.MapOutputTrackerMaster.$anonfun$unregisterShuffle$1$adapted(MapOutputTracker.scala:881)
        at scala.Option.foreach(Option.scala:437)
        at org.apache.spark.MapOutputTrackerMaster.unregisterShuffle(MapOutputTracker.scala:881)
        at org.apache.spark.storage.BlockManagerStorageEndpoint$$anonfun$receiveAndReply$1.$anonfun$applyOrElse$3(BlockManagerStorageEndpoint.scala:59)
        at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.scala:17)
        at org.apache.spark.storage.BlockManagerStorageEndpoint.$anonfun$doAsync$1(BlockManagerStorageEndpoint.scala:89)
        at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:678)
        at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

I searched and couldn't find any of the principal terms in the error message.

Disconcerting that Spark is breaking at what seems to be a fundamental part of processing, and with a NullPointerException at that.

I filed ticket SPARK-40582.

2

There are 2 best solutions below

0
On BEST ANSWER

I filed SPARK-40582, and they told me that this is a known Scala 2.13.8 issue (#12613). They are adding a fix in SPARK-39553, scheduled for release in v3.3.1.

2
On

Ok, I don't know Spark, but I referred below 2 pages.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala unregisterShuffle method

https://www.hadoopinrealworld.com/how-does-shuffle-sort-merge-join-work-in-spark/

Spark shuffles the data and during that shuffling, some data, which was supposed to be there based on id, is not found.

Option 1: Try to increase memory allocated to your application to see if that solves this.

Option 2: Unit test with various use cases to see if you can find the use case when you get a problem.

Option 3: Try an earlier version of spark.

Updated reasons for my suggestions as per OP comment - I agree with your viewpoint and my answer may seem "restart your computer" kind of general suggestions. Here are the specific reasons why I gave above 3 suggestions.

  1. I looked up the source code of spark (given the link) where NullPointerException is occurring. I see a ConcurrentHashMap shuffleStatuses where a value for shuffleId is expected but not found. Your now edited, question mentioned that you were getting the error non-deterministically. i.e. for same data, you got error once, but not the next time. This points to a non-code reason for the problem, like memory availability - hence suggestion #1

  2. A general approach to such 'non-deterministic' errors is to unit test your code with different possible use cases. That helps you pin-point a use case when such errors come. Hence suggestion #2.

  3. A runtime exception like NullPointerException occurring inside the library code indicates that some code flow is not handled and library code is buggy. New versions of libraries could have such errors. In such cases, it is prudent to report such errors to the github community. We can use earlier and more stable versions to unblock ourselves- Hence suggestion #3. (I think you had not reported the bug at the time of me writing my original answer)