I am processing a File via Google Data Fusion Pipeline but as pipeline goes I am getting below Warnings and Errors:
09/25/2020 12:31:31 WARN org.apache.spark.storage.memory.MemoryStore#66-Executor task launch worker for task 6 Not enough space to cache rdd_10_6 in memory! (computed 238.5 MB so far)
09/25/2020 12:45:05 ERROR org.apache.spark.scheduler.cluster.YarnClusterScheduler#70-dispatcher-event-loop-1
Lost executor 2 on cdap-soco-crea-99b67b97-fefb-11ea-8ee6-daceb18eb3cf-w-0.c.datalake-dev-rotw-36b8.internal: Container marked as failed: container_1601016787667_0001_01_000003 on host: cdap-soco-crea-99b67b97-fefb-11ea-8ee6-daceb18eb3cf-w-0.c.datalake-dev-rotw-36b8.internal. Exit status: 3. Diagnostics: [2020-09-25 07:15:05.226]Exception from container-launch. Container id: container_1601016787667_0001_01_000003 Exit code: 3
Help Please !
Sudhir, can you navigate to Datafusion > SYSTEM ADMIN > Configuration > System Compute Profiles, then increase the memory of your Dataproc compute profile.
By default, a Datafusion ENTERPRISE instance has 8192 MB of memory per worker. You can start by doubling that amount, and keep increasing, until the pipeline runs successfully.
Note that Spark executes transformations on RDDs in-memory. As can be realized from the error message [1], one of your worker failed to cache an RDD in-memory, due to OOM conditions.
Caching RDDs in-memory is needed, before Spark can unleash its power of in-memory processing.
Hope this helps!
[1] worker for task 6 Not enough space to cache rdd_10_6 in memory