Spark standalone cluster tuning

1.5k Views Asked by At

We have spark 2.1.0 standalone cluster running on a single node with 8 cores and 50GB memory(single worker).

We run spark applications in cluster mode with the following memory settings -

--driver-memory = 7GB (default - 1core is used)
--worker-memory = 43GB (all remaining cores - 7 cores)

Recently, we observed executor getting killed and restarted by driver/master frequently. I found below logs on driver -

17/12/14 03:29:39 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 3658237 ms exceeds timeout 3600000 ms  
17/12/14 03:29:39 ERROR TaskSchedulerImpl: Lost executor 2 on 10.150.143.81: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 23.0 in stage 316.0 (TID 9449, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 9.0 in stage 318.0 (TID 9459, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 8.0 in stage 318.0 (TID 9458, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 5.0 in stage 318.0 (TID 9455, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms  
17/12/14 03:29:39 WARN TaskSetManager: Lost task 7.0 in stage 318.0 (TID 9457, 10.150.143.81, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 3658237 ms

Application is not so memory intensive, there are couple of joins and writing dataset to directory. Same code runs on spark-shell without any failure.

Looking for cluster tuning or any configurations settings which will reduce executor getting killed.

3

There are 3 best solutions below

2
On

there might be memory issue with executor. So u should configure your cores with executor memory in the spark-env.sh file. It could be found on path ~/spark/conf/spark-env.sh :- As your memory is total 50 GB.

export SPARK_WORKER_CORES=8
export SPARK_WORKER_INSTANCES=5
export SPARK_WORKER_MEMORY=8G
export SPARK_EXECUTOR_INSTANCES=2

And if your data is not too large to process, u can set driver memory in spark-default.conf. Also give some overhead memory to executor in this file ~/spark/conf/spark-default.conf` as :-

spark.executor.memoryOverhead 1G
spark.driver.memory  1G
0
On

With spark-shell, driver is also the executor. Looks like driver killed the executor because it received no heartbeats for 1 hour. Typically heartbeats are configured to 10s.

  1. Have you modified the default heartbeat settings?
  2. Check for GC on executor. Long GC pauses is a frequent cause of missing heartbeats. If so, improve upon memory per core in your executor. This typically means increase memory or decrease cores.
  3. Anything in your network which could cause heartbeats to drop?

The logs clearly show that driver killed the executor because it received no heartbeat for 1 hour and also that executor was running some tasks when it was killed.

0
On

Firstly I would advise to never allocate a total of 50Gb of RAM to any application if your instance has exactly 50Gb of RAM. The rest of the system applications needs some RAM to work too, and RAM not used by applications is used by the system to cache files and reduce the amount of disk reads. The JVM itself also has a small memory overhead outside of it.

If your spark job uses all the memory, then your instance will inevitably swap, and if it swaps, it will start to behave incorrectly. You can easily check your memory usage and see if your server is swapping by running the command htop. You should also make sure that the swapiness is reduced to 0, so that it doesn't swap unless it really has to.

That's all I can say given the info you provided, if this does not help, you should consider providing more information, like the complete exact parameters of your spark job.