"Socket RecvAll Error:Connection reset by peer" in XGBoost

315 Views Asked by At

We are trying to use XGBoost-Spark for our project & we are facing issues when training the model with large data. But the same works well for small data. Training stage runs for around 2 hours and all tasks complete almost at the same time. After around 1200 tasks are over, all remaining executors begin to fail and we see same error from all of them. Note : We are data-engineers who are new to machine-learning & trying to create production version of a prototype created by data scientists. Our exposure to machine-learning concepts are very limited.

Jars Used - xgboost4j-spark-0.72-criteo-20180518_2.11.jar & xgboost4j-0.72-criteo-20180518_2.10-linux.jar Error from logs of one of the executors:

Container id: container_e109_1529510504264_41133_01_000223
Exit code: 255
Shell output: main : command provided 1
main : run as user is svccaddv
main : requested yarn user is svccaddv
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /u/applic/data/hdfs1/hadoop/yarn/local/nmPrivate/application_1529510504264_41133/container_e109_1529510504264_41133_01_000223/container_e109_1529510504264_41133_01_000223.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...


Container exited with a non-zero exit code 255. Last 4096 bytes of stderr 
:ter_prune.cc:74: tree pruning end, 1[23:49:45] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth= roots, 126 extra nodes, 0 pruned nodes, max_depth=6
6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, [23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
0 pruned nodes, max_depth=6
[23:49:47] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6[23:49:50] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6

[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6[23:49:52] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6

[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
6
: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:55] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:57] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:49:59] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
[23:50:02] /xgboost-jars/xgboost/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 126 extra nodes, 0 pruned nodes, max_depth=6
Socket RecvAll Error:Connection reset by peer
Socket RecvAll Error:Connection reset by peer

Code Snippet that we are using:

MLUtils.saveAsLibSVMFile(newtrainingData.rdd, inputTrainPath)
val trainSess = spark.sqlContext.read.format("libsvm").option("numFeatures", "10").load(inputTrainPath)
val paramMap =  List(
  "eta" -> 0.003,
  "max_depth" -> 6,
  "subsample" -> 0.8,
  "colsample_bytree" -> 0.8,
  "silent" -> 0,
  "numEarlyStoppingRounds" -> 100,
  "objective" -> "reg:linear").toMap
val numRound = 1500    
val xgboostModel = XGBoost.trainWithDataFrame(trainSess, paramMap, numRound, nWorkers = trainSess.rdd.getNumPartitions, useExternalMemory = false)

Size of table ~ 21 GB (Stored as ORC w SNAPPY Compression) Size of SVM files ~ 160 GB Size of Input in spark stage for training ~ 460 GB Tasks spawned during training stage - 4044 Executors - 515 (approx - we use dynamic allocation) Executor-cores - 4 Executor-mem - 4G Executor-mem-overhead - 1200 MB Driver-mem - 10G

1

There are 1 best solutions below

0
On

We found a workaround. We reduced the number of partitions - which reduced number of tasks - using coalesce(). Previously we used repartition() to reduce the partitions, but still got the error. But even with coalesce, the jobs fail if number of partitions are more than 1000. For some medium data sets, job runs fine with 1200 and 1500 partitions. But we sticked to 1000 partitions & jobs run fine. Previously, we increased the partitions to 3k or 4k to increase parallelism and thereby improve performance. But with 1k partitions, performance is not bad.

For people looking for another work-around, please refer the suggestion given by XGBoost team - https://github.com/dmlc/xgboost/issues/3462 (I did not try this).