How to do YARN role graceful shutdown on a Cloudera Manager datanode CDH 6.3.2

448 Views Asked by At

Can not find answer on this question.

How to gracefully stop YARN role on a data node and wait till all running jobs on a datanode will finish with status success.

I know that in ClouderaManager you can decommission yarn role when you can stop it. If I do YARN role decommission The running jobs will fail with exit code killed or crash status.

Is this a safe way to YARN role stop on a data node?

Is this a graceful yarn role shutdown or where is other way to do this? all jobs have killed status after YARN role decommission

2

There are 2 best solutions below

0
user2784340 On BEST ANSWER

YARN Graceful decommission will wait for jobs to complete. You can pass the timeout value so that YARN will start decommission after x seconds. If no jobs running within x secs then automatically YARN will start decommission without waiting for timeout to happen.

CM -> Clusters -> yarn -> Configuration -> In search bar (

yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs) Set the value and save the configuration and do restart to deploy configs. To decommission a specific host/more hosts

CM -> Clusters -> yarn -> Instances (Select the hosts that you want to decommission)

Click -> Actions for selected hosts -> Decommission In case you want to decommission all the roles of a host then follow this doc https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cm_mc_host_maint.html#decomm_host

2
Matt Andruff On

This is documented poorly on Apache website for hadoop 3.3:

Create an XML file with NodeManagers you wish to decommission:

<?xml version="1.0"?>
<hosts>
  <host><name>host1</name></host> <!-- normal 'kill' --> 
  <host><name>host2</name><timeout>123</timeout></host> <!-- allows jobs 123 seconds to finish --> 
  <host><name>host3</name><timeout>-1</timeout></host><!-- allows jobs infinite seconds to finish --> 
</hosts>

Update your config(yarn-site.xml) to point to this file (No restart required)

yarn.resourcemanager.nodes.exclude-path=[path/to/exculd/file]

run update: (initiate decomission)

yarn rmadmin -refreshNodes 

Alternatively you could set a graceful timeout for all nodes:

yarn.resourcemanager.nodemanager-graceful-decommission-timeout-secs

Alternatively you manually set a graceful timeout:

yarn rmadmin -refreshNodes -g [timeout in seconds] -client