failover is not fired when active name node crashes

1.2k Views Asked by At

I am using Apache Hadoop-2.7.1 on cluster that consists of three nodes

nn1 master name node

nn2 (second name node)

dn1 (data node)

i have configured high availability,and nameservice and zookeeper is working in all three nodes
and it is started on nn2 as leader

first of all i have to mention that nn1 is active and nn2 is stand by

when i kill name node on nn1

,nn2 becomes active so automatic fail over is happening

but with the following scenario (which i apply when nn1 is active and nn2 is standby)and which is :

when i turn off nn1 (nn1 whole crashing)

nn2 stay stand by and doesn't become active so automatic failover is not happening

with noticeable error in log

Unable to trigger a roll of the active NN(which was nn1 and now it is closed ofcourse)

shouldn't automatic fail over happens with two existing journal nodes on nn2 and dn1

and what could be possible reasons ?

2

There are 2 best solutions below

0
On BEST ANSWER

my problem was solved by altering dfs.ha.fencing.methods in hdfs-site.xml

to include not only ssh fencing but also another shell fencing method that

returns always true

<name>dfs.ha.fencing.methods</name>
<value>sshfence
       shell(/bin/true)
</value>

automatic failover will fail if fencing fails, i specified two options, the second one( shell(/bin/true)) always returns success. This is done for workaround cases where the primary NameNode machine goes down and the ssh method will fail, and no failover will be performed. We want to avoid this, so the second option would be to failover anyway

you can find details here https://www.packtpub.com/books/content/setting-namenode-ha

0
On

This appears to be due to a bug in the sshfence fencing method, identified as HADOOP-15684, fixed in 3.0.4, 3.1.2, and 3.2.0 as well as backported to 2.10.0 via HDFS-14397.