HDFS双NameNode发生故障后HA无法切换的问题 2018-09-20 20:00:36 ## 背景 2个NameNode - namenode1, active - namenode2, standby `DFSZKFailoverController`和`zookeeper`服务都正常运行的情况下,将namenode1的进程执行kill -9,namenode2没有自动切换为active的状态,导致HDFS只读不可写。 ## 分析 分析了两个namenode节点的zkfc相关日志,发现namenode2节点的zkfc日志有些异常: ``` 2032-09-21 14:25:42,258 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received 2032-09-21 14:25:42,258 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: gssapi-with-mic,publickey,keyboard-interactive,password 2032-09-21 14:25:42,258 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: gssapi-with-mic 2032-09-21 14:25:42,262 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can continue: publickey,keyboard-interactive,password 2032-09-21 14:25:42,262 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: publickey 2032-09-21 14:25:42,322 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded (publickey). 2032-09-21 14:25:42,322 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connected to bdp-namenode2 2032-09-21 14:25:42,322 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Looking for process running on port 53310 2032-09-21 14:25:42,393 WARN org.apache.hadoop.ha.SshFenceByTcpPort: PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 53310 via ssh: bash: fuser: 未找到命令 2032-09-21 14:25:42,394 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: 127 2032-09-21 14:25:42,394 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from bdp-namenode2 port 22 2032-09-21 14:25:42,396 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. 2032-09-21 14:25:42,396 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method. 2032-09-21 14:25:42,396 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving main loop due to Socket closed 2032-09-21 14:25:42,397 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election java.lang.RuntimeException: Unable to fence NameNode at namenode2/192.168.123.4:53310 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:902) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:801) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:416) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) ``` 由于SshFenceByTcpPort: PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 53310 via ssh: bash: fuser: 未找到命令,导致Fencing不成功,进而导致了后面的切换失败了。 ## 修复 `dfs.ha.fencing.methods`配置的是`sshfence`,SshFenceByTcpPort处理时会用到fuser这个命令。 给两个NameNode节点装上`fuser`命令就可以了。 非特殊说明,均为原创,原创文章,未经允许谢绝转载。 原始链接:HDFS双NameNode发生故障后HA无法切换的问题 赏 Prev Kafka分布式设计要点 Next 优雅地解决Spark Application jar包冲突问题