Longhorn异常重启后Volume一直处于Attaching状态的问题 2020-03-25 20:20:49 ## 背景 给Rancher集群启用了Istio,第二天看没有完全安装成功,kube-system下的dns、cannal服务异常重启过,导致Longhorn(版本:v0.8.0)服务也异常重启了。 异常重启后,部分服务并没有启动成功。对Longhorn相关的Workload做了重启,虽然最终服务都起来了,但是服务不正常,Volume一直处于Attaching的状态。 ![](/api/file/getImage?fileId=5e7b4c7f66f3b358d5000793) ## 异常表现 ### longhorn-engine异常 Longhorn Redeploy异常(longhornio/longhorn-engine重启失败)`cp: cannot create regular file '/data/longhorn': Text file busy`; 参考:<https://github.com/longhorn/longhorn/issues/1116>,执行 `rm -rf /var/lib/longhorn/engine-binaries`虽然能正常启动longhorn-engine了,但是Longhorn上的Volume一直处于Attaching的状态。 ### longhorn-manager报错 longhornio/longhorn-manager: 报错 E0325 06:33:23.659547 1 replica_controller.go:177] fail to sync replica for longhorn-system/pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08: failed to cleanup the related replica process before deleting replica pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08: failed to delete process pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08: rpc error: code = DeadlineExceeded desc = context deadline exceeded time="2020-03-25T06:33:23Z" level=warning msg="Dropping Longhorn replica longhorn-system/pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08 out of the queue: fail to sync replica for longhorn-system/pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08: failed to cleanup the related replica process before deleting replica pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08: failed to delete process pvc-8b14c13c-28a4-4a9e-92d6-83709733e599-r-e2611f08: rpc error: code = DeadlineExceeded desc = context deadline exceeded" time="2020-03-25T08:00:42Z" level=error msg="failed to poll instance info to update instance manager instance-manager-r-f1a932c6: failed to list processes: rpc error: code = DeadlineExceeded desc = context deadline exceeded" ### 重启engine-image-ei失败 报错:engine-image-ei cp: cannot create regular file '/data/longhorn': Text file busy lsof /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v0.8.0/longhorn 发现是:longhorn-instance-manager进程,启动了一堆子进程: ``` /host/var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v0.8.0/longhorn replica /host/var/lib/rancher/longhorn/replicas/pvc-xxx ``` ## 解决方案 最后通过重启机器的Docker解决(不得说,分析了一天的问题,还是重启大法好!!)。 异常的pvc,通过重新创建Workload,挂载之前的pvc解决。 解决后,晚上又看了下这个卡片<https://github.com/longhorn/longhorn/issues/1116>,发现Longhorn的Commiter对问题的描述做了更新,意思是在当前的v0.8.0版本下,只要重启engine-image-ei相关pod就会导致:`cp: cannot create regular file '/data/longhorn': Text file busy`这样的问题,是个Bug。 看来只能寄希望于后续版本彻底解决这个问题了。 使用期间还是要尽力避免重启Longhorn相关的Workload。 非特殊说明,均为原创,原创文章,未经允许谢绝转载。 原始链接:Longhorn异常重启后Volume一直处于Attaching状态的问题 赏 Prev 解决Mac系统下Kontakt入库音源重启丢失的问题 Next 最近坚持跑步,可以一口气跑10公里了!