I recently used LCM to update the firmware on my 9 node cluster. One of the hosts did not come back up. I was able to get it to reboot back into the host OS, but it had no network connectivity. The interfaces were up, IP address assigned, just no traffic passing through the bridge.
After fighting with it for some time, I decided to eject the node from the cluster and just re-foundation a new one. I initiated the removal process, but the node status remains in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE:
nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ ncli host ls id=21
Id : 0005adca-5f30-1bf1-0000-000000008d15::21
Uuid : 705089f7-2435-4a35-83fe-603470bd36d6
Name : 10.1.153.15
IPMI Address : 10.1.151.249
Controller VM Address : 10.1.153.34
Controller VM NAT Address :
Controller VM NAT PORT :
Hypervisor Address : 10.1.153.15
Host Status : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE
Oplog Disk Size : 400 GiB (429,496,729,600 bytes) (2.4%)
Under Maintenance Mode : false (life_cycle_management)
Metadata store status : Node is removed from metadata store
Node Position : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.
Node Serial (UUID) : <REDACTED>
Block Serial (Model) : 0123456789 (NX-8035-G4)
I followed this: AHV | Node removal stuck after successfully entering maintenance mode article, but after confirming that the node status was indeed still in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE, and that the host was removed from the metadata ring:
nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ nodetool -h 0 ring
Address Status State Load Owns Token
10.1.153.146 Up Normal 805.96 MB 11.11% 00000000yyjS51szT1lE5ujXJWrV7DDGq59Q9OxirILNaz33w8uXUaiiI1J6
10.1.153.32 Up Normal 789.13 MB 11.11% 6t6t6t6t4iHBhbGwpTcLD3zM1GCApKbfOg6WvxTaH1ktlJbPt6oVS0ONKMrF
10.1.153.31 Up Normal 894.45 MB 11.11% DmDmDmDm0000000000000000000000000000000000000000000000000000
10.1.153.35 Up Normal 818.87 MB 11.11% KfKfKfKfYFHjTJhIdvv6IQHmWivFVXXFa5bdGrQrAvJJ77TPMETArIYLOLY1
10.1.153.37 Up Normal 1.45 GB 22.22% YRYRYRYRZ0QUD39sU9U4hCgMOl0VoI5vidgiHiZhWjWrEwcvNBxsFUNj0EuH
10.1.153.33 Up Normal 1.41 GB 11.11% fKfKfKfK2TH6RDGnNnXunDuzuM3tGRzWPbCu4tHWX48e7uRxbW6pkMYZS1X4
10.1.153.30 Up Normal 1.5 GB 11.11% mDmDmDmD0000000000000000000000000000000000000000000000000000
10.1.153.36 Up Normal 812.69 MB 11.11% t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff
It has been sitting in this state for over a week now. I also tried manually removing the node using:
ncli host remove-start id=21 skip-space-check=True
It reports that the node removal was successfully initiated, but I see no change in operation. Any help is greatly appreciated.
Best answer by matthearn
Further updates: I had to do a bunch of zeus-hacking. Apparently my attempt to remove the node simply timed out, and left the node half-removed; it had copied all the necessary data off the disks, but hadn’t marked them as removable.
In Production: Incorrect use of the edit-zeus command could lead to data loss or other cluster complications and should not be used unless under the guidance of Nutanix Support.
Which described using “edit-zeus” to manually set the disk-removal status, but the actual code that it specified (changing 17 to 273) may be out of date. I was seeing codes like 4369, 4096, and 4113. Disks that were not scheduled to be removed were all status:
$ zeus_config_printer | grep data_migration_status
The “zeus-edit” command essentially allows you to replace those values, but it’s smart enough to discard your changes if you pick values that don’t work. I tried 273 and it wouldn’t save it. I tried setting them to 0, 4096, 4113, etc., and then noticed that if I set one to 4369, it generally stayed that way, and also became grayed out in prism. So I set them all to 4369. Immediately they all went gray, and the host I was trying to remove began to show up in prism with only an IP address and no CPU/Disk/Memory statistics. It still wouldn’t quite disappear, though.
Which specified to do edit-zeus again, and look for the “node_status” of the removing node:
$ zeus_config_printer | grep node_status
I changed “kToBeRemoved” to “kOkToBeRemoved” and the host immediately became grayed out in prism. A few minutes later it was gone, and the cluster was back down to 3 nodes and healthy and clean.
Hopefully you can do the same, although if you have a 9-node cluster I’m guessing you’re *not* running CE and should probably just call support. :)