I recently used LCM to update the firmware on my 9 node cluster. One of the hosts did not come back up. I was able to get it to reboot back into the host OS, but it had no network connectivity. The interfaces were up, IP address assigned, just no traffic passing through the bridge.
After fighting with it for some time, I decided to eject the node from the cluster and just re-foundation a new one. I initiated the removal process, but the node status remains in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE:
nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ ncli host ls id=21
Id : 0005adca-5f30-1bf1-0000-000000008d15::21
Uuid : 705089f7-2435-4a35-83fe-603470bd36d6
Name : 10.1.153.15
IPMI Address : 10.1.151.249
Controller VM Address : 10.1.153.34
Controller VM NAT Address :
Controller VM NAT PORT :
Hypervisor Address : 10.1.153.15
Host Status : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE
Oplog Disk Size : 400 GiB (429,496,729,600 bytes) (2.4%)
Under Maintenance Mode : false (life_cycle_management)
Metadata store status : Node is removed from metadata store
Node Position : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.
Node Serial (UUID) : <REDACTED>
Block Serial (Model) : 0123456789 (NX-8035-G4)
I followed this (
https://support-portal.nutanix.com/page/documents/kbs/details?targetId=kA00e0000009D6CCAU) article, but after confirming that the node status was indeed still in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE, and that the host was removed from the metadata ring:
nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ nodetool -h 0 ring
Address Status State Load Owns Token
t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff
10.1.153.146 Up Normal 805.96 MB 11.11% 00000000yyjS51szT1lE5ujXJWrV7DDGq59Q9OxirILNaz33w8uXUaiiI1J6
10.1.153.32 Up Normal 789.13 MB 11.11% 6t6t6t6t4iHBhbGwpTcLD3zM1GCApKbfOg6WvxTaH1ktlJbPt6oVS0ONKMrF
10.1.153.31 Up Normal 894.45 MB 11.11% DmDmDmDm0000000000000000000000000000000000000000000000000000
10.1.153.35 Up Normal 818.87 MB 11.11% KfKfKfKfYFHjTJhIdvv6IQHmWivFVXXFa5bdGrQrAvJJ77TPMETArIYLOLY1
10.1.153.37 Up Normal 1.45 GB 22.22% YRYRYRYRZ0QUD39sU9U4hCgMOl0VoI5vidgiHiZhWjWrEwcvNBxsFUNj0EuH
10.1.153.33 Up Normal 1.41 GB 11.11% fKfKfKfK2TH6RDGnNnXunDuzuM3tGRzWPbCu4tHWX48e7uRxbW6pkMYZS1X4
10.1.153.30 Up Normal 1.5 GB 11.11% mDmDmDmD0000000000000000000000000000000000000000000000000000
10.1.153.36 Up Normal 812.69 MB 11.11% t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff
It has been sitting in this state for over a week now. I also tried manually removing the node using:
ncli host remove-start id=21 skip-space-check=True
It reports that the node removal was successfully initiated, but I see no change in operation.
Any help is greatly appreciated.