How to remove dead node from cluster?


Badge +5
One of three nodes is completely dead. How to remove it from cluster?
ncli cluster delete id= skip-space-check didn't help, removing process hangs on 0%

This topic has been closed for comments

22 replies

Badge +5
we reinstall all nodes.
Badge +5
Does someone finally find a solution about the issue of removing node (stuck at 0%)?
Badge +7
please check this topic it may help you solve the issue

http://next.nutanix.com/t5/Discussion-Forum/re-install-1-of-3-nodes/m-p/9504
Badge +5
Nothing happens
Badge +7
after initiate the host remove command the task will show the progress and the result, you can see the error code their if it failed.
Badge +5
host list
Id : 00055525-0597-1318-2adc-ac1f6b054d36::6
Uuid : b26de4fe-6e55-46fd-8f67-14249e116aba
Name : 172.19.5.13
IPMI Address : 192.168.5.234
Controller VM Address : 172.19.5.23
Hypervisor Address : 172.19.5.13
Host Status : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE
Oplog Disk Size : 190.64 GiB (204,693,376,000 bytes) (2.1%)
Under Maintenance Mode : null (-)
Metadata store status : Node is removed from metadata store
Node Position : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.
Node Serial (UUID) : b26de4fe-6e55-46fd-8f67-14249e116aba
Block Serial (Model) : 52420081 (CommunityEdition)


host remove-start id=00055525-0597-1318-2adc-ac1f6b054d36::6 skip-space-check=true force=true
Host removal successfully initiated

Nothing happens
Badge +7
I has been successfully with this command below

find the node id to remove
ncli host list

force remove dead node
ncli host remove-start id= skip-space-check=true force=true
Badge +5
How to delete node?
cluster remove-start id=00055525-0597-1318-2adc-ac1f6b054d36::6 force=true skip-space-check=true
Does not work
Userlevel 7
Badge +24
It's true that Nutanix uses a global blacklist for offline disks. Trying to use the same disks as a "new" node will not work if the old node hasn't been removed yet.
Badge +5
And we have a new trouble.

4th node (physically same node as 3rd, dead node) has a disk status "Offline".
I think it may be because it has same disks ID (serial) as those, which was be marked for detaching (because this is a same disks).
Anyone can help with this?
Badge +5
Host not deleted.
After doing nothing happens

cluster remove-start id=00055525-0597-1318-2adc-ac1f6b054d36::6 force=true skip-space-check=true
Host removal successfully initiated

Id : 00055525-0597-1318-2adc-ac1f6b054d36::6Uuid : b26de4fe-6e55-46fd-8f67-14249e116abaName : 172.19.5.13IPMI Address : 192.168.5.234Controller VM Address : 172.19.5.23Hypervisor Address : 172.19.5.13Host Status : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLEOplog Disk Size : 190.64 GiB (204,693,376,000 bytes) (2.1%)Under Maintenance Mode : null (-)Metadata store status : Node is removed from metadata storeNode Position : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.Node Serial (UUID) : b26de4fe-6e55-46fd-8f67-14249e116abaBlock Serial (Model) : 52420081 (CommunityEdition)

nodetool -h localhost ring
Address Status State Load Owns TokenzzzzzzzzcZHutv3SPnqiM6UMNSTQsFcByvyEWKKoc3kkR9f1ybAPqq6BZ2n4172.19.5.24 Up Normal 1.37 GB 16.67% AKfKfKfK0000000000000000000000000000000000000000000000000000172.19.5.22 Up Normal 1.34 GB 16.67% KfKfKfKfFqg3Q4qPGEgxdbsxbs8NhFFrpPxfY063k26pcEKaqVl2ollFkXm1172.19.5.21 Up Normal 1.35 GB 66.67% zzzzzzzzcZHutv3SPnqiM6UMNSTQsFcByvyEWKKoc3kkR9f1ybAPqq6BZ2n4
Badge +5
Yes
Userlevel 3
Badge +20
Did you use force remove flag
ncli cluster remove-start id={node-id} force=true skip-space-check=true
Thanks
mswasif
Badge +5
node was reinstalled with other IPs and joined into a cluster.
in ncli I initiate removing process for dead node, in Prism Tasks I can see that disks and node remove process are complite to 100%, but node still present in cluster.

in ncli:
host get-remove-status Host Id : b26de4fe-6e55-46fd-8f67-14249e116aba Host Status : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE Ring Changer Host Address : 172.19.5.22 Ring Changer Host Id : da970d73-68e2-4a6d-a558-009e0066b1ab
Userlevel 7
Badge +25
Pull it and reinstall that node. Your important node data is lost. In commercial nutanix mirrors that drive for this situation.You will be down to 2 nodes so the system won't function but no data is lost. I think your will need to use the console and the discover nodes process as prism might be down in a 2 node situation.
Badge +5
3 nodes
How can I properly replace that ssd?
Userlevel 7
Badge +25
3 node or 4 node cluster?SSD has all the important stuff so that node is toast. You will need to replace your ssd, reinstall CE soon that node. Then depending on the size of your cluster you can use the expand feature in prism
Badge +5
Cluster was reinstalled, and i have same trouble)

1 SSD on one node fall in I/O error, CVM on that node did'nt load. if i reboot node - it falls into dracut timeout error :(
Can someone describe me how to repair dead CVM and replace SSD without rebuild all cluster?

Thanx
Userlevel 3
Badge +20
Did you use force remove flag
ncli cluster remove-start id={node-id} force=true skip-space-check=true
Thanks
mswasif
Badge +6
cluster status | grep -i zeus -B1cluster --migrate_from=192.168.20.104 --migrate_to=192.168.20.124 --genesis_rpc_timeout_secs=300 migrate_zeusncli host remove-start id=000523d9-0b9f-cddc-6e9f-003048f81ba2::5 (if you get an error see below)ncli host listncli host rm-start id=5 skip-space-check=trueHaven't tried yet, but it can help.
Badge +5
Yep, 4th node was added to cluster, but dead node stil present.
host MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE in status.
all disk status of that node is Detachable
Userlevel 3
Badge +20
Best option is to shutdown the node. Rebuild it with new ip addresses than the dead node. Add it to the cluster as the 4th node. After the cluster is up and running remove the dead node.
Thanks
mswasif