Solved

AHV node stuck in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE

  • 26 February 2021
  • 4 replies
  • 147 views

Badge

I recently used LCM to update the firmware on my 9 node cluster. One of the hosts did not come back up. I was able to get it to reboot back into the host OS, but it had no network connectivity. The interfaces were up, IP address assigned, just no traffic passing through the bridge.

After fighting with it for some time, I decided to eject the node from the cluster and just re-foundation a new one. I initiated the removal process, but the node status remains in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE:

nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ ncli host ls id=21

    Id                        : 0005adca-5f30-1bf1-0000-000000008d15::21
    Uuid                      : 705089f7-2435-4a35-83fe-603470bd36d6
    Name                      : 10.1.153.15
    IPMI Address              : 10.1.151.249
    Controller VM Address     : 10.1.153.34
    Controller VM NAT Address :
    Controller VM NAT PORT    :
    Hypervisor Address        : 10.1.153.15
    Host Status               : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE
    Oplog Disk Size           : 400 GiB (429,496,729,600 bytes) (2.4%)
    Under Maintenance Mode    : false (life_cycle_management)
    Metadata store status     : Node is removed from metadata store
    Node Position             : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.
    Node Serial (UUID)        : <REDACTED>
    Block Serial (Model)      : 0123456789 (NX-8035-G4)

I followed this: AHV | Node removal stuck after successfully entering maintenance mode article, but after confirming that the node status was indeed still in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE, and that the host was removed from the metadata ring:

nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff
10.1.153.146    Up     Normal     805.96 MB       11.11%  00000000yyjS51szT1lE5ujXJWrV7DDGq59Q9OxirILNaz33w8uXUaiiI1J6
10.1.153.32     Up     Normal     789.13 MB       11.11%  6t6t6t6t4iHBhbGwpTcLD3zM1GCApKbfOg6WvxTaH1ktlJbPt6oVS0ONKMrF
10.1.153.31     Up     Normal     894.45 MB       11.11%  DmDmDmDm0000000000000000000000000000000000000000000000000000
10.1.153.35     Up     Normal     818.87 MB       11.11%  KfKfKfKfYFHjTJhIdvv6IQHmWivFVXXFa5bdGrQrAvJJ77TPMETArIYLOLY1
10.1.153.37     Up     Normal     1.45 GB         22.22%  YRYRYRYRZ0QUD39sU9U4hCgMOl0VoI5vidgiHiZhWjWrEwcvNBxsFUNj0EuH
10.1.153.33     Up     Normal     1.41 GB         11.11%  fKfKfKfK2TH6RDGnNnXunDuzuM3tGRzWPbCu4tHWX48e7uRxbW6pkMYZS1X4
10.1.153.30     Up     Normal     1.5 GB          11.11%  mDmDmDmD0000000000000000000000000000000000000000000000000000
10.1.153.36     Up     Normal     812.69 MB       11.11%  t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff

It has been sitting in this state for over a week now. I also tried manually removing the node using:

ncli host remove-start id=21 skip-space-check=True

It reports that the node removal was successfully initiated, but I see no change in operation. Any help is greatly appreciated.

icon

Best answer by matthearn 1 March 2021, 23:09

Further updates: I had to do a bunch of zeus-hacking.  Apparently my attempt to remove the node simply timed out, and left the node half-removed; it had copied all the necessary data off the disks, but hadn’t marked them as removable. 

In Production: Incorrect use of the edit-zeus command could lead to data loss or other cluster complications and should not be used unless under the guidance of Nutanix Support.

Which described using “edit-zeus” to manually set the disk-removal status, but the actual code that it specified (changing 17 to 273) may be out of date.  I was seeing codes like 4369, 4096, and 4113.  Disks that were not scheduled to be removed were all status:

$ zeus_config_printer | grep data_migration_status
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4096
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4113
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0

The “zeus-edit” command essentially allows you to replace those values, but it’s smart enough to discard your changes if you pick values that don’t work.  I tried 273 and it wouldn’t save it.  I tried setting them to 0, 4096, 4113, etc., and then noticed that if I set one to 4369, it generally stayed that way, and also became grayed out in prism.  So I set them all to 4369.  Immediately they all went gray, and the host I was trying to remove began to show up in prism with only an IP address and no CPU/Disk/Memory statistics.  It still wouldn’t quite disappear, though. 

Which specified to do edit-zeus again, and look for the “node_status” of the removing node:

$ zeus_config_printer | grep node_status
node_status: kNormal
node_status: kNormal
node_status: kToBeRemoved
node_status: kNormal

I changed “kToBeRemoved” to “kOkToBeRemoved” and the host immediately became grayed out in prism.  A few minutes later it was gone, and the cluster was back down to 3 nodes and healthy and clean.

Hopefully you can do the same, although if you have a 9-node cluster I’m guessing you’re *not* running CE and should probably just call support. :)

View original

This topic has been closed for comments

4 replies

Userlevel 1
Badge +5

Funny, I’m having the same problem, and this was the first thing that popped up when I searched for “MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE”.  In my case, I *think* the issue is that the host in question somehow kept becoming “unschedulable” but was still hosting VMs.  When I tried to remove the host, it threw some initial errors about not being able to go into maintenance mode:

Failed to evacuate 6/7 VMs: - 3: HypervisorConnectionError: Could not connect to hypervisor on host eebb3eed-c906-4e47-98c6-4116b28f0ca2 - 3: InvalidVmState: Cannot complete request in state Off

But still started the removal process.  At this point it seems to have migrated all the metadata and storage to the other hosts in the cluster (it’s only using 176MB of storage), but still won't actually remove.  Also, the VMs that were “stuck” on it, had completely disappeared from the cluster.  I was able to use “acli host.exit_maintenance_mode” to make the host schedulable even though it was mid-remove, and then was able to migrate *some* of the VMs off of it, but it seems like there’s one VM I can’t move, even though the VM in question is powered off; I haven’t actually been able to determine which VM that actually is.

In your case, I’d recommend using “virsh list” on the host to see if there are still VMs running on it, and also check prism to see if all of your VMs actually show up.  I’m guessing in both our cases, it won’t remove the host until there are no VMs attached to it (even if they’re powered off).

If I make any progress with my cluster, I’ll update you.  I’m hoping I don’t have to rebuild it for the second time in 3 months...

Userlevel 1
Badge +5

Looks like you can use acli to try and identify “lingering” VMs; I ran this:

acli vm.list | awk '{print $1}' | while read VM; do echo ${VM} $(acli vm.get ${VM} | grep host_uuid); done | grep eeb

“eeb” being the first few characters of the uuid of the wonky host.  It gave me:

deadvm01 removed_from_host_uuid: "eebb3eed-c906-4e47-98c6-4116b28f0ca2"

I then powered that VM up on another host:

<acropolis> vm.on deadvm01 host=10.5.38.4

The VM came up happily, but I still can’t get the host to remove or stay in maintenance mode.

Userlevel 1
Badge +5

Further updates: I had to do a bunch of zeus-hacking.  Apparently my attempt to remove the node simply timed out, and left the node half-removed; it had copied all the necessary data off the disks, but hadn’t marked them as removable. 

In Production: Incorrect use of the edit-zeus command could lead to data loss or other cluster complications and should not be used unless under the guidance of Nutanix Support.

Which described using “edit-zeus” to manually set the disk-removal status, but the actual code that it specified (changing 17 to 273) may be out of date.  I was seeing codes like 4369, 4096, and 4113.  Disks that were not scheduled to be removed were all status:

$ zeus_config_printer | grep data_migration_status
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4096
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4113
data_migration_status: 4369
data_migration_status: 4113
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0
data_migration_status: 0

The “zeus-edit” command essentially allows you to replace those values, but it’s smart enough to discard your changes if you pick values that don’t work.  I tried 273 and it wouldn’t save it.  I tried setting them to 0, 4096, 4113, etc., and then noticed that if I set one to 4369, it generally stayed that way, and also became grayed out in prism.  So I set them all to 4369.  Immediately they all went gray, and the host I was trying to remove began to show up in prism with only an IP address and no CPU/Disk/Memory statistics.  It still wouldn’t quite disappear, though. 

Which specified to do edit-zeus again, and look for the “node_status” of the removing node:

$ zeus_config_printer | grep node_status
node_status: kNormal
node_status: kNormal
node_status: kToBeRemoved
node_status: kNormal

I changed “kToBeRemoved” to “kOkToBeRemoved” and the host immediately became grayed out in prism.  A few minutes later it was gone, and the cluster was back down to 3 nodes and healthy and clean.

Hopefully you can do the same, although if you have a 9-node cluster I’m guessing you’re *not* running CE and should probably just call support. :)

Badge

@matthearn this is exactly the missing piece I needed. The node was complete offline, but there were a couple disks that were stuck being removed. I had attempted to remove the disks using but it just reported that the disks were already in the being removed state.

ncli disk rm-start id=<diskid> force=true

I fired up edit-zeus and located the disks that were stuck. Just like you, I observed some had the data_migration_status stuck in 4113. I updated these to 4369, and after a few minutes the disks were all greyed out in the UI. I followed that up with setting the node_status of the down node from kToBeRemoved to kOkToBeRemoved. The node was gone from the UI shortly afterwards! Also, yes it is a 9 node cluster, it is a old Nutanix cluster that I retired from production and is no longer under support. You rock, thanks for the help!