Solved

AHV Node failed during LCM

  • 24 February 2021
  • 1 reply
  • 841 views

Badge

I recently used LCM to update the firmware on my 9 node cluster. One of the hosts did not come back up. I was able to get it to reboot back into the host OS, but it had no network connectivity. The interfaces were up, IP address assigned, just no traffic passing through the bridge.

 

After fighting with it for some time, I decided to eject the node from the cluster and just re-foundation a new one. I initiated the removal process, but the node status remains in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE:

nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ ncli host ls id=21

    Id                        : 0005adca-5f30-1bf1-0000-000000008d15::21
    Uuid                      : 705089f7-2435-4a35-83fe-603470bd36d6
    Name                      : 10.1.153.15
    IPMI Address              : 10.1.151.249
    Controller VM Address     : 10.1.153.34
    Controller VM NAT Address :
    Controller VM NAT PORT    :
    Hypervisor Address        : 10.1.153.15
    Host Status               : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE
    Oplog Disk Size           : 400 GiB (429,496,729,600 bytes) (2.4%)
    Under Maintenance Mode    : false (life_cycle_management)
    Metadata store status     : Node is removed from metadata store
    Node Position             : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.
    Node Serial (UUID)        : <REDACTED>
    Block Serial (Model)      : 0123456789 (NX-8035-G4)

 

I followed this (
https://support-portal.nutanix.com/page/documents/kbs/details?targetId=kA00e0000009D6CCAU) article, but after confirming that the node status was indeed still in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE, and that the host was removed from the metadata ring:

nutanix@NTNX-<REDACTED>-A-CVM:10.1.153.30:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff
10.1.153.146    Up     Normal     805.96 MB       11.11%  00000000yyjS51szT1lE5ujXJWrV7DDGq59Q9OxirILNaz33w8uXUaiiI1J6
10.1.153.32     Up     Normal     789.13 MB       11.11%  6t6t6t6t4iHBhbGwpTcLD3zM1GCApKbfOg6WvxTaH1ktlJbPt6oVS0ONKMrF
10.1.153.31     Up     Normal     894.45 MB       11.11%  DmDmDmDm0000000000000000000000000000000000000000000000000000
10.1.153.35     Up     Normal     818.87 MB       11.11%  KfKfKfKfYFHjTJhIdvv6IQHmWivFVXXFa5bdGrQrAvJJ77TPMETArIYLOLY1
10.1.153.37     Up     Normal     1.45 GB         22.22%  YRYRYRYRZ0QUD39sU9U4hCgMOl0VoI5vidgiHiZhWjWrEwcvNBxsFUNj0EuH
10.1.153.33     Up     Normal     1.41 GB         11.11%  fKfKfKfK2TH6RDGnNnXunDuzuM3tGRzWPbCu4tHWX48e7uRxbW6pkMYZS1X4
10.1.153.30     Up     Normal     1.5 GB          11.11%  mDmDmDmD0000000000000000000000000000000000000000000000000000
10.1.153.36     Up     Normal     812.69 MB       11.11%  t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff

 

It has been sitting in this state for over a week now. I also tried manually removing the node using:

ncli host remove-start id=21 skip-space-check=True

 

It reports that the node removal was successfully initiated, but I see no change in operation.

 

Any help is greatly appreciated.

icon

Best answer by runningahv 9 March 2021, 16:38

View original

This topic has been closed for comments