AHV Node failed during LCM

  • 24 February 2021
  • 1 reply


I recently used LCM to update the firmware on my 9 node cluster. One of the hosts did not come back up. I was able to get it to reboot back into the host OS, but it had no network connectivity. The interfaces were up, IP address assigned, just no traffic passing through the bridge.


After fighting with it for some time, I decided to eject the node from the cluster and just re-foundation a new one. I initiated the removal process, but the node status remains in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE:

nutanix@NTNX-<REDACTED>-A-CVM:$ ncli host ls id=21

    Id                        : 0005adca-5f30-1bf1-0000-000000008d15::21
    Uuid                      : 705089f7-2435-4a35-83fe-603470bd36d6
    Name                      :
    IPMI Address              :
    Controller VM Address     :
    Controller VM NAT Address :
    Controller VM NAT PORT    :
    Hypervisor Address        :
    Host Status               : MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE
    Oplog Disk Size           : 400 GiB (429,496,729,600 bytes) (2.4%)
    Under Maintenance Mode    : false (life_cycle_management)
    Metadata store status     : Node is removed from metadata store
    Node Position             : Node physical position can't be displayed for this model. Please refer to Prism UI for this information.
    Node Serial (UUID)        : <REDACTED>
    Block Serial (Model)      : 0123456789 (NX-8035-G4)


I followed this ( article, but after confirming that the node status was indeed still in MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE, and that the host was removed from the metadata ring:

nutanix@NTNX-<REDACTED>-A-CVM:$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff    Up     Normal     805.96 MB       11.11%  00000000yyjS51szT1lE5ujXJWrV7DDGq59Q9OxirILNaz33w8uXUaiiI1J6     Up     Normal     789.13 MB       11.11%  6t6t6t6t4iHBhbGwpTcLD3zM1GCApKbfOg6WvxTaH1ktlJbPt6oVS0ONKMrF     Up     Normal     894.45 MB       11.11%  DmDmDmDm0000000000000000000000000000000000000000000000000000     Up     Normal     818.87 MB       11.11%  KfKfKfKfYFHjTJhIdvv6IQHmWivFVXXFa5bdGrQrAvJJ77TPMETArIYLOLY1     Up     Normal     1.45 GB         22.22%  YRYRYRYRZ0QUD39sU9U4hCgMOl0VoI5vidgiHiZhWjWrEwcvNBxsFUNj0EuH     Up     Normal     1.41 GB         11.11%  fKfKfKfK2TH6RDGnNnXunDuzuM3tGRzWPbCu4tHWX48e7uRxbW6pkMYZS1X4     Up     Normal     1.5 GB          11.11%  mDmDmDmD0000000000000000000000000000000000000000000000000000     Up     Normal     812.69 MB       11.11%  t6t6t6t6bP8XaoghFmTbXpFEVor1FjqbRz0Zwdw7LuMRx9guA7GHBpinihff


It has been sitting in this state for over a week now. I also tried manually removing the node using:

ncli host remove-start id=21 skip-space-check=True


It reports that the node removal was successfully initiated, but I see no change in operation.


Any help is greatly appreciated.


Best answer by runningahv 9 March 2021, 16:38

View original

This topic has been closed for comments