Hi,
I have a 3 node cluster (call them nodes 1 2 3 - node3 is the node where lcm framework has not updated)
After running lcm inventory for some updates it seems to be stuck at 0 percent on the framework update
I can see the stuck tasks here via cli:
ecli task.list include_completed=false
Task UUID Parent Task UUID Component Sequence-id Type Status
2743e560-1980-4aee-9e80-29ce76043953 08434196-8a53-44fd-aca3-7da4a23a2aae Genesis 3 Hypervisor reboot kRunning
08434196-8a53-44fd-aca3-7da4a23a2aae Genesis 1 Hypervisor rolling restart kRunning
d2878702-bb49-493b-b916-9e78eecb16fd lcm 275 kLcmRootTask kRunning
Note that there is a reboot task(post LCM update) also stuck, as i tried to use the GUI to restart the node(prism>cog>reboot>) where the lcm framwork update had not updated - it seems the reboot worked, but something in the task list hasnt updated properly:
Hypervisor reboot: Hypervisor reboot completed - 100% - succeeded
Hypervisor reboot: Hypervisor is in maintenance mode, Rebooting ... - 70% - running - 4 days 32 minutes
Hypervisor rolling reboot initiated successfully - 100% - succeeded‘ is stuck at 70 percent
Since this ive rebooted every AHV host and CVM in the cluster one by one using the Nutanix CLI process, but the tasks remain stuck. Cluster health is good.
From the logs
~/data/logs$ tail -50 lcm_op.trace | grep lcm
lcm_upgrade_status is attempting to connect to Zookeeper
is mentioned a lot but doesnt seem to complete
:~/data/logs$ tail -50 zookeeper.out
ERROR genesis_utils.py:571 Unable to find node_uuid for **node3**
ERROR lcm_genesis.py:455 Failed to get svm id
INFO catalog_staging.py:757 Removing /home/nutanix/tmp/lcm_staging/94eb35b1-b557-4de3-91bc-4dff182022f9 after extraction with rm -rf /home/nutanix/tmp/lcm_staging/94eb35b1-b557-4de3-91bc-4dff182022f9
INFO command_execute.py:55 Attempt 0 to execute rm -rf /home/nutanix/tmp/lcm_staging/94eb35b1-b557-4de3-91bc-4dff182022f9 on **node3**
INFO framework_updater.py:581 Successfully staged LCM update module for LCM Version 2.5.32269 at /home/nutanix/tmp/lcm_staging
INFO framework_updater.py:585 Staging LCM image module for version 2.5.32269 on cvm **node3**
INFO catalog_staging.py:610 Prep remote staging area /home/nutanix/tmp/lcm_staging/framework_image
ERROR genesis_utils.py:571 Unable to find node_uuid for **node3**
ERROR lcm_genesis.py:455 Failed to get svm id
~/cluster/lib/py$ lcm_auto_upgrade_status
Intended update version: 2.5.32269
2.5.32269: **node1**,**node2**
Nodes not on Intended update version
2.4.4.1.28720: **node3**
cluster status | egrep -i "zeus|CVM:"
2022-09-30 10:53:28,694Z INFO zookeeper_session.py:176 cluster is attempting to connect to Zookeeper
2022-09-30 10:53:28,699Z INFO cluster:2729 Executing action status on SVMs **node1**,**node2**, **node3**
CVM: **node1** Up, ZeusLeader
Zeus UP g29344, 29389, 29390, 29391, 29411, 29429]
CVM: **node2** Up
Zeus UP *10640, 10675, 10676, 13368, 13384, 13401]
2022-09-30 10:53:31,316Z INFO cluster:2888 Success!
CVM: **node3** Up
Zeus UP 13600, 3638, 3639, 3640, 3649, 3667]
As its community edition ill need to try and solve this myself - im wondering if anyone has ran into this issue before with lcm and if using these to bypass or end the tasks is safe to perform with lcm?
ergon_update_task --task_uuid='XXXXXXXXXXXX' --task_status=succeeded or (aborted)
Are you sure you want to continue? (y/n)