Skip to main content
Question

LCM Stuck With Framework Update


Hi,

I have a 3 node cluster (call them nodes 1 2 3 - node3 is the node where lcm framework has not updated)

After running lcm inventory for some updates it seems to be stuck at 0 percent on the framework update

I can see the stuck tasks here via cli:

ecli task.list include_completed=false
Task UUID                             Parent Task UUID                      Component  Sequence-id  Type                        Status
2743e560-1980-4aee-9e80-29ce76043953  08434196-8a53-44fd-aca3-7da4a23a2aae  Genesis    3            Hypervisor reboot           kRunning
08434196-8a53-44fd-aca3-7da4a23a2aae                                        Genesis    1            Hypervisor rolling restart  kRunning
d2878702-bb49-493b-b916-9e78eecb16fd                                        lcm        275          kLcmRootTask                kRunning

Note that there is a reboot task(post LCM update) also stuck, as i tried to use the GUI to restart the node(prism>cog>reboot>) where the lcm framwork update had not updated - it seems the reboot worked, but something in the task list hasnt updated properly:

Hypervisor reboot: Hypervisor reboot completed - 100% -  succeeded
Hypervisor reboot: Hypervisor is in maintenance mode, Rebooting ... - 70% -  running  - 4 days 32 minutes
Hypervisor rolling reboot initiated successfully - 100% - succeeded‘ is stuck at 70 percent

Since this ive rebooted every AHV host and CVM in the cluster one by one using the Nutanix CLI process, but the tasks remain stuck. Cluster health is good.

 

From the logs

~/data/logs$ tail -50 lcm_op.trace | grep lcm

lcm_upgrade_status is attempting to connect to Zookeeper

is mentioned a lot but doesnt seem to complete

 

:~/data/logs$ tail -50 zookeeper.out

ERROR genesis_utils.py:571 Unable to find node_uuid for **node3**
ERROR lcm_genesis.py:455 Failed to get svm id
INFO catalog_staging.py:757 Removing /home/nutanix/tmp/lcm_staging/94eb35b1-b557-4de3-91bc-4dff182022f9 after extraction with rm -rf /home/nutanix/tmp/lcm_staging/94eb35b1-b557-4de3-91bc-4dff182022f9
INFO command_execute.py:55 Attempt 0 to execute rm -rf /home/nutanix/tmp/lcm_staging/94eb35b1-b557-4de3-91bc-4dff182022f9 on **node3**
INFO framework_updater.py:581 Successfully staged LCM update module for LCM Version 2.5.32269 at /home/nutanix/tmp/lcm_staging
INFO framework_updater.py:585 Staging LCM image module for version 2.5.32269 on cvm **node3**
INFO catalog_staging.py:610 Prep remote staging area /home/nutanix/tmp/lcm_staging/framework_image
ERROR genesis_utils.py:571 Unable to find node_uuid for **node3**
ERROR lcm_genesis.py:455 Failed to get svm id

 

~/cluster/lib/py$ lcm_auto_upgrade_status
Intended update version: 2.5.32269
2.5.32269: **node1**,**node2**
Nodes not on Intended update version
2.4.4.1.28720: **node3**

 

cluster status | egrep -i "zeus|CVM:"
2022-09-30 10:53:28,694Z INFO zookeeper_session.py:176 cluster is attempting to connect to Zookeeper
2022-09-30 10:53:28,699Z INFO cluster:2729 Executing action status on SVMs **node1**,**node2**, **node3**
        CVM: **node1** Up, ZeusLeader
                                Zeus   UP       g29344, 29389, 29390, 29391, 29411, 29429]
        CVM: **node2** Up
                                Zeus   UP       *10640, 10675, 10676, 13368, 13384, 13401]
2022-09-30 10:53:31,316Z INFO cluster:2888 Success!
        CVM: **node3** Up
                                Zeus   UP       13600, 3638, 3639, 3640, 3649, 3667]

 

As its community edition ill need to try and solve this myself - im wondering if anyone has ran into this issue before with lcm and if using these to bypass or end the tasks is safe to perform with lcm?

ergon_update_task --task_uuid='XXXXXXXXXXXX' --task_status=succeeded or (aborted)

WARNING: Using this command can cause database corruption and complete system failure, if used improperly.
Are you sure you want to continue? (y/n)
 
Id like to keep interruption of the cluster workload to a minimum.
 

14 replies

Userlevel 5
Badge +7

Hi Rob,

I hope you're well. I'm pretty sure I've used this same ecli to mark LCM tasks aborted (LCM inventory tasks that just sat forever). I'm not sure if it'll help with the issue, but I do think it'd make the logs more interesting from a fresh run.

Userlevel 5
Badge +7

Might also be interesting to see how Steve fixed similar.

https://next.nutanix.com/how-it-works-22/lcm-framework-stuck-updating-3927

Though yours is talking about not being able to find the uuid, I wonder if you can find the uuid from the earlier logs?

Badge +1

Thanks for the reply - thats exactly one of the previous discussions i was looking at to help fix this. Not many the same issue i think but similar.

Do you know what might be missing the uuid? a config file used by a service or some other field or list?

I'm not clear on the logs exactly what component it is  ‘Unable to find node_uuid for **node3** ‘

I'll see about ending the tasks – may take some time to organize to make sure interested parties are all ok with my changes.

Userlevel 5
Badge +7

Hey,

I’m not sure but I see the uuid from ncli host ls in my zookeeper.out log.

Might be worth looking in ncli host ls and confirming you have a uuid for each host first, then have a grep through the zookeeper logs to see if any of those show up - or better what format of the uuid for the other hosts you have in there?

Badge +1

Interesting update

nutanix@NTNX-0fcb0e3e-A-CVM:X.Y.Z.238:~$ acli host.list
#Hypervisor IP  Hypervisor DNS Name  Host UUID                             Compute Only  Schedulable  Hypervisor Type  Hypervisor Name  CVM IP
X.Y.Z.247   X.Y.Z.247         d4c6a6b9-ad84-429f-8ac7-d4b671e5c1e7  False         True         kKvm             AHV              X.Y.Z.237
X.Y.Z.248   X.Y.Z.248         1f918e7b-b20d-4608-aa9d-e282ada101bd  False         True         kKvm             AHV              X.Y.Z.238
X.Y.Z.249   X.Y.Z.249         050e9777-d7b3-4957-a1a0-3f6a060611ac  False         True         kKvm             AHV              X.Y.Z.239   ****problem node****

nutanix@NTNX-7e11bd65-A-CVM:X.Y.Z.239:~$ zkcli ls /appliance/logical/lcm/mercury_config
1f918e7b-b20d-4608-aa9d-e282ada101bd
d4c6a6b9-ad84-429f-8ac7-d4b671e5c1e7

Seems that uuid is missing from here - 050e9777-d7b3-4957-a1a0-3f6a060611ac

Not clear what these fiels actually do to help
nutanix@NTNX-7e11bd65-A-CVM:X.Y.Z.239:~$ zkcli cat /appliance/logical/lcm/mercury_config/1f918e7b-b20d-4608-aa9d-e282ada101bd
1.3
nutanix@NTNX-7e11bd65-A-CVM:X.Y.Z.239:~$ zkcli cat /appliance/logical/lcm/mercury_config/d4c6a6b9-ad84-429f-8ac7-d4b671e5c1e7
1.3

suspect i need to     zkcli write /appliance/logical/lcm/mercury_config/050e9777-d7b3-4957-a1a0-3f6a060611ac

Badge +1

Also does anyone know if runing ‘genesis restart ‘ on the current leader will cause issues with workload?  I presume i may need to do this to trigger the leader to switch. If there is any doubt i can do a manual node restart from cli.

Userlevel 5
Badge +7

I’d agree on wanting to write that in similar to Steve’s. Spinning genesis should be fine, it is one of the first things I do whenever I’m troubleshooting something because Genesis knows all ;)

Badge +1

Update.

 

I created the missing file using zkcli cat /appliance/logical/lcm/mercury_config/050e9777-d7b3-4957a0-3f6a060611ac

I then rebooted every node sequencially using maintenance mode and cli

No change in lcm or reboot processes stuck

Tried genesis restart command - no effect

I then ran

ergon_update_task --task_uuid='XXXXXXXXXXXX' --task_status=succeeded or (aborted)

aganst the affected tasks - now tasks are clear, but still LCM framework page is stuck with message:

‘ LCM Framework Update in Progress, please check back when the update process is completed. ‘

‘ Error calling Entities : Error: Not Found ‘

I then ran another set of restarts on nodes

 

The 2 nodes with lcm framework up to date have this output for their uuids
nutanix@NTNX-e0c1425c-A-CVM:X.Y.Z.237:~/data/logs$ allssh zkcli cat /appliance/logical/lcm/mercury_config/d4c6a6b9-ad84-429f-8ac7-d4b671e5c1e7
================== X.Y.Z.238 =================
1.3================== X.Y.Z.239 =================
1.3================== X.Y.Z.237 =================
1.3

Fro the new one i created using zcli write uuids has no value
nutanix@NTNX-e0c1425c-A-CVM:X.Y.Z.237:~/data/logs$ allssh zkcli cat /appliance/logical/lcm/mercury_config/050e9777-d7b3-4957a0-3f6a060611ac
================== X.Y.Z.238 =================
================== X.Y.Z.239 =================
================== X.Y.Z.237 =================

Does anyone know the significance of the 1.3 value?

I checked it against the value in a different fully supported nutanix cluster run nx hardware and it has value true in these files

Userlevel 5
Badge +7

Hmm, I'm not sure sorry.

I wonder if it's worth writing 1.3 on the end for consistency anyway and see if lcm changes at all.

Totally guessing now though I'm afraid!

Badge +1

Support asked if i could remove the commands for  'ergon_update_task' from my post.

I dont think i can do this(i cant edit anything) - im leaving a note for others that you shouldnt do this without direct interaction with Nutanix Support.

I think they want to avoid user's doing this as a normal operation or troubleshooting process.

Badge +1

As an update - no further with fixing the LCM - still issues with updating LCM.

The cluster seems to be running with ok health(recently recovered from failed state due to hosting issues)

 

Prism reports slightly different error now:

Operation failed. Reason: Prechecks failed: Found remote version None while 2.5.0.3.33256 was expected on node 10.15.17.239. This may be a caching issue. Please ensure all local caches are cleared and wait a few minutes for any remote caches to get invalidated before retrying. Please check KB 7784

This seems related to the LCM version mismatch between nodes found in cvm commands.

I think it might be an idea to remove the node from the cluster completely and re-add - from what i can tell i think i need 4 nodes to do this operation via this process https://portal.nutanix.com/page/documents/details?targetId=Web-Console-Guide-Prism-v6_5:wc-removing-node-pc-c.html

Did you ever resolve this?

As an update - no further with fixing the LCM - still issues with updating LCM.

The cluster seems to be running with ok health(recently recovered from failed state due to hosting issues)

 

Prism reports slightly different error now:

Operation failed. Reason: Prechecks failed: Found remote version None while 2.5.0.3.33256 was expected on node 10.15.17.239. This may be a caching issue. Please ensure all local caches are cleared and wait a few minutes for any remote caches to get invalidated before retrying. Please check KB 7784

This seems related to the LCM version mismatch between nodes found in cvm commands.

I think it might be an idea to remove the node from the cluster completely and re-add - from what i can tell i think i need 4 nodes to do this operation via this process https://portal.nutanix.com/page/documents/details?targetId=Web-Console-Guide-Prism-v6_5:wc-removing-node-pc-c.html

 

Badge +1

Hi,

No sorry, i couldnt fix the lcm error directly from CLI - I still have a plan to add a new a new node temporarily and delete and re-add the (broken LCM) node, but havent got around to this yet.

 

 

I’ve just got off a call with Nutanix support, and we ended up fixing it by deploying the new version of LCM Framework via CLI on the LCM_Leader CVM

Reply