Solved

LCM Framework Stuck Updating


Userlevel 3
Badge +7

Hi,

I can’t access LCM in my CE cluster. It just says “LCM Framework Update is in progress. Please check back when the update process is completed.” It’s been like this for days. Updating Foundation and NCC worked fine the old way.

I can’t see any active tasks in Prism or using ecli task.list

Any ideas?

icon

Best answer by SteveCoops 10 March 2021, 17:26

View original

This topic has been closed for comments

12 replies

Userlevel 3
Badge +7

I’ve noticed if I do an NCC check then I get a load of :

INFO: Cluster/node reports it is currently undergoing maintenance/upgrade. This health check plugin is disabled during this workflow to avoid inaccurate results or alerts but will run again when the workflow completes. See KB4999 for more details.

But ncli host ls shows false for maintenance mode on all three nodes.

I found some other scripts but again says nothing is happening:

nutanix@NTNX-afcd9aa8-A-CVM:192.168.250.236:/usr/local/nutanix/cluster/bin/lcm$ ./lcm_upgrade_status
Ongoing upgrades:
No upgrade is in progress

Finished upgrades:
Up to 5 previously finished upgrade batches listed in descending order of upgrade start time:
nutanix@NTNX-afcd9aa8-A-CVM:192.168.250.236:/usr/local/nutanix/cluster/bin/lcm$

nutanix@NTNX-afcd9aa8-A-CVM:192.168.250.236:/usr/local/nutanix/cluster/bin/lcm$ lcm_auto_upgrade_status
No autoupdate in progress

Userlevel 3
Badge +7

I just tried moving the lcm_leader to another node (no idea how to do this so I just rebooted the CVM the lcm_leader was on!) and it now just hangs on “Loading...” when going to the LCM page in Prism.

Userlevel 6
Badge +16

Take a look at this article: LCM: (Life Cycle Manager) Troubleshooting Guide

Userlevel 3
Badge +7

There’s nothing there that helps me in the state it’s in unfortunately. These don’t exist: 

~/data/logs/lcm_ops.out or 
~/data/logs/lcm_wget.log

No stuck tasks in ergon.out

Nothing interesting in genesis.out

….lcm/lcm path doesn’t exist so no issue there.

I think I need to clear the upgrade status that’s set somewhere for NCC to report this against most of its checks:

Detailed information for ahv_version_check:
Node 192.168.250.237: 
INFO: Cluster/node reports it is currently undergoing maintenance/upgrade. This health check plugin is disabled during this workflow to avoid inaccurate results or alerts but will run again when the workflow completes. See KB4999 for more details.

Userlevel 3
Badge +7

Yeah so these both say no upgrade is taking place:

nutanix@NTNX-5de7c188-A-CVM:192.168.250.235:~/cluster/bin/lcm$ upgrade_status
2021-03-09 14:47:36,142Z INFO zookeeper_session.py:176 upgrade_status is attempting to connect to Zookeeper
2021-03-09 14:47:36,149Z INFO upgrade_status:38 Target release version: el7.3-release-ce-2020.09.16-stable-d4fc219b73b4181935a3a19465eb922313fc735f
2021-03-09 14:47:36,152Z INFO upgrade_status:103 SVM 192.168.250.235 is up to date
2021-03-09 14:47:36,152Z INFO upgrade_status:103 SVM 192.168.250.236 is up to date
2021-03-09 14:47:36,153Z INFO upgrade_status:103 SVM 192.168.250.237 is up to date

nutanix@NTNX-5de7c188-A-CVM:192.168.250.235:~/cluster/bin/lcm$ host_upgrade_status
2021-03-09 14:47:56,572Z INFO zookeeper_session.py:176 host_upgrade_status is attempting to connect to Zookeeper
Automatic Hypervisor upgrade: Enabled
Target host version: None

But NCC says:

Detailed information for cluster_active_upgrade_check:
Node 192.168.250.235:
INFO: ['NOS', 'Hypervisor', 'Firmware'] being upgraded
Refer to KB 5277 (http://portal.nutanix.com/kb/5277) for details on cluster_active_upgrade_check or Recheck with: ncc health_checks system_checks cluster_active_upgrade_check

Userlevel 6
Badge +5

Shot in the dark, can you upgrade NCC? It will effectively restart the process.

Userlevel 3
Badge +7

No I’m already on the latest :(

 

I’ve today got to a different stage by leaving it overnight. When I go to the LCM page it just says “Waiting for LCM Framework to start...”

 

edit oh leaving it for a few minutes it’s back to “LCM Framework Update in Progress, please check back when the update process is completed.”

 

I notice in genesis log this repeats over and over:

 

2021-03-10 10:23:17,117Z INFO ergon_utils.py:825 No LCM operation running
2021-03-10 10:23:17,118Z INFO ergon_utils.py:1595 Cannot find root task uuid
2021-03-10 10:23:17,121Z INFO zeus_utils.py:413 Zk node /appliance/logical/lcm/mercury_config/f58690da-9d18-400c-8a6a-f7d3a4e09b28 doesn't exist
2021-03-10 10:23:17,121Z INFO framework.py:2196 Mercury config is in progress. Returning Autoupdate

 

Edit2 - I think I’m on to something. “f58690da-9d18-400c-8a6a-f7d3a4e09b28” is the ID of my 3rd node and it’s missing:

 

nutanix@NTNX-5de7c188-A-CVM:192.168.250.235:~/data/logs$ zkls /appliance/logical/lcm/mercury_config
5d0b2664-9ace-4dcf-9a0e-35ca8ee0910e
c555b5dd-6841-4ec0-a87d-54f3454c213d
nutanix@NTNX-5de7c188-A-CVM:192.168.250.235:~/data/logs$

 

The other two IDs there are the other two nodes.

 

How do I create this “thing” (is it a folder?)

 

Cheers,

Steve

Userlevel 3
Badge +7

I just had a play and I fixed it! It’s now doing a framework update.

 

I *think* the command that did it was 

zkcli write /appliance/logical/lcm/mercury_config/f58690da-9d18-400c-8a6a-f7d3a4e09b28

as that id now shows when I do a list. I did so many different commands I’m not 100% sure it is that one though :)

Cheers,

Steve

Userlevel 3
Badge +7

Oh I had to do it again but adding the other hosts in as i had a similar error in genesis logs, then I restarted genesis and the lcm leader moved to another host and now the LCM prechecks have completed and inventory is running!

Userlevel 3
Badge +7

Woohoo all done and updates performed!

 

Userlevel 3
Badge +7

Also NCC is clear now - no longer says upgrade in progress

Userlevel 6
Badge +5

Wow! Great work!