Question

Disk Firmware Update Stuck

  • 4 September 2019
  • 4 replies
  • 2477 views

Hi guys,

I can't seem to perform an inventory on LCM any more.

It fails on the check:
Check 'test_upgrade_in_progress' failed with 'Failure reason: Another Upgrade operation is in progress. Please wait for that operation to complete before starting an LCM operation.'

However after running

progress_monitor_cli --fetchall
This does not show anything in progress.

host list - All 3 hosts show:
false (life_cycle_management) - Which is as expected.

~/data/logs$ upgrade_status
2019-09-04 14:47:32 INFO zookeeper_session.py:131 upgrade_status is attempting to connect to Zookeeper
2019-09-04 14:47:32 INFO upgrade_status:38 Target release version: el7.3-release-euphrates-5.10.6-stable-294f5f671ba8982a0199e18b756e8ef3a453af9a
2019-09-04 14:47:32 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling upgrade
2019-09-04 14:47:32 INFO upgrade_status:96 SVM 10.x.x.x is up to date
2019-09-04 14:47:32 INFO upgrade_status:96 SVM 10.x.x.x is up to date
2019-09-04 14:47:32 INFO upgrade_status:96 SVM 10.x.x.x is up to date

I noticed that the lcm_ops.out is not reporting any output what so ever.

If I try and stop the cluster and shut down the CMV I get the below.

CRITICAL cvm_shutdown:152 An upgrade was found to be in progress on the cluster. Not proceeding with shutdown as it can cause several issues including possible downtime (see ENG-173549, ENG-191016). Please wait for the upgrade to finish

This topic has been closed for comments

4 replies

Userlevel 3
Badge +5
Hi @Hugo1

Can you also provide the output of the following commands:

$ host_upgrade_status

$ firmware_upgrade_status

$ ncc health_checks system_checks cluster_active_upgrade_check

Also, after login to any CVM, locate the LCM leader using the command "lcm_leader", jump to the LCM leader and grep the genesis.out log for firmware_upgrade like the following:

$ cat /home/nutanix/data/logs/genesis.out | grep firmware_upgrade
Hi Richardson,
Thanks for your reply.

See output below:
host_upgrade_status
2019-09-04 21:12:08 INFO zookeeper_session.py:131 host_upgrade_status is attempting to connect to Zookeeper
Automatic Hypervisor upgrade: Enabled
Target host version: NoData
Upgrade Method: Automatic

firmware_upgrade_status
2019-09-04 21:13:05 INFO zookeeper_session.py:131 firmware_upgrade_status is attempting to connect to Zookeeper

Firmware information for disks:
-------------------------------
Slot Disk Boot Model Version Hades-cur Hades-tar Plan-Out
-----------------------------------------------------------------------------
1 sda Yes INTEL SSDSC2BX480G4 0140 0140 0140 No
2 sdb Yes INTEL SSDSC2BX480G4 0140 0140 0140 No
3 sdc No ST2000NX0253 SN05 SN05 SN02 No
4 sdd No ST2000NX0253 SN05 SN05 SN02 No
5 sde No ST2000NX0253 SN05 SN05 SN02 No
6 sdf No ST2000NX0253 SN05 SN05 SN02 No

Status of disk firmware upgrade:
--------------------------------
Successful/skipped for disks : /dev/sda, /dev/sdb
In progress for disks : None
Failed for disk : ['/dev/sdc', '/dev/sdd', '/dev/sde', '/dev/sdf']

Current status of disk firmware upgrade:
----------------------------------------
Firmware upgrade failed

ncc health_checks system_checks cluster_active_upgrade_check

####################################
# TIMESTAMP : 09/04/2019 9:17:52 PM
####################################
ncc_version: 3.8.0-2ed69d02
cluster id: xxx
cluster name: xxx
node with service vm id 3
service vm external ip: xxx
hypervisor address list: [u'xxx']
hypervisor version: 6.5.0 build - 14320405 update - 3
ipmi address list: [u'xxx']
software version: euphrates-5.10.6-stable
software changeset ID: 294f5f671ba8982a0199e18b756e8ef3a453af9a
node serial: xxx
rackable unit: NX-3060-G4
node with service vm id 4
service vm external ip: xxx
hypervisor address list: [u'xxx']
hypervisor version: 6.5.0 build - 14320405 update - 3
ipmi address list: [u'xxx']
software version: euphrates-5.10.6-stable
software changeset ID: 294f5f671ba8982a0199e18b756e8ef3a453af9a
node serial: xxx
rackable unit: NX-3060-G4
node with service vm id 5
service vm external ip: xxx
hypervisor address list: [u'xxx']
hypervisor version: 6.5.0 build - 14320405 update - 3
ipmi address list: [u'xxx']
software version: euphrates-5.10.6-stable
software changeset ID: 294f5f671ba8982a0199e18b756e8ef3a453af9a
node serial: xxx
rackable unit: NX-3060-G4

Running : health_checks system_checks cluster_active_upgrade_check
[==================================================] 100%
/health_checks/system_checks/cluster_active_upgrade_check
[ PASS ]
--------------------------------------------------------------------------------------------------------------------------------------------------------------++-----------------------+
| State | Count |
+-----------------------+
| Pass | 1 |
| Total Plugins | 1 |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log

cat /home/nutanix/data/logs/genesis.out | grep firmware_upgrade
2019-09-04 14:15:18 INFO firmware_upgrade_helper.py:855 Disk firmware upgrade is still in progress
2019-09-04 14:28:15 INFO firmware_upgrade_helper.py:855 Disk firmware upgrade is still in progress
2019-09-04 14:38:58 INFO firmware_upgrade_helper.py:855 Disk firmware upgrade is still in progress
2019-09-04 15:47:53 INFO firmware_upgrade_helper.py:855 Disk firmware upgrade is still in progress
Userlevel 3
Badge +5
Thank you for the output and please open a support ticket and an available SRE will request a remote session to fix the issue. Unfortunately I cannot give you the commands since it may can lead to problems if not executed correctly and even the SRE will be monitored by another SRE during the change.

Also, to speed up the resolution you may can suggest SRE to check the internal KB6021.
Thanks Ricardson,

To be honest support for this hardware expired a few months back and we migrated everything to another solution for the time being. We are already engaged to purchase some G7 systems, but I wanted to keep these G4's as a test lab.

Can you provide the commands with a disclaimer? The kit does not have anything running on it and has already be decommed from production.