Question

AHV upgrade failed with 'ahv_wait_for_host' action

  • 17 July 2021
  • 3 replies
  • 125 views

Badge

Hi, everyone

AOS: 5.15.6

AHV: 0190916.410 / 20190916.564

hardware: Lenovo HX5510

After I upgrade my cluster to AOS 5.15.6, I was using LCM to upgrade my cluster’s hosts’ AHV from version 20190916.410 to 20190916.564 . Several hosts successfully upgraded, but one of my host seems to fail with the following error:

Operation failed. Reason: LCM failed performing action ahv_wait_for_host in phase PreActions on ip address 10.248.51.32. Failed with error 'Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.' Logs have been collected and are available to download on 10.248.51.36 at /home/nutanix/data/log_collector/lcm_logs__10.248.51.36__2021-07-17_00-31-25.780130.tar.gz

I logged on to the IMM and connect to the remote control and saw the follow error on screen:

I downloaded the log bundle from mentioned above, but did not found anything useful, the ‘10.248.51.32’(the cvm on host 10.248.51.12) directory is empty, and in ‘lcm_logger.out.2021-07-17_00-31-25.780130’ I only found records of not able to connect to host 10.248.51.12 or CVM 10.248.51.32.

 

Should I just  power cycle the host ?


3 replies

Userlevel 1
Badge +4

HI Welsper,

Was the host (on which the upgrade failed) stuck in phoenix?

  • Please connect to CVM holding LCM leader role (execute the command lcm_leader from any CVM to find the leader)
  •  Check /home/messages/data/logs/lcm_ops.out log file and search for the keyword LcmActionsError
  • Connect to AHV host where upgrade failed and check /var/log/ahv_upgrade_627_firstboot.log file

We will need to make sure what stage the host is at and also about theupgrade process

Please contact Nutanix Support for a better resolution path on this

 

Thanks

 

 

Badge

Hi Raaji, 

Thank you for your reply.

I’m not sure if the host was stucked in phoenix. The host’s screen is still printing similiar error messages shown in my last post. When I try to SSH to the host, after I entered the username ‘root’, it did replied with the banner message:

| Nutanix AHV

But then when I entered the password for root, the SSH window just closed without any prompt or error message.

I logged on to the lcm leader CVM and download the lcm_ops.out log file, and found the following error message:

2021-07-17 00:30:28 WARNING lcm_actions_helper.py:886 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) Skipping recovery process because the recovery action list is empty.
2021-07-17 00:30:28 ERROR lcm_actions_helper.py:698 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) 
Traceback (most recent call last):
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 677, in execute_actions_with_wal
    raise LcmActionsError(err_msg, action_name, phase, ip_addr)
LcmActionsError: Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.

2021-07-17 00:30:28 ERROR lcm_actions_helper.py:365 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) 
Traceback (most recent call last):
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 352, in execute_actions
    metric_entity_proto=metric_entity_proto
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 677, in execute_actions_with_wal
    raise LcmActionsError(err_msg, action_name, phase, ip_addr)
LcmActionsError: Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.

2021-07-17 00:30:28 INFO metric_entity.py:1557 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) Exception report: {'error_type': 'LcmActionsError', 'kwargs': {'action': u'ahv_wait_for_host', 'phase': 'PreActions', 'err_msg': u'Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.', 'ip_addr': u'10.248.51.32'}}
2021-07-17 00:30:28 INFO cpdb.py:463 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) Exception: Mark Update In Progress Failed (kIncorrectCasValue)
2021-07-17 00:30:28 ERROR lcm_cpdb.py:208 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) Failed to update: 9d54bfe5-d43a-4f22-bd82-728a83190bd9
2021-07-17 00:30:28 INFO metric_entity.py:1047 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) metric_entity 9d54bfe5-d43a-4f22-bd82-728a83190bd9 got CAS value 20L after automatic conflict resolution
2021-07-17 00:30:28 ERROR lcm_ops_by_host:1351 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage [2/2]) lcm_ops_by_host encountered exception Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.. Traceback (most recent call last):
  File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 1346, in _perform_multistage_operation_by_host
    parent_task_index=parent_task_index)
  File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 437, in _run_operation_state_machine
    task_index, **extra_args)
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_ops_utils.py", line 110, in __call__
    ret = self._execution(updater, args, kwargs)
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_ops_utils.py", line 220, in _execution
    return self.__task_handler(*handler_args, **handler_kwargs)
  File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 487, in _run_operation_state_machine_step
    metric_entity_proto=metric_entity_proto)
  File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 553, in _execute_pre_actions
    metric_entity_proto=metric_entity_proto)
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 352, in execute_actions
    metric_entity_proto=metric_entity_proto
  File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 677, in execute_actions_with_wal
    raise LcmActionsError(err_msg, action_name, phase, ip_addr)
LcmActionsError: Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.

2021-07-17 00:30:28 ERROR lcm_ops_by_host:1358 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Failed to perform upgrade stage 1/1
2021-07-17 00:30:28 INFO command_execute.py:55 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Attempt 0 to execute sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12
2021-07-17 00:30:29 ERROR catalog_staging.py:947 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Failed to run sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12 with ret: 254, out: , err: 
2021-07-17 00:30:29 INFO metric_entity.py:1557 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Exception report: {'error_type': 'LcmActionsError', 'kwargs': {'action': u'ahv_wait_for_host', 'phase': 'PreActions', 'err_msg': u'Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.', 'ip_addr': u'10.248.51.32'}}
2021-07-17 00:30:29 INFO cpdb.py:463 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Exception: Mark Update In Progress Failed (kIncorrectCasValue)
2021-07-17 00:30:29 ERROR lcm_cpdb.py:208 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Failed to update: 252c8919-ced0-4fbf-b36a-475c64ce8e69
2021-07-17 00:30:29 INFO metric_entity.py:1047 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) metric_entity 252c8919-ced0-4fbf-b36a-475c64ce8e69 got CAS value 20L after automatic conflict resolution
2021-07-17 00:30:29 ERROR lcm_ops_by_host:362 (update) Failed to perform operation.
2021-07-17 00:30:29 INFO command_execute.py:55 (update) Attempt 0 to execute sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12
2021-07-17 00:30:29 ERROR catalog_staging.py:947 (update) Failed to run sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12 with ret: 254, out: , err: 
 

I may not be able to use Nutanix Support right now since my host’s support period has already ended, so any help would be appreciated!

 

 

Badge

Hi,  everyone

I decided to reboot the host.  During the reboot, I noticed the following messages was shown on the screen:

 

The booting proccedure then continued to normal boot proccess , and at the end, it successfully booted into AHV this time. Then I was able to re-add this host to my cluster.

Problem solved.

 

My theory:  About two months ago I have upgrade the host’s IMM2 firmware to 5.30 . I did not reboot the host after the firmware upgrade because I don’t think it would be neccessary . But now I’m guessing the firmware upgrade needs some post upgrade proccess during the next reboot and it interfered with AHV upgrade’s reboot proccess. Maybe nutanix can take a look into this.

Reply