AHV upgrade failed with 'ahv_wait_for_host' action
Hi, everyone
AOS: 5.15.6
AHV: 0190916.410 / 20190916.564
hardware: Lenovo HX5510
After I upgrade my cluster to AOS 5.15.6, I was using LCM to upgrade my cluster’s hosts’ AHV from version 20190916.410 to 20190916.564 . Several hosts successfully upgraded, but one of my host seems to fail with the following error:
Operation failed. Reason: LCM failed performing action ahv_wait_for_host in phase PreActions on ip address 10.248.51.32. Failed with error 'Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.' Logs have been collected and are available to download on 10.248.51.36 at /home/nutanix/data/log_collector/lcm_logs__10.248.51.36__2021-07-17_00-31-25.780130.tar.gz
I logged on to the IMM and connect to the remote control and saw the follow error on screen:
I downloaded the log bundle from mentioned above, but did not found anything useful, the ‘10.248.51.32’(the cvm on host 10.248.51.12) directory is empty, and in ‘lcm_logger.out.2021-07-17_00-31-25.780130’ I only found records of not able to connect to host 10.248.51.12 or CVM 10.248.51.32.
Should I just power cycle the host ?
Page 1 / 1
HI Welsper,
Was the host (on which the upgrade failed) stuck in phoenix?
Please connect to CVM holding LCM leader role (execute the command lcm_leader from any CVM to find the leader)
Check /home/messages/data/logs/lcm_ops.out log file and search for the keyword LcmActionsError
Connect to AHV host where upgrade failed and check /var/log/ahv_upgrade_627_firstboot.log file
We will need to make sure what stage the host is at and also about theupgrade process
Please contact Nutanix Support for a better resolution path on this
Thanks
Hi Raaji,
Thank you for your reply.
I’m not sure if the host was stucked in phoenix. The host’s screen is still printing similiar error messages shown in my last post. When I try to SSH to the host, after I entered the username ‘root’, it did replied with the banner message:
| Nutanix AHV
But then when I entered the password for root, the SSH window just closed without any prompt or error message.
I logged on to the lcm leader CVM and download the lcm_ops.out log file, and found the following error message:
2021-07-17 00:30:28 WARNING lcm_actions_helper.py:886 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage p2/2]) Skipping recovery process because the recovery action list is empty. 2021-07-17 00:30:28 ERROR lcm_actions_helper.py:698 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage p2/2]) Traceback (most recent call last): File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 677, in execute_actions_with_wal raise LcmActionsError(err_msg, action_name, phase, ip_addr) LcmActionsError: Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.
2021-07-17 00:30:28 ERROR lcm_actions_helper.py:365 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage f2/2]) Traceback (most recent call last): File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 352, in execute_actions metric_entity_proto=metric_entity_proto File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 677, in execute_actions_with_wal raise LcmActionsError(err_msg, action_name, phase, ip_addr) LcmActionsError: Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.
2021-07-17 00:30:28 INFO metric_entity.py:1557 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage c2/2]) Exception report: {'error_type': 'LcmActionsError', 'kwargs': {'action': u'ahv_wait_for_host', 'phase': 'PreActions', 'err_msg': u'Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.', 'ip_addr': u'10.248.51.32'}} 2021-07-17 00:30:28 INFO cpdb.py:463 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage c2/2]) Exception: Mark Update In Progress Failed (kIncorrectCasValue) 2021-07-17 00:30:28 ERROR lcm_cpdb.py:208 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage c2/2]) Failed to update: 9d54bfe5-d43a-4f22-bd82-728a83190bd9 2021-07-17 00:30:28 INFO metric_entity.py:1047 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage c2/2]) metric_entity 9d54bfe5-d43a-4f22-bd82-728a83190bd9 got CAS value 20L after automatic conflict resolution 2021-07-17 00:30:28 ERROR lcm_ops_by_host:1351 (10.248.51.12, update, 4343a246-939e-46c9-8d81-b678c2986a5f, upgrade stage c2/2]) lcm_ops_by_host encountered exception Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.. Traceback (most recent call last): File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 1346, in _perform_multistage_operation_by_host parent_task_index=parent_task_index) File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 437, in _run_operation_state_machine task_index, **extra_args) File "/usr/local/nutanix/cluster/bin/lcm/lcm_ops_utils.py", line 110, in __call__ ret = self._execution(updater, args, kwargs) File "/usr/local/nutanix/cluster/bin/lcm/lcm_ops_utils.py", line 220, in _execution return self.__task_handler(*handler_args, **handler_kwargs) File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 487, in _run_operation_state_machine_step metric_entity_proto=metric_entity_proto) File "/home/nutanix/cluster/bin/lcm/lcm_ops_by_host", line 553, in _execute_pre_actions metric_entity_proto=metric_entity_proto) File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 352, in execute_actions metric_entity_proto=metric_entity_proto File "/usr/local/nutanix/cluster/bin/lcm/lcm_actions_helper.py", line 677, in execute_actions_with_wal raise LcmActionsError(err_msg, action_name, phase, ip_addr) LcmActionsError: Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.
2021-07-17 00:30:28 ERROR lcm_ops_by_host:1358 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Failed to perform upgrade stage 1/1 2021-07-17 00:30:28 INFO command_execute.py:55 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Attempt 0 to execute sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12 2021-07-17 00:30:29 ERROR catalog_staging.py:947 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Failed to run sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12 with ret: 254, out: , err: 2021-07-17 00:30:29 INFO metric_entity.py:1557 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Exception report: {'error_type': 'LcmActionsError', 'kwargs': {'action': u'ahv_wait_for_host', 'phase': 'PreActions', 'err_msg': u'Host 10.248.51.12 did not complete upgrade stage one in 7200 seconds.', 'ip_addr': u'10.248.51.32'}} 2021-07-17 00:30:29 INFO cpdb.py:463 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Exception: Mark Update In Progress Failed (kIncorrectCasValue) 2021-07-17 00:30:29 ERROR lcm_cpdb.py:208 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) Failed to update: 252c8919-ced0-4fbf-b36a-475c64ce8e69 2021-07-17 00:30:29 INFO metric_entity.py:1047 (10.248.51.12, update, c272304c-42f3-40a8-9922-fd26cce49e84) metric_entity 252c8919-ced0-4fbf-b36a-475c64ce8e69 got CAS value 20L after automatic conflict resolution 2021-07-17 00:30:29 ERROR lcm_ops_by_host:362 (update) Failed to perform operation. 2021-07-17 00:30:29 INFO command_execute.py:55 (update) Attempt 0 to execute sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12 2021-07-17 00:30:29 ERROR catalog_staging.py:947 (update) Failed to run sudo rm -rf /dev/shm/lcm_staging on 10.248.51.12 with ret: 254, out: , err:
I may not be able to use Nutanix Support right now since my host’s support period has already ended, so any help would be appreciated!
Hi, everyone
I decided to reboot the host. During the reboot, I noticed the following messages was shown on the screen:
The booting proccedure then continued to normal boot proccess , and at the end, it successfully booted into AHV this time. Then I was able to re-add this host to my cluster.
Problem solved.
My theory: About two months ago I have upgrade the host’s IMM2 firmware to 5.30 . I did not reboot the host after the firmware upgrade because I don’t think it would be neccessary . But now I’m guessing the firmware upgrade needs some post upgrade proccess during the next reboot and it interfered with AHV upgrade’s reboot proccess. Maybe nutanix can take a look into this.