Solved

Error Genesis not starting

  • 11 December 2019
  • 7 replies
  • 529 views

Badge +1

Someone who can support me, was doing the idrac update (Hyper-V hypervisor) and failed, but now the node is out and can not lift the services, I try to restart it from the console and I get the error message:

2019-12-11 14:04:41 INFO zookeeper_session.py:131 cvm_shutdown is attempting to connect to Zookeeper 2019-12-11 14:04:41 WARNING lcm_genesis.py:219 Failed to reach a [localhost] where LCM [LcmFramework.is_lcm_operation_in_progress] is up. Retrying... 2019-12-11 14:04:46 WARNING lcm_genesis.py:219 Failed to reach a [localhost] where LCM [LcmFramework.is_lcm_operation_in_progress] is up. Retrying... 2019-12-11 14:04:51 WARNING lcm_genesis.py:219 Failed to reach a [localhost] where LCM [LcmFramework.is_lcm_operation_in_progress] is up. Retrying... 2019-12-11 14:04:56 WARNING lcm_genesis.py:219 Failed to reach a [localhost] where LCM [LcmFramework.is_lcm_operation_in_progress] is up. Retrying... 2019-12-11 14:05:01 WARNING lcm_genesis.py:219 Failed to reach a [localhost] where LCM [LcmFramework.is_lcm_operation_in_progress] is up. Retrying... 2019-12-11 14:05:06 WARNING lcm_genesis.py:219 Failed to reach a [localhost] where LCM [LcmFramework.is_lcm_operation_in_progress] is up. Retrying... 2019-12-11 14:05:11 ERROR lcm_genesis.py:221 Failed to perform RPC method is_lcm_operation_in_progress on localhost 2019-12-11 14:05:11 ERROR lcm_helper.py:33 Failed to check if LCM operation is running. 2019-12-11 14:05:11 INFO cvm_shutdown:157 No upgrade was found to be in progress on the cluster 2019-12-11 14:05:11 ERROR cvm_shutdown:82 Error acquiring the shutdown token 2019-12-11 14:05:11 ERROR cvm_shutdown:175 Failed to shutdown. Error (Failed to acquire shutdown token) occurred, exiting ... 
icon

Best answer by sbarab 12 December 2019, 17:18

View original

This topic has been closed for comments

7 replies

Userlevel 3
Badge +3

@Kike2020 It definitely looks like LCM operation was in the process when the idrac update had to happen.  When you say idrec update, was this a firmware update? And if so was it initiated using LCM or it was manual update? 

You may be able to take more information out by finding the lcm leader (lcm_leader) in the cluster and checking the lcm files in nutanix “data/logs” folder (files include “lcm” in their names and the content of each can help getting to the bottom of this.   Try to add them here (get the lines where they indicate and “Error” in these files.

Regards,

 

-Said

 

Badge +1

Hi @sbarab 

The process was as follows, through the LCM I made the inventory and from there I selected the firmware update of iDRAC was that in the middle of the update I send the error and from there I no longer pick up the services.

sent what the log collected  

2019-12-11 12:43:34 INFO command_execute.py:86 (IP ADDRESS NODE) Waiting 6 seconds before next attempt 2019-12-11 12:43:40 INFO command_execute.py:52 (IP ADDRESS NODE) Attempt 3 to execute if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE 2019-12-11 12:43:40 WARNING command_execute.py:83 (IP ADDRESS NODE) Failed to execute command if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE. ret: -1  out:  err: Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging\c879a59c-3710-404a-abec-66102899db01: The process cannot access the file 'c879a59c-3710-404a-abec-66102899db01' because it is being used by another process.Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging: The directory is not empty. 2019-12-11 12:43:40 INFO command_execute.py:86 (IP ADDRESS NODE) Waiting 8 seconds before next attempt 2019-12-11 12:43:48 INFO command_execute.py:52 (IP ADDRESS NODE) Attempt 4 to execute if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE 2019-12-11 12:43:49 ERROR catalog_staging_utils.py:820 (IP ADDRESS NODE) Failed to run if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE with ret: -1, out: , err: Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging\c879a59c-3710-404a-abec-66102899db01: The process cannot access the file 'c879a59c-3710-404a-abec-66102899db01' because it is being used by another process.Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging: The directory is not empty. ^C nutanix@NTNX-59TT382-A-CVM:IP ADDRESS:~/data/logs$ tail -F lcm_ops.out 2019-12-11 12:43:34 INFO command_execute.py:86 (IP ADDRESS NODE) Waiting 6 seconds before next attempt 2019-12-11 12:43:40 INFO command_execute.py:52 (IP ADDRESS NODE) Attempt 3 to execute if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE 2019-12-11 12:43:40 WARNING command_execute.py:83 (IP ADDRESS NODE) Failed to execute command if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE. ret: -1  out:  err: Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging\c879a59c-3710-404a-abec-66102899db01: The process cannot access the file 'c879a59c-3710-404a-abec-66102899db01' because it is being used by another process.Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging: The directory is not empty. 2019-12-11 12:43:40 INFO command_execute.py:86 (IP ADDRESS NODE) Waiting 8 seconds before next attempt 2019-12-11 12:43:48 INFO command_execute.py:52 (IP ADDRESS NODE) Attempt 4 to execute if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE 2019-12-11 12:43:49 ERROR catalog_staging_utils.py:820 (IP ADDRESS NODE) Failed to run if (Test-Path "$\Nutanix\Tmp\lcm_staging") \Nutanix\Tmp\lcm_staging"} on IP ADDRESS NODE with ret: -1, out: , err: Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging\c879a59c-3710-404a-abec-66102899db01: The process cannot access the file 'c879a59c-3710-404a-abec-66102899db01' because it is being used by another process.Cannot remove item C:\Program Files\Nutanix\Tmp\lcm_staging: The directory is not empty. 

 

 

perform a cluster_status and left:

 

 

all the nodes go well "UP", but the error node goes out:
CVM: IP ADDRESS NODE Maintenance

  
Userlevel 3
Badge +3

@Kike2020 I am investigating this, in the mean time it would be great if yo let me know what AOS version you are running on and what is the lcm release.  You might also want to run the commands:

acli task.list include_completed=no

and 

progress_monitor_cli --fetchall

 

Regards,

Said

Userlevel 3
Badge +3

@Kike2020 on the top of above, the error condition may be higher up on the lcm logs, you may want to add that logs in here (if possible).

 

Badge +1

Hi @sbarab 

I have version 5.10.7 of AOS. 3.9.2.1 NCC, 4.5 Foundation, LCM Version 2.2.11203
the first command left me invalid, I don't know if something is doing wrong, the second one showed me this:
2019-12-12 07:31:59,446:3209(0x7fe073913c80):ZOO_INFO@log_env@951: Client environment:zookeeper.version=zookeeper C client 3.4.3 2019-12-12 07:31:59,447:3209(0x7fe073913c80):ZOO_INFO@log_env@955: Client environment:host.name=ntnx-59tt382-a-cvm 2019-12-12 07:31:59,447:3209(0x7fe073913c80):ZOO_INFO@log_env@962: Client environment:os.name=Linux 2019-12-12 07:31:59,447:3209(0x7fe073913c80):ZOO_INFO@log_env@963: Client environment:os.arch=3.10.0-957.21.3.el7.nutanix.20190619.cvm.x86_64 2019-12-12 07:31:59,447:3209(0x7fe073913c80):ZOO_INFO@log_env@964: Client environment:os.version=#1 SMP Wed Jun 19 05:38:02 UTC 2019 2019-12-12 07:31:59,447:3209(0x7fe073913c80):ZOO_INFO@zookeeper_init@999: Initiating client connection, host=zk3:9876,zk2:9876,zk1:9876 sessionTimeout=20000 watcher=0x561d688e4580 sessionId=0 sessionPasswd=<null> context=0x561d69818040 flags=0 2019-12-12 07:31:59,451:3209(0x7fe06b788700):ZOO_INFO@zookeeper_interest@1942: Connecting to server IP ADDRESS:PORT 2019-12-12 07:31:59,451:3209(0x7fe06b788700):ZOO_INFO@zookeeper_interest@1979: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket [IP ADDRESS:PORT] 2019-12-12 07:31:59,452:3209(0x7fe06b788700):ZOO_INFO@check_events@2161: initiated connection to server [IP ADDRESS:PORT] 2019-12-12 07:31:59,454:3209(0x7fe06b788700):ZOO_INFO@check_events@2208: session establishment complete on server [IP ADDRESS:PORT], sessionId=0x36eadf2847ec73e, negotiated timeout=20000 
Userlevel 3
Badge +3

@Kike2020 OK,

1- run the command “lcm_leader” on any cvm.

2- ssh to that cvm using the credentials “nutanix”

2- cd data/logs

3- ls -lart  lcm*

4- copy the results.

5-check for the the logs “lcm_wget.log” and “lcm_op.trace” and “lcm_ops.out”.Examine their output. There should be line there in the logs with the timestamp of the day that you run the lcm for firmware upgrades

6- zip the logs above and upload them here

7- I will review and based on my finding I will either provide you with a response or ask you to open a case with our support line to dig deeper in this issue

Regards,

 

-Said

Badge +1
Hello sorry for the delay, the server is back UP, they took it out of the cluster and re-entered it.