Need help on an old cluster, Zookeeper won't start | Nutanix Community
Skip to main content

Hi All

I have recently inherited an old (el7.nutanix.20201105.30281 is the version maybe?) 4-node Nutanix cluster and very soon after first looking at it and raising myself some tasks it has decided to stop working! I do not have support (turns out that expired a couple of months ago) and while I do intend to resolve that situation I just need to get the thing back online ASAP as it runs some critical services

The initial problem was caused by one of the nodes running out of inodes on the root disk. I cleared this down (it was just a LOT of emails in the spool) but the prism console was not available afterwards. Had a quick look online and restarted genesis (allssh 'genesis restart') off the back of that but it did not help

The problem seems to be with zookeeper, in zookeeper.out I just see a lot of this

2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_INFO@zookeeper_interest@1926: Connecting to server 10.6.232.32:9876

2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_INFO@zookeeper_interest@1963: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket 10.6.232.32:9876]

2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_ERROR@handle_socket_error_msg@2165: Socket 10.6.232.32:9876] zk retcode=-4, errno=111(Connection refused): server refused to accept the client

If I look on the individual servers in the cluster, none of them are even listening on port 9876 and if I try to do anything with the cluster it just fails to connect to zookeeper

nutanix@CVM:~/data/logs$ cluster start

2024-06-24 15:56:27,483Z INFO MainThread zookeeper_session.py:191 cluster is attempting to connect to Zookeeper

2024-06-24 15:56:37,484Z ERROR MainThread configuration.py:158 Could not get Zookeeper connection with host_port_list: zk1:9876,zk2:9876,zk3:9876

024-06-24 15:56:37,485Z INFO MainThread cluster:2919 Executing action start on SVMs ...

So my question is really, how do I diagnose the issues with zookeeper? What logs should I look in to get more information? Am I just missing something stupid?

I did notice that one of the nodes has the address for zk1 different in /etc/hosts (it is pointing to an IP that does not even exist, but I do not know the implications of this so I have left it

192.0.2.1 zk1 # DON'T TOUCH THIS LINE

Hi Josh,

As you said, is the same node where inode issue was identified has different zk entries?

 

F>P


It did, no idea how it worked before, but I fixed that (in /home/nutanix/data/zookeeper_monitor/zk_server_config_file and /etc/hosts) so that all four cluster members match. It is now getting further, but still trying to connect to zkX:9876 … but nothing is listening on that port so the connection fails

The zookeeper.out eventually just shows the zookeeper monitor sitting there doing this

*** Check failure stack trace: ***

First FATAL tid: 16200

Installed ExitTimer with timeout 30 secs and interval 5 secs

Leak checks complete

Flushed log files

Initialized FiberPool BacktraceGenerator

Collected stack-frames for threads

Collected stack-frames for all Fibers

Symbolized all thread stack-frames

Symbolized all fiber stack-frames

Obtained stack traces of threads responding to SIGPROF

Collected stack traces from /proc for unresponsive threads

Stack traces are generated at /home/nutanix/data/cores/zookeeper_monit.16200.20240625-074827.stack_trace.txt

Stacktrace collection complete

E20240625 07:48:27.132045Z 13331 zookeeper_monitor.cc:815] Zookeeper monitor exited too many times - Delaying starting the child for 45 second


Sorry, missed an error from the zookeeper logs

F20240625 07:48:25.034206Z 16182 zookeeper_monitor.cc:1831] Check failed: iter != zookeeper_mapping_local_.end() 
 


So after more investigation the problem I have is definitely with zookeeper_monitor not bringing up zookeeper. It must be failing some consistency checks but can anyone tell me exactly what zookeeper_monitor checks?

I know that it checks

  • checks /home/nutanix/data/zookeeper_monitor/zk_server_config_file matches across hosts
  • checks /etc/hosts matches across hosts
  • checks the zookeeper config version matches across hosts

What else does it do that I am missing?


Hi Josh,

I believe due inode issue seems some files got corrupted, if u can get Nutanix support on  exception, you can involve local accounts team.

look through genesis.out if you can see if any file missing errors.

F>P


Thanks Sl

I actually managed to resurrect it, although I am really not sure how I managed it under the hood. As you say, I need to get it back supported ASAP

For reference, the problem was that zookeeper_monitor would not bring up zookeeper on some nodes because or some unknown inconsistency in the configuration. I found that unless at least 2 of the 3 nodes in the quorum were trying to come up and have an election then nothing would come up

I had managed to get zookeeper to start on one node by restarting a CVM (just sitting there trying to have an election). I then just ran zookeeper (the same java command from the working ones ps) myself on a node that had not run out of inodes. This triggered the election and everything sprang to life. It also allowed me to run the health checks (ncc health_checks run_all) which told me what was actually wrong, in this case the zookeeper config version incorrect

I did not need to go back and correct anything, the self healing properties of the cluster kicked in the restarted zookeeper using zookeeper_monitor. It also moved the third quorum member off the host that had filled up and onto the forth cluster member (which I thought was pretty cool)