Solved

Need help on an old cluster, Zookeeper won't start

9 months ago
June 24, 2024
6 replies
1512 views

josh.berry
Adventurer
4 replies

Hi All

I have recently inherited an old (el7.nutanix.20201105.30281 is the version maybe?) 4-node Nutanix cluster and very soon after first looking at it and raising myself some tasks it has decided to stop working! I do not have support (turns out that expired a couple of months ago) and while I do intend to resolve that situation I just need to get the thing back online ASAP as it runs some critical services

The initial problem was caused by one of the nodes running out of inodes on the root disk. I cleared this down (it was just a LOT of emails in the spool) but the prism console was not available afterwards. Had a quick look online and restarted genesis (allssh 'genesis restart') off the back of that but it did not help

The problem seems to be with zookeeper, in zookeeper.out I just see a lot of this

2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_INFO@zookeeper_interest@1926: Connecting to server 10.6.232.32:9876 2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_INFO@zookeeper_interest@1963: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket [10.6.232.32:9876] 2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_ERROR@handle_socket_error_msg@2165: Socket [10.6.232.32:9876] zk retcode=-4, errno=111(Connection refused): server refused to accept the client

If I look on the individual servers in the cluster, none of them are even listening on port 9876 and if I try to do anything with the cluster it just fails to connect to zookeeper

nutanix@CVM:~/data/logs$ cluster start 2024-06-24 15:56:27,483Z INFO MainThread zookeeper_session.py:191 cluster is attempting to connect to Zookeeper 2024-06-24 15:56:37,484Z ERROR MainThread configuration.py:158 Could not get Zookeeper connection with host_port_list: zk1:9876,zk2:9876,zk3:9876

024-06-24 15:56:37,485Z INFO MainThread cluster:2919 Executing action start on SVMs ...

So my question is really, how do I diagnose the issues with zookeeper? What logs should I look in to get more information? Am I just missing something stupid?

I did notice that one of the nodes has the address for zk1 different in /etc/hosts (it is pointing to an IP that does not even exist, but I do not know the implications of this so I have left it

192.0.2.1 zk1 # DON'T TOUCH THIS LINE

Best answer by josh.berry

Thanks Sl

I actually managed to resurrect it, although I am really not sure how I managed it under the hood. As you say, I need to get it back supported ASAP

For reference, the problem was that zookeeper_monitor would not bring up zookeeper on some nodes because or some unknown inconsistency in the configuration. I found that unless at least 2 of the 3 nodes in the quorum were trying to come up and have an election then nothing would come up

I had managed to get zookeeper to start on one node by restarting a CVM (just sitting there trying to have an election). I then just ran zookeeper (the same java command from the working ones ps) myself on a node that had not run out of inodes. This triggered the election and everything sprang to life. It also allowed me to run the health checks (ncc health_checks run_all) which told me what was actually wrong, in this case the zookeeper config version incorrect

I did not need to go back and correct anything, the self healing properties of the cluster kicked in the restarted zookeeper using zookeeper_monitor. It also moved the third quorum member off the host that had filled up and onto the forth cluster member (which I thought was pretty cool)

View original

Did this topic help you find an answer to your question?

This topic has been closed for comments

sl.farhanparkar
Vanguard
181 replies
9 months ago
June 24, 2024

Hi Josh,

As you said, is the same node where inode issue was identified has different zk entries?

F>P

josh.berry
Author
Adventurer
4 replies
9 months ago
June 25, 2024

It did, no idea how it worked before, but I fixed that (in /home/nutanix/data/zookeeper_monitor/zk_server_config_file and /etc/hosts) so that all four cluster members match. It is now getting further, but still trying to connect to zkX:9876 … but nothing is listening on that port so the connection fails

The zookeeper.out eventually just shows the zookeeper monitor sitting there doing this

*** Check failure stack trace: *** First FATAL tid: 16200 Installed ExitTimer with timeout 30 secs and interval 5 secs Leak checks complete Flushed log files Initialized FiberPool BacktraceGenerator Collected stack-frames for threads Collected stack-frames for all Fibers Symbolized all thread stack-frames Symbolized all fiber stack-frames Obtained stack traces of threads responding to SIGPROF Collected stack traces from /proc for unresponsive threads Stack traces are generated at /home/nutanix/data/cores/zookeeper_monit.16200.20240625-074827.stack_trace.txt Stacktrace collection complete E20240625 07:48:27.132045Z 13331 zookeeper_monitor.cc:815] Zookeeper monitor exited too many times - Delaying starting the child for 45 second

josh.berry
Author
Adventurer
4 replies
9 months ago
June 25, 2024

Sorry, missed an error from the zookeeper logs

F20240625 07:48:25.034206Z 16182 zookeeper_monitor.cc:1831] Check failed: iter != zookeeper_mapping_local_.end()

josh.berry
Author
Adventurer
4 replies
9 months ago
June 26, 2024

So after more investigation the problem I have is definitely with zookeeper_monitor not bringing up zookeeper. It must be failing some consistency checks but can anyone tell me exactly what zookeeper_monitor checks?

I know that it checks

checks /home/nutanix/data/zookeeper_monitor/zk_server_config_file matches across hosts
checks /etc/hosts matches across hosts
checks the zookeeper config version matches across hosts

What else does it do that I am missing?

sl.farhanparkar
Vanguard
181 replies
9 months ago
June 26, 2024

Hi Josh,

I believe due inode issue seems some files got corrupted, if u can get Nutanix support on exception, you can involve local accounts team.

look through genesis.out if you can see if any file missing errors.

F>P

josh.berry
Author
Adventurer
4 replies
Answer
9 months ago
June 27, 2024

Thanks Sl

I actually managed to resurrect it, although I am really not sure how I managed it under the hood. As you say, I need to get it back supported ASAP

Related Topics

Family Plan - Country issuesicon

Family plan not in same country erroricon

Sorry, you can't join this plan. You must live in the same country to join the family plan.icon

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded