Hi All
I have recently inherited an old (el7.nutanix.20201105.30281 is the version maybe?) 4-node Nutanix cluster and very soon after first looking at it and raising myself some tasks it has decided to stop working! I do not have support (turns out that expired a couple of months ago) and while I do intend to resolve that situation I just need to get the thing back online ASAP as it runs some critical services
The initial problem was caused by one of the nodes running out of inodes on the root disk. I cleared this down (it was just a LOT of emails in the spool) but the prism console was not available afterwards. Had a quick look online and restarted genesis (allssh 'genesis restart') off the back of that but it did not help
The problem seems to be with zookeeper, in zookeeper.out I just see a lot of this
2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_INFO@zookeeper_interest@1926: Connecting to server 10.6.232.32:9876
2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_INFO@zookeeper_interest@1963: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket 10.6.232.32:9876]
2024-06-24 15:51:33,093Z:26630(0x7f4343398700):ZOO_ERROR@handle_socket_error_msg@2165: Socket 10.6.232.32:9876] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
If I look on the individual servers in the cluster, none of them are even listening on port 9876 and if I try to do anything with the cluster it just fails to connect to zookeeper
nutanix@CVM:~/data/logs$ cluster start
2024-06-24 15:56:27,483Z INFO MainThread zookeeper_session.py:191 cluster is attempting to connect to Zookeeper
2024-06-24 15:56:37,484Z ERROR MainThread configuration.py:158 Could not get Zookeeper connection with host_port_list: zk1:9876,zk2:9876,zk3:9876
024-06-24 15:56:37,485Z INFO MainThread cluster:2919 Executing action start on SVMs ...
So my question is really, how do I diagnose the issues with zookeeper? What logs should I look in to get more information? Am I just missing something stupid?
I did notice that one of the nodes has the address for zk1 different in /etc/hosts (it is pointing to an IP that does not even exist, but I do not know the implications of this so I have left it
192.0.2.1 zk1 # DON'T TOUCH THIS LINE