Question

Unable to reach server. Please check your network connection.

  • 28 November 2018
  • 30 replies
  • 25 views

Badge +6
Hi everyone
I have a 3 node cluster, that worked well for a few weeks, and then all of a suddent i keep getting the following error from Prism:

Unable to reach server. Please check your network connection.

all cluster services are up on all hosts.
but I keep getting disconnected from prism, my VM's have crashed and I can figure out where to look. all host have been rebooted. network has been checked.
I have no issues with SSH to CVM and AHV hosts, no loss of ping.
But the cluster is failing doing anything essentially...

Please help...

This topic has been closed for comments

30 replies

Userlevel 7
Badge +25
Any FATAL logs in the home/nutanix/data/logs folder?

Results of "ncc health_checks run_all"?
Badge +6
Hi I just spotted something in the logs that might help:
this is from the SSL_Terminaltor.out log file.
I did actually request the cluster to get a new self signed certificate, after trying to apply an official one. could this be the reason for the failiues? and if it is certificate related, how can i fix it?


2018-11-28 17:38:00 INFO zookeeper_session.py:110 ssl_terminator is attempting to connect to Zookeeper
2018-11-28 17:38:00 INFO ssl_terminator_server.py:768 Adding custom server certificate and private key
2018-11-28 17:38:04 INFO ssl_terminator_server.py:768 Adding custom server certificate and private key
2018-11-28 17:38:08 WARNING ssl_terminator_server.py:365 Checksum mismatch 0e4de9ddde8551324042ecc974078b542c556912cc4aac8d8887a6b929ccf96d !=
2018-11-28 17:38:08 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 17:38:09 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 17:38:10 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 17:38:10 CRITICAL decorators.py:47 Traceback (most recent call last):
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/util/misc/decorators.py", line 41, in wrapper
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/ssl_terminator/ssl_terminator_server.py", line 881, in _secure_key_repo_watch_thr
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/ssl_terminator/ssl_terminator_server.py", line 713, in _configure_pem
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/ssl_terminator/ssl_terminator_server.py", line 441, in _download_secure_key
SSLTerminatorServerException: Failed to download private key

2018-11-28 17:38:10 INFO 5697 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:189 StartServiceMonitor: Child 880 exited with status: 256
2018-11-28 17:38:11 INFO 5697 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:180 StartServiceMonitor: Launched child with pid: 1569
2018-11-28 17:38:11 INFO 1569 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:210 StartService: Starting service with cmd: /home/nutanix/bin/ssl_terminator
2018-11-28 17:38:11 INFO 1569 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:134 RefreshZkHostPortList: Setting ZOOKEEPER_HOST_PORT_LIST=zk3:9876,zk2:9876,zk1:9876;
Userlevel 7
Badge +25
Yeah looks like mTLS kicked in and is failing all the network connections.

So you added an external CA's certificate and then removed it and tried to regenerate a self-signed?

I know you mentioned prism is down, but using https://portal.nutanix.com/#/page/docs/details?targetId=Web_Console_Guide-Prism_v4_7:wc_security_ssl_certificate_wc_t.html as an example what did you try to do before the whole thing fell over?
Userlevel 7
Badge +25
Ohh and I think "ncli ssl-certificate generate" on a CVM will force things back to self-signed w/o prism
Badge +6
Yes essentially, i added a public cert, and things started going a bit wooky i suppose, i didnt think much of it, and figured I would just generate a self-signed, but from the looks of it, the hosts dont recieve it...
and keeps saying:
SSLTerminatorServerException: Failed to download private key
2018-11-28 17:51:28 WARNING ssl_terminator_server.py:365 Checksum mismatch 0e4de9ddde8551324042ecc974078b542c556912cc4aac8d8887a6b929ccf96d !=
2018-11-28 17:51:29 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 17:51:29 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 17:51:30 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=


For the fun of it, I just regenered a new one, but the error keeps comming...
Badge +6
Detailed information for cluster_services_status:
Node 172.31.10.91:
FAIL: Components core dumped in last 24 hours: ['pithos', 'stargate']
Refer to KB 3378 (http://portal.nutanix.com/kb/3378) for details on cluster_services_status or Recheck with: ncc health_checks system_checks cluster_services_status


Detailed information for vm_checks:
Node 172.31.10.93:
FAIL: CVM 'NTNX-8e658690-A-CVM' has cpu utilization (100 😵 above threshold (90 %)

Refer to KB 2733 (http://portal.nutanix.com/kb/2733) for details on vm_checks or Recheck with: ncc health_checks hypervisor_checks vm_checks --cvm_list=172.31.10.93

Detailed information for cvm_reboot_check:
Node 172.31.10.92:
FAIL: CVM has rebooted recently. Last reboot was at Wed Nov 28 17:10:00 2018
Node 172.31.10.93:
FAIL: CVM has rebooted recently. Last reboot was at Wed Nov 28 16:58:00 2018
Refer to KB 2474 (http://portal.nutanix.com/kb/2474) for details on cvm_reboot_check or Recheck with: ncc health_checks system_checks cvm_reboot_check --cvm_list=172.31.10.92,172.31.10.93

Detailed information for host_cpu_frequency_check:
Node 172.31.10.91:
ERR : Error while getting host CPU frequency range. bash: cpupower: command not found

Node 172.31.10.92:
ERR : Error while getting host CPU frequency range. bash: cpupower: command not found

Node 172.31.10.93:
ERR : Error while getting host CPU frequency range. bash: cpupower: command not found

Refer to KB 5542 (http://portal.nutanix.com/kb/5542) for details on host_cpu_frequency_check or Recheck with: ncc health_checks system_checks host_cpu_frequency_check --cvm_list=172.31.10.91,172.31.10.92,172.31.10.93

Detailed information for mellanox_nic_status_check:
Node 172.31.10.91:
ERR : node (service_vm_id: 4) : Error while trying to get NIC information
Node 172.31.10.92:
ERR : node (service_vm_id: 5) : Error while trying to get NIC information
Node 172.31.10.93:
ERR : node (service_vm_id: 6) : Error while trying to get NIC information
Refer to KB 4114 (http://portal.nutanix.com/kb/4114) for details on mellanox_nic_status_check or Recheck with: ncc health_checks network_checks mellanox_nic_status_check --cvm_list=172.31.10.91,172.31.10.92,172.31.10.93
+-----------------------+
| State | Count |
+-----------------------+
| Pass | 165 |
| Info | 1 |
| Fail | 5 |
| Error | 2 |
| Total Plugins | 173 |
+-----------------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
Userlevel 7
Badge +25
So you ran the generate on a CVM and no change? Wonder if the whole chain got mucked up with the original custom cert and chain.
Badge +6
Yes I still have the error, and the cluster is still having a range of issues, as SSH is failing between nodes, from what I can tell...
Userlevel 7
Badge +25
So I don't know where nutanix is storing their certs myself and python can target any location.

@Primzy any thoughts on why ncli couldn't pave over a bad SSL config?
Badge +6
I have tried to replace certs, but have reverted back to using an offical certificate.

but i still get the following in my /data/logs/ssl_terminator.out
And I have no idea about how to find the ssl_terminator_server.py file it keeps refering to... I thought it might give a clue as to what it was looking for...

Is there a way to extract virtual disks from VM's ? I have a few disks I would like to recover if it turns out to be impossible to turn the ship around...

2018-11-28 23:03:11 INFO 5795 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:189 StartServiceMonitor: Child 17335 exited with status: 256
2018-11-28 23:03:12 INFO 5795 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:180 StartServiceMonitor: Launched child with pid: 17899
2018-11-28 23:03:12 INFO 17899 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:210 StartService: Starting service with cmd: /home/nutanix/bin/ssl_terminator
2018-11-28 23:03:12 INFO 17899 ../../../../../infrastructure/cluster/service_monitor/service_monitor.c:134 RefreshZkHostPortList: Setting ZOOKEEPER_HOST_PORT_LIST=zk3:9876,zk2:9876,zk1:9876;
2018-11-28 23:03:14 INFO zookeeper_session.py:110 ssl_terminator is attempting to connect to Zookeeper
2018-11-28 23:03:14 INFO ssl_terminator_server.py:768 Adding custom server certificate and private key
2018-11-28 23:03:17 INFO ssl_terminator_server.py:768 Adding custom server certificate and private key
2018-11-28 23:03:20 WARNING ssl_terminator_server.py:365 Checksum mismatch 0e4de9ddde8551324042ecc974078b542c556912cc4aac8d8887a6b929ccf96d !=
2018-11-28 23:03:20 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 23:03:21 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 23:03:22 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-28 23:03:22 CRITICAL decorators.py:47 Traceback (most recent call last):
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/util/misc/decorators.py", line 41, in wrapper
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/ssl_terminator/ssl_terminator_server.py", line 881, in _secure_key_repo_watch_thr
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/ssl_terminator/ssl_terminator_server.py", line 713, in _configure_pem
File "/home/afg/src/main/builds/build-ce-2018.05.01-stable-release/python-tree/bdist.linux-x86_64/egg/ssl_terminator/ssl_terminator_server.py", line 441, in _download_secure_key
SSLTerminatorServerException: Failed to download private key
Userlevel 7
Badge +25
If Stargate is functional than maybe, but sounds like that is questionable. You would need to see if 2222 is listening and then scp to that port.

You may be able to repair one server at a time without data loss.
Badge +6
I can connect to 2222, but browsing images stalls... so stargate seems to be somewhat down.
I will try to run install from ahv?
Userlevel 7
Badge +25
Yuck... gray failures are a pain

Yeah login as install from ahv and it should detect the existing install and prompt for repair. Go one at a time and hopefully it will put SSL back there way it should be
Badge +6
Alright, I have started the repair of CVM, (destroy all SSD data)...
on one host...
I am guessing that could be a problem doing that to all hosts, as all data on ssd would be lost...
repair host (all data preserved, I am guessing is probably not enough to fix cvm?)
Userlevel 7
Badge +25
Don't destroy the data?! Preserve would have just reset the cvm image and hopefully enough of the config to fix ssl.

You have 2 of your 3 so hopefully you can preserve the rest of the nodes and then join that scrubbed node back to the cluster.
Badge +6
Yeah, unfortunately the issue seems to be with the CVM, so repairing the host, could be a hit and miss...
But I will see what happens when the first one comes up again.
Userlevel 7
Badge +25
Well the CVM is basically an ISO with a directory mounted. You nuked all of the Cassandra data along with that which is staged on the SSD. I would have done a repair first which has shown to fix other CVM issues and that doesn't touch the Cassandra and other hot stage partitions.
Badge +6
Ok, well after the redeploy of CVM, it is not comming up. all data is preserved in
/home/nutanix/data/stargate-storage/disks/XXXX
and for that reason I get the following:
2018-11-29 13:03:59 ERROR node_manager.py:5095 Disk mounted on /home/nutanix/data/stargate-storage/disks/ZA1ALJ3H0000C832/WAL_alt is not empty
2018-11-29 13:03:59 CRITICAL node_manager.py:5146 Data disks are not clean.

from genesis.out
Userlevel 7
Badge +25
The delete data one or did you start a repair?

In both cases you are redeploying the CVM, but in the preserve case it is not wiping all the data partitions on the SSD.
Badge +6
Hi Jrack
I only have 3 options,
  • Completely destroy all data
  • CVM repair, destroy SSD data
  • host repair redeploy (keep all data)
I am trying the redeploy CVM once again on the same host...
the two remaining hosts, are untouched... but still not running as they should...
Userlevel 7
Badge +25
Well didn't you do #2 on that node already? If so that node is a lost cause as you can't do #3 after nuking the data.

You have 2 untouched nodes and question is if a #3 on one of those would do anything.
Badge +6
Yes, and the result on node 1, is that it is not rejoining the cluster.
node 2 and 3 remain untouched...
Userlevel 7
Badge +25
Yeah node1 is going to be a problem now because all of the data is hosed. I still would suggest to try a repair (#3) on one of the 2 remaining and see if the sslterm errors go away.
Badge +6
Should i try a repair through the exisiting root console, or reimage the nutanix drive ?(runs of satadom disk)
Badge +6
Since crashing the 1. node, ssl_terminator_server.py gives a slight variation:
indicating, that first checksum is what it expects, and the others are the nodes, and what it recieves... nodes have been in sync, but not with the expected value...

2018-11-29 17:08:07 WARNING ssl_terminator_server.py:365 Checksum mismatch 0e4de9ddde8551324042ecc974078b542c556912cc4aac8d8887a6b929ccf96d !=
2018-11-29 17:08:07 WARNING ssl_terminator_server.py:393 Failed to download from 172.31.10.91: 255 (, )
2018-11-29 17:08:08 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=
2018-11-29 17:08:08 WARNING ssl_terminator_server.py:365 Checksum mismatch e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 !=