Checklist to verify cluster health status prior to the restarting a CVM

Forum|Forum|4 years ago
January 31, 2021
0 replies
3429 views

+2

ShvetaD
Nutanix Employee

There are instances when you have to perform a rolling restart of the CVMs (Controller VMs) or a rolling restart of the hypervisor hosts or a restart of just one of the CVMs.

This is a list of health checks to execute prior to the restart to verify cluster health.

Verify if any nodes or services are in a 'down' state. Run the following command for smaller sized clusters:

nutanix@cvm$ cluster status

If the cluster contains multiple nodes, running the following command which excludes services that are UP from the output may be more convenient:

nutanix@cvm$ cluster status | grep -v UP

Nodes or services that are unexpectedly in a 'down' state need to be fixed before proceeding with the restart.

Verify if any nodes are missing or are in a 'down' state in the Cassandra ring. There should be the same number of nodes as the number of IPs in the svmips output (four nodes in the example below). If a node is missing, it means it was removed from the Cassandra ring:

nutanix@cvm$ nodetool -h 0 ring

Address Status State Load Owns Token

kV000000Msfgt0tSk22HNmeoLEMT9hDKoNj90Tfc1JpRHn0pRzgU6vJkCwYW

X.X.X.44 Up Normal 19.54 GB 25.00% 00000000NUjWKYp94sEGXJfIESzM6uY1nEVSEnkeZd0Dk4FMDYI1JFmYskpL

X.X.X.41 Up Normal 15.11 GB 25.00% FV000000jZyBpvdRUdTMjOVYIhBRLlq1hNDrXIGAqzO8bYBeceSieWOQ6NdK

X.X.X.42 Up Normal 23.17 GB 25.00% V00000001XCXAHdrXjVlkQHxCX2XJ8oAtUX21dPZfC46JQeltUpSL9WgZKmX

X.X.X.43 Up Normal 21.34 GB 25.00% kV000000Msfgt0tSk22HNmeoLEMT9hDKoNj90Tfc1JpRHn0pRzgU6vJkCwYW

nutanix@cvm$ svmips

X.X.X.41 X.X.X.42 X.X.X.43 X.X.X.44

Run the following command to check Cassandra status:

nutanix@cvm$ ncc health_checks cassandra_checks cassandra_status_check

Verify if there are any recent FATAL files in the ~nutanix/data/logs directory:
nutanix@cvm$ ls -ltr ~/data/logs/*FATAL*
Review any service fatal in the past 1 hour and then validate if the fatal service is in the 'up' state and stable before you proceed with the restart.
Verify if any Stargate node is down or if ha.py is enabled.
nutanix@cvm$ ncc health_checks network_checks ha_py_rerouting_check
Verify if the cluster can tolerate a single node failure.

nutanix@cvm$ ncli cluster get-domain-fault-tolerance-status type=node

Review any unacknowledged alerts and their create time

For more details and commands , please review the kb: https://portal.nutanix.com/page/documents/kbs/details?targetId=kA032000000982pCAA

This topic has been closed for replies.

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded