There are instances when you have to perform a rolling restart of the CVMs (Controller VMs) or a rolling restart of the hypervisor hosts or a restart of just one of the CVMs.
This is a list of health checks to execute prior to the restart to verify cluster health.
- Verify if any nodes or services are in a 'down' state. Run the following command for smaller sized clusters:
nutanix@cvm$ cluster status
- If the cluster contains multiple nodes, running the following command which excludes services that are UP from the output may be more convenient:
nutanix@cvm$ cluster status | grep -v UP
- Nodes or services that are unexpectedly in a 'down' state need to be fixed before proceeding with the restart.
Verify if any nodes are missing or are in a 'down' state in the Cassandra ring. There should be the same number of nodes as the number of IPs in the svmips output (four nodes in the example below). If a node is missing, it means it was removed from the Cassandra ring:
nutanix@cvm$ nodetool -h 0 ring
Address Status State Load Owns Token
kV000000Msfgt0tSk22HNmeoLEMT9hDKoNj90Tfc1JpRHn0pRzgU6vJkCwYW
X.X.X.44 Up Normal 19.54 GB 25.00% 00000000NUjWKYp94sEGXJfIESzM6uY1nEVSEnkeZd0Dk4FMDYI1JFmYskpL
X.X.X.41 Up Normal 15.11 GB 25.00% FV000000jZyBpvdRUdTMjOVYIhBRLlq1hNDrXIGAqzO8bYBeceSieWOQ6NdK
X.X.X.42 Up Normal 23.17 GB 25.00% V00000001XCXAHdrXjVlkQHxCX2XJ8oAtUX21dPZfC46JQeltUpSL9WgZKmX
X.X.X.43 Up Normal 21.34 GB 25.00% kV000000Msfgt0tSk22HNmeoLEMT9hDKoNj90Tfc1JpRHn0pRzgU6vJkCwYW
nutanix@cvm$ svmips
X.X.X.41 X.X.X.42 X.X.X.43 X.X.X.44
- Run the following command to check Cassandra status:
nutanix@cvm$ ncc health_checks cassandra_checks cassandra_status_check
- Verify if there are any recent FATAL files in the ~nutanix/data/logs directory:
nutanix@cvm$ ls -ltr ~/data/logs/*FATAL*
Review any service fatal in the past 1 hour and then validate if the fatal service is in the 'up' state and stable before you proceed with the restart.
- Verify if any Stargate node is down or if ha.py is enabled.
nutanix@cvm$ ncc health_checks network_checks ha_py_rerouting_check - Verify if the cluster can tolerate a single node failure.
nutanix@cvm$ ncli cluster get-domain-fault-tolerance-status type=node
- Review any unacknowledged alerts and their create time
For more details and commands , please review the kb: https://portal.nutanix.com/page/documents/kbs/details?targetId=kA032000000982pCAA