Checklist to verify cluster health status prior to the restarting a CVM

  • 31 January 2021
  • 0 replies
  • 2783 views

Userlevel 1
Badge +2

There are instances when you have to perform a rolling restart of the CVMs (Controller VMs) or a rolling restart of the hypervisor hosts or a restart of just one of the CVMs. 

This is a list of health checks to execute prior to the restart to verify cluster health.

 

  • Verify if any nodes or services are in a 'down' state. Run the following command for smaller sized clusters:

 

nutanix@cvm$ cluster status

 

  • If the cluster contains multiple nodes, running the following command which excludes services that are UP from the output may be more convenient:

nutanix@cvm$ cluster status | grep -v UP

  • Nodes or services that are unexpectedly in a 'down' state need to be fixed before proceeding with the restart. 

 

Verify if any nodes are missing or are in a 'down' state in the Cassandra ring. There should be the same number of nodes as the number of IPs in the svmips output (four nodes in the example below). If a node is missing, it means it was removed from the Cassandra ring:

nutanix@cvm$ nodetool -h 0 ring

Address     Status State  Load        Owns Token

                                                       kV000000Msfgt0tSk22HNmeoLEMT9hDKoNj90Tfc1JpRHn0pRzgU6vJkCwYW

X.X.X.44 Up Normal 19.54 GB    25.00%  00000000NUjWKYp94sEGXJfIESzM6uY1nEVSEnkeZd0Dk4FMDYI1JFmYskpL

X.X.X.41 Up Normal 15.11 GB    25.00%  FV000000jZyBpvdRUdTMjOVYIhBRLlq1hNDrXIGAqzO8bYBeceSieWOQ6NdK

X.X.X.42 Up Normal 23.17 GB    25.00%  V00000001XCXAHdrXjVlkQHxCX2XJ8oAtUX21dPZfC46JQeltUpSL9WgZKmX

X.X.X.43 Up Normal 21.34 GB    25.00%  kV000000Msfgt0tSk22HNmeoLEMT9hDKoNj90Tfc1JpRHn0pRzgU6vJkCwYW

 

nutanix@cvm$ svmips

X.X.X.41 X.X.X.42 X.X.X.43 X.X.X.44

  • Run the following command to check Cassandra status:

nutanix@cvm$ ncc health_checks cassandra_checks cassandra_status_check

  • Verify if there are any recent FATAL files in the ~nutanix/data/logs directory:
    nutanix@cvm$ ls -ltr ~/data/logs/*FATAL*
    Review any service fatal in the past 1 hour and then validate if the fatal service is in the 'up' state and stable before you proceed with the restart.
     
  • Verify if any Stargate node is down or if ha.py is enabled.
    nutanix@cvm$ ncc health_checks network_checks ha_py_rerouting_check
  • Verify if the cluster can tolerate a single node failure.

nutanix@cvm$ ncli cluster get-domain-fault-tolerance-status type=node

  • Review any unacknowledged alerts and their create time

For more details and commands , please review the kb: https://portal.nutanix.com/page/documents/kbs/details?targetId=kA032000000982pCAA

 


This topic has been closed for comments