Cluster Sanity | Nutanix Community
Skip to main content

Have a situation where you need to upgrade cluster components or restart CVMs or hosts or just wish to check if your cluster health is normal?

 

Below listed are a few quick checks to evaluate the cluster stability. Note that the commands listed are to be run from an SSH session to any one of the CVMs in the cluster or PCVM where applicable.

 

No.

Check

Purpose

CLI Command

GUI method

Expected output

1

Cluster services

(Cluster and PC)

To ensure all services are running fine on all CVMs

cluster status

-N/A-

No services should be in 'DOWN' state

2

Metadata ring structure

To ensure that the metadata ring structure is stable

nodetool -h 0 ring

-N/A-

All the nodes must be listed and should have 'UP' and 'NORMAL' states

3

Data Resiliency

To ensure that the cluster can tolerate a CVM/host being put into maintenance for the upgrade activity

ncli cluster get-domain-fault-tolerance-status type=node

Data Resiliency tab on the bottom right section of the Prism Home page of the cluster

It should be 'OK' in green

4

CVM maintenance mode

To ensure that no CVM is already in maintenance mode

ncli host ls

-N/A-

Look for 'Under Maintenance Mode' parameter and ensure no CVM should be having 'true' value

5

Host Maintenance mode (AHV)

To ensure that no host is already in maintenance mode

For AHV:

acli host.list

 

-N/A-

Look for 'Schedulable' parameter and ensure no host has 'false' value

6

NCC health check (Cluster)

To ensure there are no critical failures in the cluster, which might interrupt the upgrades.

ncc health_checks run_all

Prism > Health > Run all NCC Checks

Refer the KBs listed after every failing check and fix critical issues accordingly.

7

NCC health check (Prism central)

To ensure there are no critical failures in the Prism Central VM(s), which might interrupt the upgrades.

ncc health_checks run_all

 -N/A-

Refer the KBs listed after every failing check and fix critical issues accordingly.

 

Note: For non-AHV setups, please follow alternate steps for Option 5: Host Maintenance Mode  as per hypervisor vendor.

 

For more details on putting a cluster in and out of maintenance mode, refer KB-4639.

 

The above listed checks are the basic set of parameters for a generic check. Feel free to check out KB-2852 for more such parameters.

Nice one @Nashma :smiley: