Have a situation where you need to upgrade cluster components or restart CVMs or hosts or just wish to check if your cluster health is normal?
Below listed are a few quick checks to evaluate the cluster stability. Note that the commands listed are to be run from an SSH session to any one of the CVMs in the cluster or PCVM where applicable.
No. | Check | Purpose | CLI Command | GUI method | Expected output |
1 | Cluster services (Cluster and PC) | To ensure all services are running fine on all CVMs | cluster status | -N/A- | No services should be in 'DOWN' state |
2 | Metadata ring structure | To ensure that the metadata ring structure is stable | nodetool -h 0 ring | -N/A- | All the nodes must be listed and should have 'UP' and 'NORMAL' states |
3 | Data Resiliency | To ensure that the cluster can tolerate a CVM/host being put into maintenance for the upgrade activity | ncli cluster get-domain-fault-tolerance-status type=node | Data Resiliency tab on the bottom right section of the Prism Home page of the cluster | It should be 'OK' in green |
4 | CVM maintenance mode | To ensure that no CVM is already in maintenance mode | ncli host ls | -N/A- | Look for 'Under Maintenance Mode' parameter and ensure no CVM should be having 'true' value |
5 | Host Maintenance mode (AHV) | To ensure that no host is already in maintenance mode | For AHV: acli host.list
| -N/A- | Look for 'Schedulable' parameter and ensure no host has 'false' value |
6 | NCC health check (Cluster) | To ensure there are no critical failures in the cluster, which might interrupt the upgrades. | ncc health_checks run_all | Prism > Health > Run all NCC Checks | Refer the KBs listed after every failing check and fix critical issues accordingly. |
7 | NCC health check (Prism central) | To ensure there are no critical failures in the Prism Central VM(s), which might interrupt the upgrades. | ncc health_checks run_all | -N/A- | Refer the KBs listed after every failing check and fix critical issues accordingly. |
Note: For non-AHV setups, please follow alternate steps for Option 5: Host Maintenance Mode as per hypervisor vendor.
For more details on putting a cluster in and out of maintenance mode, refer KB-4639.
The above listed checks are the basic set of parameters for a generic check. Feel free to check out KB-2852 for more such parameters.