Troubleshooting Drive fail on the nodes

3 years ago
2 October 2020
1 reply
3302 views

Userlevel 2

+4

raaji
Nutanix Employee
69 replies

When a drive on a host (SSD or HDD) experiencing a recoverable errors, warnings or a complete hardware failure, the stargate service will mark the disk as bad.

The following can be observed when a disk fail occurs.

Disk is shown as a red or a plain grey in prism
A critical error in Prism stating the disk went bad.

Troubleshooting steps

Identify the problematic disk in Prism.
1. Check the Prism web console for the failed disk. In the Diagram view, you can see red or grey for the missing disk.
2. Check the alerts in the Prism web console for the disk alerts, or use the following command from any of the working CVMs in the cluster to check for disks that have generated the failure messages.

ncli alert ls

Check to see if the disk is being recognized by the black plane. Execute the following command from the cm of the node that shows disk fail

list_diks

Check to see if the disk is mounted on the node.

df -h

Check for offline disks using NCC check disk_online_check.

ncc health_checks hardware_checks disk_checks disk_online_checks

Check the health status of the disk using the following command

sudo smartctl -H /dev/sdX (X can be an alphabet of this disk from step 3)

Before replacing the Drive a CVM reboot is needed to properly show the Drive as failed or if it should come back to be able to troubleshoot further and try to bring it online

KB Article:

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0600000008USrCAM

This topic has been closed for comments

1 reply

Userlevel 3

+6

DavidN
Trendsetter
45 replies
3 years ago
2 October 2020

Is a CVM reboot (or any of these commands/test) or a disk reseat still needed if the disk is listed as “tombstoned”?

ncli disk ls-tombstone-entries

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded