Troubleshooting Drive fail on the nodes

  • 2 October 2020
  • 1 reply
  • 3302 views

Userlevel 2
Badge +4
  • Nutanix Employee
  • 69 replies

When a drive on a host (SSD or HDD) experiencing a recoverable errors, warnings or a complete hardware failure, the stargate service will mark the disk as bad.

 

The following can be observed when a disk fail occurs.

  1. Disk is shown as a red  or a plain grey in prism

  2. A critical error in Prism stating the disk went bad.

 

Troubleshooting steps

 

  1. Identify the problematic disk in Prism.

    1. Check the Prism web console for the failed disk. In the Diagram view, you can see red or grey for the missing disk.

    2. Check the alerts in the Prism web console for the disk alerts, or use the following command from any of the working CVMs in the cluster to check for disks that have generated the failure messages.

ncli alert ls

  1. Check to see if the disk is being recognized by the black plane. Execute the following command from the cm of the node that shows disk fail

list_diks

  1. Check to see if the disk is mounted on the node.

df -h

  1. Check for offline disks using NCC check disk_online_check.

ncc health_checks hardware_checks disk_checks disk_online_checks

  1. Check the health status of the disk using the following command

sudo smartctl -H /dev/sdX  (X can be an alphabet of this disk from step 3)

 

Before replacing the Drive a CVM reboot is needed to properly show the Drive as failed or if it should come back to be able to troubleshoot further and try to bring it online

 

KB Article: 

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0600000008USrCAM


This topic has been closed for comments

1 reply

Userlevel 3
Badge +6

Is a CVM reboot (or any of these commands/test) or a disk reseat still needed if the disk is listed as “tombstoned”?

ncli disk ls-tombstone-entries