Nutanix fault tolerance question | Nutanix Community
Skip to main content

Hi Guys,

I have a nutanix block with four nodes. I mean four node cluster. 

Supermicro NX30-60

I have a query about fault tolerance. Currently it is set as RF-2

Each node  has 6 disks. 4 hdd and 2 SSD

4*800 gb hdd

2* 200 gb sad

OS is AHV. 

1) Assume, if one host completely fails, then the vm will be restarted on other host and it will continue to run without any issues. 

2) If one disk fails(either HDD or SSD or both at the same time)from first AHV, then there will be no impact. 

3) If two disk fails from two hosts(1st  AHV and 2nd AHV) at the same time(assume one HDD fails on both the nodes), then what will be the impact for all the running vms in the cluster? 

4) If two disk fails from two hosts(1st  AHV and 2nd AHV) in different time I. E 30 mins gap(assume one HDD failed in 6AM in first node and another HDD failed at 6.30 AM in second node), then what will be the impact for all the running vms in the cluster? 

 

Could someone please advise? 

 

Thank you, 

Manivel RR

 

 

 

 

Hi @Rajamanivel 

  1. Yes - provided HA is enabled for that VM and cluster capacity available (cpu+ram)
  2. Single Disk failure : Nutanix AOS pro-actively monitors disk and will alert even before the actual disk failure (in majority of cases). so for e.g. if a disk is nearing it’s life or a read or write OP is not responded within a few milliseconds - AOS will mark the disk bad and you will see ncc alerts for either the disk or for the component responsible for data management. Second copy of your data will be hosted on another disk in another node in the cluster. Impact will be none. You can read more about “Data Path Resiliency” at the nutanixbible.com. Also note, that as Nutanix doesn’t uses RAID - so AOS will only rebuild the actual amount of data and not the size of the disk. 

 

3 & 4 :

Dual Disk Failures :

with Replication Factor = 2, we have two copies of each data block. Also, with a 4TB Disk which is only 1TB utilised, we will have 1TB worth of blocks/data across the cluster (shorter rebuild times as we factor in the capacity utilised)-  plus with AOS proactive disk alerting, so, unless two disks are pulled at the same time, this can be a very rare occurrence of two drives failing at the same time.

Now, if that were to happen, then it will depend as to how many blocks were identical on these two disks. that could impact the VMs in question. You can always have a separate container with Replication Factor=3 for critical VMs, that will ensure you have 3 copies of data spread out across the cluster.

If the two drives on different hosts fails within (as per your e.g. 30 Minutes), you will need to factor in the capacity utilised on these drives as well, but the rebuild will be triggered immediately.

In the event of a disk failure, a Curator scan (MapReduce Framework) will occur immediately.  It will scan the metadata (Cassandra) to find the data previously hosted on the failed disk and the nodes / disks hosting the replicas.

Once it has found that data that needs to be “re-replicated”, it will distribute the replication tasks to the nodes throughout the cluster. 

During this process a Drive Self Test (DST) is started for the bad disk and SMART logs are monitored for errors.

Data Path Resiliency at the nutanixbible.com will explain a bit more on how nutanix protects data and metadata.

You can also read the following thread for some more info:
https://next.nutanix.com/how-it-works-22/disk-fault-tollerance-8822

Hope that helps, have tried to answer your points, please feel free to discuss for further clarity - Thanks

 

BR


Thanks BR for your prompt response.Much appreciated.

 

 

Thanks,

Manivel RR