Hi @Rajamanivel
- Yes - provided HA is enabled for that VM and cluster capacity available (cpu+ram)
- Single Disk failure : Nutanix AOS pro-actively monitors disk and will alert even before the actual disk failure (in majority of cases). so for e.g. if a disk is nearing it’s life or a read or write OP is not responded within a few milliseconds - AOS will mark the disk bad and you will see ncc alerts for either the disk or for the component responsible for data management. Second copy of your data will be hosted on another disk in another node in the cluster. Impact will be none. You can read more about “Data Path Resiliency” at the nutanixbible.com. Also note, that as Nutanix doesn’t uses RAID - so AOS will only rebuild the actual amount of data and not the size of the disk.
3 & 4 :
Dual Disk Failures :
with Replication Factor = 2, we have two copies of each data block. Also, with a 4TB Disk which is only 1TB utilised, we will have 1TB worth of blocks/data across the cluster (shorter rebuild times as we factor in the capacity utilised)- plus with AOS proactive disk alerting, so, unless two disks are pulled at the same time, this can be a very rare occurrence of two drives failing at the same time.
Now, if that were to happen, then it will depend as to how many blocks were identical on these two disks. that could impact the VMs in question. You can always have a separate container with Replication Factor=3 for critical VMs, that will ensure you have 3 copies of data spread out across the cluster.
If the two drives on different hosts fails within (as per your e.g. 30 Minutes), you will need to factor in the capacity utilised on these drives as well, but the rebuild will be triggered immediately.
In the event of a disk failure, a Curator scan (MapReduce Framework) will occur immediately. It will scan the metadata (Cassandra) to find the data previously hosted on the failed disk and the nodes / disks hosting the replicas.
Once it has found that data that needs to be “re-replicated”, it will distribute the replication tasks to the nodes throughout the cluster.
During this process a Drive Self Test (DST) is started for the bad disk and SMART logs are monitored for errors.
Data Path Resiliency at the nutanixbible.com will explain a bit more on how nutanix protects data and metadata.
You can also read the following thread for some more info:
https://next.nutanix.com/how-it-works-22/disk-fault-tollerance-8822
Hope that helps, have tried to answer your points, please feel free to discuss for further clarity - Thanks
BR
Thanks BR for your prompt response.Much appreciated.
Thanks,
Manivel RR