Disk fault tollerance

  • 25 March 2016
  • 3 replies

Badge +3
Hi there, i'm very impressed about Nutanix technology!

But i've a question, if i've a block with thre node for example (the basic to start), configure with RF2, i've understanded that i can loose 1 node and all my system is still up and running.

If i loose disk inside the same node, also let me continue running without issue, BUT what heappens if i loose two disks in the same time on the different nodes?

I've understand that in this case i loose the data the were the same on the two lost disks, so in this case all mi infrustructure goes down?

Probably i'm missing something...

Best answer by Jon 31 March 2016, 10:47

View original

This topic has been closed for comments

3 replies

Userlevel 6
Badge +29
True, but unlike traditional storage systems, this is rarely, if ever, a problem, and here's the high level "why":

Nutanix does not use RAID to protect data, we store data in a "Replication Factor", which stores individual blocks of data in a redundant fashion across two or more nodes in a cluster (i.e RF2 two copies or RF3, three copies).

If you have a drive fail, let's say it was a 1TB drive but only 200GB full.

For the sake of easy math, let's maintain a three node cluster.

That means (roughly) 200GB of information was on that disk, and approximately 200GB of information is spread across on all of the disks in the other two nodes, roughly 100GB per node

In a traditional storage system, you'd have to:
Rebuild an entire 4TB Drive map, on to a hot spare (idle drive) within the system, regardless of data
Rebuild that data parity from the "RAID Pack" the drive failed from, which trashes performance of that RAID pack and other workloads on it, and takes forever to do the operation.

In Nutanix, No Raid, so you only have to rebuild/reprotect 200GB worth of information, instead of 1TB. Also, that 200GB is spread out across the entire cluster, so all disks and nodes participate in the rebuild, spreading out the rebuild task, and making it very low impact on the cluster and performance (if at all).

The end result?
Drives fail, and rebuilds happen very quickly, as the rebuild eats into the free capacity of the cluster. This means no idle/wasted hot spares.
This means that the data is re-protected faster, so the likely hood of that second drive failure taking out data is minimized (not zero, but minimized).

If you are concerned with dual disk failure, which some customers are for business critical operations, you'd want to go with an RF3 setup, which is basically N+2, so you can have any two components fail without worry.

Anyhow, you can read more about cluster resiliency in the nutanix bible:
Badge +3
Thank you very much, for you exhaustive response.

This is all i ned to know about that.

Thank you!

What will happen to VMs in a cluster with RF2 when 1 drive failed on 2 different hosts? Will be any data loss or no? Will VMs reboot?