Node failure | Nutanix Community
Skip to main content

hi team, i want to ask something, in my understanding nutanix can tolerates 1 node failure if I use rf 2 where the data will be replicated into 2 pieces, what happens if I have 10 nodes, how much nutanix can tolerance of node failure at the same time? it still tolerates only 1 node failure or any other mechanism? I am confuse after read this kb https://portal.nutanix.com/page/documents/details/?targetId=Web-Console-Guide-Prism-v5_17%3Aarc-host-failure-c.html

There are 2 options of fault tolerance - RF2 and RF3. 

RF2 means there are 2 copies of all data. With RF2 one node can go down at a given time.

RF3 means there are 3 copies of all data. With RF3 two nodes can go down at a given time.

If you have 10 nodes and RF2 configuration, one node can go down and the cluster will stay up. When the node goes down, the data starts rebuilding and the cluster recreates the copies of data that went missing. If another node goes down while the data rebuild is not finished, the cluster will go down. 

If you have enough free space in the cluster, after some time, when the rebuild is complete, one more node can go down and so on.


There are 2 options of fault tolerance - RF2 and RF3. 

RF2 means there are 2 copies of all data. With RF2 one node can go down at a given time.

RF3 means there are 3 copies of all data. With RF3 two nodes can go down at a given time.

If you have 10 nodes and RF2 configuration, one node can go down and the cluster will stay up. When the node goes down, the data starts rebuilding and the cluster recreates the copies of data that went missing. If another node goes down while the data rebuild is not finished, the cluster will go down. 

If you have enough free space in the cluster, after some time, when the rebuild is complete, one more node can go down and so on.

 

hi Sergei, thanks for your clear answer, one more 😃, do you have estimate time for rebuilding data? you can give me example used space data

and i need clarify “ If another node goes down while the data rebuild is not finished, the cluster will go down.” this is all cluster down, or all vm on failed node only will be down and the other vm on other node still up?

 

 


It’s impossible to say, because it depends on too many factors, such as how much space needs to be rebuilt, whether it’s an all-flash or a hybrid cluster, the overall I/O load on the cluster. If the cluster is all-flash and nodes are not heavily utilised space-wise, it can take around 15-20 minutes. In a cluster with a lot of HDD and heavy space usage it can be several hours. Don’t take those numbers as some accurate info, because it depends on the situation and each time it will be different. But, in general, it doesn’t take awfully long.

And to answer your second question - if two nodes are down at the same time (in RF2) before the rebuild from the first node going down is finished, the whole cluster will go down, meaning all VMs in the cluster will be down.


ok thanks Sergei for your awasome answer and i need clarify “ If another node goes down while the data rebuild is not finished, the cluster will go down.” this is all cluster down, or all vm on failed node only will be down and the other vm on other node still up?


I’ve edited my previous post with the answer to that question :)


ohhh my god 😃, so if i have one node fail (rf2), i need repair as soon as posible 😃, thanks for your clarification Sergei :)


It’s not as bad as it sounds. The chances that 2 nodes will go down within an hour are incredibly low.


hhaha its true Sergei, thanks for your answer :)