How many node failures can 5 node RF3 cluster tolerate? | Nutanix Community
Skip to main content

Hello!

We have 5 node RF3 cluster. I tried shutting down 2 of 5 node at a time. Cluster stays online, but rebuild never happens. Is it normal? Should 5 node RF3 cluster rebuild after loosing 2 nodes? Can it survive loosing 1 more node after already loosing 2 (only 2 of 5 stays online)? 

A 5 node RF3 cluster can survive 2 nodes down at the same time. 5 nodes is the minimum for RF3 so if there are nodes down, it will not fully rebuild to have 2 nodes resiliency. If there is enough capacity, it should rebuild to tolerate 1 more node failure. You can check that by clicking on the “Data resiliency” tab on the home page of Prism.


A 5 node RF3 cluster can survive 2 nodes down at the same time. 5 nodes is the minimum for RF3 so if there are nodes down, it will not fully rebuild to have 2 nodes resiliency. If there is enough capacity, it should rebuild to tolerate 1 more node failure. You can check that by clicking on the “Data resiliency” tab on the home page of Prism.

How long this rebuild can take? In our environment only about 350 GB is used from 47TiB but after 3 hours in Data Resiliency status on Home Page “Auto rebuild in progress Yes” was written. Does it mean that rebuild is still running? Is there a way to check status of rebuild?

 


It will keep showing Critical, because you are on RF3 and less than 5 nodes are available. It will never completely rebuild again to have 2 more node failures as it is impossible. If you click on the word “Critical”, what does it show you then?


It will keep showing Critical, because you are on RF3 and less than 5 nodes are available. It will never completely rebuild again to have 2 more node failures as it is impossible. If you click on the word “Critical”, what does it show you then?

It showed “Failures Tolerable 0” for Metadata, Zookeeper and Stargate health.