How many node fail can you withstand?(4Node Cluster - RF2) | Nutanix Community
Skip to main content

 

Hello. I have a question about the RF2 function.

Currently, it is composed of 4 nodes and RF2 as a cluster.

As far as I know, RF2 has 1 node fail.

If so, shouldn't the Data Resiliency Status drop to Fail?

However, although one node has failed, the Data Resiliency Status is still OK.

In fact, the CPU, Memory, and Storage usage in the current cluster is low.

Currently only 3 nodes are active CPU is 20% Memory is 40% Storage is 18.5%.

Why is the Data Resiliency Status Ok?

Is there any problem with the service even if one more node is currently turned off?

Or will the cluster die if 1 more Node dies?

If you don't die, I want to know why.

Hi @cubensys 

The explanation is fairly simple.
Nutanix bases the resilience factor not only on the minimum number of nodes (which for a cluster is 3) but also on the availability of resources.

What does this mean in your case?

You have 4 nodes with RF2, when one of the nodes fails AOS immediately starts replicating the data to return to a stable RF2 state. Since you have 4 nodes and probably enough resources to keep the workloads active, the system returns to a stable state with only 3 nodes.

This obviously would not have happened if your cluster had been at 3 nodes or if the resources left on the 3 active nodes were not enough to keep the workloads fully operational.

Now, with your data resiliency in OK state, you can loose another node and then you will face a critical status.

Hope this helps


Thank you for answer.Then, I understood that the cluster is in a stable state with 3 nodes.If one fails in three nodes, the Data Resiliency Status will be in a critical state, but can it be determined that the service is maintained due to resource availability?

Thank you for answer.Then, I understood that the cluster is in a stable state with 3 nodes.If one fails in three nodes, the Data Resiliency Status will be in a critical state, but can it be determined that the service is maintained due to resource availability?

Not sure if i understand your question but usually, if you size a 3 nodes cluster for N+1, when a node fails the critical state simply means you can not loose other components but all the workloads are up and running. Otherwise the system maintain running only the workload there are resources for, the others are stopped. 

You can be sure about the type of HA your cluster will use with 3 different settings

Best Effort (the system will decide automatically based on resource consumption)

HA (The system will reserve a dedicate amount of memory per node for HA purpose)

Dedicated Node (An entire node wolud be used by the system as spare) this setting is deprecated

You will find more useful information about how HA works here

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA00e000000LIQUCA4

 

 

 


Hi cubensys,

 

Instead of looking at the task from the number of nodes perspective, look at it from the number of copies of data point of view.

You have 4 nodes. In an RF-2 cluster you have 2 copies of data. You loose one node which means you lose part of the second copy of the data. The expected initial Data Resiliency state is Fail because if you lose another node then it is possible that with that node you lose the other copy of the same data.

However the second copy of data can still be recovered by replicating the existing data. Once the copying process is complete the cluster Data Resiliency status returns to OK.