Solved

Data Resiliency Status critical - what happens when another node goes down?

  • 17 April 2019
  • 5 replies
  • 664 views

Badge +1
Hi all.
I have read the document about the different failure scenarios. For me it is still not clear, what really happens, if the data resiliency status is still critical and another node fails.

In the doc I read this: With two sets of physical disks that are inaccessible, there would be a chance that some VM data extents are missing completely.

In our starting time when using Nutanix (version 3.x) I remember our Nutanix partner was telling me, in such a case the cluster stops to make sure data keeps consistent.

Do I have some UVMs that are not working any more correctly or is cluster stopping and all UVMs are stopped?

Thank you very much.

Kind regards, Stefan
icon

Best answer by jrack 17 April 2019, 20:48

View original

This topic has been closed for comments

5 replies

Userlevel 7
Badge +25
How big is your cluster? Data is spread on two nodes in a cluster (RF2) on one of the devices in that node.

  • 4 nodes can tolerate a whole node w/o impact and any missing accessible copies are re-replicated onto remaining 3
  • 3 nodes can tolerate a node w/o data loss, but as you have only 2 nodes left zookeeper and other quorum systems have a single master elected to avoid split brain. So the cluster is impaired and can't tolerate another node loss as you would lose access to your 2 copies. This happens during an upgrade of a 3 node, but it is temporary and yeah not a place you want to be for long.
  • 2 nodes... not really a thing in CE or really in the DFS and would look like a weird raid1
  • 1 node... no additional copies so when that node is gone it is gone
So yes unless you move to RF3 on 5 nodes there is a chance if you lost 2 nodes in a RF2 pool (or just 2 specific block devices if you are really unlucky) you could have a data loss situation. Since CE maxes at 4 nodes you really can't do RF3.

That what you were looking for? Maybe you have a specific scenario in mind as an example always I find is more useful for understanding a concept than the abstract stuff.
Badge +1
My scenario is as follows: I have setup a new 4 node cluster with RF2. At the moment I have only test UVMs on it. Today, I did some tests. First I shutdown one node. After HA moved over all UVMs from that node to the other 3 nodes and the shutdown node was detached after 30 minutes, data resilience was rebuild and came back to OK status. Then I shutdown another node. All was working as espected. No problems.
But the question is, what happens with the cluster, when I don't wait until data resilience is OK before shutdown the second node. I understand, that I loose some UVM data as there could be some replicas missing. But will the cluster still run or will the cluster stop and also all UVMs stop to keep data consistent as I was told some years ago.
Or will the cluster keep running and I only have some problems with some UVMs as these are missing some data blocks?
As I have only test VMs on it until now, I can easily test this scenario, but I do not want to have all the config staff I did so far on the cluster to be at risk.
Userlevel 7
Badge +25
So technically you could put the file system in an odd state, lose extent access and likely crash the VM.

It could happen that Stargate shuts down and tries to protect the data at rest from corruption or incomplete copies. When you went from 4 to 3 all the VMs who were were started on one of the other 3 lack data locality potentially (33% chance their new node had their 2nd extent copy). So once the VMs are "happy" the DF figures out there are numerous extents in the remaining 3 with only one copy so it starts establishing the 2nd copy in the rest. Now if you killed the 3rd node before it was stabilized (aka Prism OK) not only are you breaking the quorum , but you may kill off the only copy of a given extent. Note this could not only be a node outage, but a block device as well on one of the remaining nodes.

Example...
  1. VM is on 4 and has some extent copies on 3 (as well as on 2 and 1).
  2. You kill 4 and the VM starts on 2 where it has some local extents, but not all.
  3. It will start making copies to 2 for locality and all other VMs get their 2nd copies that were on 4 replicated
  4. You kill 3 but before the 2nd copies for the VM now on 2 were localized to 2 so the only devices with that extent (3 and 4) are inaccessible and you will get filesystem issues for that VM
A 4 node cluster can tolerate a single node outage w/o being impaired. Tolerating 2 nodes is classically an RF3 situation (5+ nodes) so you have 3 copies in the rare event that you have a cascading failure.

That help?
Badge +1
Thank you very much for that super explanation.
With that, I come to the conclusion, that only the VMs will get in troubles, those have lost some of the extends. These VMs will probably crash. But all the other VMs, having all extends on either Node 1 or 2 will keep running.
And the most important for me: the Nutanix Cluster with its CVMs can handle this without any problems. As soon as I bring up the two nodes 3 and 4 again, the whole cluster will get back to a healthy state. Except the VMs I have to repair (if possible) or recover from a snapshot of the protection domain or from the external backup, if the PD is also broken.

If this is all correct, I will do the test with my new cluster and the test VMs tomorrow.

Thank you very much again.
Userlevel 7
Badge +25
So to be clear Stargate will not be happy with a 50% loss of capacity in the cluster and it may bring down the storage interface and limit access for all consumers. Not sure honestly when it shifts into a defensive mode so test well, but just know that on paper a 4 node RF2 cluster can tolerate 1 node w/o issues. Beyond that you are treading onto thin ice and corruption and availability will be very situational.

And be aware that it will need to reestablish a quorum when going to that 3rd node and sometimes that has been a bit wonky. Not sure what the goal is here, but just know "here be dragons".