Dealing with disaster…how does Nutanix respond?

Things won’t always go to plan and the worst will happen. In this case, we lose 2 nodes from our 9 node rf2 cluster (e.g. one’s down for maintenance, someone forgets and reboots another node at the same time).

Can someone point me at some documentation that outlines how we can recover from this and get the cluster back up and running. Is there any permanent fallout?

Thanks.

Page 1 / 1

As a rule of thumb, with RF2, a Nutanix cluster can lose a single node at a time so i suggest to open a support ticket to check and get your system back.

Are all hosts and cvms available?

Thanks. Sorry, I should have made clear this is a ‘what if’ scenario (nothing wrong now). I’d like to know what the ultimate outcome is if we were ever to face a 2 host failure. I understand the first step is a support call to Nutanix but I want to know how things go from there and whether we stand to lose data.

Thanks.

It mainly depends from what happens after the failure but the chance to lose data exists and you have for sure to check the exact state of nodes and cvms with support guys.

There is no right answer or workaround here.
For example, few days ago an NTC fellow found himself in the exact situation you described, while one of the nodes was in maintenance mode another one has been rebooted, one of the nodes was in fault, the other one after the reboot was online but with about 200 vms down.
With the help of support staff he rebuilded the failed node with a phoenix iso with aos and hypervisor embedded and everything gone fine but...i really dont want to find myself in a situation like that, about 200vms down….you know…

The best rule i can suggest is:
any cluster more than 5 nodes RF3 (FT2) at the cluster level, and two containers, 1 for critical VM on rf3 and rest on rf2 container

UPX - you’re right on… hope you don’t mind if I add on to your comments.

If 2 or more nodes suddenly fail, or fail before the cluster becomes resilient after the loss of the 1st node, the hypervisor should place storage offline (aka APD - ‘All paths down’ - event). At this point the hypervisor (assuming it’s configured with Nutanix best practices) should halt/power off VMs to prevent any data loss - i.e. VMs trying to ‘commit’ a write when in fact the hardware cannot promise such a request WOULD/could lead to data loss.

Possible exceptions and design considerations with BLOCK or RACK resiliency can sustain the loss of more than 2 nodes in certain situations: 2 nodes in the SAME block or 2 nodes in the SAME rack.

(Is there ANYTHING in life with 0% risk???) - with that in mind, I think anyone working in storage/virtualization should at minimum perform (or review) a risk analysis and disaster planning:

What kind of an outage can be absorbed by the organization who rely on the data? 1 minute, 1hr, 1 day etc? The shortest time would need to drive the decision such as setting the types of resiliency factors to RF3 or perhaps spread the risk to more than one cluster, along with design recovery steps using (Nearsync/Async) etc recovery.
Hardware resiliency: Invest in redundant components, switches, power etc?
Coordinate all changes, so only ONE admin actually touches/makes changes to the cluster… in a maintenance window. Preferably.
Have a very well kept Disaster recovery plan (off the cluster that’s accessible to key persons) with approved Tiers, documented recovery steps AND TESTING of such a plan. Should be part of any business continuity (i.e. not just addressing the VMs/Storage, but people/logistics/communications etc.)
etc etc… mitigating risk requires constant work & vigilance.

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded