Solved

Dealing with disaster…how does Nutanix respond?

  • 25 September 2021
  • 4 replies
  • 54 views

Things won’t always go to plan and the worst will happen. In this case, we lose 2 nodes from our 9 node rf2 cluster (e.g. one’s down for maintenance, someone forgets and reboots another node at the same time). 
 

Can someone point me at some documentation that outlines how we can recover from this and get the cluster back up and running. Is there any permanent fallout?

Thanks. 

icon

Best answer by UPX 26 September 2021, 10:53

View original

This topic has been closed for comments

4 replies

Userlevel 2
Badge +4

As a rule of thumb, with RF2, a Nutanix cluster can lose a single node at a time so i suggest to open a support ticket to check and get your system back.

Are all hosts and cvms available? 

Thanks. Sorry, I should have made clear this is a ‘what if’ scenario (nothing wrong now). I’d like to know what the ultimate outcome is if we were ever to face a 2 host failure. I understand the first step is a support call to Nutanix but I want to know how things go from there and whether we stand to lose data. 
 

Thanks. 

Userlevel 2
Badge +4

It mainly depends from what happens after the failure but the chance to lose data exists and you have for sure to check the exact state of nodes and cvms with support guys.

There is no right answer or workaround here.
For example, few days ago an NTC fellow found himself in the exact situation you described, while one of the nodes was in maintenance mode another one has been rebooted, one of the nodes was in fault, the other one after the reboot was online but with about 200 vms down.
With the help of support staff he rebuilded the failed node with a phoenix iso with aos and hypervisor embedded and everything gone fine but...i really dont want to find myself in a situation like that, about 200vms down….you know…

The best rule i can suggest is:
any cluster more than 5 nodes RF3 (FT2) at the cluster level, and two containers, 1 for critical VM on rf3 and rest on rf2 container

 

Userlevel 2
Badge +4

UPX - you’re right on… hope you don’t mind if I add on to your comments.

 

If 2 or more nodes suddenly fail, or fail before the cluster becomes resilient after the loss of the 1st node, the hypervisor should place storage offline (aka APD - ‘All paths down’ - event).  At this point the hypervisor (assuming it’s configured with Nutanix best practices) should halt/power off VMs to prevent any data loss - i.e. VMs trying to ‘commit’ a write when in fact the hardware cannot promise such a request WOULD/could lead to data loss.

 

Possible exceptions and design considerations with BLOCK or RACK resiliency can sustain the loss of more than 2 nodes in certain situations: 2 nodes in the SAME block or 2 nodes in the SAME rack.


(Is there ANYTHING in life with 0% risk???) - with that in mind, I think anyone working in storage/virtualization should at minimum perform (or review) a risk analysis and disaster planning:

  • What kind of an outage can be absorbed by the organization who rely on the data? 1 minute, 1hr, 1 day etc? The shortest time would need to drive the decision such as setting the types of resiliency factors to RF3 or perhaps spread the risk to more than one cluster, along with design recovery steps  using (Nearsync/Async) etc recovery.
  • Hardware resiliency: Invest in redundant components, switches, power etc?
  • Coordinate all changes, so only ONE admin actually touches/makes changes to the cluster…  in a maintenance window. Preferably. 
  • Have a very well kept Disaster recovery plan (off the cluster that’s accessible to key persons) with approved Tiers, documented recovery steps AND TESTING of such a plan. Should be part of any business continuity (i.e. not just addressing the VMs/Storage, but people/logistics/communications etc.)
  • etc etc… mitigating risk requires constant work & vigilance.

Recommended reads & References:

https://www.joshodgers.com/tag/all-paths-down/

http://www.joshodgers.com/2020/06/22/i-o-path-resiliency-comparison-nutanix-aos-vmware-vsan-dellemc-vxrail/

 

https://portal.nutanix.com/page/documents/details?targetId=Web-Console-Guide-Prism-v6_0:arc-failure-modes-c.html

 

https://portal.nutanix.com/page/documents/solutions/details?targetId=TN-2068-Infrastructure-Resiliency:TN-2068-Infrastructure-Resiliency

 

Review KBs to understand implications of AOS fixes is highly recommended (but still good to read to understand possible issues!)

https://portal.nutanix.com/page/documents/kbs/details?targetId=kA0600000008fb7CAA


ESXi: VMWare KB 2032940