Solved

SATADOM Failure


Userlevel 1
Badge +5
I assume that Hypervisor (consider ESXi) and CVM configuration files are residing in SATADOM.

1) what happens if the SATADOM fails ?
2) Is there any redundant mechanism available to withstand failure ?
icon

Best answer by patrbng 11 April 2017, 17:35

If the satadom fails then the hypervisor fails and the VM's will see it as a node failure. If vSphereHA is setup, the VMs should recover to the other nodes in the cluster as long as compute/memory capacity exists to run them. It's recommended to setup vSphereHA with admission control enabled to ensure that capacity is available. The VM's when they restart will read their blocks of storage from whatever local RF data on the node it is running on. If the RF data isn't on the node it will get the storage over the network and if it needs the data more often, the data will be replicated back to the VM's host node for data locality. If the failed node remains offline long enough, it will be marked as out of the metadata ring and the protection data will be re-created on the remaining nodes.
Make sense?

View original

10 replies

Userlevel 4
Badge +20
1) There is only 1 SATADOM and if it fails, you’ll have to re-image the node using the foundation/phoenix process to re-install the hypervisor and Nutanix CVM virtual machine. After that’s done, the host can be added to the cluster. Support can/will walk you through the process.

2) I don't think any of the NX/XC/HX solutions support redundant SATADOMs.
Userlevel 1
Badge +5
Thanks for the quick reply. What would happen if the ESXi node facing SATADOM failure has running user VMs. I suspect that the user VMs will failover restart to working neighbor ESXi node post declaring PDL.
Userlevel 4
Badge +20
If the satadom fails then the hypervisor fails and the VM's will see it as a node failure. If vSphereHA is setup, the VMs should recover to the other nodes in the cluster as long as compute/memory capacity exists to run them. It's recommended to setup vSphereHA with admission control enabled to ensure that capacity is available. The VM's when they restart will read their blocks of storage from whatever local RF data on the node it is running on. If the RF data isn't on the node it will get the storage over the network and if it needs the data more often, the data will be replicated back to the VM's host node for data locality. If the failed node remains offline long enough, it will be marked as out of the metadata ring and the protection data will be re-created on the remaining nodes.
Make sense?
Userlevel 1
Badge +5
Thanks a lot. Clear enough.
Userlevel 1
Badge +3
THIS IS A VERY PAINFUL PROCESS.

I have had 2 SATADOM failues one of my customers cluster.
I was told these things last 1-2 years. There are new models out that were replaced with that is supposed to last longer. But, they are a Single Point a Faulire with a very short life.

In both instances we never received any alerts or predictave failures, NCC Health_checks didn't find any problems with this, and there was another utility that shows the life of these devices and those didn't show the correct information

In both instances this was a very painful process. Bascially the node was running in memory, we lost management of the host. The VM's were running but could not be migrated over to any other host. We incurred outages on our VM's to get the environment working again. I spent a few sleepless nights and many hours troubleshooting this with Nutanix support and had to respond in emergency fashion to get things working.

I have 3 more host with the old Satadoms and don't really have a clear forward plan to prevent this from happening. I have been requesting for these to be replaced, but no word on this yet.

This is a very bad design and I expect better when there are claims are made for invisbile infrastrucutre and self healing. This is not true in this case and if you have the old Satadoms then you should expect this happening to you in 1 -2 years.
I do have one question for chadgiardina how to you check what the SATA DOM model is? and if it is improved version or not??

Many Thanks


UPDATE - I have this model SATADOM 2DSL_3ME

How do I find if that new or not?
Userlevel 3
Badge +8
@patrbng Is @nutanix considering a design that allows mirroring of the SATADOM/ Hypervisor? I'm not looking forward to @chadgiardina 's experience.
Userlevel 4
Badge +20
The G6's are using .M2 drives instead of SATADOMs so there is "some" improvement but no capability of mirroring to a second .M2 yet even though a second one is installed. I would work with your Nutanix SE to see if mirroring is on a roadmap.
I hope the CVM will take care the VMs will be restart on another node incase of node failure. In case of satadom failure how the cvm will act as it holds the CVM configuration too.
We have just deployed some G6s with this new .M2 drive and before it even went into production we've already lost one host due to this drive failing...

Reply