Solved

Failover Testing Failed

  • 26 May 2016
  • 4 replies
  • 2195 views

Badge +5
Hello we have a brand new cluster (4 nodes) and did some failover testing over the weekend prior to placing the new gear into production. We have 2 cisco 4500x 10GB switches and have the nodes split between the two switches for redundancy. We simulated a switch failover by pulling the plug on one of them and experiened some very bad results. Some CVMs became unresponsive and we lost connectivity to most of our VMs. Has anyone else experienced a similar issue? Or can anyone point me in the right direction for configurations to double check? Would appreciate any insight!
icon

Best answer by charlie_chuhak 2 June 2016, 18:07

Worked with support to get this sqaured away. We ended up upgrading the hypervisor which solved the issue.
View original

4 replies

Userlevel 7
Badge +30
First thing - have you put in a support ticket with Nutanix yet?

thats the best first step here, as we can help you validate your configuration on the Nutanix and hypervisor side. We can also help guide the conversation on the network side (bunch of our support staff are ex Cisco, and some are even CCIEs)

past that, it would be key to know how you have your vSwitches setup and how you have your Cisco switches setup.

when you make the support ticket, if you could attach "show run" from both switches, appropriately censored (don't need SSH keys or anything like that), that would really help.

feel free to CC me on the ticket, Jon at Nutanix dot com
Badge +5
Thanks for the reply, it is appreciated! I have a support ticket open but have had a hard time connecting with the engineer as we are in differnt time zones. I just copied you in to the ticket.

We are using a distributed switch with each of the uplinks split between two Cisco 4500-X switches. I've confirmed via CDP that the all uplinks are properly connected to each switch.

I'll be sure to add in the sh run from the swithces to the ticket. Thanks again for chiming in. Looking forward to getting the cluster into production.
Badge +5
Worked with support to get this sqaured away. We ended up upgrading the hypervisor which solved the issue.
Userlevel 7
Badge +30
Thanks for the update here.

For everyone else, all was well on VDS and Cisco config AFAIK, was hitting a VMW Bug, upgraded to 6.0u2 and all is well now

Reply