Host services restart when network changes resulting in vm down

  • 27 November 2019
  • 3 replies

Badge +10

We have had a couple of instances recently when making network changes that have affected our clusters. This caused a restart on the lead host due to it detecting a network loss and  then resulted in system outages. The cluster is configured with dual networks ports in active and passive mode and the understanding was that it would switch if any change or failure was detecetd without producing error events and systems down.


Best answer by sbarab 28 November 2019, 19:24

View original

This topic has been closed for comments

3 replies

Userlevel 3
Badge +3

@roberthwl  This should have definitely been the case “IF” the switches involved were configured correctly. I recently dealt with an issue were the core switch has some issues causing this, it was found after involving the switch vendor.   

One other thing, the AHV version should be relatively new (for example AHV-20170830.300 or above 300) as many issues were addressed with the later releases of the AHV.

You can involve nutanix support to review any logs or re-test with them on line (if possible), but I would probably review switch  logs and configuration before going that route.

Badge +10

We recently updated the clusters to  AHV VERSION NUTANIX 20170830.200, earlier this year and would look to do so again soon ( typo on your reply .300?)

I had run diagnostics on the cluster and a seperate one on network components which didnt flag any faults or errors inany config or associated components.

The network team would now need to verify the switch configuration is ok.

As far as I know nothing has been changed from when the clusters were setup by the hardware providers technical support team

Userlevel 3
Badge +3

@roberthwl the last number on release is .300 or above.  If you are in a position to test this on one node, that would help a lot looking at  both physical switch issues and nutanix cluster, you can open a case with our support and have them look at some logs. If not, logs is all we have and they can be reviewed to see why things did not happen as expected.