Question

Intermittent network issue between VMs on different NX hosts

  • 3 December 2023
  • 5 replies
  • 110 views

Userlevel 1
Badge +1

Hi

I have an intermittent issue where VMs can’t contact VMs on other hosts.  I’m trying to find a pattern with timings and the hosts involved but haven’t found one yet.

When a VM can’t reach a VM on another host,  the problem can be resolved by migrating the VMs to the same physical NX host

e.g. VM1 on Host1 can’t reach VM2 on Host2,  workaround is to migrate VM2 to Host1

This leads me to believe there is a problem with the network between hosts.  Each host has 2 uplinks going to 2 switches with active-active load balancing and my current theory is the problem lies with the aggregation.   Perhaps packets are leaving eth2 and returning on eth3?

I’m going to change the OVS bond mode from Balance-TCP to Active-Backup and see what happens.  I don’t need the throughput of 2 active links so perhaps this should have been the config from the start

Question - has anyone else seen this behaviour on your cluster? 

 

Environment:

4 node cluster

Uplinks to 2 Cisco Nexus switches aggregrated using Balance-TCP and LACP fast enabled

AOS 6.5

 

Thanks!


This topic has been closed for comments

5 replies

Userlevel 6
Badge +8

I've seen many issues with wrongly configured switches.

As you are using LACP, did you configured LACP on the switch ports as well? And are the switches (I'm not a CISCO network guy. So I dont know the name for this) aware of the split LACP across multiple switches? (In Mellanox terms MLAG with an IPL between switches). 

Or are there VLANS missing on specific switch ports?

 

If you can switch to active-backup then that would be the first step to test. Dont forget to disable LACP on the switch ports as well. 

Userlevel 1
Badge +1

Hi JeroenTielen thanks for the reply.  Yes we have LACP configured on the switch ports and the correct vLANs added to the switch ports/interfaces.  I had planned to reconfigure the hosts and switches to use Active-Backup but after a very informative NX support call the SRE has helped identify the switches as the (most likely) fault.  We are going to try upgrading the Nexus switches as they are 4 years out of date as the first step.   If we get to a conclusion I’ll post more info here

Userlevel 1
Badge +1

We haven’t upgraded the Nexus switches yet but did give them a reboot.  We thought perhaps the Nexus was part of the problem due to 1000+ days uptime and also a couple odd behaviours like the MAC address table was blank, when it should have 100s of entries.   The reboot fixed that the empty MAC table issue but the overall host-host issue persisted.

 

Our current theory is the issue is caused by an upstream router that our VMs are pointing at for their default gateway.   We gave that a reboot and since then the problem has not re-occured!   VMs on different hosts can reach eachother fine.    How long it will last is unclear but I think we know enough now that its not the NX cluster itself at fault.

Userlevel 3
Badge +7

i had smiller issue , our network esclated with CISCO and they upgrade the version , things started working normally 

Same here. After a reboot and firmware upgrade of the ciscos everything went back to normal.