Hi
I have an intermittent issue where VMs can’t contact VMs on other hosts. I’m trying to find a pattern with timings and the hosts involved but haven’t found one yet.
When a VM can’t reach a VM on another host, the problem can be resolved by migrating the VMs to the same physical NX host
e.g. VM1 on Host1 can’t reach VM2 on Host2, workaround is to migrate VM2 to Host1
This leads me to believe there is a problem with the network between hosts. Each host has 2 uplinks going to 2 switches with active-active load balancing and my current theory is the problem lies with the aggregation. Perhaps packets are leaving eth2 and returning on eth3?
I’m going to change the OVS bond mode from Balance-TCP to Active-Backup and see what happens. I don’t need the throughput of 2 active links so perhaps this should have been the config from the start
Question - has anyone else seen this behaviour on your cluster?
Environment:
4 node cluster
Uplinks to 2 Cisco Nexus switches aggregrated using Balance-TCP and LACP fast enabled
AOS 6.5
Thanks!