Skip to main content

We have a 6-node Nutanix cluster on VMware ESXi spread across two racks. Each rack has a ToR switch, and both ToRs are in an IRF stack. All nodes are dual-uplinked (one NIC to each ToR).
Core switch handles L3 routing, and VLANs are trunked to the IRF stack.

We're planning to replace both ToR switches with new IRF switches — without downtime.

Can the community please guide on:

  • Best practices for live switch replacement

  • How to ensure CVM and ESXi connectivity during the change

  • Key health checks from Prism

  • Any Nutanix KB or doc on this

Great question — replacing both ToR switches in a live Nutanix on ESXi cluster without downtime is possible with careful planning. Here's a step-by-step approach based on best practices.

You should always validate the steps here and consult Nutanix support before performing any work on production. 

 

@Mariappan  Let me know if this is a good start in your planning process. 

 

✅ 1. Pre-Change Validation

  • Health checks:

    • Run NCC (Nutanix Cluster Check): ncc health_checks run_all

    • Validate all CVMs are in good health from Prism.

    • Check for any recent alerts or degraded components.

  • Redundancy check:

    • Confirm that each node is dual-uplinked and using LACP or active/standby NIC teaming.

    • Ensure both ToRs are in the IRF stack and passing traffic redundantly.

🔄 2. Switch Replacement Workflow (No Downtime Strategy)

You’ll want to replace one ToR at a time while ensuring traffic is always flowing through the other.

Step-by-Step:

  1. Drain one ToR

    • Disable interfaces to all hosts from ToR-A.

    • Monitor that all traffic seamlessly shifts to ToR-B.

    • Confirm CVMs, Prism, and ESXi hosts remain reachable.

  2. Replace ToR-A

    • Physically replace and reconfigure it into the new IRF stack.

    • Match VLANs, LACP, and trunking config from the original.

    • Reconnect cables to ToR-A, re-enable ports, and verify connectivity.

  3. Repeat for ToR-B

    • Follow the same process: drain, replace, reconnect.

  4. Verify full redundancy

    • Confirm both switches are in the IRF stack and passing traffic.

    • Check vSwitch uplink status in ESXi.

    • Re-run NCC to ensure cluster health.

🛠️ 3. Important Tips

  • Use Maintenance Mode in Prism for individual CVMs only if needed — ideally avoid.

  • ESXi NIC teaming (vSwitch or DVS) should use active/active or active/passive depending on ToR behavior.

  • Validate vMotion and management network functionality during each ToR replacement.

  • If available, use out-of-band management on switches for configuration verification.

📚 Nutanix & VMware Resources

  • Nutanix KB: Search for “Network Redundancy” and “ToR Replacement” (examples: KB 1937, KB 7076).

  • VMware KB: Refer to NIC teaming and failover policy best practices.

  • Consider opening a Nutanix Support case to get a verified checklist for your environment.

 


Reply