How to Replace ToR Switches Without Downtime?

Question

We have a 6-node Nutanix cluster on VMware ESXi spread across two racks. Each rack has a ToR switch, and both ToRs are in an IRF stack. All nodes are dual-uplinked (one NIC to each ToR).Core switch handles L3 routing, and VLANs are trunked to the IRF stack.We're planning to replace both ToR switches with new IRF switches — without downtime.Can the community please guide on:	Best practices for live switch replacement			How to ensure CVM and ESXi connectivity during the change			Key health checks from Prism			Any Nutanix KB or doc on this

ntnx_angelo · Accepted Answer

Great question — replacing both ToR switches in a live Nutanix on ESXi cluster without downtime is possible with careful planning. Here's a step-by-step approach based on best practices.

You should always validate the steps here and consult Nutanix support before performing any work on production.

@Mariappan Let me know if this is a good start in your planning process.

✅ 1. Pre-Change Validation

Health checks:
- Run NCC (Nutanix Cluster Check): ncc health_checks run_all
- Validate all CVMs are in good health from Prism.
- Check for any recent alerts or degraded components.
Redundancy check:
- Confirm that each node is dual-uplinked and using LACP or active/standby NIC teaming.
- Ensure both ToRs are in the IRF stack and passing traffic redundantly.

🔄 2. Switch Replacement Workflow (No Downtime Strategy)

You’ll want to replace one ToR at a time while ensuring traffic is always flowing through the other.

Step-by-Step:

Drain one ToR
- Disable interfaces to all hosts from ToR-A.
- Monitor that all traffic seamlessly shifts to ToR-B.
- Confirm CVMs, Prism, and ESXi hosts remain reachable.
Replace ToR-A
- Physically replace and reconfigure it into the new IRF stack.
- Match VLANs, LACP, and trunking config from the original.
- Reconnect cables to ToR-A, re-enable ports, and verify connectivity.
Repeat for ToR-B
- Follow the same process: drain, replace, reconnect.
Verify full redundancy
- Confirm both switches are in the IRF stack and passing traffic.
- Check vSwitch uplink status in ESXi.
- Re-run NCC to ensure cluster health.

🛠 3. Important Tips

Use Maintenance Mode in Prism for individual CVMs only if needed — ideally avoid.
ESXi NIC teaming (vSwitch or DVS) should use active/active or active/passive depending on ToR behavior.
Validate vMotion and management network functionality during each ToR replacement.
If available, use out-of-band management on switches for configuration verification.

📚 Nutanix & VMware Resources

Nutanix KB: Search for “Network Redundancy” and “ToR Replacement” (examples: KB 1937, KB 7076).
VMware KB: Refer to NIC teaming and failover policy best practices.
Consider opening a Nutanix Support case to get a verified checklist for your environment.

You should always validate the steps here and consult Nutanix support before performing any work on production.

✅ 1. Pre-Change Validation

🔄 2. Switch Replacement Workflow (No Downtime Strategy)

Step-by-Step:

🛠 3. Important Tips

📚 Nutanix & VMware Resources

You should always validate the steps here and consult Nutanix support before performing any work on production.

✅ 1. Pre-Change Validation

🔄 2. Switch Replacement Workflow (No Downtime Strategy)

Step-by-Step:

🛠 3. Important Tips

📚 Nutanix & VMware Resources

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded