We have a 6-node Nutanix cluster on VMware ESXi spread across two racks. Each rack has a ToR switch, and both ToRs are in an IRF stack. All nodes are dual-uplinked (one NIC to each ToR). Core switch handles L3 routing, and VLANs are trunked to the IRF stack.
We're planning to replace both ToR switches with new IRF switches — without downtime.
Can the community please guide on:
Best practices for live switch replacement
How to ensure CVM and ESXi connectivity during the change
Key health checks from Prism
Any Nutanix KB or doc on this
Best answer by ntnx_angelo
Great question — replacing both ToR switches in a live Nutanix on ESXi cluster without downtime is possible with careful planning. Here's a step-by-step approach based on best practices.
You should always validate the steps here and consult Nutanix support before performing any work on production.
@Mariappan Let me know if this is a good start in your planning process.
✅ 1. Pre-Change Validation
Health checks:
Run NCC (Nutanix Cluster Check): ncc health_checks run_all
Validate all CVMs are in good health from Prism.
Check for any recent alerts or degraded components.
Redundancy check:
Confirm that each node is dual-uplinked and using LACP or active/standby NIC teaming.
Ensure both ToRs are in the IRF stack and passing traffic redundantly.
🔄 2. Switch Replacement Workflow (No Downtime Strategy)
You’ll want to replace one ToR at a time while ensuring traffic is always flowing through the other.
Step-by-Step:
Drain one ToR
Disable interfaces to all hosts from ToR-A.
Monitor that all traffic seamlessly shifts to ToR-B.
Confirm CVMs, Prism, and ESXi hosts remain reachable.
Replace ToR-A
Physically replace and reconfigure it into the new IRF stack.
Match VLANs, LACP, and trunking config from the original.
Reconnect cables to ToR-A, re-enable ports, and verify connectivity.
Repeat for ToR-B
Follow the same process: drain, replace, reconnect.
Verify full redundancy
Confirm both switches are in the IRF stack and passing traffic.
Check vSwitch uplink status in ESXi.
Re-run NCC to ensure cluster health.
🛠 3. Important Tips
Use Maintenance Mode in Prism for individual CVMs only if needed — ideally avoid.
ESXi NIC teaming (vSwitch or DVS) should use active/active or active/passive depending on ToR behavior.
Validate vMotion and management network functionality during each ToR replacement.
If available, use out-of-band management on switches for configuration verification.
📚 Nutanix & VMware Resources
Nutanix KB: Search for “Network Redundancy” and “ToR Replacement” (examples: KB 1937, KB 7076).
Great question — replacing both ToR switches in a live Nutanix on ESXi cluster without downtime is possible with careful planning. Here's a step-by-step approach based on best practices.
You should always validate the steps here and consult Nutanix support before performing any work on production.
@Mariappan Let me know if this is a good start in your planning process.
✅ 1. Pre-Change Validation
Health checks:
Run NCC (Nutanix Cluster Check): ncc health_checks run_all
Validate all CVMs are in good health from Prism.
Check for any recent alerts or degraded components.
Redundancy check:
Confirm that each node is dual-uplinked and using LACP or active/standby NIC teaming.
Ensure both ToRs are in the IRF stack and passing traffic redundantly.
🔄 2. Switch Replacement Workflow (No Downtime Strategy)
You’ll want to replace one ToR at a time while ensuring traffic is always flowing through the other.
Step-by-Step:
Drain one ToR
Disable interfaces to all hosts from ToR-A.
Monitor that all traffic seamlessly shifts to ToR-B.
Confirm CVMs, Prism, and ESXi hosts remain reachable.
Replace ToR-A
Physically replace and reconfigure it into the new IRF stack.
Match VLANs, LACP, and trunking config from the original.
Reconnect cables to ToR-A, re-enable ports, and verify connectivity.
Repeat for ToR-B
Follow the same process: drain, replace, reconnect.
Verify full redundancy
Confirm both switches are in the IRF stack and passing traffic.
Check vSwitch uplink status in ESXi.
Re-run NCC to ensure cluster health.
🛠 3. Important Tips
Use Maintenance Mode in Prism for individual CVMs only if needed — ideally avoid.
ESXi NIC teaming (vSwitch or DVS) should use active/active or active/passive depending on ToR behavior.
Validate vMotion and management network functionality during each ToR replacement.
If available, use out-of-band management on switches for configuration verification.
📚 Nutanix & VMware Resources
Nutanix KB: Search for “Network Redundancy” and “ToR Replacement” (examples: KB 1937, KB 7076).