Skip to main content

When you see an alert about one of the NICs being down the first thought is “oh-oh!”. There is more than one NIC in the host and the traffic will switch over to the other interface or the bandwidth is now reduced (when LACP is enabled, for example). If the environment is not running at the brink of its capacity then a loss of one of the network interfaces should not cause an immediate impact. 

The issue is detected by NCC health check nic_link_down_check which runs every hour and generates the alert if any issues are found. In the majority of cases using networking troubleshooting logic is enough (make sure the interface is physically connected, devices on both ends are configured correctly and are functioning properly).

In some cases, however, the alert may be a false positive. If an interface used to be connected or, in some instances, after cluster expansion on a new node. Eliminate the possibility of the alert being true first. Verify physical connectivity of the host to the switch. If you are certain the alert cannot possibly be right, there is a file that the NCC check uses to compare to the expected state of the interfaces. If the problematic interface is listed in that file in a “down” you may be onto something. The file can be removed and will be re-created automatically on the next run of the health check.

For NIC troubleshooting refer to KB-2480 NCC Health Check: nic_link_down_check

To look at the check_cvm_health_job_state.json file go to KB-2556 Alert "Link on NIC vmnic x] of host cx.x.x.x] is down" being raised if an interface was used previously