How can we determine if Node is experiencing NIC issues and if so, then what?
There are many reasons that can cause host NIC errors and troubleshooting this usually involves analysis to triangulate the issue to a certain part in the networking topology.
-
Link flapping (interface continually goes up and down)
-
Cable disconnect/connect
-
Faulty external switch port
-
Misconfiguration of the external switch port
-
Faulty NIC port
-
Faulty cable
-
Faulty SFP+ module
The two big error counters that we are concerned are rx_crc_errors and rx_over_errors (and in conjunction, rx_missed_errors/rx_fifo_errors).
Run the following command on your host depending on the hypervisor:
AHV:
ethtool -S <eth> | egrep "rx_errors|rx_crc_errors|rx_missed_errors"
ESXi:
esxcli network nic stats get -n <vmnic> | egrep "Total receive errors|Receive CRC errors|Receive missed errors"
Hyper-V:
Get-NetAdapterStatistics -Name Ethernet*<interface number> | fl *
Rx_crc_error:
The sending host computes a cyclic redundancy check (CRC) of the entire Ethernet frame and puts this value in the FCS (frame check sequence) section of the Ethernet frame after the user payload. The intermediate switch then checks this computed value and the destination host to determine if the frame has been corrupted in transit.
rx_crc_errors are caused either by faults in layer 1, or issues with jumbo frames on the network. If that packet has an MTU over what is configured on the interface, it will cut off the packet at the designated MTU, causing the server to receive a malformed packet, which will throw a CRC error.
Faulty cables and/or SFP+ modules are the most common cause of these errors. If the problem is regularly occurring on a particular interface, you should perform testing in a controlled fashion during a change window, to isolate the faulty component by swapping cables, modules, and switch ports as needed to isolate the issue.
Physical troubleshooting is a cumulative effort, as with CRC’s, there is not an effective way to diagnose if the Host’s NIC, the Cable, or the Switch port is the source of the bad data transmission.
Rx_over_errors:
They are caused when the hardware receive buffer on the physical NIC is full and some received packets have to be dropped at the physical NIC layer. In most cases, the values reported by this counter will equal rx_missed_errors and rx_fifo_errors. The packet drops can happen during high bursts of traffic.
These NIC errors can be triggered by a number of scenarios as indicated above and can usually be ignored. However, continuously increasing NIC errors usually point to a physical layer component that is failing. The Nutanix alert is triggered when the error rate is considered excessive. This alert should be investigated whenever it is raised and the source should be corrected.
Pro-Tip : Engage your networking team to check Switch port statistics to captures errors on the switch side that these hosts are connected to.
For more information :
KB - 1381 - NCC Health Check: host_nic_error_check