NIC Ethernet many ReceivePacketErrors (URGENT)

  • 29 October 2015
  • 6 replies

Badge +3
I have 5 nodes in a single cluster with NIC teaming over each node. I am continously getting "'NIC Ethernet in host -- has encountered many ReceivePacketErrors" over all nodes. Each node is trunked with LACP over HP 5406zl core switch.
PRISM give continous Warning alerts "Cluster performance may be degraded" & when i see CVM the error keeps increasing, details below

nutanix-CVM:~$ date;allssh 'winsh "Get-NetAdapterStatistics | select *" | grep -i err | grep -v " : 0"'Thu Oct 29 04:23:18 PDT 2015Executing winsh "Get-NetAdapterStatistics | select *" | grep -i err | grep -v " : 0" on the cluster================== =================ReceivedPacketErrors : 1569================== =================ReceivedPacketErrors : 13ReceivedPacketErrors : 31================== =================ReceivedPacketErrors : 69906================== =================ReceivedPacketErrors : 3ReceivedPacketErrors : 1244================== =================ReceivedPacketErrors : 150ReceivedPacketErrors : 8855

Shall highly appreciate if any solution over this, contacted Nutanix support but could not resolve yet after many tries...

Best Regards

6 replies

Userlevel 7
Badge +35

I suggest contacting support to help get to the root of the issue.
Badge +3
Hi aluciani

Thanks for your reply, Nutanix support tried multiple options but still clueless what is causing this error. Thought if any one had already experienced such issue & resolved might have helped me too..

Userlevel 3
Badge +17
I'm not sure this will help completely, but here's a perspective on RX errors at the Host:* Receive errors indicate the NIC sees an error, not that it has generated the error itself.* This implies to look on the switch ports associated with those NICs for TX errors at about the same rate. * If TX errors are not being shown by the switch, it's likely cable problems or some such.* If TX errors are shown at the switch, this could well be due to packet "cut-through" mode at the switch, typically used these days for efficiency. - This could well be due to a "rogue" NIC on the network issuing errored packets. - Packet captures would be necessary to identify the source of the "bad" packets. - Packet capture would have to be done at the switch, or via "promiscuous mode" at the host. - Review the capture for "ip.checksum_bad == true" (Wireshark) and the offending IP/MAC address should be seen.
Hi yakoob,

I know this is old, but did you ever find out what the problem was? I have the exact same issue.

Userlevel 7
Badge +35
Hi @Yakoob

Were you able to find a solution? would benefit the community to share here as @buddyalex is having a similar issue. Thanks
Badge +5
My team has experienced the same issue with the CRC errors and worked out with TAM and Support to implement a RX Buffer fix that increases the default value of 256 to 4096. There is a significant improvement, from an initial error value of > 40K drops to < 10K drops after implementing the value change.

To be honest, this CRC errors also depends on how you design the networking on Nutanix, and the workloads that are running on it. If you have a high compute and high I/O workloads, it is best to dedicate a pair of ToR switches and dedicated Nutanix hardware, do not mix the high workloads and general ones together, a recipe for disaster. I have made such design improvements and noticed that CRC errors could drop to single digit values. However, it could never be perfect as to become absolute zero.

Please note that the value depends on the ToR switches that you connect your nodes to, in my case I am using Cisco Nexus 5000 series and the port buffer can go as high as 4096, hence my setting. We tried out progressive values, from 256 to 1024, then 2048, and lastly to 4096. Each progression we saw a drop in the error packet value and we pushed to the limit to make sure the error packets goes to the minimum.