Node Nic Error Rate High

  • 18 February 2015
  • 13 replies
  • 14772 views

Userlevel 2
Badge +14
It looks like ever since I upgraded from 4.0.1.2 to 4.1.1.2 my Node Nic Error Rate High healthcheck has been intermittently going off for all of my hosts. It looks like in the worst case, it has averaged to fail the check about 20% of the time. I haven't observed any errors in my VM's, and my cluster latency is relatively low. I'm using LACP instead of load-based teaming, but I would assume that shouldn't have an impact on the healthcheck itself.

Has anyone else seen similar behavior? I upgraded my nodes on Saturday evening and I have consistently seen this behavior every day since then.

EDIT: Our networking group as confirmed that they're not seeing errors on physical switch ports. Some resets are observed, but thats to be expected.

13 replies

Userlevel 3
Badge +14
Hi Tjagoda,

Rx_missed errors are caused by the NIC when it runs out of hardware descriptors to store incoming packets. These are usually seen on 10Gbps interfaces due to the limitation on the number of interrupts that can be served by a single CPU core. Here is a good document from Intel that explains the issue - http://www.intel.com/content/dam/doc/white-paper/improving-network-performance-in-multi-core-systems-paper.pdfYou can tune ESXis to reduce/stop the occurrence of RX missed errors. VMware recommends enabling NetQueue on the ixgbe driver.The process is outline here - http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004278Another option is if NetQueue does not resolve the error is to make sure flow control is enabled upstream so that ESXi can send pause frames.In 4.1.1.2 we increased NIC error monitoring level which is what is causing the alert. The same errors were occurring prior to the upgrade, its just alerting now due to the increased monitoring.
Userlevel 2
Badge +14
Awesome response - thanks for the insight! It looks like NetQueue is on by default in ESXi 4.0 and later, and it appears to be enabled when I take a look at the 10GB vmnic's using ethtool within ESXi. I see 8 rxqueue's though (and only three have counters greater than zero) - would you recommend bumping that up to the maximum of 16 or pursuing flow control upstream? Its possible flow control is already on, as I do see the following counters:

tx_flow_control_xon: 1918 rx_flow_control_xon: 1003669 tx_flow_control_xoff: 1970 rx_flow_control_xoff: 1003669

There is a decent number of rx_missed_errors in comparison to the overall rx stats:

rx_packets: 882184972
rx_missed_errors: 72716
Userlevel 2
Badge +14
Actually upon further digging it looks like NetQueue is enabled, but the ixgbe module has no paramteres to set the number of queues - I'll try setting this later and we'll see if it makes a dent.
Badge +2
Hi 

Did setting parameters on the ixgbe module stop the nic errors from occuring? I am currently having this same issue on of my esxi hosts.

Thanks,
Bryan
Userlevel 3
Badge +14
Refer this KB from the portal.
Userlevel 2
Badge +15
Did anyone ever get a solution to this? I have customer seeeing those rx_missed_errors.

Thanks
Tony
Badge +7
We found there was an STP issue on our switch. We disabled STP on the ports connected to our Nutanix hosts and the errors went away. We are using Extreme Networks 670 10Gb switches.
Userlevel 2
Badge +15
Thanks for the response, what exact error were you seeing? I am seeing the rx_missed_errors.

THanks,
Tony
Badge +7
Yeah, you are right. I looked back at my support case and they were rx_errors.

Cheers!
Userlevel 2
Badge +15
jomebrew wrote:Yeah, you are right. I looked back at my support case and they were rx_errors.

Cheers!

Hi 

They were the rx_missing_error or just plain rx_errors?

Thanks,
Tony
Badge +7
Hi 
This goes back about a year. The logs I have showed only rx_errors. I can't be certain of rx_missing_error, however, I would recommend disabling STP on server ports anyway since STP is not applicable to host ports.
I could not pinpoint a problem assocaited with the error. I just don't like systems spewing errors.
/Joe

tonyholland00 wrote:

Hi @jomebrew

They were the rx_missing_error or just plain rx_errors?

Thanks,
Tony
Badge +1
Hi  

I've got the same rx_errors rate for both 10G nic on my 3 nodes. How can I fix this? thanks.

~ # ethtool -S vmnic2 | grep errors rx_errors: 54 tx_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_csum_offload_errors: 0 fcoe_last_errors: 0

~ # ethtool -S vmnic3 | grep errors rx_errors: 54 tx_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_fifo_errors: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_csum_offload_errors: 0 fcoe_last_errors: 0
Userlevel 3
Badge +11
Hi,

anybody manage to resolve the issue? we are having this warning for quiet some time.

we are running with NOS/AOS: 4.6.1.1 and hypervisor: AHV - Nutanix 20160217.2

Reply