What to do when there is a DIMM failure alert?

  • 11 June 2020
  • 0 replies
  • 6525 views

Userlevel 3
Badge +4


What is the DIMM error?

 

A memory error is an event that leads to the logical state of one or multiple bits being read differently from how they were last written. For example, If 1 was written in a memory cell and while reading the same memory cell, it returns 0. 

 

Memory errors can be classified into two types:

 

  1. Soft errors, which randomly corrupt bits but do not leave physical damage. Soft errors are transient in nature and are not repeatable. Soft errors can be because of electrical or magnetic interference (e.g. due to cosmic rays, alpha particles, leakage, random noise).

 

  1. Hard errors, which corrupt bits in a repeatable manner because of a physical/hardware defect or an environmental problem. Hard error can also occur if DIMM is not seated properly.

  

All memory systems in use in servers today are protected by error detection and   correction codes. These server machines employ error correcting codes (ECC), which allows the detection and correction of one or multiple bit errors.

 

Additional bits of error correction codes are sent along with 64 bits data via data bus to correct the memory errors if possible. These error correction codes can typically be categorised into two types:

 

  1. Single error correct double error detect (SECDED): This means they can reliably detect and correct any single-bit error, but they can only detect and not correct multiple bits errors. 

 

  1. Chip-kill: This can correct up to 4 adjacent bits boundary (one nibble) at once in a single memory word. 

 

Any memory error which could be corrected by the ECC is known as a correctable error (CECC). The memory errors which could not be corrected by the ECC is known as uncorrectable error (UECC). 

 

Important point to note here is that even hard errors can be corrected as long as ECC is able to correct them (if it falls under the correctable range of ECC). For example if a single bit is hard failed, it can always be corrected until the 2nd bit fails in the same data word if using SECDED ECC type. 

 

DIMM Error Handling in Nutanix G6 and G7 node

 

Please refer KB 7503

 

DIMM Error Handling in Nutanix G5 node

 

 Please refer  KB-3357


This topic has been closed for comments