Jun 262014
 

Came across a hint today about reporting on ECC memory errors. For those who do not know, ECC memory detects memory errors and corrects correctable errors. Normal memory (as found in almost all laptops and desktops) simply ignores the errors and lets them accumulate and cause problems either with data corruption or by causing software errors.

As I happen to have ECC memory in my desktop machine I thought I would have a look into the hint. Turns out that Linux does not report on ECC events automatically; you need to install the relevant EDAC (Error Detection and Correction) tools. Which for Debian, turns out to be pretty simple :-

# apt-get install edac-utils

As part of the installation process, a daemon process is started. But for whatever reason, it didn’t automatically detect what driver to load. So I edited /etc/default/edac and added :-

EDAC_DRIVER=amd64_edac_mod

Once that is done, a simple /etc/init.d/edac restart loads the driver and starts monitoring. Messages should appear in your log files (/var/log/messages) and reports can be displayed with edac-util :-

# edac-util --report=full 
mc0:csrow0:mc#0csrow#0channel#0:CE:0
mc0:csrow0:mc#0csrow#0channel#1:CE:0
mc0:csrow1:mc#0csrow#1channel#0:CE:0
mc0:csrow1:mc#0csrow#1channel#1:CE:0
mc0:csrow2:mc#0csrow#2channel#0:CE:0
mc0:csrow2:mc#0csrow#2channel#1:CE:0
mc0:csrow3:mc#0csrow#3channel#0:CE:0
mc0:csrow3:mc#0csrow#3channel#1:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0

Of course memory errors are relatively rare (or at least should be) so it may take months before any error is reported.