[Linux] kernel:[Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.

Sometimes on Linux systems you’ll get an error message similar to this:

Message from syslogd@server at May  1 13:09:44 …

kernel:[Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.

Message from syslogd@server at May  1 13:09:44 …

kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@server at May  1 13:09:44 …

kernel:[Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc40400061080813

Message from syslogd@server at May  1 13:09:44 …

kernel:[Hardware Error]: MC4_ADDR: 0x000000103900a030

Message from syslogd@server at May  1 13:09:44 …

kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

These typically aren’t a problem as they’re just notification that ECC memory has made a mistake and corrected itself.  Sometimes it becomes obvious however that the ECC memory isn’t doing its job correctly as it starts causing a lot of problems on the box.  In this particular situation the CPU that uses this memory back is locking up:

Message from syslogd@server at May  1 01:27:25 …

kernel:BUG: soft lockup – CPU#4 stuck for 67s!

So there’s a bad DIMM that needs to be replaced.  The problem is how to find it.  First install edac-utils:

yum install edac-utils

Once installed run the following command:

[root@server ~]# edac-util -v

mc0: 0 Uncorrected Errors with no DIMM info

mc0: 0 Corrected Errors with no DIMM info

mc0: csrow0: 0 Uncorrected Errors

mc0: csrow0: ch0: 0 Corrected Errors

mc0: csrow0: ch1: 0 Corrected Errors

mc0: csrow1: 0 Uncorrected Errors

mc0: csrow1: ch0: 0 Corrected Errors

mc0: csrow1: ch1: 0 Corrected Errors

mc0: csrow2: 0 Uncorrected Errors

mc0: csrow2: ch0: 0 Corrected Errors

mc0: csrow2: ch1: 0 Corrected Errors

mc0: csrow3: 0 Uncorrected Errors

mc0: csrow3: ch0: 0 Corrected Errors

mc0: csrow3: ch1: 0 Corrected Errors

mc1: 0 Uncorrected Errors with no DIMM info

mc1: 40 Corrected Errors with no DIMM info

mc1: csrow0: 0 Uncorrected Errors

mc1: csrow0: ch0: 801 Corrected Errors

mc1: csrow0: ch1: 0 Corrected Errors

mc1: csrow1: 0 Uncorrected Errors

mc1: csrow1: ch0: 980 Corrected Errors

mc1: csrow1: ch1: 0 Corrected Errors

mc1: csrow2: 0 Uncorrected Errors

mc1: csrow2: ch0: 0 Corrected Errors

mc1: csrow2: ch1: 0 Corrected Errors

mc1: csrow3: 0 Uncorrected Errors

mc1: csrow3: ch0: 0 Corrected Errors

mc1: csrow3: ch1: 0 Corrected Errors

mc2: 0 Uncorrected Errors with no DIMM info

mc2: 0 Corrected Errors with no DIMM info

mc2: csrow0: 0 Uncorrected Errors

mc2: csrow0: ch0: 0 Corrected Errors

mc2: csrow0: ch1: 0 Corrected Errors

mc2: csrow1: 0 Uncorrected Errors

mc2: csrow1: ch0: 0 Corrected Errors

mc2: csrow1: ch1: 0 Corrected Errors

mc2: csrow2: 0 Uncorrected Errors

mc2: csrow2: ch0: 0 Corrected Errors

mc2: csrow2: ch1: 0 Corrected Errors

mc2: csrow3: 0 Uncorrected Errors

mc2: csrow3: ch0: 0 Corrected Errors

mc2: csrow3: ch1: 0 Corrected Errors

mc3: 0 Uncorrected Errors with no DIMM info

mc3: 0 Corrected Errors with no DIMM info

mc3: csrow0: 0 Uncorrected Errors

mc3: csrow0: ch0: 0 Corrected Errors

mc3: csrow0: ch1: 0 Corrected Errors

mc3: csrow1: 0 Uncorrected Errors

mc3: csrow1: ch0: 0 Corrected Errors

mc3: csrow1: ch1: 0 Corrected Errors

mc3: csrow2: 0 Uncorrected Errors

mc3: csrow2: ch0: 0 Corrected Errors

mc3: csrow2: ch1: 0 Corrected Errors

mc3: csrow3: 0 Uncorrected Errors

mc3: csrow3: ch0: 0 Corrected Errors

mc3: csrow3: ch1: 0 Corrected Errors

Considering these errors are popping up nearly constantly we can just look for the high correction count.  In this case we can see a couple channels showing > 800 corrections while every other memory channel is fine.  Using this info we can find and replace the failing DIMM(s).