Sometimes on Linux systems you’ll get an error message similar to this:
Message from syslogd@server at May 1 13:09:44 …
kernel:[Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
Message from syslogd@server at May 1 13:09:44 …
kernel:[Hardware Error]: Error Status: Corrected error, no action required.
Message from syslogd@server at May 1 13:09:44 …
kernel:[Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[Over|CE|MiscV|-|AddrV|-|-|CECC]: 0xdc40400061080813
Message from syslogd@server at May 1 13:09:44 …
kernel:[Hardware Error]: MC4_ADDR: 0x000000103900a030
Message from syslogd@server at May 1 13:09:44 …
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
These typically aren’t a problem as they’re just notification that ECC memory has made a mistake and corrected itself. Sometimes it becomes obvious however that the ECC memory isn’t doing its job correctly as it starts causing a lot of problems on the box. In this particular situation the CPU that uses this memory back is locking up:
Message from syslogd@server at May 1 01:27:25 …
kernel:BUG: soft lockup – CPU#4 stuck for 67s!
So there’s a bad DIMM that needs to be replaced. The problem is how to find it. First install edac-utils:
yum install edac-utils
Once installed run the following command:
[root@server ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: ch0: 0 Corrected Errors
mc0: csrow1: ch1: 0 Corrected Errors
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: ch0: 0 Corrected Errors
mc0: csrow2: ch1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: ch0: 0 Corrected Errors
mc0: csrow3: ch1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 40 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: ch0: 801 Corrected Errors
mc1: csrow0: ch1: 0 Corrected Errors
mc1: csrow1: 0 Uncorrected Errors
mc1: csrow1: ch0: 980 Corrected Errors
mc1: csrow1: ch1: 0 Corrected Errors
mc1: csrow2: 0 Uncorrected Errors
mc1: csrow2: ch0: 0 Corrected Errors
mc1: csrow2: ch1: 0 Corrected Errors
mc1: csrow3: 0 Uncorrected Errors
mc1: csrow3: ch0: 0 Corrected Errors
mc1: csrow3: ch1: 0 Corrected Errors
mc2: 0 Uncorrected Errors with no DIMM info
mc2: 0 Corrected Errors with no DIMM info
mc2: csrow0: 0 Uncorrected Errors
mc2: csrow0: ch0: 0 Corrected Errors
mc2: csrow0: ch1: 0 Corrected Errors
mc2: csrow1: 0 Uncorrected Errors
mc2: csrow1: ch0: 0 Corrected Errors
mc2: csrow1: ch1: 0 Corrected Errors
mc2: csrow2: 0 Uncorrected Errors
mc2: csrow2: ch0: 0 Corrected Errors
mc2: csrow2: ch1: 0 Corrected Errors
mc2: csrow3: 0 Uncorrected Errors
mc2: csrow3: ch0: 0 Corrected Errors
mc2: csrow3: ch1: 0 Corrected Errors
mc3: 0 Uncorrected Errors with no DIMM info
mc3: 0 Corrected Errors with no DIMM info
mc3: csrow0: 0 Uncorrected Errors
mc3: csrow0: ch0: 0 Corrected Errors
mc3: csrow0: ch1: 0 Corrected Errors
mc3: csrow1: 0 Uncorrected Errors
mc3: csrow1: ch0: 0 Corrected Errors
mc3: csrow1: ch1: 0 Corrected Errors
mc3: csrow2: 0 Uncorrected Errors
mc3: csrow2: ch0: 0 Corrected Errors
mc3: csrow2: ch1: 0 Corrected Errors
mc3: csrow3: 0 Uncorrected Errors
mc3: csrow3: ch0: 0 Corrected Errors
mc3: csrow3: ch1: 0 Corrected Errors
Considering these errors are popping up nearly constantly we can just look for the high correction count. In this case we can see a couple channels showing > 800 corrections while every other memory channel is fine. Using this info we can find and replace the failing DIMM(s).