Blog

How to find DIMM errors and replace the faulty RAM ?

Published on: December 20, 2018 by Smith Nevil

How to find DIMM errors and replace the faulty RAM ?

Scenario:

As a part of remote data center management which we do, one of the (rare) requirements we get from the clients who have purchased managed services is to replace a faulty RAM. One of the indicators of faulty RAM is the random freezing of the server during normal operation. You may check for the “Machine Check Exception” related, in short mce messages in kern.log or messages, depending on your OS.

Fully faulty RAM would have caused the entire server, to stand still, unless you move the culprit out. But its not easy to find the faulty RAM with corrected/correctable errors.

Error sample

EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0) 

EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0) 

EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0) 

EDAC MC1: 1 CE error on CPU#1Ch7nnel#0_DIMM#0 (ch7nnel:0 slot:0 p7ge:0x0 offset:0x0 gr7in:8 syndrome:0x0)

First step to start is with the EDAC output. In newer systems (kernel 2.6.18+) with sysfs a check in the sys folder at location /sys/devices/system/edac/mc/mc0 would show the error count. The file you should check is ce_count. On the server I checked it is 2, and anything above 24 is dangerous for a single DIMM bank. ue_count should be 0 because Uncorrected errors means its faulty and should be replaced

[root@server ~]# ls -s /sys/devices/system/edac/mc/mc0

total 0

0 ce_count 0 max_location 0 rank3 0 seconds_since_reset

0 ce_noinfo_count 0 mc_name 0 rank4 0 size_mb

0 csrow0 0 power 0 rank5 0 subsystem

0 csrow1 0 rank0 0 rank6 0 ue_count

0 csrow2 0 rank1 0 rank7 0 ue_noinfo_count

0 csrow3 0 rank2 0 reset_counters 0 uevent

ce_count : The total count of correctable errors that have occurred on this memory controller (attribute file).

ce_noinfo_count : The total count of correctable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).

mc_name : The type of memory controller being utilized (attribute file).

reset_counters : A write-only control file that zeroes out all of the statistical counters for correctable and uncorrectable errors on this memory controller and resets the timer indicating how long it has been since the last reset (counter zero). The basic command is echo < anything >  /sys/devices/system/edac/mc/mc0/reset_counters , where < anything > is literally anything (just use a 0 to make things easy).

sdram_scrub_rate : An attribute file that controls memory scrubbing. The scrubbing rate is set by writing a minimum bandwidth in bytes per second to the attribute file. The rate will be translated to an internal value at the specified rate. If the configuration fails or memory scrubbing is not implemented, the value of the attribute file will be -1 .

seconds_since_reset : An attribute file that displays how many seconds have elapsed since the last counter reset. This can be used with the error counters to measure error rates.

size_mb : An attribute file that contains the size (MB) of memory that this memory controller manages.

ue_count : An attribute file that contains the total number of uncorrectable errors that have occurred on this memory controller.

ue_noinfo_count : The total count of uncorrectable errors on this memory controller, but with no information as to which DIMM slot is experiencing errors (attribute file).

[root@server ~]# ls -s /sys/devices/system/edac/mc/mc0/csrow0

total 0

0 ce_count 0 ch1_ce_count 0 edac_mode 0 size_mb 0 uevent

0 ch0_ce_count 0 ch1_dimm_label 0 mem_type 0 subsystem

0 ch0_dimm_label 0 dev_type 0 power 0 ue_count

ce_count : The total count of correctable errors that have occurred on this csrow (attribute file).

ch0_ce_count : The total count of correctable errors on this DIMM in channel 0 (attribute file).

ch0_dimm_label : The control file that labels this DIMM. This can be very useful for panic events to isolate the cause of the uncorrectable error. Note that DIMM labels must be assigned after booting, with information that correctly identifies the physical slot with its silk screen label on the board itself.

dev_type : An attribute file that will display the type of DRAM device being used on this DIMM. Typically this is x1 , x2 , x4 , or x8 .

edac_mode : An attribute file that displays the type of error detection and correction being utilized.

mem_type : An attribute file that displays the type of memory currently on a csrow.

size_mb : An attribute file that contains the size (MB) of memory a csrow contains.

ue_count : An attribute file that contains the total number of uncorrectable errors that have occurred on a csrow

[root@server ~]# cat /sys/devices/system/edac/mc/mc0/ce_count 

2 

[root@server ~]# cat /sys/devices/system/edac/mc/mc0/ue_count 

0

If the ue_count is more than 0, you have to go specific to find out which slot is faulty. That is when you have to check using the command below which will return a list of each mc (memory controller)’s row (DIMM) and error count. There could two or more mcʼs which will be identified as mc0 and mc1.

[root@server ~]# cat /sys/devices/system/edac/mc/mc0/csrow*/ch0_dimm_label mc#0csrow#0channel#0 

mc#0csrow#1channel#0 

mc#0csrow#2channel#0 

mc#0csrow#3channel#0

This means I have 4 csrows (chip select rows) and 1 channel in each row.

[root@server ~]# cat /sys/devices/system/edac/mc/mc0/csrow*/ch0_ce_count 

0 

0 

0 

0

Finally being said all this, you can use edac-util which is a program that reports EDAC(Error Detection and Correction), it reads information from EDAC in the kernel, using files exported by these drivers in sysfs. You may need to install it separately though.

“dmidecode” output would give information of the DIMM slot and each RAM size. Another command which would help you is lshw. You may need to install them if not present. If you are running it as a normal user, you may get an output as below. So do switch as root first.

dmidecode

# dmidecode 3.0 Sc$nning /dev/mem for entry point.

/dev/mem: Permission denied

You can enter dmidecode command to show all the hardware information and specifically -t option to specify the type of hardware. For memory details, it is 17.

[root@server ~]# dmidecode -t 17

# dmidecode 3.0

Scanning /dev/mem for entry point.

SMBIOS 2.6 present.

Handle 0x0056, DMI type 17, 28 bytes

Memory Device

Array Handle: 0x0057

Error Information Handle: 0x005A

Total Width: 128 bits

Data Width: 64 bits

Size: 8192 MB

Form Factor: DIMM

Set: None

Locator: ChannelA-DIMM0

Bank Locator: BANK 0

Type: DDR3

Type Detail: Synchronous

Speed: 1333 MHz

Manufacturer: Kingston

Serial Number: D104061E

Asset Tag: 9876543210

Part Number: 9965525-058.A00LF

Rank: 2

Handle 0x005B, DMI type 17, 28 bytes

Memory Device

Array Handle: 0x0057

Error Information Handle: No Error

Total Width: 128 bits

Data Width: 64 bits

Size: 8192 MB

Form Factor: DIMM

Set: None

Locator: ChannelA-DIMM1

Bank Locator: BANK 1

Type: DDR3

Type Detail: Synchronous

Speed: 1333 MHz

Manufacturer: Kingston

Serial Number: CB040A1E

Asset Tag: 9876543210

Part Number: 9965525-058.A00LF

Rank: 2

Handle 0x005C, DMI type 17, 28 bytes

Memory Device

Array Handle: 0x0057

Error Information Handle: 0x005F

Total Width: 128 bits

Data Width: 64 bits

Size: 8192 MB

Form Factor: DIMM

Set: None

Locator: ChannelB-DIMM0

Bank Locator: BANK 2

Type: DDR3

Type Detail: Synchronous

Speed: 1333 MHz

Manufacturer: Kingston

Serial Number: CE040A1E

Asset Tag: 9876543210

Part Number: 9965525-058.A00LF

Rank: 2

Handle 0x0061, DMI type 17, 28 bytes

Memory Device

Array Handle: 0x0057

Error Information Handle: No Error

Total Width: 128 bits

Data Width: 64 bits

Size: 8192 MB

Form F7ctor: DIMM

Set: None

Loc7tor: Ch7nnelB-DIMM1

B7nk Loc7tor: BANK 3

Type: DDR3

Type Detail: Synchronous

Speed: 1333 MHz

Manufacturer: Kingston

Seri7l Number: CF04DC1D

Asset Tag: 9876543210

Part Number: 9965525-058.A00LF

Rank: 2

In the above input there are 4 DIMM slots and each is filled with 8GB memory. Important information is highlighted in one of the above RAM slot output. More details can be read at below URLs. Even though the docs are a bit old, it is classic!

https://docs.oracle.com/cd/E19121-01/sf.x4440/820-3067-14/dimms.html

https://docs.oracle.com/cd/E19150-01/820-4213-11/dimms.html

System monitoring and administration is a critical aspect to the successful operations of most businesses. But you don’t worry. SupportSages is always happy to help you.

Category : server

Smith Nevil

Smith Nevil

Smith is always ready to learn new technologies and explore new territories. His never-ending passion towards technological advancements, unyielding affinity to perfection and excitement in the exploration of new areas, help him to be on the top of everything he is involved with. He is currently working as System Engineer at SupportSages.

You may also read:

Comments

Add new commentSIGN IN

Let's Connect

Get new updates

Categories

$0.000 items