Linux 内存错误诊断

您所在的位置:网站首页 memory内存条 Linux 内存错误诊断

Linux 内存错误诊断

2024-07-10 20:38| 来源: 网络整理| 查看: 265

先了解一些概念

DRAM(Dynamic Random Access Memory),即动态随机存取存储器,最为常见的系统内存。ECC是“Error Checking and Correcting”的简写,中文名称是“错误检查和纠正”。ECC内存,即应用了能够实现错误检查和纠正技术(ECC)的内存条。EDAC,即Error Detection And Correction(错误检测与纠正)。

内存有两种错误类型分别是CE和UE,CE 是 Correctable Error 的简称, UE是Uncorrectable Error的简称,CE即可恢复的错误,暂不影响系统的正常运行。可以在找时机停机换掉。UE为不可恢复的内存错误,通常会导致宕机。

系统messages日志 [root@my-host mg4a]# grep kernel /var/log/messages Jan 14 19:01:11 my-host kernel: mce: [Hardware Error]: Machine check events logged Jan 14 19:01:12 my-host kernel: EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#0 (channel:5 slot:0 page:0x554c02 offset:0x3c0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 ha:1 channel_mask:2 rank:0) [root@my-host mg4a]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:1 /sys/devices/system/edac/mc/mc1/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow0/ch5_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow0/ch5_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow0/ch5_ce_count:0 [root@my-host mg4a]# dmidecode -t 1 # dmidecode 3.0 Getting SMBIOS data from sysfs. SMBIOS 2.7 present. Handle 0x0044, DMI type 1, 27 bytes System Information Manufacturer: LENOVO Product Name: Lenovo System x3750 M4 -[8753IH5]- Version: 03 Serial Number: 06FF367 UUID: C4EF8080-7926-11E5-8B14-6C0B849B418E Wake-up Type: Other SKU Number: XxXxXxX Family: System X

这是另外一台设备messges日志

Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. Jun 27 13:53:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8de3b1960 Jun 27 13:53:25 irora30 kernel: EDAC MC2: CE page 0x8de3b1, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac Jun 27 13:53:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 27 13:53:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080a13 Jun 27 13:53:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008de3b1960 Jun 27 13:53:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) Jun 27 14:19:27 irora30 auditd[5571]: Audit daemon rotating log files Jun 27 19:09:23 irora30 auditd[5571]: Audit daemon rotating log files Jun 27 23:59:21 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. Jun 28 02:15:55 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8d9ea5960 Jun 28 02:15:55 irora30 kernel: EDAC MC2: CE page 0x8d9ea5, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac Jun 28 02:15:55 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 28 02:15:55 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813 Jun 28 02:15:55 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008d9ea5960 Jun 28 02:15:55 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. Jun 28 03:08:25 irora30 kernel: EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8ded39960 Jun 28 03:08:25 irora30 kernel: EDAC MC2: CE page 0x8ded39, offset 0x960, grain 0, syndrome 0xab40, row 5, channel 0, label "": amd64_edac Jun 28 03:08:25 irora30 kernel: [Hardware Error]: Error Status: Corrected error, no action required. Jun 28 03:08:25 irora30 kernel: [Hardware Error]: CPU:1 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c204000ab080813 Jun 28 03:08:25 irora30 kernel: [Hardware Error]: MC4_ADDR: 0x00000008ded39960 Jun 28 03:08:25 irora30 kernel: [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) Jun 28 03:45:13 irora30 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date. Jun 28 04:44:25 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 09:34:22 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 10:02:30 irora30 ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=df -hl /var|awk 'NR>1 && int($5) > 80' removes=None creates=None chdir=None Jun 28 14:23:49 irora30 auditd[5571]: Audit daemon rotating log files Jun 28 19:09:25 irora30 auditd[5571]: Audit daemon rotating log files 故障确认及定位故障内存槽位 [root@irora30 ~]# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow5/ch0_ce_count:294 /sys/devices/system/edac/mc/mc3/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow5/ch0_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow4/ch0_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow5/ch0_ce_count:0 [root@irora30 ~]#

count:不为0的行即代表存在内存错误。 mc:第几个CPU。 csrow:内存通道。 ch*:通道内的第几根内存。

内存安装情况

1 Memory Component Status 2 3 Proc 1 DIMM 1A 16384 MB 1333 MHz 4 5 Proc 1 DIMM 2I Not installed Not installed 6 7 Proc 1 DIMM 3E Not installed Not installed 8 9 Proc 1 DIMM 4C Not installed Not installed 10 11 Proc 1 DIMM 5K Not installed Not installed 12 13 Proc 1 DIMM 6G Not installed Not installed 14 15 Proc 1 DIMM 7B 16384 MB 1333 MHz 16 17 Proc 1 DIMM 8J Not installed Not installed 18 19 Proc 1 DIMM 9F Not installed Not installed 20 21 Proc 1 DIMM 10D Not installed Not installed 22 23 Proc 1 DIMM 11L Not installed Not installed 24 25 Proc 1 DIMM 12H Not installed Not installed 26 27 Proc 2 DIMM 1A 16384 MB 1333 MHz 28 29 Proc 2 DIMM 2I Not installed Not installed 30 31 Proc 2 DIMM 3E Not installed Not installed 32 33 Proc 2 DIMM 4C Not installed Not installed 34 35 Proc 2 DIMM 5K Not installed Not installed 36 37 Proc 2 DIMM 6G Not installed Not installed 38 39 Proc 2 DIMM 7B 16384 MB 1333 MHz 40 41 Proc 2 DIMM 8J Not installed Not installed 42 43 Proc 2 DIMM 9F Not installed Not installed 44 45 Proc 2 DIMM 10D Not installed Not installed 46 47 Proc 2 DIMM 11L Not installed Not installed 48 49 Proc 2 DIMM 12H Not installed Not installed 50 51 Proc 3 DIMM 1A 16384 MB 1333 MHz 52 53 Proc 3 DIMM 2I Not installed Not installed 54 55 Proc 3 DIMM 3E Not installed Not installed 56 57 Proc 3 DIMM 4C Not installed Not installed 58 59 Proc 3 DIMM 5K Not installed Not installed 60 61 Proc 3 DIMM 6G Not installed Not installed 62 63 Proc 3 DIMM 7B 16384 MB 1333 MHz 64 65 Proc 3 DIMM 8J Not installed Not installed 66 67 Proc 3 DIMM 9F Not installed Not installed 68 69 Proc 3 DIMM 10D Not installed Not installed 70 71 Proc 3 DIMM 11L Not installed Not installed 72 73 Proc 3 DIMM 12H Not installed Not installed 74 75 Proc 4 DIMM 1A 16384 MB 1333 MHz 76 77 Proc 4 DIMM 2I Not installed Not installed 78 79 Proc 4 DIMM 3E Not installed Not installed 80 81 Proc 4 DIMM 4C Not installed Not installed 82 83 Proc 4 DIMM 5K Not installed Not installed 84 85 Proc 4 DIMM 6G Not installed Not installed 86 87 Proc 4 DIMM 7B 16384 MB 1333 MHz 88 89 Proc 4 DIMM 8J Not installed Not installed 90 91 Proc 4 DIMM 9F Not installed Not installed 92 93 Proc 4 DIMM 10D Not installed Not installed 94 95 Proc 4 DIMM 11L Not installed Not installed 96 97 Proc 4 DIMM 12H Not installed Not installed 使用edac工具来检测服务器内存故障

随着虚拟化,Redis,BDB内存数据库等应用的普及,现在越来越多的服务器配置了大容量内存,拿DELL的R620来说在配置双路CPU下,其24个内存插槽,支持的内存高达960GB。对于ECC,REG这些带有纠错功能的内存故障检测是一件很头疼的事情,出现故障,还是可以连续运行几个月甚至几年,但如果运气不好,随时都会挂掉,好在linux中提供了一个edac-utils 内存纠错诊断工具,可以用来检查服务器内存潜在的故障。 下面以CentOS为例,介绍下edac-utils 工具的使用. 在使用edac-utils 工具之前,需要先了解服务器的硬件架构,以DELL R620为例,(其它如HP DL360P G8,IBM X3650 M4 机型都使用了 E5-2600 系列CPU,C600 系列芯片组.大致相同) 其CPU内存控制器对应通道,内存槽关系,如下所示。

处理器0 (对应一个内存控制器) 通道0:内存插槽A1、A5 和A9 通道1:内存插槽A2、A6 和A10 通道2:内存插槽A3、A7 和A11 通道3:内存插槽A4、A8 和A12

处理器1 (对应一个内存控制器) 通道0:内存插槽B1、B5 和B9 通道1:内存插槽B2、B6 和B10 通道2:内存插槽B3、B7 和B11 通道3:内存插槽B4、B8 和B12

1.安装 edac-utils 工具

yum install -y libsysfs edac-utils

2.执行检测命令,可查看纠错提示如下

edac-util -v 1 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: A1 2 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: A2 3 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: A3 4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: A4 5 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: A5 6 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: A6 7 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: A7 8 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: A8 9 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: A9 10 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: A10 11 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: A11 12 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: A12 13 14 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: B1 15 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: B2 16 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: B3 17 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: B4 18 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B5 19 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B6 20 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B7 21 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B8 22 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: B9 23 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: B10 24 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: B11 25 mc1: csrow2: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: B12

其中

mc06 表示 表示内存控制器0; CPU_Src_ID#0 表示源CPU0; Channel#0 表示通道0; DIMM#0 标示内存槽0; Corrected Errors 代表已经纠错的次数;

根据前面列出的CPU通道和内存槽对应关系即可给edac-utils 返回的信息进行编号。 即可得出 A1槽 6312 次纠错,B1槽 6459次纠错,B3槽 535次纠错. 3条内存出现潜在故障,接下来联系供应商进行更换即可。

12条内存的对应关系

1 mc0: csrow0: CPU#0Channel#0_DIMM#0: A1 2 mc0: csrow0: CPU#0Channel#1_DIMM#0: A2 3 mc0: csrow0: CPU#0Channel#2_DIMM#0: A3 4 mc0: csrow1: CPU#0Channel#0_DIMM#1: A4 5 mc0: csrow1: CPU#0Channel#1_DIMM#1: A5 6 mc0: csrow1: CPU#0Channel#2_DIMM#1: A6 7 8 mc1: csrow0: CPU#1Channel#0_DIMM#0: B1 9 mc1: csrow0: CPU#1Channel#1_DIMM#0: B2 10 mc1: csrow0: CPU#1Channel#2_DIMM#0: B3 11 mc1: csrow1: CPU#1Channel#0_DIMM#1: B4 12 mc1: csrow1: CPU#1Channel#1_DIMM#1: B5 13 mc1: csrow1: CPU#1Channel#2_DIMM#1: B6

20条内存的对应关系

1 mc0: 0 Uncorrected Errors with no DIMM info 2 mc0: 0 Corrected Errors with no DIMM info 3 mc0: csrow0: 0 Uncorrected Errors 4 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors A1 5 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors B1 6 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors C1 7 mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors D1 8 mc0: csrow1: 0 Uncorrected Errors 9 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors A2 10 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors B2 11 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors C2 12 mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors D2 13 mc0: csrow2: 0 Uncorrected Errors 14 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#0_DIMM#2: 0 Corrected Errors A3 15 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#1_DIMM#2: 11 Corrected Errors B3 16 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#2_DIMM#2: 0 Corrected Errors C3 17 mc0: csrow2: CPU_SrcID#0_Ha#0_Chan#3_DIMM#2: 0 Corrected Errors D3 18 mc1: 0 Uncorrected Errors with no DIMM info 19 mc1: 0 Corrected Errors with no DIMM info 20 mc1: csrow0: 0 Uncorrected Errors 21 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors 22 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors 23 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#2_DIMM#0: 0 Corrected Errors 24 mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#3_DIMM#0: 0 Corrected Errors 25 mc1: csrow1: 0 Uncorrected Errors 26 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors 27 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors 28 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#2_DIMM#1: 0 Corrected Errors 29 mc1: csrow1: CPU_SrcID#1_Ha#0_Chan#3_DIMM#1: 0 Corrected Errors 30 31 4x16关系 32 mc0: csrow0: CPU#0Channel#0_DIMM#0: 0 Corrected Errors 8a 33 mc0: csrow0: CPU#0Channel#1_DIMM#0: 0 Corrected Errors 5b 34 mc0: csrow0: CPU#0Channel#2_DIMM#0: 0 Corrected Errors 2c 35 mc0: csrow1: 0 Uncorrected Errors 36 mc0: csrow1: CPU#0Channel#0_DIMM#1: 1 Corrected Errors 7d 37 mc0: csrow1: CPU#0Channel#1_DIMM#1: 0 Corrected Errors 4e 38 mc0: csrow1: CPU#0Channel#2_DIMM#1: 0 Corrected Errors 1f 39 mc0: csrow2: 0 Uncorrected Errors 40 mc0: csrow2: CPU#0Channel#0_DIMM#2: 0 Corrected Errors 6G 41 mc0: csrow2: CPU#0Channel#1_DIMM#2: 0 Corrected Errors 3h

参考: https://www.cnblogs.com/luckyall/p/11225772.html http://www.voidcn.com/article/p-gvfvakvy-btw.html



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3