Hardware Error 内存报错

您所在的位置:网站首页 alertmemoryerroroccurred Hardware Error 内存报错

Hardware Error 内存报错

2023-10-21 03:26| 来源: 网络整理| 查看: 265

192.168.219.90 使用 dmesg|grep -i error 查看时发现这台机器内存有问题,如下图所示: [Hardware Error]: MC4 Error (node 1): L3 cache tag error. [Hardware Error]: Error Status: Corrected error, no action required. [Hardware Error]: CPU:6 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9c11cc60001d018b [Hardware Error]: MC4_ADDR: 0x00000018edfd9100 [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900 [Hardware Error]: Error Status: Corrected error, no action required. [Hardware Error]: CPU:12 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c40400010080a13 [Hardware Error]: MC4_ADDR: 0x00000008cf6cb900 [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900 [Hardware Error]: Error Status: Corrected error, no action required. [Hardware Error]: CPU:12 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c40400010080a13 [Hardware Error]: MC4_ADDR: 0x00000008cf6cb900 [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

进一步查询发现是第5条内存有问题,需要联系私有云那边报修。 grep [0-9] /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146 /sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0

count不为0的行即代表存在内存错误。 mc*:第几个CPU。 csrow*:内存通道。 ch*:通道内的第几根内存。

然后通过dmidecode查看:

[root@customer log]# dmidecode -t memory |grep 'Locator: DIMM' Locator: DIMM01 Locator: DIMM02 Locator: DIMM03 Locator: DIMM04 Locator: DIMM05 Locator: DIMM06 Locator: DIMM07 Locator: DIMM08 Locator: DIMM09 Locator: DIMM10 Locator: DIMM11 Locator: DIMM12 Locator: DIMM13 Locator: DIMM14 Locator: DIMM15 Locator: DIMM16 Locator: DIMM17 Locator: DIMM18 Locator: DIMM19 Locator: DIMM20 Locator: DIMM21 Locator: DIMM22 Locator: DIMM23 Locator: DIMM24 Locator: DIMM25 Locator: DIMM26 Locator: DIMM27 Locator: DIMM28 Locator: DIMM29 Locator: DIMM30 Locator: DIMM31 Locator: DIMM32 通过服务器控制台查看内存:

主板上内存插槽的分布:

结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1 应该是内存插槽DIMM_F1的问题。

解决: 最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3