Hardware Error 内存报错 |
您所在的位置:网站首页 › alertmemoryerroroccurred › Hardware Error 内存报错 |
192.168.219.90 使用 dmesg|grep -i error 查看时发现这台机器内存有问题,如下图所示: [Hardware Error]: MC4 Error (node 1): L3 cache tag error. [Hardware Error]: Error Status: Corrected error, no action required. [Hardware Error]: CPU:6 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|Poison|CECC]: 0x9c11cc60001d018b [Hardware Error]: MC4_ADDR: 0x00000018edfd9100 [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900 [Hardware Error]: Error Status: Corrected error, no action required. [Hardware Error]: CPU:12 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c40400010080a13 [Hardware Error]: MC4_ADDR: 0x00000008cf6cb900 [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) [Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB. EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900 [Hardware Error]: Error Status: Corrected error, no action required. [Hardware Error]: CPU:12 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x8c40400010080a13 [Hardware Error]: MC4_ADDR: 0x00000008cf6cb900 [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout) 进一步查询发现是第5条内存有问题,需要联系私有云那边报修。 grep [0-9] /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146 /sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0 count不为0的行即代表存在内存错误。 mc*:第几个CPU。 csrow*:内存通道。 ch*:通道内的第几根内存。 然后通过dmidecode查看: [root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'
Locator: DIMM01
Locator: DIMM02
Locator: DIMM03
Locator: DIMM04
Locator: DIMM05
Locator: DIMM06
Locator: DIMM07
Locator: DIMM08
Locator: DIMM09
Locator: DIMM10
Locator: DIMM11
Locator: DIMM12
Locator: DIMM13
Locator: DIMM14
Locator: DIMM15
Locator: DIMM16
Locator: DIMM17
Locator: DIMM18
Locator: DIMM19
Locator: DIMM20
Locator: DIMM21
Locator: DIMM22
Locator: DIMM23
Locator: DIMM24
Locator: DIMM25
Locator: DIMM26
Locator: DIMM27
Locator: DIMM28
Locator: DIMM29
Locator: DIMM30
Locator: DIMM31
Locator: DIMM32
通过服务器控制台查看内存:
主板上内存插槽的分布:
结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1 应该是内存插槽DIMM_F1的问题。 解决: 最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |