Bug 202005
Summary: | Ryzen 5 PRO 2500U CPU errors reported by MCE on boot and every 311 seconds | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Rafał Miłecki (zajec5) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | NEEDINFO --- | ||
Severity: | low | CC: | amit.prakash.ambasta, bp, clemej |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.20-rc6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
config
dmesg |
Description
Rafał Miłecki
2018-12-16 20:40:11 UTC
Created attachment 280041 [details]
config
Created attachment 280043 [details] dmesg I'm attaching dmesg with many CPU errors logged. It's a kernel 4.20-rc6 + a simple diff from [0] Re: [PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them (it switches MCE to the core_initcall() and core_initcall_sync()). I've compiled that kernel using attached config. [0] https://marc.info/?l=linux-edac&m=154348063802065&w=2 For anyone interested in it. This issue is expected to get solved as there are people working on it. Please check [0] Re: [PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them [1] Re: Problem with late AMD microcode reload/feedback for the source. [0] https://marc.info/?l=linux-edac&m=154357559728875&w=2 [1] https://marc.info/?l=linux-kernel&m=154495709911678&w=2 The last info I have to share as for now. Some hint for solving/narrowing this problem may be hidden in the bug 201213. First of all Linus's tree kernels didn't boot on HP EliteBooks 7x5 G5 starting from 4.10 up to the 4.20-rc4 (see bug 201291 for details). Those kernels simply couldn't handle MCE errors happening so early. If you take a look at bug 201213 however, Amit came up with a kernel config that allowed him to boot kernel 4.18.13. I see two explanations for this: 1) That specific config stopped CPU errors 2) That specific config made CPU errors appear (or get detected) later I've tried reproducing that "success" with kernel 4.18.13 and provided config but didn't manage to. It's possible that above half-success is related to the CONFIG_AMDGPU. That would also match bug 201727 which was about different "Hardware Error"s introduced by an amdgpu change. I can replicate this issue on 4.19.8 Dec 17 13:56:24 antizen kernel: mce: [Hardware Error]: Machine check events logged Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:5 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Dec 17 13:56:24 antizen kernel: mce: [Hardware Error]: Machine check events logged Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:4 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:3 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:0 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Rafał, are we done here? We did this AFAIR: 71a84402b93e ("x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models") Thx. |