Some notebooks suffer from CPU errors like: [ 1.070317] mce: [Hardware Error]: Machine check events logged [ 1.070324] [Hardware Error]: Corrected error, no action required. [ 1.070330] [Hardware Error]: CPU:0 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 [ 1.070337] [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 [ 1.070343] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 [ 1.070346] [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. [ 1.070351] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD It has been reported and is perfectly reproducible on HP EliteBook 735 G5 and 745 G5 with Ryzen 5 PRO 2500U using BIOS 1.03.01 (0x0810100b CPU microcode). It happens with the latest kernel 4.20-rc6 as well as with some really old ones like 4.9. There is no newer microcode for this CPU in the microcode_amd_fam17h.bin from linux-firmware.git at the time of writing. CPU's signature (family/model/stepping) is 0x810f10. This is definitely not a faulty unit. It has been reported by 3 different owners (including me), see e.g. bug 201213 and bug 201291. This doesn't happen on all 2500U CPUs. I also own MateBook D with Ryzen 5 2500U using BIOS 1.12 (0x08101007 CPU microcode) and I've never seen this error there.
Created attachment 280041 [details] config
Created attachment 280043 [details] dmesg I'm attaching dmesg with many CPU errors logged. It's a kernel 4.20-rc6 + a simple diff from [0] Re: [PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them (it switches MCE to the core_initcall() and core_initcall_sync()). I've compiled that kernel using attached config. [0] https://marc.info/?l=linux-edac&m=154348063802065&w=2
For anyone interested in it. This issue is expected to get solved as there are people working on it. Please check [0] Re: [PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them [1] Re: Problem with late AMD microcode reload/feedback for the source. [0] https://marc.info/?l=linux-edac&m=154357559728875&w=2 [1] https://marc.info/?l=linux-kernel&m=154495709911678&w=2
The last info I have to share as for now. Some hint for solving/narrowing this problem may be hidden in the bug 201213. First of all Linus's tree kernels didn't boot on HP EliteBooks 7x5 G5 starting from 4.10 up to the 4.20-rc4 (see bug 201291 for details). Those kernels simply couldn't handle MCE errors happening so early. If you take a look at bug 201213 however, Amit came up with a kernel config that allowed him to boot kernel 4.18.13. I see two explanations for this: 1) That specific config stopped CPU errors 2) That specific config made CPU errors appear (or get detected) later I've tried reproducing that "success" with kernel 4.18.13 and provided config but didn't manage to. It's possible that above half-success is related to the CONFIG_AMDGPU. That would also match bug 201727 which was about different "Hardware Error"s introduced by an amdgpu change.
I can replicate this issue on 4.19.8 Dec 17 13:56:24 antizen kernel: mce: [Hardware Error]: Machine check events logged Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:5 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Dec 17 13:56:24 antizen kernel: mce: [Hardware Error]: Machine check events logged Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:4 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:3 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required. Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:0 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151 Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10 Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error. Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Rafał, are we done here? We did this AFAIR: 71a84402b93e ("x86/MCE/AMD: Don't report L1 BTB MCA errors on some family 17h models") Thx.