Bug 202005 - Ryzen 5 PRO 2500U CPU errors reported by MCE on boot and every 311 seconds
Summary: Ryzen 5 PRO 2500U CPU errors reported by MCE on boot and every 311 seconds
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 low
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-16 20:40 UTC by Rafał Miłecki
Modified: 2018-12-17 08:27 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.20-rc6
Tree: Mainline
Regression: No


Attachments
config (202.50 KB, text/plain)
2018-12-16 20:55 UTC, Rafał Miłecki
Details
dmesg (174.84 KB, text/plain)
2018-12-16 20:59 UTC, Rafał Miłecki
Details

Description Rafał Miłecki 2018-12-16 20:40:11 UTC
Some notebooks suffer from CPU errors like:
[    1.070317] mce: [Hardware Error]: Machine check events logged
[    1.070324] [Hardware Error]: Corrected error, no action required.
[    1.070330] [Hardware Error]: CPU:0 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151
[    1.070337] [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
[    1.070343] [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10
[    1.070346] [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error.
[    1.070351] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

It has been reported and is perfectly reproducible on HP EliteBook 735 G5 and 745 G5 with Ryzen 5 PRO 2500U using BIOS 1.03.01 (0x0810100b CPU microcode).

It happens with the latest kernel 4.20-rc6 as well as with some really old ones like 4.9.

There is no newer microcode for this CPU in the microcode_amd_fam17h.bin from linux-firmware.git at the time of writing. CPU's signature (family/model/stepping) is 0x810f10.

This is definitely not a faulty unit. It has been reported by 3 different owners (including me), see e.g. bug 201213 and bug 201291.

This doesn't happen on all 2500U CPUs. I also own MateBook D with Ryzen 5 2500U using BIOS 1.12 (0x08101007 CPU microcode) and I've never seen this error there.
Comment 1 Rafał Miłecki 2018-12-16 20:55:44 UTC
Created attachment 280041 [details]
config
Comment 2 Rafał Miłecki 2018-12-16 20:59:51 UTC
Created attachment 280043 [details]
dmesg

I'm attaching dmesg with many CPU errors logged. It's a kernel 4.20-rc6 + a simple diff from
[0] Re: [PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them
(it switches MCE to the core_initcall() and core_initcall_sync()).

I've compiled that kernel using attached config.

[0] https://marc.info/?l=linux-edac&m=154348063802065&w=2
Comment 3 Rafał Miłecki 2018-12-16 21:43:43 UTC
For anyone interested in it. This issue is expected to get solved as there are people working on it.

Please check
[0] Re: [PATCH] x86/mce/AMD: Make sure banks were initialized before accessing them
[1] Re: Problem with late AMD microcode reload/feedback
for the source.

[0] https://marc.info/?l=linux-edac&m=154357559728875&w=2
[1] https://marc.info/?l=linux-kernel&m=154495709911678&w=2
Comment 4 Rafał Miłecki 2018-12-16 22:03:52 UTC
The last info I have to share as for now.

Some hint for solving/narrowing this problem may be hidden in the bug 201213.

First of all Linus's tree kernels didn't boot on HP EliteBooks 7x5 G5 starting from 4.10 up to the 4.20-rc4 (see bug 201291 for details). Those kernels simply couldn't handle MCE errors happening so early.

If you take a look at bug 201213 however, Amit came up with a kernel config that allowed him to boot kernel 4.18.13. I see two explanations for this:
1) That specific config stopped CPU errors
2) That specific config made CPU errors appear (or get detected) later

I've tried reproducing that "success" with kernel 4.18.13 and provided config but didn't manage to. It's possible that above half-success is related to the CONFIG_AMDGPU. That would also match bug 201727 which was about different "Hardware Error"s introduced by an amdgpu change.
Comment 5 Amit Prakash Ambasta 2018-12-17 08:27:39 UTC
I can replicate this issue on 4.19.8


Dec 17 13:56:24 antizen kernel: mce: [Hardware Error]: Machine check events logged
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:5 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151
Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Dec 17 13:56:24 antizen kernel: mce: [Hardware Error]: Machine check events logged
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:4 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151
Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:3 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151
Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Corrected error, no action required.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: CPU:0 (17:11:0) MC1_STATUS[Over|CE|MiscV|-|-|-|-|SyndV|-]: 0xd8200000000a0151
Dec 17 13:56:24 antizen kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000004a000000
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Extended Error Code: 10
Dec 17 13:56:24 antizen kernel: [Hardware Error]: Instruction Fetch Unit Error: L1 BTB multi-match error.
Dec 17 13:56:24 antizen kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

Note You need to log in before you can comment on or make changes to this bug.