Bug 212087 - Random reboots with 5.11 and earlier 5.10.x versions, 5.10.18+ stable (Ryzen 5000)
Summary: Random reboots with 5.11 and earlier 5.10.x versions, 5.10.18+ stable (Ryzen ...
Status: RESOLVED INVALID
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: Other Linux
: P1 high
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-06 15:42 UTC by alan.loewe
Modified: 2021-04-17 22:01 UTC (History)
5 users (show)

See Also:
Kernel Version: 5.10.13-5.10.16(17?), 5.11
Tree: Mainline
Regression: No


Attachments

Description alan.loewe 2021-03-06 15:42:15 UTC
Like many others with a new Ryzen 5000, I had issues with random reboots, and there certainly were BIOS issues, but with the latest BIOS version being from Jan 29 and after leaving BIOS settings alone at almost their defaults* for a while, I'm sure there also are Kernel issues.

I used Kernels as released by Arch Linux. After eventually having no reboots for a week after a BIOS update (pre 3204 beta), occasional reboots started occurring again with 5.10.13, becoming unbearable (hourly) with the release of 5.11.1 (upgrading from 5.10.16). Then I installed linux-lts at version 5.10.18**, which has been stable since, and now at 5.10.20 still is. In between I tried 5.11.2 when released and got three reboots that day.

** Thus I didn't use 5.10.17, which has relevant changes.

The unstable versions sometimes logged hardware errors:

Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged
Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 1: bc800800060c0859
Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 3b1f3a280 MISC d012000000000000 IPID 100b000000000 
Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1613844658 SOCKET 0 APIC 1e microcode a201009

Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged
Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: fc800800060c0859
Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR e1775b600 MISC d012000000000000 IPID 100b000000000 
Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1614144675 SOCKET 0 APIC 1d microcode a201009


OS: Arch Linux
CPU: AMD Ryzen 5950X
Mainboard: ASUS Crosshair VIII Hero (Wi-Fi)
RAM: G.Skill F4-3600C16D-64GTZN
GPU: Nvidia GTX 1070
PSU: be quiet! Straight Power 11 Platinum 750W

* Non-default BIOS settings:
Power supply idle control: typical current
Memory current capability: 110%
RAM set to 3200 MHz, 16-18-18-38, 1.35V (below its specs)
SVM enabled
Comment 1 alan.loewe 2021-03-06 16:34:11 UTC
I should mention that before downgrading to 5.10.18 I had the BIOS setting "Power supply idle control" at its default. The reasoning behind "Memory current capability" is, that its help text says that the system will be halted, when the current if above the configured current. And the reported current actually is always about 0.015V above the configured current. I want to exclude that as a cause.

Now I will "Load optimized defaults" and activate the DOCP profile to see whether 5.10.20 is stable with that, too. Haven't really tried that yet.

In before "C states": no, makes no difference.
Comment 2 alan.loewe 2021-03-08 06:32:48 UTC
With default BIOS settings plus DOCP it rebooted after 27h uptime, a few minutes after beginning to play music.

So back to the settings mentioned above. I already tested over-volting as a mitigation by setting the curve optimizer to +1, and then verified by setting it to -1 (under-volting). It makes no difference, -1 is stable. But I had the impression that 5.11 crashed sooner with -1.

So I tested the limit. My CPU doesn't boot with -5, can't take load with -4, CAN compile the linux kernel and pass Cinebench with -3, but it's not stable in the long run. So -2 is probably the limit. That's rather poor.

With -3, 5.11 doesn't even boot, while 5.10 and Windows are stable enough for some benchmarks.

So I guess, 5.11 somehow drives the CPU harder, too hard, and whether it's stable depends on sample quality. I'll call that a regression.
Comment 3 alan.loewe 2021-03-08 08:10:59 UTC
Uhm, nevermind, it has probably already been fixed in 5.11.3 by the cpufreq/schedutil stuff that's also in 5.10.17. I thought it had been in 5.11 from the beginning. Testing 5.11.4 right now, which I have conveniently built as a stress test without noticing. (:

I'll close this issue if it proves stable, which I assume.
Comment 4 alan.loewe 2021-03-08 11:17:58 UTC
Unfortunately not, 5.11.4 is slightly better, but not stable.

Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged
Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: CPU 31: Machine Check: 0 Bank 1: bc800800060c0859
Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 6a1cce800 MISC d012000000000000 IPID 100b000000000 
Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615201914 SOCKET 0 APIC 1f microcode a201009
Comment 5 sean.mcauliffe 2021-03-15 14:30:16 UTC
I too am having this issue with 5.11.2 and similar specs (AMD 5950X/Crosshair VIII Dark Hero). Dual boot Windows and have not had any issues on that side.

Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: bc800800060c0859
Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 13e05c600 MISC d012000000000000 IPID 100b000000000 
Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615797369 SOCKET 0 APIC 1d microcode a201009

I'm attempting to downgrade to 5.10.19-1-MANJARO to see if it's more stable.
Comment 6 sean.mcauliffe 2021-03-17 17:10:37 UTC
To follow up, I did get an MCE on 5.10.19-1-MANJARO once I pushed the memory overclock to 3800 and an FCLK of 1900. Dropping this down to 3600/1800 respectively seems to be stable.

Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: bc800800060c0859
Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR e9c27e840 MISC d0120ffe00000000 IPID 100b000000000 
Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615934655 SOCKET 0 APIC 1d microcode a201009
Comment 7 Chromer 2021-04-04 18:00:00 UTC
I have this issue on Haswell CPU with 5.11 series.  

Log from 5.11.11:

Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: be00000000800400
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffc07e06cc MISC ffffffffc07e06cc 
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1617535033 SOCKET 0 APIC 0 microcode 28

Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffc0e4b269 MISC ffffffffc0e4b269 
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1617535033 SOCKET 0 APIC 2 microcode 28
Comment 8 Misha Nasledov 2021-04-08 00:58:10 UTC
I've had this same issue on similar hardware. Ryzen 5900X, ASRock X570 Taichi. I was running BIOS 4.00 but I have just updated to the beta BIOS 4.15 with the AGESA 1.2.0.1 Patch A update to see if it helps.

I've tried all the usual Ryzen stuff as well. Global C-state control disabled, Power supply control "Typical Current Idle", and even disabling C-states with ZenStates.py at boot.

6017:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: Machine check events logged
6018:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0 Bank 1: bc800800060c0859
6019:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: TSC 0 ADDR 169d89280 MISC d012000000000000 IPID 100b000000000 
6020:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1617769364 SOCKET 0 APIC 9 microcode a201009

$ ./run.py 1 bc800800060c0859 
Bank: Instruction Fetch Unit (IF)
Error: L2 Cache Response Poison Error. Error is the result of consuming poison data (L2RespPoison 0xc)
Comment 9 meedalexa 2021-04-17 09:34:30 UTC
I'm using an Arch kernel so I don't know if this will be of any help, but I'm having a similar issue on 5.11.13-arch1-1 (the same mce symptoms and bc800800060c0859 error code, and same PROCESSOR etc. line except for the Unix timestamp), and I've had a stable system thus far after turning off Core Performance Boost in my BIOS. But I've also not had my system under as-heavy load (though still pretty heavy load relative to what was making my system crash earlier). Ryzen 7 5800X, Gigabyte X570 Aorus Elite Wifi
Comment 10 alan.loewe 2021-04-17 20:05:01 UTC
Looks like I eventually got my system stable. The solution was to slightly undervolt DRAM. I guess this kind of issue depends on the silicon quality of the data fabric, and if you're unlucky, you need to find stable non-default settings.

RAM specs are 3600 MHz, 16-22-22-42, 1.45V. With that I got random reboots without MCE about once a day. With 3200 MHz and 1.35V, 5.11 crashed every few hours with MCE, but 5.10 seemed stable. Well, almost, it still suddenly crashed multiple times a day after a few days without crashes.

Long story short, these settings are stable for my system, everything else at default:
* DOCP, but with 3200 MHz and 1.345V.
* DRAM power phase control: optimized (default is extreme).

These setting may or may not be relevant:
* DRAM current capability: 110%

Irrelevant settings:
* Virtualization stuff is enabled.

Note: when DRAM voltage is set to 1.35V, sensors reported about 1.36-1.365V. When it's set to 1.345V, quite exactly 1.35V are reported.

To verify, I increased voltage to 1.355V, which yielded MCEs pretty quickly.

This also means that it's not a kernel issue.
Comment 11 Misha Nasledov 2021-04-17 22:01:56 UTC
I have some DDR4-3200 ECC RAM that runs DDR4-3200 with JEDEC timings, so it's 12.00V here. I don't have crashes as frequently as you have reported. My rig was pretty stable until a recent kernel, so I am going to keep digging. It's difficult as it only happens once every few days

Note You need to log in before you can comment on or make changes to this bug.