Bug 215577

Summary: AsRock B550 Taichi - reboots with AMD Ryzen 9 5900X (Machine Check: 0 Bank 5: bea0000000000108)
Product: Platform Specific/Hardware Reporter: Alias Fakanami (abyomi0)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: REOPENED ---    
Severity: normal CC: bp, cousinmarc, gabriele.svelto, pmenzel+bugzilla.kernel.org
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.16.3-arch1-1 Subsystem:
Regression: No Bisected commit-id:
Attachments: acpidump
dmesg
lspci
BIOS Screenshots 1
BIOS Version

Description Alias Fakanami 2022-02-07 18:54:04 UTC
Created attachment 300406 [details]
acpidump

Creating this as requested by Paul here: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c281

Motherboard: AsRock B550 Taichi
BIOS Information
        Vendor: American Megatrends International, LLC.
        Version: L2.05
        Release Date: 01/06/2022

CPU: AMD Ryzen 9 5900X
GPU: nVidia RTX 2060 Super
Kernel: 5.16.3-arch1-1
OS: Arch Linux

Running Folding at Home (Full) on both the CPU and GPU. 
So far, this is the only situation where I've had the reboots occur.

Computer reboots at random with the following in dmesg upon rebooting:

[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: Machine check events logged
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: TSC 0 ADDR 8a3a3a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1643981838 SOCKET 0 APIC 8 microcode a201016
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: Machine check events logged
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 5: bea0000000000108
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[Fri Feb  4 08:37:18 2022] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1643981838 SOCKET 0 APIC 19 microcode a201016

After another MCE this morning, I'm currently running the system with the following: processor.max_cstate=5 (something I found here: https://github.com/DimitriFourny/MCE-Ryzen-Decoder)

The request output of acpidump, dmesg, and lspci -nn are attached.
Comment 1 Alias Fakanami 2022-02-07 18:54:27 UTC
Created attachment 300407 [details]
dmesg
Comment 2 Alias Fakanami 2022-02-07 18:55:02 UTC
Created attachment 300408 [details]
lspci
Comment 3 Paul Menzel 2022-02-07 19:09:32 UTC
Thank, but I requested the additional for the system, where `acpi_osi=Linux` was supposedly helping. (Which turned out to false as far as I understood.)

1.  What does `zenstates.py --list` [1] report?
2.  Does the firmware have any option to configure the C-States?

[1]: https://github.com/r4m0n/ZenStates-Linux
Comment 4 Alias Fakanami 2022-02-07 20:16:35 UTC
(In reply to Paul Menzel from comment #3)
> Thank, but I requested the additional for the system, where `acpi_osi=Linux`
> was supposedly helping. (Which turned out to false as far as I understood.)
> 
> 1.  What does `zenstates.py --list` [1] report?
> 2.  Does the firmware have any option to configure the C-States?
> 
> [1]: https://github.com/r4m0n/ZenStates-Linux

With processor.max_cstates=5 (zenstates --list):
(The output is the same with or without processor.max_cstates=5 passed at boot time, oddly)

P0 - Enabled - FID = 94 - DID = 8 - VID = 48 - Ratio = 37.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

As for C States options in BIOS, the only thing I see is under: 
	Advanced\AMD CBS\CPU Common Options

	Global C-State Control: Auto
        Description: Controls IO based C-state generation and DF C-states.

	It has options for Disable, Enable and Auto.
	Setting Enable doesn't show any extra options.
        It also doesn't say whether or not having it set to Auto = Enabled or Auto = Disabled.
Comment 5 Paul Menzel 2022-02-07 20:47:54 UTC
Thank you.

>     C6 State - Package - Enabled
>     C6 State - Core - Enabled

Please try to toggle the (UEFI) firmware settings, and check if the output of `zenstates.py --list` changes.

Also, please try to contact the ASRock support. The chances are low, it’s going to help, but maybe you get lucky.
Comment 6 Alias Fakanami 2022-02-07 21:09:49 UTC
Thought I should mention it, Global C-State Control is set to Auto by default.
Setting it to Disabled results in zenstates listing C6 State - Core as disabled.

The remaining combinations (Enabled, Auto, with or without the processor.max_cstate=5) results in C6 State - Core showing as Enabled in zenstates.

UEFI Global C-State Control: Disabled (processor.max_cstates=5)
P0 - Enabled - FID = 94 - DID = 8 - VID = 48 - Ratio = 37.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Disabled

UEFI Global C-State Control: Disabled (processor.max_cstates, not set)
P0 - Enabled - FID = 94 - DID = 8 - VID = 48 - Ratio = 37.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Disabled

I can reach out to AsRock support, do I just tell them there's a bug in the firmware related to C States, causing the MCE under load?
Comment 7 Paul Menzel 2022-02-07 22:34:09 UTC
(In reply to Alias Fakanami from comment #6)
> Thought I should mention it, Global C-State Control is set to Auto by
> default.
> Setting it to Disabled results in zenstates listing C6 State - Core as
> disabled.

That is good. Please try, if using that (and no other workarounds, that means *no* additional parameters related to the problem on the Linux command line) is giving you a stable system.

> I can reach out to AsRock support, do I just tell them there's a bug in the
> firmware related to C States, causing the MCE under load?

Good question. But yes, that you want to use C-State C6 in GNU/Linux, and it crashes your system.

If you have the time, you could also join some firmware/BIOS modding/hacking forum, and ask there, if they can reverse engineer, what the different options of *Global C-State Control* actually do, that means what registers are set.
Comment 8 Alias Fakanami 2022-02-08 01:32:35 UTC
Okay. I'll test it with just Global C-State set to disabled in UEFI and with no additional parameters on the command line and see what happens.

Reached out to ASRock Support.

I'll try with Level1Techs and see if they can help with the BIOS modding/hacking.

Thread: https://forum.level1techs.com/t/asrock-b550-taichi-bios-reverse-engineering/181739
Comment 9 Paul Menzel 2022-02-08 07:17:24 UTC
Just to avoid, that too many comments are posted to this bug/issue as in bug #206903 [1], could you please tag the subject/title with *AsRock B550 Taichi*?

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c281
Comment 10 Paul Menzel 2022-02-08 07:21:27 UTC
(In reply to Alias Fakanami from comment #8)

[…]

> Reached out to ASRock Support.

Awesome.

> I'll try with Level1Techs and see if they can help with the BIOS
> modding/hacking.
> 
> Thread:
>
> https://forum.level1techs.com/t/asrock-b550-taichi-bios-reverse-engineering/181739

Finger’s crossed, someone with the knowledge takes the time to look at it. I also heard of the Win-Raid Forum [1], but now idea if chances are higher there. If you have time, you could also take a stab at disassembling the firmware binary. UEFITool and radare2 might already be enough to find out, what is happening.

[1]: https://www.win-raid.com/
[2]: https://rada.re/n/radare2.html
Comment 11 Alias Fakanami 2022-02-08 12:51:42 UTC
(In reply to Paul Menzel from comment #9)
> Just to avoid, that too many comments are posted to this bug/issue as in bug
> #206903 [1], could you please tag the subject/title with *AsRock B550
> Taichi*?
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c281

Changed title from reboots with AMD Ryzen 9 5900X (Machine Check: 0 Bank 5: bea0000000000108) to AsRock B550 Taichi - reboots with AMD Ryzen 9 5900X (Machine Check: 0 Bank 5: bea0000000000108).
Comment 12 Alias Fakanami 2022-02-08 12:54:24 UTC
(In reply to Paul Menzel from comment #10)
> (In reply to Alias Fakanami from comment #8)
> 
> […]
> 
> > Reached out to ASRock Support.
> 
> Awesome.
> 
> > I'll try with Level1Techs and see if they can help with the BIOS
> > modding/hacking.
> > 
> > Thread:
> >
> >
> https://forum.level1techs.com/t/asrock-b550-taichi-bios-reverse-engineering/181739
> 
> Finger’s crossed, someone with the knowledge takes the time to look at it. I
> also heard of the Win-Raid Forum [1], but now idea if chances are higher
> there. If you have time, you could also take a stab at disassembling the
> firmware binary. UEFITool and radare2 might already be enough to find out,
> what is happening.
> 
> [1]: https://www.win-raid.com/
> [2]: https://rada.re/n/radare2.html

So, ASRock Support isn't able to help. Their response is below.

"Unfortunately ASRock does not have any solution , drivers nor support for any Linux OS. ASRock support Windows version only."

I can try radare2. I took a look at UEFITool when I was on Level1Techs, though I didn't get very far with it.
Comment 13 Alias Fakanami 2022-02-14 13:52:28 UTC
Hello Paul,

I left the computer running Folding At Home. It rebooted a few minutes ago, after about 3, maybe 4 days.
Here's the output from journalctl --list-boots:

-1 33e2c2e4298545daabdaa0b1067a6bd4 Thu 2022-02-10 10:52:11 EST—Sun 2022-02-13 21:27:56 EST
 0 3cc6ca7c3faa47e28f78ac15ba555668 Mon 2022-02-14 08:32:37 EST—Mon 2022-02-14 08:36:01 EST

Same messages in dmesg:

[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: Machine check events logged
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1644845550 SOCKET 0 APIC 6 microcode a201016
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: Machine check events logged
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: CPU 23: Machine Check: 0 Bank 5: bea0000000000108
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: TSC 0 ADDR 8a3aa2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1644845550 SOCKET 0 APIC 1b microcode a201016
[Mon Feb 14 08:32:35 2022] MCE: In-kernel MCE decoding enabled.

I made a post on win-raid for the AsRock BIOS (it does seem like chances might be better there):
https://www.win-raid.com/t10182f54-Request-ASrock-B-Taichi-BIOS-Bugfix.html

The Win-Raid forum had some steps on getting started with firmware modding, but I wasn't able to make any headway with decompiling the firmware binary.
Comment 14 Alias Fakanami 2022-02-14 14:23:57 UTC
Thought I should mention it, 4 days appears to be the maximum.
Comment 15 Alias Fakanami 2022-02-22 13:52:15 UTC
Hi Paul. Good news! I've had my system running for the past 7 days now without a reboot. 

uptime: 08:50:44 up 7 days,  1:00,  2 users,  load average: 25.28, 25.41, 25.33

The last change I made was to adjust the RAM speed from 3600 Mhz to 2400 Mhz, after reading a thread on unRAID [1] [2]. 

I'll be retesting with the RAM at 3200 Mhz (max speed supported by Ryzen [3]) and see what happens, but for now, it seems the RAM running at 3600 MHz was causing the Machine Check Exception and reboots.

1: https://forums.unraid.net/topic/104115-solved-unraid-keeps-freezing-or-restarting-and-i-cant-figure-out-why/

2: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-819173

3: https://www.amd.com/en/products/cpu/amd-ryzen-9-5900x#product-specs
Comment 16 Paul Menzel 2022-02-22 14:00:56 UTC
Thank you for keeping us posted, and great job on finding a workaround.

What RAM modules and configuration do you have exactly?

Maybe that is another data point, you can contact the ASRock support with. Is your RAM on the compatibility(?) list?
Comment 17 Alias Fakanami 2022-02-22 14:22:48 UTC
RAM Manufacturer: Kingston
Part Number: KHX3600C17D4/16GX
Last I checked, it is not on the QVL list.
2x16 GB, for a total of 32 GB.

Here's a spec sheet: https://www.kingston.com/dataSheets/HX436C17PB3AK2_32.pdf 

I think this is the right one, but the Part Numbers differ for some reason. I may pull a stick of RAM out just to be sure.
Comment 18 Alias Fakanami 2022-03-29 23:50:41 UTC
Well, I thought it was stable, but it would appear I was wrong.

I've had two MCEs. One this Sunday, the 27th (after 11 days of uptime) and again a few minutes ago after two days of uptime. In both cases, it was under load from FoldingAtHome.

After the MCE on Sunday, I thought maybe having the RAM at 3200 MHz was no good and turned it down to 2400 MHz to confirm, but it just rebooted at that speed, too.
Comment 19 Alias Fakanami 2022-11-19 11:18:22 UTC
Paul,

I found this [1] and left F@H running. The system made it 21 days and 5 hours, approximately. Booted with idle=nomwait.

It is a different kernel version, however: 5.19.12.arch1-1

1: https://community.amd.com/t5/archives-discussions/epyc-7551-spontaneously-resets-after-10mins-rendering/m-p/162407#M191
Comment 20 Paul Menzel 2022-11-19 13:50:10 UTC
Nice, that this workaround worked for you. It didn’t on other boards.

By the way, you original report was with system firmware L2.05. The current version seems to be 2.30 [1] containing several updates to the general AMD AGESA platform initialization code.

1.  2.10: Update AMD AM4 AGESA Combo V2 PI 1.2.0.6b
2.  2.20: Update AMD AM4 AGESA Combo V2 PI 1.2.0.7
Comment 21 Alias Fakanami 2022-11-19 15:09:19 UTC
(In reply to Paul Menzel from comment #20)
> Nice, that this workaround worked for you. It didn’t on other boards.
> 
> By the way, you original report was with system firmware L2.05. The current
> version seems to be 2.30 [1] containing several updates to the general AMD
> AGESA platform initialization code.
> 
> 1.  2.10: Update AMD AM4 AGESA Combo V2 PI 1.2.0.6b
> 2.  2.20: Update AMD AM4 AGESA Combo V2 PI 1.2.0.7

While I wouldn't exactly call it a success, it's not quite a failure, either.

Yes, I noticed the updates available and I updated after the reboot, so I'm running the latest system firmware now (see attached). I'll probably test again to see if anything's changed with the update.
Comment 22 Alias Fakanami 2022-11-19 15:11:42 UTC
Created attachment 303234 [details]
BIOS Screenshots 1
Comment 23 Alias Fakanami 2022-11-19 15:12:51 UTC
Created attachment 303235 [details]
BIOS Version
Comment 24 Alias Fakanami 2022-12-09 13:48:57 UTC
Rebooted again, but failed much sooner this time.
Ran from the 3rd to the 8th before it rebooted on the around 6am.
Comment 25 Alias Fakanami 2022-12-13 18:46:03 UTC
I'm not sure what's changed now. The kernel version is the same (upgrading to 6.0 breaks virtualization, I haven't found a solution yet), but...the system rebooted on its own again a little while ago. 

Only made it 3 days and 11 hours, approximately. The uptime is pretty inconsistent now, for whatever reason.

The only other thing to try is to change out the motherboard. That worked for someone in the foldingform link. I figure it's worth a shot. I've already RMA'd the CPU once.

I've only been able to find a couple things on this:

https://foldingforum.org/viewtopic.php?t=37535&sid=5179d7e794321212f2ba0f21511ef8e0&start=75

https://community.amd.com/t5/archives-discussions/epyc-7551-spontaneously-resets-after-10mins-rendering/m-p/162407#M191
Comment 26 Alias Fakanami 2022-12-27 16:30:52 UTC
Well, that didn't work. Though, thinking back on it, I'm not sure why I expected anything different.

I did find this, though.
https://wiki.gentoo.org/wiki/Ryzen#Random_reboots_with_mce_events

See erratum 1109: https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf 

I can't seem to find anything newer related to Zen 3 / Ryzen 5000, though.