Bug 215577
Summary: | AsRock B550 Taichi - reboots with AMD Ryzen 9 5900X (Machine Check: 0 Bank 5: bea0000000000108) | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Alias Fakanami (abyomi0) |
Component: | x86-64 | Assignee: | platform_x86_64 (platform_x86_64) |
Status: | REOPENED --- | ||
Severity: | normal | CC: | bp, cousinmarc, gabriele.svelto, pmenzel+bugzilla.kernel.org |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.16.3-arch1-1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
acpidump
dmesg lspci BIOS Screenshots 1 BIOS Version |
Description
Alias Fakanami
2022-02-07 18:54:04 UTC
Created attachment 300407 [details]
dmesg
Created attachment 300408 [details]
lspci
Thank, but I requested the additional for the system, where `acpi_osi=Linux` was supposedly helping. (Which turned out to false as far as I understood.) 1. What does `zenstates.py --list` [1] report? 2. Does the firmware have any option to configure the C-States? [1]: https://github.com/r4m0n/ZenStates-Linux (In reply to Paul Menzel from comment #3) > Thank, but I requested the additional for the system, where `acpi_osi=Linux` > was supposedly helping. (Which turned out to false as far as I understood.) > > 1. What does `zenstates.py --list` [1] report? > 2. Does the firmware have any option to configure the C-States? > > [1]: https://github.com/r4m0n/ZenStates-Linux With processor.max_cstates=5 (zenstates --list): (The output is the same with or without processor.max_cstates=5 passed at boot time, oddly) P0 - Enabled - FID = 94 - DID = 8 - VID = 48 - Ratio = 37.00 - vCore = 1.10000 P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000 P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Enabled As for C States options in BIOS, the only thing I see is under: Advanced\AMD CBS\CPU Common Options Global C-State Control: Auto Description: Controls IO based C-state generation and DF C-states. It has options for Disable, Enable and Auto. Setting Enable doesn't show any extra options. It also doesn't say whether or not having it set to Auto = Enabled or Auto = Disabled. Thank you.
> C6 State - Package - Enabled
> C6 State - Core - Enabled
Please try to toggle the (UEFI) firmware settings, and check if the output of `zenstates.py --list` changes.
Also, please try to contact the ASRock support. The chances are low, it’s going to help, but maybe you get lucky.
Thought I should mention it, Global C-State Control is set to Auto by default. Setting it to Disabled results in zenstates listing C6 State - Core as disabled. The remaining combinations (Enabled, Auto, with or without the processor.max_cstate=5) results in C6 State - Core showing as Enabled in zenstates. UEFI Global C-State Control: Disabled (processor.max_cstates=5) P0 - Enabled - FID = 94 - DID = 8 - VID = 48 - Ratio = 37.00 - vCore = 1.10000 P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000 P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Disabled UEFI Global C-State Control: Disabled (processor.max_cstates, not set) P0 - Enabled - FID = 94 - DID = 8 - VID = 48 - Ratio = 37.00 - vCore = 1.10000 P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000 P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Disabled I can reach out to AsRock support, do I just tell them there's a bug in the firmware related to C States, causing the MCE under load? (In reply to Alias Fakanami from comment #6) > Thought I should mention it, Global C-State Control is set to Auto by > default. > Setting it to Disabled results in zenstates listing C6 State - Core as > disabled. That is good. Please try, if using that (and no other workarounds, that means *no* additional parameters related to the problem on the Linux command line) is giving you a stable system. > I can reach out to AsRock support, do I just tell them there's a bug in the > firmware related to C States, causing the MCE under load? Good question. But yes, that you want to use C-State C6 in GNU/Linux, and it crashes your system. If you have the time, you could also join some firmware/BIOS modding/hacking forum, and ask there, if they can reverse engineer, what the different options of *Global C-State Control* actually do, that means what registers are set. Okay. I'll test it with just Global C-State set to disabled in UEFI and with no additional parameters on the command line and see what happens. Reached out to ASRock Support. I'll try with Level1Techs and see if they can help with the BIOS modding/hacking. Thread: https://forum.level1techs.com/t/asrock-b550-taichi-bios-reverse-engineering/181739 Just to avoid, that too many comments are posted to this bug/issue as in bug #206903 [1], could you please tag the subject/title with *AsRock B550 Taichi*? [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c281 (In reply to Alias Fakanami from comment #8) […] > Reached out to ASRock Support. Awesome. > I'll try with Level1Techs and see if they can help with the BIOS > modding/hacking. > > Thread: > > https://forum.level1techs.com/t/asrock-b550-taichi-bios-reverse-engineering/181739 Finger’s crossed, someone with the knowledge takes the time to look at it. I also heard of the Win-Raid Forum [1], but now idea if chances are higher there. If you have time, you could also take a stab at disassembling the firmware binary. UEFITool and radare2 might already be enough to find out, what is happening. [1]: https://www.win-raid.com/ [2]: https://rada.re/n/radare2.html (In reply to Paul Menzel from comment #9) > Just to avoid, that too many comments are posted to this bug/issue as in bug > #206903 [1], could you please tag the subject/title with *AsRock B550 > Taichi*? > > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c281 Changed title from reboots with AMD Ryzen 9 5900X (Machine Check: 0 Bank 5: bea0000000000108) to AsRock B550 Taichi - reboots with AMD Ryzen 9 5900X (Machine Check: 0 Bank 5: bea0000000000108). (In reply to Paul Menzel from comment #10) > (In reply to Alias Fakanami from comment #8) > > […] > > > Reached out to ASRock Support. > > Awesome. > > > I'll try with Level1Techs and see if they can help with the BIOS > > modding/hacking. > > > > Thread: > > > > > https://forum.level1techs.com/t/asrock-b550-taichi-bios-reverse-engineering/181739 > > Finger’s crossed, someone with the knowledge takes the time to look at it. I > also heard of the Win-Raid Forum [1], but now idea if chances are higher > there. If you have time, you could also take a stab at disassembling the > firmware binary. UEFITool and radare2 might already be enough to find out, > what is happening. > > [1]: https://www.win-raid.com/ > [2]: https://rada.re/n/radare2.html So, ASRock Support isn't able to help. Their response is below. "Unfortunately ASRock does not have any solution , drivers nor support for any Linux OS. ASRock support Windows version only." I can try radare2. I took a look at UEFITool when I was on Level1Techs, though I didn't get very far with it. Hello Paul, I left the computer running Folding At Home. It rebooted a few minutes ago, after about 3, maybe 4 days. Here's the output from journalctl --list-boots: -1 33e2c2e4298545daabdaa0b1067a6bd4 Thu 2022-02-10 10:52:11 EST—Sun 2022-02-13 21:27:56 EST 0 3cc6ca7c3faa47e28f78ac15ba555668 Mon 2022-02-14 08:32:37 EST—Mon 2022-02-14 08:36:01 EST Same messages in dmesg: [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: Machine check events logged [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108 [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: TSC 0 ADDR 8a3a8a MISC d012000100000000 SYND 4d000000 IPID 500b000000000 [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1644845550 SOCKET 0 APIC 6 microcode a201016 [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: Machine check events logged [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: CPU 23: Machine Check: 0 Bank 5: bea0000000000108 [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: TSC 0 ADDR 8a3aa2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 [Mon Feb 14 08:32:30 2022] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1644845550 SOCKET 0 APIC 1b microcode a201016 [Mon Feb 14 08:32:35 2022] MCE: In-kernel MCE decoding enabled. I made a post on win-raid for the AsRock BIOS (it does seem like chances might be better there): https://www.win-raid.com/t10182f54-Request-ASrock-B-Taichi-BIOS-Bugfix.html The Win-Raid forum had some steps on getting started with firmware modding, but I wasn't able to make any headway with decompiling the firmware binary. Thought I should mention it, 4 days appears to be the maximum. Hi Paul. Good news! I've had my system running for the past 7 days now without a reboot. uptime: 08:50:44 up 7 days, 1:00, 2 users, load average: 25.28, 25.41, 25.33 The last change I made was to adjust the RAM speed from 3600 Mhz to 2400 Mhz, after reading a thread on unRAID [1] [2]. I'll be retesting with the RAM at 3200 Mhz (max speed supported by Ryzen [3]) and see what happens, but for now, it seems the RAM running at 3600 MHz was causing the Machine Check Exception and reboots. 1: https://forums.unraid.net/topic/104115-solved-unraid-keeps-freezing-or-restarting-and-i-cant-figure-out-why/ 2: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/?tab=comments#comment-819173 3: https://www.amd.com/en/products/cpu/amd-ryzen-9-5900x#product-specs Thank you for keeping us posted, and great job on finding a workaround. What RAM modules and configuration do you have exactly? Maybe that is another data point, you can contact the ASRock support with. Is your RAM on the compatibility(?) list? RAM Manufacturer: Kingston Part Number: KHX3600C17D4/16GX Last I checked, it is not on the QVL list. 2x16 GB, for a total of 32 GB. Here's a spec sheet: https://www.kingston.com/dataSheets/HX436C17PB3AK2_32.pdf I think this is the right one, but the Part Numbers differ for some reason. I may pull a stick of RAM out just to be sure. Well, I thought it was stable, but it would appear I was wrong. I've had two MCEs. One this Sunday, the 27th (after 11 days of uptime) and again a few minutes ago after two days of uptime. In both cases, it was under load from FoldingAtHome. After the MCE on Sunday, I thought maybe having the RAM at 3200 MHz was no good and turned it down to 2400 MHz to confirm, but it just rebooted at that speed, too. Paul, I found this [1] and left F@H running. The system made it 21 days and 5 hours, approximately. Booted with idle=nomwait. It is a different kernel version, however: 5.19.12.arch1-1 1: https://community.amd.com/t5/archives-discussions/epyc-7551-spontaneously-resets-after-10mins-rendering/m-p/162407#M191 Nice, that this workaround worked for you. It didn’t on other boards. By the way, you original report was with system firmware L2.05. The current version seems to be 2.30 [1] containing several updates to the general AMD AGESA platform initialization code. 1. 2.10: Update AMD AM4 AGESA Combo V2 PI 1.2.0.6b 2. 2.20: Update AMD AM4 AGESA Combo V2 PI 1.2.0.7 (In reply to Paul Menzel from comment #20) > Nice, that this workaround worked for you. It didn’t on other boards. > > By the way, you original report was with system firmware L2.05. The current > version seems to be 2.30 [1] containing several updates to the general AMD > AGESA platform initialization code. > > 1. 2.10: Update AMD AM4 AGESA Combo V2 PI 1.2.0.6b > 2. 2.20: Update AMD AM4 AGESA Combo V2 PI 1.2.0.7 While I wouldn't exactly call it a success, it's not quite a failure, either. Yes, I noticed the updates available and I updated after the reboot, so I'm running the latest system firmware now (see attached). I'll probably test again to see if anything's changed with the update. Created attachment 303234 [details]
BIOS Screenshots 1
Created attachment 303235 [details]
BIOS Version
Rebooted again, but failed much sooner this time. Ran from the 3rd to the 8th before it rebooted on the around 6am. I'm not sure what's changed now. The kernel version is the same (upgrading to 6.0 breaks virtualization, I haven't found a solution yet), but...the system rebooted on its own again a little while ago. Only made it 3 days and 11 hours, approximately. The uptime is pretty inconsistent now, for whatever reason. The only other thing to try is to change out the motherboard. That worked for someone in the foldingform link. I figure it's worth a shot. I've already RMA'd the CPU once. I've only been able to find a couple things on this: https://foldingforum.org/viewtopic.php?t=37535&sid=5179d7e794321212f2ba0f21511ef8e0&start=75 https://community.amd.com/t5/archives-discussions/epyc-7551-spontaneously-resets-after-10mins-rendering/m-p/162407#M191 Well, that didn't work. Though, thinking back on it, I'm not sure why I expected anything different. I did find this, though. https://wiki.gentoo.org/wiki/Ryzen#Random_reboots_with_mce_events See erratum 1109: https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf I can't seem to find anything newer related to Zen 3 / Ryzen 5000, though. |