Bug 206903

Summary: Spontaneous reboots with Ryzen-3700x (Machine Check: 0 Bank 5: bea0000000000108)
Product: Platform Specific/Hardware Reporter: Clemens Eisserer (linuxhippy)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: NEW ---    
Severity: high CC: agurenko, alan.loewe, alexdeucher, am7kimbkv, amadejkastelic7, andviic, andy, bmogilefsky, bp, c1.kernel, captain_rage, chihero1982, cousinmarc, ctron, dev, dion, eleonorelerp, emomuffinsz, evvke, foulques, geoffrey.vandenberge, hampus.linander, info, jaakko, jacob, joel_damiano, jonas.v, josh, karunadheera, kbugs, kernel.org, kernel, kernel, kernel, kernelorg, labadens.pierre, leonardodearaujoaugusto, makiftasova, mclark, nilsdev, njlmerchant, patrickjholloway, pgnet.dev, pmenzel+bugzilla.kernel.org, richard.tattoli, samuel.alexander.cowley, t.clastres, tfiosilva, thomas.langkamp, tsweet64, vkrevs, vvb, xasafam914, yk749
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.5.9 Tree: Mainline
Regression: No
Attachments: dmesg
Xorg log
Logs for debugging Machine Check crash
kernel and X11 logs for a boot after crash when reading registers
possible fix
dmesg log while running at high CPU load
pci devices
dmesg of latest crash
dmesg after 3rd reboot

Description Clemens Eisserer 2020-03-21 09:45:28 UTC
Ever since building my new PC I experience spontaneous (every week or so) reboots caused by machine check exceptions (always same bank and code - please see below). The reboots tend to happen in low-load situations (e.g. right after loading the desktop, or when playing youtube videos) - high load doesn't seem to make it worse.

The system consists of:
Asrock Phantom Gaming 4 X570 (latest BIOS: 2.30)
Ryzen 3700x
MSI RX570 4GB
Crucial Ballistics 4x8GB, DDR4, 3000Mhz

I first suspected a hardware fault, but the system has been rock solid running Windows-10 for months (not a single crash / reboot) and runs memtest-86+ without error for days in single- & multicore-mode.
Temps are low, PSU is of high quality.

I tried the following work-arrounds without success (suggested for ZEN1-chips with the same error code):
- Disabled RC6 power state
- Disabled mwait for core-signalling
- limited GPU power saving states

Others have experienced exactly the same issue: https://www.reddit.com/r/archlinux/comments/e33nyg/hard_reboots_with_ryzen_3600x/fgtj09u/

... where in some cases it seems changing to a different GPU helps.
However my RX570 is rather new, so I am not so keen on replacing it after 6 months of use.


[    0.707393] mce: [Hardware Error]: Machine check events logged
[    0.707395] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
[    0.707464] mce: [Hardware Error]: TSC 0 ADDR 1ffffbb03343c MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.707540] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288 SOCKET 0 APIC 5 microcode 8701013
[    0.709397] mce: [Hardware Error]: Machine check events logged
[    0.709398] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
[    0.709468] mce: [Hardware Error]: TSC 0 ADDR 1ffffbba3a05a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.709543] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288 SOCKET 0 APIC 9 microcode 870101
Comment 1 Alex Deucher 2020-03-23 13:29:50 UTC
does setting amdgpu.ppfeaturemask=0xffffbffb on the kernel command line in grub help?  Please attach your dmesg output and xorg log (if using X).
Comment 2 Clemens Eisserer 2020-03-24 15:53:08 UTC
Created attachment 288035 [details]
dmesg
Comment 3 Clemens Eisserer 2020-03-24 15:55:11 UTC
Created attachment 288037 [details]
Xorg log
Comment 4 Clemens Eisserer 2020-03-24 15:59:46 UTC
logs are attached, thanks for the hint regarding the feature mask, I'll give it a try and report back as soon as the next reboot occurs. 

I discovered a feature-mask was already set, stemming from my experiments with reducing power management - the crashes happend before setting the feature mask to 0xfffd7fff.
Comment 5 Clemens Eisserer 2020-03-26 08:09:18 UTC
the modigied feature mask didn't seem to improve things - just had another reboot:

[    0.105123] mce: [Hardware Error]: Machine check events logged
[    0.105124] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
[    0.105191] mce: [Hardware Error]: TSC 0 ADDR 7f6b3dbdfe9e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.105267] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585208779 SOCKET 0 APIC 0 microcode 8701013
Comment 6 Borislav Petkov 2020-03-27 11:40:41 UTC
I see in your dmesg:

amdgpu.ppfeaturemask=0xfffd7fff

Alex asked you to try:

amdgpu.ppfeaturemask=0xffffbffb

Are you saying that the MCE in comment #5 happened with the 0xffffbffb mask?

If so, then you probably should RMA your CPU.

HTH.
Comment 7 Clemens Eisserer 2020-03-27 11:49:11 UTC
@Borislav: the dmesg dump was uploaded before the feature-mask was adjusted, the last crash happend with amdgpu.ppfeaturemask=0xffffbffb.

I've filed this report because the machine is rock solid running Windows-10 and crashes don't seem to be load-related (e.g. encoding VP9 videos on all cores for 24h on Linux didn't cause any problems). But who knows, maybe it is another "linux performance marginality problem" ;)
Comment 8 Borislav Petkov 2020-03-27 11:59:07 UTC
Comparing it to windoze doesn't mean a whole lot. If you want to debug it, then I guess the only thing I can think of is for you to try to rule out hw components. Like, for example, if you have another GPU handy to try with it and see if the MCEs still happen. And so on.

Then perhaps try to figure out what you do exactly before it reboots - maybe you'll be able to spot a pattern there.

You get the idea.
Comment 9 Clemens Eisserer 2020-03-27 12:05:43 UTC
| Comparing it to windoze doesn't mean a whole lot.

It means either:
* the hw is not faulty or has quirks which the windows drivers handle properly
* the hw causing the MCE is not used / used in a different way when running windows
Comment 10 Borislav Petkov 2020-03-27 12:18:09 UTC
And how do you suggest we figure that out?
Comment 11 Clemens Eisserer 2020-03-31 14:12:36 UTC
and another one:

[    0.696908] mce: [Hardware Error]: Machine check events logged
[    0.696909] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
[    0.696977] mce: [Hardware Error]: TSC 0 ADDR 1ffffafa3b2aa MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.697053] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585663601 SOCKET 0 APIC 6 microcode 8701013
Comment 12 someguy108 2020-04-03 00:48:13 UTC
Hello! I 've been having a similar issue as well as Clemens in regards to spontaneous reboots as well. 
This is my configuration:
-Ryzen 3900x + Noctua D15
-MSI X570 Unify (latest agesa as of writing)
-DDR4 3200mhz 32GB kit
-Sapphire Pulse 5700 XT
-Corsair RMX 850 Watt
-Arch Linux with kernel 5.5.13
-Mesa 20.0.3
-Early KMS enabled

I've had this system up and running since November 2019 but initially with a Nvidia 1060 and Windows 10. Everything was running smoothly. About a month ago I switched back over to Linux after purchasing my 5700 XT as my initial plan was to go back to Linux. Since returning I've experienced multiple spontaneous MCE reboots. All happened while I was playing one particular game, Warcraft 3 Reforged. The MCE event is the following:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 2 microcode 8701013
kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 9 microcode 8701013
kernel: #16 #17 #18 #19 #20 #21 #22 #23

Initially I figured it could be ram so I performed the usual test with no problems. Also tested with standard JEDEC as well and eventually received a MCE during Warcraft 3 reforged. After consulting with a few friends I decided to try a different power supply to no avail. I then bit the bullet and bought a brand new 3900x. I also cleared CMOS before getting my new 3900x and after. All CPU values are on auto with no PBO or manual overclocking. The only fancy is the ram in regards to XMP. Yesterday, after owning the new 3900x for three days, I had a MCE while I was playing Warcraft 3 Reforged. I have tested other games but none of them caused a MCE or any crashes / freezes for that matter. World of Warcraft, The Outer Worlds, Stellaris, and Counter-Strike: Global Offensive.

One thing to note is I haven't received it during desktop usage. Only in Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and always have. Both of which used, tested, and received the MCE's during those sessions. 

I have noticed a pattern with the MCE crashes with Warcraft 3. They always happen during a GPU load drop off or increase transition. By that I mean when exiting a match to return to the lobby, or loading a map and when it switches from the loading screen to the match itself is when these MCE's happen. 

The entire screen quickly turns black, everything is hard locked, and then after about a minute or so the machine reboots on its own. It hasn't happened yet while in a middle of a match session, sitting in the lobby or at the main menu screen. Its consistently been during a transition. 

My theory is that this could possibly be a GPU hang from switching from one power state to another power state. With the GPU hanging, causes the CPU to stall, and thus a MCE. The GPU hanging could explain the quick solid black screen as well as all output is stopped. But I'm really just assuming here form my own observations from my very limited understanding. Possible reason why this triggers in Warcraft is because the other games have few moments of switching power states heavily. The Outer Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive all keep a constant high load on the GPU and the match sessions are long.

From what its worth, I've had no major issues in Windows 10. The only quirks where initially a few TDR's that recovered from alt tabing out of most games with Google Chrome running. Disabling hardware acceleration in Chrome fixed those TDR's while alt-tabing out of games. 

I've also used both 3900x's to compile things like chromium and other large projects that's last hours perfectly fine. On Windows side of things I've also done extensive stress testing with Prime95 and Aida64. Along with long gaming sessions with Battlefield V that utilizes AVX instructions and puts a load on all 24 threads.

From searching, I've found quite a few reports of people talking about receiving MCE's that isn't the typical first generation MCE's reports from 2017 involving Ryzen. Where those where fixed by disabling c-states, ram, and changing power supply current from low to typical. These ones within the past year appear to all have a AMD GPU in common. I did notice a few with Intel CPU's as well paired up with a AMD GPU.

Any feedback would be greatly appreciated.
Comment 13 Clemens Eisserer 2020-04-13 09:20:22 UTC
As mentioned by other users / in the reddit thread this issue really seems to be somehow GPU / PCIe related.

I've disabled GPU acceleration by switching to Xorg AccelMethod=none and the llvmpipe opengl rasterizer and despite using the machine more frequently I haven't had a single crash in the past 10 days.

It is still unclear for me how the GPU could trigger processor MCEs. Maybe the windows drivers know about some CPU quirks and have workarounds implements, while the linux drivers still lack those?
Comment 14 Clemens Eisserer 2020-04-13 09:23:50 UTC
sorry, wrong conclusion - two minutes after writing this post, during shutting down the system there it was:

[    0.123020] mce: [Hardware Error]: Machine check events logged
[    0.123022] mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
[    0.123090] mce: [Hardware Error]: TSC 0 ADDR 1ffffb8a3b5ce MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.123166] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1586769715 SOCKET 0 APIC f microcode 8701013
Comment 15 Nicholas H. 2020-04-20 00:41:41 UTC
I have the same issue with my 3900X and 5700 XT but (in my case) it is not specific to AMD graphics cards.

On my machine the resets are most common during suspend to RAM, or immediately after my monitors go to sleep. I set my machine to suspend after one minute and have another machine sending WoL packets in a loop so I can reproduce the issue easily. I always get a reset before the 100th suspend.

I tested this setup with both a 5700 XT (amdgpu) and an Nvidia 780 Ti (nouveau) and the resets happen with both cards.

Like the other reports, I've never had this happen in Windows.

In addition to the bea0000000000108 mce, I've had two others:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1581274016 SOCKET 0 APIC 0 microcode 8701013

kernel: mce: [Hardware Error]: Machine check events logged
kernel: [Hardware Error]: System Fatal error.
kernel: [Hardware Error]: CPU:12 (17:71:0) MC3_STATUS[Over|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xfaa0000000070118
kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000004d000030
kernel: [Hardware Error]: Decode Unit Ext. Error Code: 7, Patch RAM sequencer parity error.
kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: RD
Comment 16 Borislav Petkov 2020-04-20 17:06:12 UTC
(In reply to Nicholas H. from comment #15)
> In addition to the bea0000000000108 mce, I've had two others:

Looks like a different issue to me.

You should update your BIOS to the latest, if haven't done so already.

Then, there was a recent issue with AMD GPUs which could trigger MCEs
too, see

https://bugzilla.kernel.org/show_bug.cgi?id=207331

The fix will be in stable kernels, if that has not happened yet, and it
might be worth a try. That's something which the other people affected by this should try too.

If none of that helps, you should return your CPU for replacement.

HTH.
Comment 17 Rich 2020-04-21 16:38:09 UTC
>> dmesg log shows: microcode: CPU0: patch_level=0x08701013
 ensure  microcode is updated to 08701021 after updating the BIOS    

>> Asrock Phantom Gaming 4 X570 (latest BIOS: 2.30)
update BIOS to latest on the Asrock Phantom Gaming 4 X570...Note: Some vendor BIOS' lag  AMD code releases ...so may have to take the next BIOS update as well to ensure all known fixes are in
BIOS upgrade for the  : https://www.asrock.com/mb/AMD/X570%20phantom%20Gaming%204/index.asp#BIOS
    latest: Version 2.60	2020/4/16	13.84MB	Instant FlashHow to Update:    https://www.asrock.com/support/BIOSIG.asp?cat=BIOS1
Comment 18 Clemens Eisserer 2020-04-22 10:52:58 UTC
Hi Rich,

Will 0x08701013 be published to the linux firmware git anytime soon, so I don't have to rely on my motherboard manufacturer?

I've updated to BIOS 2.6 a few days ago and the microcode patch level is still at 0x08701013.
Comment 19 Rich 2020-04-23 18:30:31 UTC
Hi Eisserer,

Are you still seeing the failure? <light load, dmesg shows same MCE ? btw that MCE is a catchall and can have lots of possible causes..

Its better to go with the motherboard vendor BIOS as it includes all of AMD's updates including the security and power management micro-controller and microcode for the cores.

please post any new failure data and we will figure it out.
monitoring CPU voltage, temperature , frequency, activity level, may provide a clue if this is on the hardware side

Rich
Comment 20 Clemens Eisserer 2020-04-29 22:19:15 UTC
just experienced the same crash again - this time with BIOS 2.6 (still on microcode 8701013):

Quite often I've seen this crash when actually launching LibreOfice:


[    0.105648] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9
[    0.116018] mce: [Hardware Error]: Machine check events logged
[    0.116019] mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108
[    0.116087] mce: [Hardware Error]: TSC 0 ADDR 1ffffc066c3c6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.116163] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1588198590 SOCKET 0 APIC 3 microcode 8701013
[    0.116237]  #10 #11 #12 #13 #14
[    0.122019] mce: [Hardware Error]: Machine check events logged
[    0.122021] mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
[    0.122089] mce: [Hardware Error]: TSC 0 ADDR 7f40ca005e9e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.122164] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1588198590 SOCKET 0 APIC d microcode 8701013
Comment 21 Clemens Eisserer 2020-04-30 05:38:05 UTC
to make sure this isn't a hardware fault, which simply is triggered more likely when running linux, I've swapped my ryzen-3700x (early one from 7/2019) with a new one ordered a few days ago.
Comment 22 Rich 2020-04-30 15:05:24 UTC
trying another CPU is a good idea....

bea0000000000108 means the thread has stopped executing...this is longest timeout, all other hardware fault timers would/should fire before this. 

occurs on 2 threads but one of them goes 1st..both Thread 1, one in kernal mode code (ADDR 1ffffc066c3c6) and the other in user space code (ADDR 7f40ca005e9e)


this case has lots of possible causes...OS, App, voltage , temp, board hardware(power delivery cases), memory (are you running ECC memory ?)

What OS/version?  and what version libre office? i can try launching libre office repeatedly as well.
Comment 23 Rich 2020-04-30 18:25:04 UTC
Tried 15 cycles...random delay between app startup.....no issues seen..I'll looking a way to automate this case for continuous testing.

This is my setup...not identical to yours...just a reference point. 

CPU0: AMD Ryzen 9 3950X 16-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
microcode: CPU0: patch_level=0x08701021

I'm running 	Ubuntu 18.04.1 LTS

LibreOffice Version:  Version: 6.0.6.2
Build ID: 1:6.0.6-0ubuntu0.18.04.1
CPU threads: 32; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); Calc: group
Comment 24 Clemens Eisserer 2020-04-30 19:38:44 UTC
Hi Rick,

I observed most crashes during cold LibreOffice (calc) startups, but those are not reproduceable - however my setup is a bit unusual: btrfs (-> high fragmentation) on dmcrypt and with libreoffice installed via flatpak - so maybe IO plays a role here. However, I also saw crashes with firefox playing youtube in background.

I am now using the system with the "fresh" 3700x, everything else is unchanged - and report back in 2-3 weeks. (until now the longest period without MCE was 10 days), maybe it is really faulty hw after all...
Comment 25 Clemens Eisserer 2020-05-13 11:35:58 UTC
it seems the processor was fine - the new 3700x crashed today the same way the old one did:

[    0.292661] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10
[    0.303677] mce: [Hardware Error]: Machine check events logged
[    0.303679] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
[    0.303747] mce: [Hardware Error]: TSC 0 ADDR 1ffffc0a9e3c6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.304662] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589369644 SOCKET 0 APIC 5 microcode 8701013
[    0.304736]  #11 #12 #13 #14 #15
Comment 26 joel_damiano 2020-05-22 04:57:21 UTC
I also am having this problem. Initially, I couldn't even boot using the Fedora live CD. I found that adding 'nomodeset' to the grub boot line made the system unconditionally stable. I have tried all of the recommendations on the web to fix this problem. The only one which has helped was to replace 'nomodeset' with 'nowatchdog', which leaves the system somewhat stable. However, if I try to use a program which uses graphics acceleration (glmark2), the system promptly crashes. I will attach a copy of the output from journalctl, lspci, etc... for your review.
Comment 27 joel_damiano 2020-05-22 04:58:22 UTC
Created attachment 289219 [details]
Logs for debugging Machine Check crash
Comment 28 Roman C. 2020-05-22 11:53:25 UTC
I can confirm the reboots with a similar hardware setup:

AMD Ryzen 9 3950X
Gigabyte X570 Aorus Elite (latest BIOS F12f, microrcode 0x08701013)
PowerColor Radeon RX 5700 XT Red Devil 8GB
4x Samsung M378A4G43MB1-CTD DDR4-2666

I use kernel 5.6.11 and used different kernels to hope for improvement. Also try different BIOS setups, with no effect.

On my system the error happens on different cores, but I don't think this matters.

My observation is that the reboots happen in a situation with low load on the system, but I can`t reproduce the error with an behaviour.
Sometimes I can work 8 hours without a reboot and sometimes its reboots within the first 30 minutes.
Comment 29 Rich 2020-05-22 15:43:03 UTC
(In reply to Clemens Eisserer from comment #25)
> it seems the processor was fine - the new 3700x crashed today the same way
> the old one did:
> 
> [    0.292661] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8 
> #9 #10
> [    0.303677] mce: [Hardware Error]: Machine check events logged
> [    0.303679] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.303747] mce: [Hardware Error]: TSC 0 ADDR 1ffffc0a9e3c6 MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000 
> [    0.304662] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589369644
> SOCKET 0 APIC 5 microcode 8701013
> [    0.304736]  #11 #12 #13 #14 #15


Hi Clemens, 
Yes this is the same failure..the thread in linux's CPU10 is no longer executing...

Since this is the 2nd CPU in the same setup , i would suspect power delivery....And power management state changes are typically what bring out power delivery issues.

My suggestion would be to turn off all power management in the OS (force P0 as the only state), CPU (in BIOS setup options..this is vender dependent.) and GPU (in linux parameters like amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0).

Also, Clamping the behavior of the CPU VR can be accomplished through the AMD OC (overclocking) BIOS setup option. I would peg the frequency and voltage of the part.
choose the base frequency of the CPU installed and set the voltage to 1400  (for 1.4V). this is more than adequate to run at base frequency on all cores.
Also logging CPU voltages and temp through a failure event might show the CPU VR/power delivery is going out of regulation.
Comment 30 Rich 2020-05-22 16:21:07 UTC
(In reply to joel_damiano from comment #27)
> Created attachment 289219 [details]
> Logs for debugging Machine Check crash

MCE Bank 5 Status: bea0000000000108 means the thread stopped executing and hung.
3 threads hung in the kernel
3 threads hung in a user space app.

Fails using glmark2 .

This doesn't present like a CPU power delivery or CPU power management problem.
if a read out to the video card doesn't return data in the timeout period, this mce will be present. This is the last error catchall...there are usually other faults in the system that occur but this one always gets logged.

>> The only one which has helped was to replace 'nomodeset' with 'nowatchdog',
>> which leaves the system somewhat stable.
>> However, if I try to use a program which uses graphics acceleration
>> (glmark2), the system promptly crashes.

I would try another video card or put the video card in another PCIe slot (one closer to the system power supply) and see if that modulates the failure rate.
Comment 31 joel_damiano 2020-05-23 05:10:31 UTC
I've already tried using the second PCIe slot, with no change in the system's behavior. Unfortunately, I don't have another video card to try. I did try appending amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 to the boot line as suggested by Rich, and that seems to have done the trick. I was able to run glmark2 without any problem. Not totally conclusive given the limited testing, but it is certainly a big step forward. Since I'm not an expert on this subject, I'd be curious to know what these parameters do and if there are any downsides to using them.
Comment 32 Vitalii 2020-05-23 12:08:23 UTC
Hi, I have the same problem on
AMD Ryzen 9 3900X
Video: Radeon HD 7850
PCI-E: Intel 82574L NIC
SATA disks only

PPT on CPU is limited to 85W and it's running Folding@Home almost all the time while system is on. Normally my desktop is running Openbox (no fancy desktop effects), and it's stable unless it's running some game (with or without Folding@Home in background).

AoW3 quiet reliably crashed the system with radeon GPU driver, but seems to be fine with amdgpu driver, but then Euro Truck Simulator 2 crashed with amdgpu. I didn't try changing amdgpu.ppfeaturemask yet.

The above was happening on latest BIOS with AGESA 1.0.0.4 B. There are reports that some PCI-E cards don't work properly (Creative sound cards, Ethernet cards have issues with link somehow). So I downgraded to an older BIOS with AGESA 1.0.0.3 ABBA, can't tell if it helped yet, but output from lspci is a bit different for most devices in areas related to error reporting (and handling?). In either case CPU microcode is the same 0x08701013. If it'll crash again, I'll try changing something else...

Thanks
Comment 33 Rich 2020-05-23 15:35:38 UTC
(In reply to joel_damiano from comment #31)
> I've already tried using the second PCIe slot, with no change in the
> system's behavior. Unfortunately, I don't have another video card to try. I
> did try appending amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 to the
> boot line as suggested by Rich, and that seems to have done the trick. I was
> able to run glmark2 without any problem. Not totally conclusive given the
> limited testing, but it is certainly a big step forward. Since I'm not an
> expert on this subject, I'd be curious to know what these parameters do and
> if there are any downsides to using them.

amdgpu.ppfeaturemask and admgpu.dpm turn on and off various features of the AMD GPU

admgpu.dpm = 1 Enables the Override for dynamic power management

There a lot of detail to review here: https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

The settings i gave are the recommended settings for RX 480 and RX 550 video cards to turn off the power management features of the video card.



for your OS installation find the enum PP_FEATURE_MASK  which (i think because i don't have a fedora install) lives in the following places:
drivers/gpu/drm/amd/powerplay/inc/hwmgr.h
drivers/gpu/drm/amd/include/amd_shared.h

my take is you'll find this enum

enum PP_FEATURE_MASK {
	PP_SCLK_DPM_MASK = 0x1,
	PP_MCLK_DPM_MASK = 0x2,
	PP_PCIE_DPM_MASK = 0x4,
	PP_SCLK_DEEP_SLEEP_MASK = 0x8,
	PP_POWER_CONTAINMENT_MASK = 0x10,
	PP_UVD_HANDSHAKE_MASK = 0x20,
	PP_SMC_VOLTAGE_CONTROL_MASK = 0x40,
	PP_VBI_TIME_SUPPORT_MASK = 0x80,
	PP_ULV_MASK = 0x100,
	PP_ENABLE_GFX_CG_THRU_SMU = 0x200,
	PP_CLOCK_STRETCH_MASK = 0x400,
	PP_OD_FUZZY_FAN_CONTROL_MASK = 0x800,
	PP_SOCCLK_DPM_MASK = 0x1000,
	PP_DCEFCLK_DPM_MASK = 0x2000,
	PP_OVERDRIVE_MASK = 0x4000,           
	PP_GFXOFF_MASK = 0x8000,
	PP_ACG_MASK = 0x10000,
	PP_STUTTER_MODE = 0x20000,
	PP_AVFS_MASK = 0x40000,
};

going with  amdgpu.ppfeaturemask=0xffffbffb  sets the following

PP_SCLK_DPM_MASK             = 1
PP_MCLK_DPM_MASK             = 1
PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment...which we override to off with the other parameter
PP_SCLK_DEEP_SLEEP_MASK      = 1
PP_POWER_CONTAINMENT_MASK    = 1
PP_UVD_HANDSHAKE_MASK        = 1
PP_SMC_VOLTAGE_CONTROL_MASK  = 1
PP_VBI_TIME_SUPPORT_MASK     = 1
PP_ULV_MASK                  = 1
PP_ENABLE_GFX_CG_THRU_SMU    = 1
PP_CLOCK_STRETCH_MASK        = 1
PP_OD_FUZZY_FAN_CONTROL_MASK = 1
PP_SOCCLK_DPM_MASK           = 1
PP_DCEFCLK_DPM_MASK          = 1
PP_OVERDRIVE_MASK            = 0   for higher frequency operation/overclocking
PP_GFXOFF_MASK               = 1
PP_ACG_MASK                  = 1
PP_STUTTER_MODE              = 1
PP_AVFS_MASK                 = 1


some explanations are here:
https://www.kernel.org/doc/html/v4.20/gpu/drivers.html
https://wiki.archlinux.org/index.php/Kernel_parameters


i turn power managment off on my productivity systems mostly because i don't want the entry/exit latency and the added stress on voltage regulators/caps/inductors/system power supply that comes with power management.
The AC power usage measured at the AC outlet for my entire system rarely exceeds 100W. Ryzen 3000 series 105W products are incredibly power efficient.
I favor performance over power savings on the machines i use to do work.
Comment 34 Rich 2020-05-24 17:45:26 UTC
(In reply to someguy108 from comment #12)
> Hello! I 've been having a similar issue as well as Clemens in regards to
> spontaneous reboots as well. 
> This is my configuration:
> -Ryzen 3900x + Noctua D15
> -MSI X570 Unify (latest agesa as of writing)
> -DDR4 3200mhz 32GB kit
> -Sapphire Pulse 5700 XT
> -Corsair RMX 850 Watt
> -Arch Linux with kernel 5.5.13
> -Mesa 20.0.3
> -Early KMS enabled
> 
> I've had this system up and running since November 2019 but initially with a
> Nvidia 1060 and Windows 10. Everything was running smoothly. About a month
> ago I switched back over to Linux after purchasing my 5700 XT as my initial
> plan was to go back to Linux. Since returning I've experienced multiple
> spontaneous MCE reboots. All happened while I was playing one particular
> game, Warcraft 3 Reforged. The MCE event is the following:
> 
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5:
> bea0000000000108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
> APIC 2 microcode 8701013
> kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5:
> bea0000000000108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
> APIC 9 microcode 8701013
> kernel: #16 #17 #18 #19 #20 #21 #22 #23
> 
> Initially I figured it could be ram so I performed the usual test with no
> problems. Also tested with standard JEDEC as well and eventually received a
> MCE during Warcraft 3 reforged. After consulting with a few friends I
> decided to try a different power supply to no avail. I then bit the bullet
> and bought a brand new 3900x. I also cleared CMOS before getting my new
> 3900x and after. All CPU values are on auto with no PBO or manual
> overclocking. The only fancy is the ram in regards to XMP. Yesterday, after
> owning the new 3900x for three days, I had a MCE while I was playing
> Warcraft 3 Reforged. I have tested other games but none of them caused a MCE
> or any crashes / freezes for that matter. World of Warcraft, The Outer
> Worlds, Stellaris, and Counter-Strike: Global Offensive.
> 
> One thing to note is I haven't received it during desktop usage. Only in
> Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and
> always have. Both of which used, tested, and received the MCE's during those
> sessions. 
> 
> I have noticed a pattern with the MCE crashes with Warcraft 3. They always
> happen during a GPU load drop off or increase transition. By that I mean
> when exiting a match to return to the lobby, or loading a map and when it
> switches from the loading screen to the match itself is when these MCE's
> happen. 
> 
> The entire screen quickly turns black, everything is hard locked, and then
> after about a minute or so the machine reboots on its own. It hasn't
> happened yet while in a middle of a match session, sitting in the lobby or
> at the main menu screen. Its consistently been during a transition. 
> 
> My theory is that this could possibly be a GPU hang from switching from one
> power state to another power state. With the GPU hanging, causes the CPU to
> stall, and thus a MCE. The GPU hanging could explain the quick solid black
> screen as well as all output is stopped. But I'm really just assuming here
> form my own observations from my very limited understanding. Possible reason
> why this triggers in Warcraft is because the other games have few moments of
> switching power states heavily. The Outer Worlds, World of Warcraft,
> Stellaris, and Counter-Strike Global Offensive all keep a constant high load
> on the GPU and the match sessions are long.
> 
> From what its worth, I've had no major issues in Windows 10. The only quirks
> where initially a few TDR's that recovered from alt tabing out of most games
> with Google Chrome running. Disabling hardware acceleration in Chrome fixed
> those TDR's while alt-tabing out of games. 
> 
> I've also used both 3900x's to compile things like chromium and other large
> projects that's last hours perfectly fine. On Windows side of things I've
> also done extensive stress testing with Prime95 and Aida64. Along with long
> gaming sessions with Battlefield V that utilizes AVX instructions and puts a
> load on all 24 threads.
> 
> From searching, I've found quite a few reports of people talking about
> receiving MCE's that isn't the typical first generation MCE's reports from
> 2017 involving Ryzen. Where those where fixed by disabling c-states, ram,
> and changing power supply current from low to typical. These ones within the
> past year appear to all have a AMD GPU in common. I did notice a few with
> Intel CPU's as well paired up with a AMD GPU.
> 
> Any feedback would be greatly appreciated.

Hi Someguy,

Usually when a system goes from stable to unstable its the  last change made that induced the problem.
Changing the video card brings with it the video driver-OS revision set problem...
i would suggest updating windows to its latest update level and updating the video driver to its latest revision.

Turning off the video cards power management features is another thing to try.

https://community.amd.com/external-link.jspa?url=https%3A%2F%2Fwww.amd.com%2Fen%2Fsupport%2Fdriverhelp'

Installing the Radeon software on Windows : https://www.amd.com/en/support/kb/faq/rsx-install


>> Yesterday, after owning the new 3900x for three days, I had a MCE while I
>> was playing Warcraft 3 Reforged. 
same MCE ?? Bank 5: bea0000000000108
>> They always happen during a GPU load drop off or increase transition. 


this points to the video card power management features and possibly the video card's PCIe power management ASPM/L1/L1 substates (L1ss)/etc ...Turn them off.


>>Disabling hardware acceleration in Chrome fixed those TDR's while alt-tabing
>>out of games. 

Thanks for mentioning this one....i've been battling the TDR BSOD 0x116 on my laptop....
Comment 35 Clemens Eisserer 2020-05-25 06:15:42 UTC
I've now also plugged the GPU into a different PCIe slot + set nodpm/feature mask - only time will tell.

What intrigues me is the fact the system is rock solid running Windows-10, I haven't had a reboot/bsod in months. Maybe the windows drivers contain work-arounds/quirk-handling not present in their linux counterparts...
Comment 36 joel_damiano 2020-05-28 04:36:08 UTC
I did some experimentation with kernel boot parameters. With both amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0, the system was stable. With only amdgpu.dpm=0, the system was also stable. amdgpu.ppfeaturemask=0xffffbffb without amdgpu.dpm=0 would cause an immediate crash when running glmark2. I don't know if this is of any help in tracking down the problem, but I thought you might find it interesting.
Comment 37 Rich 2020-05-28 09:41:41 UTC
(In reply to joel_damiano from comment #36)
> I did some experimentation with kernel boot parameters. With both
> amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0, the system was stable.
> With only amdgpu.dpm=0, the system was also stable.
> amdgpu.ppfeaturemask=0xffffbffb without amdgpu.dpm=0 would cause an
> immediate crash when running glmark2. I don't know if this is of any help in
> tracking down the problem, but I thought you might find it interesting.

Hi Joel,

My take  is there is one more experiment to narrow this down to a single change. Disable DPM here only

going with  amdgpu.ppfeaturemask=0xffffbfff  (without amdgpu.dpm=0)  sets the following

PP_SCLK_DPM_MASK             = 1
PP_MCLK_DPM_MASK             = 1
PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment..
PP_SCLK_DEEP_SLEEP_MASK      = 1
PP_POWER_CONTAINMENT_MASK    = 1
PP_UVD_HANDSHAKE_MASK        = 1
PP_SMC_VOLTAGE_CONTROL_MASK  = 1
PP_VBI_TIME_SUPPORT_MASK     = 1
PP_ULV_MASK                  = 1
PP_ENABLE_GFX_CG_THRU_SMU    = 1
PP_CLOCK_STRETCH_MASK        = 1
PP_OD_FUZZY_FAN_CONTROL_MASK = 1
PP_SOCCLK_DPM_MASK           = 1
PP_DCEFCLK_DPM_MASK          = 1
PP_OVERDRIVE_MASK            = 1    
PP_GFXOFF_MASK               = 1
PP_ACG_MASK                  = 1
PP_STUTTER_MODE              = 1
PP_AVFS_MASK                 = 1

Thanks for going through this,
Rich
Comment 38 Rich 2020-05-28 10:06:33 UTC
(In reply to Roman C. from comment #28)
> I can confirm the reboots with a similar hardware setup:
> 
> AMD Ryzen 9 3950X
> Gigabyte X570 Aorus Elite (latest BIOS F12f, microrcode 0x08701013)
> PowerColor Radeon RX 5700 XT Red Devil 8GB
> 4x Samsung M378A4G43MB1-CTD DDR4-2666
> 
> I use kernel 5.6.11 and used different kernels to hope for improvement. Also
> try different BIOS setups, with no effect.
> 
> On my system the error happens on different cores, but I don't think this
> matters.
> 
> My observation is that the reboots happen in a situation with low load on
> the system, but I can`t reproduce the error with an behaviour.
> Sometimes I can work 8 hours without a reboot and sometimes its reboots
> within the first 30 minutes.

Hi Roman,
any files in /var/log/ we can look at ?
Comment 39 Vitalii 2020-05-28 16:26:57 UTC
(In reply to Vitalii from comment #32)
> AMD Ryzen 9 3900X
> Video: Radeon HD 7850
> PCI-E: Intel 82574L NIC
> SATA disks only
Forgot to add: Debian 10, stock kernel: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64

I tried a few things, with no particular success so far.

1) Downgrading to BIOS with AGESA 1.0.0.3 ABBA changes one thing. Reboots are still there, but there are no more MCEs logged during the boot.

2) Disabling IOMMU doesn't help in my case.

3) Setting amdgpu.ppfeaturemask doesn't help in my case either. Will try with amdgpu.dpm=0, but it forces some fixed noisy FAN profile on my video card and it's a bit annoying.

4) Interestingly, doing "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves way too similar to this issue. System locks for a few seconds, screen frozen, audio loops, and then reboot follows. To confirm, I went back to the latest BIOS, and now I get MCEs too:

May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0 Bank 5: bea0000000000108
May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd4481af8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589913499 SOCKET 0 APIC b microcode 8701013
May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 23: Machine Check: 0 Bank 5: bea0000000000108
May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd42c16e6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589913499 SOCKET 0 APIC 1d microcode 8701013
    [ no MCEs on the old BIOS, but crashes were present ]
May 28 19:05:17 vb kernel: mce: [Hardware Error]: Machine check events logged
May 28 19:05:17 vb kernel: mce: [Hardware Error]: CPU 11: Machine Check: 0 Bank 5: bea0000000000108
May 28 19:05:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f9500f73c68 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 28 19:05:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1590681911 SOCKET 0 APIC 1c microcode 8701013
May 28 19:09:17 vb kernel: mce: [Hardware Error]: Machine check events logged
May 28 19:09:17 vb kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
May 28 19:09:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f78c01ee226 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 28 19:09:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1590682152 SOCKET 0 APIC 5 microcode 8701013

The last two MCEs were caused by "cat /sys/kernel/debug/dri/0/amdgpu_regs". Some registers are dumped (there's a garbage on a screen), then it freezes. Don't know what this means.

Thanks
Comment 40 Rich 2020-05-28 17:02:03 UTC
(In reply to Vitalii from comment #39)
> (In reply to Vitalii from comment #32)
> > AMD Ryzen 9 3900X
> > Video: Radeon HD 7850
> > PCI-E: Intel 82574L NIC
> > SATA disks only
> Forgot to add: Debian 10, stock kernel: 4.19.0-9-amd64 #1 SMP Debian
> 4.19.118-2 (2020-04-29) x86_64
> 
> I tried a few things, with no particular success so far.
> 
> 1) Downgrading to BIOS with AGESA 1.0.0.3 ABBA changes one thing. Reboots
> are still there, but there are no more MCEs logged during the boot.
> 
> 2) Disabling IOMMU doesn't help in my case.
> 
> 3) Setting amdgpu.ppfeaturemask doesn't help in my case either. Will try
> with amdgpu.dpm=0, but it forces some fixed noisy FAN profile on my video
> card and it's a bit annoying.
> 
> 4) Interestingly, doing "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves
> way too similar to this issue. System locks for a few seconds, screen
> frozen, audio loops, and then reboot follows. To confirm, I went back to the
> latest BIOS, and now I get MCEs too:
> 
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0
> Bank 5: bea0000000000108
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd4481af8
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1589913499 SOCKET 0 APIC b microcode 8701013
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 23: Machine Check: 0
> Bank 5: bea0000000000108
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd42c16e6
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1589913499 SOCKET 0 APIC 1d microcode 8701013
>     [ no MCEs on the old BIOS, but crashes were present ]
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: CPU 11: Machine Check: 0
> Bank 5: bea0000000000108
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f9500f73c68
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1590681911 SOCKET 0 APIC 1c microcode 8701013
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0
> Bank 5: bea0000000000108
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f78c01ee226
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1590682152 SOCKET 0 APIC 5 microcode 8701013
> 
> The last two MCEs were caused by "cat /sys/kernel/debug/dri/0/amdgpu_regs".
> Some registers are dumped (there's a garbage on a screen), then it freezes.
> Don't know what this means.
> 
> Thanks



garbage on the screen usually means video controller got bad data or took a power/thermal event

Machine Check: 0 Bank 5: bea0000000000108 is thread is no longer executing.....
4 threads hung

can you collect logs from \var\logs on next reboot 

>>  "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves way too similar to this
>>  issue. System locks for a few seconds, screen frozen, audio loops, and then
>>  reboot follows.
just reading registers does this??  :
the system is in bad shape...have you tried another OS install ? another sata drive ? another system power supply ? another motherboard.....something fundamental is wrong  

what motherboard is this?
Comment 41 Alex Deucher 2020-05-28 17:20:21 UTC
(In reply to Vitalii from comment #39)
> 
> 4) Interestingly, doing "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves
> way too similar to this issue. System locks for a few seconds, screen
> frozen, audio loops, and then reboot follows. To confirm, I went back to the
> latest BIOS, and now I get MCEs too:
> 
<snip>

> The last two MCEs were caused by "cat /sys/kernel/debug/dri/0/amdgpu_regs".
> Some registers are dumped (there's a garbage on a screen), then it freezes.
> Don't know what this means.

This is expected.  The amdgpu_regs file just provides access to the GPU's MMIO registers for debugging.  You should not access it unless you know what you are doing.
Comment 42 Vitalii 2020-05-28 17:28:42 UTC
Hi Rich,

(In reply to Rich from comment #40)
> (In reply to Vitalii from comment #39)
> > (In reply to Vitalii from comment #32)
> Machine Check: 0 Bank 5: bea0000000000108 is thread is no longer
> executing.....
> 4 threads hung
> 
> can you collect logs from \var\logs on next reboot 
Sure, I'll attach some logs.

> >>  "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves way too similar to this
> >>  issue. System locks for a few seconds, screen frozen, audio loops, and
> then
> >>  reboot follows.
> just reading registers does this??  :
> the system is in bad shape...have you tried another OS install ? another
> sata drive ? another system power supply ? another motherboard.....something
> fundamental is wrong  
Yes, just reading registers does this. And by "garbage on a screen" I meant "a binary data in a terminal", sorry for confusion. I understand it's not supposed to work like this, and that dumping all registers region sometimes can be a bad idea, but it's interesting that it causes MCE somehow.

Regarding the other OS, I too can say that Windows 10 works more reliably (I had 1 or 2 crashes long time ago, I did not investigate), and there were no recent crashes in games. PSU/video/disks are old and were perfectly stable on an old Phenom X4 system (125W TDP). I understand that these components not necessarily 100% compatible with this system, though.

> what motherboard is this?
Gigabyte X570 Gaming X

Other than that, I tried amdgpu.dpm=0 and it affects the performance a lot, GL is about 8 time slower.

Thanks
Comment 43 Alex Deucher 2020-05-28 17:34:16 UTC
(In reply to Rich from comment #37)
> 
> going with  amdgpu.ppfeaturemask=0xffffbfff  (without amdgpu.dpm=0)  sets
> the following
> 
> PP_SCLK_DPM_MASK             = 1
> PP_MCLK_DPM_MASK             = 1
> PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment..
> PP_SCLK_DEEP_SLEEP_MASK      = 1
> PP_POWER_CONTAINMENT_MASK    = 1
> PP_UVD_HANDSHAKE_MASK        = 1
> PP_SMC_VOLTAGE_CONTROL_MASK  = 1
> PP_VBI_TIME_SUPPORT_MASK     = 1
> PP_ULV_MASK                  = 1
> PP_ENABLE_GFX_CG_THRU_SMU    = 1
> PP_CLOCK_STRETCH_MASK        = 1
> PP_OD_FUZZY_FAN_CONTROL_MASK = 1
> PP_SOCCLK_DPM_MASK           = 1
> PP_DCEFCLK_DPM_MASK          = 1
> PP_OVERDRIVE_MASK            = 1    
> PP_GFXOFF_MASK               = 1
> PP_ACG_MASK                  = 1
> PP_STUTTER_MODE              = 1
> PP_AVFS_MASK                 = 1

Can you try and narrow down which feature(s) cause the problem by setting different bits in amdgpu.ppfeaturemask to disable different GPU power features?
Comment 44 Vitalii 2020-05-28 17:42:48 UTC
Created attachment 289387 [details]
kernel and X11 logs for a boot after crash when reading registers

Adding kernel and X11 logs for a boot after crash when reading registers
Comment 45 Alex Deucher 2020-05-28 17:52:03 UTC
(In reply to Vitalii from comment #42)
> 
> Other than that, I tried amdgpu.dpm=0 and it affects the performance a lot,
> GL is about 8 time slower.

Do you still get MCEs in that case?
Comment 46 Vitalii 2020-05-28 18:25:37 UTC
Hi Alex,

(In reply to Alex Deucher from comment #45)
> (In reply to Vitalii from comment #42)
> > 
> > Other than that, I tried amdgpu.dpm=0 and it affects the performance a lot,
> > GL is about 8 time slower.
> 
> Do you still get MCEs in that case?

I don't know yet. It's difficult to test because GPU is slow, and my normal test case right now is Euro Truck Simulator 2 (usually takes 1-2 hours to trigger), and now it's unusable. I'll try to test something else, but it'll take time.

I can test dumping the registers, and MCE still is logged. This probably has little common with the normal usage, as you said, just out of curiosity I tried "dd if=amdgpu_regs bs=4 | hexdump" and the last lines in X11 terminal are (from a video, if I typed it correctly):
*
0012fa0 03ff 0002 0000 0000 0000 0000 0000 0000
0012fb0 0000 0000 0000 0000 0000 0000 0000 0000
*
0012ff0 0000 0000 cccc cccc 0000 0000 0000 0000
0013000 0000 0000 0000 0000 0000 0000 0000 0000
*
[freeze]

I'll get back to experiments.
Thanks
Comment 47 MrZomg 2020-05-28 20:58:16 UTC
(In reply to Rich from comment #37)
> going with  amdgpu.ppfeaturemask=0xffffbfff  (without amdgpu.dpm=0)  sets
> the following
> 
> PP_SCLK_DPM_MASK             = 1
> PP_MCLK_DPM_MASK             = 1
> PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment..
> PP_SCLK_DEEP_SLEEP_MASK      = 1
> PP_POWER_CONTAINMENT_MASK    = 1
> PP_UVD_HANDSHAKE_MASK        = 1
> PP_SMC_VOLTAGE_CONTROL_MASK  = 1
> PP_VBI_TIME_SUPPORT_MASK     = 1
> PP_ULV_MASK                  = 1
> PP_ENABLE_GFX_CG_THRU_SMU    = 1
> PP_CLOCK_STRETCH_MASK        = 1
> PP_OD_FUZZY_FAN_CONTROL_MASK = 1
> PP_SOCCLK_DPM_MASK           = 1
> PP_DCEFCLK_DPM_MASK          = 1
> PP_OVERDRIVE_MASK            = 1    
> PP_GFXOFF_MASK               = 1
> PP_ACG_MASK                  = 1
> PP_STUTTER_MODE              = 1
> PP_AVFS_MASK                 = 1

Maybe i understand it wrong but shouldn't the feature mask be 0xfffffffb to turn PCIe Power Management off? I am currently trying that mask.

Additionally, i want to strenghten the evidence that this problem is GPU related. My 3700X system with B450 chipset was running fine with a 1080TI and old R9 280X GPU. It just surfaced the day after i replaced the R9 with an 5500XT. Because both cards are supported by the AMDGPU driver, absolutely no software change was needed. It may also be load related as i had it happen to me when i used closed VLC media player (i was playing around with enabling hardware acceleration for it, as it is not used by default on this cards).

Regards
Comment 48 Alex Deucher 2020-05-28 21:19:02 UTC
(In reply to MrZomg from comment #47)
> 
> Maybe i understand it wrong but shouldn't the feature mask be 0xfffffffb to
> turn PCIe Power Management off? I am currently trying that mask.

correct.
Comment 49 joel_damiano 2020-05-29 06:02:22 UTC
Hi Rich,

I ran more experiments, all without amdgpu.dpm on the boot line.

First I tried amdgpu.ppfeaturemask=0xffffbfff, then amdgpu.ppfeaturemask=0xfffffffb. In both cases, the system crashed when I tried to run glmark2.

I then ran through all values of the last nibble of the ppfeaturemask from 0xffffbff0 through 0xffffbfff and found the following pattern: If the lower two bits were both 1, the system would crash running glmark2. If one was 1 and the other 0, the system was stable and would run glmark2 without a problem. If both were 0, the system wouldn't boot. The screen would shut off. In this case, an error would be logged in the journal:

kernel: amdgpu: probe of 0000:0e:00.0 failed with error -110

So it seems like there is some sort of interaction between the PP_SCLK_DPM_MASK and PP_MCLK_DPM_MASK features. They can't both be enabled if the system is to be stable.

Joel
Comment 50 Rich 2020-05-30 14:17:53 UTC
(In reply to joel_damiano from comment #49)
> Hi Rich,
> 
> I ran more experiments, all without amdgpu.dpm on the boot line.
> 
> First I tried amdgpu.ppfeaturemask=0xffffbfff, then
> amdgpu.ppfeaturemask=0xfffffffb. In both cases, the system crashed when I
> tried to run glmark2.
> 
> I then ran through all values of the last nibble of the ppfeaturemask from
> 0xffffbff0 through 0xffffbfff and found the following pattern: If the lower
> two bits were both 1, the system would crash running glmark2. If one was 1
> and the other 0, the system was stable and would run glmark2 without a
> problem. If both were 0, the system wouldn't boot. The screen would shut
> off. In this case, an error would be logged in the journal:
> 
> kernel: amdgpu: probe of 0000:0e:00.0 failed with error -110
> 
> So it seems like there is some sort of interaction between the
> PP_SCLK_DPM_MASK and PP_MCLK_DPM_MASK features. They can't both be enabled
> if the system is to be stable.
> 
> Joel


Nice work narrowing this down!

ok so the 2 working settings are:
	bit 0 PP_SCLK_DPM_MASK  = 0
	bit 1 PP_MCLK_DPM_MASK  = 1 
    amdgpu.ppfeaturemask=0xffffbffe
or 
	bit 0 PP_SCLK_DPM_MASK  = 1 
	bit 1 PP_MCLK_DPM_MASK  = 0
    amdgpu.ppfeaturemask=0xffffbffd


from prior boot
>> journalctl dump for most recent boot:
>> -- Logs begin at Sun 2020-01-19 23:24:02 PST, end at Thu 2020-05-21 21:32:25
>> PDT. --
>> ...
>> May 21 14:19:04 joel kernel: pci 0000:0e:00.0: [1002:67ef] type 00 class
>> 0x030000

this PCIe node:Bus:Device:Fuction [VID:DID] = [1002:67ef] 
VID = 1002 is ATI video group in AMD
DID = 67ef identifies the card as Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]

>> kernel: amdgpu: probe of 0000:0e:00.0 failed with error -110
my take on this is the Video card can not be read..the card is hung.


>> So it seems like there is some sort of interaction between the
>> PP_SCLK_DPM_MASK and PP_MCLK_DPM_MASK features. 

turning them  both off turns off power management on both clocks which may not be a valid config enforced by the power management controller software.
SCLK is the System Clock
MCLK is the Memory Clock
basically the power management controller will downshift the clock frequency to reduce power consumption when there is no work to do.

my take is performance benchmark between these based on your usage model and then go with the higher performance option.

	bit 0 PP_SCLK_DPM_MASK  = 0
	bit 1 PP_MCLK_DPM_MASK  = 1 
    amdgpu.ppfeaturemask=0xffffbffe
or 
	bit 0 PP_SCLK_DPM_MASK  = 1 
	bit 1 PP_MCLK_DPM_MASK  = 0
    amdgpu.ppfeaturemask=0xffffbffd
    
in both cases, while the heavy stress workload is running, It would be good to monitor the amount of heat coming off the video card or take a IR thermometer reading of the air coming out of the card...
if there is over 10C of temperature difference i would give up the performance and choose the cooler option. Silicon that runs cooler lasts much longer.
Comment 51 Josh 2020-06-06 20:32:56 UTC
I'm having what I believe is the same problem. A newly built:

AMD 3950X 
Sapphire Nitro+ 5700 XT 8GB
MSI Unify X570 (BIOS updated to A3, latest stable)
850W Gold PSU
D15 Cooler
64GB LPX DDR4 memory

Kernel: 5.6.15-300.fc32.x86_64

It happens to me when the system resumes (at least tries to resume) from sleep. Sometimes it works and sometimes it doesn't. I'm running 3 screens from it and the two over HDMI go green and a DP one stays black.

Shortly after, the machine reboots, I see the MCE and then see it again when I log into Fedora (32, updates installed).

baa000000002010b isn't always there, but bea0000000000108 always is. 

Extract from my log:

[    0.003275] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.003454] Freeing SMP alternatives memory: 36K
[    0.107430] smpboot: CPU0: AMD Ryzen 9 3950X 16-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
[    0.107477] mce: [Hardware Error]: Machine check events logged
[    0.107478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
[    0.107479] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
[    0.107482] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649 SOCKET 0 APIC 0 microcode 8701013
[    0.107502] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[    0.107503] ... version:                0
[    0.107504] ... bit width:              48
[    0.107504] ... generic registers:      6
[    0.107504] ... value mask:             0000ffffffffffff
[    0.107504] ... max period:             00007fffffffffff
[    0.107504] ... fixed-purpose events:   0
[    0.107505] ... event mask:             000000000000003f
[    0.107531] rcu: Hierarchical SRCU implementation.
[    0.107813] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    0.107943] smp: Bringing up secondary CPUs ...
[    0.107983] x86: Booting SMP configuration:
[    0.107983] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21
[    0.134021] mce: [Hardware Error]: Machine check events logged
[    0.134022] mce: [Hardware Error]: CPU 21: Machine Check: 0 Bank 5: bea0000000000108
[    0.134025] mce: [Hardware Error]: TSC 0 ADDR 1ffffc069b1c8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.134028] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649 SOCKET 0 APIC b microcode 8701013
[    0.134046]  #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
[    0.146035] smp: Brought up 1 node, 32 CPUs
[    0.146035] smpboot: Max logical packages: 1
[    0.146036] smpboot: Total of 32 processors activated (224012.41 BogoMIPS)
[    0.149658] devtmpfs: initialized
[    0.149658] x86/mm: Memory block size: 128MB
Comment 52 Rich 2020-06-07 13:56:41 UTC
(In reply to Josh from comment #51)
> I'm having what I believe is the same problem. A newly built:
> 
> AMD 3950X 
> Sapphire Nitro+ 5700 XT 8GB
> MSI Unify X570 (BIOS updated to A3, latest stable)
> 850W Gold PSU
> D15 Cooler
> 64GB LPX DDR4 memory
> 
> Kernel: 5.6.15-300.fc32.x86_64
> 
> It happens to me when the system resumes (at least tries to resume) from
> sleep. Sometimes it works and sometimes it doesn't. I'm running 3 screens
> from it and the two over HDMI go green and a DP one stays black.
> 
> Shortly after, the machine reboots, I see the MCE and then see it again when
> I log into Fedora (32, updates installed).
> 
> baa000000002010b isn't always there, but bea0000000000108 always is. 
> 
> Extract from my log:
> 
> [    0.003275] Speculative Store Bypass: Mitigation: Speculative Store
> Bypass disabled via prctl and seccomp
> [    0.003454] Freeing SMP alternatives memory: 36K
> [    0.107430] smpboot: CPU0: AMD Ryzen 9 3950X 16-Core Processor (family:
> 0x17, model: 0x71, stepping: 0x0)
> [    0.107477] mce: [Hardware Error]: Machine check events logged
> [    0.107478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22:
> baa000000002010b
> [    0.107479] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND
> 4d000000 IPID 1813e17000 
> [    0.107482] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649
> SOCKET 0 APIC 0 microcode 8701013
> [    0.107502] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
> [    0.107503] ... version:                0
> [    0.107504] ... bit width:              48
> [    0.107504] ... generic registers:      6
> [    0.107504] ... value mask:             0000ffffffffffff
> [    0.107504] ... max period:             00007fffffffffff
> [    0.107504] ... fixed-purpose events:   0
> [    0.107505] ... event mask:             000000000000003f
> [    0.107531] rcu: Hierarchical SRCU implementation.
> [    0.107813] NMI watchdog: Enabled. Permanently consumes one hw-PMU
> counter.
> [    0.107943] smp: Bringing up secondary CPUs ...
> [    0.107983] x86: Booting SMP configuration:
> [    0.107983] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8 
> #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21
> [    0.134021] mce: [Hardware Error]: Machine check events logged
> [    0.134022] mce: [Hardware Error]: CPU 21: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.134025] mce: [Hardware Error]: TSC 0 ADDR 1ffffc069b1c8 MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000 
> [    0.134028] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649
> SOCKET 0 APIC b microcode 8701013
> [    0.134046]  #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
> [    0.146035] smp: Brought up 1 node, 32 CPUs
> [    0.146035] smpboot: Max logical packages: 1
> [    0.146036] smpboot: Total of 32 processors activated (224012.41 BogoMIPS)
> [    0.149658] devtmpfs: initialized
> [    0.149658] x86/mm: Memory block size: 128MB

>> [    0.107478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank
>> 22:baa000000002010b    <== Bank 22 is NBIO  MCA_STATUS_NBIO[21:16] = 0x2:'
>> ErrEvent',
This is the logic block above the PCIe interface...and indicates an error occurred in a transaction to/from a PCIe link.  Are there other PCIe cards installed? (may try to remove them and attempt to induce the failure with successive sleep - wake testing)

>> It happens to me when the system resumes (at least tries to resume) from
>> sleep
this can have lots of causes....BIOS not updated, power management issue, power delivery issue, PCie card doesn't resume properly.....

Have you tried updating the video card's driver and VBIOS ?
Have you tried the various amdgpu.ppfeaturemask settings just to narrow this down ?
Have you tried S4 (hibernate) instead of S3 (sleep) ?
Comment 53 Alex Deucher 2020-06-10 15:46:09 UTC
hmmm... I vaguely recall the core kernel pci function pcie_bandwidth_available() causing problems on some platforms.  Does avoiding that call in the driver help?  You can force the pcie gen and lanes via module parameters.  E.g., append
amdgpu.pcie_gen_cap=0x00070007 amdgpu.pcie_lane_cap=0x00ff0000
to the kernel command line in grub which will force pcie gen3 and 16 lanes.
Comment 54 MrZomg 2020-06-11 08:06:16 UTC
Just wanted to give you guys a status update from my side:

a) i didn't have crash with amdgpu.ppfeaturemask=0xffffbfff yet, so this seems to work for me
b) since the weekend i've upgraded to kernel 5.7.0

i'll now try without the ppfeaturemask again to see if the problem reappears

@Alex Deucher: Maybe it's of interest that i've been running the card in a chipset PCIe 2.0 4x slot the whole time.
Comment 55 Alex Deucher 2020-06-11 13:13:58 UTC
(In reply to MrZomg from comment #54)
> Just wanted to give you guys a status update from my side:
> 
> a) i didn't have crash with amdgpu.ppfeaturemask=0xffffbfff yet, so this
> seems to work for me
>

Are you sure?  0xffffbfff is the default setting.
Comment 56 MrZomg 2020-06-11 14:15:30 UTC
(In reply to Alex Deucher from comment #55)
> (In reply to MrZomg from comment #54)
> > Just wanted to give you guys a status update from my side:
> > 
> > a) i didn't have crash with amdgpu.ppfeaturemask=0xffffbfff yet, so this
> > seems to work for me
> >
> 
> Are you sure?  0xffffbfff is the default setting.

Nope. Sorry. I've been using amdgpu.ppfeaturemask=0xfffffffb.
Comment 57 Roman C. 2020-06-15 20:30:26 UTC
As short feedback my Radeon RX 5700 works stable with amdgpu.ppfeaturemask=0xffffbffd.
I couldn't test many different settings, because I don't have an easy reproducible scenario to cause the error. But the setting works now for many days and hours.

It also worked with amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 but used 95 to 60 watt.

Thanks for your support! This behaviour was really annoying.
Comment 58 Alex Deucher 2020-06-15 20:34:42 UTC
(In reply to Roman C. from comment #57)
> As short feedback my Radeon RX 5700 works stable with
> amdgpu.ppfeaturemask=0xffffbffd.
> I couldn't test many different settings, because I don't have an easy
> reproducible scenario to cause the error. But the setting works now for many
> days and hours.
> 
> It also worked with amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 but
> used 95 to 60 watt.

Setting dpm=0 disables all GPU power management so the ppfeaturemask is ignored in that case.
Comment 59 Paul Menzel 2020-06-15 20:44:55 UTC
For the record, bug 206487 (AMD Ryzen: Random freezes/crashes with enabled C-State C6) is about the same problems, and it happens with all Dell OptiPlex 5055 we have here, which run GNU/Linux. No problems are reported when run with Microsoft Windows 10. Unfortunately, it’s hard to reproduce. We are going to try the suggestions, and report back in the other bug report.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206487
Comment 60 Paul Menzel 2020-06-15 20:47:48 UTC
(In reply to Clemens Eisserer from comment #35)
> I've now also plugged the GPU into a different PCIe slot + set nodpm/feature
> mask - only time will tell.

Clemens, what is the status after testing this for three weeks? We are anxious to know.
Comment 61 Josh 2020-06-16 06:36:01 UTC
(In reply to Rich from comment #52)

Sorry for the delay in getting back to you. I switched from Fedora 32 to Pop OS (unrelated to this problem). I was getting lots of display issues on the default kernel (5.3 I think). I updated manually to 5.7.1, which stopped 99% of the issues. I've been running that for just over a week or so and essentially:

Sleeping is hit and miss. Sometimes it works, sometimes it doesn't.
Sometimes it goes to sleep fine and wakes up fine, sometimes it sleeps fine but reboots on resume. Sometimes it failed to go to sleep at all and just seemed unresponsive and required a reset.

It also struggled with locking the screens, that would sometimes it to reboot. I also had an issue where the screens would just sit on black (powered on). That was solved by turning off auto detect input I think.

Another time, the fans on the GPU kept spooling up and down and then eventually the PC reset itself.

I also got green flashes on the two HDMI output monitors (but not on the DP one) immediately after logging into Gnome. That hasn't happened for a while though.

Yesterday my entire display output froze, but audio was still playing fine. I presume if I was able to SSH into the box and restart the window server that might of fixed it, but I couldn't and had to reset.

The reassuring thing is they all seem related to graphics now.

> (In reply to Josh from comment #51)

> This is the logic block above the PCIe interface...and indicates an error
> occurred in a transaction to/from a PCIe link.  Are there other PCIe cards
> installed? (may try to remove them and attempt to induce the failure with
> successive sleep - wake testing)

There is nothing else plugged into PCI slots, but an EVO Plus 1TB NVMe in the top M.2 slot.

> 
> >> It happens to me when the system resumes (at least tries to resume) from
> >> sleep
> this can have lots of causes....BIOS not updated, power management issue,
> power delivery issue, PCie card doesn't resume properly.....
BIOS is latest stable, PSU is good.

> Have you tried updating the video card's driver and VBIOS ?
No, I can't seem to find out how for this card.

> Have you tried the various amdgpu.ppfeaturemask settings just to narrow this
> down ?
I'm now running 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd and a few test sleeps  and screen locks, nothing bad has happened so far. I'll keep you updated.

> Have you tried S4 (hibernate) instead of S3 (sleep) ?
When I tried this in Fedora 32 they had the same intermittent results.

I thought I could reproduce it in Fedora by sleeping while there was a VirtualBox machine running, but I could only trigger it once. VMs don't seem to make a difference.

One way I can improve the likelihood the machine will resume after sleep now, is to lock it manually but keep the mouse moving so the screens don't shutdown, and then sleep it. Might be placebo but it felt like it was improving things slightly.

Thank you for your help and hard work! :)
Comment 62 Clemens Eisserer 2020-06-17 14:52:20 UTC
Paul: Until now the system has been stable with dmp=0. This option leads to low gpu/memory clocks, however it is acceptable because I actually bought the RX570 only to drive my 4k display at 60Hz).

However I haven't used the system a lot recently with Linux (mostly Windows-10 for coding on a specific project - where it was, as always, rock solid). So maybe the next reboot is just hours away ;)
Comment 63 Alex Deucher 2020-06-17 16:10:45 UTC
(In reply to Paul Menzel from comment #59)
> For the record, bug 206487 (AMD Ryzen: Random freezes/crashes with enabled
> C-State C6) is about the same problems, and it happens with all Dell
> OptiPlex 5055 we have here, which run GNU/Linux. No problems are reported
> when run with Microsoft Windows 10. Unfortunately, it’s hard to reproduce.
> We are going to try the suggestions, and report back in the other bug report.
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206487

What makes you think this bug is related?  It's a different processor, and according to your comments, messing with the GPU driver power options has no effect.
Comment 64 Jens Reimann 2020-06-17 21:00:08 UTC
I am having the same issue:

---
[    0.107886] mce: [Hardware Error]: Machine check events logged
[    0.107887] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
[    0.107888] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
[    0.107890] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1592426039 SOCKET 0 APIC 0 microcode 8701013
---

* Fedora 32
* AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
* AMD Radeon 570
* Samsung NVME Evo Plus
* Asus ROG X570 motherboard

Everything was stable for around two weeks. Sleep mode working fine. I also but the system under load (except for the GPU).

Once I tried out Minecraft, putting some load on the GPU for the first time, the system reset.
Comment 65 Josh 2020-06-18 05:57:32 UTC
(In reply to Josh from comment #61)
> (In reply to Rich from comment #52)
> > Have you tried the various amdgpu.ppfeaturemask settings just to narrow
> this
> > down ?
> I'm now running 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd and a few test
> sleeps  and screen locks, nothing bad has happened so far. I'll keep you
> updated.

Good news! My system seems stable on 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd

It's able to sleep, sleep displays, logout and login again.

Only a couple of minor things, when it shuts down, my primary display (3440x1440) has a flash of what I assume is corrupt video signal on the right hand side. I've not measured it but I assume the GPU is only sending out a 16:9 signal (it is displaying terminal/console output as this point during the shutdown). Not an issue for me, but it might help.

I also booted Minecraft last night, it lagged in the menu a little bit and the graphics aren't as good as I remember them in Fedora 32. But more than playable.
Comment 66 Paul Menzel 2020-06-18 14:24:08 UTC
Vitalii, if I am not mistaken, you are the only one reporting this for Linux 4.19.x, and the only one having some kind of reproducer (games). Is it possible, you are having a separate issue?
Comment 67 Vitalii 2020-06-18 15:43:38 UTC
Hi Paul, I don't know, maybe. It doesn't seem to be clear what exactly the problem is, except that it may be related to GPU power management. I have a somewhat old video card (HD 7850 - southern islands), which has an independent DPM implementation in amdgpu driver and ppfeaturemask does not affect its behavior, as far as I can see, so it's more difficult to check if it's the same issue. Other than that, I didn't have much time to investigate yet. I also never use suspend.
Thanks
Comment 68 Josh 2020-06-22 08:36:01 UTC
(In reply to Josh from comment #65)
> (In reply to Josh from comment #61)
> > (In reply to Rich from comment #52)
> > > Have you tried the various amdgpu.ppfeaturemask settings just to narrow
> > this
> > > down ?
> > I'm now running 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd and a few test
> > sleeps  and screen locks, nothing bad has happened so far. I'll keep you
> > updated.
> 
> Good news! My system seems stable on 5.7.1 with
> amdgpu.ppfeaturemask=0xffffbffd
> 
> It's able to sleep, sleep displays, logout and login again.
> 
> Only a couple of minor things, when it shuts down, my primary display
> (3440x1440) has a flash of what I assume is corrupt video signal on the
> right hand side. I've not measured it but I assume the GPU is only sending
> out a 16:9 signal (it is displaying terminal/console output as this point
> during the shutdown). Not an issue for me, but it might help.
> 
> I also booted Minecraft last night, it lagged in the menu a little bit and
> the graphics aren't as good as I remember them in Fedora 32. But more than
> playable.

I may of spoken too soon. It still seems to hang when entering sleep now and again. The screens go off but the system never sleeps. Audio stops.

Upon pressing the keyboard to wake it, audio resumes and all the screens power back up, but on solid black output. When 5.8.1 is out I'll try that out. Would I still need the feature mask?
Comment 69 Paul Menzel 2020-06-30 08:22:42 UTC
I am still trying to wrap my head around the issue, and am missing some details. Clemens, it’d be great if you answered the questions below.

(In reply to Clemens Eisserer from comment #0)
> Ever since building my new PC I experience spontaneous (every week or so)
> reboots caused by machine check exceptions (always same bank and code -
> please see below). The reboots tend to happen in low-load situations (e.g.
> right after loading the desktop, or when playing youtube videos) - high load
> doesn't seem to make it worse.

Clemens, one more question. Was Linux 5.5.9, taken from the bug meta data, the earliest version you experienced this with? In your report you write, it happened for six months already.

[…]

> [    0.707393] mce: [Hardware Error]: Machine check events logged
> [    0.707395] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.707464] mce: [Hardware Error]: TSC 0 ADDR 1ffffbb03343c MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> [    0.707540] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288
> SOCKET 0 APIC 5 microcode 8701013
> [    0.709397] mce: [Hardware Error]: Machine check events logged
> [    0.709398] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.709468] mce: [Hardware Error]: TSC 0 ADDR 1ffffbba3a05a MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> [    0.709543] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288
> SOCKET 0 APIC 9 microcode 870101

I am still not sure how this MCE is related at all. The MCE is visible on the *next* reboot, right, and right before the crash no MCE is logged, right? Or are you seeing every time after a crash/freeze?
Comment 70 Jens Reimann 2020-06-30 09:03:56 UTC
I am not sure if this is related, but I found the following when doing a `dmesg`:

```
[ 8579.583454] ata7: SATA link down (SStatus 0 SControl 300)
[ 8579.583705] ata6: SATA link down (SStatus 0 SControl 300)
[ 8579.583718] ata5: SATA link down (SStatus 0 SControl 300)
[ 8579.744431] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 8579.869852] ------------[ cut here ]------------
[ 8579.869903] WARNING: CPU: 1 PID: 17174 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1761 dm_resume+0x31b/0x370 [amdgpu]
[ 8579.869903] Modules linked in: snd_seq_dummy snd_hrtimer rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp tun nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd uvcvideo kvm_amd videobuf2_vmalloc videobuf2_memops joydev videobuf2_v4l2 kvm videobuf2_common videodev iwlmvm btusb irqbypass btrtl eeepc_wmi btbcm btintel asus_wmi snd_usb_audio sparse_keymap mac80211 bluetooth video mxm_wmi wmi_bmof snd_usbmidi_lib snd_hda_codec_realtek snd_rawmidi mc cp210x snd_hda_codec_generic ledtrig_audio pcspkr snd_hda_codec_hdmi libarc4 ecdh_generic ecc snd_hda_intel sp5100_tco k10temp
[ 8579.869917]  i2c_piix4 iwlwifi snd_intel_dspcfg snd_hda_codec snd_hda_core cfg80211 snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer rfkill snd soundcore acpi_cpufreq ip_tables dm_crypt amdgpu amd_iommu_v2 gpu_sched ccp ttm igb drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel dca r8169 i2c_algo_bit ghash_clmulni_intel drm nvme nvme_core wmi pinctrl_amd br_netfilter bridge stp llc fuse
[ 8579.869924] CPU: 1 PID: 17174 Comm: kworker/u64:13 Not tainted 5.6.19-300.fc32.x86_64 #1
[ 8579.869925] Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING, BIOS 1409 05/12/2020
[ 8579.869928] Workqueue: events_unbound async_run_entry_fn
[ 8579.869970] RIP: 0010:dm_resume+0x31b/0x370 [amdgpu]
[ 8579.869971] Code: 8b 83 d4 66 00 00 83 e0 03 83 f8 01 74 36 48 89 ef e8 99 ce 3d c4 31 c0 48 83 c4 18 5b 5d 41 5c 41 5d c3 0f 0b e9 40 ff ff ff <0f> 0b e9 d7 fe ff ff 89 c6 48 c7 c7 80 97 84 c0 e8 d0 0c c1 ff e9
[ 8579.869971] RSP: 0018:ffffa153ca813d38 EFLAGS: 00010202
[ 8579.869972] RAX: 0000000000000002 RBX: ffff948c831e0000 RCX: 0000000000000006
[ 8579.869972] RDX: ffff948c3dfa1800 RSI: ffff9487f8081980 RDI: ffff948c83337000
[ 8579.869973] RBP: 0000000000000000 R08: ffff948c969de278 R09: 0000000000000000
[ 8579.869973] R10: ffff948c92f27b40 R11: 00000000000000f0 R12: ffff948c969de000
[ 8579.869973] R13: ffff9488fcf1e400 R14: ffffffff853dfb1f R15: 0000000000000010
[ 8579.869974] FS:  0000000000000000(0000) GS:ffff948c9ea40000(0000) knlGS:0000000000000000
[ 8579.869974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8579.869975] CR2: 0000000000000000 CR3: 000000076c80a000 CR4: 0000000000340ee0
[ 8579.869975] Call Trace:
[ 8579.870006]  amdgpu_device_ip_resume_phase2+0x52/0xb0 [amdgpu]
[ 8579.870034]  ? amdgpu_device_fw_loading+0xa0/0x110 [amdgpu]
[ 8579.870061]  amdgpu_device_resume+0x80/0x2e0 [amdgpu]
[ 8579.870064]  ? pm_runtime_enable+0x59/0xb0
[ 8579.870065]  ? pci_pm_restore+0xe0/0xe0
[ 8579.870066]  dpm_run_callback+0x4f/0x140
[ 8579.870067]  device_resume+0x136/0x200
[ 8579.870067]  async_resume+0x19/0x50
[ 8579.870068]  async_run_entry_fn+0x39/0x160
[ 8579.870069]  process_one_work+0x1b4/0x380
[ 8579.870070]  worker_thread+0x53/0x3e0
[ 8579.870070]  ? process_one_work+0x380/0x380
[ 8579.870071]  kthread+0x115/0x140
[ 8579.870072]  ? __kthread_bind_mask+0x60/0x60
[ 8579.870074]  ret_from_fork+0x22/0x40
[ 8579.870076] ---[ end trace 527992a575e73b9e ]---
[ 8580.378356] [drm] Fence fallback timer expired on ring sdma0
[ 8580.882361] [drm] Fence fallback timer expired on ring sdma0
[ 8581.386352] [drm] Fence fallback timer expired on ring sdma0
[ 8581.890351] [drm] Fence fallback timer expired on ring sdma0
[ 8581.921965] [drm] UVD and UVD ENC initialized successfully.
[ 8582.022979] [drm] VCE initialized successfully.
[ 8582.031721] PM: resume devices took 2.757 seconds
[ 8582.031729] OOM killer enabled.
[ 8582.031729] Restarting tasks ... done.
[ 8582.033372] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 8582.033374] PM: suspend exit
[ 8582.083510] RTL8125 2.5Gbps internal r8169-500:00: attached PHY driver [RTL8125 2.5Gbps internal] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[ 8582.183274] r8169 0000:05:00.0 enp5s0: Link is Down
[ 8584.727313] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx
```
Comment 71 Rich 2020-06-30 12:50:36 UTC
(In reply to Jens Reimann from comment #64)
> I am having the same issue:
> 
> ---
> [    0.107886] mce: [Hardware Error]: Machine check events logged
> [    0.107887] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22:
> baa000000002010b
> [    0.107888] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND
> 4d000000 IPID 1813e17000 
> [    0.107890] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1592426039
> SOCKET 0 APIC 0 microcode 8701013
> ---
> 
> * Fedora 32
> * AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping:
> 0x0)
> * AMD Radeon 570
> * Samsung NVME Evo Plus
> * Asus ROG X570 motherboard
> 
> Everything was stable for around two weeks. Sleep mode working fine. I also
> but the system under load (except for the GPU).
> 
> Once I tried out Minecraft, putting some load on the GPU for the first time,
> the system reset.

Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port off the CPU......its hard to isolate an issue just based on this....
Comment 72 Jens Reimann 2020-06-30 15:15:05 UTC
(In reply to Rich from comment #71)
> 
> Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port
> off the CPU......its hard to isolate an issue just based on this....

Is there any additional information I can provide?
Comment 73 Rich 2020-06-30 16:23:50 UTC
(In reply to Jens Reimann from comment #72)
> (In reply to Rich from comment #71)
> > 
> > Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port
> > off the CPU......its hard to isolate an issue just based on this....
> 
> Is there any additional information I can provide?

the  complete dmesg log could have some more clues..Does linux collect the  PCIe AER (Advanced Error Recovery) error registers ? Windows would collect them  and deposit failure data in the event log...

What was  the system doing when it crashed can help...was it idle at desktop, just launched an app that has lots of video or lots of compute? did what was launched talk to a specific PCie card connected to the CPU ? did the power stay up? or did it cycle DC power? did the postcode LEDs change and where are they at failure steady state?

if this happens alot, then i'd look closely at the PCIe cards in the CPU PCie slots....playing with the power management options of the PCie bus (like turn off L1 and L1-substates are usually things PCIe endpoints can have problems with) or toggling the GPU cards power management options to see if there is any dependencies on a particular power management feature on the card itself.
Comment 74 Jens Reimann 2020-07-01 10:33:22 UTC
(In reply to Rich from comment #73)
> (In reply to Jens Reimann from comment #72)
> > (In reply to Rich from comment #71)
> > > 
> > > Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port
> > > off the CPU......its hard to isolate an issue just based on this....
> > 
> > Is there any additional information I can provide?
> 
> the  complete dmesg log could have some more clues..Does linux collect the 
> PCIe AER (Advanced Error Recovery) error registers ? Windows would collect
> them  and deposit failure data in the event log...

I don't know. I am just using Linux. I will upload the dmesg log the next time the machine resets.

> 
> What was  the system doing when it crashed can help...was it idle at
> desktop, just launched an app that has lots of video or lots of compute? did
> what was launched talk to a specific PCie card connected to the CPU ? did
> the power stay up? or did it cycle DC power? did the postcode LEDs change
> and where are they at failure steady state?

First time I experienced this was, when I was playing Minecraft for the first time on that machine. And that was the first time I put the GPU under "load". Before that, I was only using the CPU under load. Since then, I am not playing Minecraft anymore (on this machine). All the other cases (still up to today) had been when using Zoom meetings. About 5-10 minutes into the call. I am using other video conferencing software (Bluejeans, Google Meeting) without issues.

I never had any problem with a high CPU load though.

> did what was launched talk to a specific PCie card connected to the CPU ?

Don't know, how can I find that out?

> did the power stay up? or did it cycle DC power?

Don't know either. The fans kept blowing as the always do. I guess that could mean the power wasn't lost.

> did the postcode LEDs change and where are they at failure steady state?

I have no idea what that means. Sorry.

> 
> if this happens alot, then i'd look closely at the PCIe cards in the CPU
> PCie slots....playing with the power management options of the PCie bus
> (like turn off L1 and L1-substates are usually things PCIe endpoints can
> have problems with) or toggling the GPU cards power management options to
> see if there is any dependencies on a particular power management feature on
> the card itself.

I am sorry, but I am not a Kernel developer. So I don't know anything about all of this.

It happens around 1-2 times a week. Always when using Zoom. Around 5-10 minutes into the call. Both screens go gray. The system reboots. And then it works again until the next time.

I see the MCE messages after that in the dmesg log. I also saw the other warnings in the dmesg log when returning from sleep.

If you have any specific commands I should run, before or after the crash, I am happy to do that and report the results.
Comment 75 Alex Deucher 2020-07-01 16:12:52 UTC
Created attachment 290035 [details]
possible fix

For those of you with polaris GPUs, does this patch fix the issue?
Comment 76 Jens Reimann 2020-07-04 12:48:07 UTC
Created attachment 290091 [details]
dmesg log while running at high CPU load

I am not sure this is the same issue, but as it points again to the GPU, maybe it is related.

Today I had the case again that the machine didn't come back from sleep. Keyboard powered, screens not, and also not the network. I couldn't fine anything afterwards in the system log.

I ran some load on the CPU for >1 hour, and during that time one of the two screens went dark. I could re-enable it, using the KDE display config dialog. Please see the attached the dmesg log.
Comment 77 Tiago Silva 2020-07-08 19:38:05 UTC
Hi,
I am new here!
I just created my account because I was following the thread for a while already, and probably I have some useful information that worth sharing.

My setup is an Ryzen 7 2700 in a B450M Aorus (firmware updated to the last version). My video board is a RX550 PowerColor. Following this thread, I switched it to a GTX1030 Gigabyte to avoid the amdgpu device module. However, the issue of restarts when idling with MCE errors still persisted. Even at the same frequency.

I can reproduce the error just playing some YouTube video with nothing else running. In about 20 minutes the system crashes.

The problem was over for me only when I turned off the Cool'n'Quiet in the BIOS setup. I did not tried the trick of changing the kernel parameter for amdgpu, so it is set to the default value.

Changing the C-state did not work out for me, but I have seen people on the internet claiming that using the "Typical currents" option was enough. Honestly, I find this option only in the Power Source Control parameter, but changing it did not work either. The point is that with the Cool'n'Quiet turned off the processor is not able to slow down the frequency to save power. In Windows system I can reach frequencies of 4,1GHz, which I am not able to see in Linux.

I am not an expert myself, but if there is some additional information I can provide, please let me know. The only thing is that probably detailed instructions on how to get it should be provided.

Sincerely,
Tiago
Comment 78 busdma 2020-08-01 14:58:49 UTC
Experiencing similar issues.
OS: Arch Linux
KERNEL: 5.7.11-arch1-1
CPU: AMD Ryzen 5 3600
GPU: AMD Radeon RX 5700 XT (PowerColor Red Devil)
GPU DRIVER: 4.6 Mesa 20.1.4
RAM: 32 GB
MOTHERBOARD: MSI B450 Tomahawk MAX

amd-ucode version 20200721.2b823fc-1

To me this is really easy to reproduce, just launching the Assassin's Creed Origins game with steamplay machine checks my PC.
I can provide more information if needed.
Comment 79 Rich 2020-08-01 16:44:42 UTC
(In reply to busdma from comment #78)
> Experiencing similar issues.
> OS: Arch Linux
> KERNEL: 5.7.11-arch1-1
> CPU: AMD Ryzen 5 3600
> GPU: AMD Radeon RX 5700 XT (PowerColor Red Devil)
> GPU DRIVER: 4.6 Mesa 20.1.4
> RAM: 32 GB
> MOTHERBOARD: MSI B450 Tomahawk MAX
> 
> amd-ucode version 20200721.2b823fc-1
> 
> To me this is really easy to reproduce, just launching the Assassin's Creed
> Origins game with steamplay machine checks my PC.
> I can provide more information if needed.

Hi,

can you provide the Machine check codes..i'll decode them.
Comment 80 busdma 2020-08-01 17:35:16 UTC
(In reply to Rich from comment #79)
> (In reply to busdma from comment #78)
> > Experiencing similar issues.
> > OS: Arch Linux
> > KERNEL: 5.7.11-arch1-1
> > CPU: AMD Ryzen 5 3600
> > GPU: AMD Radeon RX 5700 XT (PowerColor Red Devil)
> > GPU DRIVER: 4.6 Mesa 20.1.4
> > RAM: 32 GB
> > MOTHERBOARD: MSI B450 Tomahawk MAX
> > 
> > amd-ucode version 20200721.2b823fc-1
> > 
> > To me this is really easy to reproduce, just launching the Assassin's Creed
> > Origins game with steamplay machine checks my PC.
> > I can provide more information if needed.
> 
> Hi,
> 
> can you provide the Machine check codes..i'll decode them.

from journalctl:

...
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1595724426 SOCKET 0 APIC 0 microcode 8701021
...
smp: Bringing up secondary CPUs ...
x86: Booting SMP configuration:
.... node  #0, CPUs:        #1  #2  #3
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffffc12a027c MISC d012000100000000 SYND 4d000000 IPID 500b000000000
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1595724426 SOCKET 0 APIC 8 microcode 8701021
...
Comment 81 busdma 2020-08-01 21:31:10 UTC
Just got a similar crash in The Witcher 3. 

Seems like the same errors:
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffffc0ea427c MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1596316891 SOCKET 0 APIC 0 microcode 8701021
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1596316891 SOCKET 0 APIC 0 microcode 8701021

The errors only seem to happen in certain scenes or settings in games. I never get any crashes for activities like watching Youtube, unlike some other people in this thread.
The issue I'm having might be a different one, hopefully I'm not hijacking this thread.
Comment 82 Rich 2020-08-03 23:40:58 UTC
Hi busdma, for comment 80 and 81 they look like same problem to me.

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b    ==>  Bank 22 is NBIO  MCA_STATUS_NBIO[21:16] = 0x2:'SDP port ErrEvent',  This implicates a PCIe slot downstream
mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5:  bea0000000000108    ==>  Bank 5  is EX  [21:16] = 0x0  CPU WDT (watchdog timeout) ...means thread is not retiring micro-ops  in the time-out period.

i think your in the same place as the others...Most likely a Video card or other PCie card has power management issues.

have you tried walking through various video card power management cases?
for AMD video cards:
amdgpu.ppfeaturemask=0xffffbfbf (disable voltage control)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 6  PP_SMC_VOLTAGE_CONTROL_MASK   = 0

amdgpu.ppfeaturemask=0xfffbbfff (disable AVFS)
bit 18 PP_AVFS_MASK         = 0
bit 14 PP_OVERDRIVE_MASK    = 0 

amdgpu.ppfeaturemask=0xffffbff8 (disable all DPMs)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 0  PP_SCLK_DPM_MASK              = 0
bit 1  PP_MCLK_DPM_MASK              = 0
bit 2  PP_PCIE_DPM_MASK              = 0 

amdgpu.ppfeaturemask=0xffffbffe (disable sclk dpm)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 0  PP_SCLK_DPM_MASK              = 0

amdgpu.ppfeaturemask=0xffffbffd (disable mclk dpm)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 1  PP_MCLK_DPM_MASK              = 0

for PCIe cards we've also see not all correctly support the L1 or L1 substates PCIe link power management state
could try disabling L1 on PCie slots in BIOS setup ?
Comment 83 Alex Deucher 2020-08-04 17:01:56 UTC
For those of you with Polaris GPUs (e.g., Rx580/RX570/RX560, etc.), can you try the patch in comment 75 without any workarounds applied?
Comment 84 busdma 2020-08-09 21:37:09 UTC
(In reply to Rich from comment #82)
Hi, sorry for the late reply. I've tried your suggestions, unfortunately without good results. 

My testing method is to launch the AC Origins game.

> have you tried walking through various video card power management cases?
> for AMD video cards:
> amdgpu.ppfeaturemask=0xffffbfbf (disable voltage control)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 6  PP_SMC_VOLTAGE_CONTROL_MASK   = 0
No difference (machine check error)

> amdgpu.ppfeaturemask=0xfffbbfff (disable AVFS)
> bit 18 PP_AVFS_MASK         = 0
> bit 14 PP_OVERDRIVE_MASK    = 0 
No difference (machine check error)
 
> amdgpu.ppfeaturemask=0xffffbff8 (disable all DPMs)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 0  PP_SCLK_DPM_MASK              = 0
> bit 1  PP_MCLK_DPM_MASK              = 0
> bit 2  PP_PCIE_DPM_MASK              = 0 
Hangs during boot

> amdgpu.ppfeaturemask=0xffffbffe (disable sclk dpm)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 0  PP_SCLK_DPM_MASK              = 0
Hangs during boot

> amdgpu.ppfeaturemask=0xffffbffd (disable mclk dpm)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 1  PP_MCLK_DPM_MASK              = 0
No difference (machine check error)

> for PCIe cards we've also see not all correctly support the L1 or L1
> substates PCIe link power management state
> could try disabling L1 on PCie slots in BIOS setup ?
I found no such option in my BIOS. I have an MSI board, does it have a specific name i could search for?
Comment 85 Rich 2020-08-10 15:56:04 UTC
(In reply to busdma from comment #84)


>>disabling L1 on PCie slots in BIOS setup 
on my AMI base bios system its  under
AMD PBS -> PM L1 SS -> Disabled    This will disable PCIe slots L1 substates


What PCie card to you have installed? 
can you collect an lspci -vvv -xxxx  > filesave.txt
Comment 86 busdma 2020-08-10 18:46:58 UTC
Created attachment 290823 [details]
pci devices

pci devices
sudo lspci -vvv -xxxx > pci_devs.txt
Comment 87 Thomas Langkamp 2020-08-10 21:14:50 UTC
*** Bug 208573 has been marked as a duplicate of this bug. ***
Comment 88 exeskull1 2020-08-13 10:29:37 UTC
(In reply to Tiago Silva from comment #77)
> Hi,
> I am new here!
> I just created my account because I was following the thread for a while
> already, and probably I have some useful information that worth sharing.
> 
> My setup is an Ryzen 7 2700 in a B450M Aorus (firmware updated to the last
> version). My video board is a RX550 PowerColor. Following this thread, I
> switched it to a GTX1030 Gigabyte to avoid the amdgpu device module.
> However, the issue of restarts when idling with MCE errors still persisted.
> Even at the same frequency.
> 
> I can reproduce the error just playing some YouTube video with nothing else
> running. In about 20 minutes the system crashes.
> 
> The problem was over for me only when I turned off the Cool'n'Quiet in the
> BIOS setup. I did not tried the trick of changing the kernel parameter for
> amdgpu, so it is set to the default value.
> 
> Changing the C-state did not work out for me, but I have seen people on the
> internet claiming that using the "Typical currents" option was enough.
> Honestly, I find this option only in the Power Source Control parameter, but
> changing it did not work either. The point is that with the Cool'n'Quiet
> turned off the processor is not able to slow down the frequency to save
> power. In Windows system I can reach frequencies of 4,1GHz, which I am not
> able to see in Linux.
> 
> I am not an expert myself, but if there is some additional information I can
> provide, please let me know. The only thing is that probably detailed
> instructions on how to get it should be provided.
> 
> Sincerely,
> Tiago

I just created my account to say thank you man!


>The problem was over for me only when I turned off the Cool'n'Quiet in the
>>BIOS setup. I did not tried the trick of changing the kernel parameter for
>>amdgpu, so it is set to the default value.

After I did this all my pain gone.

Thank you Tiago!
Sincerely,
Sasa
Comment 89 Vitalii 2020-08-13 19:16:17 UTC
Hi, the new BIOS for my MB (Gigabyte X570 Gaming X, F20, AGESA ComboV2
1.0.0.2; 3900X; Radeon HD 7850) has an option to disable "Core watchdog".
The system reboots anyway, but the screen looks slightly different
just before turning black, and MCE is different -

reboot 1
Linux version 5.8.1 (gcc (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP Thu Aug 13 13:59:53 EEST 2020
21:16:08 - kernel: mce: [Hardware Error]: Machine check events logged
21:16:08 - kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000000080b
21:16:08 - kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 5d000000 IPID 1002e00000500 
21:16:08 - kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1597342518 SOCKET 0 APIC 0 microcode 8701021

reboot 2, same 5.8.1 kernel
21:18:04 - kernel: mce: [Hardware Error]: Machine check events logged
21:18:04 - kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000000080b
21:18:04 - kernel: mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 5d000000 IPID 1002e00000500 
21:18:04 - kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1597342659 SOCKET 0 APIC 0 microcode 8701021

If "Core watchdog" is enabled (default), I get the same "Bank 5: bea0000000000108" on random CPUs.

Rich, is there anything interesting about this MCE?

Other that that, I found a way to reproduce the problem quicky on my
system. Two things are needed for "radeon" driver:
1) running Age of Empires III in a map mode (runs in windowed mode on
   different virtual desktop)
2) "glmark2 --run-forever -b build:use-vbo=true" (foreground task)
And it reboots in 2-15 seconds after glmark2 start. Looks like
it's important that glmark2 is using vbo for some reason.

However, it looks like it does not reboot (or reboot is way much less probable)
if I do
# echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level

If AoE3 and glmark2 are running already and power_dpm_force_performance_level
is switched to "high", it reboots very quickly. In other cases (e.g.
glmark2 but no AoE3), switching between "high" and "low" every second
does not reproduce the problem reliably.

I tried disabling Cool'n'Quiet in BIOS - no significant difference.

Thanks
Comment 90 yk749 2020-09-03 01:45:13 UTC
I am getting reproducible, reboot using Ubuntu 20.04/18.04 Fedora 32, etc. Basically all distros I tried will keep rebooting. I can run Ubuntu 20.04 without issues a few months ago, but last month whenever I boot into my Ubuntu 20.04, it will reboot right at login page. I deleted Ubuntu (I was dual booting Ubuntu 20.04 and windows 10 pro, no issues with Windows) and tried many other distros (using live stick), all suffer from the same issue. 

my hw configuration is:
CPU : Ryzen 3950x
GPU : RTX 2080 super
Mobo: asus x570 crosshair viii hero, bios is 2206
RAM: Corsair DOMINATOR PLATINUM 4 * 16G 

The MCE error I got is:
mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
mce [Hardware error]: TSC 0 ADDR b6100 MISC d012003f00000000 SYND 622d1f1103 IPID 700b020b50000
mce [Hardware error]: Processor 2:870f10 TIME 1598212995 SOCKET 0 APIC 2 microcode 8701021

Can someone help me or give me some suggestions on how to debug this ? Thanks.
Comment 91 Rich 2020-09-03 01:57:47 UTC
(In reply to yk749 from comment #90)
> I am getting reproducible, reboot using Ubuntu 20.04/18.04 Fedora 32, etc.
> Basically all distros I tried will keep rebooting. I can run Ubuntu 20.04
> without issues a few months ago, but last month whenever I boot into my
> Ubuntu 20.04, it will reboot right at login page. I deleted Ubuntu (I was
> dual booting Ubuntu 20.04 and windows 10 pro, no issues with Windows) and
> tried many other distros (using live stick), all suffer from the same issue. 
> 
> my hw configuration is:
> CPU : Ryzen 3950x
> GPU : RTX 2080 super
> Mobo: asus x570 crosshair viii hero, bios is 2206
> RAM: Corsair DOMINATOR PLATINUM 4 * 16G 
> 
> The MCE error I got is:
> mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
> mce [Hardware error]: TSC 0 ADDR b6100 MISC d012003f00000000 SYND 622d1f1103
> IPID 700b020b50000
> mce [Hardware error]: Processor 2:870f10 TIME 1598212995 SOCKET 0 APIC 2
> microcode 8701021
> 
> Can someone help me or give me some suggestions on how to debug this ?
> Thanks.

CPU1 Bank 7 is the L3 cache 
Fea040000002010b decodes to Tag Parity error

if this happens in the same bank with same MCA_STATUS = Fea040000002010b every failure then it points at the processor....
of course if the the board's power delivery or thermal solution are out of spec we could also get here...but it would likely move to different banks..
if its in the same place in code execution each time it could be the CPU VR....because certain places in OS boot tax the VR more than any App will...

do you happen to have another motherboard? if it fails there too...its most likely DPM or an early life fail.
Comment 92 yk749 2020-09-03 02:15:17 UTC
(In reply to Rich from comment #91)
> (In reply to yk749 from comment #90)
> > I am getting reproducible, reboot using Ubuntu 20.04/18.04 Fedora 32, etc.
> > Basically all distros I tried will keep rebooting. I can run Ubuntu 20.04
> > without issues a few months ago, but last month whenever I boot into my
> > Ubuntu 20.04, it will reboot right at login page. I deleted Ubuntu (I was
> > dual booting Ubuntu 20.04 and windows 10 pro, no issues with Windows) and
> > tried many other distros (using live stick), all suffer from the same
> issue. 
> > 
> > my hw configuration is:
> > CPU : Ryzen 3950x
> > GPU : RTX 2080 super
> > Mobo: asus x570 crosshair viii hero, bios is 2206
> > RAM: Corsair DOMINATOR PLATINUM 4 * 16G 
> > 
> > The MCE error I got is:
> > mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
> > mce [Hardware error]: TSC 0 ADDR b6100 MISC d012003f00000000 SYND
> 622d1f1103
> > IPID 700b020b50000
> > mce [Hardware error]: Processor 2:870f10 TIME 1598212995 SOCKET 0 APIC 2
> > microcode 8701021
> > 
> > Can someone help me or give me some suggestions on how to debug this ?
> > Thanks.
> 
> CPU1 Bank 7 is the L3 cache 
> Fea040000002010b decodes to Tag Parity error
> 
> if this happens in the same bank with same MCA_STATUS = Fea040000002010b
> every failure then it points at the processor....
> of course if the the board's power delivery or thermal solution are out of
> spec we could also get here...but it would likely move to different banks..
> if its in the same place in code execution each time it could be the CPU
> VR....because certain places in OS boot tax the VR more than any App will...
> 
> do you happen to have another motherboard? if it fails there too...its most
> likely DPM or an early life fail.

Hello Rich, 
Thanks for your prompt reply. I just tried to boot into Ubuntu 20.04 lts live stick again, and got the same mce error message:
mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
mce [Hardware error]: TSC 0 ADDR d6100 MISC d012003f00000000 SYND 622d1f1103 IPID 700b020b50000
mce [Hardware error]: Processor 2:870f10 TIME 1599084556 SOCKET 0 APIC 2 microcode 8701021

But this time I also see something like:
do_IRQ No irq handler for vector

And:
Initramfs unpacking failed Decoding failed

I just built my new PC so I don't have a spare mobo for testing

Regards,
Yk
Comment 93 EllieTheCat 2020-09-03 06:50:06 UTC
(In reply to Rich from comment #50)

>       bit 0 PP_SCLK_DPM_MASK  = 1 
>       bit 1 PP_MCLK_DPM_MASK  = 0
>     amdgpu.ppfeaturemask=0xffffbffd


just wanted to say thank you so much for this, i have been having these seemingly random restarts since building my new computer, and this little guy right here seems to have made it stable. that said however it does significantly hurt my performance, and trying the other option you listed (amdgpu.ppfeaturemask=0xffffbffe) stopped my system from completing the boot process. got into the initial grub bootloader menu, but after trying to boot into manjaro it would hang. went back to the mclk_dpm_mask = 0 setting, but i was wondering if there's any way to like...mitigate that performance hit? i'm assuming what happens is that the VRAM runs at its base clock speed, which is going to be much lower than what it's actually capable of. however with this feature mask set i'm not entirely certain how to change what clock speed the VRAM is running at, or if that's even possible. very new to all this. any insight people can give would be absolutely lovely

first i thought about using something like corectrl, but that featuremask setting only allows for tinkering with the fan curve and power limits. i also noticed that the memory clock speed is no longer reported by mangohud. i have to assume this is all more or less intended behavior based on the featuremask being set to exclude power management for the MCLK, but that being the case i just don't know where to go from here
Comment 94 Rich 2020-09-03 12:20:05 UTC
(In reply to EllieTheCat from comment #93)
> (In reply to Rich from comment #50)
> 
> >       bit 0 PP_SCLK_DPM_MASK  = 1 
> >       bit 1 PP_MCLK_DPM_MASK  = 0
> >     amdgpu.ppfeaturemask=0xffffbffd
> 
> 
> just wanted to say thank you so much for this, i have been having these
> seemingly random restarts since building my new computer, and this little
> guy right here seems to have made it stable. that said however it does
> significantly hurt my performance, and trying the other option you listed
> (amdgpu.ppfeaturemask=0xffffbffe) stopped my system from completing the boot
> process. got into the initial grub bootloader menu, but after trying to boot
> into manjaro it would hang. went back to the mclk_dpm_mask = 0 setting, but
> i was wondering if there's any way to like...mitigate that performance hit?
> i'm assuming what happens is that the VRAM runs at its base clock speed,
> which is going to be much lower than what it's actually capable of. however
> with this feature mask set i'm not entirely certain how to change what clock
> speed the VRAM is running at, or if that's even possible. very new to all
> this. any insight people can give would be absolutely lovely
> 
> first i thought about using something like corectrl, but that featuremask
> setting only allows for tinkering with the fan curve and power limits. i
> also noticed that the memory clock speed is no longer reported by mangohud.
> i have to assume this is all more or less intended behavior based on the
> featuremask being set to exclude power management for the MCLK, but that
> being the case i just don't know where to go from here


Hi Ellie,

DPM is Dynamic Power Management...so playing with  this  option just disables some power management features...which ticks up power usage but would not affect performance...Power management typically costs performance as one trades power savings for entry and exit latencies and slower clocks.....

my preferred approach when i want to save power is turn off the machine.

Rich
Comment 95 Paul Menzel 2020-09-03 12:28:23 UTC
(In reply to Rich from comment #94)

> DPM is Dynamic Power Management...so playing with  this  option just
> disables some power management features...which ticks up power usage but
> would not affect performance...Power management typically costs performance
> as one trades power savings for entry and exit latencies and slower
> clocks.....

That statement is incorrect with the Linux kernel driver. It will run at lowest speed.

[…]
Comment 96 Alex Deucher 2020-09-03 13:09:37 UTC
I'll repeat since no one has tried it: For those of you with Polaris GPUs (e.g., Rx580/RX570/RX560, etc.), can you try the patch in comment 75 without any workarounds applied?  Does that fix the issue?
Comment 97 EllieTheCat 2020-09-03 14:21:35 UTC
(In reply to Paul Menzel from comment #95)

> 
> It will run at
> lowest speed.
> 
> […]

That being the case do I have any options for more or less forcing a higher base speed? 

Also I'd like to take a moment to thank you as well for replying without any condescending attitude.
Comment 98 patrickjholloway 2020-09-18 10:37:24 UTC
I am having similar issues. Recently Ubuntu 20.04 has become unusable on my machine. I did a clean install of 18.04 and have had some better stability, but even booting successfully is a crap shoot. My last boot succeeded, but did have a MCE during startup.

Ubuntu 18.04.5 LTS / Windows 10 Pro dual boot
5.4.0-47-generic
Ryzen 3600
X570 AORUS ELITE/X570 AORUS ELITE, BIOS F20b 07/02/2020
RTX2070 Super

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffffc1d4a04c MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600420000 SOCKET 0 APIC 9 microcode 8701021

Same MCE on the previous boot, but at CPU 9: Machine Check: 0 Bank 5: bea0000000000000. Everything else identical.


When I was trying to get 20.04 LTS running recently I was seeing stuff like the following (typed this in copying from pics I snapped on my phone before spontaneous reboot):

ata5.00: failed command: READ FPDMA QUEUED
ata5.00: Exception Emask 0x52 SAct 0x12004000 SErr 0xffffffff action 0xe frozen
ata5.00: cmd ... EMask 0x52 (ATA bus error)
ata6.00: failed command: READ FPDMA QUEUED

Some actual screen grabs here:
https://imgur.com/3Hy5enQ
https://imgur.com/mmXhM0h
https://imgur.com/2xgNQrQ

One weird detail is that installing Ubuntu 20 would make it so I would get unwanted reboots when selecting Windows once I got to the Windows login screen. It would do this 100% of the time and wouldn't matter if I selected Windows in grub or in the BIOS.

I haven't tried any of these types of things yet:

>       bit 0 PP_SCLK_DPM_MASK  = 1 
>       bit 1 PP_MCLK_DPM_MASK  = 0
>     amdgpu.ppfeaturemask=0xffffbffd

Sharing my experience here to maybe get some help and drill into this issue. This isn't my daily driver for work and things seem to be okay-ish the Windows side with 18.04 LTS dual boot. Enough that I have a stable environment to dig into logs and settings but still reproduce some of the behavior. I could use some help in that respect as my linux experience is limited relative to some experts on here. I use Ubuntu every day at work as a software developer, but up until 6 months ago I was only using Macs.
Comment 99 Alex Deucher 2020-09-18 14:06:50 UTC
(In reply to patrickjholloway from comment #98)
> Ubuntu 18.04.5 LTS / Windows 10 Pro dual boot
> 5.4.0-47-generic
> Ryzen 3600
> X570 AORUS ELITE/X570 AORUS ELITE, BIOS F20b 07/02/2020
> RTX2070 Super

> I haven't tried any of these types of things yet:
> 
> >       bit 0 PP_SCLK_DPM_MASK  = 1 
> >       bit 1 PP_MCLK_DPM_MASK  = 0
> >     amdgpu.ppfeaturemask=0xffffbffd

These are not relevant if you are not using an AMD GPU.
Comment 100 Jaakko Kantojärvi 2020-09-20 17:34:51 UTC
Hello, I'm joining the club with the following hardware

    AMD Ryzen 9 3900X (microcode 0x08701013)
    Gigabyte X570 Aorus Elite (BIOS F30 / 2020-08-15)
    Asus Radeon R9 270X (connected via the x4 slot behind the X570 chipset)
    2x Kingston KHX3200C16D4/16GX DDR4 3200MHz

Module config:

    /proc/cmdline:
      BOOT_IMAGE=/vmlinuz-5.8.0-1-amd64 root=... ro quiet hugepagesz=1G hugepages=6 amdgpu.ppfeaturemask=0xffffbffd

    /etc/modprobe.d/*:
      blacklist edac_mce_amd
      blacklist radeon
      options amdgpu si_support=1
      options amdgpu cik_support=1
      softdep amdgpu pre: vfio vfio_pci
      options vfio_pci disable_vga=1
      # AMD RX 580 gpu+audio (NOTE: this GPU is currently removed)
      options vfio_pci ids=1002:67df,1002:aaf0"

and the following MCE

    mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 1ffffc05bd88a MISC d010000000000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600618873 SOCKET 0 APIC 5 microcode 8701021


In addition to that, I have encountered more MCEs during past weeks. For those, I had slightly different setup

    CPU microcode: 0x08701013
    BIOS: F11 / 2019-12-06
    
    Second video card with vfio-pci driver:
      Gigabyte Radeon RX580 Aorus (connected via the x16 slot to the CPU)
    
    /proc/cmdline:
      BOOT_IMAGE=/vmlinuz-5.8.0-1-amd64 root=... ro quiet hugepagesz=1G hugepages=6

Event 1) System was idle and on the desktop

    mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: b6a0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 1ffff88616a28 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 0 microcode 8701013

    mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 7f2bb6d59152 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 9 microcode 8701013
    
    mce: [Hardware Error]: CPU 18: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 1ffff87e621b6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 11 microcode 8701013
    
    mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: f2a000000002010b
    mce: [Hardware Error]: TSC 0 SYND 4d000000 IPID 1813e17000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 0 microcode 8701013

Event 2) System was idle and displays had entered powersaving hours before

    mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 7f53cee29aa8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600475700 SOCKET 0 APIC 4 microcode 8701013
    
    mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: f2a000000002010b
    mce: [Hardware Error]: TSC 0 SYND 4d000000 IPID 1813e17000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600475700 SOCKET 0 APIC 0 microcode 8701013


After next reboot, I'll move the R9 to the x16 slot for testing. Idea was to reserve x16 slot for the windows VM, but I guess I prefer a stable desktop more.
Also, I need to test different amdgpu power management masks, as the current one wasn't enough. Although, it's possible that it partially helped (less MCEs per event, assuming the rest were not just direct consequence of the main problem).
Comment 101 Rich 2020-10-04 14:03:47 UTC
>> (In reply to Jaakko Kantojärvi from comment #100)

Yes...put the video card in the PCie x16 slot closest to the CPU.
>> Asus Radeon R9 270X (connected via the x4 slot behind the X570 chipset) <==
>> the 2 slots closest the CPU are direct connect to the CPU.

since your failing idle with CPU WDT MCA...its likely  something is going to sleep and not waking up.....so the usual suspects  in Power managemnet....could turn off PC6 or turn off CC6 , turn off PCIe ASPM , turn off PCIe L1 , L1 substates.....turn it all off...and see if stabilty returns...then start turning things back on and see when it fails....

some read is not completing and that leads to Bank 5: bea0000000000108 CPU WatchDog  timeout....Usually its a  PCie endpoint, but can be something in the path to the PCIe card that doesn't exit/wake from sleep states properly.
Comment 102 jns-v 2020-10-14 16:26:03 UTC
Hi folks,

big thanks for all the work you already put into this.
From what I could extract from the thread and comparing it to my situation I think I have the same issue as Clemens. I therefore hope, that it's okay if I chime in on this thread.

System:
CPU:    AMD Ryzen 7 3700X 8-Core Processor
MB:     Asrock X570M Pro4
GPU:    GeForce GTX 1050 Ti (should be fine, as it powered my old i5 based system with no issues)
RAM:    HyperX Predator HX429C15PB3AK2/16
PSU:    be quiet! System Power 9: 700W

OS:     Arcolinux
Kernel: 5.8.14
WM:     XMONAD
packages: amd-ucode, nvidia

So the issue pretty much started with the installation procedure which took me several attempts to accomplish due to reboots caused by mce events.
During the process of setting everything up I did some research and so far did the following:

-- Updated BIOS to latest version 
-- disabled C-States in BIOS
-- bootoption idle=nomwait (removed again because of no effect)

Both steps reduced the frequency of the reboots occuring (system basically wasn't usable at all before).
Apart from setting up the system, editing configs installing software and browsing I didn't do anything to stress the system yet. However while writing this I had 4 reboots...
I started taking logs (dmesg, Xorg) and it seems the mce codes are the same every time:

[    0.216553] mce: [Hardware Error]: Machine check events logged
[    0.216554] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
[    0.216557] mce: [Hardware Error]: TSC 0 ADDR 1ffffc0c3e028 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.216560] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1602691009 SOCKET 0 APIC 6 microcode 8701021

The only thing that differs is the core on which I get the bea0000000000108 code.

I figured that setting any of the amdgpu related masks makes no sense for my setup?
I still have the Option of returning the components and get me an Intel based setup again...
Any advice is much appreciated!

Also in order to make some valueable contribution to this thread I am willing to provide additional info and test some things out. However I am just a user with very limited knowledge on these things, so I'm afraid I'd need some guidance in that regard.

Best regards and big thanks to the community
Jonas
Comment 103 jns-v 2020-10-14 16:27:27 UTC
Created attachment 292967 [details]
dmesg of latest crash

See Initial comment above
Comment 104 Jaakko Kantojärvi 2020-11-03 20:28:50 UTC
(In reply to Rich from comment #101)

> Yes...put the video card in the PCie x16 slot closest to the CPU.

I have not done that, yet.

> since your failing idle with CPU WDT MCA...its likely  something is going to
> sleep and not waking up.....so the usual suspects  in Power
> managemnet....could turn off PC6 or turn off CC6 , turn off PCIe ASPM , turn
> off PCIe L1 , L1 substates.....turn it all off...and see if stabilty
> returns...then start turning things back on and see when it fails....

As I don't have a good way to reproduce the crash, I started disabling one feature at a time. First thing I tried was setting "AMD CBS > Power Supply Idle Control" to "Typical current idle". However, that didn't seem to matter. Took a week for the crash to happen, but it did.

Interestingly, no MCEs have occured, after I disabled SVM (I enabled it initially, as I'm going to need it later). It's now been about two weeks from that, so I'm fairly sure it has something to do with this.

I'm interestetd to hear if other people having the same problem have the SVM enabled or not.

> some read is not completing and that leads to Bank 5: bea0000000000108 CPU
> WatchDog  timeout....Usually its a  PCie endpoint, but can be something in
> the path to the PCIe card that doesn't exit/wake from sleep states properly.

Could it be that the watchdog timeout is just too strict? Like 1 in 10,000 of the time PCIe device doesn't respond in time, but rest of the time it's fine?.
Comment 105 Rich 2020-11-04 13:17:52 UTC
> Could it be that the watchdog timeout is just too strict? Like 1 in 10,000
> of the time PCIe device doesn't respond in time, but rest of the time it's
> fine?.

The CPU WDT timeout period on Ryzen 7 3700 is over 5 seconds. For an idle system this is a lot... I have seen on complex server systems executing heavy I/O workloads  with lots of retimers and PCIe-PCIe bridges in the read data path the read completion time needs to be higher....but in the case of a machine at idle a failure on the default timeout period means something is stuck, usually its power delivery/noise/sleep to wake protocols related.
Comment 106 Foulques du Peloux de Praron 2020-11-04 16:23:52 UTC
> I'm interestetd to hear if other people having the same problem have the SVM
> enabled or not.

I have the problem too and SVM is disabled.

My specs :
Ryzen 5 3600
CORSAIR - Vengeance LPX 16 Go 2 x 8 Go, DDR4 2400 Mhz
B450-PLUS GAMING (BIOS is up to date)
RX 5700 XT
Be Quiet Straight Power E9 - 500W
Manjaro with kernel 5.9.1
Mesa 20.1.8

It happens only when gaming in VR with games like Half-Life Alyx or Moss for example. It is really random. Usually I play for one or two hours. Most of the time there is no problem, but sometime it can happen two times in less than 30 minutes, and then it is stable again...
Comment 107 fav 2020-11-18 05:57:49 UTC
CPU: Ryzen 3 1200
GPU: RX 570
OS: Manjaro with kernel 5.9.3

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffff956f0b10 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1604849640 SOCKET 0 APIC 1 microcode 8001138

Проблема проявляется только когда процессор работает на стоковых частотах. Стоит только увеличить множитель с 31 до 36, проблема исчезает.

Google Translate:
The problem manifests itself only when the processor is running at stock frequencies. One has only to increase the multiplier from 31 to 36, the problem disappears.
Comment 108 Jaakko Kantojärvi 2020-11-18 13:51:56 UTC
> I have the problem too and SVM is disabled.

Took me around 3 days to get a reboot after posting my previous message. Clearly 2 weeks wasn't enough. Any case, I also tested that disabling IOMMU didn't help either.

I disabled QT debug messages, which were flushing data to disk every second, and now the MCE crashes occur nearly every night.

After that, I re-enabled SVM and IOMMU (as they don't seem to be relevant), and I set processor.max_cstate=5, which wasn't enough. I have change it to processor.max_cstate=1, which also wasn't enough. I finally tried to disable L0 and L1 thingies with

> cat /sys/module/pcie_aspm/parameters/policy 
> default [performance] powersave powersupersave

This has NOT fixed the issue for me. I guess the next test:

1) set idle=nomwait
2) move the GPU to the x16 slot (and directly to CPU).
3) replace R9 270x with RX 580 and see how that works (I use amdgpu driver with amdgpu.si_support=1)

In addition, I would like to test with a different disk. I currently have a Samsung 1TB 970 EVO Plus (NVMe, M.2, PCIe 3.0 x4). Someone reported to have the problem with only SATA drives, so it shouldn't be related to this, but who knows. I just don't have any 1TB SATA SSDs :(

> could turn off PC6 or turn off CC6 , turn off PCIe ASPM , turn off PCIe L1 ,
> L1 substates.....turn it all off...

So far, I think I have tested most of the things. I have still some BIOS flags to try out, which might be relevant for this. Sadly, this is not my area of expertize. Well, I have learned a lot during these months.

Have to say this is really frustrating problem. I really hate timeout issues. So,

* Any good ways to debug this more?
* Can we parse more details from the MCE data?
* What if I disable KASLR, would that make the memory address in the MCE info  more useful?
* Finally, is there any definitive ways to proof that this is something that I should just RMA?

And finally, list of latest MCE errors, just for reference:

> # Oct 21 00:47:07 --> Disable SVM
>
> Nov 07 18:04:46 iwana kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0
> Bank 5: b6a0000000000108
> Nov 07 18:04:46 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8b038cea
> SYND 4d000000 IPID 500b000000000 
> Nov 07 18:04:46 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1604752114 SOCKET 0 APIC 0 microcode 8701021
>
> Nov 07 18:04:46 iwana kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 07 18:04:46 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 7fcd3305e0a2
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Nov 07 18:04:46 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1604752114 SOCKET 0 APIC b microcode 8701021
>
> # Nov 07 18:08:44 --> Disable IOMMU
> 
> Nov 10 02:14:36 iwana kernel: mce: [Hardware Error]: CPU 18: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 10 02:14:36 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 7f9f432ec3ca
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Nov 10 02:14:36 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1604962395 SOCKET 0 APIC 11 microcode 8701021
> 
> # Nov 10 02:27:24 --> Enable SVM, Enable IOMMU, set `processor.max_cstate=5`
> 
> Nov 12 07:55:48 iwana kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 12 07:55:48 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 7f133a726eb6
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> marras 12 07:55:48 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10
> TIME 1605147009 SOCKET 0 APIC 12 microcode 8701021
>
> # Nov 12 07:55:48 --> add `echo performance >
> /sys/module/pcie_aspm/parameters/policy` to initramfs
> # Nov 12 07:55:48 --> set `processor.max_cstate=5`
> 
> Nov 14 05:18:47 iwana kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 14 05:18:47 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff90400e48
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Nov 14 05:18:47 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605313791 SOCKET 0 APIC c microcode 8701021
> 
> Nov 15 08:00:20 iwana kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 15 08:00:20 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb9c00e48
> MISC d012000200000000 SYND 4d000000 IPID 500b000000000 
> Nov 15 08:00:20 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605354267 SOCKET 0 APIC c microcode 8701021
>
> Nov 15 08:00:20 iwana kernel: mce: [Hardware Error]: CPU 10: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 15 08:00:20 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb945e146
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Nov 15 08:00:20 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605354267 SOCKET 0 APIC 1a microcode 8701021
>
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb640107a
> MISC d012000300000000 SYND 4d000000 IPID 500b000000000 
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605680663 SOCKET 0 APIC c microcode 8701021
>
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: CPU 22: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb5c5e146
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605680663 SOCKET 0 APIC 1b microcode 8701021
Comment 109 Clemens Eisserer 2020-11-18 14:06:05 UTC
Hi Jaakko,

I don't have any good news or hints, however I can state what didn't work for me and save you some time. My system is no longer running Linux (with Windows it is rock solid), and  a renoir based laptop behaves as expected "even" running linux.

* set idle=nomwait: tried, didn't help
* replace R9 270x with RX 580: I was using a RX570, which uses the same (partly deactivated) silicon as the RX580

* Finally, is there any definitive ways to proof that this is something that I should just RMA?

RMA the CPU will most likely not help either, after ~6 months I gave another Ryzen 3700X a try and it behaved the same way (early Ryzen-1xxx were so buggy (SIGSEVS under high load) AMD had to physically replace it and play the bug down as "performance problem").
Comment 110 Jaakko Kantojärvi 2020-11-18 15:28:23 UTC
(In reply to Clemens Eisserer from comment #109)

> My system is no longer running Linux (with Windows it is rock solid),
> and a renoir based laptop behaves as expected "even" running linux.

This is one of the most annoying things. What does Windows do differently :(
Like, does Linux assume messages can't get lost and waits for a message forever, which will never arrive. I would have assumed this is something in CPU's microcode.

> * set idle=nomwait: tried, didn't help
> * replace R9 270x with RX 580: I was using a RX570, which uses the same
> (partly deactivated) silicon as the RX580

Thanks for these. I added first to grub configs, so it will active in next crash.

> RMA the CPU will most likely not help either, after ~6 months I gave another
> Ryzen 3700X a try and it behaved the same way (early Ryzen-1xxx were so
> buggy (SIGSEVS under high load) AMD had to physically replace it and play
> the bug down as "performance problem").

This is what I read from other places, so thus I'm not expecting much unless I can findout that my CPU is clearly broken. For now, I think I should be able to workaround the issue by setting the computer to sleep when not in use. Sadly, doesn't work for a vacation if I like to do remote work.

Thanks any case :)

I really hope to get this working some day. I love the Ryzen architecture, so I really wanted to suppor them instead of Intel this time.
Comment 111 matthew clark 2020-11-21 14:39:58 UTC
disabling multithreading seems to solve it.  e.g. run this on boot-up:

#!/bin/bash
#
# disables hyperthreading, which stops the system from crashing
#
   for CPU in /sys/devices/system/cpu/cpu[0-9]*; do
        CPUID=$(basename $CPU)
        echo "CPU: $CPUID";
        if test -e $CPU/online; then
                echo "1" > $CPU/online; 
        fi;

        COREID="$(cat $CPU/topology/core_id)";
        eval "COREENABLE=\"\${core${COREID}enable}\"";

        if ${COREENABLE:-true}; then        
                echo "${CPU} core=${CORE} -> enable"
                eval "core${COREID}enable='false'";
        else
                echo "$CPU core=${CORE} -> disable"; 
                echo "0" > "$CPU/online"; 
        fi; 
    done;
Comment 112 Jaakko Kantojärvi 2020-12-03 22:45:49 UTC
Disabling SMT wasn't the fix. I did it from the BIOS and it seemed to lower the frequency of crashes.

I plan to set 'rcu_nocbs=0-<cpu_max_index>' and see if that changes something. Maybe it lowers the change to hit the bug, or it doesn't do anything.
Comment 113 Cyrax 2020-12-04 00:15:37 UTC
(In reply to Jaakko Kantojärvi from comment #112)
> Disabling SMT wasn't the fix. I did it from the BIOS and it seemed to lower
> the frequency of crashes.
> 
> I plan to set 'rcu_nocbs=0-<cpu_max_index>' and see if that changes
> something. Maybe it lowers the change to hit the bug, or it doesn't do
> anything.

Make sure that you have the config option CONFIG_RCU_NOCB_CPU already set by your linux distribution platform, otherwise you need to rebuild your kernel with that option set on.
Comment 114 Cyrax 2020-12-04 00:17:51 UTC
Also try if setting kernel boot option mce=off helps.
Comment 115 Borislav Petkov 2020-12-04 16:12:10 UTC
(In reply to Cyrax from comment #114)
> Also try if setting kernel boot option mce=off helps.

That's not a good idea. If you boot with mce=off and the machine experiences an uncorrectable MCE which raises a machine check exception, the machine will immediately shutdown without the ability to even log the error to know what kind of error it was. Supplying "mce=off" on the kernel command line is almost never a good idea.
Comment 116 Martin Roth 2020-12-08 12:53:31 UTC
Hi,

I've been experiencing very similar / identical issues on my machine since I've built it a year ago. The processor is a Ryzen 3600, the motherboard an ASUS TUF X470-PLUS GAMING (BIOS version 5406). It is running Arch Linux, kernel 5.9.11-arch2-1 and has the AMD microcode package amd-ucode 20201120.bc9cd0b-1 installed.
I've tried different BIOS settings and iterations of various kernel parameters but the sporadic and random freezes have persisted, sometimes occurring once a week, sometimes more than once a day, usually under low load (like watching a video in the web browser).

The last thing I tried was to change the CPU Ratio from Auto → 36.00, like suggested in https://forum-en.msi.com/index.php?threads/solved-msi-x570-a-pro-ryzen-5-3600-freeze.344085/ (the base frequency of my CPU is 3.6 GHz) and now a week has passed without any random freezes.

Perhaps you can also try changing the CPU Ratio to a fixed value in the BIOS and see if it helps.
Comment 117 Gurenko Alex 2020-12-15 22:57:56 UTC
Created attachment 294147 [details]
dmesg after 3rd reboot

I've just built a new machine few days ago with Ryzen 9 5900X and 5700XT graphics on X570 motherboard (MSI Tomahawk Wifi) and experiencing similar issues. In just 2 days I've had 3 reboots doing different things. During next boot I'm seeing same error as others for various cpu ids.

I'm running 5.9.14-200.fc33.x86_64 on Fedora 33 KDE. I saw some suggestions about IOMMU, so I've tried disabling it and disabling ResizableBAR, but still got my 3rd reboot today.
Comment 118 Jaakko Kantojärvi 2020-12-18 08:12:53 UTC
Hi,

I got some interesting events today. When I tried to wake displays from sleep, they didn't. I logged to my desktop over SSH and it was working fine (also Jupyter Notebook server was alive and well). Xorg logs didn't have anything interesting and xrandr and xset both did hang when I tried to use them.

Any case, I tried to restart X by systemctl restart sddm.service, but after few seconds (5s I guess), computer restarted due to MCE event.

However, after checking journalctl (see logs below), I noticed that earlier amdgpu drm had timeout ut and after that my cronjob, which uses X, started to fail. So I guess CPU and GPU or the driver got out of sync or such.

On the other notes, I have not yet enabled rcu_nocbs as I have not had time to debug these problems..

For Alex, It's sad to hear that Zen 3 cores also experience the same bug. However, it's also nice to know that upgrading the hardware wouldn't fix the issue for me. Also, if you read the history, you see that there have not been no real solutions so far. Setting the clock speed did help someone (not me), disabling the SMT someone else (but not me), disabling IOMMU shouldn't matter. I fear that the issue might be related to how many dies (and CCXes) one has or what is the topology of the CPU.

The logs:

> Dec 18 03:26:21 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx timeout, signaled seq=15639736, emitted seq=15639738
> Dec 18 03:26:21 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process Xorg pid 1086 thread Xorg:cs0 pid 1185
> Dec 18 03:26:21 iwana kernel: [drm] GPU recovery disabled.
> Dec 18 03:26:21 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> sdma1 timeout, signaled seq=8083976, emitted seq=8083978
> Dec 18 03:26:21 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process Xorg pid 1086 thread Xorg:cs0 pid 1185
> Dec 18 03:26:21 iwana kernel: [drm] GPU recovery disabled.
> Dec 18 03:27:54 iwana libvirtd[1066]: internal error: connection closed due
> to keepalive timeout
> Dec 18 03:30:01 iwana systemd[1]: Starting [Cron] "*/15 * * * *
> $HOME/.local/bin/kde-safe-session"...
> Dec 18 03:30:28 iwana systemd[1]: cron-user-user-0.service: Main process
> exited, code=exited, status=1/FAILURE
> Dec 18 03:30:28 iwana systemd[1]: cron-user-user-0.service: Failed with
> result 'exit-code'.
> Dec 18 03:30:28 iwana systemd[1]: Failed to start [Cron] "*/15 * * * *
> $HOME/.local/bin/kde-safe-session".
>
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0
> Bank 5: bea0000000000108
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff81a5e146
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1608277087 SOCKET 0 APIC 2 microcode 8701021
>
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0
> Bank 5: bea0000000000108
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff82201498
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1608277087 SOCKET 0 APIC 6 microcode 8701021
>
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: CPU 7: Machine Check: 0
> Bank 5: bea0000000000108
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc11c0124
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1608277087 SOCKET 0 APIC 9 microcode 8701021
>
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: CPU 9: Machine Check: 0
> Bank 5: bea0000000000108
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc0adf550
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Dec 18 09:38:27 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1608277087 SOCKET 0 APIC c microcode 8701021


I noticed that some (maybe 10%) of the mce events preceeding amdgpu timeout.

> Sep 20 02:54:44 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx timeout, signaled seq=625846, emitted seq=625848
> Sep 20 02:54:44 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process gfxbench_gl pid 9805 thread gfxbench_g:cs0 pid
> 9809
> Sep 20 02:54:44 iwana kernel: [drm] GPU recovery disabled.
> Sep 20 02:57:40 iwana libvirtd[1208]: internal error: connection closed due
> to keepalive timeout
> Sep 20 02:58:01 iwana kernel: INFO: task kworker/15:1:10563 blocked for more
> than 120 seconds.
> Sep 20 02:58:01 iwana kernel:       Not tainted 5.8.0-1-amd64 #1 Debian
> 5.8.7-1
> Sep 20 02:58:01 iwana kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> 
> Sep 20 19:46:40 iwana kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0
> Bank 5: bea0000000000108
> Sep 20 19:46:40 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc05bd88a
> MISC d010000000000000 SYND 4d000000 IPID 500b000000000 
> Sep 20 19:46:40 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1600618873 SOCKET 0 APIC 5 microcode 8701021

and

> Oct 16 22:38:18 iwana kernel: pcieport 0000:00:01.2: AER: Multiple Corrected
> error received: 0000:00:01.2
> Oct 16 22:38:18 iwana kernel: pcieport 0000:00:01.2: AER: PCIe Bus Error:
> severity=Corrected, type=Transaction Layer, (Receiver ID)
> Oct 16 22:38:18 iwana kernel: pcieport 0000:00:01.2: AER:   device
> [1022:1483] error status/mask=00002000/00004000
> Oct 16 22:38:18 iwana kernel: pcieport 0000:00:01.2: AER:    [13] NonFatalErr 
> Oct 16 22:38:18 iwana kernel: pcieport 0000:00:01.2: AER: Multiple Corrected
> error received: 0000:00:01.2
> Oct 16 22:38:18 iwana kernel: pcieport 0000:00:01.2: AER: can't find device
> of ID000a
> 
> 00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP
> Bridge (prog-if 00 [Normal decode])
>   Flags: bus master, fast devsel, latency 0, IRQ 28, IOMMU group 2
>   Bus: primary=00, secondary=02, subordinate=08, sec-latency=0
>   I/O behind bridge: 0000e000-0000ffff [size=8K]
>   Memory behind bridge: fc600000-fcbfffff [size=6M]
>   Prefetchable memory behind bridge: 00000000e0000000-00000000efffffff
>   [size=256M]
>   Capabilities: <access denied>
>   Kernel driver in use: pcieport
> 
> Oct 16 22:38:20 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> sdma0 timeout, signaled seq=206376, emitted seq=206378
> Oct 16 22:38:20 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process  pid 0 thread  pid 0
> Oct 16 22:38:20 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx timeout, signaled seq=146458, emitted seq=146460
> Oct 16 22:38:20 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process kscreenlocker_g pid 8351 thread kscreenloc:cs0
> pid 8355
> Oct 16 22:38:20 iwana kernel: [drm] GPU recovery disabled.
> Oct 16 22:38:20 iwana kernel: [drm] GPU recovery disabled.
> Oct 16 22:38:21 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> sdma1 timeout, signaled seq=15733, emitted seq=15735
> Oct 16 22:38:21 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process Xorg pid 1218 thread Xorg:cs0 pid 1291
> Oct 16 22:38:21 iwana kernel: [drm] GPU recovery disabled.
> Oct 16 22:39:07 iwana libvirtd[1204]: internal error: connection closed due
> to keepalive timeout
>
> Oct 16 23:45:06 iwana kernel: mce: [Hardware Error]: CPU 11: Machine Check: 0
> Bank 5: bea0000000000108
> Oct 16 23:45:06 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 559925b8a998
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Oct 16 23:45:06 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1602877420 SOCKET 0 APIC 1c microcode 8701021

but I also thing there are times when that was not the case, like here (I'm pretty sure I used the computer between following dates)

> Nov 17 00:00:16 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx timeout, signaled seq=299292, emitted seq=299294
> Nov 17 00:00:16 iwana kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process sddm-greeter pid 1533 thread sddm-greet:cs0 pid
> 1542

> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: CPU 5: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb640107a
> MISC d012000300000000 SYND 4d000000 IPID 500b000000000 
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605680663 SOCKET 0 APIC c microcode 8701021
>
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: CPU 22: Machine Check: 0
> Bank 5: bea0000000000108
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb5c5e146
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> Nov 18 10:33:05 iwana kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1605680663 SOCKET 0 APIC 1b microcode 8701021

--
Jaakko
Comment 119 Gurenko Alex 2020-12-18 10:21:19 UTC
 I've been trying various scenarios yesterday and I also have an interesting finding that I'm trying to confirm today and over weekend:

I have a reboot only from the fresh power on state ... and as I found out yesterday after returning from sleep. But after that reboot I never had a second one no matter how I stress (literally used stress utility) the system.

So yesterday I started my morning and rebooted the system right after the first boot and whole day no reboot, no errors. At the end of the day at some point, I've put machine to sleep, woke it up ~20 min later and after few minutes got familiar restart and errors in logs.

I read about amdgpu driver, but none of my restarts had amdgpu log entries, so I'm not sure if that's the case at least for me.

From the looks of it there is a problem with initialization of cpu which is "fixed" during reboot sequence? I'm just speculating at this point since I'm only to replicate the results with system stability after reboot.

I've also opened a support ticket with AMD, but have not heard back just yet.
Comment 120 Clemens Eisserer 2020-12-18 10:38:57 UTC
>  I've also opened a support ticket with AMD, but have not heard back just
>  yet.

Typically those tickets are closed with a pointer to the linux kernel bugzilla, stating that they do not provide end-user support for linux directly - this was actually the reason I created this bugzilla ticket.
In my opinion this is actually a dumb decision, as this way AMD does not even have statics.
Comment 121 Gurenko Alex 2020-12-18 10:42:28 UTC
(In reply to Clemens Eisserer from comment #120)
> >  I've also opened a support ticket with AMD, but have not heard back just
> >  yet.
> 
> Typically those tickets are closed with a pointer to the linux kernel
> bugzilla, stating that they do not provide end-user support for linux
> directly - this was actually the reason I created this bugzilla ticket.
> In my opinion this is actually a dumb decision, as this way AMD does not
> even have statics.

I was an early adopter of 1st gen Ryzen and I had a quite productive discussions in regards to 1st gen ryzen segfault issue that ended up in RMA and acceptance from AMD side it was a hardware problem, so there is hope.

Also I'm not sure it's linux specific issue, there is a thread on AMD community forum from Windows side which looks very similar: https://community.amd.com/t5/processors/ryzen-5900x-system-constantly-crashing-restarting-whea-logger-id/td-p/423321
Comment 122 Vitalii 2020-12-18 11:23:48 UTC
(In reply to Gurenko Alex from comment #121)

> Also I'm not sure it's linux specific issue, there is a thread on AMD
> community forum from Windows side which looks very similar:
> https://community.amd.com/t5/processors/ryzen-5900x-system-constantly-
> crashing-restarting-whea-logger-id/td-p/423321

Hi. Not sure about 5000-series, but on 3900 I had a similar crash on Windows only once in many months. It looked and felt very similar to what happens in Linux, and it happened when I was fiddling with monitoring OSD overlays.

Reported by component: Processor Core
Error Source: Machine Check Exception
Error Type: Cache Hierarchy Error
Processor APIC ID: 2

The details view of this entry contains further information.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{c26c4f3c-3f66-4e99-8f8a-39405cfed220}" />
    <EventID>18</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2020-11-20T08:58:18.0572101Z" />
    <EventRecordID>13886</EventRecordID>
    <Correlation ActivityID="{b0404eb3-8e85-4bc6-a142-610f268a257b}" />
    <Execution ProcessID="3836" ThreadID="4240" />
    <Channel>System</Channel>
    <Computer>...</Computer>
    <Security UserID="..." />
  </System>
  <EventData>
    <Data Name="ErrorSource">3</Data>
    <Data Name="ApicId">2</Data>
    <Data Name="MCABank">5</Data>
    <Data Name="MciStat">0xbea0000000000108</Data>
    <Data Name="MciAddr">0x7ffec4f656b4</Data>
    <Data Name="MciMisc">0xd01a0ffe00000000</Data>
    <Data Name="ErrorType">9</Data>
    <Data Name="TransactionType">2</Data>
    <Data Name="Participation">256</Data>
    <Data Name="RequestType">0</Data>
    <Data Name="MemorIO">256</Data>
    <Data Name="MemHierarchyLvl">0</Data>
    <Data Name="Timeout">256</Data>
    <Data Name="OperationType">256</Data>
    <Data Name="Channel">256</Data>
    <Data Name="Length">936</Data>
    <Data Name="RawData">...</Data>
  </EventData>
</Event>

While not exactly the same, not so encouraging either.
Comment 123 Nicholas H. 2020-12-18 18:21:31 UTC
FWIW, I haven't had any MCEs since RMAing my 3900X in May. With the old CPU I had them daily. Maybe more people in this thread have bad Ryzens?
Comment 124 Aaran Lee 2020-12-19 02:12:44 UTC
Building
Comment 125 Jaakko Kantojärvi 2020-12-19 13:51:33 UTC
(In reply to Nicholas H. from comment #123)
> FWIW, I haven't had any MCEs since RMAing my 3900X in May. With the old CPU
> I had them daily. Maybe more people in this thread have bad Ryzens?

I guess I should setup similar sleep + WoL system like what you had. I just don't feel confortable to RMA product that kind of works. Specifically, when I can't say if it's the CPU or motherboard.

On other things. Based on that idea of adding +0.05V to DRAM voltage, which was mentioned in the Windows related forum thread, I set my memory to second XMP profile, which is 3GHz at 1.35V. Memory default is ~2GHz at 1.2V and XMP 1 is 3.6GHz at 1.35V. I'm hoping that this configuration might have a bit more voltage for a lower frequency. I also disable the Precision Boost Overdrive and raised clock multiplier from 38 to 39. However, the cpu low frequency is the same 2.2GHz, which is used most of the time (e.g. on idle).

Finally, I raised Load Line Calibration from Auto (seemed to match standard / lowest setting) to Medium, which apparently raised idle voltage from 0.9V to 1.1V. I thought this setting only controlled the voltage regulation PID or such.

Maybe these configurations have positive effect. At least, I see plausible explanations how they would fit our observations of the problems and RMA working/not working. We'll see in 2-30 days..

Happy holidays for everyone.
Comment 126 Gurenko Alex 2020-12-19 14:48:39 UTC
(In reply to Jaakko Kantojärvi from comment #125)
> (In reply to Nicholas H. from comment #123)
> > FWIW, I haven't had any MCEs since RMAing my 3900X in May. With the old CPU
> > I had them daily. Maybe more people in this thread have bad Ryzens?
> 
> I guess I should setup similar sleep + WoL system like what you had. I just
> don't feel confortable to RMA product that kind of works. Specifically, when
> I can't say if it's the CPU or motherboard.
> 
> On other things. Based on that idea of adding +0.05V to DRAM voltage, which
> was mentioned in the Windows related forum thread, I set my memory to second
> XMP profile, which is 3GHz at 1.35V. Memory default is ~2GHz at 1.2V and XMP
> 1 is 3.6GHz at 1.35V. I'm hoping that this configuration might have a bit
> more voltage for a lower frequency. I also disable the Precision Boost
> Overdrive and raised clock multiplier from 38 to 39. However, the cpu low
> frequency is the same 2.2GHz, which is used most of the time (e.g. on idle).
> 
> Finally, I raised Load Line Calibration from Auto (seemed to match standard
> / lowest setting) to Medium, which apparently raised idle voltage from 0.9V
> to 1.1V. I thought this setting only controlled the voltage regulation PID
> or such.
> 
> Maybe these configurations have positive effect. At least, I see plausible
> explanations how they would fit our observations of the problems and RMA
> working/not working. We'll see in 2-30 days..
> 
> Happy holidays for everyone.

 I was running my memory at 1.40V as suggested for a few days and it had no affect. In my opinion disabling PBO is a wrong way to go, if that does not work, it means it's a faulty product. Another suggestion that make sense is to play around with PSU idle state. Never had such feature before, but now it seems to be a thing. Based on some reading it's a feature that that tells PSU to lower the current when idle and some PSUs don't restore it fast enough especially when PBO is engaged from the idle state which obviously happen if CPU was idling for a period of time and stays cool.

 I've flipped setting in a BIOS to keep current at normal rate during idle, so we'll see if THAT has any effect. Probably that's the best reason I've found so far that at least makes sense. I also don't want to go through the RMA, but comparing to the issues of 1st gen Ryzen this is a big deal since it happens randomly and I've already lost some work due to unexpected reboot.
Comment 127 Gurenko Alex 2020-12-19 16:57:58 UTC
 Just a quick update, that also had no effect, rebooted ~90 seconds out of the sleep mode. That seems to be a great trigger...
Comment 128 matthew clark 2020-12-21 18:37:49 UTC
(In reply to Gurenko Alex from comment #127)
>  Just a quick update, that also had no effect, rebooted ~90 seconds out of
> the sleep mode. That seems to be a great trigger...

if you turn off the multi-threading on each core as I posted previously the MCE error totally stops for me; never occurs.
Comment 129 Gurenko Alex 2020-12-21 19:00:48 UTC
(In reply to matthew clark from comment #128)
> (In reply to Gurenko Alex from comment #127)
> >  Just a quick update, that also had no effect, rebooted ~90 seconds out of
> > the sleep mode. That seems to be a great trigger...
> 
> if you turn off the multi-threading on each core as I posted previously the
> MCE error totally stops for me; never occurs.

Okay, I'll try, but this is not a solution, SMT is one of the features of the CPU and it's not working as expected.
Comment 130 Jaakko Kantojärvi 2020-12-21 20:26:24 UTC
Hi,

(In reply to matthew clark from comment #128)
> (In reply to Gurenko Alex from comment #127)
> >  Just a quick update, that also had no effect, rebooted ~90 seconds out of
> > the sleep mode. That seems to be a great trigger...
> 
> if you turn off the multi-threading on each core as I posted previously the
> MCE error totally stops for me; never occurs.

This is interesting as disabling SMT in BIOS didn't help me at all.


(In reply to Gurenko Alex from comment #126)
> In my opinion disabling PBO is a wrong way to go, if that does not
> work, it means it's a faulty product.

I agree, but if that fixes the problem, then I have identified the issue and can RMA the product with a better self-confidence.

> I've flipped setting in a BIOS to keep current at normal rate during idle,
> so we'll see if THAT has any effect.

That didn't have effect for me at least.
Comment 131 binarytamer 2020-12-23 21:26:38 UTC
Hi

I just want to jump on the mce-train and add my experience with this problem.
First, this is my syslog snippet of the mce error:

Dec 23 08:14:45 dp-pc kernel: [    0.533154] mce: [Hardware Error]: Machine check events logged
Dec 23 08:14:45 dp-pc kernel: [    0.533154] mce: [Hardware Error]: CPU 13: Machine Check: 0 Bank 5: bea0000000000108
Dec 23 08:14:45 dp-pc kernel: [    0.533154] mce: [Hardware Error]: TSC 0 ADDR 7fb1c9c9685a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Dec 23 08:14:45 dp-pc kernel: [    0.533154] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1608707676 SOCKET 0 APIC b microcode 8701021

Second, this is my Hardware setup:

MB: MPG X570 GAMING PLUS (MS-7C37)
CPU: Ryzen 7 3800X
RAM: 32GB 2x16GB DDR4 3200MT/s
GPU: ASUS TUF 3-RX5700XT-O8G-GAMING 
GPU: NVIDIA RTX 2080 only for IOMMU

Third, when happens the crashes on my System:

Only on heavy Load while Gaming on the Linux host with the VM deactivated and the Nvidia GPU sleeping (S3)[no, I do not belive this is PSU related]. Sometimes after 15 Minutes, sometimes almost after starting the Game. Some >2h in the Game. The System just resets itself and reboots.

Forth, what I already checked:

- 24h memorytest
- CPU stresstest to check temps
- tried all kernel options discussed in this thread
- tried different distributions (all  rolling) Solus, Manjaro and now Gentoo --> same behavior on all systems

Here is the interesting part:

All this starts with kernel version >= 5.5 when I install kernel 5.4.80 (in Gentoo currently marked stable)the system is rock solid not a single crash. I observed this on my rig since version 5.5 is out and tried sporadically to update to the recent version. Maybe someone can confirm that 5.4 is also stable on other systems. I would really like to know if so.
Comment 132 Paul Menzel 2020-12-24 13:28:20 UTC
(In reply to binarytamer from comment #131)

> I just want to jump on the mce-train and add my experience with this problem.

[…]

> Third, when happens the crashes on my System:
> 
> Only on heavy Load while Gaming on the Linux host with the VM deactivated
> and the Nvidia GPU sleeping (S3)[no, I do not believe this is PSU related].
> Sometimes after 15 Minutes, sometimes almost after starting the Game. Some
> >2h in the Game. The System just resets itself and reboots.

[…]

> Here is the interesting part:
> 
> All this starts with kernel version >= 5.5 when I install kernel 5.4.80 (in
> Gentoo currently marked stable)the system is rock solid not a single crash.
> I observed this on my rig since version 5.5 is out and tried sporadically to
> update to the recent version. Maybe someone can confirm that 5.4 is also
> stable on other systems. I would really like to know if so.

Thank you for reporting the issue. As the report is too long already, and I haven’t read about the 5.4 to 5.5 regression – in one of my cases it’s a regression from 4.19 to 5.4 –, please create a new issue for this. (Please also mention the firmware version or attached the Linux messages.) As you can see from the reports, nobody really has a clue what is causing this. As it could be power management related, it’s also hard to reproduce it on other systems.

As you can reproduce it, and are using Gentoo – so are able to build your own Linux kernel – it would be really great if you could bisect the issue. (From my experience that is the only way to get the developers’ attention to fix it.) My guess is, that some power management for the graphics cards features were enable in 5.5.
Comment 133 Gurenko Alex 2020-12-26 16:09:18 UTC
(In reply to Jaakko Kantojärvi from comment #130)
> Hi,
> 
> (In reply to matthew clark from comment #128)
> > (In reply to Gurenko Alex from comment #127)
> > >  Just a quick update, that also had no effect, rebooted ~90 seconds out
> of
> > > the sleep mode. That seems to be a great trigger...
> > 
> > if you turn off the multi-threading on each core as I posted previously the
> > MCE error totally stops for me; never occurs.
> 
> This is interesting as disabling SMT in BIOS didn't help me at all.
> 
> 
> (In reply to Gurenko Alex from comment #126)
> > In my opinion disabling PBO is a wrong way to go, if that does not
> > work, it means it's a faulty product.
> 
> I agree, but if that fixes the problem, then I have identified the issue and
> can RMA the product with a better self-confidence.
> 
> > I've flipped setting in a BIOS to keep current at normal rate during idle,
> > so we'll see if THAT has any effect.
> 
> That didn't have effect for me at least.

I was wondering what configuration are you running right now? I decided to try running the system with SMT disabled and it was stable for 2 days straight. Then I read on level1 forum that offsetting the voltage also helps, but first I decided to bring back SMT to see what's the current status is and...it's still running stable now for another 2 days. After 10 straight days of resets several times a day, this now works for 4 days without any problem.

Can you try bringing back SMT and see if that's stable for you now?
Comment 134 Gurenko Alex 2020-12-29 14:37:25 UTC
(In reply to Gurenko Alex from comment #133)
> (In reply to Jaakko Kantojärvi from comment #130)
> > Hi,
> > 
> > (In reply to matthew clark from comment #128)
> > > (In reply to Gurenko Alex from comment #127)
> > > >  Just a quick update, that also had no effect, rebooted ~90 seconds out
> > of
> > > > the sleep mode. That seems to be a great trigger...
> > > 
> > > if you turn off the multi-threading on each core as I posted previously
> the
> > > MCE error totally stops for me; never occurs.
> > 
> > This is interesting as disabling SMT in BIOS didn't help me at all.
> > 
> > 
> > (In reply to Gurenko Alex from comment #126)
> > > In my opinion disabling PBO is a wrong way to go, if that does not
> > > work, it means it's a faulty product.
> > 
> > I agree, but if that fixes the problem, then I have identified the issue
> and
> > can RMA the product with a better self-confidence.
> > 
> > > I've flipped setting in a BIOS to keep current at normal rate during
> idle,
> > > so we'll see if THAT has any effect.
> > 
> > That didn't have effect for me at least.
> 
> I was wondering what configuration are you running right now? I decided to
> try running the system with SMT disabled and it was stable for 2 days
> straight. Then I read on level1 forum that offsetting the voltage also
> helps, but first I decided to bring back SMT to see what's the current
> status is and...it's still running stable now for another 2 days. After 10
> straight days of resets several times a day, this now works for 4 days
> without any problem.
> 
> Can you try bringing back SMT and see if that's stable for you now?

Okay, scratch that, after ~6 days even with SMT off it's rebooting, so I'm out of options.
Comment 135 Peter 2021-01-24 10:53:24 UTC
Rich and Joel,

can't say enough how much you made my day! The 

    amdgpu.ppfeaturemask=0xffffbffd

solved it after months of debugging.

For those coming here without having gone through the entire thread: 

If 'journalctl | grep -i "hardware err"' returns errors like bea0000000000108 and microcode 8701021 or 8701013, and the BIOS is updated to the last version, the kernel is up to date and several passes of memtest86+ have run without errors, then if you have an AMD GPU the problem might be related to that

GPU
===
Many report success by booting their kernels with amdgpu.ppfeaturemask=0xffffbffd. If that does not help, try amdgpu.dpm=0. If that works, either keep it as is or remove it again and experiment with other less invasive ppfeaturemask settings discussed above. If none of this helps, the problem might be related to the 

CPU
===
The first recommendation generally is to set in the BIOS "Cool'n Quiet" to Disabled. If that does not help also set "Power Idle Current" to Typical. This should already disable the problematic C states (checking and even disabling c6 can be done also with https://github.com/r4m0n/ZenStates-Linux). If none of this helps, then the recommendation is to also set "Global C State Control" to Disabled. The next step would be to also set "SMT" of the Overclocking settings to Disabled. 

Peter
Comment 136 Gurenko Alex 2021-01-25 12:22:16 UTC
(In reply to Peter from comment #135)
> Rich and Joel,
> 
> can't say enough how much you made my day! The 
> 
>     amdgpu.ppfeaturemask=0xffffbffd
> 
> solved it after months of debugging.
> 
> For those coming here without having gone through the entire thread: 
> 
> If 'journalctl | grep -i "hardware err"' returns errors like
> bea0000000000108 and microcode 8701021 or 8701013, and the BIOS is updated
> to the last version, the kernel is up to date and several passes of
> memtest86+ have run without errors, then if you have an AMD GPU the problem
> might be related to that
> 
> GPU
> ===
> Many report success by booting their kernels with
> amdgpu.ppfeaturemask=0xffffbffd. If that does not help, try amdgpu.dpm=0. If
> that works, either keep it as is or remove it again and experiment with
> other less invasive ppfeaturemask settings discussed above. If none of this
> helps, the problem might be related to the 
> 
> CPU
> ===
> The first recommendation generally is to set in the BIOS "Cool'n Quiet" to
> Disabled. If that does not help also set "Power Idle Current" to Typical.
> This should already disable the problematic C states (checking and even
> disabling c6 can be done also with
> https://github.com/r4m0n/ZenStates-Linux). If none of this helps, then the
> recommendation is to also set "Global C State Control" to Disabled. The next
> step would be to also set "SMT" of the Overclocking settings to Disabled. 
> 
> Peter

Thanks for the summary, Peter. I've been running my system stable for weeks since I've switched to the 5.10 kernel (since December 30). I've had only single unexpected reboot which I believe was related to a regression in 5.10.5 kernel, if I'm not mistaken. I've been checking various BIOS settings and narrowed stability down to the single parameter: "Power Idle Current" setting to typical.

However...this morning with no changes to the settings or any package update in a last few days, I've got 3 reboots in a row which is very disappointing to say the least. I've added ppfeaturemask parameter to see if that would make a difference, I'm failing to see why would it help with anything, given what this parameter does.
Comment 137 Paul Menzel 2021-01-25 12:32:26 UTC
(In reply to Gurenko Alex from comment #136)

[…]

> However...this morning with no changes to the settings or any package update
> in a last few days, I've got 3 reboots in a row which is very disappointing
> to say the least. I've added ppfeaturemask parameter to see if that would
> make a difference, I'm failing to see why would it help with anything, given
> what this parameter does.

Please always give details to your hardware setup and even firmware versions.
Comment 138 Gurenko Alex 2021-01-25 14:34:30 UTC
(In reply to Paul Menzel from comment #137)
> (In reply to Gurenko Alex from comment #136)
> 
> […]
> 
> > However...this morning with no changes to the settings or any package
> update
> > in a last few days, I've got 3 reboots in a row which is very disappointing
> > to say the least. I've added ppfeaturemask parameter to see if that would
> > make a difference, I'm failing to see why would it help with anything,
> given
> > what this parameter does.
> 
> Please always give details to your hardware setup and even firmware versions.

Fair enough, currently I'm running kernel 5.10.9-201.fc33.x86_64, mesa-dri-drivers-20.3.3-3.fc33.x86_64

CPU: Ryzen 5900X
MB: MSI X570 Tomahawk - BIOS 7C84v153 (Beta version) ComboAM4PIV2 1.1.9.0
GPU: Gigabyte Aorus RX 5700 XT
RAM: G.Skill 32 Gb (2x16) 3600Mhz/CL16 (XMP)
Comment 139 Foulques du Peloux de Praron 2021-01-25 20:22:46 UTC
(In reply to Peter from comment #135)
> Rich and Joel,
> 
> can't say enough how much you made my day! The 
> 
>     amdgpu.ppfeaturemask=0xffffbffd
> 
> solved it after months of debugging.
> 
> For those coming here without having gone through the entire thread: 
> 
> If 'journalctl | grep -i "hardware err"' returns errors like
> bea0000000000108 and microcode 8701021 or 8701013, and the BIOS is updated
> to the last version, the kernel is up to date and several passes of
> memtest86+ have run without errors, then if you have an AMD GPU the problem
> might be related to that
> 
> GPU
> ===
> Many report success by booting their kernels with
> amdgpu.ppfeaturemask=0xffffbffd. If that does not help, try amdgpu.dpm=0. If
> that works, either keep it as is or remove it again and experiment with
> other less invasive ppfeaturemask settings discussed above. If none of this
> helps, the problem might be related to the 
> 
> CPU
> ===
> The first recommendation generally is to set in the BIOS "Cool'n Quiet" to
> Disabled. If that does not help also set "Power Idle Current" to Typical.
> This should already disable the problematic C states (checking and even
> disabling c6 can be done also with
> https://github.com/r4m0n/ZenStates-Linux). If none of this helps, then the
> recommendation is to also set "Global C State Control" to Disabled. The next
> step would be to also set "SMT" of the Overclocking settings to Disabled. 
> 
> Peter

I tried amdgpu.ppfeaturemask=0xffffbffd and amdgpu.dpm=0 but it does not help me as I have terrible performance with them in game, so I cannot reproduce the error.

My specs :
Ryzen 5 3600
CORSAIR - Vengeance LPX 16 Go 2 x 8 Go, DDR4 3200 Mhz
B450-PLUS GAMING (BIOS is up to date)
RX 5700 XT
Be Quiet Straight Power E9 - 500W
Manjaro with kernel 5.10.7
Mesa 20.3.3
amd-ucode 20210109.r1812.d528862-1
Comment 140 Gurenko Alex 2021-02-01 08:13:15 UTC
 I have an interesting observation over the weekend after the worst week in terms of restarts so far: While week was full of restarts, the weekend went completely stable. I even did a full 4h+ memory test with memtest86 and a few hours of burnin test for cpu/memory/gpu and nothing.
 This morning, after only 15-20 minutes of uptime - restart.

 The only difference between work days and weekend are: VPN (openvpn) and Chrome browser. I don't use Chrome for personal use and I don't normally keep work VPN on, however sometimes I forget to disconnect at the end of work day, so it may stay on late in the evenings, hence pretty much everything points to the chrome usage.

Now, I don't really know how to confirm that since logs never show any errors per se before the reboot. I'll try disabling GPU acceleration in Chrome to see if it has any effect.
Comment 141 exeskull1 2021-02-01 09:48:29 UTC
(In reply to Gurenko Alex from comment #140)
>  I have an interesting observation over the weekend after the worst week in
> terms of restarts so far: While week was full of restarts, the weekend went
> completely stable. I even did a full 4h+ memory test with memtest86 and a
> few hours of burnin test for cpu/memory/gpu and nothing.
>  This morning, after only 15-20 minutes of uptime - restart.
> 
>  The only difference between work days and weekend are: VPN (openvpn) and
> Chrome browser. I don't use Chrome for personal use and I don't normally
> keep work VPN on, however sometimes I forget to disconnect at the end of
> work day, so it may stay on late in the evenings, hence pretty much
> everything points to the chrome usage.
> 
> Now, I don't really know how to confirm that since logs never show any
> errors per se before the reboot. I'll try disabling GPU acceleration in
> Chrome to see if it has any effect.

Most of the time when I had restart - Chrome was on (actually now that I think, I don't remember that restart ever happened to me when Chrome wasn't turn on), I turned off GPU acceleration but still is happening.
BIOS "Cool'n Quiet" is Disabled and this solved my issues for some time.

I see that Slack in Chrome is one of the main reasons why sometimes restarts occur, sometimes everything freeze, and sometimes just restart.
Comment 142 Jens Reimann 2021-02-01 09:58:14 UTC
That rings a bell. I didn't have any restarts lately (hope that doesn't change in a second :) ). Even playing Minecraft works.

Initially I had issues in Zoom calls, using the Zoom app. ~15 minutes into a call, and the machine reset with this error. Assuming it was some GPU issue, I disabled the GPU acceleration the Zoom, and never had any issues anymore in Zoom.

I rarely use Chrome. And Firefox seems to have GPU support disabled by default for me.

So maybe this has something to do with OpenCL, or whatever "GPU acceleration" means.
Comment 143 Gurenko Alex 2021-02-01 10:04:41 UTC
(In reply to exeskull1 from comment #141)
> (In reply to Gurenko Alex from comment #140)
> >  I have an interesting observation over the weekend after the worst week in
> > terms of restarts so far: While week was full of restarts, the weekend went
> > completely stable. I even did a full 4h+ memory test with memtest86 and a
> > few hours of burnin test for cpu/memory/gpu and nothing.
> >  This morning, after only 15-20 minutes of uptime - restart.
> > 
> >  The only difference between work days and weekend are: VPN (openvpn) and
> > Chrome browser. I don't use Chrome for personal use and I don't normally
> > keep work VPN on, however sometimes I forget to disconnect at the end of
> > work day, so it may stay on late in the evenings, hence pretty much
> > everything points to the chrome usage.
> > 
> > Now, I don't really know how to confirm that since logs never show any
> > errors per se before the reboot. I'll try disabling GPU acceleration in
> > Chrome to see if it has any effect.
> 
> Most of the time when I had restart - Chrome was on (actually now that I
> think, I don't remember that restart ever happened to me when Chrome wasn't
> turn on), I turned off GPU acceleration but still is happening.
> BIOS "Cool'n Quiet" is Disabled and this solved my issues for some time.
> 
> I see that Slack in Chrome is one of the main reasons why sometimes restarts
> occur, sometimes everything freeze, and sometimes just restart.

How would you tell that Slack is the reason? Do you see errors in any logs? I'm asking because I also have Slack in my Chrome.
AMD support also recommend to disable Cool'n Quiet, but...MSI removed that toggle few BIOS updates ago.

(In reply to Jens Reimann from comment #142)
> That rings a bell. I didn't have any restarts lately (hope that doesn't
> change in a second :) ). Even playing Minecraft works.
> 
> Initially I had issues in Zoom calls, using the Zoom app. ~15 minutes into a
> call, and the machine reset with this error. Assuming it was some GPU issue,
> I disabled the GPU acceleration the Zoom, and never had any issues anymore
> in Zoom.
> 
> I rarely use Chrome. And Firefox seems to have GPU support disabled by
> default for me.
> 
> So maybe this has something to do with OpenCL, or whatever "GPU
> acceleration" means.

I have GPU acceleration force enabled for a very long time in Firefox on all of my machines (Desktop, laptop, Pinebook Pro) and it works without any problems and, just like I said, over weekend (or during my quiet time over Christmas break) I never had a reboot...now when it looks quite obvious, because I never launched Chrome over this period of time. I've tried now to disable Hardware acceleration in Chrome, but just got another reboot, so it's definitely Chrome, the question is why and what to do about it.
Comment 144 exeskull1 2021-02-01 10:12:21 UTC
(In reply to Gurenko Alex from comment #143)
> (In reply to exeskull1 from comment #141)
> > (In reply to Gurenko Alex from comment #140)
> > >  I have an interesting observation over the weekend after the worst week
> in
> > > terms of restarts so far: While week was full of restarts, the weekend
> went
> > > completely stable. I even did a full 4h+ memory test with memtest86 and a
> > > few hours of burnin test for cpu/memory/gpu and nothing.
> > >  This morning, after only 15-20 minutes of uptime - restart.
> > > 
> > >  The only difference between work days and weekend are: VPN (openvpn) and
> > > Chrome browser. I don't use Chrome for personal use and I don't normally
> > > keep work VPN on, however sometimes I forget to disconnect at the end of
> > > work day, so it may stay on late in the evenings, hence pretty much
> > > everything points to the chrome usage.
> > > 
> > > Now, I don't really know how to confirm that since logs never show any
> > > errors per se before the reboot. I'll try disabling GPU acceleration in
> > > Chrome to see if it has any effect.
> > 
> > Most of the time when I had restart - Chrome was on (actually now that I
> > think, I don't remember that restart ever happened to me when Chrome wasn't
> > turn on), I turned off GPU acceleration but still is happening.
> > BIOS "Cool'n Quiet" is Disabled and this solved my issues for some time.
> > 
> > I see that Slack in Chrome is one of the main reasons why sometimes
> restarts
> > occur, sometimes everything freeze, and sometimes just restart.
> 
> How would you tell that Slack is the reason? Do you see errors in any logs?
> I'm asking because I also have Slack in my Chrome.
> AMD support also recommend to disable Cool'n Quiet, but...MSI removed that
> toggle few BIOS updates ago.
> 
> (In reply to Jens Reimann from comment #142)
> > That rings a bell. I didn't have any restarts lately (hope that doesn't
> > change in a second :) ). Even playing Minecraft works.
> > 
> > Initially I had issues in Zoom calls, using the Zoom app. ~15 minutes into
> a
> > call, and the machine reset with this error. Assuming it was some GPU
> issue,
> > I disabled the GPU acceleration the Zoom, and never had any issues anymore
> > in Zoom.
> > 
> > I rarely use Chrome. And Firefox seems to have GPU support disabled by
> > default for me.
> > 
> > So maybe this has something to do with OpenCL, or whatever "GPU
> > acceleration" means.
> 
> I have GPU acceleration force enabled for a very long time in Firefox on all
> of my machines (Desktop, laptop, Pinebook Pro) and it works without any
> problems and, just like I said, over weekend (or during my quiet time over
> Christmas break) I never had a reboot...now when it looks quite obvious,
> because I never launched Chrome over this period of time. I've tried now to
> disable Hardware acceleration in Chrome, but just got another reboot, so
> it's definitely Chrome, the question is why and what to do about it.

Unfortunately, nothing in the logs :/ my logs are the same as others here ...It's just an observation that pretty much all time when slack was on in the browser there was a reboot(I believe some service worker or some memory loss or etc)...Something is wrong with Chrome but I can't still find it.
Comment 145 Paul Menzel 2021-02-01 10:54:05 UTC
A userspace program should not be able to crash the system, so it’s not a bug in Chrome, but it’s good, that you found some indicators to force the problem *for your configuration*. Could you two please create a separate bug report for this with all the details for the developers, and link to it from here? If you have the same graphics device configuration, I’d report it to the graphics developers. I am pretty sure, nobody will look into this thread, as it has again become to convoluted.
Comment 146 Foulques du Peloux de Praron 2021-02-08 19:13:47 UTC
On my side, I found a way to reproduce the problem almost 100% of the time. The game Boneworks produce MCE almost every time I reach the main menu.
Comment 147 Paul Menzel 2021-02-08 20:05:59 UTC
(In reply to Foulques du Peloux de Praron from comment #146)
> On my side, I found a way to reproduce the problem almost 100% of the time.
> The game Boneworks produce MCE almost every time I reach the main menu.

Nice. What is your hardware configuration? What graphics device do you use? If it’s an AMD graphics card, please report a separate issues to the DRM issue tracker [1].

[1]: https://gitlab.freedesktop.org/drm/amd.
Comment 148 Foulques du Peloux de Praron 2021-02-09 15:36:46 UTC
(In reply to Paul Menzel from comment #147)
> Nice. What is your hardware configuration? What graphics device do you use?
> If it’s an AMD graphics card, please report a separate issues to the DRM
> issue tracker [1].
> 
> [1]: https://gitlab.freedesktop.org/drm/amd.


The separate issue can be viewed here : https://gitlab.freedesktop.org/drm/amd/-/issues/1481
Comment 149 Devrandom 2021-02-10 10:38:41 UTC
I had the same issue (reboots every ~30 min)

The good news is that I have been running without incidents for 24 hours with amdgpu.ppfeaturemask=0xffffbffd .

Configuration:

- CPU: AMD 3900X
- Motherboard: X570 AORUS ELITE with F33a firmware
- RAM: 2x32GB M378A4G43MB1-CTD 2666 MT/s
- GPU: RX 5500 (4 GB Sapphire RX 5500 XT)
- kernel 5.8.0-43-generic (Ubuntu)
Comment 150 Alex Deucher 2021-02-10 14:31:55 UTC
(In reply to Devrandom from comment #149)
> I had the same issue (reboots every ~30 min)
> 
> The good news is that I have been running without incidents for 24 hours
> with amdgpu.ppfeaturemask=0xffffbffd .
> 
> Configuration:
> 
> - CPU: AMD 3900X
> - Motherboard: X570 AORUS ELITE with F33a firmware

What do you mean by F33a firmware?

> - RAM: 2x32GB M378A4G43MB1-CTD 2666 MT/s
> - GPU: RX 5500 (4 GB Sapphire RX 5500 XT)
> - kernel 5.8.0-43-generic (Ubuntu)

Since you have a polaris board, as per comment 75 et al., can you try the patches I suggested?  Better yet, they're already upstream so if you could try an 5.10 or 5.11 kernel that would work too.
Comment 151 Devrandom 2021-02-10 14:47:09 UTC
(In reply to Alex Deucher from comment #150)
> (In reply to Devrandom from comment #149)
> > I had the same issue (reboots every ~30 min)
> > 
> > The good news is that I have been running without incidents for 24 hours
> > with amdgpu.ppfeaturemask=0xffffbffd .
> > 
> > Configuration:
> > 
> > - CPU: AMD 3900X
> > - Motherboard: X570 AORUS ELITE with F33a firmware
> 
> What do you mean by F33a firmware?

Sorry, I meant BIOS version.

> 
> > - RAM: 2x32GB M378A4G43MB1-CTD 2666 MT/s
> > - GPU: RX 5500 (4 GB Sapphire RX 5500 XT)
> > - kernel 5.8.0-43-generic (Ubuntu)
> 
> Since you have a polaris board, as per comment 75 et al., can you try the
> patches I suggested?  Better yet, they're already upstream so if you could
> try an 5.10 or 5.11 kernel that would work too.

I'm installing from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.10/amd64/ and will update here soon.
Comment 152 Gurenko Alex 2021-02-10 14:55:44 UTC
I'm gonna post some updates here too (copy-paste from level1 forum discussion)

I guess it’s obvious that settings don’t do much. My experience turned to a complete gamble with not crashing at all one day and crashing every few minutes several days straight.

I’ve finally decided to start swapping components around since I have my wife’s PC with almost identical components.

First experience was to swap CPUs (my 5900X to her machine and her 5800X into mine): Her PC worked for 3 days straight with zero issues (granted she is running Windows, but I do not believe it’s an OS level problem). On the other hand, I’ve swapped her CPU into my system and after loading Optimized Defaults…system refused to load. POST LEDs keep getting stuck on CPU, then quickly passing everything else and reboot. Removing battery helped, so I managed to get the system running, but after almost a day I’ve got exactly the same reboot with same MCE error. Since she’s been running this CPU for almost 3 months without a single problem and was running my CPU that rebooted on me on day one, I believe we can conclude, that it’s not the CPU issue per se.

Yesterday, spending another few hours putting CPUs back, I’ve swapped memory modules. We have identical modules, so I’m testing if I have problematic modules. Although, I did a 4h+ test run of memtest86 few weeks ago, still can be the problem.

Should that test fail, last item to check would be to try and swap our GPUs: I’m running Auros RX 5700XT and she has Gigabyte Vision RTX 3070 OC. If I’m not mistaken, everyone in here has AMD gpu with open source drivers? While I don’t have a single line that would say there is a problem with GPU/driver around the time of the crash, people with 3000-series do report those.

If all else fails, I believe it’s a Motherboard issue. There is still a PSU at play, but I strongly believe the issue with PSU would present themselves more consistent.

Also I have a call today with AMD support (I guess they are tired exchanging emails for 2+ weeks now), I don’t expect anything useful from them, but who know. I’ll post my findings after the call.
Comment 153 Devrandom 2021-02-11 11:10:53 UTC
(In reply to Devrandom from comment #151)
> (In reply to Alex Deucher from comment #150)
> > (In reply to Devrandom from comment #149)
> > > I had the same issue (reboots every ~30 min)
> > > 
> > > The good news is that I have been running without incidents for 24 hours
> > > with amdgpu.ppfeaturemask=0xffffbffd .
> > > 
> > > Configuration:
> > > 
> > > - CPU: AMD 3900X
> > > - Motherboard: X570 AORUS ELITE with F33a firmware
> > 
> > What do you mean by F33a firmware?
> 
> Sorry, I meant BIOS version.
> 
> > 
> > > - RAM: 2x32GB M378A4G43MB1-CTD 2666 MT/s
> > > - GPU: RX 5500 (4 GB Sapphire RX 5500 XT)
> > > - kernel 5.8.0-43-generic (Ubuntu)
> > 
> > Since you have a polaris board, as per comment 75 et al., can you try the
> > patches I suggested?  Better yet, they're already upstream so if you could
> > try an 5.10 or 5.11 kernel that would work too.
> 
> I'm installing from
> https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.10/amd64/ and will update
> here soon.

This has been stable about 20 hours with kernel:

Linux version 5.10.0-051000-generic (kernel@kathleen) (gcc (Ubuntu 10.2.0-13ubuntu1) 10.2.0, GNU ld (GNU Binutils for Ubuntu) 2.35.1) #202012132330 SMP Sun Dec 13 23:33:36 UTC 2020

and cmdline:

BOOT_IMAGE=/vmlinuz-5.10.0-051000-generic root=/dev/mapper/vgubuntu-root ro quiet splash

(i.e. no amdgpu flags)
Comment 154 Jacob Tjørnholm 2021-02-11 22:43:39 UTC
For me, the solution was to set a fixed CPU ratio in the BIOS. I haven't had a single stability issue for two weeks now since setting the CPU ratio to 3900Mhz. 

My system: 
CPU: AMD Ryzen 7 3800XT, 3.9 - 4.7 GHz, 8 Core
RAM: G.Skill Aegis 32GB (2-KIT) DDR4 3000MHz CL16 DIMM
GFX: Gigabyte GeForce GTX 1650 4 GB OC
MB: MSI B550-A PRO ATX

For months I had been experiencing spontaneous reboots while running Linux. Usually at least once a day, often on video calls. Also easy to reproduce by letting Youtube run in fullscreen for a few hours. It did not seem to be linked to high CPU load or temperature. I never experienced other stability issues (eg. freezes), just the reboots. 

After each reboot I would see one or more "Machine Check" (mte) entries during the boot.  

I never had any issues running Windows 10. Tried running Youtube in fullscreen overnight several times, no problems. 

I've been following this thread and tried disabling SMT. This seemed to help a lot, but I did get a reboot after a couple of days. Also, it was quite annoying that I was no longer able to suspend the computer. 

Finally, not even sure where, I read the hint about setting a fixed CPU ratio, and that seems to have completely solved the problem. I'm sure I could set it much higher than 3900 but I care more about stability than performance (because the machine is more than fast enough for my work needs). 

So my system is now rock solid and I love it. Let me know if I left out any important details. This issue is very frustrating, so I would be more than happy to help others suffering from it.
Comment 155 Martin 2021-02-15 00:29:00 UTC
Unexpected reboots (within 30 minutes of booting) only started occurring for me since I moved from kernel 5.10.8-100.fc32.x86_64 to 5.10.12-100.fc32.x86_64. Booting back into 5.10.8 stopped the random reboots. The unexpected reboots continued to occur with 5.10.15-100.fc32.x86_64, but interestingly, I experienced an unexpected reboot after booting back into 5.10.8 after having installed 5.10.15, which leads me to think it may not be (at least entirely) a kernel issue.

The system had been stable for one year prior, and there have not been any changes in physical components.

Today, I disabled IOMMU, SR-IOV, and "Above 4G Decoding" in the BIOS and have not had any unexpected reboots or MCE events logged in several hours (though I still do have disk errors being logged). I disabled all three items at the same time, so I'm not sure which one(s) actually mitigated the issue.

---

Sample disk error retrieved with ras-mc-ctl (continues to occur):
error: dev=0:0, sector=-1, nr_sector=0, error='I/O error', rwbs='N', cmd=''

Sample MCE event retrieved with ras-mc-ctl (has not reoccurred since disabling the three BIOS options mentioned above):
error: Corrected error, no action required., CPU 2, mcg mcgstatus=0, mci CECC, mcgcap=0x0000011c, status=0x98004000003e0000, misc=0xd01a001a00000000, walltime=0x602962dd, cpuid=0x00870f10, bank=0x00000019

---

OS: Fedora 32
Mesa 20.2.3-1.fc32

CPU: AMD Ryzen 5 3600X
Graphics Card: AMD Radeon RX 5700
Motherboard: ASUS Pro WS X570-ACE (BIOS version 3204, AGESA 1.2.0.0)
Memory: Crucial Ballistix Elite 4x8GB DDR4 3600MHz
Internal Storage: Crucial P1 1TB 3D NAND NVMe PCIe M.2 SSD
PSU: Seasonic PRIME Ultra 1000 Titanium
Comment 156 Paul Menzel 2021-02-15 11:30:38 UTC
(In reply to Martin from comment #155)
> Unexpected reboots (within 30 minutes of booting) only started occurring for
> me since I moved from kernel 5.10.8-100.fc32.x86_64 to
> 5.10.12-100.fc32.x86_64. Booting back into 5.10.8 stopped the random
> reboots.

As you have an AMD graphics card (AMD Radeon RX 5700), I’d say, the issue is unrelated to this one here. As you seem to be able to reproduce it, bisection would be the quickest way to find the culprit. If you find the commit, please create a separate issue (probably at https://gitlab.freedesktop.org/drm/amd).

> The unexpected reboots continued to occur with
> 5.10.15-100.fc32.x86_64, but interestingly, I experienced an unexpected
> reboot after booting back into 5.10.8 after having installed 5.10.15, which
> leads me to think it may not be (at least entirely) a kernel issue.

Did this reboot have the same symptoms?

[…]
Comment 157 Martin 2021-02-15 16:20:58 UTC
(In reply to Paul Menzel from comment #156)
> (In reply to Martin from comment #155)
> > The unexpected reboots continued to occur with
> > 5.10.15-100.fc32.x86_64, but interestingly, I experienced an unexpected
> > reboot after booting back into 5.10.8 after having installed 5.10.15, which
> > leads me to think it may not be (at least entirely) a kernel issue.
> 
> Did this reboot have the same symptoms?
> 
> […]

I believe so. The logs from the older reboots are no longer available, but here's the last error in the journal from the reboot while using 5.10.15 (three of these same errors were logged):

mce: [Hardware Error]: Machine check events logged
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
[Hardware Error]: IPID: 0x000100ff03830400
[Hardware Error]: Platform Security Processor Ext. Error Code: 62
[Hardware Error]: cache level: RESV, tx: INSN
Comment 158 Paul Menzel 2021-02-15 16:32:59 UTC
(In reply to Martin from comment #157)

[…]

> but here's the last error in the journal from the reboot while using 5.10.15
> (three of these same errors were logged):
> 
> mce: [Hardware Error]: Machine check events logged
> [Hardware Error]: Corrected error, no action required.
> [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]:
> 0x98004000003e0000
> [Hardware Error]: IPID: 0x000100ff03830400
> [Hardware Error]: Platform Security Processor Ext. Error Code: 62
> [Hardware Error]: cache level: RESV, tx: INSN

That’s different from the issue this bug report is about. Please create a separate issue.
Comment 159 Stefan de Konink 2021-02-28 10:37:00 UTC
I don't know if this is related, but I can now consistently crash a 5.11.2-gentoo system (AMD Ryzen 5 2500U with Radeon Vega Mobile) by just pulling out the HDMI cable within Xorg. I have not yet been able to get a trace.
Comment 160 T. Lindig 2021-02-28 20:25:25 UTC
I had exactly the same problems. Spontaneous reboots after performance throttling.


After such a reboot the error log said:


> 14:03:41 kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614431013
> SOCKET 0 APIC 0 microcode 8701021
> 14:03:41 kernel: mce: [Hardware Error]: TSC 0 MISC d0120001000000 SYND
> 5d020002 IPID 1002e00000500 
> 14:03:41 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27:
> baa0000002080b


In the last two days I have replaced all components one after the other without being able to solve the problem permanently.


But the solution was the change of a bios setting.


I have MSI MAG B550M MORTAR WIFI


I switched following options in expert mode:

> CPU Config / AMD CBS -> Power Supply Idle Control

from "Auto" to "Typical Current Idle


After that, system runs absolute stable.
Comment 161 Paul Menzel 2021-03-01 07:32:52 UTC
(In reply to Stefan de Konink from comment #159)
> I don't know if this is related, but I can now consistently crash a
> 5.11.2-gentoo system (AMD Ryzen 5 2500U with Radeon Vega Mobile) by just
> pulling out the HDMI cable within Xorg. I have not yet been able to get a
> trace.

That sounds like a different problem. Please create a separate issue at the AMDGPU issue tracker [1] and mention it here. If it’s a regression, it’d be great if you could bisect it, as you seem to be able to reproduce it “easily”.


[1]: https://gitlab.freedesktop.org/drm/amd
Comment 162 Paul Menzel 2021-03-01 07:35:48 UTC
(In reply to kernel.org from comment #160)
> I had exactly the same problems. Spontaneous reboots after performance
> throttling.
> 
> After such a reboot the error log said:
> 
> > 14:03:41 kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614431013
> SOCKET 0 APIC 0 microcode 8701021
> > 14:03:41 kernel: mce: [Hardware Error]: TSC 0 MISC d0120001000000 SYND
> 5d020002 IPID 1002e00000500 
> > 14:03:41 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27:
> baa0000002080b
> 
> In the last two days I have replaced all components one after the other
> without being able to solve the problem permanently.
> 
> But the solution was the change of a bios setting.
> 
> I have MSI MAG B550M MORTAR WIFI
> 
> I switched following options in expert mode:
> 
> > CPU Config / AMD CBS -> Power Supply Idle Control
> 
> from "Auto" to "Typical Current Idle
> 
> After that, system runs absolute stable.

Thank you for the report. Can you please add the firmware version of the mainboard, and if you use a dedicated graphics card?
Comment 163 Amadej Kastelic 2021-03-01 10:29:10 UTC
I've had the same issue for a few months now. Mostly happens when my pc is idle. My specs are:

- OS: Arch Linux
- Kernel: 5.11.2
- CPU: AMD Ryzen 9 3900x
- GPU: Gigabyte RX 5700 XT
- RAM: 4x16GiB G.Skill 3200Mhz
- Motherboard: Gigabyte X570 Aorus Master (F33c bios)

Disabling states using kernel parameter options didn't do anything, but using ZenStates script seemed to fix it (https://github.com/r4m0n/ZenStates-Linux). I disabled the C6 state. I've been running stable for over a week now. This is my systemd service file:

```
[Unit]
Description=Ryzen Disable C6
DefaultDependencies=no
After=sysinit.target local-fs.target suspend.target hibernate.target
Before=basic.target

[Service]
Type=oneshot
ExecStart=/usr/sbin/zenstates --c6-disable

[Install]
WantedBy=basic.target suspend.target hibernate.target
```

I have also installed a new kernel that might have some additional patches for AMD: https://aur.archlinux.org/packages/linux-amd-znver2/. Maybe it helped, not sure though.
Comment 164 Borislav Petkov 2021-03-01 10:48:11 UTC
(In reply to Amadej Kastelic from comment #163)
> Disabling states using kernel parameter options didn't do anything, but
> using ZenStates script seemed to fix it
> (https://github.com/r4m0n/ZenStates-Linux). I disabled the C6 state.

So I've been trying to find people with such boxes and to pinpoint which
MSR bits that script modifies in order to fix the issue in the kernel so
that there's no need to have an external script.

Can you do the following on your box:

Disable the zenstates services file so that it doesn't load, boot the
box and do as root:

# rdmsr -a 0xC0010292 | uniq
# rdmsr -a 0xC0010296 | uniq

and paste the output from both commands here.

This will tell us which bits --c6-disable turns off. I'm assuming all
4 but I'd like to make sure.

Also, "F33c bios" is the latest one from your vendor and there's no
newer one?

Also, can you send a full dmesg from the box? Privately's fine too.

Thx.
Comment 165 Amadej Kastelic 2021-03-01 11:02:10 UTC
(In reply to Borislav Petkov from comment #164)
> (In reply to Amadej Kastelic from comment #163)
> > Disabling states using kernel parameter options didn't do anything, but
> > using ZenStates script seemed to fix it
> > (https://github.com/r4m0n/ZenStates-Linux). I disabled the C6 state.
> 
> So I've been trying to find people with such boxes and to pinpoint which
> MSR bits that script modifies in order to fix the issue in the kernel so
> that there's no need to have an external script.
> 
> Can you do the following on your box:
> 
> Disable the zenstates services file so that it doesn't load, boot the
> box and do as root:
> 
> # rdmsr -a 0xC0010292 | uniq
> # rdmsr -a 0xC0010296 | uniq
> 
> and paste the output from both commands here.
> 
> This will tell us which bits --c6-disable turns off. I'm assuming all
> 4 but I'd like to make sure.
> 
> Also, "F33c bios" is the latest one from your vendor and there's no
> newer one?
> 
> Also, can you send a full dmesg from the box? Privately's fine too.
> 
> Thx.

C6 enabled:
```
sudo rdmsr -a 0xC0010292 | uniq
104000012
sudo rdmsr -a 0xC0010296 | uniq                          
484848
```

C6 Disabled:
```
sudo rdmsr -a 0xC0010292 | uniq
4000012
sudo rdmsr -a 0xC0010296 | uniq 
80808
```

F33c is the latest version.

Dmesg: https://pastebin.com/zpqkq8vj
Comment 166 Jaakko Kantojärvi 2021-03-01 12:45:20 UTC
For a comparison, with hardware

> CPU: AMD Ryzen 9 3900X (microcode 0x08701021)
> MB.: Gigabyte X570 Aorus Elite (BIOS F30 / 2020-08-15)
> GPU: Asus Radeon R9 270X (connected via the x4 slot behind the X570 chipset)
> RAM: Kingston KHX3200C16D4/16GX DDR4 3200MHz (2x)
> PSU: Seasonic 750W PRIME GX-750

and C6 disabled in bios

```
% sudo rdmsr -a 0xC0010292 | uniq
12
% sudo rdmsr -a 0xC0010296 | uniq
80808
```

and ZenState list

```
% sudo python3 zenstates.py -l
P0 - Enabled - FID = 9C - DID = 8 - VID = 48 - Ratio = 39.00 - vCore = 1.10000
P1 - Enabled - FID = C3 - DID = A - VID = 48 - Ratio = 39.00 - vCore = 1.10000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Disabled
```

cpupower idle-info

```
% sudo cpupower idle-info
CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 2
Available idle states: POLL C1
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 0
Duration: 0
C1:
Flags/Description: ACPI HLT
Latency: 0
Usage: 30708884
Duration: 200450383887
```

And this has NOT fixed the idle crashes for me. Currently, I have an erlang VM alive, and last time I stopped that, the computer crashed when displays went to power save mode. I think the erlang VM uses busy loop for actor scheduling.


p.s. I need to update my BIOS, there seems to be version with "Improve system stability"
Comment 167 Borislav Petkov 2021-03-01 13:02:07 UTC
(In reply to Jaakko Kantojärvi from comment #166)
> And this has NOT fixed the idle crashes for me. Currently, I have an erlang
> VM alive, and last time I stopped that, the computer crashed when displays
> went to power save mode. I think the erlang VM uses busy loop for actor
> scheduling.
> 
> 
> p.s. I need to update my BIOS, there seems to be version with "Improve
> system stability"

Yah, try that first. Gigabyte boards have been known to f*ck stuff up in the BIOS in the past too. Not that it means anything for this current issue - just sayin'.
Comment 168 Borislav Petkov 2021-03-01 13:02:52 UTC
(In reply to Amadej Kastelic from comment #165)
> C6 enabled:

By "C6 enabled" you mean, the MSR contents after a fresh boot and
zenstates disabled, yes?

> ```
> sudo rdmsr -a 0xC0010292 | uniq
> 104000012
> sudo rdmsr -a 0xC0010296 | uniq
> 484848
> ```
>
> C6 Disabled:
> ```
> sudo rdmsr -a 0xC0010292 | uniq
> 4000012
> sudo rdmsr -a 0xC0010296 | uniq
> 80808
> ```

Ok, let's try this now: leave zenstates disabled, boot your machine and
do as root:

# wrmsr -a 0xC0010292 0x4000012

and confirm bit 32 is off:

# rdmsr -a 0xC0010292 | uniq
4000052

and now use the box normally and see if it still freezes. It if doesn't,
turning off that bit alone is a good indication that it fixes the issue.

Thx.
Comment 169 Gurenko Alex 2021-03-01 13:11:55 UTC
I've forgot to provide update here as well for the clarity sake:

 My CPU and memory in my wife's motherboard (same model, same BIOS) worked perfectly fine. Her CPU and her memory caused exactly same errors in my motherboard.

AMD support: AMD said they are ready to RMA the CPU should I choose to do so, but they do not believe it’s the problem, which was also confirmed by swapping the CPU to my wife’s motherboard. The guy I’ve talked to (located in Canada) tried to replicate the problem with Ubuntu 20.10, 5900X and Asus X570 Pro board, but no luck, it was running stable for a week. He said it looks like the problem is with particular motherboard, most likely with power delivery. In addition he confirmed that this error IS NOT with the CPU, this simply is the event that CPU catches and reports back. The error point to the timeout for the data reception, which is probably caused by slow power state change due to poor power delivery. Current state is: they are waiting for me to exchange the board and report the results.

Motherboard replacement: MSI said they I have to send motherboard back to the store where, I've purchased it. I've sent my motherboard back to the store 2 weeks ago, they confirmed that there is no physical damage to the board and they sent it to the MSI for details look, which supposed to take ~14 days (counting from Friday 26th).

I'm wondering if anyone with this problem has access to another motherboard that is proved working to validate "It's a motherboard, not CPU per se" problem.
Comment 170 Amadej Kastelic 2021-03-01 13:13:12 UTC
(In reply to Borislav Petkov from comment #168)
> (In reply to Amadej Kastelic from comment #165)
> > C6 enabled:
> 
> By "C6 enabled" you mean, the MSR contents after a fresh boot and
> zenstates disabled, yes?
> 

Yes.

> Ok, let's try this now: leave zenstates disabled, boot your machine and
> do as root:
> 
> # wrmsr -a 0xC0010292 0x4000012
> 
> and confirm bit 32 is off:
> 
> # rdmsr -a 0xC0010292 | uniq
> 4000052
> 

Is this ok?

```
# sudo wrmsr -a 0xC0010292 0x4000012

# sudo rdmsr -a 0xC0010292 | uniq                             
4000012
```
Comment 171 Amadej Kastelic 2021-03-01 13:16:52 UTC
(In reply to Amadej Kastelic from comment #170)
> (In reply to Borislav Petkov from comment #168)
> > (In reply to Amadej Kastelic from comment #165)
> > > C6 enabled:
> > 
> > By "C6 enabled" you mean, the MSR contents after a fresh boot and
> > zenstates disabled, yes?
> > 
> 
> Yes.
> 
> > Ok, let's try this now: leave zenstates disabled, boot your machine and
> > do as root:
> > 
> > # wrmsr -a 0xC0010292 0x4000012
> > 
> > and confirm bit 32 is off:
> > 
> > # rdmsr -a 0xC0010292 | uniq
> > 4000052
> > 
> 
> Is this ok?
> 
> ```
> # sudo wrmsr -a 0xC0010292 0x4000012
> 
> # sudo rdmsr -a 0xC0010292 | uniq                             
> 4000012
> ```

ZenStates now lists:

```
# sudo zenstates --list                    
P0 - Enabled - FID = 98 - DID = 8 - VID = 48 - Ratio = 38.00 - vCore = 1.10000
P1 - Enabled - FID = 8C - DID = A - VID = 58 - Ratio = 28.00 - vCore = 1.00000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Enabled
```
Comment 172 Borislav Petkov 2021-03-01 13:23:21 UTC
(In reply to Amadej Kastelic from comment #170)
> Is this ok?
> 
> ```
> # sudo wrmsr -a 0xC0010292 0x4000012
> 
> # sudo rdmsr -a 0xC0010292 | uniq                             
> 4000012
> ```

Yap, looks good. Now watch the box pls and see if it freezes. Also, you should run these commands each time you boot/reboot it because BIOS will have reset bit 32 to 1.

Thx.
Comment 173 Amadej Kastelic 2021-03-01 13:24:35 UTC
(In reply to Borislav Petkov from comment #172)
> Yap, looks good. Now watch the box pls and see if it freezes. Also, you
> should run these commands each time you boot/reboot it because BIOS will
> have reset bit 32 to 1.
> 
Will do. Thank you very much.
Comment 174 Paul Menzel 2021-03-01 13:52:58 UTC
(In reply to Gurenko Alex from comment #169)

[…]

> I'm wondering if anyone with this problem has access to another motherboard
> that is proved working to validate "It's a motherboard, not CPU per se"
> problem.

Sorry, for the unhelpful reply, but this bug report has now been used for so many reports, that I think, all components (power, firmware, graphics card, Linux graphics driver, …) are responsible for some of them.
Comment 175 Paul Menzel 2021-03-01 13:53:32 UTC
Clemens, as the creator of this issue/bug report, I assume, you are still experiencing this issue?
Comment 176 Sam Cowley 2021-03-01 17:43:59 UTC
(In reply to Paul Menzel from comment #174)
> (In reply to Gurenko Alex from comment #169)
> 
> […]
> 
> > I'm wondering if anyone with this problem has access to another motherboard
> > that is proved working to validate "It's a motherboard, not CPU per se"
> > problem.
> 
> Sorry, for the unhelpful reply, but this bug report has now been used for so
> many reports, that I think, all components (power, firmware, graphics card,
> Linux graphics driver, …) are responsible for some of them.

I think you're right; it seems like multiple problems are manifesting in the exact same way. My dmesg output had the exact error that was reported here, but it turns out that the key was my GPU (https://wiki.archlinux.org/index.php/AMDGPU#R9_390_series_poor_performance_and/or_instability) and I haven't had a spontaneous reboot since adding those kernel parameters.

For anyone else trying to determine if it's a GPU stability problem, this is what I see in my Xorg logs right before a crash:

> [    56.839] (II) modesetting: Driver for Modesetting Kernel Drivers: kms
> [    56.840] (EE) open /dev/dri/card0: No such file or directory
> [    56.840] (WW) Falling back to old probe method for modesetting
> [    56.840] (EE) open /dev/dri/card0: No such file or directory
> [    56.840] (EE) Screen 0 deleted because of no matching config section.
> [    56.840] (II) UnloadModule: "modesetting"
> [    56.840] (EE) Device(s) detected, but none match those in the config
> file.
> [    56.840] (EE)
> Fatal server error:
> [    56.840] (EE) no screens found(EE)
> [    56.840] (EE)

My specs for reference:

* Ryzen 7 3700X
* Asus PRIME X570-P (BIOS version 3001)
* Radeon R9 390
* EVGA 750W Gold

BIOS settings: Fixed CPU clock ratio to 3600 Mhz
Kernel parameters: radeon.cik_support=0 radeon.si_support=0 amdgpu.cik_support=1 amdgpu.si_support=1 amdgpu.dc=1
Comment 177 T. Lindig 2021-03-01 20:25:17 UTC
(In reply to Paul Menzel from comment #162)
> (In reply to kernel.org from comment #160)
> > I had exactly the same problems. Spontaneous reboots after performance
> > throttling.
> > 
> > After such a reboot the error log said:
> > 
> > > 14:03:41 kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1614431013
> > SOCKET 0 APIC 0 microcode 8701021
> > > 14:03:41 kernel: mce: [Hardware Error]: TSC 0 MISC d0120001000000 SYND
> > 5d020002 IPID 1002e00000500 
> > > 14:03:41 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27:
> > baa0000002080b
> > 
> > In the last two days I have replaced all components one after the other
> > without being able to solve the problem permanently.
> > 
> > But the solution was the change of a bios setting.
> > 
> > I have MSI MAG B550M MORTAR WIFI
> > 
> > I switched following options in expert mode:
> > 
> > > CPU Config / AMD CBS -> Power Supply Idle Control
> > 
> > from "Auto" to "Typical Current Idle
> > 
> > After that, system runs absolute stable.
> 
> Thank you for the report. Can you please add the firmware version of the
> mainboard, and if you use a dedicated graphics card?

Of course, sorry.

I tested it with:
CPU: AMD Ryzen 5 3600 6-Core Processor
Graphics Card: Radeon RX 570 Series (POLARIS10, DRM 3.38.0, 5.8.0-44-generic, LLVM 11.0.0) and Radeon HD 5670 GPU (Gigabyte GV-R5670OC-1GI Rev1.0)
Main Board: MSI MAG B550M MORTAR WIFI
Bios firmware version: 7C94v15, 7C94v154(beta) and 7C94v163(beta)
Comment 178 T. Lindig 2021-03-02 08:31:28 UTC
(In reply to T. Lindig from comment #177)
> (In reply to Paul Menzel from comment #162)
> > (In reply to kernel.org from comment #160)
> > > I had exactly the same problems. Spontaneous reboots after performance
> > > throttling.
> > > 
> > > After such a reboot the error log said:
> > > 
> > > > 14:03:41 kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> > 1614431013
> > > SOCKET 0 APIC 0 microcode 8701021
> > > > 14:03:41 kernel: mce: [Hardware Error]: TSC 0 MISC d0120001000000 SYND
> > > 5d020002 IPID 1002e00000500 
> > > > 14:03:41 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank
> 27:
> > > baa0000002080b
> > > 
> > > In the last two days I have replaced all components one after the other
> > > without being able to solve the problem permanently.
> > > 
> > > But the solution was the change of a bios setting.
> > > 
> > > I have MSI MAG B550M MORTAR WIFI
> > > 
> > > I switched following options in expert mode:
> > > 
> > > > CPU Config / AMD CBS -> Power Supply Idle Control
> > > 
> > > from "Auto" to "Typical Current Idle
> > > 
> > > After that, system runs absolute stable.
> > 
> > Thank you for the report. Can you please add the firmware version of the
> > mainboard, and if you use a dedicated graphics card?
> 
> Of course, sorry.
> 
> I tested it with:
> CPU: AMD Ryzen 5 3600 6-Core Processor
> Graphics Card: Radeon RX 570 Series (POLARIS10, DRM 3.38.0,
> 5.8.0-44-generic, LLVM 11.0.0) and Radeon HD 5670 GPU (Gigabyte
> GV-R5670OC-1GI Rev1.0)
> Main Board: MSI MAG B550M MORTAR WIFI
> Bios firmware version: 7C94v15, 7C94v154(beta) and 7C94v163(beta)

update:
After sleeping on it, I have to correct myself, I think the assumption that it is the combination of graphics card, board and CPU is correct.

For me it happened the first time after I played a 3d game via steam for the first time on this system. After finishing the game, if I had played it for some time before (at least 10 minutes) the PC crashed after a short time. After finishing the game I shut down the computer quickly and was surprised that it booted again immediately. The reason was that it crashed before the shutdown was finished. 

This could then be reliably reproduced. The computer crashed regularly after the game ended, if it had been running for some time before. It was not enough to just open it, change a few settings and quit again. 

During the crashes the filesystem on my system ssd was damaged, because it needed several reboots or only worked from USB Live CD. As soon as I accessed the SSD, the PC crashed again. In this situation I also used the other graphics card, without improvement, only the removal of the SSD helped.
Comment 179 Amadej Kastelic 2021-03-02 09:18:16 UTC
(In reply to Borislav Petkov from comment #172)
> Yap, looks good. Now watch the box pls and see if it freezes. Also, you
> should run these commands each time you boot/reboot it because BIOS will
> have reset bit 32 to 1.
> 
> Thx.

Just got a freeze. Will test with the ZenStates script again...

```
[    0.307542] mce: [Hardware Error]: Machine check events logged
[    0.307542] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
[    0.307542] mce: [Hardware Error]: TSC 0 ADDR 7f176cea357c MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.307542] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1614676352 SOCKET 0 APIC 4 microcode 8701021
```
Comment 180 Borislav Petkov 2021-03-02 10:46:23 UTC
Hmm, this is the usual watchdog timeout, AFAICT, which means some transaction timed out. You've got the MCE after you rebooted, correct?

Btw, does your BIOS has any settings regarding C-states and if so, what happens if you disable them there?

Thx.
Comment 181 Amadej Kastelic 2021-03-02 11:00:06 UTC
(In reply to Borislav Petkov from comment #180)
> Hmm, this is the usual watchdog timeout, AFAICT, which means some
> transaction timed out. You've got the MCE after you rebooted, correct?

Yes.

> Btw, does your BIOS has any settings regarding C-states and if so, what
> happens if you disable them there?

I've tried disabling them and setting "Power Supply Idle Control" to "Typical Current Idle". It didn't help.
Comment 182 Borislav Petkov 2021-03-02 11:44:59 UTC
It could be that disabling the C-states is simply making the freeze
harder to trigger but not fix it and this could really be some power
undervolt/overcurrent or whatever during states switch hardware issue
which manifests itself in the transaction timeoutting.

And considering how setting some GPU power management cmdline args in
the amdgpu driver, fixes the issue for some people here, it really could
be the GPU is causing this power corner case.

Hmm, I wonder if you have a small, cheap GPU and you replace your big
RX 5700 with it, whether it'll trigger then too. If it doesn't because
- and I'm purely conjecturing here - that small GPU never causes the
platform to even get close to that power corner case, then that would be
starting to make sense.

But again, this is all speculation and without a specialized measurement
hardware to check the platform's power limits, who knows what happens.
Comment 183 Amadej Kastelic 2021-03-02 12:18:02 UTC
(In reply to Borislav Petkov from comment #182)
> Hmm, I wonder if you have a small, cheap GPU and you replace your big
> RX 5700 with it, whether it'll trigger then too. If it doesn't because
> - and I'm purely conjecturing here - that small GPU never causes the
> platform to even get close to that power corner case, then that would be
> starting to make sense.

All GPUs are expensive nowadays :)
I'll try to get my hands on an RX 580 and report back though.
Comment 184 Nils 2021-03-04 19:14:06 UTC
I am having the same issue: random switch-offs after something between 20 seconds (before booting!) to 3 days of consecutive running.  Sometimes after switching off the system will reboot after a few seconds off.  In these cases the Os is sometimes not loaded.  I thought there might be a short with the power button (case's front panel) but disconnecting it from the MB did not help.

Board:  ASRock X570 Steel Legend AMD X570, BIOS 3.60 dated 2021/2/2.
CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics (family: 0x17, model: 0x60, stepping: 0x1)
RAM: 2x16GB Crucial, DDR4-3200 DIMM
GPU: Radeon RX 560 (simple PCIE card) -- but this was replaced by a completely different card with no change
PSU: Replaced, tried 3 in total - seems unrelated

The system state (idle/load) seems unimportant though the system was never under real load.  I have no games installed.

No overclocking of any sorts.

After one reboot I opened the BIOS and it rebooted again within the BIOS setup which means that my issue certainly isn't Linux related - right?  Of course a state could be induced but the previously running system which survived the power issue and then cause the problem while running BIOS setup?

Likely unrelated: acceleration in Chromium and MPV does not work on the CPU's built-in graphics chip (AMD/ATI Renoir rev d9), I need to drag windows using 2D GPU accelerations to the other monitor connected by the PCIE graphics card.  Funny effect.
Comment 185 Nils 2021-03-06 13:57:01 UTC
Is this possibly related?

https://bugzilla.kernel.org/show_bug.cgi?id=206487
Random freezes/crashes with enabled C-State C6 - AMD Ryzen
Comment 186 Nils 2021-03-06 20:41:57 UTC
Sorry, that was already referenced above - apologies for the noise.
Comment 187 alan.loewe 2021-03-07 18:13:46 UTC
I reported an issue that could be the same as some of the issues here, especially Alex Gurenko's:
https://bugzilla.kernel.org/show_bug.cgi?id=212087

However, it somehow got fixed during the 5.10 cycle, but 5.11 is worse than ever before. Stable for over a week with 5.10.18+, right now 25h uptime with BIOS defaults plus DOCP (XMP). So, there definitely are some bugs in the kernel.

Generally, from browsing a lot of forums, I learned that for Ryzen 3000 users trouble began with the BIOS versions supporting Ryzen 5000. So maybe try downgrading to a version before that, or upgrade to the latest version, if it has at least Agesa 1.2.0.0, which should have resolved most problems at least for Ryzen 5000.
Comment 188 Clemens Eisserer 2021-03-07 18:35:25 UTC
> I learned that for Ryzen 3000 users trouble began with the BIOS versions
> supporting Ryzen 5000.

At least not for me - and I originally filed this bug report a year ago.
However, my system is no longer running linux - I built it for professional work where I cannot afford such bricolage.
Comment 189 Tim S 2021-03-10 22:30:30 UTC
I am also experiencing these sudden shutoffs followed by an MCE error on boot, such as:

> mce: [Hardware Error]: Machine check events logged
> mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 0: bc00080001010135
> mce: [Hardware Error]: TSC 0 ADDR 104c1b0b8 MISC d012000000000000 IPID
> 1000b000000000
> mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1614904051 SOCKET 0 APIC 18
> microcode a201009

- OS: Arch Linux with kernel 5.11.2-arch1-1

- Board: ASUS TUF GAMING X570-PRO, firmware v. 3405

- CPU: AMD Ryzen 9 5900X

- GPU: Visiontek Radeon HD 5450

- PSU: EVGA 650 GT

Interestingly, I can reproduce the crash 100% reliably (12+ attempts) by selecting "Verify Local Data" on a torrent in Transmission QT. The crash usually occurs within about 3 minutes. Something to do with checksumming large amounts of data perhaps

The crashes have not occurred since turning off Precision Boost Overdrive. Obviously that's not an ideal solution but it gets my PC working for now. Setting "Typical Current Idle" didn't solve it on my machine.
Comment 190 alan.loewe 2021-03-11 00:27:34 UTC
Can confirm. I tried "Verify integrity of game files" in Steam. Reboots after a minute. But sha256summing a few gigabytes of files is no problem. Weird.

However, I could reproduce this on Windows, too, so it's not a Linux issue.

I already had the impression, that cryptographic work loads trigger the problem, and tried the LUKS benchmark, but without success.
Comment 191 Maxim Egorushkin 2021-03-12 12:34:34 UTC
I have a similar problem with Z390, Intel 9900KS and Nvidia 1080Ti.

When the display turns off due to inactivity the kernel logs one or two mce errors and freezes.

As a work-around I disabled turning off the screen.
Comment 192 Paul Menzel 2021-03-12 14:28:45 UTC
(In reply to Tim S from comment #189)
> I am also experiencing these sudden shutoffs followed by an MCE error on
> boot, such as:
> 
> > mce: [Hardware Error]: Machine check events logged
> > mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 0: bc00080001010135

That number looks different from the one the bug report is about.

> > mce: [Hardware Error]: TSC 0 ADDR 104c1b0b8 MISC d012000000000000 IPID
> > 1000b000000000
> > mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1614904051 SOCKET 0 APIC 18
> > microcode a201009
> 
> - OS: Arch Linux with kernel 5.11.2-arch1-1
> 
> - Board: ASUS TUF GAMING X570-PRO, firmware v. 3405
> 
> - CPU: AMD Ryzen 9 5900X
> 
> - GPU: Visiontek Radeon HD 5450
> 
> - PSU: EVGA 650 GT
> 
> Interestingly, I can reproduce the crash 100% reliably (12+ attempts) by
> selecting "Verify Local Data" on a torrent in Transmission QT. The crash
> usually occurs within about 3 minutes. Something to do with checksumming
> large amounts of data perhaps
> 
> The crashes have not occurred since turning off Precision Boost Overdrive.
> Obviously that's not an ideal solution but it gets my PC working for now.
> Setting "Typical Current Idle" didn't solve it on my machine.

With the information from above, I recommend to report this issue to your motherboard vendor, and also create a separate bug report.
Comment 193 Paul Menzel 2021-03-12 14:30:46 UTC
(In reply to alan.loewe from comment #190)
> Can confirm. I tried "Verify integrity of game files" in Steam. Reboots
> after a minute. But sha256summing a few gigabytes of files is no problem.
> Weird.
> 
> However, I could reproduce this on Windows, too, so it's not a Linux issue.
> 
> I already had the impression, that cryptographic work loads trigger the
> problem, and tried the LUKS benchmark, but without success.

I’d say this is a separate issue from the bug report at hand. Please create a separate issue, and give as much detail (inclusive firmware version, microcode, …) there.
Comment 194 Paul Menzel 2021-03-12 14:32:13 UTC
(In reply to Maxim Egorushkin from comment #191)
> I have a similar problem with Z390, Intel 9900KS and Nvidia 1080Ti.
> 
> When the display turns off due to inactivity the kernel logs one or two mce
> errors and freezes.
> 
> As a work-around I disabled turning off the screen.

This issue at hand is definitely only related to AMD “chipsets”. Please create a separate issue for your problem with as much information as possible (including firmware version, microcode updates, logs, …).
Comment 195 Gurenko Alex 2021-03-13 15:00:43 UTC
So, I would like to give a quick update from my side in regards to motherboard repair:

 Earlier this week I finally got my motherboard back from MSI. Unfortunately I don't have any conclusion other than it's been repaired, the motherboard seems the same and no visual signs of things being "repaired".

 It's 3.5 days with absolutely default settings (only enabled XMP and SVM), idle current is set to Auto, c-state control enabled etc, and running completely stable. I'm also now running kernel 5.11.5 (I saw some people on BZ reporting it aggravates the problem) and it's also perfectly fine.
 While I'd love to see at least 3-4 weeks of stable before final conclusion, I'd say this "repair" worked and it was indeed the motherboard as all other components remains the same and I don't have a single problem right now, while before repair I had at least 5-6 reboots with mce errors a day.
Comment 196 Gurenko Alex 2021-03-13 21:58:37 UTC
(In reply to Gurenko Alex from comment #195)
> So, I would like to give a quick update from my side in regards to
> motherboard repair:
> 
>  Earlier this week I finally got my motherboard back from MSI. Unfortunately
> I don't have any conclusion other than it's been repaired, the motherboard
> seems the same and no visual signs of things being "repaired".
> 
>  It's 3.5 days with absolutely default settings (only enabled XMP and SVM),
> idle current is set to Auto, c-state control enabled etc, and running
> completely stable. I'm also now running kernel 5.11.5 (I saw some people on
> BZ reporting it aggravates the problem) and it's also perfectly fine.
>  While I'd love to see at least 3-4 weeks of stable before final conclusion,
> I'd say this "repair" worked and it was indeed the motherboard as all other
> components remains the same and I don't have a single problem right now,
> while before repair I had at least 5-6 reboots with mce errors a day.

Never mind, just got a system reset...
Comment 197 xasafam914@leonvero.com 2021-03-16 21:55:33 UTC
I have been watching this thread very closely. After an ASUS BIOS update, containing AMD AM4 AGESA V2 PI 1.2.0.0, I started getting random reboots. At first I thought I was having issues with a PSU, so I replaced it, but reboots persisted... Bios rollback was not possible.
Do not recall ever having reboots before the BIOS update. 

I am getting similar errors like everyone else:

kernel: [ 4674.931067] mce: [Hardware Error]: Machine check events logged
kernel: [ 4674.931070] [Hardware Error]: Corrected error, no action required.
kernel: [ 4674.931074] [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
kernel: [ 4674.931077] [Hardware Error]: IPID: 0x000100ff03830400
kernel: [ 4674.931079] [Hardware Error]: Platform Security Processor Ext. Error Code: 62
kernel: [ 4674.931080] [Hardware Error]: cache level: RESV, tx: INSN 

After multiple BIOS resets, reloads of configuration I have finally found two settings that worked. 

VDDCR CPU Voltage [Manual]
VDDCR CPU Voltage Override [1.32500]

and/or setting the voltage to negative -Offset, with first step 0.00165. 

Which got me thinking; set CPU Voltage to auto and do some logging using "sensors".

The lowest vcore voltage I observed was 919mv, and the highest was 1.51v, at 100ms between measuring intervals. 
I could not get the computer to reboot during my measurements - but voltage at 1.51v seemed a bit high. 

With voltage negative offset adjustment, the CPU got as high as 1.50v... which brings the question. 

Is this just a result of a bad chip, or result of BIOS/AGESA/FIRMWARE/KERNL condition that's messing with the vcore voltage, and making the computer reboot because the voltage is getting too high? 

Another interesting tidbits, is the timestamps between mce errors.. Every 311s. 

 [22730.301571] mce: [Hardware Error]: Machine check events logged
 [23041.601517] mce: [Hardware Error]: Machine check events logged
 [23352.900448] mce: [Hardware Error]: Machine check events logged
 [47322.015640] mce: [Hardware Error]: Machine check events logged
 [48255.907934] mce: [Hardware Error]: Machine check events logged
 [49189.803993] mce: [Hardware Error]: Machine check events logged
 [49812.398710] mce: [Hardware Error]: Machine check events logged
 [52302.770599] mce: [Hardware Error]: Machine check events logged
 [62886.872149] mce: [Hardware Error]: Machine check events logged
 [  317.094926] mce: [Hardware Error]: Machine check events logged
 [  628.382479] mce: [Hardware Error]: Machine check events logged
 [ 4986.552369] mce: [Hardware Error]: Machine check events logged
 [ 8099.525574] mce: [Hardware Error]: Machine check events logged
 [ 9033.420023] mce: [Hardware Error]: Machine check events logged
 [10589.906729] mce: [Hardware Error]: Machine check events logged
 [12768.987541] mce: [Hardware Error]: Machine check events logged
 [13391.581006] mce: [Hardware Error]: Machine check events logged
 [13702.878346] mce: [Hardware Error]: Machine check events logged
 [ 1250.660902] mce: [Hardware Error]: Machine check events logged
 [ 1561.959065] mce: [Hardware Error]: Machine check events logged
 [ 2184.554495] mce: [Hardware Error]: Machine check events logged
 [ 2495.850929] mce: [Hardware Error]: Machine check events logged
 [ 4674.931067] mce: [Hardware Error]: Machine check events logged


Hopefully it helps someone.
Comment 198 Martin 2021-03-16 22:03:25 UTC
(In reply to xasafam914@leonvero.com from comment #197)
> I have been watching this thread very closely. After an ASUS BIOS update,
> containing AMD AM4 AGESA V2 PI 1.2.0.0, I started getting random reboots. At
> first I thought I was having issues with a PSU, so I replaced it, but
> reboots persisted... Bios rollback was not possible.
> Do not recall ever having reboots before the BIOS update. 
> 
> I am getting similar errors like everyone else:
> 
> kernel: [ 4674.931067] mce: [Hardware Error]: Machine check events logged
> kernel: [ 4674.931070] [Hardware Error]: Corrected error, no action required.
> kernel: [ 4674.931074] [Hardware Error]: CPU:0 (17:71:0)
> MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
> kernel: [ 4674.931077] [Hardware Error]: IPID: 0x000100ff03830400
> kernel: [ 4674.931079] [Hardware Error]: Platform Security Processor Ext.
> Error Code: 62
> kernel: [ 4674.931080] [Hardware Error]: cache level: RESV, tx: INSN 
> 
> After multiple BIOS resets, reloads of configuration I have finally found
> two settings that worked. 
> 
> VDDCR CPU Voltage [Manual]
> VDDCR CPU Voltage Override [1.32500]
> 
> and/or setting the voltage to negative -Offset, with first step 0.00165. 
> 
> Which got me thinking; set CPU Voltage to auto and do some logging using
> "sensors".
> 
> The lowest vcore voltage I observed was 919mv, and the highest was 1.51v, at
> 100ms between measuring intervals. 
> I could not get the computer to reboot during my measurements - but voltage
> at 1.51v seemed a bit high. 
> 
> With voltage negative offset adjustment, the CPU got as high as 1.50v...
> which brings the question. 
> 
> Is this just a result of a bad chip, or result of BIOS/AGESA/FIRMWARE/KERNL
> condition that's messing with the vcore voltage, and making the computer
> reboot because the voltage is getting too high? 
> 
> Another interesting tidbits, is the timestamps between mce errors.. Every
> 311s. 
> 
>  [22730.301571] mce: [Hardware Error]: Machine check events logged
>  [23041.601517] mce: [Hardware Error]: Machine check events logged
>  [23352.900448] mce: [Hardware Error]: Machine check events logged
>  [47322.015640] mce: [Hardware Error]: Machine check events logged
>  [48255.907934] mce: [Hardware Error]: Machine check events logged
>  [49189.803993] mce: [Hardware Error]: Machine check events logged
>  [49812.398710] mce: [Hardware Error]: Machine check events logged
>  [52302.770599] mce: [Hardware Error]: Machine check events logged
>  [62886.872149] mce: [Hardware Error]: Machine check events logged
>  [  317.094926] mce: [Hardware Error]: Machine check events logged
>  [  628.382479] mce: [Hardware Error]: Machine check events logged
>  [ 4986.552369] mce: [Hardware Error]: Machine check events logged
>  [ 8099.525574] mce: [Hardware Error]: Machine check events logged
>  [ 9033.420023] mce: [Hardware Error]: Machine check events logged
>  [10589.906729] mce: [Hardware Error]: Machine check events logged
>  [12768.987541] mce: [Hardware Error]: Machine check events logged
>  [13391.581006] mce: [Hardware Error]: Machine check events logged
>  [13702.878346] mce: [Hardware Error]: Machine check events logged
>  [ 1250.660902] mce: [Hardware Error]: Machine check events logged
>  [ 1561.959065] mce: [Hardware Error]: Machine check events logged
>  [ 2184.554495] mce: [Hardware Error]: Machine check events logged
>  [ 2495.850929] mce: [Hardware Error]: Machine check events logged
>  [ 4674.931067] mce: [Hardware Error]: Machine check events logged
> 
> 
> Hopefully it helps someone.

I was receiving the same error (reported at the Red Hat Bugzilla, https://bugzilla.redhat.com/show_bug.cgi?id=1929308) and it seemed that enabling dTPM in the BIOS settings made it go away for me. I'm curious if dTPM is disabled in your BIOS settings?
Comment 199 xasafam914@leonvero.com 2021-03-16 22:05:03 UTC
There are my TPM Settings:

TPM Device Selection [Discrete TPM]
Erase fTPM NV for factory reset [Enabled]
Comment 200 xasafam914@leonvero.com 2021-03-16 22:15:43 UTC
Martin,

Thank you for the TIP. I have switched the Discrete TPM to Firmware, and Disabled fTPM NV for factory reset, vCore Voltage set to Auto. We will see if this fixes my reboot issue. Thanks again.
Comment 201 xasafam914@leonvero.com 2021-03-17 22:23:50 UTC
Quick update - almost 24h later - Unfortunately, the reboot was back with Firmware enabled. I rolled back to previous BIOS setting - which is negative voltage offset, and I am back to stable.
Comment 202 Gurenko Alex 2021-03-17 23:58:07 UTC
I'm also continuing my quest in attempt to isolate the problem and final state is following:

I've saw in a recent reboot in addition to the original "Bank 5: bea0000000000108" code b2a000000002010b in bank 22 which translates to

Bank: Power Management, Interrupts, Etc. (PIE)
Error: Link Error: An error occurred on a GMI or xGMI link (GMI 0x2)

which points to the GPU area... so I swapped my RX5700XT with my wife's RTX 3070. Two things so far:

1) My system is running completely stable for 3 days now and I'm pushing it as much as I can with sleep, hibernation, idle etc, everything that caused problem in the past
2) My wife actually had an MCE error on her window setup although only once so far, but there are other bizarre behavior of this GPU on windows, but I think it's outside scope of this problem.

I'm planning to run those setups for another week or so. I wanted to get another AMD gpu to try, but...current stock situation.
Comment 203 Paul Menzel 2021-03-19 07:41:41 UTC
Thank you for your report. I recommend to open a separate issue with all the details to the AMD Linux driver project [1]. Please provide all the details including logs and firmware and driver versions.

As you have also experienced this once with Microsoft Windows, that’s actually great news. Please contact the mainboard and graphics device manufacturer with that information.

[1]: https://gitlab.freedesktop.org/drm/amd
Comment 204 Gurenko Alex 2021-03-19 11:31:49 UTC
(In reply to Paul Menzel from comment #203)
> Thank you for your report. I recommend to open a separate issue with all the
> details to the AMD Linux driver project [1]. Please provide all the details
> including logs and firmware and driver versions.
> 
> As you have also experienced this once with Microsoft Windows, that’s
> actually great news. Please contact the mainboard and graphics device
> manufacturer with that information.
> 
> [1]: https://gitlab.freedesktop.org/drm/amd

Since I don't have direct contact to the motherboard vendor and last repair attempt lead to nothing, I'd try and stick to the GPU vendor, maybe I'll have some luck there. On a down side, windows system didn't crash since the first error, it seems like it has a better recovery mechanizm, it sometimes freezes the screen or goes black, but eventually recovers, maybe that's exactly the spot where my system restarts and throws a MCE on reboot.

Do you want me to join this [0] open issue discussion or open a new one?

[0] https://gitlab.freedesktop.org/drm/amd/-/issues/1481
Comment 205 Paul Menzel 2021-03-21 08:05:09 UTC
Unless it’s exactly your system configuration, I’d create a new issue.
Comment 206 Martin 2021-03-22 22:39:14 UTC
(In reply to xasafam914@leonvero.com from comment #201)
> Quick update - almost 24h later - Unfortunately, the reboot was back with
> Firmware enabled. I rolled back to previous BIOS setting - which is negative
> voltage offset, and I am back to stable.

I performed a few more troubleshooting steps on my end, and it seems that it may actually have been a combination of enabling AMD CPU fTPM (which may not have been necessary, but not sure), disabling PCI SR-IOV support, and disabling "Above 4G Decoding" fixed it for me.

I have opened a separate bug for this here: https://bugzilla.kernel.org/show_bug.cgi?id=212399
Comment 207 Gurenko Alex 2021-03-30 09:51:28 UTC
So, I would like to share a good results for my setup so far.

After 10-11 days with no problems whatsoever on nVidia RTX3070 card, I've switched back to my RX 5700XT and within 2 days, started to have similar problems as before, so I've filed this report: https://gitlab.freedesktop.org/drm/amd/-/issues/1551

However, I've decided to try and switch my CPU PCIe generation to the 3Gen and so far the system is running as stable as it was with the nVidia card. Based on previous experience it's too soon to tell, but it's running for 5 days straight as expected.

I'm wondering if the problem lies actually with the PCIe gen "mismatch"? RTX3xxx series are PCIe gen4 cards, hence probably no issues with default configuration, and RX5xxx cards are gen3. PCIe *should* be backwards compatible, but...

This came as an idea based on AMD's official recommendation for USB disconnect issues which is only fixed in upcoming ComboAM4PIV2 1.2.0.2, but the suggestion to fix it was also switch back the PCIe gen configuration from 4 -> 3.

If people who still experience initial (bea0000000000108) mce can give that a try, that would be useful.
Comment 208 Amadej Kastelic 2021-04-15 08:51:32 UTC
(In reply to Gurenko Alex from comment #207)
> So, I would like to share a good results for my setup so far.
> 
> After 10-11 days with no problems whatsoever on nVidia RTX3070 card, I've
> switched back to my RX 5700XT and within 2 days, started to have similar
> problems as before, so I've filed this report:
> https://gitlab.freedesktop.org/drm/amd/-/issues/1551
> 
> However, I've decided to try and switch my CPU PCIe generation to the 3Gen
> and so far the system is running as stable as it was with the nVidia card.
> Based on previous experience it's too soon to tell, but it's running for 5
> days straight as expected.
> 
> I'm wondering if the problem lies actually with the PCIe gen "mismatch"?
> RTX3xxx series are PCIe gen4 cards, hence probably no issues with default
> configuration, and RX5xxx cards are gen3. PCIe *should* be backwards
> compatible, but...
> 
> This came as an idea based on AMD's official recommendation for USB
> disconnect issues which is only fixed in upcoming ComboAM4PIV2 1.2.0.2, but
> the suggestion to fix it was also switch back the PCIe gen configuration
> from 4 -> 3.
> 
> If people who still experience initial (bea0000000000108) mce can give that
> a try, that would be useful.

Tried it, but still have the same issue :(
Comment 209 Gurenko Alex 2021-04-15 10:17:58 UTC
(In reply to Amadej Kastelic from comment #208)
> (In reply to Gurenko Alex from comment #207)
> > So, I would like to share a good results for my setup so far.
> > 
> > ...
> > 
> > This came as an idea based on AMD's official recommendation for USB
> > disconnect issues which is only fixed in upcoming ComboAM4PIV2 1.2.0.2, but
> > the suggestion to fix it was also switch back the PCIe gen configuration
> > from 4 -> 3.
> > 
> > If people who still experience initial (bea0000000000108) mce can give that
> > a try, that would be useful.
> 
> Tried it, but still have the same issue :(

That's disappointing, but thanks for reporting.
Comment 210 Amadej Kastelic 2021-04-15 10:19:21 UTC
(In reply to Gurenko Alex from comment #209)
> (In reply to Amadej Kastelic from comment #208)
> > (In reply to Gurenko Alex from comment #207)
> > > So, I would like to share a good results for my setup so far.
> > > 
> > > ...
> > > 
> > > This came as an idea based on AMD's official recommendation for USB
> > > disconnect issues which is only fixed in upcoming ComboAM4PIV2 1.2.0.2,
> but
> > > the suggestion to fix it was also switch back the PCIe gen configuration
> > > from 4 -> 3.
> > > 
> > > If people who still experience initial (bea0000000000108) mce can give
> that
> > > a try, that would be useful.
> > 
> > Tried it, but still have the same issue :(
> 
> That's disappointing, but thanks for reporting.

You still running stable? There might be some differences in our bioses. I'll check if there's more settings connected to PCIe.
Comment 211 Gurenko Alex 2021-04-15 10:27:37 UTC
(In reply to Amadej Kastelic from comment #210)
> (In reply to Gurenko Alex from comment #209)
> > (In reply to Amadej Kastelic from comment #208)
> > > (In reply to Gurenko Alex from comment #207)
> > > > So, I would like to share a good results for my setup so far.
> > > > 
> > > > ...
> > > > 
> > > > This came as an idea based on AMD's official recommendation for USB
> > > > disconnect issues which is only fixed in upcoming ComboAM4PIV2 1.2.0.2,
> > but
> > > > the suggestion to fix it was also switch back the PCIe gen
> configuration
> > > > from 4 -> 3.
> > > > 
> > > > If people who still experience initial (bea0000000000108) mce can give
> > that
> > > > a try, that would be useful.
> > > 
> > > Tried it, but still have the same issue :(
> > 
> > That's disappointing, but thanks for reporting.
> 
> You still running stable? There might be some differences in our bioses.
> I'll check if there's more settings connected to PCIe.

I was running stable-ish with that setting for ~12 days, then I've updated BIOS to AGEISA 1.2.0.2 which reverted settings back to stock and it's still running stable for another 3 days now, but I have a feeling that it's working fine after "board reset", you know when some settings in BIOS cut the power violently (with quite audible sounds), then it keeps stable, using sleep mode seems to break it and then system starts to reboot more and more often. I don't have it in me investigate anything further, I have no idea what works and what's not, at this point it's completely random and since there is no way to get a new GPU and prices for other motherboards are also insane, I don't know what else to do. It's working for now, but based on experience it will come back in a few days.
Comment 212 Amadej Kastelic 2021-04-15 10:52:19 UTC
(In reply to Gurenko Alex from comment #211)
> (In reply to Amadej Kastelic from comment #210)
> > (In reply to Gurenko Alex from comment #209)
> > > (In reply to Amadej Kastelic from comment #208)
> > > > (In reply to Gurenko Alex from comment #207)
> > > > > So, I would like to share a good results for my setup so far.
> > > > > 
> > > > > ...
> > > > > 
> > > > > This came as an idea based on AMD's official recommendation for USB
> > > > > disconnect issues which is only fixed in upcoming ComboAM4PIV2
> 1.2.0.2,
> > > but
> > > > > the suggestion to fix it was also switch back the PCIe gen
> > configuration
> > > > > from 4 -> 3.
> > > > > 
> > > > > If people who still experience initial (bea0000000000108) mce can
> give
> > > that
> > > > > a try, that would be useful.
> > > > 
> > > > Tried it, but still have the same issue :(
> > > 
> > > That's disappointing, but thanks for reporting.
> > 
> > You still running stable? There might be some differences in our bioses.
> > I'll check if there's more settings connected to PCIe.
> 
> I was running stable-ish with that setting for ~12 days, then I've updated
> BIOS to AGEISA 1.2.0.2 which reverted settings back to stock and it's still
> running stable for another 3 days now, but I have a feeling that it's
> working fine after "board reset", you know when some settings in BIOS cut
> the power violently (with quite audible sounds), then it keeps stable, using
> sleep mode seems to break it and then system starts to reboot more and more
> often. I don't have it in me investigate anything further, I have no idea
> what works and what's not, at this point it's completely random and since
> there is no way to get a new GPU and prices for other motherboards are also
> insane, I don't know what else to do. It's working for now, but based on
> experience it will come back in a few days.

I have the exact same feeling. Every time I update the bios, it runs stable for like a week. Then crashes come back. I'll try to get my hand on a new GPU.
Comment 213 Foulques du Peloux de Praron 2021-04-15 12:09:00 UTC
(In reply to Gurenko Alex from comment #211)
> (In reply to Amadej Kastelic from comment #210)
> > (In reply to Gurenko Alex from comment #209)
> > > (In reply to Amadej Kastelic from comment #208)
> > > > (In reply to Gurenko Alex from comment #207)
> > > > > So, I would like to share a good results for my setup so far.
> > > > > 
> > > > > ...
> > > > > 
> > > > > This came as an idea based on AMD's official recommendation for USB
> > > > > disconnect issues which is only fixed in upcoming ComboAM4PIV2
> 1.2.0.2,
> > > but
> > > > > the suggestion to fix it was also switch back the PCIe gen
> > configuration
> > > > > from 4 -> 3.
> > > > > 
> > > > > If people who still experience initial (bea0000000000108) mce can
> give
> > > that
> > > > > a try, that would be useful.
> > > > 
> > > > Tried it, but still have the same issue :(
> > > 
> > > That's disappointing, but thanks for reporting.
> > 
> > You still running stable? There might be some differences in our bioses.
> > I'll check if there's more settings connected to PCIe.
> 
> I was running stable-ish with that setting for ~12 days, then I've updated
> BIOS to AGEISA 1.2.0.2 which reverted settings back to stock and it's still
> running stable for another 3 days now, but I have a feeling that it's
> working fine after "board reset", you know when some settings in BIOS cut
> the power violently (with quite audible sounds), then it keeps stable, using
> sleep mode seems to break it and then system starts to reboot more and more
> often. I don't have it in me investigate anything further, I have no idea
> what works and what's not, at this point it's completely random and since
> there is no way to get a new GPU and prices for other motherboards are also
> insane, I don't know what else to do. It's working for now, but based on
> experience it will come back in a few days.

I have the same feeling too. I changed the "Power Supply Idle Control" setting to "Typical current idle" in my BIOS, as suggested above, and my system seemed to be way more stable. Then I updated my BIOS, which reset this setting, and my computer is still stable at the moment (~ 2 weeks).
Comment 214 Gurenko Alex 2021-04-15 12:23:07 UTC
Yeah, I had several successful tweaks in 5 months:

- Power Supply Idle Control - 3.5 weeks stable straight
- BIOS reset with battery removed for 1 hour - 1.5 weeks stable
- SMT disable (which actually disables sleep state) - 11 days stable
- Motherboard sent for repair - 10 days after I got it back
- PCIe gen switching - 2 weeks and counting

Each BIOS update keeps system stable for 5-7 days... so something starts this behavior, I can say that there is no better trigger than a sleep mode and then it's going downhill after the first re-occurrence.

all of those would probably tell us something...but switching to RTX 3070 didn't cause ANY problem for 11 days with putting machine to sleep 3-4 times a day, stressing and leaving it to idle for hours, so...

At this point I'm trying to avoid putting machine to sleep for now, let's see how long it would take to get a first reset.
Comment 215 Jaakko Kantojärvi 2021-04-15 18:37:08 UTC
With the same bios settings, I have encountered resets ranging from few hours to 4-5 weeks. However, the reset happens typically after 2-7 days of uninterrupted idle (I have used another compter for different things). I have kept my computer turned on (i.e., no sleep), so I could see how long it takes to crash.

To me, this seems to be random. I also feel like many bios settings only change the chance of the problem to occur (e.g. lower clock speed lowers the likelihood of the reset to occur).

I have one interesting observation, which is that I typically encounter the reset around 4am-6am. Not always, but it has happened at that time too many times. So, it might be related to time from when computer entered idle or to daily cron tasks for example. It's just the only common pattern I have noticed.
Comment 216 Leonardo de Araujo Augusto 2021-04-16 20:08:17 UTC
So I can consistently reproduce this playing World of Warcraft (retail, current ver.). Doesn't take long, I don't know exactly how to fetch logs from lutris/wine crashes, if they exist, but my journalctl is identical.

Specs: 

Debian sid
Motherboard: B450M DS3H (Gigabyte)
Kernel: 5.10.0-6-amd64
CPU:Ryzen 5 3600 
GPU:Radeon RX 580
mesa ver: 20.3.4
bios ver: f50
Comment 217 retromuzz 2021-04-17 04:16:03 UTC
Mainboard: TUF GAMING X570-PRO (WI-FI)
CPU: Ryen 5 5600X
GPU9: AORUS Radeon™ RX 6800 MASTER 16G
GPU1: Asus ROG Strix Radeon RX 6800 OC 16GB

OS: Arch Linux x86_64
Kernel: 5.11.13-arch1-1

journalctl
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 6: Machine Check: 0 Bank 5: bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffc112bbd4 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1618580526 SOCKET 0 APIC 1 microcode a201009

Random reboot happens once every few hours, at least 4 times a week.

Thinkgs I have tried:
Switching from PCIE Gen 4.0 to 3.0 - no effect
Added amdgpu.ppfeaturemask=0xffffbffb to boot args - no effect
Disabled C6 power state using systemd service - no effect

Do I have to RMA CPU? But how come windows 10 does not have any issue?
Comment 218 Paul Menzel 2021-04-17 07:30:47 UTC
(In reply to retromuzz from comment #217)
> Mainboard: TUF GAMING X570-PRO (WI-FI)
> CPU: Ryen 5 5600X

User Tim S has the same device and problem. What firmware do you use? Have you contacted the manufacturer already? It’s an expensive board and CPU, so they are probably interested in satisfying their customers.

> GPU9: AORUS Radeon™ RX 6800 MASTER 16G
> GPU1: Asus ROG Strix Radeon RX 6800 OC 16GB

Do you have by any chance access to a non-AMD graphics card just for testing?
Comment 219 retromuzz 2021-04-17 12:07:21 UTC
(In reply to Paul Menzel from comment #218)
> (In reply to retromuzz from comment #217)
> > Mainboard: TUF GAMING X570-PRO (WI-FI)
> > CPU: Ryen 5 5600X
> 
> User Tim S has the same device and problem. What firmware do you use? Have
> you contacted the manufacturer already? It’s an expensive board and CPU, so
> they are probably interested in satisfying their customers.
> 
> > GPU0: AORUS Radeon™ RX 6800 MASTER 16G
> > GPU1: Asus ROG Strix Radeon RX 6800 OC 16GB
> 
> Do you have by any chance access to a non-AMD graphics card just for testing?

My firmware:
linux-firmware 20210315.3568f96-1
I did not contact manufacturers yet. Do you recommend contacting CPU (AMD) or Mainboard (Asus) or both?
Sorry I don't have any other graphics card to try.
For the timebeing I disabled Precision Boost Overdrive in BIOS as Tim S did. This is not a dealbreaker for me. So let's see how it goes. It has not crashed after that but too early to say it the problem is gone.

Mainboard: TUF GAMING X570-PRO (WI-FI)
CPU:       AMD Ryen 5 5600X
GPU0:      AORUS Radeon™ RX 6800 MASTER 16G
GPU1:      Asus ROG Strix Radeon RX 6800 OC 16GB
PSU:       SILVESTONE ST75F-PT 80 PLUS PLATINUM 750W
OS:        Arch Linux x86_64
Kernel:    5.11.13-arch1-1
Firmware:  linux-firmware 20210315.3568f96-1
Comment 220 retromuzz 2021-04-17 12:11:41 UTC
(In reply to retromuzz from comment #219)
> (In reply to Paul Menzel from comment #218)
> > (In reply to retromuzz from comment #217)
> > > Mainboard: TUF GAMING X570-PRO (WI-FI)
> > > CPU: Ryen 5 5600X
> > 
> > User Tim S has the same device and problem. What firmware do you use? Have
> > you contacted the manufacturer already? It’s an expensive board and CPU, so
> > they are probably interested in satisfying their customers.
> > 
> > > GPU0: AORUS Radeon™ RX 6800 MASTER 16G
> > > GPU1: Asus ROG Strix Radeon RX 6800 OC 16GB
> > 
> > Do you have by any chance access to a non-AMD graphics card just for
> testing?
> 
> My firmware:
> linux-firmware 20210315.3568f96-1
> I did not contact manufacturers yet. Do you recommend contacting CPU (AMD)
> or Mainboard (Asus) or both?
> Sorry I don't have any other graphics card to try.
> For the timebeing I disabled Precision Boost Overdrive in BIOS as Tim S did.
> This is not a dealbreaker for me. So let's see how it goes. It has not
> crashed after that but too early to say it the problem is gone.
> 
> Mainboard: TUF GAMING X570-PRO (WI-FI)
> CPU:       AMD Ryen 5 5600X
> GPU0:      AORUS Radeon™ RX 6800 MASTER 16G
> GPU1:      Asus ROG Strix Radeon RX 6800 OC 16GB
> PSU:       SILVESTONE ST75F-PT 80 PLUS PLATINUM 750W
> OS:        Arch Linux x86_64
> Kernel:    5.11.13-arch1-1
> Firmware:  linux-firmware 20210315.3568f96-1

Ahh when you asked firmware I think you meant BIOS. I upgraded to latest. Still the issue persists. Here you go: 
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 3801
        Release Date: 04/07/2021


Mainboard: TUF GAMING X570-PRO (WI-FI) BIOS: Version: 3801
CPU:       AMD Ryen 5 5600X
GPU0:      AORUS Radeon™ RX 6800 MASTER 16G
GPU1:      Asus ROG Strix Radeon RX 6800 OC 16GB
PSU:       SILVESTONE ST75F-PT 80 PLUS PLATINUM 750W
OS:        Arch Linux x86_64
Kernel:    5.11.13-arch1-1
Firmware:  linux-firmware 20210315.3568f96-1
Comment 221 Leonardo de Araujo Augusto 2021-04-17 22:28:14 UTC
(In reply to Leonardo de Araujo Augusto from comment #216)
> So I can consistently reproduce this playing World of Warcraft (retail,
> current ver.). Doesn't take long, I don't know exactly how to fetch logs
> from lutris/wine crashes, if they exist, but my journalctl is identical.
> 
> Specs: 
> 
> Debian sid
> Motherboard: B450M DS3H (Gigabyte)
> Kernel: 5.10.0-6-amd64
> CPU:Ryzen 5 3600 
> GPU:Radeon RX 580
> mesa ver: 20.3.4
> bios ver: f50

reporting that a bios update that implemented AMD AGESA ComboV2 1.2.0.1 PatchA did not help at all. But my BIOS now is F61a and I literally had the issue again.
Comment 222 Leonardo de Araujo Augusto 2021-04-20 00:44:28 UTC
This doesn't seem neither random nor OS-specific at all. 
I've installed Windows 10 on the same machine and tried to run the same game(WoW retail) through it. Windows gracefully died with a TDR_ERROR(Timeout Detection and Recovery). (The computer has rebooted from a bugcheck.  The bugcheck was: 0x00000116 (0xffff988b61a2e460, 0xfffff801509a0a40, 0xffffffffc0000001, 0x0000000000000003). 

I could provide the logs if they're relevant. I don't really mind fixing Windows (lol) but perhaps those can provide some insight. The memdump is 700mb large so I'd need to upload to some cloud service.
Comment 223 alan.loewe 2021-04-22 21:23:05 UTC
If switching from PCIe 4 to 3 helps (the RX5700 does support PCIe 4), maybe check your SoC voltage. Standard is 1.0V, but many gaming boards seem to use a higher voltage when set to auto. AFAIK it affects everything on the CPU except the cores, including the controller for the CPU PCIe lanes (i.e. those used by the GPU). If the actual voltage is higher than 1.0V, lowering it might be worth a try.
Comment 224 Amadej Kastelic 2021-05-03 07:51:39 UTC
(In reply to Borislav Petkov from comment #182)
> It could be that disabling the C-states is simply making the freeze
> harder to trigger but not fix it and this could really be some power
> undervolt/overcurrent or whatever during states switch hardware issue
> which manifests itself in the transaction timeoutting.
> 
> And considering how setting some GPU power management cmdline args in
> the amdgpu driver, fixes the issue for some people here, it really could
> be the GPU is causing this power corner case.
> 
> Hmm, I wonder if you have a small, cheap GPU and you replace your big
> RX 5700 with it, whether it'll trigger then too. If it doesn't because
> - and I'm purely conjecturing here - that small GPU never causes the
> platform to even get close to that power corner case, then that would be
> starting to make sense.
> 
> But again, this is all speculation and without a specialized measurement
> hardware to check the platform's power limits, who knows what happens.

I upgraded my pc 4 days ago, running stable for now. Replaced my cpu with a 5950x and GPU with RX 6900XT. Everything seems stable with C-states enabled for now. Will report back if mce crashes reappear.
Comment 225 Gurenko Alex 2021-05-04 11:24:24 UTC
(In reply to Amadej Kastelic from comment #224)
> 
> I upgraded my pc 4 days ago, running stable for now. Replaced my cpu with a
> 5950x and GPU with RX 6900XT. Everything seems stable with C-states enabled
> for now. Will report back if mce crashes reappear.

But you kept your motherboard, correct?

I've managed to rent a RTX3070 (and installed it today) to replace my 5700XT, while I'm still trying to get RX6000-series as nvidia and KDE plasma does not result in great experience.
Comment 226 Amadej Kastelic 2021-05-04 11:25:31 UTC
(In reply to Gurenko Alex from comment #225)
>
> But you kept your motherboard, correct?
> 
> I've managed to rent a RTX3070 (and installed it today) to replace my
> 5700XT, while I'm still trying to get RX6000-series as nvidia and KDE plasma
> does not result in great experience.

Yup, kept everything else. Still running stable. Would crash 1-2 times a day before.
Comment 227 Gurenko Alex 2021-05-04 11:39:38 UTC
(In reply to Amadej Kastelic from comment #226)
> (In reply to Gurenko Alex from comment #225)
> >
> > But you kept your motherboard, correct?
> > 
> > I've managed to rent a RTX3070 (and installed it today) to replace my
> > 5700XT, while I'm still trying to get RX6000-series as nvidia and KDE
> plasma
> > does not result in great experience.
> 
> Yup, kept everything else. Still running stable. Would crash 1-2 times a day
> before.

I wouldn't rush into conclusion as the common experience is that there is some sort of reset and trigger for this behavior, for example:

- sleep mode (or low power state in general) - good trigger to start mce resets

- bios flashing, bios reset - good reset for mce errors

I would wait and probably "stress test" new setup. Like I mentioned, putting machine to sleep is 100% hit :)
Comment 228 Amadej Kastelic 2021-05-04 11:42:10 UTC
(In reply to Gurenko Alex from comment #227)
> 
> I wouldn't rush into conclusion as the common experience is that there is
> some sort of reset and trigger for this behavior, for example:
> 
> - sleep mode (or low power state in general) - good trigger to start mce
> resets
> 
> - bios flashing, bios reset - good reset for mce errors
> 
> I would wait and probably "stress test" new setup. Like I mentioned, putting
> machine to sleep is 100% hit :)

Putting my machine to sleep wasn't a trigger before. I was flashing different bios versions every other day, which didn't make my system more stable. It usually just happened randomly during low/idle load.
Comment 229 fintara 2021-05-16 06:00:40 UTC
(Wow, I'm so glad (but also so sad) that I've found a thread full of people with the same problem...)

My build has the same problem, and before that it's been working stable for 2 months.

> CPU: AMD 5900X
> RAM: G.Skill Ripjaws V 128 GB (4 x 32 GB) DDR4-3600 CL18
> MB: Gigabyte X570 Aorus Pro (rev. 1.2)
> VGA: Sapphire 5500XT 4GB

For me this sudden reboot can happen in both idle and under load, but it happens more often when I leave it alone (Firefox+grafana and spotify in the background).

I've tried every fix here, but unfortunately it's all the same. Some fixes _seem_ to do the job, but then you get a restart again - this is the worst.

Observation: sometimes after a restart it won't POST, it won't be showing anything, and motherboard's LED for VGA will be glowing red.

I wonder, is this behavior specific to AMD+AMD cpu/vga configurations? It seems so...
Comment 230 Paul Menzel 2021-05-16 07:15:59 UTC
(In reply to fintara from comment #229)

[…]

> My build has the same problem, and before that it's been working stable for
> 2 months.

What distribution, Linux kernel version do you use? Any idea what changed in the two months ((package) updates)? Was the system continuously running for two months and the problem started after a reboot? Does the Linux kernel log MCE messages after the unwanted reboots/restarts?

> > CPU: AMD 5900X
> > RAM: G.Skill Ripjaws V 128 GB (4 x 32 GB) DDR4-3600 CL18
> > MB: Gigabyte X570 Aorus Pro (rev. 1.2)

What firmware version do you use?

> > VGA: Sapphire 5500XT 4GB

Do you have a chance, to find out what firmware is running on the graphics device?

> For me this sudden reboot can happen in both idle and under load, but it
> happens more often when I leave it alone (Firefox+grafana and spotify in the
> background).
> 
> I've tried every fix here, but unfortunately it's all the same. Some fixes
> _seem_ to do the job, but then you get a restart again - this is the worst.
> 
> Observation: sometimes after a restart it won't POST, it won't be showing
> anything, and motherboard's LED for VGA will be glowing red.

Please report this to the mainboard vendor, so they fix the UEFI firmware.

> I wonder, is this behavior specific to AMD+AMD cpu/vga configurations? It
> seems so...

According to [2][3], you can use the integrated graphics device. It’d be great if you tried to reproduce it with that.

[1]: https://www.sapphiretech.com/en/consumer/pulse-radeon-rx-5500-xt-4g-gddr6
[2]: https://www.gigabyte.com/de/Motherboard/X570-AORUS-PRO-rev-11-12#kf
[3]: https://www.top2gadget.com/amd-ryzen-9-5900x-gaming-processor/
Comment 231 Amadej Kastelic 2021-05-16 07:22:44 UTC
(In reply to Amadej Kastelic from comment #224)
> 
> I upgraded my pc 4 days ago, running stable for now. Replaced my cpu with a
> 5950x and GPU with RX 6900XT. Everything seems stable with C-states enabled
> for now. Will report back if mce crashes reappear.

Another update. I'm still running stable and I'm 90% sure that it is the GPU that is causing MCE errors. I've tried my old GPU in a new system with Windows, which is causing strange GPU driver timeouts and random reboots. I'd suggest you all, to try a different GPU.
Comment 232 Gurenko Alex 2021-05-17 10:07:26 UTC
(In reply to Amadej Kastelic from comment #231)
> (In reply to Amadej Kastelic from comment #224)
> > 
> > I upgraded my pc 4 days ago, running stable for now. Replaced my cpu with a
> > 5950x and GPU with RX 6900XT. Everything seems stable with C-states enabled
> > for now. Will report back if mce crashes reappear.
> 
> Another update. I'm still running stable and I'm 90% sure that it is the GPU
> that is causing MCE errors. I've tried my old GPU in a new system with
> Windows, which is causing strange GPU driver timeouts and random reboots.
> I'd suggest you all, to try a different GPU.

I've also running 100% stable for 16 days now after switch to RTX3070, so it was a GPU/driver issue for me, which at this point no one really can narrow down further.
Comment 233 fintara 2021-05-17 14:49:47 UTC
(In reply to Paul Menzel from comment #230)

> What distribution, Linux kernel version do you use? Any idea what changed in
> the two months ((package) updates)? Was the system continuously running for
> two months and the problem started after a reboot? Does the Linux kernel log
> MCE messages after the unwanted reboots/restarts?

Arch Linux, latest stable Linux kernel (but also happens with linux-lts and linux-zen). I do also have amd-ucode.
I updated the system on a Friday, then no problems during the weekend, and on Monday the next week it started rebooting itself (after a full day of work). I shutdown the PC at least once a day.
Yes, there are always MCE messages in the dmesg after such reboots, such as:

---
[    0.808978] mce: [Hardware Error]: Machine check events logged
[    0.808980] mce: [Hardware Error]: CPU 19: Machine Check: 0 Bank 5: bea0000001000108
[    0.808984] mce: [Hardware Error]: TSC 0 ADDR 7f0f4932d109 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.808988] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1620673385 SOCKET 0 APIC 13 microcode a201009
---

or

---
[    0.811609] mce: [Hardware Error]: Machine check events logged
[    0.811609] mce: [Hardware Error]: CPU 17: Machine Check: 0 Bank 5: bea0000001000108
[    0.811609] mce: [Hardware Error]: TSC 0 ADDR ffffff9fa2a6aa MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.811609] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1620915516 SOCKET 0 APIC b microcode a201009
---


> What firmware version do you use?

I have flashed the latest BIOS firmware - F33h. When this problem started to happen, I had the F33g I believe.

> Do you have a chance, to find out what firmware is running on the graphics
> device?

I'm afraid I can't find this information.

> According to [2][3], you can use the integrated graphics device. It’d be
> great if you tried to reproduce it with that.

Unfortunately 5900X does not have integrated graphics (I did try to boot without VGA, but the VGA red LED would be ON and it won't POST; I also tried to move the card to another PCI slot, but the problem persisted).

Some more observations I forgot to share in the previous comment:
- it will reboot while I'm in BIOS
- it will reboot while I've booted memtest (I'm rather confident that it's not the memory, as I've tried different configurations of the 4 RAM sticks, also with and without XMP)
- it will reboot if I run `stress --cpu 24 --timeout 12h`, likewise if I put load on the video with `glmark2 --run-forever -b build:use-vbo=true` - so it doesn't matter idle/load
- if music is playing, at the moment of an unwanted reboot the display goes blank, the music stops, then 1 second the music plays again, and then it reboots.

I've tried disabling every option for the CPU (even "downgraded" it to 6 cores/6 threads), downclocked, overclocked, more or less voltage - nothing seems to make a difference.

The video card has some switch on it (Dual BIOS?) - but it doesn't matter which position it is in.

Also, is `cat /sys/kernel/debug/dri/0/amdgpu_regs` supposed to indeed freeze the system? It does for me, I have to hold the power button afterwards.

---

I have ordered an RX550, we'll see if that changes anything. The two previous commenters give me some hope...
Comment 234 Paul Menzel 2021-05-17 14:59:58 UTC
(In reply to fintara from comment #233)
> (In reply to Paul Menzel from comment #230)

[…]

> > What firmware version do you use?
> 
> I have flashed the latest BIOS firmware - F33h. When this problem started to
> happen, I had the F33g I believe.
> 
> > Do you have a chance, to find out what firmware is running on the graphics
> > device?
> 
> I'm afraid I can't find this information.

I don’t know a way to do that in GNU/Linux either. GPU-Z seems to be able to retrieve that with Microsoft Windows.

> > According to [2][3], you can use the integrated graphics device. It’d be
> > great if you tried to reproduce it with that.
> 
> Unfortunately 5900X does not have integrated graphics (I did try to boot
> without VGA, but the VGA red LED would be ON and it won't POST; I also tried
> to move the card to another PCI slot, but the problem persisted).

Sorry for the confusion. I thought so too, but the links claimed there is an integrated graphics device.

> Some more observations I forgot to share in the previous comment:
> - it will reboot while I'm in BIOS

That is actually good news, as it’s unrelated to GNU/Linux. I recommend to contact the mainboard vendor.

[…]

> Also, is `cat /sys/kernel/debug/dri/0/amdgpu_regs` supposed to indeed freeze
> the system? It does for me, I have to hold the power button afterwards.

No, nothing should crash the Linux kernel. Please report that to the AMDGPU developers [4].


[4]: https://gitlab.freedesktop.org/drm/amd/-/issues/
Comment 235 Paul Menzel 2021-05-17 15:02:18 UTC
Everyone with Asus RX5700XT card: In bug 210929 [1], @binarytamer was able to solve their issue:

> Than I red about a bad ASUS Firmware on my 5700XT. I already tried to update
> the 5700XT in the past but the ASUS Update Tool always reports that there is
> no
> update needed. So I flashed a ROM file from the ASUS Update with the AMD
> Flash
> tool this time.


[1]: https://bugzilla.kernel.org/show_bug.cgi?id=210929
     "MCE bea0000000000108 Crash on heavy/gaming workload since Kernel 5.5"
Comment 236 Alex Deucher 2021-05-17 15:06:08 UTC
You can access the GPU firmware versions from debugfs.  e.g.,
sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
Comment 237 Arne Brücher 2021-05-21 11:00:26 UTC
I'm also experiencing the same sudden reboots followed by a kernel message MCE hardware error since October 2020. And it's always the same code "bea0000000000108" and "microcode 8701021". Here's my error log history: https://pastebin.com/NTbJb5ci

The error happens for me mostly when I play video games, e. g. Counter Strike: Global Offensive or Grand Theft Auto Online. Sometimes I play for just a few minutes and experience a crash, sometimes I play for hours without any problem. Also playing Assassins Creed Odyssey for over 12 hours in total was never a problem.

I've already tried disabling XMP and auto-overclocking in the BIOS with no avail. I did a full memtest86+ run without any error. The suggested Cool 'n' Quiet option is no longer available in the BIOS and I had crashes even with a fixed CPU ration.

I eventually RMA'ed my CPU and received a new Ryzen 5 3600 yesterday, but after playing CS:GO for a couple of minutes, my PC crashed again. Since it almost always happens while gaming I suspect it's the GPU triggering the error. I have a RX 5700 which isn't working ideal on Linux. Using Manjaro I experienced GPU crashes showing a green screen before the system shut down spontaneously, as explained here:
https://forum.manjaro.org/t/system-crashes-and-reboots-because-either-the-cpu-or-gpu/33969

Also I'm unable to use standby on my Arch system, because when waking up my GPU failes. This could have something to do with this. Here's the error log of a failed waking up from standby: https://pastebin.com/CZPdfhJw

Right now I'm trying the amdgpu.ppfeaturemask=0xffffbffd kernel parameter, but the biggest problem is, that this error isn't reproducible on purpose. I've ran Basemark GPU, both Vulkan and OpenGL, a full GFXBench run and GPUtest for a couple of minutes, both with and without the ppfeaturemask kernel parameter, without any crash.

Has anybody been successful in even pin-pointing where the error is coming from and what's causing it? Or was anybody able to reproduce it? It's really hard to troubleshoot when the error randomly occurs every now and then.

Finally here's my system (inxi): https://pastebin.com/Afn31RCG
Comment 238 Amadej Kastelic 2021-05-21 11:02:51 UTC
(In reply to kernel from comment #237)
> I'm also experiencing the same sudden reboots followed by a kernel message
> MCE hardware error since October 2020. And it's always the same code
> "bea0000000000108" and "microcode 8701021". Here's my error log history:
> https://pastebin.com/NTbJb5ci
> 
> The error happens for me mostly when I play video games, e. g. Counter
> Strike: Global Offensive or Grand Theft Auto Online. Sometimes I play for
> just a few minutes and experience a crash, sometimes I play for hours
> without any problem. Also playing Assassins Creed Odyssey for over 12 hours
> in total was never a problem.
> 
> I've already tried disabling XMP and auto-overclocking in the BIOS with no
> avail. I did a full memtest86+ run without any error. The suggested Cool 'n'
> Quiet option is no longer available in the BIOS and I had crashes even with
> a fixed CPU ration.
> 
> I eventually RMA'ed my CPU and received a new Ryzen 5 3600 yesterday, but
> after playing CS:GO for a couple of minutes, my PC crashed again. Since it
> almost always happens while gaming I suspect it's the GPU triggering the
> error. I have a RX 5700 which isn't working ideal on Linux. Using Manjaro I
> experienced GPU crashes showing a green screen before the system shut down
> spontaneously, as explained here:
> https://forum.manjaro.org/t/system-crashes-and-reboots-because-either-the-
> cpu-or-gpu/33969
> 
> Also I'm unable to use standby on my Arch system, because when waking up my
> GPU failes. This could have something to do with this. Here's the error log
> of a failed waking up from standby: https://pastebin.com/CZPdfhJw
> 
> Right now I'm trying the amdgpu.ppfeaturemask=0xffffbffd kernel parameter,
> but the biggest problem is, that this error isn't reproducible on purpose.
> I've ran Basemark GPU, both Vulkan and OpenGL, a full GFXBench run and
> GPUtest for a couple of minutes, both with and without the ppfeaturemask
> kernel parameter, without any crash.
> 
> Has anybody been successful in even pin-pointing where the error is coming
> from and what's causing it? Or was anybody able to reproduce it? It's really
> hard to troubleshoot when the error randomly occurs every now and then.
> 
> Finally here's my system (inxi): https://pastebin.com/Afn31RCG

It's probably the GPU, read my previous comments. The error doesn't have to do anything with Linux, it happens on Windows as well. Try a cheap GPU or RMA yours.
Comment 239 Paul Menzel 2021-05-21 11:22:10 UTC
Arne, thank you for sharing your findings. As you can see, the issue is already quite unwieldy. Please report the resume issue separately to Linux’ AMDGPU folks [1], and try to read all the comments here. ;-)

Any idea, what changed in October 2020?

As you have a XT5700, as per comment 235, please check out bug 210929 [2] (and please paste/attach the output of `sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info` there, before upgrading the GPU firmware).


[1]: https://gitlab.freedesktop.org/drm/amd/-/issues/
[2]: https://bugzilla.kernel.org/show_bug.cgi?id=210929
Comment 240 Leonardo de Araujo Augusto 2021-05-21 11:49:10 UTC
I'm MCE-free for about two months after fixing the CPU frequency to 36.0 (from auto) at the BIOS. This fixed the resets happening both under Windows and Linux.
The suggestion came from the comment below, from Marin Roth.


(In reply to Martin Roth from comment #116)

...

> The last thing I tried was to change the CPU Ratio from Auto → 36.00, like
> suggested in
> https://forum-en.msi.com/index.php?threads/solved-msi-x570-a-pro-ryzen-5-
> 3600-freeze.344085/ (the base frequency of my CPU is 3.6 GHz) and now a week
> has passed without any random freezes.
> 
> Perhaps you can also try changing the CPU Ratio to a fixed value in the BIOS
> and see if it helps.
Comment 241 Clemens Eisserer 2021-05-21 11:54:17 UTC
| The error doesn't have to do anything with Linux, it happens on Windows 
| as well. Try a cheap GPU or RMA yours.

For me (original bug reporter) it *never* happend running Windows, I used the machine extensivly for one year running windows for developing proprietary stuff and it was rock solid. However dual-booting into linux for personal useage produced ~one MCE per week. So I guess it is not valid to conclude from your findings to what others experience.

| I'm MCE-free for about two months after fixing the CPU frequency to 36.0 (from |  auto) at the BIOS. This fixed the resets happening both under Windows and 
| Linux. The suggestion came from the comment below, from Marin Roth.

Could it be the issue is power-related, with PSUs not ramping up output fast enough when the cpu is changing requencies? The timeouts causing the MCEs would be brown-outs caused by too low voltage. Or would there be more detailed error reporting in this case?
Comment 242 Foulques du Peloux de Praron 2021-05-21 12:25:48 UTC
(In reply to Clemens Eisserer from comment #241)
> Could it be the issue is power-related, with PSUs not ramping up output fast
> enough when the cpu is changing requencies? The timeouts causing the MCEs
> would be brown-outs caused by too low voltage. Or would there be more
> detailed error reporting in this case?

Good suggestion. I use an old PSU (Be Quiet Straight Power E9 - 500W), but running benchmarks like Unigine Superposition is never a problem and most of the time I have no issue in games.
But, when the MCE happens, it seems to be while playing a moderate or heavy game, and I am hearing ventilators suddenly running crazy.
Comment 243 Arne Brücher 2021-05-21 12:31:31 UTC
(In reply to Paul Menzel from comment #239)
> Arne, thank you for sharing your findings. As you can see, the issue is
> already quite unwieldy. Please report the resume issue separately to Linux’
> AMDGPU folks [1], and try to read all the comments here. ;-)

Thanks, will do!

> Any idea, what changed in October 2020?

No and I believe I had this issue before. This is just the earliest time I seeked help online. I bought the computer in April 2020 and sure had this issue between April and October.
 
> As you have a XT5700, as per comment 235, please check out bug 210929 [2]
> (and please paste/attach the output of `sudo cat
> /sys/kernel/debug/dri/0/amdgpu_firmware_info` there, before upgrading the
> GPU firmware).

Thanks, done. It's an RX 5700 btw. :)

(In reply to Leonardo de Araujo Augusto from comment #240)
> I'm MCE-free for about two months after fixing the CPU frequency to 36.0
> (from auto) at the BIOS. This fixed the resets happening both under Windows
> and Linux.

Weird, I had the issue using a fixed CPU ratio to 42.00. Maybe it is a problem with power delivery?

> Good suggestion. I use an old PSU (Be Quiet Straight Power E9 - 500W), but
> running benchmarks like Unigine Superposition is never a problem and most of
> the time I have no issue in games.
> But, when the MCE happens, it seems to be while playing a moderate or heavy
> game, and I am hearing ventilators suddenly running crazy.

But on the other hand I also had no problems running heavy load GPU benchmarks and the crashes only happen, when playing games. What's so specific to videogames in comparison to synthetic benchmarks?
Comment 244 Tim S 2021-05-23 19:05:13 UTC
(In reply to Tim S from comment #189)

I was finally able to replace my Radeon HD 5450 with a NVIDIA GeForce RTX 3060. However, it seemed to have made no difference regarding the crashes. Just thought I should report this info since many people were discussing the effect of different GPUs on the problem. Of course YMMV