Bug 206903 - Spontaneous reboots with Ryzen-3700x (Machine Check: 0 Bank 5: bea0000000000108)
Summary: Spontaneous reboots with Ryzen-3700x (Machine Check: 0 Bank 5: bea0000000000108)
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
: 208573 (view as bug list)
Depends on:
Blocks:
 
Reported: 2020-03-21 09:45 UTC by Clemens Eisserer
Modified: 2020-09-20 17:34 UTC (History)
25 users (show)

See Also:
Kernel Version: 5.5.9
Tree: Mainline
Regression: No


Attachments
dmesg (79.61 KB, text/plain)
2020-03-24 15:53 UTC, Clemens Eisserer
Details
Xorg log (49.85 KB, text/plain)
2020-03-24 15:55 UTC, Clemens Eisserer
Details
Logs for debugging Machine Check crash (390.21 KB, text/plain)
2020-05-22 04:58 UTC, joel_damiano
Details
kernel and X11 logs for a boot after crash when reading registers (134.19 KB, text/plain)
2020-05-28 17:42 UTC, Vitalii
Details
possible fix (2.02 KB, patch)
2020-07-01 16:12 UTC, Alex Deucher
Details | Diff
dmesg log while running at high CPU load (5.90 KB, text/plain)
2020-07-04 12:48 UTC, Jens Reimann
Details
pci devices (668.41 KB, text/plain)
2020-08-10 18:46 UTC, busdma
Details

Description Clemens Eisserer 2020-03-21 09:45:28 UTC
Ever since building my new PC I experience spontaneous (every week or so) reboots caused by machine check exceptions (always same bank and code - please see below). The reboots tend to happen in low-load situations (e.g. right after loading the desktop, or when playing youtube videos) - high load doesn't seem to make it worse.

The system consists of:
Asrock Phantom Gaming 4 X570 (latest BIOS: 2.30)
Ryzen 3700x
MSI RX570 4GB
Crucial Ballistics 4x8GB, DDR4, 3000Mhz

I first suspected a hardware fault, but the system has been rock solid running Windows-10 for months (not a single crash / reboot) and runs memtest-86+ without error for days in single- & multicore-mode.
Temps are low, PSU is of high quality.

I tried the following work-arrounds without success (suggested for ZEN1-chips with the same error code):
- Disabled RC6 power state
- Disabled mwait for core-signalling
- limited GPU power saving states

Others have experienced exactly the same issue: https://www.reddit.com/r/archlinux/comments/e33nyg/hard_reboots_with_ryzen_3600x/fgtj09u/

... where in some cases it seems changing to a different GPU helps.
However my RX570 is rather new, so I am not so keen on replacing it after 6 months of use.


[    0.707393] mce: [Hardware Error]: Machine check events logged
[    0.707395] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
[    0.707464] mce: [Hardware Error]: TSC 0 ADDR 1ffffbb03343c MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.707540] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288 SOCKET 0 APIC 5 microcode 8701013
[    0.709397] mce: [Hardware Error]: Machine check events logged
[    0.709398] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
[    0.709468] mce: [Hardware Error]: TSC 0 ADDR 1ffffbba3a05a MISC d012000100000000 SYND 4d000000 IPID 500b000000000
[    0.709543] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288 SOCKET 0 APIC 9 microcode 870101
Comment 1 Alex Deucher 2020-03-23 13:29:50 UTC
does setting amdgpu.ppfeaturemask=0xffffbffb on the kernel command line in grub help?  Please attach your dmesg output and xorg log (if using X).
Comment 2 Clemens Eisserer 2020-03-24 15:53:08 UTC
Created attachment 288035 [details]
dmesg
Comment 3 Clemens Eisserer 2020-03-24 15:55:11 UTC
Created attachment 288037 [details]
Xorg log
Comment 4 Clemens Eisserer 2020-03-24 15:59:46 UTC
logs are attached, thanks for the hint regarding the feature mask, I'll give it a try and report back as soon as the next reboot occurs. 

I discovered a feature-mask was already set, stemming from my experiments with reducing power management - the crashes happend before setting the feature mask to 0xfffd7fff.
Comment 5 Clemens Eisserer 2020-03-26 08:09:18 UTC
the modigied feature mask didn't seem to improve things - just had another reboot:

[    0.105123] mce: [Hardware Error]: Machine check events logged
[    0.105124] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
[    0.105191] mce: [Hardware Error]: TSC 0 ADDR 7f6b3dbdfe9e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.105267] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585208779 SOCKET 0 APIC 0 microcode 8701013
Comment 6 Borislav Petkov 2020-03-27 11:40:41 UTC
I see in your dmesg:

amdgpu.ppfeaturemask=0xfffd7fff

Alex asked you to try:

amdgpu.ppfeaturemask=0xffffbffb

Are you saying that the MCE in comment #5 happened with the 0xffffbffb mask?

If so, then you probably should RMA your CPU.

HTH.
Comment 7 Clemens Eisserer 2020-03-27 11:49:11 UTC
@Borislav: the dmesg dump was uploaded before the feature-mask was adjusted, the last crash happend with amdgpu.ppfeaturemask=0xffffbffb.

I've filed this report because the machine is rock solid running Windows-10 and crashes don't seem to be load-related (e.g. encoding VP9 videos on all cores for 24h on Linux didn't cause any problems). But who knows, maybe it is another "linux performance marginality problem" ;)
Comment 8 Borislav Petkov 2020-03-27 11:59:07 UTC
Comparing it to windoze doesn't mean a whole lot. If you want to debug it, then I guess the only thing I can think of is for you to try to rule out hw components. Like, for example, if you have another GPU handy to try with it and see if the MCEs still happen. And so on.

Then perhaps try to figure out what you do exactly before it reboots - maybe you'll be able to spot a pattern there.

You get the idea.
Comment 9 Clemens Eisserer 2020-03-27 12:05:43 UTC
| Comparing it to windoze doesn't mean a whole lot.

It means either:
* the hw is not faulty or has quirks which the windows drivers handle properly
* the hw causing the MCE is not used / used in a different way when running windows
Comment 10 Borislav Petkov 2020-03-27 12:18:09 UTC
And how do you suggest we figure that out?
Comment 11 Clemens Eisserer 2020-03-31 14:12:36 UTC
and another one:

[    0.696908] mce: [Hardware Error]: Machine check events logged
[    0.696909] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
[    0.696977] mce: [Hardware Error]: TSC 0 ADDR 1ffffafa3b2aa MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.697053] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585663601 SOCKET 0 APIC 6 microcode 8701013
Comment 12 someguy108 2020-04-03 00:48:13 UTC
Hello! I 've been having a similar issue as well as Clemens in regards to spontaneous reboots as well. 
This is my configuration:
-Ryzen 3900x + Noctua D15
-MSI X570 Unify (latest agesa as of writing)
-DDR4 3200mhz 32GB kit
-Sapphire Pulse 5700 XT
-Corsair RMX 850 Watt
-Arch Linux with kernel 5.5.13
-Mesa 20.0.3
-Early KMS enabled

I've had this system up and running since November 2019 but initially with a Nvidia 1060 and Windows 10. Everything was running smoothly. About a month ago I switched back over to Linux after purchasing my 5700 XT as my initial plan was to go back to Linux. Since returning I've experienced multiple spontaneous MCE reboots. All happened while I was playing one particular game, Warcraft 3 Reforged. The MCE event is the following:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 2 microcode 8701013
kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0 APIC 9 microcode 8701013
kernel: #16 #17 #18 #19 #20 #21 #22 #23

Initially I figured it could be ram so I performed the usual test with no problems. Also tested with standard JEDEC as well and eventually received a MCE during Warcraft 3 reforged. After consulting with a few friends I decided to try a different power supply to no avail. I then bit the bullet and bought a brand new 3900x. I also cleared CMOS before getting my new 3900x and after. All CPU values are on auto with no PBO or manual overclocking. The only fancy is the ram in regards to XMP. Yesterday, after owning the new 3900x for three days, I had a MCE while I was playing Warcraft 3 Reforged. I have tested other games but none of them caused a MCE or any crashes / freezes for that matter. World of Warcraft, The Outer Worlds, Stellaris, and Counter-Strike: Global Offensive.

One thing to note is I haven't received it during desktop usage. Only in Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and always have. Both of which used, tested, and received the MCE's during those sessions. 

I have noticed a pattern with the MCE crashes with Warcraft 3. They always happen during a GPU load drop off or increase transition. By that I mean when exiting a match to return to the lobby, or loading a map and when it switches from the loading screen to the match itself is when these MCE's happen. 

The entire screen quickly turns black, everything is hard locked, and then after about a minute or so the machine reboots on its own. It hasn't happened yet while in a middle of a match session, sitting in the lobby or at the main menu screen. Its consistently been during a transition. 

My theory is that this could possibly be a GPU hang from switching from one power state to another power state. With the GPU hanging, causes the CPU to stall, and thus a MCE. The GPU hanging could explain the quick solid black screen as well as all output is stopped. But I'm really just assuming here form my own observations from my very limited understanding. Possible reason why this triggers in Warcraft is because the other games have few moments of switching power states heavily. The Outer Worlds, World of Warcraft, Stellaris, and Counter-Strike Global Offensive all keep a constant high load on the GPU and the match sessions are long.

From what its worth, I've had no major issues in Windows 10. The only quirks where initially a few TDR's that recovered from alt tabing out of most games with Google Chrome running. Disabling hardware acceleration in Chrome fixed those TDR's while alt-tabing out of games. 

I've also used both 3900x's to compile things like chromium and other large projects that's last hours perfectly fine. On Windows side of things I've also done extensive stress testing with Prime95 and Aida64. Along with long gaming sessions with Battlefield V that utilizes AVX instructions and puts a load on all 24 threads.

From searching, I've found quite a few reports of people talking about receiving MCE's that isn't the typical first generation MCE's reports from 2017 involving Ryzen. Where those where fixed by disabling c-states, ram, and changing power supply current from low to typical. These ones within the past year appear to all have a AMD GPU in common. I did notice a few with Intel CPU's as well paired up with a AMD GPU.

Any feedback would be greatly appreciated.
Comment 13 Clemens Eisserer 2020-04-13 09:20:22 UTC
As mentioned by other users / in the reddit thread this issue really seems to be somehow GPU / PCIe related.

I've disabled GPU acceleration by switching to Xorg AccelMethod=none and the llvmpipe opengl rasterizer and despite using the machine more frequently I haven't had a single crash in the past 10 days.

It is still unclear for me how the GPU could trigger processor MCEs. Maybe the windows drivers know about some CPU quirks and have workarounds implements, while the linux drivers still lack those?
Comment 14 Clemens Eisserer 2020-04-13 09:23:50 UTC
sorry, wrong conclusion - two minutes after writing this post, during shutting down the system there it was:

[    0.123020] mce: [Hardware Error]: Machine check events logged
[    0.123022] mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
[    0.123090] mce: [Hardware Error]: TSC 0 ADDR 1ffffb8a3b5ce MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.123166] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1586769715 SOCKET 0 APIC f microcode 8701013
Comment 15 Nicholas H. 2020-04-20 00:41:41 UTC
I have the same issue with my 3900X and 5700 XT but (in my case) it is not specific to AMD graphics cards.

On my machine the resets are most common during suspend to RAM, or immediately after my monitors go to sleep. I set my machine to suspend after one minute and have another machine sending WoL packets in a loop so I can reproduce the issue easily. I always get a reset before the 100th suspend.

I tested this setup with both a 5700 XT (amdgpu) and an Nvidia 780 Ti (nouveau) and the resets happen with both cards.

Like the other reports, I've never had this happen in Windows.

In addition to the bea0000000000108 mce, I've had two others:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1581274016 SOCKET 0 APIC 0 microcode 8701013

kernel: mce: [Hardware Error]: Machine check events logged
kernel: [Hardware Error]: System Fatal error.
kernel: [Hardware Error]: CPU:12 (17:71:0) MC3_STATUS[Over|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xfaa0000000070118
kernel: [Hardware Error]: IPID: 0x000300b000000000, Syndrome: 0x000000004d000030
kernel: [Hardware Error]: Decode Unit Ext. Error Code: 7, Patch RAM sequencer parity error.
kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: RD
Comment 16 Borislav Petkov 2020-04-20 17:06:12 UTC
(In reply to Nicholas H. from comment #15)
> In addition to the bea0000000000108 mce, I've had two others:

Looks like a different issue to me.

You should update your BIOS to the latest, if haven't done so already.

Then, there was a recent issue with AMD GPUs which could trigger MCEs
too, see

https://bugzilla.kernel.org/show_bug.cgi?id=207331

The fix will be in stable kernels, if that has not happened yet, and it
might be worth a try. That's something which the other people affected by this should try too.

If none of that helps, you should return your CPU for replacement.

HTH.
Comment 17 Rich 2020-04-21 16:38:09 UTC
>> dmesg log shows: microcode: CPU0: patch_level=0x08701013
 ensure  microcode is updated to 08701021 after updating the BIOS    

>> Asrock Phantom Gaming 4 X570 (latest BIOS: 2.30)
update BIOS to latest on the Asrock Phantom Gaming 4 X570...Note: Some vendor BIOS' lag  AMD code releases ...so may have to take the next BIOS update as well to ensure all known fixes are in
BIOS upgrade for the  : https://www.asrock.com/mb/AMD/X570%20phantom%20Gaming%204/index.asp#BIOS
    latest: Version 2.60	2020/4/16	13.84MB	Instant FlashHow to Update:    https://www.asrock.com/support/BIOSIG.asp?cat=BIOS1
Comment 18 Clemens Eisserer 2020-04-22 10:52:58 UTC
Hi Rich,

Will 0x08701013 be published to the linux firmware git anytime soon, so I don't have to rely on my motherboard manufacturer?

I've updated to BIOS 2.6 a few days ago and the microcode patch level is still at 0x08701013.
Comment 19 Rich 2020-04-23 18:30:31 UTC
Hi Eisserer,

Are you still seeing the failure? <light load, dmesg shows same MCE ? btw that MCE is a catchall and can have lots of possible causes..

Its better to go with the motherboard vendor BIOS as it includes all of AMD's updates including the security and power management micro-controller and microcode for the cores.

please post any new failure data and we will figure it out.
monitoring CPU voltage, temperature , frequency, activity level, may provide a clue if this is on the hardware side

Rich
Comment 20 Clemens Eisserer 2020-04-29 22:19:15 UTC
just experienced the same crash again - this time with BIOS 2.6 (still on microcode 8701013):

Quite often I've seen this crash when actually launching LibreOfice:


[    0.105648] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9
[    0.116018] mce: [Hardware Error]: Machine check events logged
[    0.116019] mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108
[    0.116087] mce: [Hardware Error]: TSC 0 ADDR 1ffffc066c3c6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.116163] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1588198590 SOCKET 0 APIC 3 microcode 8701013
[    0.116237]  #10 #11 #12 #13 #14
[    0.122019] mce: [Hardware Error]: Machine check events logged
[    0.122021] mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
[    0.122089] mce: [Hardware Error]: TSC 0 ADDR 7f40ca005e9e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.122164] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1588198590 SOCKET 0 APIC d microcode 8701013
Comment 21 Clemens Eisserer 2020-04-30 05:38:05 UTC
to make sure this isn't a hardware fault, which simply is triggered more likely when running linux, I've swapped my ryzen-3700x (early one from 7/2019) with a new one ordered a few days ago.
Comment 22 Rich 2020-04-30 15:05:24 UTC
trying another CPU is a good idea....

bea0000000000108 means the thread has stopped executing...this is longest timeout, all other hardware fault timers would/should fire before this. 

occurs on 2 threads but one of them goes 1st..both Thread 1, one in kernal mode code (ADDR 1ffffc066c3c6) and the other in user space code (ADDR 7f40ca005e9e)


this case has lots of possible causes...OS, App, voltage , temp, board hardware(power delivery cases), memory (are you running ECC memory ?)

What OS/version?  and what version libre office? i can try launching libre office repeatedly as well.
Comment 23 Rich 2020-04-30 18:25:04 UTC
Tried 15 cycles...random delay between app startup.....no issues seen..I'll looking a way to automate this case for continuous testing.

This is my setup...not identical to yours...just a reference point. 

CPU0: AMD Ryzen 9 3950X 16-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
microcode: CPU0: patch_level=0x08701021

I'm running 	Ubuntu 18.04.1 LTS

LibreOffice Version:  Version: 6.0.6.2
Build ID: 1:6.0.6-0ubuntu0.18.04.1
CPU threads: 32; OS: Linux 4.15; UI render: default; VCL: gtk3; 
Locale: en-US (en_US.UTF-8); Calc: group
Comment 24 Clemens Eisserer 2020-04-30 19:38:44 UTC
Hi Rick,

I observed most crashes during cold LibreOffice (calc) startups, but those are not reproduceable - however my setup is a bit unusual: btrfs (-> high fragmentation) on dmcrypt and with libreoffice installed via flatpak - so maybe IO plays a role here. However, I also saw crashes with firefox playing youtube in background.

I am now using the system with the "fresh" 3700x, everything else is unchanged - and report back in 2-3 weeks. (until now the longest period without MCE was 10 days), maybe it is really faulty hw after all...
Comment 25 Clemens Eisserer 2020-05-13 11:35:58 UTC
it seems the processor was fine - the new 3700x crashed today the same way the old one did:

[    0.292661] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10
[    0.303677] mce: [Hardware Error]: Machine check events logged
[    0.303679] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5: bea0000000000108
[    0.303747] mce: [Hardware Error]: TSC 0 ADDR 1ffffc0a9e3c6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.304662] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589369644 SOCKET 0 APIC 5 microcode 8701013
[    0.304736]  #11 #12 #13 #14 #15
Comment 26 joel_damiano 2020-05-22 04:57:21 UTC
I also am having this problem. Initially, I couldn't even boot using the Fedora live CD. I found that adding 'nomodeset' to the grub boot line made the system unconditionally stable. I have tried all of the recommendations on the web to fix this problem. The only one which has helped was to replace 'nomodeset' with 'nowatchdog', which leaves the system somewhat stable. However, if I try to use a program which uses graphics acceleration (glmark2), the system promptly crashes. I will attach a copy of the output from journalctl, lspci, etc... for your review.
Comment 27 joel_damiano 2020-05-22 04:58:22 UTC
Created attachment 289219 [details]
Logs for debugging Machine Check crash
Comment 28 Roman C. 2020-05-22 11:53:25 UTC
I can confirm the reboots with a similar hardware setup:

AMD Ryzen 9 3950X
Gigabyte X570 Aorus Elite (latest BIOS F12f, microrcode 0x08701013)
PowerColor Radeon RX 5700 XT Red Devil 8GB
4x Samsung M378A4G43MB1-CTD DDR4-2666

I use kernel 5.6.11 and used different kernels to hope for improvement. Also try different BIOS setups, with no effect.

On my system the error happens on different cores, but I don't think this matters.

My observation is that the reboots happen in a situation with low load on the system, but I can`t reproduce the error with an behaviour.
Sometimes I can work 8 hours without a reboot and sometimes its reboots within the first 30 minutes.
Comment 29 Rich 2020-05-22 15:43:03 UTC
(In reply to Clemens Eisserer from comment #25)
> it seems the processor was fine - the new 3700x crashed today the same way
> the old one did:
> 
> [    0.292661] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8 
> #9 #10
> [    0.303677] mce: [Hardware Error]: Machine check events logged
> [    0.303679] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.303747] mce: [Hardware Error]: TSC 0 ADDR 1ffffc0a9e3c6 MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000 
> [    0.304662] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589369644
> SOCKET 0 APIC 5 microcode 8701013
> [    0.304736]  #11 #12 #13 #14 #15


Hi Clemens, 
Yes this is the same failure..the thread in linux's CPU10 is no longer executing...

Since this is the 2nd CPU in the same setup , i would suspect power delivery....And power management state changes are typically what bring out power delivery issues.

My suggestion would be to turn off all power management in the OS (force P0 as the only state), CPU (in BIOS setup options..this is vender dependent.) and GPU (in linux parameters like amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0).

Also, Clamping the behavior of the CPU VR can be accomplished through the AMD OC (overclocking) BIOS setup option. I would peg the frequency and voltage of the part.
choose the base frequency of the CPU installed and set the voltage to 1400  (for 1.4V). this is more than adequate to run at base frequency on all cores.
Also logging CPU voltages and temp through a failure event might show the CPU VR/power delivery is going out of regulation.
Comment 30 Rich 2020-05-22 16:21:07 UTC
(In reply to joel_damiano from comment #27)
> Created attachment 289219 [details]
> Logs for debugging Machine Check crash

MCE Bank 5 Status: bea0000000000108 means the thread stopped executing and hung.
3 threads hung in the kernel
3 threads hung in a user space app.

Fails using glmark2 .

This doesn't present like a CPU power delivery or CPU power management problem.
if a read out to the video card doesn't return data in the timeout period, this mce will be present. This is the last error catchall...there are usually other faults in the system that occur but this one always gets logged.

>> The only one which has helped was to replace 'nomodeset' with 'nowatchdog',
>> which leaves the system somewhat stable.
>> However, if I try to use a program which uses graphics acceleration
>> (glmark2), the system promptly crashes.

I would try another video card or put the video card in another PCIe slot (one closer to the system power supply) and see if that modulates the failure rate.
Comment 31 joel_damiano 2020-05-23 05:10:31 UTC
I've already tried using the second PCIe slot, with no change in the system's behavior. Unfortunately, I don't have another video card to try. I did try appending amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 to the boot line as suggested by Rich, and that seems to have done the trick. I was able to run glmark2 without any problem. Not totally conclusive given the limited testing, but it is certainly a big step forward. Since I'm not an expert on this subject, I'd be curious to know what these parameters do and if there are any downsides to using them.
Comment 32 Vitalii 2020-05-23 12:08:23 UTC
Hi, I have the same problem on
AMD Ryzen 9 3900X
Video: Radeon HD 7850
PCI-E: Intel 82574L NIC
SATA disks only

PPT on CPU is limited to 85W and it's running Folding@Home almost all the time while system is on. Normally my desktop is running Openbox (no fancy desktop effects), and it's stable unless it's running some game (with or without Folding@Home in background).

AoW3 quiet reliably crashed the system with radeon GPU driver, but seems to be fine with amdgpu driver, but then Euro Truck Simulator 2 crashed with amdgpu. I didn't try changing amdgpu.ppfeaturemask yet.

The above was happening on latest BIOS with AGESA 1.0.0.4 B. There are reports that some PCI-E cards don't work properly (Creative sound cards, Ethernet cards have issues with link somehow). So I downgraded to an older BIOS with AGESA 1.0.0.3 ABBA, can't tell if it helped yet, but output from lspci is a bit different for most devices in areas related to error reporting (and handling?). In either case CPU microcode is the same 0x08701013. If it'll crash again, I'll try changing something else...

Thanks
Comment 33 Rich 2020-05-23 15:35:38 UTC
(In reply to joel_damiano from comment #31)
> I've already tried using the second PCIe slot, with no change in the
> system's behavior. Unfortunately, I don't have another video card to try. I
> did try appending amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 to the
> boot line as suggested by Rich, and that seems to have done the trick. I was
> able to run glmark2 without any problem. Not totally conclusive given the
> limited testing, but it is certainly a big step forward. Since I'm not an
> expert on this subject, I'd be curious to know what these parameters do and
> if there are any downsides to using them.

amdgpu.ppfeaturemask and admgpu.dpm turn on and off various features of the AMD GPU

admgpu.dpm = 1 Enables the Override for dynamic power management

There a lot of detail to review here: https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

The settings i gave are the recommended settings for RX 480 and RX 550 video cards to turn off the power management features of the video card.



for your OS installation find the enum PP_FEATURE_MASK  which (i think because i don't have a fedora install) lives in the following places:
drivers/gpu/drm/amd/powerplay/inc/hwmgr.h
drivers/gpu/drm/amd/include/amd_shared.h

my take is you'll find this enum

enum PP_FEATURE_MASK {
	PP_SCLK_DPM_MASK = 0x1,
	PP_MCLK_DPM_MASK = 0x2,
	PP_PCIE_DPM_MASK = 0x4,
	PP_SCLK_DEEP_SLEEP_MASK = 0x8,
	PP_POWER_CONTAINMENT_MASK = 0x10,
	PP_UVD_HANDSHAKE_MASK = 0x20,
	PP_SMC_VOLTAGE_CONTROL_MASK = 0x40,
	PP_VBI_TIME_SUPPORT_MASK = 0x80,
	PP_ULV_MASK = 0x100,
	PP_ENABLE_GFX_CG_THRU_SMU = 0x200,
	PP_CLOCK_STRETCH_MASK = 0x400,
	PP_OD_FUZZY_FAN_CONTROL_MASK = 0x800,
	PP_SOCCLK_DPM_MASK = 0x1000,
	PP_DCEFCLK_DPM_MASK = 0x2000,
	PP_OVERDRIVE_MASK = 0x4000,           
	PP_GFXOFF_MASK = 0x8000,
	PP_ACG_MASK = 0x10000,
	PP_STUTTER_MODE = 0x20000,
	PP_AVFS_MASK = 0x40000,
};

going with  amdgpu.ppfeaturemask=0xffffbffb  sets the following

PP_SCLK_DPM_MASK             = 1
PP_MCLK_DPM_MASK             = 1
PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment...which we override to off with the other parameter
PP_SCLK_DEEP_SLEEP_MASK      = 1
PP_POWER_CONTAINMENT_MASK    = 1
PP_UVD_HANDSHAKE_MASK        = 1
PP_SMC_VOLTAGE_CONTROL_MASK  = 1
PP_VBI_TIME_SUPPORT_MASK     = 1
PP_ULV_MASK                  = 1
PP_ENABLE_GFX_CG_THRU_SMU    = 1
PP_CLOCK_STRETCH_MASK        = 1
PP_OD_FUZZY_FAN_CONTROL_MASK = 1
PP_SOCCLK_DPM_MASK           = 1
PP_DCEFCLK_DPM_MASK          = 1
PP_OVERDRIVE_MASK            = 0   for higher frequency operation/overclocking
PP_GFXOFF_MASK               = 1
PP_ACG_MASK                  = 1
PP_STUTTER_MODE              = 1
PP_AVFS_MASK                 = 1


some explanations are here:
https://www.kernel.org/doc/html/v4.20/gpu/drivers.html
https://wiki.archlinux.org/index.php/Kernel_parameters


i turn power managment off on my productivity systems mostly because i don't want the entry/exit latency and the added stress on voltage regulators/caps/inductors/system power supply that comes with power management.
The AC power usage measured at the AC outlet for my entire system rarely exceeds 100W. Ryzen 3000 series 105W products are incredibly power efficient.
I favor performance over power savings on the machines i use to do work.
Comment 34 Rich 2020-05-24 17:45:26 UTC
(In reply to someguy108 from comment #12)
> Hello! I 've been having a similar issue as well as Clemens in regards to
> spontaneous reboots as well. 
> This is my configuration:
> -Ryzen 3900x + Noctua D15
> -MSI X570 Unify (latest agesa as of writing)
> -DDR4 3200mhz 32GB kit
> -Sapphire Pulse 5700 XT
> -Corsair RMX 850 Watt
> -Arch Linux with kernel 5.5.13
> -Mesa 20.0.3
> -Early KMS enabled
> 
> I've had this system up and running since November 2019 but initially with a
> Nvidia 1060 and Windows 10. Everything was running smoothly. About a month
> ago I switched back over to Linux after purchasing my 5700 XT as my initial
> plan was to go back to Linux. Since returning I've experienced multiple
> spontaneous MCE reboots. All happened while I was playing one particular
> game, Warcraft 3 Reforged. The MCE event is the following:
> 
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5:
> bea0000000000108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffad66d6fe MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
> APIC 2 microcode 8701013
> kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
> kernel: mce: [Hardware Error]: Machine check events logged
> kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5:
> bea0000000000108
> kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffc1196eb6 MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1585120217 SOCKET 0
> APIC 9 microcode 8701013
> kernel: #16 #17 #18 #19 #20 #21 #22 #23
> 
> Initially I figured it could be ram so I performed the usual test with no
> problems. Also tested with standard JEDEC as well and eventually received a
> MCE during Warcraft 3 reforged. After consulting with a few friends I
> decided to try a different power supply to no avail. I then bit the bullet
> and bought a brand new 3900x. I also cleared CMOS before getting my new
> 3900x and after. All CPU values are on auto with no PBO or manual
> overclocking. The only fancy is the ram in regards to XMP. Yesterday, after
> owning the new 3900x for three days, I had a MCE while I was playing
> Warcraft 3 Reforged. I have tested other games but none of them caused a MCE
> or any crashes / freezes for that matter. World of Warcraft, The Outer
> Worlds, Stellaris, and Counter-Strike: Global Offensive.
> 
> One thing to note is I haven't received it during desktop usage. Only in
> Warcraft 3. I do have desktop compositing in both Xfce and KDE disabled and
> always have. Both of which used, tested, and received the MCE's during those
> sessions. 
> 
> I have noticed a pattern with the MCE crashes with Warcraft 3. They always
> happen during a GPU load drop off or increase transition. By that I mean
> when exiting a match to return to the lobby, or loading a map and when it
> switches from the loading screen to the match itself is when these MCE's
> happen. 
> 
> The entire screen quickly turns black, everything is hard locked, and then
> after about a minute or so the machine reboots on its own. It hasn't
> happened yet while in a middle of a match session, sitting in the lobby or
> at the main menu screen. Its consistently been during a transition. 
> 
> My theory is that this could possibly be a GPU hang from switching from one
> power state to another power state. With the GPU hanging, causes the CPU to
> stall, and thus a MCE. The GPU hanging could explain the quick solid black
> screen as well as all output is stopped. But I'm really just assuming here
> form my own observations from my very limited understanding. Possible reason
> why this triggers in Warcraft is because the other games have few moments of
> switching power states heavily. The Outer Worlds, World of Warcraft,
> Stellaris, and Counter-Strike Global Offensive all keep a constant high load
> on the GPU and the match sessions are long.
> 
> From what its worth, I've had no major issues in Windows 10. The only quirks
> where initially a few TDR's that recovered from alt tabing out of most games
> with Google Chrome running. Disabling hardware acceleration in Chrome fixed
> those TDR's while alt-tabing out of games. 
> 
> I've also used both 3900x's to compile things like chromium and other large
> projects that's last hours perfectly fine. On Windows side of things I've
> also done extensive stress testing with Prime95 and Aida64. Along with long
> gaming sessions with Battlefield V that utilizes AVX instructions and puts a
> load on all 24 threads.
> 
> From searching, I've found quite a few reports of people talking about
> receiving MCE's that isn't the typical first generation MCE's reports from
> 2017 involving Ryzen. Where those where fixed by disabling c-states, ram,
> and changing power supply current from low to typical. These ones within the
> past year appear to all have a AMD GPU in common. I did notice a few with
> Intel CPU's as well paired up with a AMD GPU.
> 
> Any feedback would be greatly appreciated.

Hi Someguy,

Usually when a system goes from stable to unstable its the  last change made that induced the problem.
Changing the video card brings with it the video driver-OS revision set problem...
i would suggest updating windows to its latest update level and updating the video driver to its latest revision.

Turning off the video cards power management features is another thing to try.

https://community.amd.com/external-link.jspa?url=https%3A%2F%2Fwww.amd.com%2Fen%2Fsupport%2Fdriverhelp'

Installing the Radeon software on Windows : https://www.amd.com/en/support/kb/faq/rsx-install


>> Yesterday, after owning the new 3900x for three days, I had a MCE while I
>> was playing Warcraft 3 Reforged. 
same MCE ?? Bank 5: bea0000000000108
>> They always happen during a GPU load drop off or increase transition. 


this points to the video card power management features and possibly the video card's PCIe power management ASPM/L1/L1 substates (L1ss)/etc ...Turn them off.


>>Disabling hardware acceleration in Chrome fixed those TDR's while alt-tabing
>>out of games. 

Thanks for mentioning this one....i've been battling the TDR BSOD 0x116 on my laptop....
Comment 35 Clemens Eisserer 2020-05-25 06:15:42 UTC
I've now also plugged the GPU into a different PCIe slot + set nodpm/feature mask - only time will tell.

What intrigues me is the fact the system is rock solid running Windows-10, I haven't had a reboot/bsod in months. Maybe the windows drivers contain work-arounds/quirk-handling not present in their linux counterparts...
Comment 36 joel_damiano 2020-05-28 04:36:08 UTC
I did some experimentation with kernel boot parameters. With both amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0, the system was stable. With only amdgpu.dpm=0, the system was also stable. amdgpu.ppfeaturemask=0xffffbffb without amdgpu.dpm=0 would cause an immediate crash when running glmark2. I don't know if this is of any help in tracking down the problem, but I thought you might find it interesting.
Comment 37 Rich 2020-05-28 09:41:41 UTC
(In reply to joel_damiano from comment #36)
> I did some experimentation with kernel boot parameters. With both
> amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0, the system was stable.
> With only amdgpu.dpm=0, the system was also stable.
> amdgpu.ppfeaturemask=0xffffbffb without amdgpu.dpm=0 would cause an
> immediate crash when running glmark2. I don't know if this is of any help in
> tracking down the problem, but I thought you might find it interesting.

Hi Joel,

My take  is there is one more experiment to narrow this down to a single change. Disable DPM here only

going with  amdgpu.ppfeaturemask=0xffffbfff  (without amdgpu.dpm=0)  sets the following

PP_SCLK_DPM_MASK             = 1
PP_MCLK_DPM_MASK             = 1
PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment..
PP_SCLK_DEEP_SLEEP_MASK      = 1
PP_POWER_CONTAINMENT_MASK    = 1
PP_UVD_HANDSHAKE_MASK        = 1
PP_SMC_VOLTAGE_CONTROL_MASK  = 1
PP_VBI_TIME_SUPPORT_MASK     = 1
PP_ULV_MASK                  = 1
PP_ENABLE_GFX_CG_THRU_SMU    = 1
PP_CLOCK_STRETCH_MASK        = 1
PP_OD_FUZZY_FAN_CONTROL_MASK = 1
PP_SOCCLK_DPM_MASK           = 1
PP_DCEFCLK_DPM_MASK          = 1
PP_OVERDRIVE_MASK            = 1    
PP_GFXOFF_MASK               = 1
PP_ACG_MASK                  = 1
PP_STUTTER_MODE              = 1
PP_AVFS_MASK                 = 1

Thanks for going through this,
Rich
Comment 38 Rich 2020-05-28 10:06:33 UTC
(In reply to Roman C. from comment #28)
> I can confirm the reboots with a similar hardware setup:
> 
> AMD Ryzen 9 3950X
> Gigabyte X570 Aorus Elite (latest BIOS F12f, microrcode 0x08701013)
> PowerColor Radeon RX 5700 XT Red Devil 8GB
> 4x Samsung M378A4G43MB1-CTD DDR4-2666
> 
> I use kernel 5.6.11 and used different kernels to hope for improvement. Also
> try different BIOS setups, with no effect.
> 
> On my system the error happens on different cores, but I don't think this
> matters.
> 
> My observation is that the reboots happen in a situation with low load on
> the system, but I can`t reproduce the error with an behaviour.
> Sometimes I can work 8 hours without a reboot and sometimes its reboots
> within the first 30 minutes.

Hi Roman,
any files in /var/log/ we can look at ?
Comment 39 Vitalii 2020-05-28 16:26:57 UTC
(In reply to Vitalii from comment #32)
> AMD Ryzen 9 3900X
> Video: Radeon HD 7850
> PCI-E: Intel 82574L NIC
> SATA disks only
Forgot to add: Debian 10, stock kernel: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64

I tried a few things, with no particular success so far.

1) Downgrading to BIOS with AGESA 1.0.0.3 ABBA changes one thing. Reboots are still there, but there are no more MCEs logged during the boot.

2) Disabling IOMMU doesn't help in my case.

3) Setting amdgpu.ppfeaturemask doesn't help in my case either. Will try with amdgpu.dpm=0, but it forces some fixed noisy FAN profile on my video card and it's a bit annoying.

4) Interestingly, doing "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves way too similar to this issue. System locks for a few seconds, screen frozen, audio loops, and then reboot follows. To confirm, I went back to the latest BIOS, and now I get MCEs too:

May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0 Bank 5: bea0000000000108
May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd4481af8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589913499 SOCKET 0 APIC b microcode 8701013
May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 23: Machine Check: 0 Bank 5: bea0000000000108
May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd42c16e6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1589913499 SOCKET 0 APIC 1d microcode 8701013
    [ no MCEs on the old BIOS, but crashes were present ]
May 28 19:05:17 vb kernel: mce: [Hardware Error]: Machine check events logged
May 28 19:05:17 vb kernel: mce: [Hardware Error]: CPU 11: Machine Check: 0 Bank 5: bea0000000000108
May 28 19:05:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f9500f73c68 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 28 19:05:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1590681911 SOCKET 0 APIC 1c microcode 8701013
May 28 19:09:17 vb kernel: mce: [Hardware Error]: Machine check events logged
May 28 19:09:17 vb kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
May 28 19:09:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f78c01ee226 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
May 28 19:09:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1590682152 SOCKET 0 APIC 5 microcode 8701013

The last two MCEs were caused by "cat /sys/kernel/debug/dri/0/amdgpu_regs". Some registers are dumped (there's a garbage on a screen), then it freezes. Don't know what this means.

Thanks
Comment 40 Rich 2020-05-28 17:02:03 UTC
(In reply to Vitalii from comment #39)
> (In reply to Vitalii from comment #32)
> > AMD Ryzen 9 3900X
> > Video: Radeon HD 7850
> > PCI-E: Intel 82574L NIC
> > SATA disks only
> Forgot to add: Debian 10, stock kernel: 4.19.0-9-amd64 #1 SMP Debian
> 4.19.118-2 (2020-04-29) x86_64
> 
> I tried a few things, with no particular success so far.
> 
> 1) Downgrading to BIOS with AGESA 1.0.0.3 ABBA changes one thing. Reboots
> are still there, but there are no more MCEs logged during the boot.
> 
> 2) Disabling IOMMU doesn't help in my case.
> 
> 3) Setting amdgpu.ppfeaturemask doesn't help in my case either. Will try
> with amdgpu.dpm=0, but it forces some fixed noisy FAN profile on my video
> card and it's a bit annoying.
> 
> 4) Interestingly, doing "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves
> way too similar to this issue. System locks for a few seconds, screen
> frozen, audio loops, and then reboot follows. To confirm, I went back to the
> latest BIOS, and now I get MCEs too:
> 
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0
> Bank 5: bea0000000000108
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd4481af8
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1589913499 SOCKET 0 APIC b microcode 8701013
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: CPU 23: Machine Check: 0
> Bank 5: bea0000000000108
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7fdcd42c16e6
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 19 21:38:25 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1589913499 SOCKET 0 APIC 1d microcode 8701013
>     [ no MCEs on the old BIOS, but crashes were present ]
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: CPU 11: Machine Check: 0
> Bank 5: bea0000000000108
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f9500f73c68
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 28 19:05:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1590681911 SOCKET 0 APIC 1c microcode 8701013
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: Machine check events logged
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0
> Bank 5: bea0000000000108
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: TSC 0 ADDR 7f78c01ee226
> MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
> May 28 19:09:17 vb kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME
> 1590682152 SOCKET 0 APIC 5 microcode 8701013
> 
> The last two MCEs were caused by "cat /sys/kernel/debug/dri/0/amdgpu_regs".
> Some registers are dumped (there's a garbage on a screen), then it freezes.
> Don't know what this means.
> 
> Thanks



garbage on the screen usually means video controller got bad data or took a power/thermal event

Machine Check: 0 Bank 5: bea0000000000108 is thread is no longer executing.....
4 threads hung

can you collect logs from \var\logs on next reboot 

>>  "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves way too similar to this
>>  issue. System locks for a few seconds, screen frozen, audio loops, and then
>>  reboot follows.
just reading registers does this??  :
the system is in bad shape...have you tried another OS install ? another sata drive ? another system power supply ? another motherboard.....something fundamental is wrong  

what motherboard is this?
Comment 41 Alex Deucher 2020-05-28 17:20:21 UTC
(In reply to Vitalii from comment #39)
> 
> 4) Interestingly, doing "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves
> way too similar to this issue. System locks for a few seconds, screen
> frozen, audio loops, and then reboot follows. To confirm, I went back to the
> latest BIOS, and now I get MCEs too:
> 
<snip>

> The last two MCEs were caused by "cat /sys/kernel/debug/dri/0/amdgpu_regs".
> Some registers are dumped (there's a garbage on a screen), then it freezes.
> Don't know what this means.

This is expected.  The amdgpu_regs file just provides access to the GPU's MMIO registers for debugging.  You should not access it unless you know what you are doing.
Comment 42 Vitalii 2020-05-28 17:28:42 UTC
Hi Rich,

(In reply to Rich from comment #40)
> (In reply to Vitalii from comment #39)
> > (In reply to Vitalii from comment #32)
> Machine Check: 0 Bank 5: bea0000000000108 is thread is no longer
> executing.....
> 4 threads hung
> 
> can you collect logs from \var\logs on next reboot 
Sure, I'll attach some logs.

> >>  "cat /sys/kernel/debug/dri/0/amdgpu_regs" behaves way too similar to this
> >>  issue. System locks for a few seconds, screen frozen, audio loops, and
> then
> >>  reboot follows.
> just reading registers does this??  :
> the system is in bad shape...have you tried another OS install ? another
> sata drive ? another system power supply ? another motherboard.....something
> fundamental is wrong  
Yes, just reading registers does this. And by "garbage on a screen" I meant "a binary data in a terminal", sorry for confusion. I understand it's not supposed to work like this, and that dumping all registers region sometimes can be a bad idea, but it's interesting that it causes MCE somehow.

Regarding the other OS, I too can say that Windows 10 works more reliably (I had 1 or 2 crashes long time ago, I did not investigate), and there were no recent crashes in games. PSU/video/disks are old and were perfectly stable on an old Phenom X4 system (125W TDP). I understand that these components not necessarily 100% compatible with this system, though.

> what motherboard is this?
Gigabyte X570 Gaming X

Other than that, I tried amdgpu.dpm=0 and it affects the performance a lot, GL is about 8 time slower.

Thanks
Comment 43 Alex Deucher 2020-05-28 17:34:16 UTC
(In reply to Rich from comment #37)
> 
> going with  amdgpu.ppfeaturemask=0xffffbfff  (without amdgpu.dpm=0)  sets
> the following
> 
> PP_SCLK_DPM_MASK             = 1
> PP_MCLK_DPM_MASK             = 1
> PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment..
> PP_SCLK_DEEP_SLEEP_MASK      = 1
> PP_POWER_CONTAINMENT_MASK    = 1
> PP_UVD_HANDSHAKE_MASK        = 1
> PP_SMC_VOLTAGE_CONTROL_MASK  = 1
> PP_VBI_TIME_SUPPORT_MASK     = 1
> PP_ULV_MASK                  = 1
> PP_ENABLE_GFX_CG_THRU_SMU    = 1
> PP_CLOCK_STRETCH_MASK        = 1
> PP_OD_FUZZY_FAN_CONTROL_MASK = 1
> PP_SOCCLK_DPM_MASK           = 1
> PP_DCEFCLK_DPM_MASK          = 1
> PP_OVERDRIVE_MASK            = 1    
> PP_GFXOFF_MASK               = 1
> PP_ACG_MASK                  = 1
> PP_STUTTER_MODE              = 1
> PP_AVFS_MASK                 = 1

Can you try and narrow down which feature(s) cause the problem by setting different bits in amdgpu.ppfeaturemask to disable different GPU power features?
Comment 44 Vitalii 2020-05-28 17:42:48 UTC
Created attachment 289387 [details]
kernel and X11 logs for a boot after crash when reading registers

Adding kernel and X11 logs for a boot after crash when reading registers
Comment 45 Alex Deucher 2020-05-28 17:52:03 UTC
(In reply to Vitalii from comment #42)
> 
> Other than that, I tried amdgpu.dpm=0 and it affects the performance a lot,
> GL is about 8 time slower.

Do you still get MCEs in that case?
Comment 46 Vitalii 2020-05-28 18:25:37 UTC
Hi Alex,

(In reply to Alex Deucher from comment #45)
> (In reply to Vitalii from comment #42)
> > 
> > Other than that, I tried amdgpu.dpm=0 and it affects the performance a lot,
> > GL is about 8 time slower.
> 
> Do you still get MCEs in that case?

I don't know yet. It's difficult to test because GPU is slow, and my normal test case right now is Euro Truck Simulator 2 (usually takes 1-2 hours to trigger), and now it's unusable. I'll try to test something else, but it'll take time.

I can test dumping the registers, and MCE still is logged. This probably has little common with the normal usage, as you said, just out of curiosity I tried "dd if=amdgpu_regs bs=4 | hexdump" and the last lines in X11 terminal are (from a video, if I typed it correctly):
*
0012fa0 03ff 0002 0000 0000 0000 0000 0000 0000
0012fb0 0000 0000 0000 0000 0000 0000 0000 0000
*
0012ff0 0000 0000 cccc cccc 0000 0000 0000 0000
0013000 0000 0000 0000 0000 0000 0000 0000 0000
*
[freeze]

I'll get back to experiments.
Thanks
Comment 47 MrZomg 2020-05-28 20:58:16 UTC
(In reply to Rich from comment #37)
> going with  amdgpu.ppfeaturemask=0xffffbfff  (without amdgpu.dpm=0)  sets
> the following
> 
> PP_SCLK_DPM_MASK             = 1
> PP_MCLK_DPM_MASK             = 1
> PP_PCIE_DPM_MASK             = 0   This is PCIe Dynamic Power Managment..
> PP_SCLK_DEEP_SLEEP_MASK      = 1
> PP_POWER_CONTAINMENT_MASK    = 1
> PP_UVD_HANDSHAKE_MASK        = 1
> PP_SMC_VOLTAGE_CONTROL_MASK  = 1
> PP_VBI_TIME_SUPPORT_MASK     = 1
> PP_ULV_MASK                  = 1
> PP_ENABLE_GFX_CG_THRU_SMU    = 1
> PP_CLOCK_STRETCH_MASK        = 1
> PP_OD_FUZZY_FAN_CONTROL_MASK = 1
> PP_SOCCLK_DPM_MASK           = 1
> PP_DCEFCLK_DPM_MASK          = 1
> PP_OVERDRIVE_MASK            = 1    
> PP_GFXOFF_MASK               = 1
> PP_ACG_MASK                  = 1
> PP_STUTTER_MODE              = 1
> PP_AVFS_MASK                 = 1

Maybe i understand it wrong but shouldn't the feature mask be 0xfffffffb to turn PCIe Power Management off? I am currently trying that mask.

Additionally, i want to strenghten the evidence that this problem is GPU related. My 3700X system with B450 chipset was running fine with a 1080TI and old R9 280X GPU. It just surfaced the day after i replaced the R9 with an 5500XT. Because both cards are supported by the AMDGPU driver, absolutely no software change was needed. It may also be load related as i had it happen to me when i used closed VLC media player (i was playing around with enabling hardware acceleration for it, as it is not used by default on this cards).

Regards
Comment 48 Alex Deucher 2020-05-28 21:19:02 UTC
(In reply to MrZomg from comment #47)
> 
> Maybe i understand it wrong but shouldn't the feature mask be 0xfffffffb to
> turn PCIe Power Management off? I am currently trying that mask.

correct.
Comment 49 joel_damiano 2020-05-29 06:02:22 UTC
Hi Rich,

I ran more experiments, all without amdgpu.dpm on the boot line.

First I tried amdgpu.ppfeaturemask=0xffffbfff, then amdgpu.ppfeaturemask=0xfffffffb. In both cases, the system crashed when I tried to run glmark2.

I then ran through all values of the last nibble of the ppfeaturemask from 0xffffbff0 through 0xffffbfff and found the following pattern: If the lower two bits were both 1, the system would crash running glmark2. If one was 1 and the other 0, the system was stable and would run glmark2 without a problem. If both were 0, the system wouldn't boot. The screen would shut off. In this case, an error would be logged in the journal:

kernel: amdgpu: probe of 0000:0e:00.0 failed with error -110

So it seems like there is some sort of interaction between the PP_SCLK_DPM_MASK and PP_MCLK_DPM_MASK features. They can't both be enabled if the system is to be stable.

Joel
Comment 50 Rich 2020-05-30 14:17:53 UTC
(In reply to joel_damiano from comment #49)
> Hi Rich,
> 
> I ran more experiments, all without amdgpu.dpm on the boot line.
> 
> First I tried amdgpu.ppfeaturemask=0xffffbfff, then
> amdgpu.ppfeaturemask=0xfffffffb. In both cases, the system crashed when I
> tried to run glmark2.
> 
> I then ran through all values of the last nibble of the ppfeaturemask from
> 0xffffbff0 through 0xffffbfff and found the following pattern: If the lower
> two bits were both 1, the system would crash running glmark2. If one was 1
> and the other 0, the system was stable and would run glmark2 without a
> problem. If both were 0, the system wouldn't boot. The screen would shut
> off. In this case, an error would be logged in the journal:
> 
> kernel: amdgpu: probe of 0000:0e:00.0 failed with error -110
> 
> So it seems like there is some sort of interaction between the
> PP_SCLK_DPM_MASK and PP_MCLK_DPM_MASK features. They can't both be enabled
> if the system is to be stable.
> 
> Joel


Nice work narrowing this down!

ok so the 2 working settings are:
	bit 0 PP_SCLK_DPM_MASK  = 0
	bit 1 PP_MCLK_DPM_MASK  = 1 
    amdgpu.ppfeaturemask=0xffffbffe
or 
	bit 0 PP_SCLK_DPM_MASK  = 1 
	bit 1 PP_MCLK_DPM_MASK  = 0
    amdgpu.ppfeaturemask=0xffffbffd


from prior boot
>> journalctl dump for most recent boot:
>> -- Logs begin at Sun 2020-01-19 23:24:02 PST, end at Thu 2020-05-21 21:32:25
>> PDT. --
>> ...
>> May 21 14:19:04 joel kernel: pci 0000:0e:00.0: [1002:67ef] type 00 class
>> 0x030000

this PCIe node:Bus:Device:Fuction [VID:DID] = [1002:67ef] 
VID = 1002 is ATI video group in AMD
DID = 67ef identifies the card as Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X]

>> kernel: amdgpu: probe of 0000:0e:00.0 failed with error -110
my take on this is the Video card can not be read..the card is hung.


>> So it seems like there is some sort of interaction between the
>> PP_SCLK_DPM_MASK and PP_MCLK_DPM_MASK features. 

turning them  both off turns off power management on both clocks which may not be a valid config enforced by the power management controller software.
SCLK is the System Clock
MCLK is the Memory Clock
basically the power management controller will downshift the clock frequency to reduce power consumption when there is no work to do.

my take is performance benchmark between these based on your usage model and then go with the higher performance option.

	bit 0 PP_SCLK_DPM_MASK  = 0
	bit 1 PP_MCLK_DPM_MASK  = 1 
    amdgpu.ppfeaturemask=0xffffbffe
or 
	bit 0 PP_SCLK_DPM_MASK  = 1 
	bit 1 PP_MCLK_DPM_MASK  = 0
    amdgpu.ppfeaturemask=0xffffbffd
    
in both cases, while the heavy stress workload is running, It would be good to monitor the amount of heat coming off the video card or take a IR thermometer reading of the air coming out of the card...
if there is over 10C of temperature difference i would give up the performance and choose the cooler option. Silicon that runs cooler lasts much longer.
Comment 51 Josh 2020-06-06 20:32:56 UTC
I'm having what I believe is the same problem. A newly built:

AMD 3950X 
Sapphire Nitro+ 5700 XT 8GB
MSI Unify X570 (BIOS updated to A3, latest stable)
850W Gold PSU
D15 Cooler
64GB LPX DDR4 memory

Kernel: 5.6.15-300.fc32.x86_64

It happens to me when the system resumes (at least tries to resume) from sleep. Sometimes it works and sometimes it doesn't. I'm running 3 screens from it and the two over HDMI go green and a DP one stays black.

Shortly after, the machine reboots, I see the MCE and then see it again when I log into Fedora (32, updates installed).

baa000000002010b isn't always there, but bea0000000000108 always is. 

Extract from my log:

[    0.003275] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.003454] Freeing SMP alternatives memory: 36K
[    0.107430] smpboot: CPU0: AMD Ryzen 9 3950X 16-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
[    0.107477] mce: [Hardware Error]: Machine check events logged
[    0.107478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
[    0.107479] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
[    0.107482] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649 SOCKET 0 APIC 0 microcode 8701013
[    0.107502] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
[    0.107503] ... version:                0
[    0.107504] ... bit width:              48
[    0.107504] ... generic registers:      6
[    0.107504] ... value mask:             0000ffffffffffff
[    0.107504] ... max period:             00007fffffffffff
[    0.107504] ... fixed-purpose events:   0
[    0.107505] ... event mask:             000000000000003f
[    0.107531] rcu: Hierarchical SRCU implementation.
[    0.107813] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    0.107943] smp: Bringing up secondary CPUs ...
[    0.107983] x86: Booting SMP configuration:
[    0.107983] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21
[    0.134021] mce: [Hardware Error]: Machine check events logged
[    0.134022] mce: [Hardware Error]: CPU 21: Machine Check: 0 Bank 5: bea0000000000108
[    0.134025] mce: [Hardware Error]: TSC 0 ADDR 1ffffc069b1c8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.134028] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649 SOCKET 0 APIC b microcode 8701013
[    0.134046]  #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
[    0.146035] smp: Brought up 1 node, 32 CPUs
[    0.146035] smpboot: Max logical packages: 1
[    0.146036] smpboot: Total of 32 processors activated (224012.41 BogoMIPS)
[    0.149658] devtmpfs: initialized
[    0.149658] x86/mm: Memory block size: 128MB
Comment 52 Rich 2020-06-07 13:56:41 UTC
(In reply to Josh from comment #51)
> I'm having what I believe is the same problem. A newly built:
> 
> AMD 3950X 
> Sapphire Nitro+ 5700 XT 8GB
> MSI Unify X570 (BIOS updated to A3, latest stable)
> 850W Gold PSU
> D15 Cooler
> 64GB LPX DDR4 memory
> 
> Kernel: 5.6.15-300.fc32.x86_64
> 
> It happens to me when the system resumes (at least tries to resume) from
> sleep. Sometimes it works and sometimes it doesn't. I'm running 3 screens
> from it and the two over HDMI go green and a DP one stays black.
> 
> Shortly after, the machine reboots, I see the MCE and then see it again when
> I log into Fedora (32, updates installed).
> 
> baa000000002010b isn't always there, but bea0000000000108 always is. 
> 
> Extract from my log:
> 
> [    0.003275] Speculative Store Bypass: Mitigation: Speculative Store
> Bypass disabled via prctl and seccomp
> [    0.003454] Freeing SMP alternatives memory: 36K
> [    0.107430] smpboot: CPU0: AMD Ryzen 9 3950X 16-Core Processor (family:
> 0x17, model: 0x71, stepping: 0x0)
> [    0.107477] mce: [Hardware Error]: Machine check events logged
> [    0.107478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22:
> baa000000002010b
> [    0.107479] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND
> 4d000000 IPID 1813e17000 
> [    0.107482] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649
> SOCKET 0 APIC 0 microcode 8701013
> [    0.107502] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
> [    0.107503] ... version:                0
> [    0.107504] ... bit width:              48
> [    0.107504] ... generic registers:      6
> [    0.107504] ... value mask:             0000ffffffffffff
> [    0.107504] ... max period:             00007fffffffffff
> [    0.107504] ... fixed-purpose events:   0
> [    0.107505] ... event mask:             000000000000003f
> [    0.107531] rcu: Hierarchical SRCU implementation.
> [    0.107813] NMI watchdog: Enabled. Permanently consumes one hw-PMU
> counter.
> [    0.107943] smp: Bringing up secondary CPUs ...
> [    0.107983] x86: Booting SMP configuration:
> [    0.107983] .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8 
> #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21
> [    0.134021] mce: [Hardware Error]: Machine check events logged
> [    0.134022] mce: [Hardware Error]: CPU 21: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.134025] mce: [Hardware Error]: TSC 0 ADDR 1ffffc069b1c8 MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000 
> [    0.134028] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1591477649
> SOCKET 0 APIC b microcode 8701013
> [    0.134046]  #22 #23 #24 #25 #26 #27 #28 #29 #30 #31
> [    0.146035] smp: Brought up 1 node, 32 CPUs
> [    0.146035] smpboot: Max logical packages: 1
> [    0.146036] smpboot: Total of 32 processors activated (224012.41 BogoMIPS)
> [    0.149658] devtmpfs: initialized
> [    0.149658] x86/mm: Memory block size: 128MB

>> [    0.107478] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank
>> 22:baa000000002010b    <== Bank 22 is NBIO  MCA_STATUS_NBIO[21:16] = 0x2:'
>> ErrEvent',
This is the logic block above the PCIe interface...and indicates an error occurred in a transaction to/from a PCIe link.  Are there other PCIe cards installed? (may try to remove them and attempt to induce the failure with successive sleep - wake testing)

>> It happens to me when the system resumes (at least tries to resume) from
>> sleep
this can have lots of causes....BIOS not updated, power management issue, power delivery issue, PCie card doesn't resume properly.....

Have you tried updating the video card's driver and VBIOS ?
Have you tried the various amdgpu.ppfeaturemask settings just to narrow this down ?
Have you tried S4 (hibernate) instead of S3 (sleep) ?
Comment 53 Alex Deucher 2020-06-10 15:46:09 UTC
hmmm... I vaguely recall the core kernel pci function pcie_bandwidth_available() causing problems on some platforms.  Does avoiding that call in the driver help?  You can force the pcie gen and lanes via module parameters.  E.g., append
amdgpu.pcie_gen_cap=0x00070007 amdgpu.pcie_lane_cap=0x00ff0000
to the kernel command line in grub which will force pcie gen3 and 16 lanes.
Comment 54 MrZomg 2020-06-11 08:06:16 UTC
Just wanted to give you guys a status update from my side:

a) i didn't have crash with amdgpu.ppfeaturemask=0xffffbfff yet, so this seems to work for me
b) since the weekend i've upgraded to kernel 5.7.0

i'll now try without the ppfeaturemask again to see if the problem reappears

@Alex Deucher: Maybe it's of interest that i've been running the card in a chipset PCIe 2.0 4x slot the whole time.
Comment 55 Alex Deucher 2020-06-11 13:13:58 UTC
(In reply to MrZomg from comment #54)
> Just wanted to give you guys a status update from my side:
> 
> a) i didn't have crash with amdgpu.ppfeaturemask=0xffffbfff yet, so this
> seems to work for me
>

Are you sure?  0xffffbfff is the default setting.
Comment 56 MrZomg 2020-06-11 14:15:30 UTC
(In reply to Alex Deucher from comment #55)
> (In reply to MrZomg from comment #54)
> > Just wanted to give you guys a status update from my side:
> > 
> > a) i didn't have crash with amdgpu.ppfeaturemask=0xffffbfff yet, so this
> > seems to work for me
> >
> 
> Are you sure?  0xffffbfff is the default setting.

Nope. Sorry. I've been using amdgpu.ppfeaturemask=0xfffffffb.
Comment 57 Roman C. 2020-06-15 20:30:26 UTC
As short feedback my Radeon RX 5700 works stable with amdgpu.ppfeaturemask=0xffffbffd.
I couldn't test many different settings, because I don't have an easy reproducible scenario to cause the error. But the setting works now for many days and hours.

It also worked with amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 but used 95 to 60 watt.

Thanks for your support! This behaviour was really annoying.
Comment 58 Alex Deucher 2020-06-15 20:34:42 UTC
(In reply to Roman C. from comment #57)
> As short feedback my Radeon RX 5700 works stable with
> amdgpu.ppfeaturemask=0xffffbffd.
> I couldn't test many different settings, because I don't have an easy
> reproducible scenario to cause the error. But the setting works now for many
> days and hours.
> 
> It also worked with amdgpu.ppfeaturemask=0xffffbffb and amdgpu.dpm=0 but
> used 95 to 60 watt.

Setting dpm=0 disables all GPU power management so the ppfeaturemask is ignored in that case.
Comment 59 Paul Menzel 2020-06-15 20:44:55 UTC
For the record, bug 206487 (AMD Ryzen: Random freezes/crashes with enabled C-State C6) is about the same problems, and it happens with all Dell OptiPlex 5055 we have here, which run GNU/Linux. No problems are reported when run with Microsoft Windows 10. Unfortunately, it’s hard to reproduce. We are going to try the suggestions, and report back in the other bug report.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206487
Comment 60 Paul Menzel 2020-06-15 20:47:48 UTC
(In reply to Clemens Eisserer from comment #35)
> I've now also plugged the GPU into a different PCIe slot + set nodpm/feature
> mask - only time will tell.

Clemens, what is the status after testing this for three weeks? We are anxious to know.
Comment 61 Josh 2020-06-16 06:36:01 UTC
(In reply to Rich from comment #52)

Sorry for the delay in getting back to you. I switched from Fedora 32 to Pop OS (unrelated to this problem). I was getting lots of display issues on the default kernel (5.3 I think). I updated manually to 5.7.1, which stopped 99% of the issues. I've been running that for just over a week or so and essentially:

Sleeping is hit and miss. Sometimes it works, sometimes it doesn't.
Sometimes it goes to sleep fine and wakes up fine, sometimes it sleeps fine but reboots on resume. Sometimes it failed to go to sleep at all and just seemed unresponsive and required a reset.

It also struggled with locking the screens, that would sometimes it to reboot. I also had an issue where the screens would just sit on black (powered on). That was solved by turning off auto detect input I think.

Another time, the fans on the GPU kept spooling up and down and then eventually the PC reset itself.

I also got green flashes on the two HDMI output monitors (but not on the DP one) immediately after logging into Gnome. That hasn't happened for a while though.

Yesterday my entire display output froze, but audio was still playing fine. I presume if I was able to SSH into the box and restart the window server that might of fixed it, but I couldn't and had to reset.

The reassuring thing is they all seem related to graphics now.

> (In reply to Josh from comment #51)

> This is the logic block above the PCIe interface...and indicates an error
> occurred in a transaction to/from a PCIe link.  Are there other PCIe cards
> installed? (may try to remove them and attempt to induce the failure with
> successive sleep - wake testing)

There is nothing else plugged into PCI slots, but an EVO Plus 1TB NVMe in the top M.2 slot.

> 
> >> It happens to me when the system resumes (at least tries to resume) from
> >> sleep
> this can have lots of causes....BIOS not updated, power management issue,
> power delivery issue, PCie card doesn't resume properly.....
BIOS is latest stable, PSU is good.

> Have you tried updating the video card's driver and VBIOS ?
No, I can't seem to find out how for this card.

> Have you tried the various amdgpu.ppfeaturemask settings just to narrow this
> down ?
I'm now running 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd and a few test sleeps  and screen locks, nothing bad has happened so far. I'll keep you updated.

> Have you tried S4 (hibernate) instead of S3 (sleep) ?
When I tried this in Fedora 32 they had the same intermittent results.

I thought I could reproduce it in Fedora by sleeping while there was a VirtualBox machine running, but I could only trigger it once. VMs don't seem to make a difference.

One way I can improve the likelihood the machine will resume after sleep now, is to lock it manually but keep the mouse moving so the screens don't shutdown, and then sleep it. Might be placebo but it felt like it was improving things slightly.

Thank you for your help and hard work! :)
Comment 62 Clemens Eisserer 2020-06-17 14:52:20 UTC
Paul: Until now the system has been stable with dmp=0. This option leads to low gpu/memory clocks, however it is acceptable because I actually bought the RX570 only to drive my 4k display at 60Hz).

However I haven't used the system a lot recently with Linux (mostly Windows-10 for coding on a specific project - where it was, as always, rock solid). So maybe the next reboot is just hours away ;)
Comment 63 Alex Deucher 2020-06-17 16:10:45 UTC
(In reply to Paul Menzel from comment #59)
> For the record, bug 206487 (AMD Ryzen: Random freezes/crashes with enabled
> C-State C6) is about the same problems, and it happens with all Dell
> OptiPlex 5055 we have here, which run GNU/Linux. No problems are reported
> when run with Microsoft Windows 10. Unfortunately, it’s hard to reproduce.
> We are going to try the suggestions, and report back in the other bug report.
> 
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206487

What makes you think this bug is related?  It's a different processor, and according to your comments, messing with the GPU driver power options has no effect.
Comment 64 Jens Reimann 2020-06-17 21:00:08 UTC
I am having the same issue:

---
[    0.107886] mce: [Hardware Error]: Machine check events logged
[    0.107887] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
[    0.107888] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
[    0.107890] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1592426039 SOCKET 0 APIC 0 microcode 8701013
---

* Fedora 32
* AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
* AMD Radeon 570
* Samsung NVME Evo Plus
* Asus ROG X570 motherboard

Everything was stable for around two weeks. Sleep mode working fine. I also but the system under load (except for the GPU).

Once I tried out Minecraft, putting some load on the GPU for the first time, the system reset.
Comment 65 Josh 2020-06-18 05:57:32 UTC
(In reply to Josh from comment #61)
> (In reply to Rich from comment #52)
> > Have you tried the various amdgpu.ppfeaturemask settings just to narrow
> this
> > down ?
> I'm now running 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd and a few test
> sleeps  and screen locks, nothing bad has happened so far. I'll keep you
> updated.

Good news! My system seems stable on 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd

It's able to sleep, sleep displays, logout and login again.

Only a couple of minor things, when it shuts down, my primary display (3440x1440) has a flash of what I assume is corrupt video signal on the right hand side. I've not measured it but I assume the GPU is only sending out a 16:9 signal (it is displaying terminal/console output as this point during the shutdown). Not an issue for me, but it might help.

I also booted Minecraft last night, it lagged in the menu a little bit and the graphics aren't as good as I remember them in Fedora 32. But more than playable.
Comment 66 Paul Menzel 2020-06-18 14:24:08 UTC
Vitalii, if I am not mistaken, you are the only one reporting this for Linux 4.19.x, and the only one having some kind of reproducer (games). Is it possible, you are having a separate issue?
Comment 67 Vitalii 2020-06-18 15:43:38 UTC
Hi Paul, I don't know, maybe. It doesn't seem to be clear what exactly the problem is, except that it may be related to GPU power management. I have a somewhat old video card (HD 7850 - southern islands), which has an independent DPM implementation in amdgpu driver and ppfeaturemask does not affect its behavior, as far as I can see, so it's more difficult to check if it's the same issue. Other than that, I didn't have much time to investigate yet. I also never use suspend.
Thanks
Comment 68 Josh 2020-06-22 08:36:01 UTC
(In reply to Josh from comment #65)
> (In reply to Josh from comment #61)
> > (In reply to Rich from comment #52)
> > > Have you tried the various amdgpu.ppfeaturemask settings just to narrow
> > this
> > > down ?
> > I'm now running 5.7.1 with amdgpu.ppfeaturemask=0xffffbffd and a few test
> > sleeps  and screen locks, nothing bad has happened so far. I'll keep you
> > updated.
> 
> Good news! My system seems stable on 5.7.1 with
> amdgpu.ppfeaturemask=0xffffbffd
> 
> It's able to sleep, sleep displays, logout and login again.
> 
> Only a couple of minor things, when it shuts down, my primary display
> (3440x1440) has a flash of what I assume is corrupt video signal on the
> right hand side. I've not measured it but I assume the GPU is only sending
> out a 16:9 signal (it is displaying terminal/console output as this point
> during the shutdown). Not an issue for me, but it might help.
> 
> I also booted Minecraft last night, it lagged in the menu a little bit and
> the graphics aren't as good as I remember them in Fedora 32. But more than
> playable.

I may of spoken too soon. It still seems to hang when entering sleep now and again. The screens go off but the system never sleeps. Audio stops.

Upon pressing the keyboard to wake it, audio resumes and all the screens power back up, but on solid black output. When 5.8.1 is out I'll try that out. Would I still need the feature mask?
Comment 69 Paul Menzel 2020-06-30 08:22:42 UTC
I am still trying to wrap my head around the issue, and am missing some details. Clemens, it’d be great if you answered the questions below.

(In reply to Clemens Eisserer from comment #0)
> Ever since building my new PC I experience spontaneous (every week or so)
> reboots caused by machine check exceptions (always same bank and code -
> please see below). The reboots tend to happen in low-load situations (e.g.
> right after loading the desktop, or when playing youtube videos) - high load
> doesn't seem to make it worse.

Clemens, one more question. Was Linux 5.5.9, taken from the bug meta data, the earliest version you experienced this with? In your report you write, it happened for six months already.

[…]

> [    0.707393] mce: [Hardware Error]: Machine check events logged
> [    0.707395] mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.707464] mce: [Hardware Error]: TSC 0 ADDR 1ffffbb03343c MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> [    0.707540] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288
> SOCKET 0 APIC 5 microcode 8701013
> [    0.709397] mce: [Hardware Error]: Machine check events logged
> [    0.709398] mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5:
> bea0000000000108
> [    0.709468] mce: [Hardware Error]: TSC 0 ADDR 1ffffbba3a05a MISC
> d012000100000000 SYND 4d000000 IPID 500b000000000
> [    0.709543] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1583508288
> SOCKET 0 APIC 9 microcode 870101

I am still not sure how this MCE is related at all. The MCE is visible on the *next* reboot, right, and right before the crash no MCE is logged, right? Or are you seeing every time after a crash/freeze?
Comment 70 Jens Reimann 2020-06-30 09:03:56 UTC
I am not sure if this is related, but I found the following when doing a `dmesg`:

```
[ 8579.583454] ata7: SATA link down (SStatus 0 SControl 300)
[ 8579.583705] ata6: SATA link down (SStatus 0 SControl 300)
[ 8579.583718] ata5: SATA link down (SStatus 0 SControl 300)
[ 8579.744431] [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
[ 8579.869852] ------------[ cut here ]------------
[ 8579.869903] WARNING: CPU: 1 PID: 17174 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1761 dm_resume+0x31b/0x370 [amdgpu]
[ 8579.869903] Modules linked in: snd_seq_dummy snd_hrtimer rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp tun nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd uvcvideo kvm_amd videobuf2_vmalloc videobuf2_memops joydev videobuf2_v4l2 kvm videobuf2_common videodev iwlmvm btusb irqbypass btrtl eeepc_wmi btbcm btintel asus_wmi snd_usb_audio sparse_keymap mac80211 bluetooth video mxm_wmi wmi_bmof snd_usbmidi_lib snd_hda_codec_realtek snd_rawmidi mc cp210x snd_hda_codec_generic ledtrig_audio pcspkr snd_hda_codec_hdmi libarc4 ecdh_generic ecc snd_hda_intel sp5100_tco k10temp
[ 8579.869917]  i2c_piix4 iwlwifi snd_intel_dspcfg snd_hda_codec snd_hda_core cfg80211 snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer rfkill snd soundcore acpi_cpufreq ip_tables dm_crypt amdgpu amd_iommu_v2 gpu_sched ccp ttm igb drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel dca r8169 i2c_algo_bit ghash_clmulni_intel drm nvme nvme_core wmi pinctrl_amd br_netfilter bridge stp llc fuse
[ 8579.869924] CPU: 1 PID: 17174 Comm: kworker/u64:13 Not tainted 5.6.19-300.fc32.x86_64 #1
[ 8579.869925] Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING, BIOS 1409 05/12/2020
[ 8579.869928] Workqueue: events_unbound async_run_entry_fn
[ 8579.869970] RIP: 0010:dm_resume+0x31b/0x370 [amdgpu]
[ 8579.869971] Code: 8b 83 d4 66 00 00 83 e0 03 83 f8 01 74 36 48 89 ef e8 99 ce 3d c4 31 c0 48 83 c4 18 5b 5d 41 5c 41 5d c3 0f 0b e9 40 ff ff ff <0f> 0b e9 d7 fe ff ff 89 c6 48 c7 c7 80 97 84 c0 e8 d0 0c c1 ff e9
[ 8579.869971] RSP: 0018:ffffa153ca813d38 EFLAGS: 00010202
[ 8579.869972] RAX: 0000000000000002 RBX: ffff948c831e0000 RCX: 0000000000000006
[ 8579.869972] RDX: ffff948c3dfa1800 RSI: ffff9487f8081980 RDI: ffff948c83337000
[ 8579.869973] RBP: 0000000000000000 R08: ffff948c969de278 R09: 0000000000000000
[ 8579.869973] R10: ffff948c92f27b40 R11: 00000000000000f0 R12: ffff948c969de000
[ 8579.869973] R13: ffff9488fcf1e400 R14: ffffffff853dfb1f R15: 0000000000000010
[ 8579.869974] FS:  0000000000000000(0000) GS:ffff948c9ea40000(0000) knlGS:0000000000000000
[ 8579.869974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8579.869975] CR2: 0000000000000000 CR3: 000000076c80a000 CR4: 0000000000340ee0
[ 8579.869975] Call Trace:
[ 8579.870006]  amdgpu_device_ip_resume_phase2+0x52/0xb0 [amdgpu]
[ 8579.870034]  ? amdgpu_device_fw_loading+0xa0/0x110 [amdgpu]
[ 8579.870061]  amdgpu_device_resume+0x80/0x2e0 [amdgpu]
[ 8579.870064]  ? pm_runtime_enable+0x59/0xb0
[ 8579.870065]  ? pci_pm_restore+0xe0/0xe0
[ 8579.870066]  dpm_run_callback+0x4f/0x140
[ 8579.870067]  device_resume+0x136/0x200
[ 8579.870067]  async_resume+0x19/0x50
[ 8579.870068]  async_run_entry_fn+0x39/0x160
[ 8579.870069]  process_one_work+0x1b4/0x380
[ 8579.870070]  worker_thread+0x53/0x3e0
[ 8579.870070]  ? process_one_work+0x380/0x380
[ 8579.870071]  kthread+0x115/0x140
[ 8579.870072]  ? __kthread_bind_mask+0x60/0x60
[ 8579.870074]  ret_from_fork+0x22/0x40
[ 8579.870076] ---[ end trace 527992a575e73b9e ]---
[ 8580.378356] [drm] Fence fallback timer expired on ring sdma0
[ 8580.882361] [drm] Fence fallback timer expired on ring sdma0
[ 8581.386352] [drm] Fence fallback timer expired on ring sdma0
[ 8581.890351] [drm] Fence fallback timer expired on ring sdma0
[ 8581.921965] [drm] UVD and UVD ENC initialized successfully.
[ 8582.022979] [drm] VCE initialized successfully.
[ 8582.031721] PM: resume devices took 2.757 seconds
[ 8582.031729] OOM killer enabled.
[ 8582.031729] Restarting tasks ... done.
[ 8582.033372] thermal thermal_zone0: failed to read out thermal zone (-61)
[ 8582.033374] PM: suspend exit
[ 8582.083510] RTL8125 2.5Gbps internal r8169-500:00: attached PHY driver [RTL8125 2.5Gbps internal] (mii_bus:phy_addr=r8169-500:00, irq=IGNORE)
[ 8582.183274] r8169 0000:05:00.0 enp5s0: Link is Down
[ 8584.727313] r8169 0000:05:00.0 enp5s0: Link is Up - 1Gbps/Full - flow control rx/tx
```
Comment 71 Rich 2020-06-30 12:50:36 UTC
(In reply to Jens Reimann from comment #64)
> I am having the same issue:
> 
> ---
> [    0.107886] mce: [Hardware Error]: Machine check events logged
> [    0.107887] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22:
> baa000000002010b
> [    0.107888] mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND
> 4d000000 IPID 1813e17000 
> [    0.107890] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1592426039
> SOCKET 0 APIC 0 microcode 8701013
> ---
> 
> * Fedora 32
> * AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping:
> 0x0)
> * AMD Radeon 570
> * Samsung NVME Evo Plus
> * Asus ROG X570 motherboard
> 
> Everything was stable for around two weeks. Sleep mode working fine. I also
> but the system under load (except for the GPU).
> 
> Once I tried out Minecraft, putting some load on the GPU for the first time,
> the system reset.

Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port off the CPU......its hard to isolate an issue just based on this....
Comment 72 Jens Reimann 2020-06-30 15:15:05 UTC
(In reply to Rich from comment #71)
> 
> Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port
> off the CPU......its hard to isolate an issue just based on this....

Is there any additional information I can provide?
Comment 73 Rich 2020-06-30 16:23:50 UTC
(In reply to Jens Reimann from comment #72)
> (In reply to Rich from comment #71)
> > 
> > Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port
> > off the CPU......its hard to isolate an issue just based on this....
> 
> Is there any additional information I can provide?

the  complete dmesg log could have some more clues..Does linux collect the  PCIe AER (Advanced Error Recovery) error registers ? Windows would collect them  and deposit failure data in the event log...

What was  the system doing when it crashed can help...was it idle at desktop, just launched an app that has lots of video or lots of compute? did what was launched talk to a specific PCie card connected to the CPU ? did the power stay up? or did it cycle DC power? did the postcode LEDs change and where are they at failure steady state?

if this happens alot, then i'd look closely at the PCIe cards in the CPU PCie slots....playing with the power management options of the PCie bus (like turn off L1 and L1-substates are usually things PCIe endpoints can have problems with) or toggling the GPU cards power management options to see if there is any dependencies on a particular power management feature on the card itself.
Comment 74 Jens Reimann 2020-07-01 10:33:22 UTC
(In reply to Rich from comment #73)
> (In reply to Jens Reimann from comment #72)
> > (In reply to Rich from comment #71)
> > > 
> > > Bank 22:baa000000002010b is a stuck transaction in the path of a PCIe port
> > > off the CPU......its hard to isolate an issue just based on this....
> > 
> > Is there any additional information I can provide?
> 
> the  complete dmesg log could have some more clues..Does linux collect the 
> PCIe AER (Advanced Error Recovery) error registers ? Windows would collect
> them  and deposit failure data in the event log...

I don't know. I am just using Linux. I will upload the dmesg log the next time the machine resets.

> 
> What was  the system doing when it crashed can help...was it idle at
> desktop, just launched an app that has lots of video or lots of compute? did
> what was launched talk to a specific PCie card connected to the CPU ? did
> the power stay up? or did it cycle DC power? did the postcode LEDs change
> and where are they at failure steady state?

First time I experienced this was, when I was playing Minecraft for the first time on that machine. And that was the first time I put the GPU under "load". Before that, I was only using the CPU under load. Since then, I am not playing Minecraft anymore (on this machine). All the other cases (still up to today) had been when using Zoom meetings. About 5-10 minutes into the call. I am using other video conferencing software (Bluejeans, Google Meeting) without issues.

I never had any problem with a high CPU load though.

> did what was launched talk to a specific PCie card connected to the CPU ?

Don't know, how can I find that out?

> did the power stay up? or did it cycle DC power?

Don't know either. The fans kept blowing as the always do. I guess that could mean the power wasn't lost.

> did the postcode LEDs change and where are they at failure steady state?

I have no idea what that means. Sorry.

> 
> if this happens alot, then i'd look closely at the PCIe cards in the CPU
> PCie slots....playing with the power management options of the PCie bus
> (like turn off L1 and L1-substates are usually things PCIe endpoints can
> have problems with) or toggling the GPU cards power management options to
> see if there is any dependencies on a particular power management feature on
> the card itself.

I am sorry, but I am not a Kernel developer. So I don't know anything about all of this.

It happens around 1-2 times a week. Always when using Zoom. Around 5-10 minutes into the call. Both screens go gray. The system reboots. And then it works again until the next time.

I see the MCE messages after that in the dmesg log. I also saw the other warnings in the dmesg log when returning from sleep.

If you have any specific commands I should run, before or after the crash, I am happy to do that and report the results.
Comment 75 Alex Deucher 2020-07-01 16:12:52 UTC
Created attachment 290035 [details]
possible fix

For those of you with polaris GPUs, does this patch fix the issue?
Comment 76 Jens Reimann 2020-07-04 12:48:07 UTC
Created attachment 290091 [details]
dmesg log while running at high CPU load

I am not sure this is the same issue, but as it points again to the GPU, maybe it is related.

Today I had the case again that the machine didn't come back from sleep. Keyboard powered, screens not, and also not the network. I couldn't fine anything afterwards in the system log.

I ran some load on the CPU for >1 hour, and during that time one of the two screens went dark. I could re-enable it, using the KDE display config dialog. Please see the attached the dmesg log.
Comment 77 Tiago Silva 2020-07-08 19:38:05 UTC
Hi,
I am new here!
I just created my account because I was following the thread for a while already, and probably I have some useful information that worth sharing.

My setup is an Ryzen 7 2700 in a B450M Aorus (firmware updated to the last version). My video board is a RX550 PowerColor. Following this thread, I switched it to a GTX1030 Gigabyte to avoid the amdgpu device module. However, the issue of restarts when idling with MCE errors still persisted. Even at the same frequency.

I can reproduce the error just playing some YouTube video with nothing else running. In about 20 minutes the system crashes.

The problem was over for me only when I turned off the Cool'n'Quiet in the BIOS setup. I did not tried the trick of changing the kernel parameter for amdgpu, so it is set to the default value.

Changing the C-state did not work out for me, but I have seen people on the internet claiming that using the "Typical currents" option was enough. Honestly, I find this option only in the Power Source Control parameter, but changing it did not work either. The point is that with the Cool'n'Quiet turned off the processor is not able to slow down the frequency to save power. In Windows system I can reach frequencies of 4,1GHz, which I am not able to see in Linux.

I am not an expert myself, but if there is some additional information I can provide, please let me know. The only thing is that probably detailed instructions on how to get it should be provided.

Sincerely,
Tiago
Comment 78 busdma 2020-08-01 14:58:49 UTC
Experiencing similar issues.
OS: Arch Linux
KERNEL: 5.7.11-arch1-1
CPU: AMD Ryzen 5 3600
GPU: AMD Radeon RX 5700 XT (PowerColor Red Devil)
GPU DRIVER: 4.6 Mesa 20.1.4
RAM: 32 GB
MOTHERBOARD: MSI B450 Tomahawk MAX

amd-ucode version 20200721.2b823fc-1

To me this is really easy to reproduce, just launching the Assassin's Creed Origins game with steamplay machine checks my PC.
I can provide more information if needed.
Comment 79 Rich 2020-08-01 16:44:42 UTC
(In reply to busdma from comment #78)
> Experiencing similar issues.
> OS: Arch Linux
> KERNEL: 5.7.11-arch1-1
> CPU: AMD Ryzen 5 3600
> GPU: AMD Radeon RX 5700 XT (PowerColor Red Devil)
> GPU DRIVER: 4.6 Mesa 20.1.4
> RAM: 32 GB
> MOTHERBOARD: MSI B450 Tomahawk MAX
> 
> amd-ucode version 20200721.2b823fc-1
> 
> To me this is really easy to reproduce, just launching the Assassin's Creed
> Origins game with steamplay machine checks my PC.
> I can provide more information if needed.

Hi,

can you provide the Machine check codes..i'll decode them.
Comment 80 busdma 2020-08-01 17:35:16 UTC
(In reply to Rich from comment #79)
> (In reply to busdma from comment #78)
> > Experiencing similar issues.
> > OS: Arch Linux
> > KERNEL: 5.7.11-arch1-1
> > CPU: AMD Ryzen 5 3600
> > GPU: AMD Radeon RX 5700 XT (PowerColor Red Devil)
> > GPU DRIVER: 4.6 Mesa 20.1.4
> > RAM: 32 GB
> > MOTHERBOARD: MSI B450 Tomahawk MAX
> > 
> > amd-ucode version 20200721.2b823fc-1
> > 
> > To me this is really easy to reproduce, just launching the Assassin's Creed
> > Origins game with steamplay machine checks my PC.
> > I can provide more information if needed.
> 
> Hi,
> 
> can you provide the Machine check codes..i'll decode them.

from journalctl:

...
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1595724426 SOCKET 0 APIC 0 microcode 8701021
...
smp: Bringing up secondary CPUs ...
x86: Booting SMP configuration:
.... node  #0, CPUs:        #1  #2  #3
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffffc12a027c MISC d012000100000000 SYND 4d000000 IPID 500b000000000
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1595724426 SOCKET 0 APIC 8 microcode 8701021
...
Comment 81 busdma 2020-08-01 21:31:10 UTC
Just got a similar crash in The Witcher 3. 

Seems like the same errors:
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffffc0ea427c MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1596316891 SOCKET 0 APIC 0 microcode 8701021
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b
mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 4d000000 IPID 1813e17000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1596316891 SOCKET 0 APIC 0 microcode 8701021

The errors only seem to happen in certain scenes or settings in games. I never get any crashes for activities like watching Youtube, unlike some other people in this thread.
The issue I'm having might be a different one, hopefully I'm not hijacking this thread.
Comment 82 Rich 2020-08-03 23:40:58 UTC
Hi busdma, for comment 80 and 81 they look like same problem to me.

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000002010b    ==>  Bank 22 is NBIO  MCA_STATUS_NBIO[21:16] = 0x2:'SDP port ErrEvent',  This implicates a PCIe slot downstream
mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5:  bea0000000000108    ==>  Bank 5  is EX  [21:16] = 0x0  CPU WDT (watchdog timeout) ...means thread is not retiring micro-ops  in the time-out period.

i think your in the same place as the others...Most likely a Video card or other PCie card has power management issues.

have you tried walking through various video card power management cases?
for AMD video cards:
amdgpu.ppfeaturemask=0xffffbfbf (disable voltage control)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 6  PP_SMC_VOLTAGE_CONTROL_MASK   = 0

amdgpu.ppfeaturemask=0xfffbbfff (disable AVFS)
bit 18 PP_AVFS_MASK         = 0
bit 14 PP_OVERDRIVE_MASK    = 0 

amdgpu.ppfeaturemask=0xffffbff8 (disable all DPMs)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 0  PP_SCLK_DPM_MASK              = 0
bit 1  PP_MCLK_DPM_MASK              = 0
bit 2  PP_PCIE_DPM_MASK              = 0 

amdgpu.ppfeaturemask=0xffffbffe (disable sclk dpm)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 0  PP_SCLK_DPM_MASK              = 0

amdgpu.ppfeaturemask=0xffffbffd (disable mclk dpm)
bit 14 PP_OVERDRIVE_MASK             = 0 
bit 1  PP_MCLK_DPM_MASK              = 0

for PCIe cards we've also see not all correctly support the L1 or L1 substates PCIe link power management state
could try disabling L1 on PCie slots in BIOS setup ?
Comment 83 Alex Deucher 2020-08-04 17:01:56 UTC
For those of you with Polaris GPUs (e.g., Rx580/RX570/RX560, etc.), can you try the patch in comment 75 without any workarounds applied?
Comment 84 busdma 2020-08-09 21:37:09 UTC
(In reply to Rich from comment #82)
Hi, sorry for the late reply. I've tried your suggestions, unfortunately without good results. 

My testing method is to launch the AC Origins game.

> have you tried walking through various video card power management cases?
> for AMD video cards:
> amdgpu.ppfeaturemask=0xffffbfbf (disable voltage control)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 6  PP_SMC_VOLTAGE_CONTROL_MASK   = 0
No difference (machine check error)

> amdgpu.ppfeaturemask=0xfffbbfff (disable AVFS)
> bit 18 PP_AVFS_MASK         = 0
> bit 14 PP_OVERDRIVE_MASK    = 0 
No difference (machine check error)
 
> amdgpu.ppfeaturemask=0xffffbff8 (disable all DPMs)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 0  PP_SCLK_DPM_MASK              = 0
> bit 1  PP_MCLK_DPM_MASK              = 0
> bit 2  PP_PCIE_DPM_MASK              = 0 
Hangs during boot

> amdgpu.ppfeaturemask=0xffffbffe (disable sclk dpm)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 0  PP_SCLK_DPM_MASK              = 0
Hangs during boot

> amdgpu.ppfeaturemask=0xffffbffd (disable mclk dpm)
> bit 14 PP_OVERDRIVE_MASK             = 0 
> bit 1  PP_MCLK_DPM_MASK              = 0
No difference (machine check error)

> for PCIe cards we've also see not all correctly support the L1 or L1
> substates PCIe link power management state
> could try disabling L1 on PCie slots in BIOS setup ?
I found no such option in my BIOS. I have an MSI board, does it have a specific name i could search for?
Comment 85 Rich 2020-08-10 15:56:04 UTC
(In reply to busdma from comment #84)


>>disabling L1 on PCie slots in BIOS setup 
on my AMI base bios system its  under
AMD PBS -> PM L1 SS -> Disabled    This will disable PCIe slots L1 substates


What PCie card to you have installed? 
can you collect an lspci -vvv -xxxx  > filesave.txt
Comment 86 busdma 2020-08-10 18:46:58 UTC
Created attachment 290823 [details]
pci devices

pci devices
sudo lspci -vvv -xxxx > pci_devs.txt
Comment 87 Thomas Langkamp 2020-08-10 21:14:50 UTC
*** Bug 208573 has been marked as a duplicate of this bug. ***
Comment 88 exeskull1 2020-08-13 10:29:37 UTC
(In reply to Tiago Silva from comment #77)
> Hi,
> I am new here!
> I just created my account because I was following the thread for a while
> already, and probably I have some useful information that worth sharing.
> 
> My setup is an Ryzen 7 2700 in a B450M Aorus (firmware updated to the last
> version). My video board is a RX550 PowerColor. Following this thread, I
> switched it to a GTX1030 Gigabyte to avoid the amdgpu device module.
> However, the issue of restarts when idling with MCE errors still persisted.
> Even at the same frequency.
> 
> I can reproduce the error just playing some YouTube video with nothing else
> running. In about 20 minutes the system crashes.
> 
> The problem was over for me only when I turned off the Cool'n'Quiet in the
> BIOS setup. I did not tried the trick of changing the kernel parameter for
> amdgpu, so it is set to the default value.
> 
> Changing the C-state did not work out for me, but I have seen people on the
> internet claiming that using the "Typical currents" option was enough.
> Honestly, I find this option only in the Power Source Control parameter, but
> changing it did not work either. The point is that with the Cool'n'Quiet
> turned off the processor is not able to slow down the frequency to save
> power. In Windows system I can reach frequencies of 4,1GHz, which I am not
> able to see in Linux.
> 
> I am not an expert myself, but if there is some additional information I can
> provide, please let me know. The only thing is that probably detailed
> instructions on how to get it should be provided.
> 
> Sincerely,
> Tiago

I just created my account to say thank you man!


>The problem was over for me only when I turned off the Cool'n'Quiet in the
>>BIOS setup. I did not tried the trick of changing the kernel parameter for
>>amdgpu, so it is set to the default value.

After I did this all my pain gone.

Thank you Tiago!
Sincerely,
Sasa
Comment 89 Vitalii 2020-08-13 19:16:17 UTC
Hi, the new BIOS for my MB (Gigabyte X570 Gaming X, F20, AGESA ComboV2
1.0.0.2; 3900X; Radeon HD 7850) has an option to disable "Core watchdog".
The system reboots anyway, but the screen looks slightly different
just before turning black, and MCE is different -

reboot 1
Linux version 5.8.1 (gcc (Debian 8.3.0-6) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP Thu Aug 13 13:59:53 EEST 2020
21:16:08 - kernel: mce: [Hardware Error]: Machine check events logged
21:16:08 - kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000000080b
21:16:08 - kernel: mce: [Hardware Error]: TSC 0 MISC d012000100000000 SYND 5d000000 IPID 1002e00000500 
21:16:08 - kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1597342518 SOCKET 0 APIC 0 microcode 8701021

reboot 2, same 5.8.1 kernel
21:18:04 - kernel: mce: [Hardware Error]: Machine check events logged
21:18:04 - kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 27: baa000000000080b
21:18:04 - kernel: mce: [Hardware Error]: TSC 0 MISC d012000200000000 SYND 5d000000 IPID 1002e00000500 
21:18:04 - kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1597342659 SOCKET 0 APIC 0 microcode 8701021

If "Core watchdog" is enabled (default), I get the same "Bank 5: bea0000000000108" on random CPUs.

Rich, is there anything interesting about this MCE?

Other that that, I found a way to reproduce the problem quicky on my
system. Two things are needed for "radeon" driver:
1) running Age of Empires III in a map mode (runs in windowed mode on
   different virtual desktop)
2) "glmark2 --run-forever -b build:use-vbo=true" (foreground task)
And it reboots in 2-15 seconds after glmark2 start. Looks like
it's important that glmark2 is using vbo for some reason.

However, it looks like it does not reboot (or reboot is way much less probable)
if I do
# echo low > /sys/class/drm/card0/device/power_dpm_force_performance_level

If AoE3 and glmark2 are running already and power_dpm_force_performance_level
is switched to "high", it reboots very quickly. In other cases (e.g.
glmark2 but no AoE3), switching between "high" and "low" every second
does not reproduce the problem reliably.

I tried disabling Cool'n'Quiet in BIOS - no significant difference.

Thanks
Comment 90 yk749 2020-09-03 01:45:13 UTC
I am getting reproducible, reboot using Ubuntu 20.04/18.04 Fedora 32, etc. Basically all distros I tried will keep rebooting. I can run Ubuntu 20.04 without issues a few months ago, but last month whenever I boot into my Ubuntu 20.04, it will reboot right at login page. I deleted Ubuntu (I was dual booting Ubuntu 20.04 and windows 10 pro, no issues with Windows) and tried many other distros (using live stick), all suffer from the same issue. 

my hw configuration is:
CPU : Ryzen 3950x
GPU : RTX 2080 super
Mobo: asus x570 crosshair viii hero, bios is 2206
RAM: Corsair DOMINATOR PLATINUM 4 * 16G 

The MCE error I got is:
mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
mce [Hardware error]: TSC 0 ADDR b6100 MISC d012003f00000000 SYND 622d1f1103 IPID 700b020b50000
mce [Hardware error]: Processor 2:870f10 TIME 1598212995 SOCKET 0 APIC 2 microcode 8701021

Can someone help me or give me some suggestions on how to debug this ? Thanks.
Comment 91 Rich 2020-09-03 01:57:47 UTC
(In reply to yk749 from comment #90)
> I am getting reproducible, reboot using Ubuntu 20.04/18.04 Fedora 32, etc.
> Basically all distros I tried will keep rebooting. I can run Ubuntu 20.04
> without issues a few months ago, but last month whenever I boot into my
> Ubuntu 20.04, it will reboot right at login page. I deleted Ubuntu (I was
> dual booting Ubuntu 20.04 and windows 10 pro, no issues with Windows) and
> tried many other distros (using live stick), all suffer from the same issue. 
> 
> my hw configuration is:
> CPU : Ryzen 3950x
> GPU : RTX 2080 super
> Mobo: asus x570 crosshair viii hero, bios is 2206
> RAM: Corsair DOMINATOR PLATINUM 4 * 16G 
> 
> The MCE error I got is:
> mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
> mce [Hardware error]: TSC 0 ADDR b6100 MISC d012003f00000000 SYND 622d1f1103
> IPID 700b020b50000
> mce [Hardware error]: Processor 2:870f10 TIME 1598212995 SOCKET 0 APIC 2
> microcode 8701021
> 
> Can someone help me or give me some suggestions on how to debug this ?
> Thanks.

CPU1 Bank 7 is the L3 cache 
Fea040000002010b decodes to Tag Parity error

if this happens in the same bank with same MCA_STATUS = Fea040000002010b every failure then it points at the processor....
of course if the the board's power delivery or thermal solution are out of spec we could also get here...but it would likely move to different banks..
if its in the same place in code execution each time it could be the CPU VR....because certain places in OS boot tax the VR more than any App will...

do you happen to have another motherboard? if it fails there too...its most likely DPM or an early life fail.
Comment 92 yk749 2020-09-03 02:15:17 UTC
(In reply to Rich from comment #91)
> (In reply to yk749 from comment #90)
> > I am getting reproducible, reboot using Ubuntu 20.04/18.04 Fedora 32, etc.
> > Basically all distros I tried will keep rebooting. I can run Ubuntu 20.04
> > without issues a few months ago, but last month whenever I boot into my
> > Ubuntu 20.04, it will reboot right at login page. I deleted Ubuntu (I was
> > dual booting Ubuntu 20.04 and windows 10 pro, no issues with Windows) and
> > tried many other distros (using live stick), all suffer from the same
> issue. 
> > 
> > my hw configuration is:
> > CPU : Ryzen 3950x
> > GPU : RTX 2080 super
> > Mobo: asus x570 crosshair viii hero, bios is 2206
> > RAM: Corsair DOMINATOR PLATINUM 4 * 16G 
> > 
> > The MCE error I got is:
> > mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
> > mce [Hardware error]: TSC 0 ADDR b6100 MISC d012003f00000000 SYND
> 622d1f1103
> > IPID 700b020b50000
> > mce [Hardware error]: Processor 2:870f10 TIME 1598212995 SOCKET 0 APIC 2
> > microcode 8701021
> > 
> > Can someone help me or give me some suggestions on how to debug this ?
> > Thanks.
> 
> CPU1 Bank 7 is the L3 cache 
> Fea040000002010b decodes to Tag Parity error
> 
> if this happens in the same bank with same MCA_STATUS = Fea040000002010b
> every failure then it points at the processor....
> of course if the the board's power delivery or thermal solution are out of
> spec we could also get here...but it would likely move to different banks..
> if its in the same place in code execution each time it could be the CPU
> VR....because certain places in OS boot tax the VR more than any App will...
> 
> do you happen to have another motherboard? if it fails there too...its most
> likely DPM or an early life fail.

Hello Rich, 
Thanks for your prompt reply. I just tried to boot into Ubuntu 20.04 lts live stick again, and got the same mce error message:
mce [Hardware error]: CPU 1: Machine Check: 0 Bank 7:Fea040000002010b
mce [Hardware error]: TSC 0 ADDR d6100 MISC d012003f00000000 SYND 622d1f1103 IPID 700b020b50000
mce [Hardware error]: Processor 2:870f10 TIME 1599084556 SOCKET 0 APIC 2 microcode 8701021

But this time I also see something like:
do_IRQ No irq handler for vector

And:
Initramfs unpacking failed Decoding failed

I just built my new PC so I don't have a spare mobo for testing

Regards,
Yk
Comment 93 EllieTheCat 2020-09-03 06:50:06 UTC
(In reply to Rich from comment #50)

>       bit 0 PP_SCLK_DPM_MASK  = 1 
>       bit 1 PP_MCLK_DPM_MASK  = 0
>     amdgpu.ppfeaturemask=0xffffbffd


just wanted to say thank you so much for this, i have been having these seemingly random restarts since building my new computer, and this little guy right here seems to have made it stable. that said however it does significantly hurt my performance, and trying the other option you listed (amdgpu.ppfeaturemask=0xffffbffe) stopped my system from completing the boot process. got into the initial grub bootloader menu, but after trying to boot into manjaro it would hang. went back to the mclk_dpm_mask = 0 setting, but i was wondering if there's any way to like...mitigate that performance hit? i'm assuming what happens is that the VRAM runs at its base clock speed, which is going to be much lower than what it's actually capable of. however with this feature mask set i'm not entirely certain how to change what clock speed the VRAM is running at, or if that's even possible. very new to all this. any insight people can give would be absolutely lovely

first i thought about using something like corectrl, but that featuremask setting only allows for tinkering with the fan curve and power limits. i also noticed that the memory clock speed is no longer reported by mangohud. i have to assume this is all more or less intended behavior based on the featuremask being set to exclude power management for the MCLK, but that being the case i just don't know where to go from here
Comment 94 Rich 2020-09-03 12:20:05 UTC
(In reply to EllieTheCat from comment #93)
> (In reply to Rich from comment #50)
> 
> >       bit 0 PP_SCLK_DPM_MASK  = 1 
> >       bit 1 PP_MCLK_DPM_MASK  = 0
> >     amdgpu.ppfeaturemask=0xffffbffd
> 
> 
> just wanted to say thank you so much for this, i have been having these
> seemingly random restarts since building my new computer, and this little
> guy right here seems to have made it stable. that said however it does
> significantly hurt my performance, and trying the other option you listed
> (amdgpu.ppfeaturemask=0xffffbffe) stopped my system from completing the boot
> process. got into the initial grub bootloader menu, but after trying to boot
> into manjaro it would hang. went back to the mclk_dpm_mask = 0 setting, but
> i was wondering if there's any way to like...mitigate that performance hit?
> i'm assuming what happens is that the VRAM runs at its base clock speed,
> which is going to be much lower than what it's actually capable of. however
> with this feature mask set i'm not entirely certain how to change what clock
> speed the VRAM is running at, or if that's even possible. very new to all
> this. any insight people can give would be absolutely lovely
> 
> first i thought about using something like corectrl, but that featuremask
> setting only allows for tinkering with the fan curve and power limits. i
> also noticed that the memory clock speed is no longer reported by mangohud.
> i have to assume this is all more or less intended behavior based on the
> featuremask being set to exclude power management for the MCLK, but that
> being the case i just don't know where to go from here


Hi Ellie,

DPM is Dynamic Power Management...so playing with  this  option just disables some power management features...which ticks up power usage but would not affect performance...Power management typically costs performance as one trades power savings for entry and exit latencies and slower clocks.....

my preferred approach when i want to save power is turn off the machine.

Rich
Comment 95 Paul Menzel 2020-09-03 12:28:23 UTC
(In reply to Rich from comment #94)

> DPM is Dynamic Power Management...so playing with  this  option just
> disables some power management features...which ticks up power usage but
> would not affect performance...Power management typically costs performance
> as one trades power savings for entry and exit latencies and slower
> clocks.....

That statement is incorrect with the Linux kernel driver. It will run at lowest speed.

[…]
Comment 96 Alex Deucher 2020-09-03 13:09:37 UTC
I'll repeat since no one has tried it: For those of you with Polaris GPUs (e.g., Rx580/RX570/RX560, etc.), can you try the patch in comment 75 without any workarounds applied?  Does that fix the issue?
Comment 97 EllieTheCat 2020-09-03 14:21:35 UTC
(In reply to Paul Menzel from comment #95)

> 
> It will run at
> lowest speed.
> 
> […]

That being the case do I have any options for more or less forcing a higher base speed? 

Also I'd like to take a moment to thank you as well for replying without any condescending attitude.
Comment 98 patrickjholloway 2020-09-18 10:37:24 UTC
I am having similar issues. Recently Ubuntu 20.04 has become unusable on my machine. I did a clean install of 18.04 and have had some better stability, but even booting successfully is a crap shoot. My last boot succeeded, but did have a MCE during startup.

Ubuntu 18.04.5 LTS / Windows 10 Pro dual boot
5.4.0-47-generic
Ryzen 3600
X570 AORUS ELITE/X570 AORUS ELITE, BIOS F20b 07/02/2020
RTX2070 Super

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffffc1d4a04c MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600420000 SOCKET 0 APIC 9 microcode 8701021

Same MCE on the previous boot, but at CPU 9: Machine Check: 0 Bank 5: bea0000000000000. Everything else identical.


When I was trying to get 20.04 LTS running recently I was seeing stuff like the following (typed this in copying from pics I snapped on my phone before spontaneous reboot):

ata5.00: failed command: READ FPDMA QUEUED
ata5.00: Exception Emask 0x52 SAct 0x12004000 SErr 0xffffffff action 0xe frozen
ata5.00: cmd ... EMask 0x52 (ATA bus error)
ata6.00: failed command: READ FPDMA QUEUED

Some actual screen grabs here:
https://imgur.com/3Hy5enQ
https://imgur.com/mmXhM0h
https://imgur.com/2xgNQrQ

One weird detail is that installing Ubuntu 20 would make it so I would get unwanted reboots when selecting Windows once I got to the Windows login screen. It would do this 100% of the time and wouldn't matter if I selected Windows in grub or in the BIOS.

I haven't tried any of these types of things yet:

>       bit 0 PP_SCLK_DPM_MASK  = 1 
>       bit 1 PP_MCLK_DPM_MASK  = 0
>     amdgpu.ppfeaturemask=0xffffbffd

Sharing my experience here to maybe get some help and drill into this issue. This isn't my daily driver for work and things seem to be okay-ish the Windows side with 18.04 LTS dual boot. Enough that I have a stable environment to dig into logs and settings but still reproduce some of the behavior. I could use some help in that respect as my linux experience is limited relative to some experts on here. I use Ubuntu every day at work as a software developer, but up until 6 months ago I was only using Macs.
Comment 99 Alex Deucher 2020-09-18 14:06:50 UTC
(In reply to patrickjholloway from comment #98)
> Ubuntu 18.04.5 LTS / Windows 10 Pro dual boot
> 5.4.0-47-generic
> Ryzen 3600
> X570 AORUS ELITE/X570 AORUS ELITE, BIOS F20b 07/02/2020
> RTX2070 Super

> I haven't tried any of these types of things yet:
> 
> >       bit 0 PP_SCLK_DPM_MASK  = 1 
> >       bit 1 PP_MCLK_DPM_MASK  = 0
> >     amdgpu.ppfeaturemask=0xffffbffd

These are not relevant if you are not using an AMD GPU.
Comment 100 Jaakko Kantojärvi 2020-09-20 17:34:51 UTC
Hello, I'm joining the club with the following hardware

    AMD Ryzen 9 3900X (microcode 0x08701013)
    Gigabyte X570 Aorus Elite (BIOS F30 / 2020-08-15)
    Asus Radeon R9 270X (connected via the x4 slot behind the X570 chipset)
    2x Kingston KHX3200C16D4/16GX DDR4 3200MHz

Module config:

    /proc/cmdline:
      BOOT_IMAGE=/vmlinuz-5.8.0-1-amd64 root=... ro quiet hugepagesz=1G hugepages=6 amdgpu.ppfeaturemask=0xffffbffd

    /etc/modprobe.d/*:
      blacklist edac_mce_amd
      blacklist radeon
      options amdgpu si_support=1
      options amdgpu cik_support=1
      softdep amdgpu pre: vfio vfio_pci
      options vfio_pci disable_vga=1
      # AMD RX 580 gpu+audio (NOTE: this GPU is currently removed)
      options vfio_pci ids=1002:67df,1002:aaf0"

and the following MCE

    mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 1ffffc05bd88a MISC d010000000000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600618873 SOCKET 0 APIC 5 microcode 8701021


In addition to that, I have encountered more MCEs during past weeks. For those, I had slightly different setup

    CPU microcode: 0x08701013
    BIOS: F11 / 2019-12-06
    
    Second video card with vfio-pci driver:
      Gigabyte Radeon RX580 Aorus (connected via the x16 slot to the CPU)
    
    /proc/cmdline:
      BOOT_IMAGE=/vmlinuz-5.8.0-1-amd64 root=... ro quiet hugepagesz=1G hugepages=6

Event 1) System was idle and on the desktop

    mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: b6a0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 1ffff88616a28 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 0 microcode 8701013

    mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 7f2bb6d59152 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 9 microcode 8701013
    
    mce: [Hardware Error]: CPU 18: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 1ffff87e621b6 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 11 microcode 8701013
    
    mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: f2a000000002010b
    mce: [Hardware Error]: TSC 0 SYND 4d000000 IPID 1813e17000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600227628 SOCKET 0 APIC 0 microcode 8701013

Event 2) System was idle and displays had entered powersaving hours before

    mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
    mce: [Hardware Error]: TSC 0 ADDR 7f53cee29aa8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600475700 SOCKET 0 APIC 4 microcode 8701013
    
    mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: f2a000000002010b
    mce: [Hardware Error]: TSC 0 SYND 4d000000 IPID 1813e17000 
    mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1600475700 SOCKET 0 APIC 0 microcode 8701013


After next reboot, I'll move the R9 to the x16 slot for testing. Idea was to reserve x16 slot for the windows VM, but I guess I prefer a stable desktop more.
Also, I need to test different amdgpu power management masks, as the current one wasn't enough. Although, it's possible that it partially helped (less MCEs per event, assuming the rest were not just direct consequence of the main problem).

Note You need to log in before you can comment on or make changes to this bug.