Bug 212087 - Random reboots with 5.11 and earlier 5.10.x versions, 5.10.18+ stable (Ryzen 5000)
Summary: Random reboots with 5.11 and earlier 5.10.x versions, 5.10.18+ stable (Ryzen ...
Status: REOPENED
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: Other Linux
: P1 high
Assignee: Borislav Petkov
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-03-06 15:42 UTC by alan.loewe
Modified: 2021-07-01 18:43 UTC (History)
7 users (show)

See Also:
Kernel Version: 5.10.13-5.10.16(17?), 5.11
Tree: Mainline
Regression: No


Attachments

Description alan.loewe 2021-03-06 15:42:15 UTC
Like many others with a new Ryzen 5000, I had issues with random reboots, and there certainly were BIOS issues, but with the latest BIOS version being from Jan 29 and after leaving BIOS settings alone at almost their defaults* for a while, I'm sure there also are Kernel issues.

I used Kernels as released by Arch Linux. After eventually having no reboots for a week after a BIOS update (pre 3204 beta), occasional reboots started occurring again with 5.10.13, becoming unbearable (hourly) with the release of 5.11.1 (upgrading from 5.10.16). Then I installed linux-lts at version 5.10.18**, which has been stable since, and now at 5.10.20 still is. In between I tried 5.11.2 when released and got three reboots that day.

** Thus I didn't use 5.10.17, which has relevant changes.

The unstable versions sometimes logged hardware errors:

Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged
Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 1: bc800800060c0859
Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 3b1f3a280 MISC d012000000000000 IPID 100b000000000 
Feb 20 19:11:02 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1613844658 SOCKET 0 APIC 1e microcode a201009

Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged
Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: fc800800060c0859
Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR e1775b600 MISC d012000000000000 IPID 100b000000000 
Feb 24 06:31:18 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1614144675 SOCKET 0 APIC 1d microcode a201009


OS: Arch Linux
CPU: AMD Ryzen 5950X
Mainboard: ASUS Crosshair VIII Hero (Wi-Fi)
RAM: G.Skill F4-3600C16D-64GTZN
GPU: Nvidia GTX 1070
PSU: be quiet! Straight Power 11 Platinum 750W

* Non-default BIOS settings:
Power supply idle control: typical current
Memory current capability: 110%
RAM set to 3200 MHz, 16-18-18-38, 1.35V (below its specs)
SVM enabled
Comment 1 alan.loewe 2021-03-06 16:34:11 UTC
I should mention that before downgrading to 5.10.18 I had the BIOS setting "Power supply idle control" at its default. The reasoning behind "Memory current capability" is, that its help text says that the system will be halted, when the current if above the configured current. And the reported current actually is always about 0.015V above the configured current. I want to exclude that as a cause.

Now I will "Load optimized defaults" and activate the DOCP profile to see whether 5.10.20 is stable with that, too. Haven't really tried that yet.

In before "C states": no, makes no difference.
Comment 2 alan.loewe 2021-03-08 06:32:48 UTC
With default BIOS settings plus DOCP it rebooted after 27h uptime, a few minutes after beginning to play music.

So back to the settings mentioned above. I already tested over-volting as a mitigation by setting the curve optimizer to +1, and then verified by setting it to -1 (under-volting). It makes no difference, -1 is stable. But I had the impression that 5.11 crashed sooner with -1.

So I tested the limit. My CPU doesn't boot with -5, can't take load with -4, CAN compile the linux kernel and pass Cinebench with -3, but it's not stable in the long run. So -2 is probably the limit. That's rather poor.

With -3, 5.11 doesn't even boot, while 5.10 and Windows are stable enough for some benchmarks.

So I guess, 5.11 somehow drives the CPU harder, too hard, and whether it's stable depends on sample quality. I'll call that a regression.
Comment 3 alan.loewe 2021-03-08 08:10:59 UTC
Uhm, nevermind, it has probably already been fixed in 5.11.3 by the cpufreq/schedutil stuff that's also in 5.10.17. I thought it had been in 5.11 from the beginning. Testing 5.11.4 right now, which I have conveniently built as a stress test without noticing. (:

I'll close this issue if it proves stable, which I assume.
Comment 4 alan.loewe 2021-03-08 11:17:58 UTC
Unfortunately not, 5.11.4 is slightly better, but not stable.

Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged
Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: CPU 31: Machine Check: 0 Bank 1: bc800800060c0859
Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 6a1cce800 MISC d012000000000000 IPID 100b000000000 
Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615201914 SOCKET 0 APIC 1f microcode a201009
Comment 5 sean.mcauliffe 2021-03-15 14:30:16 UTC
I too am having this issue with 5.11.2 and similar specs (AMD 5950X/Crosshair VIII Dark Hero). Dual boot Windows and have not had any issues on that side.

Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: bc800800060c0859
Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 13e05c600 MISC d012000000000000 IPID 100b000000000 
Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615797369 SOCKET 0 APIC 1d microcode a201009

I'm attempting to downgrade to 5.10.19-1-MANJARO to see if it's more stable.
Comment 6 sean.mcauliffe 2021-03-17 17:10:37 UTC
To follow up, I did get an MCE on 5.10.19-1-MANJARO once I pushed the memory overclock to 3800 and an FCLK of 1900. Dropping this down to 3600/1800 respectively seems to be stable.

Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: bc800800060c0859
Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR e9c27e840 MISC d0120ffe00000000 IPID 100b000000000 
Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615934655 SOCKET 0 APIC 1d microcode a201009
Comment 7 Chromer 2021-04-04 18:00:00 UTC
I have this issue on Haswell CPU with 5.11 series.  

Log from 5.11.11:

Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: be00000000800400
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffc07e06cc MISC ffffffffc07e06cc 
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1617535033 SOCKET 0 APIC 0 microcode 28

Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffc0e4b269 MISC ffffffffc0e4b269 
Apr  4 15:47:20 xxx kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1617535033 SOCKET 0 APIC 2 microcode 28
Comment 8 Misha Nasledov 2021-04-08 00:58:10 UTC
I've had this same issue on similar hardware. Ryzen 5900X, ASRock X570 Taichi. I was running BIOS 4.00 but I have just updated to the beta BIOS 4.15 with the AGESA 1.2.0.1 Patch A update to see if it helps.

I've tried all the usual Ryzen stuff as well. Global C-state control disabled, Power supply control "Typical Current Idle", and even disabling C-states with ZenStates.py at boot.

6017:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: Machine check events logged
6018:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0 Bank 1: bc800800060c0859
6019:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: TSC 0 ADDR 169d89280 MISC d012000000000000 IPID 100b000000000 
6020:Apr  6 21:22:50 titan kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1617769364 SOCKET 0 APIC 9 microcode a201009

$ ./run.py 1 bc800800060c0859 
Bank: Instruction Fetch Unit (IF)
Error: L2 Cache Response Poison Error. Error is the result of consuming poison data (L2RespPoison 0xc)
Comment 9 meedalexa 2021-04-17 09:34:30 UTC
I'm using an Arch kernel so I don't know if this will be of any help, but I'm having a similar issue on 5.11.13-arch1-1 (the same mce symptoms and bc800800060c0859 error code, and same PROCESSOR etc. line except for the Unix timestamp), and I've had a stable system thus far after turning off Core Performance Boost in my BIOS. But I've also not had my system under as-heavy load (though still pretty heavy load relative to what was making my system crash earlier). Ryzen 7 5800X, Gigabyte X570 Aorus Elite Wifi
Comment 10 alan.loewe 2021-04-17 20:05:01 UTC
Looks like I eventually got my system stable. The solution was to slightly undervolt DRAM. I guess this kind of issue depends on the silicon quality of the data fabric, and if you're unlucky, you need to find stable non-default settings.

RAM specs are 3600 MHz, 16-22-22-42, 1.45V. With that I got random reboots without MCE about once a day. With 3200 MHz and 1.35V, 5.11 crashed every few hours with MCE, but 5.10 seemed stable. Well, almost, it still suddenly crashed multiple times a day after a few days without crashes.

Long story short, these settings are stable for my system, everything else at default:
* DOCP, but with 3200 MHz and 1.345V.
* DRAM power phase control: optimized (default is extreme).

These setting may or may not be relevant:
* DRAM current capability: 110%

Irrelevant settings:
* Virtualization stuff is enabled.

Note: when DRAM voltage is set to 1.35V, sensors reported about 1.36-1.365V. When it's set to 1.345V, quite exactly 1.35V are reported.

To verify, I increased voltage to 1.355V, which yielded MCEs pretty quickly.

This also means that it's not a kernel issue.
Comment 11 Misha Nasledov 2021-04-17 22:01:56 UTC
I have some DDR4-3200 ECC RAM that runs DDR4-3200 with JEDEC timings, so it's 12.00V here. I don't have crashes as frequently as you have reported. My rig was pretty stable until a recent kernel, so I am going to keep digging. It's difficult as it only happens once every few days
Comment 12 alan.loewe 2021-04-22 20:30:17 UTC
I dug further, too, because DRAM voltage causing L2 cache errors doesn't really make sense, and figured that the SoC voltage was weird.

Standard SoC voltage is 1.0V. It was set to auto, and somehow the mainboard decided that 1.1V is a good voltage. I set it to 1.0V manually. Uptime now 30h with DRAM at 3600MHz and 1.45V. I won't call it stable, yet, but it's looking pretty promising.
Comment 13 Borislav Petkov 2021-05-05 11:15:06 UTC
Yes, for all folks with the error status value:

[17299.027344] [Hardware Error]: CPU:15 (17:31:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859

we will have an improvement soon to avoid some of the reboots depending on where the error happens.

And yes, getting your DRAM voltage stable and otherwise not causing those bit flips to happen in DRAM - because that's what this is - bits in DRAM get flipped, hardware detects them and poisons the cacheline. Which is all fine and good until software consumes that cacheline - it ate poison so it goes boom.

The improvement should decrease the "goes boom" cases and only kill those user processes, *if* the poison is in user memory, but not bring the whole box down.

Anyway, if people are interested I'll post a branch to test soonish.

Thx.
Comment 14 Misha Nasledov 2021-05-05 17:16:14 UTC
Borislav,

It is a very curious error for me as I hadn't seen anything L2 cache related until the recent kernel. Unfortunately I also did some BIOS updates (and one was a beta BIOS) and can't totally isolate the cause. I was leaning toward the kernel after finding others with pretty much the same hardware and error.

I should also note I'm running ECC RAM. It has yet to report a single bit flip error via edac-util (and nothing interesting in dmesg either).
Comment 15 Borislav Petkov 2021-05-05 17:55:57 UTC
(In reply to Misha Nasledov from comment #14)
> It is a very curious error for me as I hadn't seen anything L2 cache
> related until the recent kernel.

This is where the error is detected, that's why it says L2.

> Unfortunately I also did some BIOS updates (and one was a beta BIOS)
> and can't totally isolate the cause. I was leaning toward the kernel
> after finding others with pretty much the same hardware and error.

I highly doubt it is the kernel. If it were, people would report this
issue left and right.

> I should also note I'm running ECC RAM. It has yet to report a single bit
> flip error via edac-util (and nothing interesting in dmesg either).

Do you have amd64_edac and edac_mce_amd modules loaded?
Comment 16 Misha Nasledov 2021-05-05 17:57:55 UTC
(In reply to Borislav Petkov from comment #15)
> Do you have amd64_edac and edac_mce_amd modules loaded?

Yes.

# lsmod | grep edac
amd64_edac_mod         36864  0
edac_mce_amd           32768  1 amd64_edac_mod

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.
Comment 17 meedalexa 2021-05-05 18:14:50 UTC
Additionally, I ran memtest86 on a computer exhibiting these symptoms (0xbc800800060c0859; see comment #9) for a few hours and got no DRAM errors. I can run it overnight (UTC-05:00) if that would be useful. I don't have ECC RAM, unlike Misha.
Comment 18 Borislav Petkov 2021-05-05 18:22:45 UTC
(In reply to Misha Nasledov from comment #16)
> (In reply to Borislav Petkov from comment #15)
> > Do you have amd64_edac and edac_mce_amd modules loaded?
> 
> Yes.

Which could mean that the errors you get are not single bit flips but, well, multiple bit flips which are uncorrectable and get poisoned. Consuming them leads to a hard reset and so they don't get logged by EDAC.
Comment 19 Borislav Petkov 2021-05-05 18:32:08 UTC
(In reply to meedalexa from comment #17)
> Additionally, I ran memtest86 on a computer exhibiting these symptoms
> (0xbc800800060c0859; see comment #9) for a few hours and got no DRAM errors.
> I can run it overnight (UTC-05:00) if that would be useful. I don't have ECC
> RAM, unlike Misha.

Probably not worth it. In pretty much all the error reports I've seen so far - and those are a *lot* :) - people would report MCEs and then run memtest for hours on the box and nothing will get caught.

If it is some high utilization pattern which would bring the box power envelope to some corner case which could lead to whatever sub-optimal operating conditions causing DRAM bit flips then I'm highly skeptical memtest can ever achieve that same operating conditions with its reading and writing of bit patterns into DRAM.

I might be mistaken but experience so far shows that memtest hardly ever causes the MCEs to get repeated. Unless you really have a faulty DIMM chip with a stuck bit which would *always* do the bit flip. But that's not what we have here.

HTH.
Comment 20 alan.loewe 2021-05-05 18:49:35 UTC
Is it possible that the bits are flipped on their way to the L2 cache under certain load conditions? I mean, lowering my SoC voltage definitively helped. I could run RAM at 3600 MHz / 1.45V without issues...

...until I ntfscloned a partition from my old HDD to a new SSD. About 1.3 TB, which would last 3 hours. It crashed twice with that MCE after about 1 and 2 hours. It successfully completed after I lowered RAM to 3200 MHz / 1.345V again.

Another good stress test is using Steam to verify the integrity of local game files. On the other hand, just sha256sum-ing a lot of files never triggered that MCE.
Comment 21 alan.loewe 2021-05-05 19:01:57 UTC
On kernel versions, the most notable change in 5.11 is frequency invariance:
https://www.phoronix.com/scan.php?page=news_item&px=AMD-Freq-Invariance-Linux-5.11

Maybe that makes the MCE more likely to happen, but the underlying cause is a hardware issue.
Comment 22 Misha Nasledov 2021-05-05 19:28:42 UTC
Given that I've seen that that people have some stability issues with the 5900X but have no issues if they downgrade to older Ryzen processors like 3600X, I'm inclined to agree that there is some sort of hardware issue. Not so much a defective hardware issue as it seems to be an issue AMD needs to address with a microcode update.

When I first got the 5900X, I regularly had reboots that were the typical Linux + Ryzen C6 state crash. The usual BIOS settings didn't fix it, but running ZenStates.py to disable C6 at boot did resolve it.

I'm hoping there will be some kind of microcode update soon. I need to look and see if anyone else has reported such issues to AMD. Maybe I need to contact them.
Comment 23 alan.loewe 2021-05-05 19:38:01 UTC
I reported the MCEs to AMD support. They asked for a dxdiag report, then said I should update my graphics drivers. When pointing out that I'm using Linux, and its drivers are up-to-date, I was told to use a compatible operating system. (:


Another thing I wonder about: the MCE is reported by the Instruction Fetch Unit. That means it happens when code execution jumps to a memory address that's not in the L2 cache, doesn't it? Very unlikely to happen when only a small, bootable application like Memtest is running.

But what would an error look like when data, not code, is fetched? Would we notice them? Do they happen?
Comment 24 Borislav Petkov 2021-05-05 21:48:05 UTC
(In reply to alan.loewe from comment #20)
> Is it possible that the bits are flipped on their way to the L2 cache under
> certain load conditions?

The error type is described in the CPU doc this way:

"L2 Cache Response Poison Error. Error is the result of consuming poison
data."

And L2 because I presume there the poison bit is being checked because
it assumes that the cacheline is going to get consumed or when it pulls
it into L1 because it is going to get consumed - there it signals the
MCE.

All guestimation of course.

> I mean, lowering my SoC voltage definitively helped. I could run RAM
> at 3600 MHz / 1.45V without issues...

Why do you even fiddle with voltages? I leave my power settings to
default in the BIOS and have no issues whatsoever.

> ...until I ntfscloned a partition from my old HDD to a new SSD. About 1.3
> TB, which would last 3 hours. It crashed twice with that MCE after about 1
> and 2 hours. It successfully completed after I lowered RAM to 3200 MHz /
> 1.345V again.

Yah, sounds like the corner conditions I was talking about.

> Another good stress test is using Steam to verify the integrity of local
> game files. On the other hand, just sha256sum-ing a lot of files never
> triggered that MCE.

Aha.

> On kernel versions, the most notable change in 5.11 is frequency invariance:

If the above ntfs cloning reliably reproduces with tweaked voltages
on 5.11 and all you change to that setup is boot into 5.10 and the
same exercise doesn't reproduce anymore, then I can imagine schedutil
contributing in some fashion. Although the average power utilization we do with
the current setting:

https://lore.kernel.org/linux-acpi/20201112182614.10700-3-ggherdovich@suse.cz/

sugov-mid, is not the maximal one so there should be some power left, so
to speak.

> I reported the MCEs to AMD support. They asked for a dxdiag report,
> then said I should update my graphics drivers. When pointing out that
> I'm using Linux, and its drivers are up-to-date, I was told to use a
> compatible operating system. (:

The standard canned response of all customer-facing support of all those
tech companies. Ignore it, there are people at AMD who care a lot about
Linux.

> Another thing I wonder about: the MCE is reported by the Instruction
> Fetch Unit. That means it happens when code execution jumps to a memory
> address that's not in the L2 cache, doesn't it?

No, the instruction cache unit caches cachelines of 64 bytes of size
which contain instructions. If a memory address it jumps to happens to
not be in it - which can happen although the prefetchers are pretty
aggressive - then you get a cache miss and that cacheline is fetched
into L2 and then into L1 for executing.

> Very unlikely to happen when only a small, bootable application like
> Memtest is running.

If the prefetcher guesses the access pattern of the applcation, the
cacheline is pretty much in the L2 by the time instructions from it get
to get executed.

> But what would an error look like when data, not code, is fetched?

If you fetch bytes which are not valid instructions, you get an
Invalid-Opcode Exception and you land in the respective exception
handler. It practically looks like this:

./ud
Illegal instruction

That ud thing does:

	asm volatile(".byte 0x27");

where 0x27 is an invalid opcode in 64-bit x86.

> Would we notice them? Do they happen?

So those things don't have anything to do with MCEs - what you're seeing
is some cacheline in memory gets two or more bits changed. For whatever
reason. Unstable voltages, alpha particles going through them, and so on.

If your DRAM is ECC, then the ECC protection word is checked and signals
that the cacheline's contents have changed and cannot be repaired
anymore (only single bits can) so they're marked as poison data and
travel around the machine without anything else bad happening.

But if they get to go up into the cache because they're about to get
executed, that poison data mark is seen by the L2 machinery and it
raises a machine check exception, causing the reboot to prevent any
further data corruption.

Something like this - all this is a rough version of the reality of what
happens but the basic idea should be clear.

HTH.
Comment 25 meedalexa 2021-05-05 22:21:28 UTC
(In reply to Borislav Petkov from comment #13)
> Yes, for all folks with the error status value:
> 
> [17299.027344] [Hardware Error]: CPU:15 (17:31:0)
> MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859
> 
> we will have an improvement soon to avoid some of the reboots depending on
> where the error happens.
> 
> And yes, getting your DRAM voltage stable and otherwise not causing those
> bit flips to happen in DRAM - because that's what this is - bits in DRAM get
> flipped, hardware detects them and poisons the cacheline. Which is all fine
> and good until software consumes that cacheline - it ate poison so it goes
> boom.
> 
> The improvement should decrease the "goes boom" cases and only kill those
> user processes, *if* the poison is in user memory, but not bring the whole
> box down.

To make sure my understanding is correct: this would isolate the MCEs to the process that experiences the issue, and kill that process instead of resetting the system? (I'm guessing it would still be a reset if this happens in kernel code?) I think that could be useful, but I would still be wondering what the root cause is.

(In reply to Misha Nasledov from comment #14)
> It is a very curious error for me as I hadn't seen anything L2 cache related
> until the recent kernel. Unfortunately I also did some BIOS updates (and one
> was a beta BIOS) and can't totally isolate the cause. I was leaning toward
> the kernel after finding others with pretty much the same hardware and error.

My machine is new, so I can't isolate the cause either; these symptoms appeared for the first time as I was transferring files over the network, as I recall. I Googled around for the specific error I'd been having, complete with the MCE code bc800800060c0859, and found a few other threads, but most of them seem to point here. It's been a while so I don't have links, unfortunately.

(In reply to alan.loewe from comment #20)
> Is it possible that the bits are flipped on their way to the L2 cache under
> certain load conditions? I mean, lowering my SoC voltage definitively
> helped. I could run RAM at 3600 MHz / 1.45V without issues...
> 
> ...until I ntfscloned a partition from my old HDD to a new SSD. About 1.3
> TB, which would last 3 hours. It crashed twice with that MCE after about 1
> and 2 hours. It successfully completed after I lowered RAM to 3200 MHz /
> 1.345V again.

My workaround of disabling Core Performance Boost (mentioned in comment #9) has been completely stable for me since then. I tried poking my head into the BIOS voltage settings, but I haven't had time to sit down and actually figure it out (plus my hardware expertise is limited), so they're still stock for me. alan.loewe mentioned in an earlier comment that the BIOS put the SoC at 1.1V for some reason; that's not a symptom I saw on my machine, if I read the BIOS voltage menus correctly.

> Another good stress test is using Steam to verify the integrity of local
> game files. On the other hand, just sha256sum-ing a lot of files never
> triggered that MCE.

My stress test has been playing multiplayer Minecraft, lol. I also had MCEs earlier when I was copying files from my previous computer over a network, as mentioned.

(In reply to Misha Nasledov from comment #22)
> Given that I've seen that that people have some stability issues with the
> 5900X but have no issues if they downgrade to older Ryzen processors like
> 3600X, I'm inclined to agree that there is some sort of hardware issue. Not
> so much a defective hardware issue as it seems to be an issue AMD needs to
> address with a microcode update.
> 
> When I first got the 5900X, I regularly had reboots that were the typical
> Linux + Ryzen C6 state crash. The usual BIOS settings didn't fix it, but
> running ZenStates.py to disable C6 at boot did resolve it.
> 
> I'm hoping there will be some kind of microcode update soon. I need to look
> and see if anyone else has reported such issues to AMD. Maybe I need to
> contact them.

I sure hope it's not defective hardware. This Arch user [1] saw symptoms disappear after replacing their processor (with one of the same model) and updating their BIOS:
> Last update.
> I returned CPU to the shop and I bought a new one (also new Ryzen). In the
> meantime the new bios was released.
> Since then everything is working fine. To be hones I'm not sure if this was a
> CPU or bios problem. Maybe both. On AMD forum the topic is still active even
> after bios update.

[1]: https://bbs.archlinux.org/viewtopic.php?pid=1954703#p1954703

However, I've seen symptoms on a BIOS version that was released after that post was made (F33g, the latest being F33h), which I flashed before I started using the computer.

I can contact AMD support as well, if we think it would be useful, even if they end up telling me to use a "supported operating system" again :P I think the RMA period on my CPU is up (blame the graphics-card market taking a billion years to supply me with the last part I needed), so I'm hoping this is solvable at the kernel or microcode level.
Comment 26 Misha Nasledov 2021-05-05 22:34:32 UTC
(In reply to meedalexa from comment #25)
> I can contact AMD support as well, if we think it would be useful, even if
> they end up telling me to use a "supported operating system" again :P I
> think the RMA period on my CPU is up (blame the graphics-card market taking
> a billion years to supply me with the last part I needed), so I'm hoping
> this is solvable at the kernel or microcode level.

I think it would be good if more people brought it to their attention. I don't think the RMA period is up. The warranty period should be 3 years IIRC
Comment 27 Borislav Petkov 2021-05-05 22:38:49 UTC
(In reply to Borislav Petkov from comment #24)
> > Another thing I wonder about: the MCE is reported by the Instruction
> > Fetch Unit. That means it happens when code execution jumps to a memory
> > address that's not in the L2 cache, doesn't it?
> 
> No, the instruction cache unit caches cachelines of 64 bytes of size
> which contain instructions. If a memory address it jumps to happens to
> not be in it - which can happen although the prefetchers are pretty
> aggressive - then you get a cache miss and that cacheline is fetched
> into L2 and then into L1 for executing.

That was ofcourse the instruction *cache* - the instruction fetch unit (IFU) steers which cachelines go into the instruction cache. Sorry for the confusion. And the MCE is reported as an IFU MCE probably because there the poison check is done or reported. It all depends on how the microarchitecture has been designed but all in all, it doesn't matter in this case.
Comment 28 Borislav Petkov 2021-05-05 22:46:52 UTC
(In reply to meedalexa from comment #25)
> To make sure my understanding is correct: this would isolate the MCEs to the
> process that experiences the issue, and kill that process instead of
> resetting the system? (I'm guessing it would still be a reset if this
> happens in kernel code?)

Exactly. 

> I think that could be useful, but I would still be wondering what the
> root cause is.

Well, is the box stable if you leave your BIOS settings to default and
don't fiddle with DRAM voltages?
Comment 29 meedalexa 2021-05-05 22:51:26 UTC
(In reply to Misha Nasledov from comment #26)
> (In reply to meedalexa from comment #25)
> > I can contact AMD support as well, if we think it would be useful, even if
> > they end up telling me to use a "supported operating system" again :P I
> > think the RMA period on my CPU is up (blame the graphics-card market taking
> > a billion years to supply me with the last part I needed), so I'm hoping
> > this is solvable at the kernel or microcode level.
> 
> I think it would be good if more people brought it to their attention. I
> don't think the RMA period is up. The warranty period should be 3 years IIRC

You're probably right about the RMA period. I should probably check into that and see if I win the silicon lottery the second time around. Though of course it'll be rough to go for however long it takes without a CPU...

And yes, I'll contact AMD support this week.

(In reply to Borislav Petkov from comment #28)
> (In reply to meedalexa from comment #25)
> > I think that could be useful, but I would still be wondering what the
> > root cause is.
> 
> Well, is the box stable if you leave your BIOS settings to default and
> don't fiddle with DRAM voltages?

No, but it is stable if I disable Core Performance Boost in my BIOS and leave voltages untouched. (I also have my RAM's XMP profile turned on; I haven't checked if my system is stable if I disable XMP and re-enable Core Performance Boost.)
Comment 30 Misha Nasledov 2021-05-05 22:56:13 UTC
(In reply to meedalexa from comment #29)
> You're probably right about the RMA period. I should probably check into
> that and see if I win the silicon lottery the second time around. Though of
> course it'll be rough to go for however long it takes without a CPU...

I RMA'd an 1800X years ago and asked them to do an advanced RMA. They sent me the replacement in advance.
Comment 31 Borislav Petkov 2021-05-05 23:22:28 UTC
(In reply to meedalexa from comment #29)
> No, but it is stable if I disable Core Performance Boost in my BIOS and
> leave voltages untouched. (I also have my RAM's XMP profile turned on; I
> haven't checked if my system is stable if I disable XMP and re-enable Core
> Performance Boost.)

This thing here:

https://yourbusiness.azcentral.com/enable-xmp-amd-board-9989.html

talks about XMP being incompatible with AMD and AMD boards having their own memory profiles called AMP. If your BIOS is enabling Intel XMPs on an AMD board, I wouldn't be surprised if the DRAM chips are running at an incompatible setting.

Regardless, you could turn off your XMP profile and all the other overclocking settings in the BIOS and see if it still causes MCEs with the default setting and  CPB enabled.
Comment 32 alan.loewe 2021-05-06 02:52:02 UTC
Nowadays AMD boards just read the XMP profile, which apparently wasn't possible before, due to the proprietary nature of XMP. They just call it differently, e.g. D.O.C.P. on ASUS boards.

As someone mentioned, there are very few RAM modules with a 3200 MHz JEDEC profile, which is the standard frequency, so enabling XMP is basically required. Everything above 3200 MHz is overlocking, though.

With default settings and DOCP disabled, resulting in 2666 MHz @ 1.2V and very slow timings, my system is less stable than with DOCP enabled.

I noticed that enabling DOCP causes the SoC voltage to be set to 1.1V. When decreasing DRAM frequency to 3200 MHz and voltage to the usual 1.35V, SoC voltage stays at that. After setting it manually and then changing it back to auto, it's 1.0V. Using offset mode yields totally weird results. So the logic implemented by ASUS seems to be a bit buggy.

Core Performance Boost is the AMD equivalent of Turbo Boost. Turning it off costs a lot of performance, because CPU frequency is fixed at the base frequency. But yes, it's stable without it, which concurs with frequency invariance introduced in 5.11 making it worse.

Disabling Performance Boost Overdrive, which is AMDs automatic overclocking that's enabled by default, doesn't help either.

I guess what we're experiencing are the downsides of a new approach: instead of advertising and guaranteeing a base performance, where you have some overclocking potential if you're lucky, AMD now tries to get the optimum out of each individual processor with smart firmware, and sometimes it doesn't work very well. Then manually tweaking settings is your only option.
Comment 33 meedalexa 2021-05-06 03:27:12 UTC
(In reply to alan.loewe from comment #32)
> Nowadays AMD boards just read the XMP profile, which apparently wasn't
> possible before, due to the proprietary nature of XMP. They just call it
> differently, e.g. D.O.C.P. on ASUS boards.
> 
> As someone mentioned, there are very few RAM modules with a 3200 MHz JEDEC
> profile, which is the standard frequency, so enabling XMP is basically
> required. Everything above 3200 MHz is overlocking, though.
> 
> With default settings and DOCP disabled, resulting in 2666 MHz @ 1.2V and
> very slow timings, my system is less stable than with DOCP enabled.

Interesting. I've been playing Minecraft for a couple of hours with XMP/DOCP off and CPB on and it's working fine. (I forget what my board calls it, if it uses XMP or something anodyne like memory profiles.) Under prior conditions it would have crashed by now, though I'm still not calling it stable just yet. My RAM is also 3200 MHz with XMP, 2666 MHz without.

> I noticed that enabling DOCP causes the SoC voltage to be set to 1.1V. When
> decreasing DRAM frequency to 3200 MHz and voltage to the usual 1.35V, SoC
> voltage stays at that. After setting it manually and then changing it back
> to auto, it's 1.0V. Using offset mode yields totally weird results. So the
> logic implemented by ASUS seems to be a bit buggy.

My board is Gigabyte. My SOC Vcore was 1.0 with XMP on or off.

So it seems there are a few potential workarounds right now:
- Disable XMP, though it might not work on some boards
- Fix your voltages if they're weird, though they might not be weird
- Disable Core Performance Boost, though goodbye performance
- RMA and get a new processor, based on the Arch Forums thread and my vague memory of earlier web searching of this problem
Comment 34 alan.loewe 2021-05-06 04:35:10 UTC
(In reply to meedalexa from comment #33)
> My RAM is also 3200 MHz with XMP, 2666 MHz without.
The XMP profile of mine is 3600 MHz @ 1.45V. Increasing the SoC voltage is actually recommended in that case, and the BIOS seems to do that automatically, but too much. With standard SoC voltage the XMP profile was stable until cloning my disk. Before and now I still use DOCP for the timings and manually adjust the frequency to non-overclocking 3200 MHz and voltage to a bit less than the usual 1.35V, but then the BIOS doesn't automatically revert to the standard SoC voltage.

So this hint is for people with overclocking RAM. 3600 MHz is often recommended as the sweet spot, and thus quite common, but many are unaware that AMD only guarantees 3200 MHz.

The difference is quite noticeable in some benchmarks, so running it at 2666 MHz is no satisfactory solution, too.

By the way, I tried that only once a few month ago with a much older BIOS, and got a non-MCE crash really quick. It could be stable now.
Comment 35 meedalexa 2021-05-08 20:57:15 UTC
With XMP off and Core Performance Boost on, it seemed like it was working... until today, when it crashed in the middle of a Minecraft game. Switched back to the known-good settings for now of XMP on, CPB off.
Comment 36 Misha Nasledov 2021-07-01 18:43:44 UTC
I don't think that XMP, CPB, etc are related. These can cause crashes, but in my experience they cause different MCEs.

I'm still experiencing the L2 cache poisoning with my 5900X. I'm running the 5.10 kernel from Debian, currently 5.10.46. The other day I actually experienced this MCE without a hard crash. The process was killed and the system kept running. See logs https://pastebin.com/74pybLyT

I'm at the point where I want to try swapping another CPU into this rig. I have a spare Ryzen 3600 that should work. I will leave everything else the same. If I don't see this crash anymore (I will give it up to 2 weeks), then it seems like a reasonable conclusion that it has to do with the 5900X. I don't have reason to believe it's a Linux-related issue, either. I am suspecting that this may be an RMA case for AMD.

Note You need to log in before you can comment on or make changes to this bug.