Bug 212087
Summary: | Random reboots with 5.11 and earlier 5.10.x versions, 5.10.18+ stable (Ryzen 5000) | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | alan.loewe |
Component: | x86-64 | Assignee: | Borislav Petkov (bp) |
Status: | REOPENED --- | ||
Severity: | high | CC: | alexoundos, bp, downloader030, gabriele.svelto, jelenczv, klangga, meedalexa, misha, nitrooo, njlmerchant, nyanpur, sean.mcauliffe, ucelsanicin |
Priority: | P1 | ||
Hardware: | Other | ||
OS: | Linux | ||
Kernel Version: | 5.10.13-5.10.16(17?), 5.11 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
alan.loewe
2021-03-06 15:42:15 UTC
I should mention that before downgrading to 5.10.18 I had the BIOS setting "Power supply idle control" at its default. The reasoning behind "Memory current capability" is, that its help text says that the system will be halted, when the current if above the configured current. And the reported current actually is always about 0.015V above the configured current. I want to exclude that as a cause. Now I will "Load optimized defaults" and activate the DOCP profile to see whether 5.10.20 is stable with that, too. Haven't really tried that yet. In before "C states": no, makes no difference. With default BIOS settings plus DOCP it rebooted after 27h uptime, a few minutes after beginning to play music. So back to the settings mentioned above. I already tested over-volting as a mitigation by setting the curve optimizer to +1, and then verified by setting it to -1 (under-volting). It makes no difference, -1 is stable. But I had the impression that 5.11 crashed sooner with -1. So I tested the limit. My CPU doesn't boot with -5, can't take load with -4, CAN compile the linux kernel and pass Cinebench with -3, but it's not stable in the long run. So -2 is probably the limit. That's rather poor. With -3, 5.11 doesn't even boot, while 5.10 and Windows are stable enough for some benchmarks. So I guess, 5.11 somehow drives the CPU harder, too hard, and whether it's stable depends on sample quality. I'll call that a regression. Uhm, nevermind, it has probably already been fixed in 5.11.3 by the cpufreq/schedutil stuff that's also in 5.10.17. I thought it had been in 5.11 from the beginning. Testing 5.11.4 right now, which I have conveniently built as a stress test without noticing. (: I'll close this issue if it proves stable, which I assume. Unfortunately not, 5.11.4 is slightly better, but not stable. Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: Machine check events logged Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: CPU 31: Machine Check: 0 Bank 1: bc800800060c0859 Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 6a1cce800 MISC d012000000000000 IPID 100b000000000 Mar 08 12:11:57 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615201914 SOCKET 0 APIC 1f microcode a201009 I too am having this issue with 5.11.2 and similar specs (AMD 5950X/Crosshair VIII Dark Hero). Dual boot Windows and have not had any issues on that side. Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: bc800800060c0859 Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR 13e05c600 MISC d012000000000000 IPID 100b000000000 Mar 15 02:36:11 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615797369 SOCKET 0 APIC 1d microcode a201009 I'm attempting to downgrade to 5.10.19-1-MANJARO to see if it's more stable. To follow up, I did get an MCE on 5.10.19-1-MANJARO once I pushed the memory overclock to 3800 and an FCLK of 1900. Dropping this down to 3600/1800 respectively seems to be stable. Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: CPU 30: Machine Check: 0 Bank 1: bc800800060c0859 Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: TSC 0 ADDR e9c27e840 MISC d0120ffe00000000 IPID 100b000000000 Mar 16 16:44:17 xxxxxx kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1615934655 SOCKET 0 APIC 1d microcode a201009 I have this issue on Haswell CPU with 5.11 series. Log from 5.11.11: Apr 4 15:47:20 xxx kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: be00000000800400 Apr 4 15:47:20 xxx kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffc07e06cc MISC ffffffffc07e06cc Apr 4 15:47:20 xxx kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1617535033 SOCKET 0 APIC 0 microcode 28 Apr 4 15:47:20 xxx kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400 Apr 4 15:47:20 xxx kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffffc0e4b269 MISC ffffffffc0e4b269 Apr 4 15:47:20 xxx kernel: mce: [Hardware Error]: PROCESSOR 0:306c3 TIME 1617535033 SOCKET 0 APIC 2 microcode 28 I've had this same issue on similar hardware. Ryzen 5900X, ASRock X570 Taichi. I was running BIOS 4.00 but I have just updated to the beta BIOS 4.15 with the AGESA 1.2.0.1 Patch A update to see if it helps. I've tried all the usual Ryzen stuff as well. Global C-state control disabled, Power supply control "Typical Current Idle", and even disabling C-states with ZenStates.py at boot. 6017:Apr 6 21:22:50 titan kernel: mce: [Hardware Error]: Machine check events logged 6018:Apr 6 21:22:50 titan kernel: mce: [Hardware Error]: CPU 16: Machine Check: 0 Bank 1: bc800800060c0859 6019:Apr 6 21:22:50 titan kernel: mce: [Hardware Error]: TSC 0 ADDR 169d89280 MISC d012000000000000 IPID 100b000000000 6020:Apr 6 21:22:50 titan kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1617769364 SOCKET 0 APIC 9 microcode a201009 $ ./run.py 1 bc800800060c0859 Bank: Instruction Fetch Unit (IF) Error: L2 Cache Response Poison Error. Error is the result of consuming poison data (L2RespPoison 0xc) I'm using an Arch kernel so I don't know if this will be of any help, but I'm having a similar issue on 5.11.13-arch1-1 (the same mce symptoms and bc800800060c0859 error code, and same PROCESSOR etc. line except for the Unix timestamp), and I've had a stable system thus far after turning off Core Performance Boost in my BIOS. But I've also not had my system under as-heavy load (though still pretty heavy load relative to what was making my system crash earlier). Ryzen 7 5800X, Gigabyte X570 Aorus Elite Wifi Looks like I eventually got my system stable. The solution was to slightly undervolt DRAM. I guess this kind of issue depends on the silicon quality of the data fabric, and if you're unlucky, you need to find stable non-default settings. RAM specs are 3600 MHz, 16-22-22-42, 1.45V. With that I got random reboots without MCE about once a day. With 3200 MHz and 1.35V, 5.11 crashed every few hours with MCE, but 5.10 seemed stable. Well, almost, it still suddenly crashed multiple times a day after a few days without crashes. Long story short, these settings are stable for my system, everything else at default: * DOCP, but with 3200 MHz and 1.345V. * DRAM power phase control: optimized (default is extreme). These setting may or may not be relevant: * DRAM current capability: 110% Irrelevant settings: * Virtualization stuff is enabled. Note: when DRAM voltage is set to 1.35V, sensors reported about 1.36-1.365V. When it's set to 1.345V, quite exactly 1.35V are reported. To verify, I increased voltage to 1.355V, which yielded MCEs pretty quickly. This also means that it's not a kernel issue. I have some DDR4-3200 ECC RAM that runs DDR4-3200 with JEDEC timings, so it's 12.00V here. I don't have crashes as frequently as you have reported. My rig was pretty stable until a recent kernel, so I am going to keep digging. It's difficult as it only happens once every few days I dug further, too, because DRAM voltage causing L2 cache errors doesn't really make sense, and figured that the SoC voltage was weird. Standard SoC voltage is 1.0V. It was set to auto, and somehow the mainboard decided that 1.1V is a good voltage. I set it to 1.0V manually. Uptime now 30h with DRAM at 3600MHz and 1.45V. I won't call it stable, yet, but it's looking pretty promising. Yes, for all folks with the error status value: [17299.027344] [Hardware Error]: CPU:15 (17:31:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 we will have an improvement soon to avoid some of the reboots depending on where the error happens. And yes, getting your DRAM voltage stable and otherwise not causing those bit flips to happen in DRAM - because that's what this is - bits in DRAM get flipped, hardware detects them and poisons the cacheline. Which is all fine and good until software consumes that cacheline - it ate poison so it goes boom. The improvement should decrease the "goes boom" cases and only kill those user processes, *if* the poison is in user memory, but not bring the whole box down. Anyway, if people are interested I'll post a branch to test soonish. Thx. Borislav, It is a very curious error for me as I hadn't seen anything L2 cache related until the recent kernel. Unfortunately I also did some BIOS updates (and one was a beta BIOS) and can't totally isolate the cause. I was leaning toward the kernel after finding others with pretty much the same hardware and error. I should also note I'm running ECC RAM. It has yet to report a single bit flip error via edac-util (and nothing interesting in dmesg either). (In reply to Misha Nasledov from comment #14) > It is a very curious error for me as I hadn't seen anything L2 cache > related until the recent kernel. This is where the error is detected, that's why it says L2. > Unfortunately I also did some BIOS updates (and one was a beta BIOS) > and can't totally isolate the cause. I was leaning toward the kernel > after finding others with pretty much the same hardware and error. I highly doubt it is the kernel. If it were, people would report this issue left and right. > I should also note I'm running ECC RAM. It has yet to report a single bit > flip error via edac-util (and nothing interesting in dmesg either). Do you have amd64_edac and edac_mce_amd modules loaded? (In reply to Borislav Petkov from comment #15) > Do you have amd64_edac and edac_mce_amd modules loaded? Yes. # lsmod | grep edac amd64_edac_mod 36864 0 edac_mce_amd 32768 1 amd64_edac_mod # edac-util -v mc0: 0 Uncorrected Errors with no DIMM info mc0: 0 Corrected Errors with no DIMM info mc0: csrow2: 0 Uncorrected Errors mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors mc0: csrow3: 0 Uncorrected Errors mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors edac-util: No errors to report. Additionally, I ran memtest86 on a computer exhibiting these symptoms (0xbc800800060c0859; see comment #9) for a few hours and got no DRAM errors. I can run it overnight (UTC-05:00) if that would be useful. I don't have ECC RAM, unlike Misha. (In reply to Misha Nasledov from comment #16) > (In reply to Borislav Petkov from comment #15) > > Do you have amd64_edac and edac_mce_amd modules loaded? > > Yes. Which could mean that the errors you get are not single bit flips but, well, multiple bit flips which are uncorrectable and get poisoned. Consuming them leads to a hard reset and so they don't get logged by EDAC. (In reply to meedalexa from comment #17) > Additionally, I ran memtest86 on a computer exhibiting these symptoms > (0xbc800800060c0859; see comment #9) for a few hours and got no DRAM errors. > I can run it overnight (UTC-05:00) if that would be useful. I don't have ECC > RAM, unlike Misha. Probably not worth it. In pretty much all the error reports I've seen so far - and those are a *lot* :) - people would report MCEs and then run memtest for hours on the box and nothing will get caught. If it is some high utilization pattern which would bring the box power envelope to some corner case which could lead to whatever sub-optimal operating conditions causing DRAM bit flips then I'm highly skeptical memtest can ever achieve that same operating conditions with its reading and writing of bit patterns into DRAM. I might be mistaken but experience so far shows that memtest hardly ever causes the MCEs to get repeated. Unless you really have a faulty DIMM chip with a stuck bit which would *always* do the bit flip. But that's not what we have here. HTH. Is it possible that the bits are flipped on their way to the L2 cache under certain load conditions? I mean, lowering my SoC voltage definitively helped. I could run RAM at 3600 MHz / 1.45V without issues... ...until I ntfscloned a partition from my old HDD to a new SSD. About 1.3 TB, which would last 3 hours. It crashed twice with that MCE after about 1 and 2 hours. It successfully completed after I lowered RAM to 3200 MHz / 1.345V again. Another good stress test is using Steam to verify the integrity of local game files. On the other hand, just sha256sum-ing a lot of files never triggered that MCE. On kernel versions, the most notable change in 5.11 is frequency invariance: https://www.phoronix.com/scan.php?page=news_item&px=AMD-Freq-Invariance-Linux-5.11 Maybe that makes the MCE more likely to happen, but the underlying cause is a hardware issue. Given that I've seen that that people have some stability issues with the 5900X but have no issues if they downgrade to older Ryzen processors like 3600X, I'm inclined to agree that there is some sort of hardware issue. Not so much a defective hardware issue as it seems to be an issue AMD needs to address with a microcode update. When I first got the 5900X, I regularly had reboots that were the typical Linux + Ryzen C6 state crash. The usual BIOS settings didn't fix it, but running ZenStates.py to disable C6 at boot did resolve it. I'm hoping there will be some kind of microcode update soon. I need to look and see if anyone else has reported such issues to AMD. Maybe I need to contact them. I reported the MCEs to AMD support. They asked for a dxdiag report, then said I should update my graphics drivers. When pointing out that I'm using Linux, and its drivers are up-to-date, I was told to use a compatible operating system. (: Another thing I wonder about: the MCE is reported by the Instruction Fetch Unit. That means it happens when code execution jumps to a memory address that's not in the L2 cache, doesn't it? Very unlikely to happen when only a small, bootable application like Memtest is running. But what would an error look like when data, not code, is fetched? Would we notice them? Do they happen? (In reply to alan.loewe from comment #20) > Is it possible that the bits are flipped on their way to the L2 cache under > certain load conditions? The error type is described in the CPU doc this way: "L2 Cache Response Poison Error. Error is the result of consuming poison data." And L2 because I presume there the poison bit is being checked because it assumes that the cacheline is going to get consumed or when it pulls it into L1 because it is going to get consumed - there it signals the MCE. All guestimation of course. > I mean, lowering my SoC voltage definitively helped. I could run RAM > at 3600 MHz / 1.45V without issues... Why do you even fiddle with voltages? I leave my power settings to default in the BIOS and have no issues whatsoever. > ...until I ntfscloned a partition from my old HDD to a new SSD. About 1.3 > TB, which would last 3 hours. It crashed twice with that MCE after about 1 > and 2 hours. It successfully completed after I lowered RAM to 3200 MHz / > 1.345V again. Yah, sounds like the corner conditions I was talking about. > Another good stress test is using Steam to verify the integrity of local > game files. On the other hand, just sha256sum-ing a lot of files never > triggered that MCE. Aha. > On kernel versions, the most notable change in 5.11 is frequency invariance: If the above ntfs cloning reliably reproduces with tweaked voltages on 5.11 and all you change to that setup is boot into 5.10 and the same exercise doesn't reproduce anymore, then I can imagine schedutil contributing in some fashion. Although the average power utilization we do with the current setting: https://lore.kernel.org/linux-acpi/20201112182614.10700-3-ggherdovich@suse.cz/ sugov-mid, is not the maximal one so there should be some power left, so to speak. > I reported the MCEs to AMD support. They asked for a dxdiag report, > then said I should update my graphics drivers. When pointing out that > I'm using Linux, and its drivers are up-to-date, I was told to use a > compatible operating system. (: The standard canned response of all customer-facing support of all those tech companies. Ignore it, there are people at AMD who care a lot about Linux. > Another thing I wonder about: the MCE is reported by the Instruction > Fetch Unit. That means it happens when code execution jumps to a memory > address that's not in the L2 cache, doesn't it? No, the instruction cache unit caches cachelines of 64 bytes of size which contain instructions. If a memory address it jumps to happens to not be in it - which can happen although the prefetchers are pretty aggressive - then you get a cache miss and that cacheline is fetched into L2 and then into L1 for executing. > Very unlikely to happen when only a small, bootable application like > Memtest is running. If the prefetcher guesses the access pattern of the applcation, the cacheline is pretty much in the L2 by the time instructions from it get to get executed. > But what would an error look like when data, not code, is fetched? If you fetch bytes which are not valid instructions, you get an Invalid-Opcode Exception and you land in the respective exception handler. It practically looks like this: ./ud Illegal instruction That ud thing does: asm volatile(".byte 0x27"); where 0x27 is an invalid opcode in 64-bit x86. > Would we notice them? Do they happen? So those things don't have anything to do with MCEs - what you're seeing is some cacheline in memory gets two or more bits changed. For whatever reason. Unstable voltages, alpha particles going through them, and so on. If your DRAM is ECC, then the ECC protection word is checked and signals that the cacheline's contents have changed and cannot be repaired anymore (only single bits can) so they're marked as poison data and travel around the machine without anything else bad happening. But if they get to go up into the cache because they're about to get executed, that poison data mark is seen by the L2 machinery and it raises a machine check exception, causing the reboot to prevent any further data corruption. Something like this - all this is a rough version of the reality of what happens but the basic idea should be clear. HTH. (In reply to Borislav Petkov from comment #13) > Yes, for all folks with the error status value: > > [17299.027344] [Hardware Error]: CPU:15 (17:31:0) > MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 > > we will have an improvement soon to avoid some of the reboots depending on > where the error happens. > > And yes, getting your DRAM voltage stable and otherwise not causing those > bit flips to happen in DRAM - because that's what this is - bits in DRAM get > flipped, hardware detects them and poisons the cacheline. Which is all fine > and good until software consumes that cacheline - it ate poison so it goes > boom. > > The improvement should decrease the "goes boom" cases and only kill those > user processes, *if* the poison is in user memory, but not bring the whole > box down. To make sure my understanding is correct: this would isolate the MCEs to the process that experiences the issue, and kill that process instead of resetting the system? (I'm guessing it would still be a reset if this happens in kernel code?) I think that could be useful, but I would still be wondering what the root cause is. (In reply to Misha Nasledov from comment #14) > It is a very curious error for me as I hadn't seen anything L2 cache related > until the recent kernel. Unfortunately I also did some BIOS updates (and one > was a beta BIOS) and can't totally isolate the cause. I was leaning toward > the kernel after finding others with pretty much the same hardware and error. My machine is new, so I can't isolate the cause either; these symptoms appeared for the first time as I was transferring files over the network, as I recall. I Googled around for the specific error I'd been having, complete with the MCE code bc800800060c0859, and found a few other threads, but most of them seem to point here. It's been a while so I don't have links, unfortunately. (In reply to alan.loewe from comment #20) > Is it possible that the bits are flipped on their way to the L2 cache under > certain load conditions? I mean, lowering my SoC voltage definitively > helped. I could run RAM at 3600 MHz / 1.45V without issues... > > ...until I ntfscloned a partition from my old HDD to a new SSD. About 1.3 > TB, which would last 3 hours. It crashed twice with that MCE after about 1 > and 2 hours. It successfully completed after I lowered RAM to 3200 MHz / > 1.345V again. My workaround of disabling Core Performance Boost (mentioned in comment #9) has been completely stable for me since then. I tried poking my head into the BIOS voltage settings, but I haven't had time to sit down and actually figure it out (plus my hardware expertise is limited), so they're still stock for me. alan.loewe mentioned in an earlier comment that the BIOS put the SoC at 1.1V for some reason; that's not a symptom I saw on my machine, if I read the BIOS voltage menus correctly. > Another good stress test is using Steam to verify the integrity of local > game files. On the other hand, just sha256sum-ing a lot of files never > triggered that MCE. My stress test has been playing multiplayer Minecraft, lol. I also had MCEs earlier when I was copying files from my previous computer over a network, as mentioned. (In reply to Misha Nasledov from comment #22) > Given that I've seen that that people have some stability issues with the > 5900X but have no issues if they downgrade to older Ryzen processors like > 3600X, I'm inclined to agree that there is some sort of hardware issue. Not > so much a defective hardware issue as it seems to be an issue AMD needs to > address with a microcode update. > > When I first got the 5900X, I regularly had reboots that were the typical > Linux + Ryzen C6 state crash. The usual BIOS settings didn't fix it, but > running ZenStates.py to disable C6 at boot did resolve it. > > I'm hoping there will be some kind of microcode update soon. I need to look > and see if anyone else has reported such issues to AMD. Maybe I need to > contact them. I sure hope it's not defective hardware. This Arch user [1] saw symptoms disappear after replacing their processor (with one of the same model) and updating their BIOS: > Last update. > I returned CPU to the shop and I bought a new one (also new Ryzen). In the > meantime the new bios was released. > Since then everything is working fine. To be hones I'm not sure if this was a > CPU or bios problem. Maybe both. On AMD forum the topic is still active even > after bios update. [1]: https://bbs.archlinux.org/viewtopic.php?pid=1954703#p1954703 However, I've seen symptoms on a BIOS version that was released after that post was made (F33g, the latest being F33h), which I flashed before I started using the computer. I can contact AMD support as well, if we think it would be useful, even if they end up telling me to use a "supported operating system" again :P I think the RMA period on my CPU is up (blame the graphics-card market taking a billion years to supply me with the last part I needed), so I'm hoping this is solvable at the kernel or microcode level. (In reply to meedalexa from comment #25) > I can contact AMD support as well, if we think it would be useful, even if > they end up telling me to use a "supported operating system" again :P I > think the RMA period on my CPU is up (blame the graphics-card market taking > a billion years to supply me with the last part I needed), so I'm hoping > this is solvable at the kernel or microcode level. I think it would be good if more people brought it to their attention. I don't think the RMA period is up. The warranty period should be 3 years IIRC (In reply to Borislav Petkov from comment #24) > > Another thing I wonder about: the MCE is reported by the Instruction > > Fetch Unit. That means it happens when code execution jumps to a memory > > address that's not in the L2 cache, doesn't it? > > No, the instruction cache unit caches cachelines of 64 bytes of size > which contain instructions. If a memory address it jumps to happens to > not be in it - which can happen although the prefetchers are pretty > aggressive - then you get a cache miss and that cacheline is fetched > into L2 and then into L1 for executing. That was ofcourse the instruction *cache* - the instruction fetch unit (IFU) steers which cachelines go into the instruction cache. Sorry for the confusion. And the MCE is reported as an IFU MCE probably because there the poison check is done or reported. It all depends on how the microarchitecture has been designed but all in all, it doesn't matter in this case. (In reply to meedalexa from comment #25) > To make sure my understanding is correct: this would isolate the MCEs to the > process that experiences the issue, and kill that process instead of > resetting the system? (I'm guessing it would still be a reset if this > happens in kernel code?) Exactly. > I think that could be useful, but I would still be wondering what the > root cause is. Well, is the box stable if you leave your BIOS settings to default and don't fiddle with DRAM voltages? (In reply to Misha Nasledov from comment #26) > (In reply to meedalexa from comment #25) > > I can contact AMD support as well, if we think it would be useful, even if > > they end up telling me to use a "supported operating system" again :P I > > think the RMA period on my CPU is up (blame the graphics-card market taking > > a billion years to supply me with the last part I needed), so I'm hoping > > this is solvable at the kernel or microcode level. > > I think it would be good if more people brought it to their attention. I > don't think the RMA period is up. The warranty period should be 3 years IIRC You're probably right about the RMA period. I should probably check into that and see if I win the silicon lottery the second time around. Though of course it'll be rough to go for however long it takes without a CPU... And yes, I'll contact AMD support this week. (In reply to Borislav Petkov from comment #28) > (In reply to meedalexa from comment #25) > > I think that could be useful, but I would still be wondering what the > > root cause is. > > Well, is the box stable if you leave your BIOS settings to default and > don't fiddle with DRAM voltages? No, but it is stable if I disable Core Performance Boost in my BIOS and leave voltages untouched. (I also have my RAM's XMP profile turned on; I haven't checked if my system is stable if I disable XMP and re-enable Core Performance Boost.) (In reply to meedalexa from comment #29) > You're probably right about the RMA period. I should probably check into > that and see if I win the silicon lottery the second time around. Though of > course it'll be rough to go for however long it takes without a CPU... I RMA'd an 1800X years ago and asked them to do an advanced RMA. They sent me the replacement in advance. (In reply to meedalexa from comment #29) > No, but it is stable if I disable Core Performance Boost in my BIOS and > leave voltages untouched. (I also have my RAM's XMP profile turned on; I > haven't checked if my system is stable if I disable XMP and re-enable Core > Performance Boost.) This thing here: https://yourbusiness.azcentral.com/enable-xmp-amd-board-9989.html talks about XMP being incompatible with AMD and AMD boards having their own memory profiles called AMP. If your BIOS is enabling Intel XMPs on an AMD board, I wouldn't be surprised if the DRAM chips are running at an incompatible setting. Regardless, you could turn off your XMP profile and all the other overclocking settings in the BIOS and see if it still causes MCEs with the default setting and CPB enabled. Nowadays AMD boards just read the XMP profile, which apparently wasn't possible before, due to the proprietary nature of XMP. They just call it differently, e.g. D.O.C.P. on ASUS boards. As someone mentioned, there are very few RAM modules with a 3200 MHz JEDEC profile, which is the standard frequency, so enabling XMP is basically required. Everything above 3200 MHz is overlocking, though. With default settings and DOCP disabled, resulting in 2666 MHz @ 1.2V and very slow timings, my system is less stable than with DOCP enabled. I noticed that enabling DOCP causes the SoC voltage to be set to 1.1V. When decreasing DRAM frequency to 3200 MHz and voltage to the usual 1.35V, SoC voltage stays at that. After setting it manually and then changing it back to auto, it's 1.0V. Using offset mode yields totally weird results. So the logic implemented by ASUS seems to be a bit buggy. Core Performance Boost is the AMD equivalent of Turbo Boost. Turning it off costs a lot of performance, because CPU frequency is fixed at the base frequency. But yes, it's stable without it, which concurs with frequency invariance introduced in 5.11 making it worse. Disabling Performance Boost Overdrive, which is AMDs automatic overclocking that's enabled by default, doesn't help either. I guess what we're experiencing are the downsides of a new approach: instead of advertising and guaranteeing a base performance, where you have some overclocking potential if you're lucky, AMD now tries to get the optimum out of each individual processor with smart firmware, and sometimes it doesn't work very well. Then manually tweaking settings is your only option. (In reply to alan.loewe from comment #32) > Nowadays AMD boards just read the XMP profile, which apparently wasn't > possible before, due to the proprietary nature of XMP. They just call it > differently, e.g. D.O.C.P. on ASUS boards. > > As someone mentioned, there are very few RAM modules with a 3200 MHz JEDEC > profile, which is the standard frequency, so enabling XMP is basically > required. Everything above 3200 MHz is overlocking, though. > > With default settings and DOCP disabled, resulting in 2666 MHz @ 1.2V and > very slow timings, my system is less stable than with DOCP enabled. Interesting. I've been playing Minecraft for a couple of hours with XMP/DOCP off and CPB on and it's working fine. (I forget what my board calls it, if it uses XMP or something anodyne like memory profiles.) Under prior conditions it would have crashed by now, though I'm still not calling it stable just yet. My RAM is also 3200 MHz with XMP, 2666 MHz without. > I noticed that enabling DOCP causes the SoC voltage to be set to 1.1V. When > decreasing DRAM frequency to 3200 MHz and voltage to the usual 1.35V, SoC > voltage stays at that. After setting it manually and then changing it back > to auto, it's 1.0V. Using offset mode yields totally weird results. So the > logic implemented by ASUS seems to be a bit buggy. My board is Gigabyte. My SOC Vcore was 1.0 with XMP on or off. So it seems there are a few potential workarounds right now: - Disable XMP, though it might not work on some boards - Fix your voltages if they're weird, though they might not be weird - Disable Core Performance Boost, though goodbye performance - RMA and get a new processor, based on the Arch Forums thread and my vague memory of earlier web searching of this problem (In reply to meedalexa from comment #33) > My RAM is also 3200 MHz with XMP, 2666 MHz without. The XMP profile of mine is 3600 MHz @ 1.45V. Increasing the SoC voltage is actually recommended in that case, and the BIOS seems to do that automatically, but too much. With standard SoC voltage the XMP profile was stable until cloning my disk. Before and now I still use DOCP for the timings and manually adjust the frequency to non-overclocking 3200 MHz and voltage to a bit less than the usual 1.35V, but then the BIOS doesn't automatically revert to the standard SoC voltage. So this hint is for people with overclocking RAM. 3600 MHz is often recommended as the sweet spot, and thus quite common, but many are unaware that AMD only guarantees 3200 MHz. The difference is quite noticeable in some benchmarks, so running it at 2666 MHz is no satisfactory solution, too. By the way, I tried that only once a few month ago with a much older BIOS, and got a non-MCE crash really quick. It could be stable now. With XMP off and Core Performance Boost on, it seemed like it was working... until today, when it crashed in the middle of a Minecraft game. Switched back to the known-good settings for now of XMP on, CPB off. I don't think that XMP, CPB, etc are related. These can cause crashes, but in my experience they cause different MCEs. I'm still experiencing the L2 cache poisoning with my 5900X. I'm running the 5.10 kernel from Debian, currently 5.10.46. The other day I actually experienced this MCE without a hard crash. The process was killed and the system kept running. See logs https://pastebin.com/74pybLyT I'm at the point where I want to try swapping another CPU into this rig. I have a spare Ryzen 3600 that should work. I will leave everything else the same. If I don't see this crash anymore (I will give it up to 2 weeks), then it seems like a reasonable conclusion that it has to do with the 5900X. I don't have reason to believe it's a Linux-related issue, either. I am suspecting that this may be an RMA case for AMD. I have the same issue with AMD 5900X (Bios with AMD AGESA ComboAM4v2 1.2.0.2), Fedora Server 34 with kernel 5.11.x+. Currently running on kernel 5.10.20 and it is still unstable, but better: it crashes once during 14 days on average (it is running 24/7). With kernel 5.11+ it crashed on average each 2-4 hours. I am running with RAM on default settings without XMP: 1.2V, 2400 MHz. On our other system with Centos, kernel 4.18.0 and AMD 3900X, same motherboard and same RAM settings, it can run without crashes for 170+ days. Error on 5900X before crash: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1629710153 SOCKET 0 APIC 1b microcode a201009 mce: [Hardware Error]: TSC 0 ADDR 7c6b87b80 MISC d012000000000000 IPID 100b000000000 mce: [Hardware Error]: CPU 23: Machine Check: 0 Bank 1: bc800800060c0859 If you have some suggestions on how to fix this, please send them. I finally resolved my issue! I went by the recommendation of the first comment on this page https://community.amd.com/t5/processors/ryzen-5900x-system-constantly-crashing-restarting-whea-logger-id/td-p/423321/page/84 I set my Curve Optimizer to +8 on all cores and set the EDC limit to 200A. I haven't had a crash in almost a month. I noticed that in my case it's always the same two cores that crash. Check it with: journalctl | grep 'mce.*CPU' Curve Optimizer looks like a promising approach, but positive values cost performance (more voltage per frequency -> more heat -> less boost time), so it might be worth trying to limit it to the affected cores. As already said, I had limited success with manually adjusting the SoC voltage, which reduced crashes to about one every 1-2 weeks. After a BIOS update a few days ago I'm trying default settings again with XMP enabled. Some releases promised "improved stability", and the automatic SoC voltage is now lower. No crashes so far. Older BIOS versions crashed 2-3 times a day. If it's still unstable in the long run, I'll try the Curve Optimizer approach, too. Sorry, maybe this is not relevant, but for people coming here as me might be useful. I have a laptop with AMD Ryzen 5 5600H and nvidia external GPU and experienced similar (though, not that random) reboots right until I blacklisted the "nouveau" kernel module. The diagnostics of the issue is highly complicated by the fact that nothing useful gets logged in dmesg or console before the non-initiated reboot. And initially it looks like the OP bug but without relevant dmesg logs. Hardware: HP Victus by HP Laptop 16-e0xxx with AMD Ryzen 7 5800H 64G of ram. Random shutdowns on low load with any 6.x kernels (and 5.x but not wide tested) on any distributions with any cpu freq drivers with or without graphical environment. System stable on high load. I tried lots of thing, and finally fixed it by disabling C6 state with zenstates utility. With C6 disabled system stable on any power profile (amd p-states) Before, it was stable only on powersave(power) and unstable on any other (balance power, balance performance, performance) the more performant profile used the more unstable system is. (In reply to Kira from comment #42) > Hardware: HP Victus by HP Laptop 16-e0xxx with AMD Ryzen 7 5800H 64G of ram. > > Random shutdowns on low load with any 6.x kernels (and 5.x but not wide > tested) on any distributions with any cpu freq drivers with or without > graphical environment. > System stable on high load. > > I tried lots of thing, and finally fixed it by disabling C6 state with > zenstates utility. > > With C6 disabled system stable on any power profile (amd p-states) > > Before, it was stable only on powersave(power) and unstable on any other > (balance power, balance performance, performance) the more performant > profile used the more unstable system is. It seems that instability was resolved in 6.9 kernel. But appeared again in 6.10 kernel. One thing, with 6.9 kernel laptop was stable only when 165Hz screen refresh rate was used. Unstable on 60Hz (shutdown in few minutes when playing YouTube and being on AC power). I'm curious whether using Windows for the affected people also results in random reboots. Could it be a hardware issue? (In reply to Artem S. Tashkinov from comment #44) > I'm curious whether using Windows for the affected people also results in > random reboots. Could it be a hardware issue? I tested Windows 10 an 11, completely stable. Also, just judging by fan speeds behavior, Windows use different approach for power management. I believe it's a hardware quality issue that has been mitigated by firmware. There was a BIOS update which included AGESA 1.2.0.6b, if I remember correctly, which was a game changer. The SoC voltage adjustment I used to apply as a remedy became totally unstable, but default settings (DOCP enabled) turned out to be (almost) stable. Generally, I figured that after changing relevant BIOS settings or updating the BIOS, the system crashes two to three times relatively soon, but then is stable for relatively long... until it crashes two times again within a relatively short time period... i.e. the cycle repeats. After the mentioned update the cycle was 6-10 weeks (system running 10+ hours/day). That some of the MCEs became software containable and go unnoticed probably helps, too. In the beginning I had the same MCEs while using Windows. Now I can't tell, because I'm using it for only a few hours/month. But recently, the crashes made a comeback. Below is my system journal, which starts at May 29, grepped for "Hardware Error". However, I don't really care anymore, because I'll get a new PC in autumn. On the other hand, Intel currently has issues with degrading CPUs, and AMD just delayed their Ryzen 9000 launch due to quality issues. Damn. ``` Jun 17 00:23:17 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jun 17 00:23:17 myredactedhost kernel: [Hardware Error]: Corrected error, no action required. Jun 17 00:23:17 myredactedhost kernel: [Hardware Error]: CPU:1 (19:21:0) MC23_STATUS[Over|CE|-|AddrV|PCC|-|-|Poison|Scrub]: 0xc7498b708320ed04 Jun 17 00:23:17 myredactedhost kernel: [Hardware Error]: Error Addr: 0x0000000000000000 Jun 17 00:23:17 myredactedhost kernel: [Hardware Error]: IPID: 0x0000000000000000 Jun 17 00:23:17 myredactedhost kernel: [Hardware Error]: Bank 23 is reserved. Jun 17 00:23:17 myredactedhost kernel: [Hardware Error]: cache level: RESV, tx: DATA Jun 30 18:18:07 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jun 30 18:18:07 myredactedhost kernel: [Hardware Error]: Corrected error, no action required. Jun 30 18:18:07 myredactedhost kernel: [Hardware Error]: CPU:1 (19:21:0) MC19_STATUS[Over|CE|MiscV|-|PCC|SyndV|-|-|-]: 0xdb3100e9db1de95b Jun 30 18:18:07 myredactedhost kernel: [Hardware Error]: IPID: 0x0000000000000000, Syndrome: 0x0000000000000000 Jun 30 18:18:07 myredactedhost kernel: [Hardware Error]: Bank 19 is reserved. Jun 30 18:18:07 myredactedhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN Jul 04 03:41:23 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 04 03:41:23 myredactedhost kernel: [Hardware Error]: Corrected error, no action required. Jul 04 03:41:23 myredactedhost kernel: [Hardware Error]: CPU:1 (19:21:0) MC24_STATUS[Over|CE|MiscV|AddrV|PCC|-|CECC|-|Poison|-]: 0xdf8948b4eb24048b Jul 04 03:41:23 myredactedhost kernel: [Hardware Error]: Error Addr: 0x0000000000000000 Jul 04 03:41:23 myredactedhost kernel: [Hardware Error]: IPID: 0x0000000000000000 Jul 04 03:41:23 myredactedhost kernel: [Hardware Error]: Bank 24 is reserved. Jul 04 03:41:23 myredactedhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN Jul 12 07:43:59 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 12 07:43:59 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 12 07:43:59 myredactedhost kernel: [Hardware Error]: CPU:15 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 12 07:43:59 myredactedhost kernel: [Hardware Error]: Error Addr: 0x00000006a76d3080 Jul 12 07:44:00 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 12 07:44:00 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 12 07:44:00 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) Jul 12 14:52:49 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 12 14:52:49 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 12 14:52:49 myredactedhost kernel: [Hardware Error]: CPU:14 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 12 14:52:49 myredactedhost kernel: [Hardware Error]: Error Addr: 0x00000009b31e8d00 Jul 12 14:52:49 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 12 14:52:49 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 12 14:52:49 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) Jul 13 10:39:11 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 13 10:39:11 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 13 10:39:11 myredactedhost kernel: [Hardware Error]: CPU:14 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 13 10:39:11 myredactedhost kernel: [Hardware Error]: Error Addr: 0x000000017bc2aa00 Jul 13 10:39:11 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 13 10:39:11 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 13 10:39:11 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) Jul 17 14:22:01 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 17 14:22:01 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 17 14:22:01 myredactedhost kernel: [Hardware Error]: CPU:13 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 17 14:22:01 myredactedhost kernel: [Hardware Error]: Error Addr: 0x000000026c136880 Jul 17 14:22:01 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 17 14:22:01 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 17 14:22:01 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) Jul 18 02:01:39 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 18 02:01:39 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 18 02:01:39 myredactedhost kernel: [Hardware Error]: CPU:14 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 18 02:01:39 myredactedhost kernel: [Hardware Error]: Error Addr: 0x000000090b190f00 Jul 18 02:01:39 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 18 02:01:39 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 18 02:01:39 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) Jul 21 22:43:24 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 21 22:43:24 myredactedhost kernel: [Hardware Error]: Deferred error, no action required. Jul 21 22:43:24 myredactedhost kernel: [Hardware Error]: CPU:1 (19:21:0) MC23_STATUS[-|-|-|-|-|-|Deferred|-|-]: 0x9090909090909090 Jul 21 22:43:24 myredactedhost kernel: [Hardware Error]: IPID: 0x0000000000000000 Jul 21 22:43:24 myredactedhost kernel: [Hardware Error]: Bank 23 is reserved. Jul 21 22:43:24 myredactedhost kernel: [Hardware Error]: cache level: RESV, tx: INSN Jul 22 02:42:20 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 22 02:42:20 myredactedhost kernel: [Hardware Error]: Corrected error, no action required. Jul 22 02:42:20 myredactedhost kernel: [Hardware Error]: CPU:1 (19:21:0) MC23_STATUS[Over|CE|-|-|-|-|CECC|-|Poison|-]: 0xc0c748010a879ee9 Jul 22 02:42:20 myredactedhost kernel: [Hardware Error]: IPID: 0x0000000000000000 Jul 22 02:42:20 myredactedhost kernel: [Hardware Error]: Bank 23 is reserved. Jul 22 02:42:20 myredactedhost kernel: [Hardware Error]: cache level: L1, tx: GEN Jul 24 00:55:15 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 24 00:55:15 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 24 00:55:15 myredactedhost kernel: [Hardware Error]: CPU:30 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 24 00:55:15 myredactedhost kernel: [Hardware Error]: Error Addr: 0x000000029795bc00 Jul 24 00:55:15 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 24 00:55:15 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 24 00:55:15 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) Jul 26 06:09:11 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 26 06:09:11 myredactedhost kernel: [Hardware Error]: Uncorrected, software restartable error. Jul 26 06:09:11 myredactedhost kernel: [Hardware Error]: CPU:30 (19:21:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135 Jul 26 06:09:11 myredactedhost kernel: [Hardware Error]: Error Addr: 0x00000005c5e38360 Jul 26 06:09:11 myredactedhost kernel: [Hardware Error]: IPID: 0x001000b000000000 Jul 26 06:09:11 myredactedhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1 Jul 26 06:09:11 myredactedhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD Jul 26 09:44:06 myredactedhost kernel: mce: [Hardware Error]: Machine check events logged Jul 26 09:44:06 myredactedhost kernel: [Hardware Error]: Uncorrected, software containable error. Jul 26 09:44:06 myredactedhost kernel: [Hardware Error]: CPU:30 (19:21:0) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859 Jul 26 09:44:06 myredactedhost kernel: [Hardware Error]: Error Addr: 0x000000008d85b400 Jul 26 09:44:06 myredactedhost kernel: [Hardware Error]: IPID: 0x000100b000000000 Jul 26 09:44:06 myredactedhost kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12 Jul 26 09:44:06 myredactedhost kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) ``` |