Created attachment 296003 [details] 2021-03-22_journalctl_mce_hardware_error_platform_security_processor.txt My system unexpectedly reboots and produces hardware errors in the system journal: mcelog[1080]: Hardware event. This is not a software error. mcelog[1080]: MCE 0 mcelog[1080]: CPU 0 BANK 25 mcelog[1080]: MISC d01a005d00000000 mcelog[1080]: TIME 1616276961 Sat Mar 20 17:49:21 2021 mcelog[1080]: STATUS 98004000003e0000 MCGSTATUS 0 mcelog[1080]: MCGCAP 11c APICID 0 SOCKETID 0 mcelog[1080]: MICROCODE 8701021 mcelog[1080]: CPUID Vendor AMD Family 23 Model 1 Step 0 kernel: mce: [Hardware Error]: Machine check events logged kernel: [Hardware Error]: Corrected error, no action required. kernel: [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000 kernel: [Hardware Error]: IPID: 0x000100ff03830400 kernel: [Hardware Error]: Platform Security Processor Ext. Error Code: 62 kernel: [Hardware Error]: cache level: RESV, tx: INSN Troubleshooting steps taken in the system BIOS settings: - Advanced > AMD fTPM configuration > AMD CPU fTPM: Disabled -> Enabled - Unexpected reboots continued to occur - Advanced > PCI Subsystem Settings > SR-IOV Support: Enabled -> Disabled - Unexpected reboots continued to occur - Advanced > PCI Subsystem Settings > Above 4G Decoding: Enabled -> Disabled - System seemed to have stabilized, no unexpected reboots OS: openSUSE Tumbleweed 20210318; also observed using Fedora 33 CPU: AMD Ryzen 5 3600X Graphics Card: AMD Radeon RX 5700 Motherboard: ASUS Pro WS X570-ACE (BIOS version 3302)
That should tell you: "Corrected error, no action required." I.e., nothing has been corrupted and you can continue using your CPU merrily. If you start seeing a lot of those, though, you could RMA your CPU. HTH.
(In reply to Borislav Petkov from comment #1) > That should tell you: "Corrected error, no action required." I.e., nothing > has been corrupted and you can continue using your CPU merrily. > > If you start seeing a lot of those, though, you could RMA your CPU. > > HTH. I don't think this should be marked as resolved, as the system unexpectedly reboots shortly after this error is logged. Something is wrong. This just happened to me again a few minutes ago. New OS version: openSUSE Tumbleweed 20210514 New motherboard BIOS version: 3402 Something that might be worth noting is that it seems that this will happen to me some time after I deliberately reboot the system myself, maybe a maximum three times before it stops happening for the rest of the week (I reboot the system weekly). May 16 20:48:04 localhost.localdomain mcelog[1160]: Hardware event. This is not a software error. May 16 20:48:04 localhost.localdomain mcelog[1160]: MCE 0 May 16 20:48:04 localhost.localdomain mcelog[1160]: CPU 0 BANK 25 May 16 20:48:04 localhost.localdomain mcelog[1160]: MISC d01a000200000000 May 16 20:48:04 localhost.localdomain mcelog[1160]: TIME 1621212484 Sun May 16 20:48:04 2021 May 16 20:48:04 localhost.localdomain mcelog[1160]: STATUS 98004000003e0000 MCGSTATUS 0 May 16 20:48:04 localhost.localdomain mcelog[1160]: MCGCAP 11c APICID 0 SOCKETID 0 May 16 20:48:04 localhost.localdomain mcelog[1160]: MICROCODE 8701021 May 16 20:48:04 localhost.localdomain mcelog[1160]: CPUID Vendor AMD Family 23 Model 1 Step 0 May 16 20:48:04 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: Corrected error, no action required. May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000 May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: IPID: 0x000100ff03830400 May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: Platform Security Processor Ext. Error Code: 62 May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: cache level: RESV, tx: INSN -- Reboot --
Forgot to mention in my reply that today's error occurred on kernel 5.12.3.
(In reply to Martin from comment #2) > I don't think this should be marked as resolved, as the system unexpectedly > reboots shortly after this error is logged. Something is wrong. Probably a follow-up hw error which causes the system to triple-fault. Can you upload dmesg from *right* *after* the reboot? Looking at your BIOS: DMI: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021 it looks pretty new but you could still check whether there's newer one. Also, try leaving your BIOS settings to their defaults and only enable those which you really need. It is very possible that some of the settings are not even properly validated. > Forgot to mention in my reply that today's error occurred on kernel > 5.12.3. Hardware resets are unlikely to be caused by the kernel. HTH.
Created attachment 296809 [details] dmesg output after an unexpected reboot with hardware error reported in journalctl before the reboot
(In reply to Borislav Petkov from comment #4) > (In reply to Martin from comment #2) > > I don't think this should be marked as resolved, as the system unexpectedly > > reboots shortly after this error is logged. Something is wrong. > > Probably a follow-up hw error which causes the system to triple-fault. > Can you upload dmesg from *right* *after* the reboot? > > Looking at your BIOS: > > DMI: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021 > > it looks pretty new but you could still check whether there's newer one. > > Also, try leaving your BIOS settings to their defaults and only enable > those which you really need. It is very possible that some of the > settings are not even properly validated. > > > Forgot to mention in my reply that today's error occurred on kernel > > 5.12.3. > > Hardware resets are unlikely to be caused by the kernel. > > HTH. dmesg output is attached as attachment 296809 [details]. Also note that I upgraded my BIOS since the time I created this issue. It is currently the most recent non-beta BIOS available. New version: DMI: ASUS System Product Name/Pro WS X570-ACE, BIOS 3402 03/22/2021
vendor BIOS site says 3402 has received new AGESA update - might be fixing something but it's not like anyone is telling you. ;-\ I don't see anything out of the ordinary in dmesg. I guess you can run the box and see what happens.
Looks forgotten. Feel free to reopen if still of interest.
Apologies, neglected to update the bug as the issue seemed to have gone away at some point (perhaps by setting the BIOS "Power Supply Idle Control" setting to "Typical Current Idle" - I reverted my other attempts at a workaround) and the system had been stable for many, many months. Today, I upgraded from kernel 5.16.1 to 5.16.2 and noticed that these error messages have reappeared. I have received two of them in 8.5 hours of uptime. I have booted back into kernel 5.16.1 as I expect random reboots if I stay in 5.16.2, where I have started seeing the error messages again (below). There have been no changes in the hardware configuration of my system since my original message on this bug. OS: openSUSE Tumbleweed 20220128 Motherboard: ASUS Pro WS X570-ACE (BIOS version 3904 12/14/2021) kernel: mce: [Hardware Error]: Machine check events logged kernel: [Hardware Error]: Corrected error, no action required. kernel: [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000 kernel: [Hardware Error]: IPID: 0x000100ff03830400 kernel: [Hardware Error]: Platform Security Processor Ext. Error Code: 62 kernel: [Hardware Error]: cache level: RESV, tx: INSN mcelog[1179]: Hardware event. This is not a software error. mcelog[1179]: MCE 0 mcelog[1179]: CPU 0 BANK 25 mcelog[1179]: MISC d01a000e00000000 mcelog[1179]: TIME 1643667603 Mon Jan 31 17:20:03 2022 mcelog[1179]: STATUS 98004000003e0000 MCGSTATUS 0 mcelog[1179]: MCGCAP 11c APICID 0 SOCKETID 0 mcelog[1179]: MICROCODE 8701021 mcelog[1179]: CPUID Vendor AMD Family 23 Model 1 Step 0
Do you see any reliable reproducer for those MCEs? I.e., when doing something specific? If so, you could try to bisect between 5.16.1 and 5.16.2 and maybe pinpoint a kernel commit which is potentially causing those... Thx.
Nothing, unfortunately, and the problem again went away after a few more kernel updates...