Bug 212399

Summary: [Hardware Error]: Platform Security Processor Ext. Error Code: 62
Product: Platform Specific/Hardware Reporter: Martin (am7kimbkv)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED UNREPRODUCIBLE    
Severity: normal CC: bp, pmenzel+bugzilla.kernel.org, richard.tattoli, vkrevs
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.12.3, reappeared on 5.16.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: 2021-03-22_journalctl_mce_hardware_error_platform_security_processor.txt
dmesg output after an unexpected reboot with hardware error reported in journalctl before the reboot

Description Martin 2021-03-22 22:37:16 UTC
Created attachment 296003 [details]
2021-03-22_journalctl_mce_hardware_error_platform_security_processor.txt

My system unexpectedly reboots and produces hardware errors in the system journal:

mcelog[1080]: Hardware event. This is not a software error.
mcelog[1080]: MCE 0
mcelog[1080]: CPU 0 BANK 25
mcelog[1080]: MISC d01a005d00000000
mcelog[1080]: TIME 1616276961 Sat Mar 20 17:49:21 2021
mcelog[1080]: STATUS 98004000003e0000 MCGSTATUS 0
mcelog[1080]: MCGCAP 11c APICID 0 SOCKETID 0
mcelog[1080]: MICROCODE 8701021
mcelog[1080]: CPUID Vendor AMD Family 23 Model 1 Step 0
kernel: mce: [Hardware Error]: Machine check events logged
kernel: [Hardware Error]: Corrected error, no action required.
kernel: [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
kernel: [Hardware Error]: IPID: 0x000100ff03830400
kernel: [Hardware Error]: Platform Security Processor Ext. Error Code: 62
kernel: [Hardware Error]: cache level: RESV, tx: INSN


Troubleshooting steps taken in the system BIOS settings:
- Advanced > AMD fTPM configuration > AMD CPU fTPM: Disabled -> Enabled
- Unexpected reboots continued to occur
- Advanced > PCI Subsystem Settings > SR-IOV Support: Enabled -> Disabled
- Unexpected reboots continued to occur
- Advanced > PCI Subsystem Settings > Above 4G Decoding: Enabled -> Disabled
- System seemed to have stabilized, no unexpected reboots


OS: openSUSE Tumbleweed 20210318; also observed using Fedora 33
CPU: AMD Ryzen 5 3600X
Graphics Card: AMD Radeon RX 5700
Motherboard: ASUS Pro WS X570-ACE (BIOS version 3302)
Comment 1 Borislav Petkov 2021-05-05 11:00:55 UTC
That should tell you: "Corrected error, no action required." I.e., nothing has been corrupted and you can continue using your CPU merrily.

If you start seeing a lot of those, though, you could RMA your CPU.

HTH.
Comment 2 Martin 2021-05-17 01:04:08 UTC
(In reply to Borislav Petkov from comment #1)
> That should tell you: "Corrected error, no action required." I.e., nothing
> has been corrupted and you can continue using your CPU merrily.
> 
> If you start seeing a lot of those, though, you could RMA your CPU.
> 
> HTH.

I don't think this should be marked as resolved, as the system unexpectedly reboots shortly after this error is logged. Something is wrong.

This just happened to me again a few minutes ago.
New OS version: openSUSE Tumbleweed 20210514
New motherboard BIOS version: 3402
Something that might be worth noting is that it seems that this will happen to me some time after I deliberately reboot the system myself, maybe a maximum three times before it stops happening for the rest of the week (I reboot the system weekly).


May 16 20:48:04 localhost.localdomain mcelog[1160]: Hardware event. This is not a software error.
May 16 20:48:04 localhost.localdomain mcelog[1160]: MCE 0
May 16 20:48:04 localhost.localdomain mcelog[1160]: CPU 0 BANK 25
May 16 20:48:04 localhost.localdomain mcelog[1160]: MISC d01a000200000000
May 16 20:48:04 localhost.localdomain mcelog[1160]: TIME 1621212484 Sun May 16 20:48:04 2021
May 16 20:48:04 localhost.localdomain mcelog[1160]: STATUS 98004000003e0000 MCGSTATUS 0
May 16 20:48:04 localhost.localdomain mcelog[1160]: MCGCAP 11c APICID 0 SOCKETID 0
May 16 20:48:04 localhost.localdomain mcelog[1160]: MICROCODE 8701021
May 16 20:48:04 localhost.localdomain mcelog[1160]: CPUID Vendor AMD Family 23 Model 1 Step 0
May 16 20:48:04 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged
May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: Corrected error, no action required.
May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: IPID: 0x000100ff03830400
May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: Platform Security Processor Ext. Error Code: 62
May 16 20:48:04 localhost.localdomain kernel: [Hardware Error]: cache level: RESV, tx: INSN
-- Reboot --
Comment 3 Martin 2021-05-17 01:06:40 UTC
Forgot to mention in my reply that today's error occurred on kernel 5.12.3.
Comment 4 Borislav Petkov 2021-05-17 07:00:19 UTC
(In reply to Martin from comment #2)
> I don't think this should be marked as resolved, as the system unexpectedly
> reboots shortly after this error is logged. Something is wrong.

Probably a follow-up hw error which causes the system to triple-fault.
Can you upload dmesg from *right* *after* the reboot?

Looking at your BIOS:

DMI: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021

it looks pretty new but you could still check whether there's newer one.

Also, try leaving your BIOS settings to their defaults and only enable
those which you really need. It is very possible that some of the
settings are not even properly validated.

> Forgot to mention in my reply that today's error occurred on kernel
> 5.12.3.

Hardware resets are unlikely to be caused by the kernel.

HTH.
Comment 5 Martin 2021-05-17 10:54:38 UTC
Created attachment 296809 [details]
dmesg output after an unexpected reboot with hardware error reported in journalctl before the reboot
Comment 6 Martin 2021-05-17 10:58:29 UTC
(In reply to Borislav Petkov from comment #4)
> (In reply to Martin from comment #2)
> > I don't think this should be marked as resolved, as the system unexpectedly
> > reboots shortly after this error is logged. Something is wrong.
> 
> Probably a follow-up hw error which causes the system to triple-fault.
> Can you upload dmesg from *right* *after* the reboot?
> 
> Looking at your BIOS:
> 
> DMI: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021
> 
> it looks pretty new but you could still check whether there's newer one.
> 
> Also, try leaving your BIOS settings to their defaults and only enable
> those which you really need. It is very possible that some of the
> settings are not even properly validated.
> 
> > Forgot to mention in my reply that today's error occurred on kernel
> > 5.12.3.
> 
> Hardware resets are unlikely to be caused by the kernel.
> 
> HTH.


dmesg output is attached as attachment 296809 [details]. Also note that I upgraded my BIOS since the time I created this issue. It is currently the most recent non-beta BIOS available.
New version: DMI: ASUS System Product Name/Pro WS X570-ACE, BIOS 3402 03/22/2021
Comment 7 Borislav Petkov 2021-05-17 12:38:14 UTC
vendor BIOS site says 3402 has received new AGESA update - might be fixing something but it's not like anyone is telling you. ;-\

I don't see anything out of the ordinary in dmesg. I guess you can run the box and see what happens.
Comment 8 Borislav Petkov 2022-01-06 18:40:40 UTC
Looks forgotten. Feel free to reopen if still of interest.
Comment 9 Martin 2022-01-31 23:27:20 UTC
Apologies, neglected to update the bug as the issue seemed to have gone away at some point (perhaps by setting the BIOS "Power Supply Idle Control" setting to "Typical Current Idle" - I reverted my other attempts at a workaround) and the system had been stable for many, many months.

Today, I upgraded from kernel 5.16.1 to 5.16.2 and noticed that these error messages have reappeared. I have received two of them in 8.5 hours of uptime. I have booted back into kernel 5.16.1 as I expect random reboots if I stay in 5.16.2, where I have started seeing the error messages again (below).

There have been no changes in the hardware configuration of my system since my original message on this bug.


OS: openSUSE Tumbleweed 20220128
Motherboard: ASUS Pro WS X570-ACE (BIOS version 3904 12/14/2021)


kernel: mce: [Hardware Error]: Machine check events logged
kernel: [Hardware Error]: Corrected error, no action required.
kernel: [Hardware Error]: CPU:0 (17:71:0) MC25_STATUS[-|CE|MiscV|-|-|-|-|CECC|-|-|-]: 0x98004000003e0000
kernel: [Hardware Error]: IPID: 0x000100ff03830400
kernel: [Hardware Error]: Platform Security Processor Ext. Error Code: 62
kernel: [Hardware Error]: cache level: RESV, tx: INSN
mcelog[1179]: Hardware event. This is not a software error.
mcelog[1179]: MCE 0
mcelog[1179]: CPU 0 BANK 25
mcelog[1179]: MISC d01a000e00000000
mcelog[1179]: TIME 1643667603 Mon Jan 31 17:20:03 2022
mcelog[1179]: STATUS 98004000003e0000 MCGSTATUS 0
mcelog[1179]: MCGCAP 11c APICID 0 SOCKETID 0
mcelog[1179]: MICROCODE 8701021
mcelog[1179]: CPUID Vendor AMD Family 23 Model 1 Step 0
Comment 10 Borislav Petkov 2022-04-06 20:52:09 UTC
Do you see any reliable reproducer for those MCEs? I.e., when doing something specific?

If so, you could try to bisect between 5.16.1 and 5.16.2 and maybe pinpoint a kernel commit which is potentially causing those...

Thx.
Comment 11 Martin 2022-04-11 01:09:25 UTC
Nothing, unfortunately, and the problem again went away after a few more kernel updates...