Bug 206487

Summary: Random freezes/crashes with enabled C-State C6 - AMD Ryzen
Product: Power Management Reporter: Paul Menzel (pmenzel+bugzilla.kernel.org)
Component: OtherAssignee: Borislav Petkov (bp)
Status: NEW ---    
Severity: high CC: alexdeucher, bmogilefsky, bp, d, dion, gabriele.svelto, ilkka.prusi, kernelbugzilla, mario.limonciello, pmenzel+bugzilla.kernel.org, richard.tattoli, rui.zhang, sbmichael, superm1, vkrevs, zounp
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: Linux 5.4.14 messages (dmesg)

Description Paul Menzel 2020-02-10 18:10:26 UTC
Created attachment 287279 [details]
Linux 5.4.14 messages (dmesg)

[I have no idea, if *Power Management* is the right component, or if it should be Platform Specific/Hardware, x86_64.]

As commented in the long bug report [1], AMD Ryzen processors seem to have a bug related C-State C6 [2]. The problem happens at least with Linux 4.19.57, Linux 5.4.14, and Linux 5.5.

In our case at least three of twenty Dell OptiPlex 5055 with AMD Ryzen 5 PRO 1500 with GNU/Linux are affected. Unfortunately, the system just freezes. The mouse cursor does not move, CAPS lock key not change LED on USB keyboard, or if monitor turned off, it does not come back up. Sometimes the network card LED is still blinking, and sometimes not. In one case, the system even rebooted.

There is nothing in the logs (ever over serial console), and nothing in the Dell event log.

`processor.max_cstate=5` seems to fix it, but it’s hard to be sure, as I have not found a way to reproduce this. Sometimes it happens twice a day, and sometimes only after three or more days. Strangely, on one system, everything worked fine for four months, and now the problem suddenly started two weeks ago.

The issues happens with the latest firmware 1.1.20, and microcode updates are applied, that means 0x08001137 is used.

```
$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           1
Model name:                      AMD Ryzen 5 PRO 1500 Quad-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         3592.550
CPU max MHz:                     3500.0000
CPU min MHz:                     1550.0000
BogoMIPS:                        6986.81
Virtualization:                  AMD-V
L1d cache:                       128 KiB
L1i cache:                       256 KiB
L2 cache:                        2 MiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled 
                                 via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __use
                                 r pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB condition
                                 al, STIBP disabled, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mt
                                 rr pge mca cmov pat pse36 clflush mmx fxsr sse
                                  sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd
                                 tscp lm constant_tsc rep_good nopl nonstop_tsc
                                  cpuid extd_apicid aperfmperf pni pclmulqdq mo
                                 nitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcn
                                 t aes xsave avx f16c rdrand lahf_lm cmp_legacy
                                  svm extapic cr8_legacy abm sse4a misalignsse 
                                 3dnowprefetch osvw skinit wdt tce topoext perf
                                 ctr_core perfctr_nb bpext perfctr_llc mwaitx c
                                 pb hw_pstate sme ssbd sev ibpb vmmcall fsgsbas
                                 e bmi1 avx2 smep bmi2 rdseed adx smap clflusho
                                 pt sha_ni xsaveopt xsavec xgetbv1 xsaves clzer
                                 o irperf xsaveerptr arat npt lbrv svm_lock nri
                                 p_save tsc_scale vmcb_clean flushbyasid decode
                                 assists pausefilter pfthreshold avic v_vmsave_
                                 vmload vgif overflow_recov succor smca
```

Several of the same systems run with Microsoft Windows, and I am unaware of freezes there. Can you find out, if C-State C6 is used by the Microsoft Windows driver?

Can you recommend ways to debug this further?

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683
[2]: https://forum.manjaro.org/t/ryzen-freezes-possible-solution-related-to-c6-state/37870
Comment 1 Borislav Petkov 2020-02-10 18:28:36 UTC
> In our case at least three of twenty Dell OptiPlex 5055 with AMD Ryzen 5 PRO
> 1500 with GNU/Linux are affected.

Are you saying, you have 20 identical systems with the same sw and only 3 freeze?
Comment 2 Paul Menzel 2020-02-10 23:07:12 UTC
(In reply to Borislav Petkov from comment #1)
> > In our case at least three of twenty Dell OptiPlex 5055 with AMD Ryzen 5
> PRO
> > 1500 with GNU/Linux are affected.
> 
> Are you saying, you have 20 identical systems with the same sw and only 3
> freeze?

Basically, yes. One of the three (and the 20) has a different configuration, that it has an 32 GB instead of 16 GB memory, NVMe SSD over AHCI HDD, and a different AMD graphics card. So it should be unrelated from these two components.

Most common configuration:

```
$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
01:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset USB 3.1 xHCI Controller [1022:43bb] (rev 02)
01:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller [1022:43b7] (rev 02)
01:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b2] (rev 02)
02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
02:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5762 Gigabit Ethernet PCIe [14e4:1687] (rev 10)
05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 OEM] [1002:6611] (rev 87)
05:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series] [1002:aab0]
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
06:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
06:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
07:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
07:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
07:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
```

Exception:

```
$ lspci -nn
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
01:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03)
02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset USB 3.1 xHCI Controller [1022:43bb] (rev 02)
02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller [1022:43b7] (rev 02)
02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b2] (rev 02)
03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
04:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5762 Gigabit Ethernet PCIe [14e4:1687] (rev 10)
06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X] [1002:6... (rev 87)
06:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series] [1002:aab0]
07:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
07:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
07:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
08:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
08:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
```
Comment 3 Borislav Petkov 2020-02-12 18:14:41 UTC
Do the other boxes have this MCE in dmesg too?

[    0.384815] [Hardware Error]: System Fatal error.
[    0.385095] [Hardware Error]: CPU:2 (17:1:1) MC0_STATUS[-|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xbaa0000000060145
[    0.385180] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000002d030100
[    0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag error type 1.
[    0.386645] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR

Because this could explain the freezes. If you see it in dmesg everytime you reboot after a freeze - i.e., reproducible - you probably should talk to your hardware supplier about replacing the CPU...

HTH.
Comment 4 Paul Menzel 2020-02-12 23:02:13 UTC
(In reply to Borislav Petkov from comment #3)
> Do the other boxes have this MCE in dmesg too?
> 
> [    0.384815] [Hardware Error]: System Fatal error.
> [    0.385095] [Hardware Error]: CPU:2 (17:1:1)
> MC0_STATUS[-|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xbaa0000000060145
> [    0.385180] [Hardware Error]: IPID: 0x000000b000000000, Syndrome:
> 0x000000002d030100
> [    0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag
> error type 1.
> [    0.386645] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR

One (*deinhandtuch*) of the two other machines logged an MCE (even same values) once, but after a *normal* reboot. In this run it crashed after two days.

```
[    0.000000] Linux version 5.4.14.mx64.317 (root@thebiglebowserver.molgen.mpg.de) (gcc version
 7.5.0 (GCC)) #1 SMP Thu Jan 23 14:24:25 CET 2020
[…]
[    0.377235] smpboot: CPU0: AMD Ryzen 5 PRO 1500 Quad-Core Processor (family: 0x17, model: 0x1, stepping: 0x1)
[    0.377905] Performance Events: Fam17h core perfctr, AMD PMU driver.
[    0.378181] ... version:                0
[    0.378420] ... bit width:              48
[    0.378664] ... generic registers:      6
[    0.378904] ... value mask:             0000ffffffffffff
[    0.379180] ... max period:             00007fffffffffff
[    0.379494] ... fixed-purpose events:   0
[    0.379733] ... event mask:             000000000000003f
[    0.380069] rcu: Hierarchical SRCU implementation.
[    0.380533] MCE: In-kernel MCE decoding enabled.
[    0.380907] smp: Bringing up secondary CPUs ...
[    0.381245] x86: Booting SMP configuration:
[    0.381494] .... node  #0, CPUs:        #1  #2
[    0.384203] mce: [Hardware Error]: Machine check events logged
[    0.384815] [Hardware Error]: System Fatal error.
[    0.385095] [Hardware Error]: CPU:2 (17:1:1) MC0_STATUS[-|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xbaa0000000060145
[    0.385180] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000002d030100
[    0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag error type 1.
[    0.386645] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR
[    0.387024]   #3  #4  #5  #6  #7
[    0.393204] smp: Brought up 1 node, 8 CPUs
[    0.393623] smpboot: Max logical packages: 2
[    0.393878] smpboot: Total of 8 processors activated (55894.52 BogoMIPS)
```

The other affected machine (*fenchurch*) does not have it.

> Because this could explain the freezes. If you see it in dmesg everytime you
> reboot after a freeze - i.e., reproducible - you probably should talk to
> your hardware supplier about replacing the CPU...

As it’s not happening every time, do you still recommend it?

How can I decode the MCEs?

Here are all the MCEs from *hypnotoad*:

Linux 5.4.10:

Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error.
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:1 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81ac64f4
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error.
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:7 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81e01060
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN

Linux 5.4.10:

Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff81ac64fc MISC d012000101000000 SYND 4d000000 IPID 500b000000000 
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1579841168 SOCKET 0 APIC 0 microcode 8001137
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error.
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:7 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81c00ad6
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN

Linux 5.4.10:

Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000000080b
Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: TSC 0 MISC d012000101000000 SYND 5d000000 IPID 1002e00000002 
Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1579871799 SOCKET 0 APIC 0 microcode 8001137

Linux 5.5-rc7:

Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error.
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:1 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81ae0f38
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error.
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:5 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81223a66
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
Comment 5 Paul Menzel 2020-02-12 23:07:51 UTC
Searching for 0xbea0000000000108, this is documented on the Gentoo Wiki page *Ryzen* [1]. So, we are back to the common processor error, and the references to C-State 6.

[1]: https://wiki.gentoo.org/wiki/Ryzen#Random_reboots_with_mce_events
Comment 6 Borislav Petkov 2020-02-13 06:25:28 UTC
(In reply to Paul Menzel from comment #4)
> One (*deinhandtuch*) of the two other machines logged an MCE (even same
> values) once, but after a *normal* reboot.

Can't be after a normal reboot - those errors get read out from the MCA
banks when the machine gets warm-reset and cleared after that. So the
MCE must have happened in the last boot of the machine.

And since it is the same signature, the deinhandtuch box's CPU- no
matter how funny its name is :) - might need to be replaced too.

> The other affected machine (*fenchurch*) does not have it.

It is possible that it doesn't log it successfully or that box has a
different problem.

> As it’s not happening every time, do you still recommend it?

That's your decision I guess. If it doesn't hurt the work you're doing
and occasional reboots are ok, then sure, it would be less hassle. :)

> How can I decode the MCEs?

They're decoded already:

[    0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag error type 1.

That's all the decode we can get.

> Here are all the MCEs from *hypnotoad*:

Those are different. They're all

[Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.

in bank 5...

> Searching for 0xbea0000000000108, this is documented on the Gentoo
> Wiki page *Ryzen* [1]. So, we are back to the common processor error,
> and the references to C-State 6.

... and do those errors go away of you boot with "idle=nomwait" or
disable C states in the BIOS?
Comment 7 Paul Menzel 2020-02-13 08:27:58 UTC
(In reply to Borislav Petkov from comment #6)
> (In reply to Paul Menzel from comment #4)
> > One (*deinhandtuch*) of the two other machines logged an MCE (even same
> > values) once, but after a *normal* reboot.
> 
> Can't be after a normal reboot - those errors get read out from the MCA
> banks when the machine gets warm-reset and cleared after that. So the
> MCE must have happened in the last boot of the machine.
> 
> And since it is the same signature, the deinhandtuch box's CPU- no
> matter how funny its name is :) - might need to be replaced too.

Sometimes we do a reboot using Kexec. Maybe that’s why the MCE still got read out? Can I manually force a read out, or would the driver always notice it real time?

> > The other affected machine (*fenchurch*) does not have it.
> 
> It is possible that it doesn't log it successfully or that box has a
> different problem.
> 
> > As it’s not happening every time, do you still recommend it?
> 
> That's your decision I guess. If it doesn't hurt the work you're doing
> and occasional reboots are ok, then sure, it would be less hassle. :)
> 
> > How can I decode the MCEs?
> 
> They're decoded already:
> 
> [    0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag
> error type 1.
> 
> That's all the decode we can get.

I guess we need a decode for the decode. ;-) What internal(?) AMD data sheet would contain this information?

> > Here are all the MCEs from *hypnotoad*:
> 
> Those are different. They're all
> 
> [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error.
> 
> in bank 5...

They are not, are they? Two of the pasted four have *bank 5* in them. They all contain the value `0xbea0000000000108` though.

> > Searching for 0xbea0000000000108, this is documented on the Gentoo
> > Wiki page *Ryzen* [1]. So, we are back to the common processor error,
> > and the references to C-State 6.
> 
> ... and do those errors go away of you boot with "idle=nomwait" or
> disable C states in the BIOS?

The Dell firmware does not have an option to disable the C-States. I have to use  ZenStates-Linux.

It looks like, my problem report is still not clear enough. The MCEs are only logged in a fraction after the freezes even on the machines that get them. As there is no reproducer for the problem, I cannot say.

Currently, all three machines (*fenchurch* with a replaced motherboard, *deinhandtuch* with disabled C-State C6) run fine since Monday.

Do you know of ways how to debug this better? How can I check, if C-State C6 has been reached? (turbostat does not show it?)

Can I log more over the serial console?

Also, judging from all the problem reports on the Web, and documented solutions, can’t we assume, that there is a hardware or Linux kernel error?

Did AMD look at this internally? Is their solution to disable C-State C6?

What does the Windows driver do?
Comment 8 Borislav Petkov 2020-02-13 10:59:14 UTC
(In reply to Paul Menzel from comment #7)
> Sometimes we do a reboot using Kexec. Maybe that’s why the MCE still got
> read out? Can I manually force a read out, or would the driver always notice
> it real time?

No, fatal MCEs cause a thing called "syncflood" which prevents the
machine from doing any forward progress. That's why it would freeze.
If the firmware is properly coded and configured, it would detect such
a condition and trigger a warm reset. And the error will be logged in
the MCA MSRs which the kernel dumps during boot if it finds a valid
signature in there. The driver doesn't even get to execute when the
error happens.

And yes, kexec-ing might be able log it, if the hardware plays along.

> I guess we need a decode for the decode. ;-) What internal(?) AMD data sheet
> would contain this information?

Good luck with that.

> They are not, are they? Two of the pasted four have *bank 5* in them. They
> all contain the value `0xbea0000000000108` though.

The only one which is not bank 5 is the one in the "Linux 5.4.10"
snippet which is bank 22 but it has not run through the decoding module.
That one is fatal too, though, so it could be related as in follow-up
MCE or maybe it is something else.

> The Dell firmware does not have an option to disable the C-States. I have to
> use  ZenStates-Linux.

And "idle=nomwait" doesn't work?

> Do you know of ways how to debug this better? How can I check, if C-State C6
> has been reached? (turbostat does not show it?)
>
> Can I log more over the serial console?

Before you do anything, let's see if idle=nomwait or CC6 disable works.
If the DC tag error doesn't happen as a result from that too, then it
is related. If it still happens, then you should consider returning the
box or replacing the CPU or so, AFAICT.

> Also, judging from all the problem reports on the Web, and documented
> solutions, can’t we assume, that there is a hardware or Linux kernel
> error?

I can't assume anything at the moment. Let's see if the CC6
disable/idle=nomwait does anything first.

Thx.
Comment 9 Paul Menzel 2020-02-14 11:20:00 UTC
Ok, after three days with `./zenstates.py --c6-disable` *deinhandtuch* did not crash. Then, rebooting the system, and leaving C-State C6 enabled, it crashed after five hours (not pingable). I had to press the power button ten seconds to turn it off. Starting it again, there were no MCEs logged.
Comment 10 Borislav Petkov 2020-02-14 13:27:22 UTC
What about "idle=nomwait"? Does that fix it too?
Comment 11 Paul Menzel 2020-02-14 13:36:36 UTC
(In reply to Borislav Petkov from comment #10)
> What about "idle=nomwait"? Does that fix it too?

I have to check that, but I thought that was solved in bug #196683 [1].

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683
     "Random Soft Lockup on new Ryzen build"
Comment 12 Borislav Petkov 2020-02-14 14:02:05 UTC
How?

I don't think anything was solved there except a couple of people showing that idle=halt fixes the issue for them.
Comment 13 Paul Menzel 2020-02-15 12:33:18 UTC
(In reply to Borislav Petkov from comment #10)
> What about "idle=nomwait"? Does that fix it too?

*deinhandtuch* crashed some minutes ago, so `idle=nomwait` does not help in my case.
Comment 14 Borislav Petkov 2020-02-15 18:38:53 UTC
Ok, thanks. So it is not erratum 1109. Let's see what I can find out about the C-states aspect.
Comment 15 Paul Menzel 2020-02-17 11:31:18 UTC
(In reply to Paul Menzel from comment #7)

[…]

> Currently, all three machines (*fenchurch* with a replaced motherboard,
> *deinhandtuch* with disabled C-State C6) run fine since Monday.

For the record, *fenchurch* with the replaced motherboard rebooted tonight after three days running from Thu Feb 13 23:58 to Sun Feb 16 23:10, but no MCE were found by Linux. So, it’s not the motherboard. I’ll ask for a CPU replacement.
Comment 16 Borislav Petkov 2020-02-17 12:04:34 UTC
How does that box behave with disabled C6? Still rebooting?
Comment 17 Paul Menzel 2020-02-17 12:53:04 UTC
One note, the crashes/freezes also happen when utilizing the full CPU, by running, for example, the script `kill-ryzen.sh`, which builds GCC separately on each thread.

[1]: https://github.com/Oxalin/ryzen-test
Comment 18 Paul Menzel 2020-02-18 18:12:20 UTC
(In reply to Paul Menzel from comment #7)
> (In reply to Borislav Petkov from comment #6)
> > (In reply to Paul Menzel from comment #4)

[…]

> > > Searching for 0xbea0000000000108, this is documented on the Gentoo
> > > Wiki page *Ryzen* [1]. So, we are back to the common processor error,
> > > and the references to C-State 6.
> > 
> > ... and do those errors go away of you boot with "idle=nomwait" or
> > disable C states in the BIOS?
> 
> The Dell firmware does not have an option to disable the C-States. I have to
> use  ZenStates-Linux.

I have to correct that statement. After unchecking the firmware option *C States Control* [1]

> Allows you to enable or disable additional processor sleep states. This
> option is enabled by default.

`zenstates.py` reports C-State C6 for the core (not package) as disabled.

```
$ sudo ~/src/ZenStates-Linux/zenstates.py -l
P0 - Enabled - FID = 8C - DID = 8 - VID = 32 - Ratio = 35.00 - vCore = 1.23750
P1 - Enabled - FID = 78 - DID = 8 - VID = 40 - Ratio = 30.00 - vCore = 1.15000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Disabled
```

Linux should log the C-State information more verbosely. I think, I tried that it crashes after unchecking that option.

[1]: https://www.dell.com/support/manuals/de/de/debsdt1/optiplex-5055-r-desktop/optiplex_5055r_tower_om/system-setup-options?guid=guid-9704849c-2bf7-4f8a-bb5a-fcb0aa5e24c7&lang=en-us
Comment 19 Paul Menzel 2020-03-04 16:17:13 UTC
Just to confirm, the crashes still occur with *C States Control* disabled in the Dell firmware.

Borislav, were you able to find out more?
Comment 20 Borislav Petkov 2020-03-07 17:14:35 UTC
(In reply to Paul Menzel from comment #19)
> Just to confirm, the crashes still occur with *C States Control* disabled in
> the Dell firmware.

What does that option even do? I'm guessing it does not disable CC6...

> Borislav, were you able to find out more?

Well, the only thing we can do is add a cmdline option to disable CC6 on those machines and do what zenstates.py does.

So, how exactly do you run that script? Like this:

./zenstates.py --c6-disable

And does that fix the issue on those boxes?

Thx.
Comment 21 Paul Menzel 2020-03-09 16:33:09 UTC
(In reply to Borislav Petkov from comment #20)
> (In reply to Paul Menzel from comment #19)
> > Just to confirm, the crashes still occur with *C States Control* disabled
> in
> > the Dell firmware.
> 
> What does that option even do? I'm guessing it does not disable CC6...

As I do not have the firmware source code, I cannot say for sure. But, see comment #18:

    C6 State - Package - Enabled
    C6 State - Core - Disabled

opposed to showing both as *Enabled*.

> > Borislav, were you able to find out more?
> 
> Well, the only thing we can do is add a cmdline option to disable CC6 on
> those machines and do what zenstates.py does.

An update, all machines running GNU/Linux seem to be affected here. The users just didn’t report it, and we didn’t check yet. First inquiries show, that devices with Microsoft Windows do not seem to be affected.

A Linux kernel option is a first step, but no solution, as you cannot expect “normal” users to add that, and it should work out of the box.

Are you in contact with AMD?

> So, how exactly do you run that script? Like this:
> 
> ./zenstates.py --c6-disable

Yes, like that.

> And does that fix the issue on those boxes?

At least on one box it still crashed. We need a reproducer for this issue.
Comment 22 Borislav Petkov 2020-03-09 19:41:37 UTC
(In reply to Paul Menzel from comment #21)
> As I do not have the firmware source code, I cannot say for sure. But, see
> comment #18:
> 
>     C6 State - Package - Enabled
>     C6 State - Core - Disabled
> 
> opposed to showing both as *Enabled*.

Ok, looks like this option controls the core C6 state.

> An update, all machines running GNU/Linux seem to be affected here.

All your boxes or all boxes in general?

> users just didn’t report it, and we didn’t check yet. First inquiries show,
> that devices with Microsoft Windows do not seem to be affected.
> 
> A Linux kernel option is a first step, but no solution, as you cannot expect
> “normal” users to add that, and it should work out of the box.

Until there is a way to reliably detect those boxes, this is the best we
can do. If at all.

> At least on one box it still crashed. We need a reproducer for this issue.

After you ran the script? Which would mean that that CC6 disable is not
really fixing it. Unless the crash was something else.

IOW, so far we have this:

- a couple of dell boxes throw MCEs under load
- disabling CC6 helps but not always

So until there's another report on !dell machine, this looks like a dell
box problem to me.

I'm sure you'll correct me if I missed anything.
Comment 23 Paul Menzel 2020-03-10 22:08:16 UTC
(In reply to Borislav Petkov from comment #22)
> (In reply to Paul Menzel from comment #21)
> > As I do not have the firmware source code, I cannot say for sure. But, see
> > comment #18:
> > 
> >     C6 State - Package - Enabled
> >     C6 State - Core - Disabled
> > 
> > opposed to showing both as *Enabled*.
> 
> Ok, looks like this option controls the core C6 state.
> 
> > An update, all machines running GNU/Linux seem to be affected here.
> 
> All your boxes or all boxes in general?

All our Dell OptiPlex 5055 with GNU/Linux. It does not happen with Microsoft Windows.

> > users just didn’t report it, and we didn’t check yet. First inquiries show,
> > that devices with Microsoft Windows do not seem to be affected.
> > 
> > A Linux kernel option is a first step, but no solution, as you cannot
> expect
> > “normal” users to add that, and it should work out of the box.
> 
> Until there is a way to reliably detect those boxes, this is the best we
> can do. If at all.
> 
> > At least on one box it still crashed. We need a reproducer for this issue.
> 
> After you ran the script? Which would mean that that CC6 disable is not
> really fixing it. Unless the crash was something else.

I agree. But how do I find out, what the crash reason was?

> IOW, so far we have this:
> 
> - a couple of dell boxes throw MCEs under load
> - disabling CC6 helps but not always

Apparently, but nobody knows, as there is no way to check that (no logs) and there is no reproducer.

> So until there's another report on !dell machine, this looks like a dell
> box problem to me.

You choose to ignore all the posts on the Web for reasons I do not understand. It’s such a big problem, that it is documented in the Gentoo Wiki and Arch Wiki.

https://wiki.archlinux.org/index.php/Ryzen#Random_reboots

> I'm sure you'll correct me if I missed anything.

I really would like to have some guidance on how to further debug this. Do you have more ideas or should I bring it up on the mailing list?

Also, you always seem to miss my questions of having contacted AMD.
Comment 24 Borislav Petkov 2020-03-10 22:44:35 UTC
(In reply to Paul Menzel from comment #23)
> I agree. But how do I find out, what the crash reason was?

Maybe look at serial console output, or use kdump.

> Apparently, but nobody knows, as there is no way to check that (no logs) and
> there is no reproducer.

That sucks.

> You choose to ignore all the posts on the Web for reasons I do not
> understand. It’s such a big problem, that it is documented in the Gentoo
> Wiki and Arch Wiki.

So far it looks to me like a bunch of problems and people tend to lump
them all together. So "the Web" has a lot of random bug reports, a lot
of them unreliable.

For example, the idle=nomwait is supposed to help with erratum 1109. It
doesn't help on your boxes. So yours must be something else.

See what I mean?

> I really would like to have some guidance on how to further debug this. Do
> you have more ideas or should I bring it up on the mailing list?

The only thing you can do is send the box to Dell and let them deal with
it. This is the official way. Especially if the other boxes which are
identical don't get the MCEs, then this very well sounds like a silicon
issue with this particular sample.

> Also, you always seem to miss my questions of having contacted AMD.

Why does that matter for the issue at hand?

Hypothetically, if I had contacted them, they would have probably said
the same thing: send the boxes back to Dell. Dell would then talk to
their AMD contact and sort out the CPU issue. This is how it is done in
general.

HTH.
Comment 25 Paul Menzel 2020-03-11 00:17:04 UTC
(In reply to Borislav Petkov from comment #24)
> (In reply to Paul Menzel from comment #23)
> > I agree. But how do I find out, what the crash reason was?
> 
> Maybe look at serial console output, or use kdump.

There was never anything on the console. I’ll look into kdump. Thanks. (We have a crashkernel set up, but as the system just freezes and can only be turned off by pressing the power button, the RAM contents is likely lost.) Thanks.

> > Apparently, but nobody knows, as there is no way to check that (no logs)
> and
> > there is no reproducer.
> 
> That sucks.
> 
> > You choose to ignore all the posts on the Web for reasons I do not
> > understand. It’s such a big problem, that it is documented in the Gentoo
> > Wiki and Arch Wiki.
> 
> So far it looks to me like a bunch of problems and people tend to lump
> them all together. So "the Web" has a lot of random bug reports, a lot
> of them unreliable.
> 
> For example, the idle=nomwait is supposed to help with erratum 1109. It
> doesn't help on your boxes. So yours must be something else.
> 
> See what I mean?

Yes, but at least the MCEs should point in some direction.

> > I really would like to have some guidance on how to further debug this. Do
> > you have more ideas or should I bring it up on the mailing list?
> 
> The only thing you can do is send the box to Dell and let them deal with
> it. This is the official way. Especially if the other boxes which are
> identical don't get the MCEs, then this very well sounds like a silicon
> issue with this particular sample.

That it doesn’t happen with Microsoft Windows makes hopeful, that a fix in the OS (Linux kernel) is possible.

> > Also, you always seem to miss my questions of having contacted AMD.
> 
> Why does that matter for the issue at hand?
> 
> Hypothetically, if I had contacted them, they would have probably said
> the same thing: send the boxes back to Dell. Dell would then talk to
> their AMD contact and sort out the CPU issue. This is how it is done in
> general.

Sorry, it looks like you never had to deal with the Dell support (lucky you). If you say, you use GNU/Linux (even the Ubuntu from Dell), they tell you they do not support that, and that support is given by the community (and point to some Ubuntu forum).

The alternative would be, they say, you are right, we had the same problem in the past, see this erratum and this fix for Microsoft Windows.
Comment 26 Paul Menzel 2020-03-11 09:51:50 UTC
(In reply to Paul Menzel from comment #25)

[…]

> The alternative would be, they say, you are right, we had the same problem
> in the past, see this erratum and this fix for Microsoft Windows.

s/they say/AMD says/
Comment 27 Paul Menzel 2020-06-15 20:30:50 UTC
Today, we have this issues with another system of the same model.

    2020-06-15T15:19:57.779069+02:00 serotimor kernel: [    0.000000] Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 
SMP Thu May 7 14:27:50 CEST 2020

After the crash, rebooting, Linux displays the MCE messages.

```2020-06-15T16:36:38.939468+02:00 serotimor kernel: [    0.439355] smpboot: CPU0: AMD Ryzen 5 PRO 1500 Quad-Core Processor (family: 0x17, model: 0x1, s
tepping: 0x1)
2020-06-15T16:36:38.939469+02:00 serotimor kernel: [    0.439976] Performance Events: Fam17h+ core perfctr, AMD PMU driver.
2020-06-15T16:36:38.939469+02:00 serotimor kernel: [    0.439977] ... version:                0
2020-06-15T16:36:38.939470+02:00 serotimor kernel: [    0.440217] ... bit width:              48
2020-06-15T16:36:38.939471+02:00 serotimor kernel: [    0.440461] ... generic registers:      6
2020-06-15T16:36:38.939471+02:00 serotimor kernel: [    0.440700] ... value mask:             0000ffffffffffff
2020-06-15T16:36:38.939473+02:00 serotimor kernel: [    0.440976] ... max period:             00007fffffffffff
2020-06-15T16:36:38.939473+02:00 serotimor kernel: [    0.441291] ... fixed-purpose events:   0
2020-06-15T16:36:38.939474+02:00 serotimor kernel: [    0.441530] ... event mask:             000000000000003f
2020-06-15T16:36:38.939475+02:00 serotimor kernel: [    0.441866] rcu: Hierarchical SRCU implementation.
2020-06-15T16:36:38.939475+02:00 serotimor kernel: [    0.442427] smp: Bringing up secondary CPUs ...
2020-06-15T16:36:38.939476+02:00 serotimor kernel: [    0.443041] x86: Booting SMP configuration:
2020-06-15T16:36:38.939476+02:00 serotimor kernel: [    0.443291] .... node  #0, CPUs:        #1  #2
2020-06-15T16:36:38.939477+02:00 serotimor kernel: [    0.445000] mce: [Hardware Error]: Machine check events logged
2020-06-15T16:36:38.939478+02:00 serotimor kernel: [    0.445978] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108
2020-06-15T16:36:38.939502+02:00 serotimor kernel: [    0.446427] mce: [Hardware Error]: TSC 0 ADDR 1ffff81ac466c MISC d012000101000000 SYND 4d000000 IPID 500b000000000 
2020-06-15T16:36:38.939503+02:00 serotimor kernel: [    0.446977] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1592231781 SOCKET 0 APIC 2 microcode 8001137
2020-06-15T16:36:38.939505+02:00 serotimor kernel: [    0.451003] mce: [Hardware Error]: Machine check events logged
2020-06-15T16:36:38.939505+02:00 serotimor kernel: [    0.451979] mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000000000108
2020-06-15T16:36:38.939506+02:00 serotimor kernel: [    0.452429] mce: [Hardware Error]: TSC 0 ADDR 1ffff81ac4664 MISC d012000101000000 SYND 4d000000 IPID 500b000000000 
2020-06-15T16:36:38.939508+02:00 serotimor kernel: [    0.452977] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1592231781 SOCKET 0 APIC 9 microcode 8001137
Comment 28 Paul Menzel 2020-06-15 20:32:08 UTC
Looks like others still have the same issue: bug 206903 (Spontaneous reboots with Ryzen-3700x (Machine Check: 0 Bank 5: bea0000000000108)) [1].

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903
Comment 29 Paul Menzel 2020-06-16 12:28:57 UTC
    [    1.185038] microcode: CPU0: patch_level=0x08001137

Rich, Mario, could you please poke Dell to release firmware updates with the latest microcode updates (0x08701021) included [1]? The Linux Firmware repository also does not contain more recent μcode updates [2].


[1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c17
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amd-ucode
Comment 30 Paul Menzel 2020-06-16 13:18:10 UTC
It still crashed with `amdgpu.ppfeaturemask=0xfffffffb`.
Comment 31 Paul Menzel 2020-06-16 17:29:54 UTC
Another system (*walle*) also froze with `amdgpu.ppfeaturemask=0xffffbffd`.
Comment 32 Paul Menzel 2020-06-18 20:49:49 UTC
*donut* seems to have crashed with Linux 5.4.39 and `amdgpu.dpm=0`.
Comment 33 Paul Menzel 2020-07-03 11:45:15 UTC
*serotimor* also crashed with Linux 5.4.39 and amdgpu.ppfeaturemask=fffd3fff` (4294787071), which is used by default with Linux 4.19.57.
Comment 34 Paul Menzel 2020-07-04 08:05:37 UTC
*serotimor* crashed with Linux 5.4.39 and `amdgpu.dpm=0` after 15 hours.
Comment 35 Zounp 2020-07-14 16:41:30 UTC
Comment 22
You wrote: "So until there's another report on !dell machine, this looks like a dell box problem to me."


I maintain two Ryzen 5 1600 computers with ASrock motherboards which have freezes.
The Ryzen 5 1600's have been RMAed by AMD but the freezes stay.
Comment 36 Paul Menzel 2020-07-14 16:46:41 UTC
(In reply to Zounp from comment #35)
> Comment 22
> You wrote: "So until there's another report on !dell machine, this looks
> like a dell box problem to me."
> 
> I maintain two Ryzen 5 1600 computers with ASrock motherboards which have
> freezes.
> The Ryzen 5 1600's have been RMAed by AMD but the freezes stay.

As the issue is very convoluted, I suggest that you open a separate issue and reference it here. Maybe you are even able to note down the CPU serial number, so production dates might give a clue.
Comment 37 Quack66 2021-03-19 03:06:03 UTC
I'm also experiencing the same thing with a Ryzen 3600x paired with a Gigabyte ab350m-ds3h and a rx580. Ubuntu is unusable for me since my whole PC lockup every time in the first 30min after boot. Since I can reproduce the issue pretty often let me know if you need more info. 


I've tried the following to no avail:
Disable C6 states
Select "Typical Current Idle" in bios
Setting the following boot options: idle=nomwait processor.max_cstate=5 rcu_nocbs=0-11


I've enabled kdump but the files in /var/crash doesn't seem to include anything relating to the crash.


Here's some info:


```Linux charles-ubuntu 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux```


```00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7
01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller (rev 02)
01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller (rev 02)
01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset PCIe Upstream Port (rev 02)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7)
08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function
0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
0a:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP
0a:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
0a:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
0b:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)```

```Mar 18 22:28:16 charles-ubuntu kernel: [  276.223281] rq->clock_update_flags < RQCF_ACT_SKIP
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223287] WARNING: CPU: 1 PID: 0 at kernel/sched/sched.h:1115 assert_clock_updated.isra.0.part.0+0x17/0x20
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223287] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg bnep edac_mce_amd kvm_amd amdgpu kvm snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul ledtrig_audio ghash_clmulni_intel aesni_intel snd_hda_codec_hdmi crypto_simd cryptd glue_helper snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_hda_codec snd_usbmidi_lib snd_hda_core mc snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi rapl iommu_v2 gpu_sched btusb snd_seq ttm 88x2bu(OE) btrtl snd_seq_device btbcm drm_kms_helper snd_timer btintel cec bluetooth rc_core snd i2c_algo_bit hid_sony fb_sys_fops syscopyarea ecdh_generic cfg80211 wmi_bmof joydev sysfillrect input_leds ff_memless ecc sysimgblt k10temp ccp soundcore mac_hid sch_fq_codel msr parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul r8169 i2c_piix4 xhci_pci realtek ahci xhci_pci_renesas libahci wmi gpio_amdpt gpio_generic
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223315] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G           OE     5.8.0-45-generic #51~20.04.1-Ubuntu
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223316] Hardware name: Gigabyte Technology Co., Ltd. AB350M-DS3H/AB350M-DS3H-CF, BIOS F50d 07/02/2020
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223317] RIP: 0010:assert_clock_updated.isra.0.part.0+0x17/0x20
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223318] Code: af 79 0e 00 84 c0 75 c1 e9 40 ff ff ff e8 21 1f aa 00 90 55 48 c7 c7 28 ca db 85 c6 05 10 fb 75 01 01 48 89 e5 e8 2f a6 a4 00 <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 55 bf 17 00 00 00 48 89
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223319] RSP: 0018:ffffa04800264e58 EFLAGS: 00010082
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223320] RAX: 0000000000000000 RBX: ffff8a44ce86c740 RCX: 0000000000000000
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223320] RDX: 0000000000000006 RSI: ffffffff865b0da6 RDI: 0000000000000046
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223321] RBP: ffffa04800264e58 R08: ffffffff865b0d80 R09: 0000000000000026
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223321] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8a44ce86c740
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223322] R13: 0000000000000001 R14: ffff8a44ccfbaf00 R15: ffff8a44ce85fa80
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223322] FS:  0000000000000000(0000) GS:ffff8a44ce840000(0000) knlGS:0000000000000000
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223323] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223323] CR2: 00007f1414077d10 CR3: 00000002eda0a000 CR4: 0000000000340ee0
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223324] Call Trace:
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223325]  <IRQ>
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223328]  update_rq_clock+0xe8/0x100
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223329]  scheduler_tick+0x58/0x130
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223331]  update_process_times+0x52/0x60
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223332]  tick_sched_handle.isra.0+0x25/0x60
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223333]  tick_sched_timer+0x40/0x80
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223334]  __hrtimer_run_queues+0xf7/0x270
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223334]  ? tick_sched_do_timer+0x60/0x60
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223336]  hrtimer_interrupt+0x109/0x220
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223337]  __sysvec_apic_timer_interrupt+0x64/0x100
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223339]  asm_call_irq_on_stack+0x12/0x20
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223340]  </IRQ>
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223341]  sysvec_apic_timer_interrupt+0x7e/0x90
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223342]  asm_sysvec_apic_timer_interrupt+0x12/0x20
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223344] RIP: 0010:cpuidle_enter_state+0xca/0x3e0
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223344] Code: ff e8 da f0 7c ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 e4 02 00 00 31 ff e8 4d 3d 83 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 39 02 00 00 49 63 d4 4c 8b 7d d0 4c 2b 7d c8 48 8d
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223345] RSP: 0018:ffffa04800137e38 EFLAGS: 00000246
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223345] RAX: ffff8a44ce86c740 RBX: ffff8a44bfd93c00 RCX: 000000000000001f
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223346] RDX: 0000000000000000 RSI: 0000000021bf5c7a RDI: 0000000000000000
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223346] RBP: ffffa04800137e78 R08: 000000405030ba58 R09: 0000000000000a68
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223347] R10: ffff8a44ce86b404 R11: ffff8a44ce86b3e4 R12: 0000000000000002
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223347] R13: ffffffff86177e80 R14: 0000000000000002 R15: ffff8a44bfd93c00
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223349]  ? cpuidle_enter_state+0xa6/0x3e0
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223350]  cpuidle_enter+0x2e/0x40
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223351]  call_cpuidle+0x23/0x40
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223352]  do_idle+0x1e7/0x280
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223353]  cpu_startup_entry+0x20/0x30
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223354]  start_secondary+0x159/0x1a0
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223356]  secondary_startup_64+0xb6/0xc0
Mar 18 22:28:16 charles-ubuntu kernel: [  276.223357] ---[ end trace 2122214e7ae2577c ]---
Mar 18 22:30:54 charles-ubuntu kernel: [  434.401855] Web Content[4621]: segfault at 0 ip 0000000000000000 sp 00007ffd0daf5070 error 14```
Comment 38 Paul Menzel 2021-03-23 09:27:28 UTC
No idea, if the warning is related to the crash. You should report it to Ubuntu (Launchpad bug tracker).

    [  276.223287] WARNING: CPU: 1 PID: 0 at kernel/sched/sched.h:1115 assert_clock_updated.isra.0.part.0+0x17/0x20

Lastly, please also try the current Linux kernel from the Ubuntu Linux Kernel PPA [1].


[1]: https://kernel.ubuntu.com/~kernel-ppa/
Comment 39 Paul Menzel 2021-05-17 15:40:31 UTC
As an update, with *C States Control* unchecked in the system firmware, the crashes reduced noticeably on all systems running GNU/Linux.

There are still a few unexplainable crashes, so there is still something else at play, but as it’s not reproducible and impossible to debug, it’s an improvement.
Comment 40 Paul Menzel 2021-05-17 15:41:22 UTC
For the record, with Linux 5.13-rc2:

```
$ sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x40000d00
MC feature version: 0, firmware version: 0x00a17730
ME feature version: 29, firmware version: 0x00000091
PFP feature version: 29, firmware version: 0x00000054
CE feature version: 29, firmware version: 0x0000003d
RLC feature version: 1, firmware version: 0x00000001
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 0, firmware version: 0x00000000
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x00000000
TA DTM feature version: 0x00000000, firmware version: 0x00000000
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x10020000
SDMA0 feature version: 0, firmware version: 0x00000000
SDMA1 feature version: 0, firmware version: 0x00000000
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x00000000
TOC feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C8690301-102
```
Comment 41 Alex Deucher 2021-05-17 17:09:13 UTC
Does this patch help?
https://www.spinics.net/lists/linux-acpi/msg101022.html
Comment 42 Mario Limonciello (AMD) 2021-05-17 18:50:07 UTC
(FYI; that patch is targeted for 5.14: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=65ea8f2c6e230bdf71fed0137cf9e9d1b307db32)

When you test with it, it would be really good to note whether you see
"FW issue: working around C-state latencies out of order" emitted to know if it's doing anything or it's a no-op for you.
Comment 43 Paul Menzel 2021-05-18 04:43:16 UTC
Thank you for the suggestion. I am going to build a test kernel, but looking at the C-State latencies, it’s, there is no 

    $ grep . /sys/devices/system/cpu/cpu0/cpuidle/*/latency/sys/devices/system
    /cpu/cpu0/cpuidle/state0/latency:0
    /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:0
    /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:100

    $ sudo ~/src/ZenStates-Linux/zenstates.py -l
    P0 - Enabled - FID = 8C - DID = 8 - VID = 32 - Ratio = 35.00 - vCore = 1.23750
    P1 - Enabled - FID = 78 - DID = 8 - VID = 40 - Ratio = 30.00 - vCore = 1.15000
    P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750
    P3 - Disabled
    P4 - Disabled
    P5 - Disabled
    P6 - Disabled
    P7 - Disabled
    C6 State - Package - Enabled
    C6 State - Core - Disabled

    $ sudo ./turbostat
    turbostat version 20.09.30 - Len Brown <lenb@kernel.org>
    CPUID(0): AuthenticAMD 0xd CPUID levels; 0x8000001f xlevels; family:model:stepping 0x17:1:1 (23:1:1)
    CPUID(1): SSE3 MONITOR - - - TSC MSR - HT -
    CPUID(6): APERF, No-TURBO, No-DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB
    CPUID(7): No-SGX
    RAPL: 234 sec. Joule Counter Range, at 280 Watts
    /dev/cpu_dma_latency: 2000000000 usec (default)
    current_driver: acpi_idle
    current_governor: menu
    current_governor_ro: menu
    cpu2: POLL: CPUIDLE CORE POLL IDLE
    cpu2: C1: ACPI HLT
    cpu2: C2: ACPI P_LVL2 IOPORT 0x414
    cpu2: cpufreq driver: acpi-cpufreq
    cpu2: cpufreq governor: schedutil
    cpufreq boost: 1
    cpu0: MSR_RAPL_PWR_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
Comment 44 Paul Menzel 2021-05-18 05:20:21 UTC
Building commit 31696a0a4ca5 (Merge branch 'acpi-x86' into bleeding-edge) [1], the log message does not show up.

    $ dmesg | grep FW
    $

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=31696a0a4ca5
Comment 45 Michael 2021-10-03 01:13:22 UTC
Just to inform that as of today, I am reproducing this issue frequently on my laptop with Ryzen 5 2500U CPU, using kernel 5.14.6-1-default in openSUSE Tumbleweed.
Comment 46 Michael 2021-10-03 01:13:40 UTC
Just to inform that as of today, I am reproducing this issue frequently on my laptop with Ryzen 5 2500U CPU, using kernel 5.14.6-1-default in openSUSE Tumbleweed.
Comment 47 Paul Menzel 2021-10-03 07:27:52 UTC
(In reply to Michael from comment #45)
> Just to inform that as of today, I am reproducing this issue frequently on
> my laptop with Ryzen 5 2500U CPU, using kernel 5.14.6-1-default in openSUSE
> Tumbleweed.

So what changed on your system? Please paste the MCE message, if there are any.
Comment 48 ilkka.prusi 2023-07-19 17:46:21 UTC
These "mysterious freezing and crashing" sound exactly like bad compatibility with RAM.
RAM-testing does not really find anything wrong, it just isn't compatible with the CPU for some reason.

Try removing one half of RAM (one DIMM if you have two) or try to run the computer with just one DIMM if there's more. If it stops crashing you've found the reason.

If you can't change the RAM try to upgrade the CPU for comparison.
Comment 49 ilkka.prusi 2023-07-19 18:43:11 UTC
From reading the above, only thing thas hasn't been changed is the RAM used in the machines? And that wasn't checked to match the QVL list associated with the CPUs?