Created attachment 287279 [details] Linux 5.4.14 messages (dmesg) [I have no idea, if *Power Management* is the right component, or if it should be Platform Specific/Hardware, x86_64.] As commented in the long bug report [1], AMD Ryzen processors seem to have a bug related C-State C6 [2]. The problem happens at least with Linux 4.19.57, Linux 5.4.14, and Linux 5.5. In our case at least three of twenty Dell OptiPlex 5055 with AMD Ryzen 5 PRO 1500 with GNU/Linux are affected. Unfortunately, the system just freezes. The mouse cursor does not move, CAPS lock key not change LED on USB keyboard, or if monitor turned off, it does not come back up. Sometimes the network card LED is still blinking, and sometimes not. In one case, the system even rebooted. There is nothing in the logs (ever over serial console), and nothing in the Dell event log. `processor.max_cstate=5` seems to fix it, but it’s hard to be sure, as I have not found a way to reproduce this. Sometimes it happens twice a day, and sometimes only after three or more days. Strangely, on one system, everything worked fine for four months, and now the problem suddenly started two weeks ago. The issues happens with the latest firmware 1.1.20, and microcode updates are applied, that means 0x08001137 is used. ``` $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD Ryzen 5 PRO 1500 Quad-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 3592.550 CPU max MHz: 3500.0000 CPU min MHz: 1550.0000 BogoMIPS: 6986.81 Virtualization: AMD-V L1d cache: 128 KiB L1i cache: 256 KiB L2 cache: 2 MiB L3 cache: 16 MiB NUMA node0 CPU(s): 0-7 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __use r pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB condition al, STIBP disabled, RSB filling Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mt rr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rd tscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq mo nitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcn t aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perf ctr_core perfctr_nb bpext perfctr_llc mwaitx c pb hw_pstate sme ssbd sev ibpb vmmcall fsgsbas e bmi1 avx2 smep bmi2 rdseed adx smap clflusho pt sha_ni xsaveopt xsavec xgetbv1 xsaves clzer o irperf xsaveerptr arat npt lbrv svm_lock nri p_save tsc_scale vmcb_clean flushbyasid decode assists pausefilter pfthreshold avic v_vmsave_ vmload vgif overflow_recov succor smca ``` Several of the same systems run with Microsoft Windows, and I am unaware of freezes there. Can you find out, if C-State C6 is used by the Microsoft Windows driver? Can you recommend ways to debug this further? [1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683 [2]: https://forum.manjaro.org/t/ryzen-freezes-possible-solution-related-to-c6-state/37870
> In our case at least three of twenty Dell OptiPlex 5055 with AMD Ryzen 5 PRO > 1500 with GNU/Linux are affected. Are you saying, you have 20 identical systems with the same sw and only 3 freeze?
(In reply to Borislav Petkov from comment #1) > > In our case at least three of twenty Dell OptiPlex 5055 with AMD Ryzen 5 > PRO > > 1500 with GNU/Linux are affected. > > Are you saying, you have 20 identical systems with the same sw and only 3 > freeze? Basically, yes. One of the three (and the 20) has a different configuration, that it has an 32 GB instead of 16 GB memory, NVMe SSD over AHCI HDD, and a different AMD graphics card. So it should be unrelated from these two components. Most common configuration: ``` $ lspci -nn 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467] 01:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset USB 3.1 xHCI Controller [1022:43bb] (rev 02) 01:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller [1022:43b7] (rev 02) 01:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b2] (rev 02) 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02) 02:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02) 03:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5762 Gigabit Ethernet PCIe [14e4:1687] (rev 10) 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 OEM] [1002:6611] (rev 87) 05:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series] [1002:aab0] 06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a] 06:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456] 06:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c] 07:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455] 07:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) 07:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457] ``` Exception: ``` $ lspci -nn 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467] 01:00.0 Non-Volatile memory controller [0108]: Intel Corporation SSD Pro 7600p/760p/E 6100p Series [8086:f1a6] (rev 03) 02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset USB 3.1 xHCI Controller [1022:43bb] (rev 02) 02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset SATA Controller [1022:43b7] (rev 02) 02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:43b2] (rev 02) 03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02) 03:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02) 04:00.0 Ethernet controller [0200]: Broadcom Inc. and subsidiaries NetXtreme BCM5762 Gigabit Ethernet PCIe [14e4:1687] (rev 10) 06:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO / Venus LE / Tropo PRO-L [Radeon HD 8830M / R7 250 / R7 M465X] [1002:6... (rev 87) 06:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Oland/Hainan/Cape Verde/Pitcairn HDMI Audio [Radeon HD 7000 Series] [1002:aab0] 07:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a] 07:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456] 07:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c] 08:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455] 08:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51) 08:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457] ```
Do the other boxes have this MCE in dmesg too? [ 0.384815] [Hardware Error]: System Fatal error. [ 0.385095] [Hardware Error]: CPU:2 (17:1:1) MC0_STATUS[-|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xbaa0000000060145 [ 0.385180] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000002d030100 [ 0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag error type 1. [ 0.386645] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR Because this could explain the freezes. If you see it in dmesg everytime you reboot after a freeze - i.e., reproducible - you probably should talk to your hardware supplier about replacing the CPU... HTH.
(In reply to Borislav Petkov from comment #3) > Do the other boxes have this MCE in dmesg too? > > [ 0.384815] [Hardware Error]: System Fatal error. > [ 0.385095] [Hardware Error]: CPU:2 (17:1:1) > MC0_STATUS[-|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xbaa0000000060145 > [ 0.385180] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: > 0x000000002d030100 > [ 0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag > error type 1. > [ 0.386645] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR One (*deinhandtuch*) of the two other machines logged an MCE (even same values) once, but after a *normal* reboot. In this run it crashed after two days. ``` [ 0.000000] Linux version 5.4.14.mx64.317 (root@thebiglebowserver.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Thu Jan 23 14:24:25 CET 2020 […] [ 0.377235] smpboot: CPU0: AMD Ryzen 5 PRO 1500 Quad-Core Processor (family: 0x17, model: 0x1, stepping: 0x1) [ 0.377905] Performance Events: Fam17h core perfctr, AMD PMU driver. [ 0.378181] ... version: 0 [ 0.378420] ... bit width: 48 [ 0.378664] ... generic registers: 6 [ 0.378904] ... value mask: 0000ffffffffffff [ 0.379180] ... max period: 00007fffffffffff [ 0.379494] ... fixed-purpose events: 0 [ 0.379733] ... event mask: 000000000000003f [ 0.380069] rcu: Hierarchical SRCU implementation. [ 0.380533] MCE: In-kernel MCE decoding enabled. [ 0.380907] smp: Bringing up secondary CPUs ... [ 0.381245] x86: Booting SMP configuration: [ 0.381494] .... node #0, CPUs: #1 #2 [ 0.384203] mce: [Hardware Error]: Machine check events logged [ 0.384815] [Hardware Error]: System Fatal error. [ 0.385095] [Hardware Error]: CPU:2 (17:1:1) MC0_STATUS[-|UE|MiscV|-|PCC|TCC|SyndV|-|-|-]: 0xbaa0000000060145 [ 0.385180] [Hardware Error]: IPID: 0x000000b000000000, Syndrome: 0x000000002d030100 [ 0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag error type 1. [ 0.386645] [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DWR [ 0.387024] #3 #4 #5 #6 #7 [ 0.393204] smp: Brought up 1 node, 8 CPUs [ 0.393623] smpboot: Max logical packages: 2 [ 0.393878] smpboot: Total of 8 processors activated (55894.52 BogoMIPS) ``` The other affected machine (*fenchurch*) does not have it. > Because this could explain the freezes. If you see it in dmesg everytime you > reboot after a freeze - i.e., reproducible - you probably should talk to > your hardware supplier about replacing the CPU... As it’s not happening every time, do you still recommend it? How can I decode the MCEs? Here are all the MCEs from *hypnotoad*: Linux 5.4.10: Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error. Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:1 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108 Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81ac64f4 Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000 Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error. Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:7 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108 Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81e01060 Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000 Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. Jan 23 13:34:36 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN Linux 5.4.10: Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108 Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff81ac64fc MISC d012000101000000 SYND 4d000000 IPID 500b000000000 Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1579841168 SOCKET 0 APIC 0 microcode 8001137 Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error. Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:7 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108 Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81c00ad6 Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000 Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. Jan 24 05:46:11 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN Linux 5.4.10: Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 22: baa000000000080b Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: TSC 0 MISC d012000101000000 SYND 5d000000 IPID 1002e00000002 Jan 24 14:16:43 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1579871799 SOCKET 0 APIC 0 microcode 8001137 Linux 5.5-rc7: Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error. Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:1 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108 Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81ae0f38 Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000 Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: mce: [Hardware Error]: Machine check events logged Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: System Fatal error. Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: CPU:5 (17:1:1) MC5_STATUS[-|UE|MiscV|AddrV|PCC|TCC|SyndV|-|-|-]: 0xbea0000000000108 Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Error Addr: 0x0001ffff81223a66 Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: IPID: 0x000500b000000000, Syndrome: 0x000000004d000000 Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. Jan 24 15:54:33 hypnotoad.molgen.mpg.de kernel: [Hardware Error]: cache level: RESV, tx: GEN, mem-tx: GEN
Searching for 0xbea0000000000108, this is documented on the Gentoo Wiki page *Ryzen* [1]. So, we are back to the common processor error, and the references to C-State 6. [1]: https://wiki.gentoo.org/wiki/Ryzen#Random_reboots_with_mce_events
(In reply to Paul Menzel from comment #4) > One (*deinhandtuch*) of the two other machines logged an MCE (even same > values) once, but after a *normal* reboot. Can't be after a normal reboot - those errors get read out from the MCA banks when the machine gets warm-reset and cleared after that. So the MCE must have happened in the last boot of the machine. And since it is the same signature, the deinhandtuch box's CPU- no matter how funny its name is :) - might need to be replaced too. > The other affected machine (*fenchurch*) does not have it. It is possible that it doesn't log it successfully or that box has a different problem. > As it’s not happening every time, do you still recommend it? That's your decision I guess. If it doesn't hurt the work you're doing and occasional reboots are ok, then sure, it would be less hassle. :) > How can I decode the MCEs? They're decoded already: [ 0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag error type 1. That's all the decode we can get. > Here are all the MCEs from *hypnotoad*: Those are different. They're all [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. in bank 5... > Searching for 0xbea0000000000108, this is documented on the Gentoo > Wiki page *Ryzen* [1]. So, we are back to the common processor error, > and the references to C-State 6. ... and do those errors go away of you boot with "idle=nomwait" or disable C states in the BIOS?
(In reply to Borislav Petkov from comment #6) > (In reply to Paul Menzel from comment #4) > > One (*deinhandtuch*) of the two other machines logged an MCE (even same > > values) once, but after a *normal* reboot. > > Can't be after a normal reboot - those errors get read out from the MCA > banks when the machine gets warm-reset and cleared after that. So the > MCE must have happened in the last boot of the machine. > > And since it is the same signature, the deinhandtuch box's CPU- no > matter how funny its name is :) - might need to be replaced too. Sometimes we do a reboot using Kexec. Maybe that’s why the MCE still got read out? Can I manually force a read out, or would the driver always notice it real time? > > The other affected machine (*fenchurch*) does not have it. > > It is possible that it doesn't log it successfully or that box has a > different problem. > > > As it’s not happening every time, do you still recommend it? > > That's your decision I guess. If it doesn't hurt the work you're doing > and occasional reboots are ok, then sure, it would be less hassle. :) > > > How can I decode the MCEs? > > They're decoded already: > > [ 0.386180] [Hardware Error]: Load Store Unit Ext. Error Code: 6, DC Tag > error type 1. > > That's all the decode we can get. I guess we need a decode for the decode. ;-) What internal(?) AMD data sheet would contain this information? > > Here are all the MCEs from *hypnotoad*: > > Those are different. They're all > > [Hardware Error]: Execution Unit Ext. Error Code: 0, Watchdog Timeout error. > > in bank 5... They are not, are they? Two of the pasted four have *bank 5* in them. They all contain the value `0xbea0000000000108` though. > > Searching for 0xbea0000000000108, this is documented on the Gentoo > > Wiki page *Ryzen* [1]. So, we are back to the common processor error, > > and the references to C-State 6. > > ... and do those errors go away of you boot with "idle=nomwait" or > disable C states in the BIOS? The Dell firmware does not have an option to disable the C-States. I have to use ZenStates-Linux. It looks like, my problem report is still not clear enough. The MCEs are only logged in a fraction after the freezes even on the machines that get them. As there is no reproducer for the problem, I cannot say. Currently, all three machines (*fenchurch* with a replaced motherboard, *deinhandtuch* with disabled C-State C6) run fine since Monday. Do you know of ways how to debug this better? How can I check, if C-State C6 has been reached? (turbostat does not show it?) Can I log more over the serial console? Also, judging from all the problem reports on the Web, and documented solutions, can’t we assume, that there is a hardware or Linux kernel error? Did AMD look at this internally? Is their solution to disable C-State C6? What does the Windows driver do?
(In reply to Paul Menzel from comment #7) > Sometimes we do a reboot using Kexec. Maybe that’s why the MCE still got > read out? Can I manually force a read out, or would the driver always notice > it real time? No, fatal MCEs cause a thing called "syncflood" which prevents the machine from doing any forward progress. That's why it would freeze. If the firmware is properly coded and configured, it would detect such a condition and trigger a warm reset. And the error will be logged in the MCA MSRs which the kernel dumps during boot if it finds a valid signature in there. The driver doesn't even get to execute when the error happens. And yes, kexec-ing might be able log it, if the hardware plays along. > I guess we need a decode for the decode. ;-) What internal(?) AMD data sheet > would contain this information? Good luck with that. > They are not, are they? Two of the pasted four have *bank 5* in them. They > all contain the value `0xbea0000000000108` though. The only one which is not bank 5 is the one in the "Linux 5.4.10" snippet which is bank 22 but it has not run through the decoding module. That one is fatal too, though, so it could be related as in follow-up MCE or maybe it is something else. > The Dell firmware does not have an option to disable the C-States. I have to > use ZenStates-Linux. And "idle=nomwait" doesn't work? > Do you know of ways how to debug this better? How can I check, if C-State C6 > has been reached? (turbostat does not show it?) > > Can I log more over the serial console? Before you do anything, let's see if idle=nomwait or CC6 disable works. If the DC tag error doesn't happen as a result from that too, then it is related. If it still happens, then you should consider returning the box or replacing the CPU or so, AFAICT. > Also, judging from all the problem reports on the Web, and documented > solutions, can’t we assume, that there is a hardware or Linux kernel > error? I can't assume anything at the moment. Let's see if the CC6 disable/idle=nomwait does anything first. Thx.
Ok, after three days with `./zenstates.py --c6-disable` *deinhandtuch* did not crash. Then, rebooting the system, and leaving C-State C6 enabled, it crashed after five hours (not pingable). I had to press the power button ten seconds to turn it off. Starting it again, there were no MCEs logged.
What about "idle=nomwait"? Does that fix it too?
(In reply to Borislav Petkov from comment #10) > What about "idle=nomwait"? Does that fix it too? I have to check that, but I thought that was solved in bug #196683 [1]. [1]: https://bugzilla.kernel.org/show_bug.cgi?id=196683 "Random Soft Lockup on new Ryzen build"
How? I don't think anything was solved there except a couple of people showing that idle=halt fixes the issue for them.
(In reply to Borislav Petkov from comment #10) > What about "idle=nomwait"? Does that fix it too? *deinhandtuch* crashed some minutes ago, so `idle=nomwait` does not help in my case.
Ok, thanks. So it is not erratum 1109. Let's see what I can find out about the C-states aspect.
(In reply to Paul Menzel from comment #7) […] > Currently, all three machines (*fenchurch* with a replaced motherboard, > *deinhandtuch* with disabled C-State C6) run fine since Monday. For the record, *fenchurch* with the replaced motherboard rebooted tonight after three days running from Thu Feb 13 23:58 to Sun Feb 16 23:10, but no MCE were found by Linux. So, it’s not the motherboard. I’ll ask for a CPU replacement.
How does that box behave with disabled C6? Still rebooting?
One note, the crashes/freezes also happen when utilizing the full CPU, by running, for example, the script `kill-ryzen.sh`, which builds GCC separately on each thread. [1]: https://github.com/Oxalin/ryzen-test
(In reply to Paul Menzel from comment #7) > (In reply to Borislav Petkov from comment #6) > > (In reply to Paul Menzel from comment #4) […] > > > Searching for 0xbea0000000000108, this is documented on the Gentoo > > > Wiki page *Ryzen* [1]. So, we are back to the common processor error, > > > and the references to C-State 6. > > > > ... and do those errors go away of you boot with "idle=nomwait" or > > disable C states in the BIOS? > > The Dell firmware does not have an option to disable the C-States. I have to > use ZenStates-Linux. I have to correct that statement. After unchecking the firmware option *C States Control* [1] > Allows you to enable or disable additional processor sleep states. This > option is enabled by default. `zenstates.py` reports C-State C6 for the core (not package) as disabled. ``` $ sudo ~/src/ZenStates-Linux/zenstates.py -l P0 - Enabled - FID = 8C - DID = 8 - VID = 32 - Ratio = 35.00 - vCore = 1.23750 P1 - Enabled - FID = 78 - DID = 8 - VID = 40 - Ratio = 30.00 - vCore = 1.15000 P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Disabled ``` Linux should log the C-State information more verbosely. I think, I tried that it crashes after unchecking that option. [1]: https://www.dell.com/support/manuals/de/de/debsdt1/optiplex-5055-r-desktop/optiplex_5055r_tower_om/system-setup-options?guid=guid-9704849c-2bf7-4f8a-bb5a-fcb0aa5e24c7&lang=en-us
Just to confirm, the crashes still occur with *C States Control* disabled in the Dell firmware. Borislav, were you able to find out more?
(In reply to Paul Menzel from comment #19) > Just to confirm, the crashes still occur with *C States Control* disabled in > the Dell firmware. What does that option even do? I'm guessing it does not disable CC6... > Borislav, were you able to find out more? Well, the only thing we can do is add a cmdline option to disable CC6 on those machines and do what zenstates.py does. So, how exactly do you run that script? Like this: ./zenstates.py --c6-disable And does that fix the issue on those boxes? Thx.
(In reply to Borislav Petkov from comment #20) > (In reply to Paul Menzel from comment #19) > > Just to confirm, the crashes still occur with *C States Control* disabled > in > > the Dell firmware. > > What does that option even do? I'm guessing it does not disable CC6... As I do not have the firmware source code, I cannot say for sure. But, see comment #18: C6 State - Package - Enabled C6 State - Core - Disabled opposed to showing both as *Enabled*. > > Borislav, were you able to find out more? > > Well, the only thing we can do is add a cmdline option to disable CC6 on > those machines and do what zenstates.py does. An update, all machines running GNU/Linux seem to be affected here. The users just didn’t report it, and we didn’t check yet. First inquiries show, that devices with Microsoft Windows do not seem to be affected. A Linux kernel option is a first step, but no solution, as you cannot expect “normal” users to add that, and it should work out of the box. Are you in contact with AMD? > So, how exactly do you run that script? Like this: > > ./zenstates.py --c6-disable Yes, like that. > And does that fix the issue on those boxes? At least on one box it still crashed. We need a reproducer for this issue.
(In reply to Paul Menzel from comment #21) > As I do not have the firmware source code, I cannot say for sure. But, see > comment #18: > > C6 State - Package - Enabled > C6 State - Core - Disabled > > opposed to showing both as *Enabled*. Ok, looks like this option controls the core C6 state. > An update, all machines running GNU/Linux seem to be affected here. All your boxes or all boxes in general? > users just didn’t report it, and we didn’t check yet. First inquiries show, > that devices with Microsoft Windows do not seem to be affected. > > A Linux kernel option is a first step, but no solution, as you cannot expect > “normal” users to add that, and it should work out of the box. Until there is a way to reliably detect those boxes, this is the best we can do. If at all. > At least on one box it still crashed. We need a reproducer for this issue. After you ran the script? Which would mean that that CC6 disable is not really fixing it. Unless the crash was something else. IOW, so far we have this: - a couple of dell boxes throw MCEs under load - disabling CC6 helps but not always So until there's another report on !dell machine, this looks like a dell box problem to me. I'm sure you'll correct me if I missed anything.
(In reply to Borislav Petkov from comment #22) > (In reply to Paul Menzel from comment #21) > > As I do not have the firmware source code, I cannot say for sure. But, see > > comment #18: > > > > C6 State - Package - Enabled > > C6 State - Core - Disabled > > > > opposed to showing both as *Enabled*. > > Ok, looks like this option controls the core C6 state. > > > An update, all machines running GNU/Linux seem to be affected here. > > All your boxes or all boxes in general? All our Dell OptiPlex 5055 with GNU/Linux. It does not happen with Microsoft Windows. > > users just didn’t report it, and we didn’t check yet. First inquiries show, > > that devices with Microsoft Windows do not seem to be affected. > > > > A Linux kernel option is a first step, but no solution, as you cannot > expect > > “normal” users to add that, and it should work out of the box. > > Until there is a way to reliably detect those boxes, this is the best we > can do. If at all. > > > At least on one box it still crashed. We need a reproducer for this issue. > > After you ran the script? Which would mean that that CC6 disable is not > really fixing it. Unless the crash was something else. I agree. But how do I find out, what the crash reason was? > IOW, so far we have this: > > - a couple of dell boxes throw MCEs under load > - disabling CC6 helps but not always Apparently, but nobody knows, as there is no way to check that (no logs) and there is no reproducer. > So until there's another report on !dell machine, this looks like a dell > box problem to me. You choose to ignore all the posts on the Web for reasons I do not understand. It’s such a big problem, that it is documented in the Gentoo Wiki and Arch Wiki. https://wiki.archlinux.org/index.php/Ryzen#Random_reboots > I'm sure you'll correct me if I missed anything. I really would like to have some guidance on how to further debug this. Do you have more ideas or should I bring it up on the mailing list? Also, you always seem to miss my questions of having contacted AMD.
(In reply to Paul Menzel from comment #23) > I agree. But how do I find out, what the crash reason was? Maybe look at serial console output, or use kdump. > Apparently, but nobody knows, as there is no way to check that (no logs) and > there is no reproducer. That sucks. > You choose to ignore all the posts on the Web for reasons I do not > understand. It’s such a big problem, that it is documented in the Gentoo > Wiki and Arch Wiki. So far it looks to me like a bunch of problems and people tend to lump them all together. So "the Web" has a lot of random bug reports, a lot of them unreliable. For example, the idle=nomwait is supposed to help with erratum 1109. It doesn't help on your boxes. So yours must be something else. See what I mean? > I really would like to have some guidance on how to further debug this. Do > you have more ideas or should I bring it up on the mailing list? The only thing you can do is send the box to Dell and let them deal with it. This is the official way. Especially if the other boxes which are identical don't get the MCEs, then this very well sounds like a silicon issue with this particular sample. > Also, you always seem to miss my questions of having contacted AMD. Why does that matter for the issue at hand? Hypothetically, if I had contacted them, they would have probably said the same thing: send the boxes back to Dell. Dell would then talk to their AMD contact and sort out the CPU issue. This is how it is done in general. HTH.
(In reply to Borislav Petkov from comment #24) > (In reply to Paul Menzel from comment #23) > > I agree. But how do I find out, what the crash reason was? > > Maybe look at serial console output, or use kdump. There was never anything on the console. I’ll look into kdump. Thanks. (We have a crashkernel set up, but as the system just freezes and can only be turned off by pressing the power button, the RAM contents is likely lost.) Thanks. > > Apparently, but nobody knows, as there is no way to check that (no logs) > and > > there is no reproducer. > > That sucks. > > > You choose to ignore all the posts on the Web for reasons I do not > > understand. It’s such a big problem, that it is documented in the Gentoo > > Wiki and Arch Wiki. > > So far it looks to me like a bunch of problems and people tend to lump > them all together. So "the Web" has a lot of random bug reports, a lot > of them unreliable. > > For example, the idle=nomwait is supposed to help with erratum 1109. It > doesn't help on your boxes. So yours must be something else. > > See what I mean? Yes, but at least the MCEs should point in some direction. > > I really would like to have some guidance on how to further debug this. Do > > you have more ideas or should I bring it up on the mailing list? > > The only thing you can do is send the box to Dell and let them deal with > it. This is the official way. Especially if the other boxes which are > identical don't get the MCEs, then this very well sounds like a silicon > issue with this particular sample. That it doesn’t happen with Microsoft Windows makes hopeful, that a fix in the OS (Linux kernel) is possible. > > Also, you always seem to miss my questions of having contacted AMD. > > Why does that matter for the issue at hand? > > Hypothetically, if I had contacted them, they would have probably said > the same thing: send the boxes back to Dell. Dell would then talk to > their AMD contact and sort out the CPU issue. This is how it is done in > general. Sorry, it looks like you never had to deal with the Dell support (lucky you). If you say, you use GNU/Linux (even the Ubuntu from Dell), they tell you they do not support that, and that support is given by the community (and point to some Ubuntu forum). The alternative would be, they say, you are right, we had the same problem in the past, see this erratum and this fix for Microsoft Windows.
(In reply to Paul Menzel from comment #25) […] > The alternative would be, they say, you are right, we had the same problem > in the past, see this erratum and this fix for Microsoft Windows. s/they say/AMD says/
Today, we have this issues with another system of the same model. 2020-06-15T15:19:57.779069+02:00 serotimor kernel: [ 0.000000] Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020 After the crash, rebooting, Linux displays the MCE messages. ```2020-06-15T16:36:38.939468+02:00 serotimor kernel: [ 0.439355] smpboot: CPU0: AMD Ryzen 5 PRO 1500 Quad-Core Processor (family: 0x17, model: 0x1, s tepping: 0x1) 2020-06-15T16:36:38.939469+02:00 serotimor kernel: [ 0.439976] Performance Events: Fam17h+ core perfctr, AMD PMU driver. 2020-06-15T16:36:38.939469+02:00 serotimor kernel: [ 0.439977] ... version: 0 2020-06-15T16:36:38.939470+02:00 serotimor kernel: [ 0.440217] ... bit width: 48 2020-06-15T16:36:38.939471+02:00 serotimor kernel: [ 0.440461] ... generic registers: 6 2020-06-15T16:36:38.939471+02:00 serotimor kernel: [ 0.440700] ... value mask: 0000ffffffffffff 2020-06-15T16:36:38.939473+02:00 serotimor kernel: [ 0.440976] ... max period: 00007fffffffffff 2020-06-15T16:36:38.939473+02:00 serotimor kernel: [ 0.441291] ... fixed-purpose events: 0 2020-06-15T16:36:38.939474+02:00 serotimor kernel: [ 0.441530] ... event mask: 000000000000003f 2020-06-15T16:36:38.939475+02:00 serotimor kernel: [ 0.441866] rcu: Hierarchical SRCU implementation. 2020-06-15T16:36:38.939475+02:00 serotimor kernel: [ 0.442427] smp: Bringing up secondary CPUs ... 2020-06-15T16:36:38.939476+02:00 serotimor kernel: [ 0.443041] x86: Booting SMP configuration: 2020-06-15T16:36:38.939476+02:00 serotimor kernel: [ 0.443291] .... node #0, CPUs: #1 #2 2020-06-15T16:36:38.939477+02:00 serotimor kernel: [ 0.445000] mce: [Hardware Error]: Machine check events logged 2020-06-15T16:36:38.939478+02:00 serotimor kernel: [ 0.445978] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: bea0000000000108 2020-06-15T16:36:38.939502+02:00 serotimor kernel: [ 0.446427] mce: [Hardware Error]: TSC 0 ADDR 1ffff81ac466c MISC d012000101000000 SYND 4d000000 IPID 500b000000000 2020-06-15T16:36:38.939503+02:00 serotimor kernel: [ 0.446977] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1592231781 SOCKET 0 APIC 2 microcode 8001137 2020-06-15T16:36:38.939505+02:00 serotimor kernel: [ 0.451003] mce: [Hardware Error]: Machine check events logged 2020-06-15T16:36:38.939505+02:00 serotimor kernel: [ 0.451979] mce: [Hardware Error]: CPU 5: Machine Check: 0 Bank 5: bea0000000000108 2020-06-15T16:36:38.939506+02:00 serotimor kernel: [ 0.452429] mce: [Hardware Error]: TSC 0 ADDR 1ffff81ac4664 MISC d012000101000000 SYND 4d000000 IPID 500b000000000 2020-06-15T16:36:38.939508+02:00 serotimor kernel: [ 0.452977] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1592231781 SOCKET 0 APIC 9 microcode 8001137
Looks like others still have the same issue: bug 206903 (Spontaneous reboots with Ryzen-3700x (Machine Check: 0 Bank 5: bea0000000000108)) [1]. [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903
[ 1.185038] microcode: CPU0: patch_level=0x08001137 Rich, Mario, could you please poke Dell to release firmware updates with the latest microcode updates (0x08701021) included [1]? The Linux Firmware repository also does not contain more recent μcode updates [2]. [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206903#c17 [2]: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amd-ucode
It still crashed with `amdgpu.ppfeaturemask=0xfffffffb`.
Another system (*walle*) also froze with `amdgpu.ppfeaturemask=0xffffbffd`.
*donut* seems to have crashed with Linux 5.4.39 and `amdgpu.dpm=0`.
*serotimor* also crashed with Linux 5.4.39 and amdgpu.ppfeaturemask=fffd3fff` (4294787071), which is used by default with Linux 4.19.57.
*serotimor* crashed with Linux 5.4.39 and `amdgpu.dpm=0` after 15 hours.
Comment 22 You wrote: "So until there's another report on !dell machine, this looks like a dell box problem to me." I maintain two Ryzen 5 1600 computers with ASrock motherboards which have freezes. The Ryzen 5 1600's have been RMAed by AMD but the freezes stay.
(In reply to Zounp from comment #35) > Comment 22 > You wrote: "So until there's another report on !dell machine, this looks > like a dell box problem to me." > > I maintain two Ryzen 5 1600 computers with ASrock motherboards which have > freezes. > The Ryzen 5 1600's have been RMAed by AMD but the freezes stay. As the issue is very convoluted, I suggest that you open a separate issue and reference it here. Maybe you are even able to note down the CPU serial number, so production dates might give a clue.
I'm also experiencing the same thing with a Ryzen 3600x paired with a Gigabyte ab350m-ds3h and a rx580. Ubuntu is unusable for me since my whole PC lockup every time in the first 30min after boot. Since I can reproduce the issue pretty often let me know if you need more info. I've tried the following to no avail: Disable C6 states Select "Typical Current Idle" in bios Setting the following boot options: idle=nomwait processor.max_cstate=5 rcu_nocbs=0-11 I've enabled kdump but the files in /var/crash doesn't seem to include anything relating to the crash. Here's some info: ```Linux charles-ubuntu 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux``` ```00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex 00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61) 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51) 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller (rev 02) 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller (rev 02) 01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset PCIe Upstream Port (rev 02) 02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02) 02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02) 02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02) 02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02) 02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02) 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c) 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) 08:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] 09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function 0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP 0a:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP 0a:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller 0a:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller 0b:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51) 0c:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)``` ```Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223281] rq->clock_update_flags < RQCF_ACT_SKIP Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223287] WARNING: CPU: 1 PID: 0 at kernel/sched/sched.h:1115 assert_clock_updated.isra.0.part.0+0x17/0x20 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223287] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg bnep edac_mce_amd kvm_amd amdgpu kvm snd_hda_codec_realtek snd_hda_codec_generic crct10dif_pclmul ledtrig_audio ghash_clmulni_intel aesni_intel snd_hda_codec_hdmi crypto_simd cryptd glue_helper snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_hda_codec snd_usbmidi_lib snd_hda_core mc snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi rapl iommu_v2 gpu_sched btusb snd_seq ttm 88x2bu(OE) btrtl snd_seq_device btbcm drm_kms_helper snd_timer btintel cec bluetooth rc_core snd i2c_algo_bit hid_sony fb_sys_fops syscopyarea ecdh_generic cfg80211 wmi_bmof joydev sysfillrect input_leds ff_memless ecc sysimgblt k10temp ccp soundcore mac_hid sch_fq_codel msr parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul r8169 i2c_piix4 xhci_pci realtek ahci xhci_pci_renesas libahci wmi gpio_amdpt gpio_generic Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223315] CPU: 1 PID: 0 Comm: swapper/1 Kdump: loaded Tainted: G OE 5.8.0-45-generic #51~20.04.1-Ubuntu Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223316] Hardware name: Gigabyte Technology Co., Ltd. AB350M-DS3H/AB350M-DS3H-CF, BIOS F50d 07/02/2020 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223317] RIP: 0010:assert_clock_updated.isra.0.part.0+0x17/0x20 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223318] Code: af 79 0e 00 84 c0 75 c1 e9 40 ff ff ff e8 21 1f aa 00 90 55 48 c7 c7 28 ca db 85 c6 05 10 fb 75 01 01 48 89 e5 e8 2f a6 a4 00 <0f> 0b 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 55 bf 17 00 00 00 48 89 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223319] RSP: 0018:ffffa04800264e58 EFLAGS: 00010082 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223320] RAX: 0000000000000000 RBX: ffff8a44ce86c740 RCX: 0000000000000000 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223320] RDX: 0000000000000006 RSI: ffffffff865b0da6 RDI: 0000000000000046 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223321] RBP: ffffa04800264e58 R08: ffffffff865b0d80 R09: 0000000000000026 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223321] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8a44ce86c740 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223322] R13: 0000000000000001 R14: ffff8a44ccfbaf00 R15: ffff8a44ce85fa80 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223322] FS: 0000000000000000(0000) GS:ffff8a44ce840000(0000) knlGS:0000000000000000 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223323] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223323] CR2: 00007f1414077d10 CR3: 00000002eda0a000 CR4: 0000000000340ee0 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223324] Call Trace: Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223325] <IRQ> Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223328] update_rq_clock+0xe8/0x100 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223329] scheduler_tick+0x58/0x130 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223331] update_process_times+0x52/0x60 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223332] tick_sched_handle.isra.0+0x25/0x60 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223333] tick_sched_timer+0x40/0x80 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223334] __hrtimer_run_queues+0xf7/0x270 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223334] ? tick_sched_do_timer+0x60/0x60 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223336] hrtimer_interrupt+0x109/0x220 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223337] __sysvec_apic_timer_interrupt+0x64/0x100 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223339] asm_call_irq_on_stack+0x12/0x20 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223340] </IRQ> Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223341] sysvec_apic_timer_interrupt+0x7e/0x90 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223342] asm_sysvec_apic_timer_interrupt+0x12/0x20 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223344] RIP: 0010:cpuidle_enter_state+0xca/0x3e0 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223344] Code: ff e8 da f0 7c ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 e4 02 00 00 31 ff e8 4d 3d 83 ff fb 66 0f 1f 44 00 00 <45> 85 e4 0f 88 39 02 00 00 49 63 d4 4c 8b 7d d0 4c 2b 7d c8 48 8d Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223345] RSP: 0018:ffffa04800137e38 EFLAGS: 00000246 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223345] RAX: ffff8a44ce86c740 RBX: ffff8a44bfd93c00 RCX: 000000000000001f Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223346] RDX: 0000000000000000 RSI: 0000000021bf5c7a RDI: 0000000000000000 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223346] RBP: ffffa04800137e78 R08: 000000405030ba58 R09: 0000000000000a68 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223347] R10: ffff8a44ce86b404 R11: ffff8a44ce86b3e4 R12: 0000000000000002 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223347] R13: ffffffff86177e80 R14: 0000000000000002 R15: ffff8a44bfd93c00 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223349] ? cpuidle_enter_state+0xa6/0x3e0 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223350] cpuidle_enter+0x2e/0x40 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223351] call_cpuidle+0x23/0x40 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223352] do_idle+0x1e7/0x280 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223353] cpu_startup_entry+0x20/0x30 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223354] start_secondary+0x159/0x1a0 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223356] secondary_startup_64+0xb6/0xc0 Mar 18 22:28:16 charles-ubuntu kernel: [ 276.223357] ---[ end trace 2122214e7ae2577c ]--- Mar 18 22:30:54 charles-ubuntu kernel: [ 434.401855] Web Content[4621]: segfault at 0 ip 0000000000000000 sp 00007ffd0daf5070 error 14```
No idea, if the warning is related to the crash. You should report it to Ubuntu (Launchpad bug tracker). [ 276.223287] WARNING: CPU: 1 PID: 0 at kernel/sched/sched.h:1115 assert_clock_updated.isra.0.part.0+0x17/0x20 Lastly, please also try the current Linux kernel from the Ubuntu Linux Kernel PPA [1]. [1]: https://kernel.ubuntu.com/~kernel-ppa/
As an update, with *C States Control* unchecked in the system firmware, the crashes reduced noticeably on all systems running GNU/Linux. There are still a few unexplainable crashes, so there is still something else at play, but as it’s not reproducible and impossible to debug, it’s an improvement.
For the record, with Linux 5.13-rc2: ``` $ sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x40000d00 MC feature version: 0, firmware version: 0x00a17730 ME feature version: 29, firmware version: 0x00000091 PFP feature version: 29, firmware version: 0x00000054 CE feature version: 29, firmware version: 0x0000003d RLC feature version: 1, firmware version: 0x00000001 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 MEC feature version: 0, firmware version: 0x00000000 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x00000000 TA XGMI feature version: 0x00000000, firmware version: 0x00000000 TA RAS feature version: 0x00000000, firmware version: 0x00000000 TA HDCP feature version: 0x00000000, firmware version: 0x00000000 TA DTM feature version: 0x00000000, firmware version: 0x00000000 TA RAP feature version: 0x00000000, firmware version: 0x00000000 TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x10020000 SDMA0 feature version: 0, firmware version: 0x00000000 SDMA1 feature version: 0, firmware version: 0x00000000 VCN feature version: 0, firmware version: 0x00000000 DMCU feature version: 0, firmware version: 0x00000000 DMCUB feature version: 0, firmware version: 0x00000000 TOC feature version: 0, firmware version: 0x00000000 VBIOS version: 113-C8690301-102 ```
Does this patch help? https://www.spinics.net/lists/linux-acpi/msg101022.html
(FYI; that patch is targeted for 5.14: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=65ea8f2c6e230bdf71fed0137cf9e9d1b307db32) When you test with it, it would be really good to note whether you see "FW issue: working around C-state latencies out of order" emitted to know if it's doing anything or it's a no-op for you.
Thank you for the suggestion. I am going to build a test kernel, but looking at the C-State latencies, it’s, there is no $ grep . /sys/devices/system/cpu/cpu0/cpuidle/*/latency/sys/devices/system /cpu/cpu0/cpuidle/state0/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:100 $ sudo ~/src/ZenStates-Linux/zenstates.py -l P0 - Enabled - FID = 8C - DID = 8 - VID = 32 - Ratio = 35.00 - vCore = 1.23750 P1 - Enabled - FID = 78 - DID = 8 - VID = 40 - Ratio = 30.00 - vCore = 1.15000 P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Disabled $ sudo ./turbostat turbostat version 20.09.30 - Len Brown <lenb@kernel.org> CPUID(0): AuthenticAMD 0xd CPUID levels; 0x8000001f xlevels; family:model:stepping 0x17:1:1 (23:1:1) CPUID(1): SSE3 MONITOR - - - TSC MSR - HT - CPUID(6): APERF, No-TURBO, No-DTS, No-PTM, No-HWP, No-HWPnotify, No-HWPwindow, No-HWPepp, No-HWPpkg, No-EPB CPUID(7): No-SGX RAPL: 234 sec. Joule Counter Range, at 280 Watts /dev/cpu_dma_latency: 2000000000 usec (default) current_driver: acpi_idle current_governor: menu current_governor_ro: menu cpu2: POLL: CPUIDLE CORE POLL IDLE cpu2: C1: ACPI HLT cpu2: C2: ACPI P_LVL2 IOPORT 0x414 cpu2: cpufreq driver: acpi-cpufreq cpu2: cpufreq governor: schedutil cpufreq boost: 1 cpu0: MSR_RAPL_PWR_UNIT: 0x000a1003 (0.125000 Watts, 0.000015 Joules, 0.000977 sec.)
Building commit 31696a0a4ca5 (Merge branch 'acpi-x86' into bleeding-edge) [1], the log message does not show up. $ dmesg | grep FW $ [1]: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=bleeding-edge&id=31696a0a4ca5
Just to inform that as of today, I am reproducing this issue frequently on my laptop with Ryzen 5 2500U CPU, using kernel 5.14.6-1-default in openSUSE Tumbleweed.
(In reply to Michael from comment #45) > Just to inform that as of today, I am reproducing this issue frequently on > my laptop with Ryzen 5 2500U CPU, using kernel 5.14.6-1-default in openSUSE > Tumbleweed. So what changed on your system? Please paste the MCE message, if there are any.
These "mysterious freezing and crashing" sound exactly like bad compatibility with RAM. RAM-testing does not really find anything wrong, it just isn't compatible with the CPU for some reason. Try removing one half of RAM (one DIMM if you have two) or try to run the computer with just one DIMM if there's more. If it stops crashing you've found the reason. If you can't change the RAM try to upgrade the CPU for comparison.
From reading the above, only thing thas hasn't been changed is the RAM used in the machines? And that wasn't checked to match the QVL list associated with the CPUs?