Bug 208201 - AMD Ryzen 7 1800X soft lockup
Summary: AMD Ryzen 7 1800X soft lockup
Status: RESOLVED DOCUMENTED
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-06-16 02:27 UTC by raul
Modified: 2022-02-08 00:18 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.4.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
/var/log/kern.log file. (1011.85 KB, text/plain)
2020-06-16 02:27 UTC, raul
Details
/var/log/kern.log file. (2.60 MB, text/plain)
2020-06-16 02:35 UTC, raul
Details
Modules. (5.46 KB, text/plain)
2020-06-16 02:36 UTC, raul
Details
kern.log showing garbage after PM: suspend entry (376.76 KB, application/gzip)
2020-07-08 14:43 UTC, raul
Details
dmesg with soft lockup (478.00 KB, application/gzip)
2020-07-19 01:06 UTC, raul
Details
C program which generates NMI interrupts. (2.61 KB, text/x-csrc)
2020-10-13 12:06 UTC, raul
Details
Sleeping multithreaded loop with some extra instructions to put some load. (2.97 KB, text/x-csrc)
2020-10-13 20:27 UTC, raul
Details
10-day dmesg output. (293.12 KB, application/gzip)
2020-10-17 15:20 UTC, raul
Details
dmesg with GPU and SATA crash on an ASUS C7H. (146.26 KB, text/plain)
2020-10-23 14:23 UTC, raul
Details
PCI bus tree of the C7H. (3.50 KB, text/plain)
2020-10-23 14:57 UTC, raul
Details
dmesg with call traces, RIP smp_call_function_many (149.96 KB, text/plain)
2020-10-24 02:16 UTC, raul
Details
kernel ring buffer showing an MCE on bank 5. It's always the same. (118.21 KB, text/plain)
2020-10-25 10:52 UTC, raul
Details
dmesg plus lspci -vnnt output. (128.20 KB, text/plain)
2020-10-25 19:33 UTC, raul
Details
dmesg with soft lockup after xHCI controller assumed dead. (157.32 KB, text/plain)
2020-11-01 00:52 UTC, raul
Details
MSI X470 GAMING PRO (112.31 KB, text/plain)
2020-11-02 09:12 UTC, raul
Details
lspci -vnnt msi x470 gaming plus, lsusb -tv (5.28 KB, text/plain)
2020-11-02 09:32 UTC, raul
Details
Dmesg with typical power idle enabled on MSI X470 Gaming Pro Max. (101.35 KB, text/plain)
2020-11-15 00:22 UTC, raul
Details
2700X first boot, all defaults. MSI X470 Gaming Pro MAX. (105.97 KB, text/plain)
2020-11-17 15:41 UTC, raul
Details
7 day dmesg with 2700X and MSI X470 Gaming Pro Max. (113.97 KB, text/plain)
2020-11-24 18:04 UTC, raul
Details
Ryzen 7 2700X 17 days uptime. (113.03 KB, text/plain)
2020-12-04 15:24 UTC, raul
Details
Linux 5.10 rc6 dmesg crash 1800X (107.06 KB, text/plain)
2020-12-05 13:37 UTC, raul
Details
PCIe AER errors since 2020/08/31 (49.66 KB, text/plain)
2021-02-23 03:07 UTC, raul
Details

Description raul 2020-06-16 02:27:48 UTC
Created attachment 289689 [details]
/var/log/kern.log file.

Machine freezes, sometimes exhibiting a soft lockup message in /var/log/kern.log, as shown in the attached file.

To reproduce: use your computer at low load. Browse the web or leave the computer idling. It will soft lockup within hours. If you use GNOME and you have GNOME Vitals extension and indicator-cpufreq installed, it will freeze in less time.

What I have tried to avoid freezing:
* Change motherboard. Tried Gigabyte Aorus Ultra Gaming and ASUS Crosshair VI Hero.
* Change PSU.
* Enable Typical power supply option in UEFI.
* Remove GNOME Vitals extension. https://extensions.gnome.org/extension/1460/vitals/
* Remove indicator-cpufreq extension. https://launchpad.net/indicator-cpufreq

What has worked to some extend:
* Enable Typical power supply option + remove GNOME Vitals extension + indicator-cpufreq extension. The system still had problems coming out of suspension.

The latest BIOS version of this motherboard doesn't have the Typical power supply option available. The system hangs without it, even thought the GNOME extensions are not installed.

In any case, idle voltage is higher. Which I don't find optimal. This is not a fix. And I can't use some software because otherwise the system wouldn't stop freezing.

Operating system: Ubuntu 20.04 LTS. This was also happening with Ubuntu 19.10 and Ubuntu 19.04.

Base Board Information
	Manufacturer: ASUSTeK COMPUTER INC.
	Product Name: CROSSHAIR VI HERO
BIOS Information
	Vendor: American Megatrends Inc.
	Version: 7704
	Release Date: 12/16/2019
Processor Information
	Socket Designation: AM4
	Type: Central Processor
	Family: Zen
	Manufacturer: Advanced Micro Devices, Inc.
	Version: AMD Ryzen 7 1800X Eight-Core Processor         
	Voltage: 1.4 V
	External Clock: 100 MHz
	Max Speed: 4100 MHz
	Current Speed: 3600 MHz
Memory Device
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_A2
	Bank Locator: BANK 1
	Speed: 2133 MT/s
	Manufacturer: Corsair
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMK16GX4M2B3200C16  
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
Memory Device
	Size: 8192 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_B2
	Bank Locator: BANK 3
	Speed: 2133 MT/s
	Manufacturer: Corsair
	Serial Number: 00000000
	Asset Tag: Not Specified
	Part Number: CMK16GX4M2B3200C16  
	Rank: 1
	Configured Memory Speed: 2133 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V

Corsair modules are ver 4.32.

Kernel module not included in Ubuntu: https://github.com/electrified/asus-wmi-sensors
Comment 1 raul 2020-06-16 02:35:41 UTC
Created attachment 289691 [details]
/var/log/kern.log file.

Both logs have in common the "JS Helper" process when the CPU gets stuck.
Comment 2 raul 2020-06-16 02:36:02 UTC
Created attachment 289693 [details]
Modules.
Comment 3 Borislav Petkov 2020-06-17 08:58:48 UTC
Yes, do not use virtual box. It is horrible.
Comment 4 raul 2020-07-08 14:34:56 UTC
Removed virtual box.

I'm using "iommu=pt idle=nomwait processor.max_cstate=1" and the system has clearly improved in stability. However, suspend is still able to trigger the bug. Once in every few wake ups from system sleep state S3 the processor freezes, either immediately after coming out of sleep or after a few seconds. In the former case, the kernel ring buffer only shows garbage (raw, binary data) just after the message announcing the suspend entry:
[40278.633465] PM: suspend entry (deep)
<garbage>
[    0.000000] Linux version 5.4.0-39-generic (buildd@lcy01-amd64-016) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #43-Ubuntu SMP Fri Jun 19 10:28:31 UTC 2020 (Ubuntu 5.4.0-39.43-generic 5.4.41)

In the latter case, system freezes as usual. Static screen, keyboard and mouse not responding.

Now I want to add the option rcu_nocbs=all, but before that I have to compile a custom Ubuntu kernel with CONFIG_RCU_NOCB_CPU.

Just for more information: the BIOS no longer has power idle option. Instead it provides an option to whether report the ACPI _CST to the OS or not. I have it disabled. Now the sixteen lines "[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)" do not appear. The parameter idle=nomwait did not make it disappear (I read on another bug that idle=halt stopped those messages). And without the max_cstate parameter, the system continued to freeze every day.
Comment 5 raul 2020-07-08 14:43:56 UTC
Created attachment 290169 [details]
kern.log showing garbage after PM: suspend entry

Kernel log showing the garbage at Jun 30 01:36:49.
Comment 6 raul 2020-07-10 14:23:17 UTC
I just had a crash. No log left in dmesg. Screen, mouse and keyboard turned off. System not responding.
Comment 7 raul 2020-07-19 01:06:23 UTC
Created attachment 290343 [details]
dmesg with soft lockup

Just for testing again after updating BIOS to the latest version, I removed all kernel parameters, BIOS defaults, RAM D.O.C.P. @ 2666 MHz 16-18-18-18-35 latencies (as in https://www.corsair.com/es/es/Categor%C3%ADas/Productos/Memoria/VENGEANCE-LPX/p/CMK16GX4M2A2666C16#tab-tech-specs).

Sample output:
Jul 16 17:20:56 localhost kernel: [56837.266681] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Jul 16 17:20:56 localhost kernel: [56837.266688] rcu: 	0-...!: (1 ticks this GP) idle=7dc/0/0x0 softirq=1286444/1286444 fqs=0 
Jul 16 17:20:56 localhost kernel: [56837.266692] rcu: 	5-...!: (2 ticks this GP) idle=f88/0/0x0 softirq=1145056/1145057 fqs=0 
Jul 16 17:20:56 localhost kernel: [56837.266696] rcu: 	6-...!: (1 ticks this GP) idle=4bc/0/0x0 softirq=1831043/1831043 fqs=0 
Jul 16 17:20:56 localhost kernel: [56837.266699] rcu: 	8-...!: (2 GPs behind) idle=15c/0/0x0 softirq=1367591/1367592 fqs=0 
Jul 16 17:20:56 localhost kernel: [56837.266702] rcu: 	13-...!: (6 GPs behind) idle=f0c/0/0x0 softirq=1102440/1102440 fqs=0 
Jul 16 17:20:56 localhost kernel: [56837.266705] rcu: 	14-...!: (2 GPs behind) idle=730/0/0x0 softirq=1094469/1094469 fqs=0 
Jul 16 17:20:56 localhost kernel: [56837.266709] 	(detected by 9, t=15012 jiffies, g=7967709, q=260)
Jul 16 17:20:56 localhost kernel: [56837.266712] Sending NMI from CPU 9 to CPUs 0:
Jul 16 17:20:56 localhost kernel: [56847.192690] Sending NMI from CPU 9 to CPUs 5:
Jul 16 17:20:56 localhost kernel: [56857.118668] Sending NMI from CPU 9 to CPUs 6:
Jul 16 17:20:56 localhost kernel: [56867.044645] Sending NMI from CPU 9 to CPUs 8:
Jul 16 17:20:56 localhost kernel: [56876.970622] Sending NMI from CPU 9 to CPUs 13:
Jul 16 17:20:56 localhost kernel: [56886.895645] Sending NMI from CPU 9 to CPUs 14:
Jul 16 17:21:05 localhost kernel: [56905.255579] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [pool-gnome-shel:143226]
Comment 8 raul 2020-08-01 20:00:10 UTC
I changed the RAM modules. I'm now using 32GB (2x16GB, dual rank modules) of Crucial Ballistix. No kernel parameters, no BIOS option changes, except for memory frequency (D.O.C.P @ 2933 MHz). Yes, dual rank and more frequency. Latencies: 16-18-18-18-36. The rest in auto, trained by the IMC in the CPU.

Memory Device
	Array Handle: 0x002D
	Error Information Handle: 0x003A
	Total Width: 64 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: DIMM_B2
	Bank Locator: BANK 3
	Type: DDR4
	Type Detail: Synchronous Unbuffered (Unregistered)
	Speed: 2933 MT/s
	Manufacturer: Crucial Technology
	Serial Number: 
	Asset Tag: Not Specified
	Part Number: BL16G32C16U4R.M16FE1
	Rank: 2
	Configured Memory Speed: 2933 MT/s
	Minimum Voltage: 1.2 V
	Maximum Voltage: 1.2 V
	Configured Voltage: 1.2 V
	Memory Technology: DRAM
	Memory Operating Mode Capability: Volatile memory
	Firmware Version: Unknown
	Module Manufacturer ID: Bank 6, Hex 0x9B
	Module Product ID: Unknown
	Memory Subsystem Controller Manufacturer ID: Unknown
	Memory Subsystem Controller Product ID: Unknown
	Non-Volatile Size: None
	Volatile Size: 16 GB
	Cache Size: None
	Logical Size: None

The system has been stable for 29 hours. 1h of stressapptest and 1h of memtester passed ok.
With the Corsair RAM kit the computer would crash in less than 12 hours, even at standard frequencies (2133-2666 MHz) although with latest BIOS 2133 MHz were feasible, stable for various days. With only 1 RAM stick apparently there was no problem though. 3200 MHz rock solid during three days of testing (computer always crashed within hours). Maybe there was a bad stick, but I tried three problematic kits of the same model, and that is quite uncommon.

So maybe that's why disabling C6 states worked for some. The state of the cores saved when entering C6 is not being corrupted anymore by bad behaviour of RAM or motherboard, but the system can still crash due to incompatibility with RAM sticks.

If all remains good I'll close this.
Comment 9 raul 2020-08-13 12:48:27 UTC
After testing different memory configurations, the only thing that seems to work is not polling voltages with the asus-wmi-sensors kernel module.  Currently the memory kit is running at its XMP profile without any hangs. No need for the UEFI option "Typical power idle" nor disable "Global C-States Control".

64h of uptime without a hang at 2666 MHz JEDEC.
22h of uptime at XMP profile (3200 MHz 16-18-18-18-36).

If all remains stable I'll close this.
Comment 10 Borislav Petkov 2020-10-01 12:53:58 UTC
Lemme do that and chalk this one up to a hardware issue. Feel free to reopen if there's anything else.

Thx.
Comment 11 raul 2020-10-13 12:06:00 UTC
Created attachment 292941 [details]
C program which generates NMI interrupts.

I made an RMA of that processor. But the new processor still hangs, even under default UEFI settings. On top of that I triggered an unknown NMI interrupt doing sleep() in an infinite loop in intervals of 2 seconds, after 5000 seconds or so. Greping the last 11 kernel logs don't show any unknown NMI message. I left the computer running at 7 seconds interval all the night but did not get any other NMI interrupt. The C program's source code is attached to this message.

[41717.839567] Uhhuh. NMI received for unknown reason 2d on CPU 11.
[41717.839568] Do you have a strange power saving mode enabled?
[41717.839568] Dazed and confused, but trying to continue

This processor also generates MCEs:
[    0.116337] mce: [Hardware Error]: Machine check events logged
[    0.116339] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
[    0.116341] mce: [Hardware Error]: TSC 0 ADDR 1ffff944d7df8 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.116344] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1599281052 SOCKET 0 APIC 0 microcode 8001138

I've done some research and this exception is related to the watchdog timer in the execution block of the processor. It exceeds the time limit to retire instructions.

This same MCE also appeared in the previous processor. The kernel NMI watchdog also catches both soft and hard lockup stalls:

d’oct. 12 12:20:29 host kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [pool-gnome-shel:1437284]
d’oct. 12 12:20:29 host kernel: Modules linked in: vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) k10temp xt_CHECKSUM xt_MASQUERADE ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat aufs nf_tables nfnetlink bridge stp llc overlay edac_mce_amd binfmt_misc rc_it913x_v2 kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio crct10dif_pclmul it913x snd_hda_codec_hdmi ghash_clmulni_intel aesni_intel snd_hda_intel nls_iso8859_1 crypto_simd af9033 cryptd snd_intel_dspcfg glue_helper snd_hda_codec dvb_usb_af9035 amdgpu dvb_usb_v2 snd_hda_core dvb_core snd_hwdep rc_core snd_pcm mc snd_seq_midi snd_seq_midi_event snd_rawmidi joydev input_leds snd_seq amd_iommu_v2 eeepc_wmi gpu_sched asus_wmi ttm snd_seq_device sparse_keymap snd_timer wmi_bmof video mxm_wmi snd drm_kms_helper fb_sys_fops syscopyarea sysfillrect ccp sysimgblt soundcore mac_hid nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common xt_LOG xt_multiport xt_addrtype xt_recent ipt_SYNPROXY sch_fq_codel nf_synproxy_core
d’oct. 12 12:20:29 host kernel:  xt_limit ipt_REJECT nf_reject_ipv4 xt_connlimit nf_conncount xt_tcpmss xt_CT xt_tcpudp iptable_raw xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables asus_wmi_sensors(OE) parport_pc iptable_filter ppdev bpfilter lp parport drm ip_tables x_tables autofs4 bcache crc64 hid_generic usbhid hid crc32_pclmul nvme igb i2c_piix4 i2c_algo_bit ahci dca libahci nvme_core gpio_amdpt wmi gpio_generic [last unloaded: zenpower]
d’oct. 12 12:20:29 host kernel: CPU: 2 PID: 1437284 Comm: pool-gnome-shel Kdump: loaded Tainted: G           OEL    5.4.0-48-generic #52-Ubuntu
d’oct. 12 12:20:29 host kernel: Hardware name: System manufacturer System Product Name/CROSSHAIR VI HERO, BIOS 7901 07/31/2020
d’oct. 12 12:20:29 host kernel: RIP: 0010:smp_call_function_single+0x9b/0x110
d’oct. 12 12:20:29 host kernel: Code: 65 8b 05 00 85 ec 56 a9 00 01 1f 00 75 79 85 c9 75 40 48 c7 c6 00 bc 02 00 65 48 03 35 b6 1c ec 56 8b 46 18 a8 01 74 09 f3 90 <8b> 46 18 a8 01 75 f7 83 4e 18 01 4c 89 c9 4c 89 c2 e8 7f fe ff ff
d’oct. 12 12:20:29 host kernel: RSP: 0018:ffffabfae4d27ba0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
d’oct. 12 12:20:29 host kernel: RAX: 0000000000000001 RBX: 000000011e22073c RCX: 0000000000000000
d’oct. 12 12:20:29 host kernel: RDX: 0000000000000000 RSI: ffff9f99be8abc00 RDI: 0000000000000004
d’oct. 12 12:20:29 host kernel: RBP: ffffabfae4d27be8 R08: ffffffffa9048950 R09: 0000000000000000
d’oct. 12 12:20:29 host kernel: R10: 0000000000000001 R11: 006f666e69757063 R12: 0000000000000004
d’oct. 12 12:20:29 host kernel: R13: 00020ead2f96cb1c R14: 0000000000000001 R15: ffff9f99412dca00
d’oct. 12 12:20:29 host kernel: FS:  00007f5cdee02700(0000) GS:ffff9f99be880000(0000) knlGS:0000000000000000
d’oct. 12 12:20:29 host kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
d’oct. 12 12:20:29 host kernel: CR2: 00007f5d3597fff8 CR3: 00000007957b6000 CR4: 00000000003406e0
d’oct. 12 12:20:29 host kernel: Call Trace:
d’oct. 12 12:20:29 host kernel:  ? __switch_to_asm+0x40/0x70
d’oct. 12 12:20:29 host kernel:  ? ktime_get+0x3e/0xa0
d’oct. 12 12:20:29 host kernel:  aperfmperf_snapshot_cpu+0x42/0x50
d’oct. 12 12:20:29 host kernel:  arch_freq_prepare_all+0x67/0xa0
d’oct. 12 12:20:29 host kernel:  cpuinfo_open+0x13/0x30
d’oct. 12 12:20:29 host kernel:  proc_reg_open+0x77/0x130
d’oct. 12 12:20:29 host kernel:  ? proc_put_link+0x10/0x10
d’oct. 12 12:20:29 host kernel:  do_dentry_open+0x143/0x3a0
d’oct. 12 12:20:29 host kernel:  vfs_open+0x2d/0x30
d’oct. 12 12:20:29 host kernel:  do_last+0x194/0x900
d’oct. 12 12:20:29 host kernel:  path_openat+0x8d/0x290
d’oct. 12 12:20:29 host kernel:  ? hrtimer_cancel+0x15/0x20
d’oct. 12 12:20:29 host kernel:  do_filp_open+0x91/0x100
d’oct. 12 12:20:29 host kernel:  ? __alloc_fd+0x46/0x150
d’oct. 12 12:20:29 host kernel:  do_sys_open+0x17e/0x290
d’oct. 12 12:20:29 host kernel:  ? __audit_syscall_exit+0x233/0x290
d’oct. 12 12:20:29 host kernel:  __x64_sys_openat+0x20/0x30
d’oct. 12 12:20:29 host kernel:  do_syscall_64+0x57/0x190
d’oct. 12 12:20:29 host kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
d’oct. 12 12:20:29 host kernel: RIP: 0033:0x7f5de422ff24
d’oct. 12 12:20:29 host kernel: Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 56 68 f8 ff 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 44 89 c7 89 44 24 0c e8 88 68 f8 ff 8b 44
d’oct. 12 12:20:29 host kernel: RSP: 002b:00007f5cdee017b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
d’oct. 12 12:20:29 host kernel: RAX: ffffffffffffffda RBX: 0000000000003a98 RCX: 00007f5de422ff24
d’oct. 12 12:20:29 host kernel: RDX: 0000000000000000 RSI: 000055a0e21ffce0 RDI: 00000000ffffff9c
d’oct. 12 12:20:29 host kernel: RBP: 000055a0e21ffce0 R08: 0000000000000000 R09: 000055a0da061df0
d’oct. 12 12:20:29 host kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
d’oct. 12 12:20:29 host kernel: R13: 00007f5cdee01900 R14: 0000000000000000 R15: 00007f5cdee01a40
d’oct. 12 12:05:17 host kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
d’oct. 12 12:05:17 host kernel: Modules linked in: vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) k10temp xt_CHECKSUM xt_MASQUERADE ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat aufs nf_tables nfnetlink bridge stp llc overlay edac_mce_amd binfmt_misc rc_it913x_v2 kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio crct10dif_pclmul it913x snd_hda_codec_hdmi ghash_clmulni_intel aesni_intel snd_hda_intel nls_iso8859_1 crypto_simd af9033 cryptd snd_intel_dspcfg glue_helper snd_hda_codec dvb_usb_af9035 amdgpu dvb_usb_v2 snd_hda_core dvb_core snd_hwdep rc_core snd_pcm mc snd_seq_midi snd_seq_midi_event snd_rawmidi joydev input_leds snd_seq amd_iommu_v2 eeepc_wmi gpu_sched asus_wmi ttm snd_seq_device sparse_keymap snd_timer wmi_bmof video mxm_wmi snd drm_kms_helper fb_sys_fops syscopyarea sysfillrect ccp sysimgblt soundcore mac_hid nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common xt_LOG xt_multiport xt_addrtype xt_recent ipt_SYNPROXY sch_fq_codel nf_synproxy_core
d’oct. 12 12:05:17 host kernel:  xt_limit ipt_REJECT nf_reject_ipv4 xt_connlimit nf_conncount xt_tcpmss xt_CT xt_tcpudp iptable_raw xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables asus_wmi_sensors(OE) parport_pc iptable_filter ppdev bpfilter lp parport drm ip_tables x_tables autofs4 bcache crc64 hid_generic usbhid hid crc32_pclmul nvme igb i2c_piix4 i2c_algo_bit ahci dca libahci nvme_core gpio_amdpt wmi gpio_generic [last unloaded: zenpower]
d’oct. 12 12:05:17 host kernel: CPU: 2 PID: 1437284 Comm: pool-gnome-shel Kdump: loaded Tainted: G           OEL    5.4.0-48-generic #52-Ubuntu
d’oct. 12 12:05:17 host kernel: Hardware name: System manufacturer System Product Name/CROSSHAIR VI HERO, BIOS 7901 07/31/2020
d’oct. 12 12:05:17 host kernel: RIP: 0010:smp_call_function_single+0x9b/0x110
d’oct. 12 12:05:17 host kernel: Code: 65 8b 05 00 85 ec 56 a9 00 01 1f 00 75 79 85 c9 75 40 48 c7 c6 00 bc 02 00 65 48 03 35 b6 1c ec 56 8b 46 18 a8 01 74 09 f3 90 <8b> 46 18 a8 01 75 f7 83 4e 18 01 4c 89 c9 4c 89 c2 e8 7f fe ff ff
d’oct. 12 12:05:17 host kernel: RSP: 0018:ffffabfae4d27ba0 EFLAGS: 00000202
d’oct. 12 12:05:17 host kernel: RAX: 0000000000000001 RBX: 000000011e22073c RCX: 0000000000000000
d’oct. 12 12:05:17 host kernel: RDX: 0000000000000000 RSI: ffff9f99be8abc00 RDI: 0000000000000004
d’oct. 12 12:05:17 host kernel: RBP: ffffabfae4d27be8 R08: ffffffffa9048950 R09: 0000000000000000
d’oct. 12 12:05:17 host kernel: R10: 0000000000000001 R11: 006f666e69757063 R12: 0000000000000004
d’oct. 12 12:05:17 host kernel: R13: 00020ead2f96cb1c R14: 0000000000000001 R15: ffff9f99412dca00
d’oct. 12 12:05:17 host kernel: FS:  00007f5cdee02700(0000) GS:ffff9f99be880000(0000) knlGS:0000000000000000
d’oct. 12 12:05:17 host kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
d’oct. 12 12:05:17 host kernel: CR2: 00007f5d3597fff8 CR3: 00000007957b6000 CR4: 00000000003406e0
d’oct. 12 12:05:17 host kernel: Call Trace:
d’oct. 12 12:05:17 host kernel:  ? __switch_to_asm+0x40/0x70
d’oct. 12 12:05:17 host kernel:  ? ktime_get+0x3e/0xa0
d’oct. 12 12:05:17 host kernel:  aperfmperf_snapshot_cpu+0x42/0x50
d’oct. 12 12:05:17 host kernel:  arch_freq_prepare_all+0x67/0xa0
d’oct. 12 12:05:17 host kernel:  cpuinfo_open+0x13/0x30
d’oct. 12 12:05:17 host kernel:  proc_reg_open+0x77/0x130
d’oct. 12 12:05:17 host kernel:  ? proc_put_link+0x10/0x10
d’oct. 12 12:05:17 host kernel:  do_dentry_open+0x143/0x3a0
d’oct. 12 12:05:17 host kernel:  vfs_open+0x2d/0x30
d’oct. 12 12:05:17 host kernel:  do_last+0x194/0x900
d’oct. 12 12:05:17 host kernel:  path_openat+0x8d/0x290
d’oct. 12 12:05:17 host kernel:  ? hrtimer_cancel+0x15/0x20
d’oct. 12 12:05:17 host kernel:  do_filp_open+0x91/0x100
d’oct. 12 12:05:17 host kernel:  ? __alloc_fd+0x46/0x150
d’oct. 12 12:05:17 host kernel:  do_sys_open+0x17e/0x290
d’oct. 12 12:05:17 host kernel:  ? __audit_syscall_exit+0x233/0x290
d’oct. 12 12:05:17 host kernel:  __x64_sys_openat+0x20/0x30
d’oct. 12 12:05:17 host kernel:  do_syscall_64+0x57/0x190
d’oct. 12 12:05:17 host kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
d’oct. 12 12:05:17 host kernel: RIP: 0033:0x7f5de422ff24
d’oct. 12 12:05:17 host kernel: Code: 24 20 eb 8f 66 90 44 89 54 24 0c e8 56 68 f8 ff 44 8b 54 24 0c 44 89 e2 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 44 89 c7 89 44 24 0c e8 88 68 f8 ff 8b 44
d’oct. 12 12:05:17 host kernel: RSP: 002b:00007f5cdee017b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
d’oct. 12 12:05:17 host kernel: RAX: ffffffffffffffda RBX: 0000000000003a98 RCX: 00007f5de422ff24
d’oct. 12 12:05:17 host kernel: RDX: 0000000000000000 RSI: 000055a0e21ffce0 RDI: 00000000ffffff9c
d’oct. 12 12:05:17 host kernel: RBP: 000055a0e21ffce0 R08: 0000000000000000 R09: 000055a0da061df0
d’oct. 12 12:05:17 host kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
d’oct. 12 12:05:17 host kernel: R13: 00007f5cdee01900 R14: 0000000000000000 R15: 00007f5cdee01a40

Enabling CPU auto overclock on the motherboard the processor lasted 14 days without crashes. Autooverclock disables Core C6 states. Global C-States didn't work before. This motherboard disables package C6 state with Global C-States.

The previous processor was RMA'd due to MCEs while executing the script kill-ryzen.sh (https://github.com/suaefar/ryzen-test/blob/master/kill-ryzen.sh) modified to compile GCC 9.3.0, needed for Ubuntu 20.04 LTS:

[68489.367433] mce: [Hardware Error]: Machine check events logged
[68489.367438] [Hardware Error]: Corrected error, no action required.
[68489.367445] [Hardware Error]: CPU:0 (17:1:1) MC9_STATUS[-|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0x9c2040000000010b
[68489.367449] [Hardware Error]: Error Addr: 0x0000000000003880
[68489.367452] [Hardware Error]: IPID: 0x000700b018090300, Syndrome: 0x000000292a1f0803
[68489.367454] [Hardware Error]: L3 Cache Ext. Error Code: 0, Shadow Tag Macro ECC Error.
[68489.367457] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN


It looks as if there's a power delivery problem in the processor while executing transient loads. The GNOME Vitals extension is such a transient load. So is the loop I wrote. I found a thread in AMD's community forums in which a user found NMI exceptions in his EPYC machine and stopped them by disabling C6: https://community.amd.com/thread/233764

I don't consider it a solution, as it removes processor functionality. And now I don't know if the fault is also on Linux due to some lack of support to C6 states in this processors, given the unknown NMI interrupt. I'm reopening this.
Comment 12 raul 2020-10-13 20:27:16 UTC
Created attachment 292949 [details]
Sleeping multithreaded loop with some extra instructions to put some load.

The following points to hardware error. And maybe it's unrelated. Both the unknown NMI and the PCIe Bus error have appeared while executing this little program. Just that and browsing the web.

pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
pcieport 0000:00:01.3: AER:   device [1022:1453] error status/mask=00001000/00006000
pcieport 0000:00:01.3: AER:    [12] Timeout 

The device [1022:1453] refers to the PCIe GPP Bridge. This AER error does not appear in any of the current kernel logs available, which go back to 1 month.

lspci -nn:

00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Root Complex [1022:1450]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) I/O Memory Management Unit [1022:1451]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:01.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 [144d:a808]
02:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller [1022:43b9] (rev 02)
02:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller [1022:43b5] (rev 02)
02:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset PCIe Upstream Port [1022:43b0] (rev 02)
03:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
03:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
04:00.0 USB controller [0c03]: ASMedia Technology Inc. ASM1143 USB 3.1 Host Controller [1b21:1343]
05:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev e7)
0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function [1022:145a]
0c:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
0c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
0d:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
0d:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
Comment 13 raul 2020-10-13 21:50:25 UTC
(In reply to raulvior.bcn from comment #12)
As I understand it, the error came from the connection with the 300 Series Chipset, where some USB ports hang from. It might be related with a problematic boot after clearing the CMOS. Linux printed initialization errors in USB ports. Don't recall whether it was busy device or not. It was booting really slow, without keyboard or mouse available since UEFI screen. I rebooted the computer. Didn't give much importance to it. This motherboard has known cold bug problems and I thought it was just another bug.

lspci -t output:

-[0000:00]-+-00.0
           +-00.2
           +-01.0
           +-01.1-[01]----00.0
           +-01.3-[02-0a]--+-00.0
           |               +-00.1
           |               \-00.2-[03-0a]--+-00.0-[04]----00.0
           |                               +-02.0-[05]----00.0
           |                               +-03.0-[06]--
           |                               +-04.0-[07]--
           |                               +-05.0-[08]--
           |                               +-06.0-[09]--
           |                               \-07.0-[0a]--
           +-02.0
           +-03.0
           +-03.1-[0b]--+-00.0
           |            \-00.1
           +-04.0
           +-07.0
           +-07.1-[0c]--+-00.0
           |            +-00.2
           |            \-00.3
           +-08.0
           +-08.1-[0d]--+-00.0
           |            +-00.2
           |            \-00.3
           +-14.0
           +-14.3
           +-18.0
           +-18.1
           +-18.2
           +-18.3
           +-18.4
           +-18.5
           +-18.6
           \-18.7


Tested CPUs had the following date codes: 1718PGS, 1743SUS and the current one, 1852PGS. The first and second ones were tested in the current motherboard and a Gigabyte X470 Aorus Ultra Gaming, which was having freezes and reboots too. The rest of the hardware (disks, GPUs, PSU) remain the same. Just in case I tested with a different PSU too, and the system still had lockups and reboots. The OS installation is the same, Ubuntu with the timely upgrades. The OS installation and the hardware have been used in two previous machines: an Intel Conroe and an AMD Bulldozer.

Successive RMA's showed less errors and greater time interval between each one. Lowering the RAM frequency increases the time interval too. All the errors come from hardware integrated in the CPU and while executing intermittent load spikes.

Maybe this bug can be closed if the new evidence points to faulty hardware. I believe it is, and I tend to think it's the SoC being fed bad power due to low quality silicon.

PS: didn't post the time reference of the AER error, just to give an idea of the message frequency in my case: [114629.467280] pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0. The NMI was at [41717.839567]. It takes hours to print an error, if it ever does.
Comment 14 raul 2020-10-13 21:53:11 UTC
(In reply to raulvior.bcn from comment #13)
> and greater time interval between each
> one. Lowering the RAM frequency increases the time interval too.

Sorry, this is not true. I have registered some short sessions (minutes) with all processors.
Comment 15 raul 2020-10-17 15:20:19 UTC
Created attachment 293045 [details]
10-day dmesg output.

I got again the same MCE exception (15/10/2020):
[    0.150233] mce: [Hardware Error]: Machine check events logged
[    0.150233] mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
[    0.150233] mce: [Hardware Error]: TSC 0 ADDR 1ffff83143c10 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.150233] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1602772969 SOCKET 0 APIC f microcode 8001138

Sometimes I get this MCE, sometimes the kernel is what catches the CPU stall. And sometimes nothing is logged (but with REISUB Sysrq  I can see the lockup message). I've attached the output of command journalctl -k -b all. You'll see the lockups and this MCE since October 7th. Previous MCEs do not appear.

I'm starting to think that's the motherboards what can't provide stable power.  Faulty design/specs? The Crosshair VI Hero has been tested with two PSUs: Corsair TX750v2 and Corsair TX750M from year 2019. Mobo crashes with both. The Gigabyte Aorus Ultra Gaming only used TX750v2. I no longer have that motherboard.

I have configured the C6H with LLC4 on CPU and SoC, disabled spread spectrum and disabled XHCI hand-off and USB legacy. Increasing VRM switching frequency didn't work, nor LLC3 on CPU.

And I think this bug can be closed as it's not related with Linux at all. It's a hardware problem. But I can only close it as RESOLVED, not DOCUMENTED as Borislav did.
Comment 16 raul 2020-10-17 15:25:25 UTC
>>I have configured the C6H with LLC4 on CPU and SoC, disabled spread spectrum
>>and disabled XHCI hand-off and USB legacy. Increasing VRM switching frequency
>>didn't work, nor LLC3 on CPU.

That configuration is from today. The dmesg I uploaded was with UEFI defaults.
Comment 17 raul 2020-10-23 14:23:32 UTC
Created attachment 293149 [details]
dmesg with GPU and SATA crash on an ASUS C7H.

Okay, so I'm testing a Crosshair VII now. It crashed. Then I added to the kernel cmdline the parameters "pcie_aspm=off nvme_core.default_ps_max_latency_us=0".
Full cmdline: BOOT_IMAGE=/vmlinuz-5.4.0-52-generic root=UUID=blah-blah ro iommu=pt pcie_aspm=off nvme_core.default_ps_max_latency_us=0 crashkernel=512M-:192M

The system crashed again. This time I pressed sysrq-l to dump a backtrace. Journalctl shows this error from amdgpu and SATA port:
d’oct. 23 15:43:37 host kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
d’oct. 23 15:43:37 host kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
d’oct. 23 15:44:19 host kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
d’oct. 23 15:44:19 host kernel: rcu:         12-...!: (16 ticks this GP) idle=400/0/0x0 softirq=214135/214137 fqs=1 
d’oct. 23 15:44:19 host kernel:         (detected by 9, t=15018 jiffies, g=879297, q=51)
d’oct. 23 15:44:19 host kernel: Sending NMI from CPU 9 to CPUs 12:
d’oct. 23 15:44:19 host kernel: rcu: rcu_sched kthread starved for 17417 jiffies! g879297 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=5
d’oct. 23 15:44:19 host kernel: rcu: RCU grace-period kthread stack dump:
d’oct. 23 15:44:19 host kernel: rcu_sched       I    0    11      2 0x80004000
d’oct. 23 15:44:19 host kernel: Call Trace:
d’oct. 23 15:44:19 host kernel:  __schedule+0x2e3/0x740
d’oct. 23 15:44:19 host kernel:  ? __internal_add_timer+0x2d/0x40
d’oct. 23 15:44:19 host kernel:  schedule+0x42/0xb0
d’oct. 23 15:44:19 host kernel:  schedule_timeout+0x8a/0x160
d’oct. 23 15:44:19 host kernel:  ? __next_timer_interrupt+0xe0/0xe0
d’oct. 23 15:44:19 host kernel:  rcu_gp_kthread+0x48d/0x990
d’oct. 23 15:44:19 host kernel:  kthread+0x104/0x140
d’oct. 23 15:44:19 host kernel:  ? kfree_call_rcu+0x20/0x20
d’oct. 23 15:44:19 host kernel:  ? kthread_park+0x90/0x90
d’oct. 23 15:44:19 host kernel:  ret_from_fork+0x22/0x40
d’oct. 23 15:44:19 host kernel: [drm] Fence fallback timer expired on ring sdma0
d’oct. 23 15:44:45 host kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
d’oct. 23 15:44:50 host kernel: ata1.00: qc timeout (cmd 0xec)
d’oct. 23 15:44:50 host kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
d’oct. 23 15:44:50 host kernel: ata1.00: revalidation failed (errno=-5)
d’oct. 23 15:44:51 host kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
d’oct. 23 15:45:01 host kernel: ata1.00: qc timeout (cmd 0xec)
d’oct. 23 15:45:01 host kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
d’oct. 23 15:45:01 host kernel: ata1.00: revalidation failed (errno=-5)
d’oct. 23 15:45:01 host kernel: ata1: limiting SATA link speed to 3.0 Gbps
d’oct. 23 15:45:01 host kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
d’oct. 23 15:45:32 host kernel: ata1.00: qc timeout (cmd 0xec)
d’oct. 23 15:45:32 host kernel: ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
d’oct. 23 15:45:32 host kernel: ata1.00: revalidation failed (errno=-5)
d’oct. 23 15:45:32 host kernel: ata1.00: disabled

And also shows this message with a stacktrace: 
INFO: task pool-gnome-shel:37969 blocked for more than 120 seconds.
d’oct. 23 15:46:03 host kernel:       Tainted: G           OE     5.4.0-52-generic #57-Ubuntu
d’oct. 23 15:46:03 host kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
d’oct. 23 15:46:03 host kernel: pool-gnome-shel D    0 37969   2643 0x00000080

The task is blocked reading asus WMI sensors (asus_wmi_sensors module).

The backtrace printed by sysrq-l:
d’oct. 23 15:54:38 host kernel: sd 0:0:0:0: [sda] tag#24 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
d’oct. 23 15:54:38 host kernel: sd 0:0:0:0: [sda] tag#24 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 00 e5 00
d’oct. 23 15:58:12 host kernel: sysrq: Show backtrace of all active CPUs
d’oct. 23 15:58:12 host kernel: NMI backtrace for cpu 6
d’oct. 23 15:58:12 host kernel: CPU: 6 PID: 0 Comm: swapper/6 Kdump: loaded Tainted: G           OE     5.4.0-52-generic #57-Ubuntu
d’oct. 23 15:58:12 host kernel: Hardware name: System manufacturer System Product Name/ROG CROSSHAIR VII HERO, BIOS 3004 12/16/2019
d’oct. 23 15:58:12 host kernel: Call Trace:
d’oct. 23 15:58:12 host kernel:  <IRQ>
d’oct. 23 15:58:12 host kernel:  dump_stack+0x6d/0x9a
d’oct. 23 15:58:12 host kernel:  ? lapic_can_unplug_cpu.cold+0x40/0x40
d’oct. 23 15:58:12 host kernel:  nmi_cpu_backtrace.cold+0x14/0x53


I still have the Crosshair VI... I can return the VII. Seems like I'm dealing with faulty hardware. The GPU was running nicely in a Bulldozer and a Conroe systems with the same OS install. And the NVMe was used in the Bulldozer system without problems. In the SATA port there's and HDD being used as backing storage for the NVMe which is used by Bcache.
Comment 18 raul 2020-10-23 14:57:01 UTC
Created attachment 293151 [details]
PCI bus tree of the C7H.

Ah... the NVMe SSD is sharing the bus with the GPU. Maybe there's some incompatibility with the NVMe SSD (Samsung 970 EVO) and ASUS UEFI. The SSD has the latest firmware version (2B2QEXE7). And I already disabled all power saving states. Is there anything more to disable?


NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : xxxxxxxxxxxxxxxxxxxxx
mn        : Samsung SSD 970 EVO 500GB               
fr        : 2B2QEXE7
rab       : 2
ieee      : 002538
cmic      : 0
mdts      : 9
cntlid    : 0x4
ver       : 0x10300
rtd3r     : 0x30d40
rtd3e     : 0x7a1200
oaes      : 0
ctratt    : 0
rrls      : 0
crdt1     : 0
crdt2     : 0
crdt3     : 0
oacs      : 0x17
acl       : 7
aerl      : 3
frmw      : 0x16
lpa       : 0x3
elpe      : 63
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 358
cctemp    : 358
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 500107862016
unvmcap   : 0
rpmbs     : 0
edstt     : 35
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0x1
mntmt     : 356
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x5f
fuses     : 0
fna       : 0x5
vwc       : 0x1
awun      : 1023
awupf     : 0
nvscc     : 1
nwpc      : 0
acwu      : 0
sgls      : 0
mnan      : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
Comment 19 raul 2020-10-24 01:31:57 UTC
Two different motherboards, errors are being reported by devices in the bus shared with the SSD. The PCIe sits in the northbridge which belongs to the SoC, the CPU. Huh... mainboard power delivery affecting it? Broken CPU?

In the C7H I tested kernel parameters "pcie_aspm=off nvme_core.default_ps_max_latency_us=0 noapic acpi=strict pci=nocrs" but it didn't work. I should try manual CPU ratio, either 36.00 or auto overclock (TPU I) which puts a ratio of 38.50. Doing that disables Core C6. With the C6H it was stable 24/7 during 14 consecutive days. At the cost of losing power saving features.
Comment 20 raul 2020-10-24 02:16:52 UTC
Created attachment 293165 [details]
dmesg with call traces, RIP smp_call_function_many

I got another crash. Default UEFI values, latest UEFI version, without special kernel parameters. With this board I'm getting more crashes. The call traces are stuck at smp_call_function_many, described in a post from a different thread (https://bugzilla.kernel.org/show_bug.cgi?id=196683#c71).
Comment 21 raul 2020-10-25 10:52:46 UTC
Created attachment 293181 [details]
kernel ring buffer showing an MCE on bank 5. It's always the same.

The system keeps throwing this MCE error, no matter what motherboard.
Default UEFI settings. No special kernel parameters. This time I'm using Linux 5.8 mainline from Ubuntu mainline kernel repositories. But still locks up or throws this MCE error.

Oct 25 10:21:09 host kernel: [    0.132030] mce: [Hardware Error]: Machine check events logged
Oct 25 10:21:09 host kernel: [    0.132036] mce: [Hardware Error]: CPU 7: Machine Check: 0 Bank 5: bea0000000000108
Oct 25 10:21:09 host kernel: [    0.132038] mce: [Hardware Error]: TSC 0 ADDR 1ffff9a2f5c7e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Oct 25 10:21:09 host kernel: [    0.132041] mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1603621263 SOCKET 0 APIC e microcode 8001137
Comment 22 raul 2020-10-25 11:11:29 UTC
Ah, this time is with UEFI versio 2203 of C7H. Just before adding support to Ryzen 3000 series.

(In reply to raul from comment #21)
> Created attachment 293181 [details]
> kernel ring buffer showing an MCE on bank 5. It's always the same.
Comment 23 raul 2020-10-25 19:33:59 UTC
Created attachment 293187 [details]
dmesg plus lspci -vnnt output.

I got another crash. This time the PCIe bus detected errors coming from the SSD, the GPU and the bus connected to the network, USB and SATA ports. Due to this errors, the GPU timed out, like in file https://bugzilla.kernel.org/attachment.cgi?id=293149&action=edit

The board is another C7H I could test. Nothing else changed.
Comment 24 raul 2020-11-01 00:52:33 UTC
Created attachment 293339 [details]
dmesg with soft lockup after xHCI controller assumed dead.

Another soft lockup. This time after the xCHI host controller stopped responding. Crosshair VII Hero.

>> de nov. 01 01:35:48 kiwi64 kernel: xhci_hcd 0000:0b:00.3: xHCI host
>> controller not responding, assume dead
>> de nov. 01 01:35:48 kiwi64 kernel: xhci_hcd 0000:0b:00.3: HC died; cleaning
>> up
>> de nov. 01 01:35:48 kiwi64 kernel: usb 5-4: dvb_usb_v2: usb_bulk_msg()
>> failed=-110
>> de nov. 01 01:35:48 kiwi64 kernel: usb 5-4: dvb_usb_v2: rc.query()
>> failed=-110
>> de nov. 01 01:35:48 kiwi64 kernel: usb 5-4: USB disconnect, device number 3
>> de nov. 01 01:36:14 kiwi64 kernel: watchdog: BUG: soft lockup - CPU#0 stuck
>> for 22s! [pool-gnome-shel:38695]

This soft lockup happened with manual CPU ratio set to 36.00, but with C6 enabled via zenstates.py script.
Comment 25 raul 2020-11-02 09:12:26 UTC
Created attachment 293377 [details]
MSI X470 GAMING PRO

Running the same system on an MSI X470 Gaming Pro motherboard, a soft lock up ocurred.

>> de nov. 01 21:27:06 host kernel: usb 2-4: USB disconnect, device number 2
>> de nov. 02 07:15:24 host kernel: rcu: INFO: rcu_sched detected stalls on
>> CPUs/tasks:
>> de nov. 02 07:15:24 host kernel:         (detected by 13, t=15059 jiffies,
>> g=8955149, q=441)
>> de nov. 02 07:15:24 host kernel: rcu: All QSes seen, last rcu_sched kthread
>> activity 15057 (4309295609-4309280552), jiffies_till_next_fqs=1, root
>> ->qsmask 0x0
>> de nov. 02 07:15:24 host kernel: rcu: rcu_sched kthread starved for 15057
>> jiffies! g8955149 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=4
>> de nov. 02 07:15:24 host kernel: rcu:         Unless rcu_sched kthread gets
>> sufficient CPU time, OOM is now expected behavior.
>> de nov. 02 07:15:24 host kernel: rcu: RCU grace-period kthread stack dump:
>> de nov. 02 07:15:24 host kernel: rcu_sched       R    0    11      2
>> 0x00004000
>> de nov. 02 07:15:24 host kernel: Call Trace:
>> de nov. 02 07:15:24 host kernel:  __schedule+0x212/0x5d0
>> de nov. 02 07:15:24 host kernel:  schedule+0x55/0xc0
>> de nov. 02 07:15:24 host kernel:  schedule_timeout+0x8b/0x160
>> de nov. 02 07:15:24 host kernel:  ? __next_timer_interrupt+0xe0/0xe0
>> de nov. 02 07:15:24 host kernel:  rcu_gp_fqs_loop+0xe9/0x2c0
>> de nov. 02 07:15:24 host kernel:  rcu_gp_kthread+0x8d/0xf0
>> de nov. 02 07:15:24 host kernel:  kthread+0x12f/0x150
>> de nov. 02 07:15:24 host kernel:  ? rcu_gp_init+0x470/0x470
>> de nov. 02 07:15:24 host kernel:  ? __kthread_bind_mask+0x70/0x70
>> de nov. 02 07:15:24 host kernel:  ret_from_fork+0x22/0x30
>> de nov. 02 07:17:27 host kernel: watchdog: BUG: soft lockup - CPU#15 stuck
>> for 22s! [(resolved):138309]
>> de nov. 02 07:17:27 host kernel: Modules linked in: uas usb_storage nct6775
>> hwmon_vid msr xt_CHECKSUM xt_MASQUERADE ip6table_mangle ip6table_nat
>> iptable_mangle iptable_nat nf_nat nf_tables nfnetlink bridge stp llc overlay
>> edac_mce_amd snd_hda_codec_realtek rc_it913x_v2 snd_hda_codec_generic
>> ledtrig_audio binfmt_misc snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg
>> it913x snd_hda_codec af9033 kvm snd_hda_core nls_iso8859_1 snd_hwdep snd_pcm
>> crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd
>> snd_seq_midi glue_helper snd_seq_midi_event amdgpu rapl joydev snd_rawmidi
>> input_leds iommu_v2 gpu_sched snd_seq dvb_usb_af9035 ttm dvb_usb_v2 dvb_core
>> snd_seq_device drm_kms_helper snd_timer mc cec rc_core wmi_bmof i2c_algo_bit
>> snd fb_sys_fops syscopyarea sysfillrect k10temp sysimgblt efi_pstore
>> soundcore ccp mac_hid sch_fq_codel nf_log_ipv6 ip6t_REJECT nf_reject_ipv6
>> xt_hl ip6t_rt nf_log_ipv4 nf_log_common xt_LOG xt_multiport xt_addrtype
>> xt_recent ipt_SYNPROXY nf_synproxy_core xt_limit ipt_REJECT
>> de nov. 02 07:17:27 host kernel:  nf_reject_ipv4 xt_connlimit nf_conncount
>> xt_tcpmss xt_CT xt_tcpudp iptable_raw xt_conntrack nf_conntrack
>> nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables
>> iptable_filter parport_pc bpfilter ppdev lp parport drm ip_tables x_tables
>> autofs4 bcache crc64 hid_generic usbhid hid crc32_pclmul i2c_piix4 nvme
>> r8169 ahci xhci_pci realtek nvme_core libahci xhci_pci_renesas wmi
>> gpio_amdpt gpio_generic [last unloaded: cpuid]
>> de nov. 02 07:17:27 host kernel: CPU: 15 PID: 138309 Comm: (resolved) Kdump:
>> loaded Tainted: G           OE     5.8.16-050816-generic #202010170731
>> de nov. 02 07:17:27 host kernel: Hardware name: Micro-Star International
>> Co., Ltd. MS-7B79/X470 GAMING PRO MAX (MS-7B79), BIOS M.30 10/29/2019
>> de nov. 02 07:17:27 host kernel: RIP:
>> 0010:smp_call_function_many_cond+0x279/0x2c0
Comment 26 raul 2020-11-02 09:32:22 UTC
Created attachment 293379 [details]
lspci -vnnt msi x470 gaming plus, lsusb -tv
Comment 27 raul 2020-11-10 00:57:49 UTC
(In reply to raul from comment #26)

I enabled Typical power supply idle and the current uptime of the system is at 7 days and 4h. So it seems that this motherboard has a working idle power setting, or a non faulty power delivery circuit. CC6 and PC6 states are enabled. Let's see how long can this go. I'll have to buy again a new power supply and test default (low) idle power setting.

> P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Disabled
> C6 State - Core - Enabled
Comment 28 raul 2020-11-15 00:22:59 UTC
Created attachment 293671 [details]
Dmesg with typical power idle enabled on MSI X470 Gaming Pro Max.

5.8.16-050816-generic #202010170731
Typical power idle enabled.
The system is still up and running, but I got a warning in the kernel log. It happened when attending an IRQ while being in cpuidle_enter_state.
>de nov. 12 15:05:42 host kernel: ------------[ cut here ]------------
>de nov. 12 15:05:42 host kernel: WARNING: CPU: 9 PID: 0 at
>drivers/gpu/drm/drm_vblank.c:349 drm_update_vblank_count+0x17d/0x240 [drm]
>de nov. 12 15:05:42 host kernel: Modules linked in: btrfs blake2b_generic xor
>raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs cpuid msr xt_CHECKSUM
>xt_MASQUERADE ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat
>nf_tables nfnetlink bridge stp llc overlay edac_mce_amd snd_hda_codec_realtek
>binfmt_misc snd_hda_codec_generic ledtrig_audio kvm snd_hda_codec_hdmi
>crct10dif_pclmul ghash_clmulni_intel snd_hda_intel snd_intel_dspcfg
>aesni_intel crypto_simd snd_hda_codec cryptd glue_helper amdgpu snd_hda_core
>rc_it913x_v2 snd_hwdep rapl snd_pcm iommu_v2 snd_seq_midi it913x
>snd_seq_midi_event gpu_sched nls_iso8859_1 af9033 snd_rawmidi ttm snd_seq
>dvb_usb_af9035 dvb_usb_v2 snd_seq_device drm_kms_helper snd_timer dvb_core mc
>cec snd rc_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect joydev
>input_leds efi_pstore wmi_bmof k10temp sysimgblt ccp soundcore nf_log_ipv6
>ip6t_REJECT nf_reject_ipv6 mac_hid sch_fq_codel xt_hl ip6t_rt nct6775
>hwmon_vid nf_log_ipv4 nf_log_common xt_LOG xt_multiport
>de nov. 12 15:05:42 host kernel:  xt_addrtype xt_recent ipt_SYNPROXY
>nf_synproxy_core xt_limit ipt_REJECT nf_reject_ipv4 xt_connlimit nf_conncount
>xt_tcpmss xt_CT xt_tcpudp iptable_raw xt_conntrack nf_conntrack nf_defrag_ipv6
>nf_defrag_ipv4 libcrc32c ip6table_filter ip6_tables parport_pc iptable_filter
>ppdev bpfilter lp parport drm ip_tables x_tables autofs4 bcache crc64
>hid_generic usbhid hid crc32_pclmul i2c_piix4 r8169 ahci nvme xhci_pci realtek
>libahci xhci_pci_renesas nvme_core wmi gpio_amdpt gpio_generic
>de nov. 12 15:05:42 host kernel: CPU: 9 PID: 0 Comm: swapper/9 Kdump: loaded
>Not tainted 5.8.16-050816-generic #202010170731
>de nov. 12 15:05:42 host kernel: Hardware name: Micro-Star International Co.,
>Ltd. MS-7B79/X470 GAMING PRO MAX (MS-7B79), BIOS M.30 10/29/2019
>de nov. 12 15:05:42 host kernel: RIP: 0010:drm_update_vblank_count+0x17d/0x240
>[drm]
>de nov. 12 15:05:42 host kernel: Code: 89 ea 45 89 f9 45 89 e0 48 c7 c6 c8 ca
>40 c0 bf 20 00 00 00 50 e8 e3 de ff ff 5a 45 85 e4 0f 85 76 ff ff ff 44 39 7b
>64 74 9f <0f> 0b eb 9b 48 8b 4d c8 eb 84 44 89 e1 44 89 ea bf 20 00 00 00 48
>de nov. 12 15:05:42 host kernel: RSP: 0018:ffffb83b803f4d50 EFLAGS: 00010006
>de nov. 12 15:05:42 host kernel: RAX: 0000000000000000 RBX: ffff8f12b09d1028
>RCX: 000000000300026c
>de nov. 12 15:05:42 host kernel: RDX: 0000000001000000 RSI: ffffffffc040cac8
>RDI: 0000000000000000
>de nov. 12 15:05:42 host kernel: RBP: ffffb83b803f4da0 R08: 0000000000000000
>R09: 0000000000000000
>de nov. 12 15:05:42 host kernel: R10: ffff8f11c57c0000 R11: 0000000000000000
>R12: 0000000000000000
>de nov. 12 15:05:42 host kernel: R13: 0000000000000000 R14: ffff8f12b57d3800
>R15: 0000000000000000
>de nov. 12 15:05:42 host kernel: FS:  0000000000000000(0000)
>GS:ffff8f12bea40000(0000) knlGS:0000000000000000
>de nov. 12 15:05:42 host kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>0000000080050033
>de nov. 12 15:05:42 host kernel: CR2: 00007f1f4ecba000 CR3: 00000007f4d84000
>CR4: 00000000003406e0
>de nov. 12 15:05:42 host kernel: Call Trace:
>de nov. 12 15:05:42 host kernel:  <IRQ>
>de nov. 12 15:05:42 host kernel:  drm_crtc_accurate_vblank_count+0x42/0xc0
>[drm]
>de nov. 12 15:05:42 host kernel:  dm_pflip_high_irq+0xdf/0x2b0 [amdgpu]
>de nov. 12 15:05:42 host kernel:  amdgpu_dm_irq_handler+0x7f/0x110 [amdgpu]
>de nov. 12 15:05:42 host kernel:  amdgpu_irq_dispatch+0xae/0x1e0 [amdgpu]
>de nov. 12 15:05:42 host kernel:  amdgpu_ih_process+0x8c/0x110 [amdgpu]
>de nov. 12 15:05:42 host kernel:  amdgpu_irq_handler+0x24/0xa0 [amdgpu]
>de nov. 12 15:05:42 host kernel:  __handle_irq_event_percpu+0x42/0x180
>de nov. 12 15:05:42 host kernel:  handle_irq_event+0x59/0xb1
>de nov. 12 15:05:42 host kernel:  handle_edge_irq+0x8c/0x220
>de nov. 12 15:05:42 host kernel:  asm_call_irq_on_stack+0x12/0x20
>de nov. 12 15:05:42 host kernel:  </IRQ>
>de nov. 12 15:05:42 host kernel:  common_interrupt+0xbc/0x150
>de nov. 12 15:05:42 host kernel:  asm_common_interrupt+0x1e/0x40
>de nov. 12 15:05:42 host kernel: RIP: 0010:cpuidle_enter_state+0xb7/0x3f0
>de nov. 12 15:05:42 host kernel: Code: 9f 4d 47 7e e8 ea ae 74 ff 48 89 45 d0
>0f 1f 44 00 00 31 ff e8 9a ba 74 ff 80 7d c7 00 0f 85 d3 01 00 00 fb 66 0f 1f
>44 00 00 <45> 85 e4 0f 88 df 01 00 00 49 63 d4 48 8d 04 52 48 8d 0c d5 00 00
>de nov. 12 15:05:42 host kernel: RSP: 0018:ffffb83b80167e48 EFLAGS: 00000246
>de nov. 12 15:05:42 host kernel: RAX: ffff8f12bea6c6c0 RBX: ffff8f12aef7ec00
>RCX: 000000000000001f
>de nov. 12 15:05:42 host kernel: RDX: 0000000000000000 RSI: 00000000238e3f5c
>RDI: 0000000000000000
>de nov. 12 15:05:42 host kernel: RBP: ffffb83b80167e88 R08: 0002fc77075f5ee8
>R09: 0000000000000400
>de nov. 12 15:05:42 host kernel: R10: 000000000000218d R11: ffff8f12bea6b364
>R12: 0000000000000002
>de nov. 12 15:05:42 host kernel: R13: ffffffff82d7e440 R14: 0000000000000002
>R15: 0000000000000000
>de nov. 12 15:05:42 host kernel:  ? cpuidle_enter_state+0xa6/0x3f0
>de nov. 12 15:05:42 host kernel:  cpuidle_enter+0x2e/0x40
>de nov. 12 15:05:42 host kernel:  cpuidle_idle_call+0x145/0x200
>de nov. 12 15:05:42 host kernel:  do_idle+0x7a/0xe0
>de nov. 12 15:05:42 host kernel:  cpu_startup_entry+0x20/0x30
>de nov. 12 15:05:42 host kernel:  start_secondary+0xe6/0x100
>de nov. 12 15:05:42 host kernel:  secondary_startup_64+0xb6/0xc0
>de nov. 12 15:05:42 host kernel: ---[ end trace 3913f260a1d14db9 ]---
Comment 29 raul 2020-11-15 00:39:55 UTC
(In reply to raul from comment #28)

Does not seem to be related. Someone reported the same warning but with an Intel Core i7 6700: https://bugzilla.redhat.com/show_bug.cgi?id=1734450
Comment 30 raul 2020-11-17 15:41:19 UTC
Created attachment 293703 [details]
2700X first boot, all defaults. MSI X470 Gaming Pro MAX.

Well, after 15 days of uptime, I decided to install a Ryzen 7 2700X I got. All defaults, BIOS cleared, no typical power supply idle turned on.

It's interesting to note that dmesg does not show these messages (grep 'Firmware Bug'):

> ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
> [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

Only this is found now:
> [Firmware Bug]: BIOS _OSI(Linux) query ignored

Same motherboard, same BIOS version, same memory, same GPU, etc.
Comment 31 raul 2020-11-24 18:04:55 UTC
Created attachment 293805 [details]
7 day dmesg with 2700X and MSI X470 Gaming Pro Max.

After 7 days with the 2700X, still no crash.
Comment 32 raul 2020-11-24 19:47:49 UTC
(In reply to raul from comment #31)


ZenStates.py output for the 2700X:

> P0 - Enabled - FID = 94 - DID = 8 - VID = 36 - Ratio = 37.00 - vCore =
> 1.21250
> P1 - Enabled - FID = 80 - DID = 8 - VID = 59 - Ratio = 32.00 - vCore =
> 0.99375
> P2 - Enabled - FID = 84 - DID = C - VID = 76 - Ratio = 22.00 - vCore =
> 0.81250
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
Comment 33 raul 2020-12-04 15:24:48 UTC
Created attachment 293931 [details]
Ryzen 7 2700X 17 days uptime.

The system is still on. No hangs. 17 days of uptime with default BIOS settings. No typical power idle, no c-state disable and no overclock.

Today I received the 4th Ryzen 7 1800X processor from AMD.
Again, a 1852 PGS from the batch 8817063000. It seems that's the last batch they produced from this processor. I have received three 1852PGS from the same batch. The first RMA was a 1743SUS (although in the invoice was printed 1741SUS from batch 8899500000) which exhibited MCE errors while executing the kill-ryzen.sh script.

Are you sure that it's not a problem of Linux and not the processor? AMD is not sending newer CPUs.

Are all the workarounds needed for the first Ryzen generation implemented as explained in the revision guide? https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf
Comment 34 raul 2020-12-04 15:26:46 UTC
(In reply to raul from comment #33)
The AMD representative told me that if after the several RMAs the system is still crashing at idle, it is a Linux problem.
Comment 35 raul 2020-12-05 13:37:39 UTC
Created attachment 293945 [details]
Linux 5.10 rc6 dmesg crash 1800X

So, I have installed the 1800X and the system has crashed in less than 12 hours. Nothing appears in dmesg except a corrected AER error.
BIOS defaults. Linux 5.10 rc6.
Comment 36 raul 2020-12-05 14:08:31 UTC
I have enabled typical power idle. No C6 states disabled with that option. I already posted this, but the PC6 was disabled. That's because I disabled Global C-State in UEFI to see what disables and forgot to enable again before posting.

> P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
Comment 37 raul 2020-12-05 14:16:31 UTC
(In reply to raul from comment #36)
Oh... that means it was the whole time with PC6 disabled, given the dmesg in comment #28 has 12 days recorded.
Comment 38 raul 2020-12-06 17:43:35 UTC
(In reply to raul from comment #36)
This is a PITA. It does disable PC6. After a cold boot. If you enable Typical power idle and you keep doing warm reboots the PC6 remains enabled. Only after doing a cold boot it gets disabled.

And Global C-State control manages CC6. Disabling Global C-State disables CC6. Having it Enabled or Auto enables CC6, without the need for a cold boot.

That's what zenstates.py lists after warm reboots and cold boots.

MSI X470 Gaming Pro Max.
Comment 39 raul 2020-12-07 01:56:11 UTC
The system crashed again, with Typical power idle enabled without cold boot, that is, PC6 disabled. All devices powered off: mouse, keyboard and screen.

Using the hardware reset button also allows the motherboard to disable PC6. Zenstates.py now shows PC6 disabled.

It's not a PSU problem.
Comment 40 raul 2021-01-03 15:14:26 UTC
I installed the 2700X again and it crashed after 48h. After that crash it stood up 10 days, after which I replaced with a 2700X received from AMD RMA process. This other 2700X has been up for 11d 15h and counting since installation.

There's a clear difference between the 1800X and the 2700X. The former usually crashes within 48 hours, while the 2700X can stand up for days. The only way to avoid crashes completely is to enable typical power idle which disables PC6 and idle voltage stays around 0.85V as per sensor readings. Or disable CC6, losing XFR and Precision Boost.

The system configuration is the same. Whether crashes are due to the PSU or not, there's a way to avoid them and I think they're not caused by a bug in Linux, as it was already stated by others. I'm changing the status to RESOLVED DOCUMENTED.
Comment 41 ilhan 2021-01-19 21:19:56 UTC
I have been getting similar errors, I have been struggling with it for some time, I have tried updating to the latest bios and kernel to see if that would change anything: 

It didn't take long before it errored. You can read the logs in here https://paste.ubuntu.com/p/SJxWkK4Zv2/

I started doing this again, because it was running fine for some time with with the following kernel parameters : GRUB_CMDLINE_LINUX="pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"

I will try out some new memory as most of the cases online were related to this.
Comment 42 raul 2021-02-22 18:36:32 UTC
(In reply to ilhan from comment #41)

That's an X570 chipset. AMD has admitted USB problems with X570 chipsets: https://videocardz.com/newz/amd-admits-there-are-problems-with-usb-devices-on-500-series-motherboards

Maybe with a new AGESA update your problem gets solved. As I said, I've got USB and screen power offs when freezing. If the screen turns off it means that the VGA card stops sending signals because it's not driven anymore. And maybe it means that the PCIe hangs. Following that logic the CPU might not be attending interrupts from I/O controllers, hence the messages in comment #11 of unknown NMI. And somehow causes the CPU to stop executing instructions.
Comment 43 ilhan 2021-02-22 19:12:19 UTC
(In reply to raul from comment #42)

In my case it was a faulty processor, AMD accepted the RMA. I placed myself a new Ryzen 7 5800x and since then I got an unseen uptime of 2 weeks and everything running very stable.
Comment 44 raul 2021-02-23 03:07:53 UTC
Created attachment 295407 [details]
PCIe AER errors since 2020/08/31

I have attached the log of PCIe errors made by rasdaemon since I installed it in 2020/08/31. The first error is the one I wrote about in comment #12, 2020-10-13.
After that I changed to X470 motherboards and errors are now something regular. Those errors are thrown by AMDGPU driver.

During the one month an a half that I had the Crosshair VI with rasdaemon installed there was no PCIe AER error logged, except when I tried the naive program with sleeping threads. I have also changed kernel versions, from 5.4 to 5.8 and then 5.10.

I'm now on an X470 Gaming Plus Max motherboard, essentially the same as the X470 Gaming Pro Max.
Comment 45 raul 2022-02-08 00:18:13 UTC
I have had the computer running with typical power idle enabled and had no hangs. Last week had to clear CMOS and forgot that option. The computer froze again (USB power down for a second and system hangs) without logging anything. The last freeze the system had time to dump a stack trace showing an error which I had not seen before.

>BUG: scheduling while atomic: swapper/15/0/0x00000008
>Modules linked in: btrfs blake2b_generic xor zstd_compress raid6_pq ufs qnx4
>hfsplus hfs minix ntfs msdos jfs xfs cpuid nf_conntrack_netlink xfrm_user
>xfrm_algo br_netfilter vboxnetadp(OE) vboxnetflt(OE)>
> sysfillrect snd sysimgblt ccp soundcore mac_hid sch_fq_codel ip6t_REJECT
> nf_reject_ipv6 xt_hl ip6t_rt nct6775 hwmon_vid xt_LOG nf_log_syslog msr
> xt_multiport xt_addrtype xt_recent ipt_SYNPROXY nf_synpro>
>CPU: 15 PID: 0 Comm: swapper/15 Kdump: loaded Tainted: G           OE    
>5.13.0-27-generic #29~20.04.1-Ubuntu
>Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS MAX
>(MS-7B79), BIOS H.C0 05/18/2021
>Call Trace:
> dump_stack+0x7d/0x9c
> __schedule_bug.cold+0x4a/0x5b
> __schedule+0x6f1/0x900
> schedule_idle+0x2c/0x40
> do_idle+0x172/0x260
> cpu_startup_entry+0x20/0x30
> start_secondary+0x11f/0x160
> secondary_startup_64_no_verify+0xc2/0xcb

After this I have reenabled typical power idle, which keeps C6 package state disabled.

Note You need to log in before you can comment on or make changes to this bug.