Bug 210261

Summary: Random freezes and reboots AMD Ryzen
Product: Platform Specific/Hardware Reporter: David Maseda (david.maseda)
Component: x86-64Assignee: other_other
Status: NEW ---    
Severity: blocking CC: captain_rage, gabriele.svelto, zawertun
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.8 Subsystem:
Regression: No Bisected commit-id:

Description David Maseda 2020-11-18 20:15:41 UTC

I have an AMD Ryzen 7 3800x with a MSI x570 mobo.

I suffer from random reboots and freezes on Linux. Windows works totally fine though.

I've searched on several forums and stackexchange threads, and none of the solutions work.

The problems happen both in a USB live media and my existing installation. If I boot without any kernel parameters set, i get some "no irq handler for vector" and a reboot. If the system is able to keep working a few more seconds, i get a "tsc marked as unstable" followed by a hard reboot.

So far, I've tried this approaches:

pci=noaer, iommu=soft, pcie_aspm=off, processor.max_cstate=1, processor.max_cstate=5 (In all possible arrangements)
disabling C state management in BIOS
Disabling PSU idle power management in BIOS
Lowering RAM speeds
marking tsc=unstable
Memtest shows no errors, and I've already cleared my CMOS and changed the battery. Tried with incrementally removing RAM sticks too.

Sometimes, a kernel exception shows up, as if the kernel was trying to dereference NULL. Another error that I seem to get randomly is: kernel: mce: [Hardware Error]: CPU 10: Machine Check: 0 Bank 0: baa0000000060185

I have no more ideas to test. I'll gladly anymore information needed.
Comment 1 Martin Roth 2020-12-03 16:35:53 UTC

I built a PC with a Ryzen 3600 and a ASUS TUF X470-PLUS GAMING motherboard (current BIOS version 5602 from 2020/07/17) in 2019 and have been experiencing presumably identical problems. The current kernel version is 5.9.11 and it is running Arch Linux, although throughout the year upgrading the kernel and AMD microcode (currently package amd-ucode 20201120.bc9cd0b-1) doesn't seem to have made any difference. The computer suddenly freezes on very random and sporadic occasions, most often under low load. Sometimes it happens once a week, sometimes it happens more than once a day.

I have also tried different iterations of the kernel parameters that you mention without any luck; the freezes persist. 

The last error message read:

[    0.585711] mce: [Hardware Error]: Machine check events logged
[    0.585713] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
[    0.585717] mce: [Hardware Error]: TSC 0 ADDR 1ffffbaec512e MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
[    0.585721] mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1606727817 SOCKET 0 APIC 8 microcode 8701021
[    0.772978] RAS: Correctable Errors collector initialized.

The last thing I tried was a tip from an online forum where an user claimed that changing the CPU Ratio setting in the BIOS from Auto to 36.00 and left the computer running for about 50 hours, using a script to move the cursor pseudo-randomly at low load and it hasn't frozen yet. Source: https://forum-en.msi.com/index.php?threads/solved-msi-x570-a-pro-ryzen-5-3600-freeze.344085/#post-1993854.

Maybe you can try this as well? Although it is too early to say whether changing the CPU Ratio in the BIOS has helped my machine or not.
Comment 2 Yaroslav Sidlovsky 2021-10-29 12:10:18 UTC

I've got exactly same problem with AMD Ryzen 5 3600 CPU and ASUS ROG CROSSHAIR VII HERO motherboard.

Memory checks shows no errors and Windows is totally stable.

Just tried advice above about changing CPU ratio to 36, will see what happens.

Thanks for advice!
Comment 3 Yaroslav Sidlovsky 2021-11-04 11:17:21 UTC
Looks like it really helps - almost 1 week on Ryzen + Linux without freezes.