Bug 196683

Summary: Random Soft Lockup on new Ryzen build
Product: Other Reporter: Account Deleted (account_deleted_20210115)
Component: SpamAssignee: Other/Spam Assignee (other_spam)
Status: CLOSED OBSOLETE    
Severity: normal CC: 0x7f494e, 2rb0alex, aagaande, aaron, agmondroid, ajlopies, albertogomezmarin, alexandre9099, alexjuvion, alfie, amirthalingam013, angel.forest.public, anjalitiwari123321, ankitvarma604, arun123seo, arup.chowdhury, ashesh.ambasta, awwit, b-o-s-s, Baka.Kafka, bradley.g.smith, brauliobo, breitenstein.alexander, brody.m.dyck, bugzilla.kernel.org, c.buddeweg, Carolinewebb78, caroljames972022, castro8583bennett, chandru.dezineguru, chandrunan90, cjsheets, colemangroup05, ct.lrn, darkbasic, deewhydy250, deewhydy253, desouzalauren, diego.viola, dion, dopey, drcouzelis, dushistov, dyson8, edigitalb, emadahmed1995, emmajustine72, eric.c.morgan, ewerton.urias, exander77, excieve, fin4478, forerunsoftwaresolutions, geoff, getnaijamusic2, ghanazennit, ghostjax35, glueckspilz85, Grace_Cooper406, grz3chu, henriquesarmentogoulartaguiar, hi-angel, hoper, i, ison, iswaryarezolve, it, itanium_de, jaap, jackwrightma, jaisuntour22, jjacocks, jnonon+kernel, joh82875, johanesalala, johannes.hirte, johnpaul7, joypetter14, kallisti5, karthik.gana12, katiyar.pradeep97, kernel.org, kernel, kernel, kernel, kernelbugs, kernelbugzilla, kernelorg, kmueller, ledesillusionniste, leonard, Leonowenz.lo, livubuntu20, ljennia, lvargas, maciek.borzecki, malikovicboris85, marksandra3060, massivebaze, mawexdiallo4, mclark, mricon, Naijaflashng, nancybrown6803, nelsoneci, nethershaw, nickbryda, njlmerchant, nucrap, okaymedia, omarandemad, onlylys, onox, owenswerk, oyvinds, pghjvanblokland, philip.rosvall, pjssilva, pmenzel+bugzilla.kernel.org, qqcoded, quentinj, raginibnb, raulvior.bcn, retronaija2020, rj8410744, rjw, robert, rtguille, russianneuromancer, ryan, safedryoutcom, sam, schweinefilet, scotth3, seo.ssnutracare, seowork.ssnutracare, sergio, sineways, skillindiadevelopment24, skoolbeepseo, spunkwap, stellicht, tanyakalraicfa, thunderbird2k, tomm, transglobalaarav, transglobalshubham, transglobaltanya, varmaseema459, victorcqueirozg, weirdconstructor, wes, Wilbur310, willemdreyer, wim, xiyou.wangcong, yungease, z.wood23, zankuro
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.13.4-3 Subsystem:
Regression: No Bisected commit-id:
Attachments: Example Log
cpuinfo
iomem
ioports
lspci -vvv
modules
scsi
softlock-dmesg.log
panic-dmesg.log
panic2-dmesg.log
Kernel softlockup Fedora 27 serial log
ryzen kernel config
Workaround init script (derived from zenstates.py)
attachment-6763-0.html
disable c6
zenstates.py small patch to allow disabling only C6 package
attachment-17307-0.html
My /proc/cpuinfo
picture of panic on screen before reboot
dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
acpi-dump-cstates.diff
dmesg log after rebooting with acpi-dump-cstates patch
dmesg w/o boot parameter idle=halt
dmesg with boot parameter idle=halt
dmesg w/o boot parameter idle=halt, patched acpi-dump-cstates
Don't do mwait on B1 and earlier
Testcase in docker
Add rifw kernel parameter to test a couple of patch to workaround ryzen freezes
Threadripper 2920x soft lockup

Description Account Deleted 2017-08-16 19:05:04 UTC
Created attachment 257955 [details]
Example Log

Full description of the problem/report:

Ryzen 1700x (8 cores) system randomly hard freezes with a soft lockup bug and a hard reset is needed. The system doesn't respond anymore to mouse or keyboard input. The problem typically occurs when the load is low while web surfing - stress testing with e.g. mprime is no problem. Example Log: Please see attachment. 

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DOM Worker:1364]

Kernel version:
Linux version 4.12.4-2-MANJARO (builduser@ste74-i5) (gcc version 7.1.1 20170630 (GCC) ) #1 SMP PREEMPT Sun Jul 30 17:04:04 UTC 2017

Environment - Software:
Manajaro Linux - Stable Update 2017-08-10; KDE Plasma; Graphics driver: linux-nvidia 1:375.82-1-any

Environment - Additional information:
Attachments follow.
Comment 1 Account Deleted 2017-08-16 19:05:56 UTC
Created attachment 257957 [details]
cpuinfo
Comment 2 Account Deleted 2017-08-16 19:06:23 UTC
Created attachment 257959 [details]
iomem
Comment 3 Account Deleted 2017-08-16 19:06:46 UTC
Created attachment 257961 [details]
ioports
Comment 4 Account Deleted 2017-08-16 19:07:07 UTC
Created attachment 257963 [details]
lspci -vvv
Comment 5 Account Deleted 2017-08-16 19:07:30 UTC
Created attachment 257965 [details]
modules
Comment 6 Account Deleted 2017-08-16 19:08:11 UTC
Created attachment 257967 [details]
scsi
Comment 7 Riccardo Gori 2017-08-29 09:36:01 UTC
I have the exact same problem with the same setup (Ryzen 1700X).

I have experienced the problem with kernels 4.4, 4.10 and now 4.12.8. The system freeze also without X and without nvidia proprietary drivers.
Comment 8 Wim Van Sebroeck 2017-09-09 15:58:16 UTC
Bug confirmed. Will revert patch 1fccb73011ea8a5fa0c6d357c33fa29c695139ea.
Comment 9 James Le Cuirot 2017-10-05 20:23:40 UTC
(In reply to Wim Van Sebroeck from comment #8)
> Bug confirmed. Will revert patch 1fccb73011ea8a5fa0c6d357c33fa29c695139ea.

Are you sure? That commit concerns iTCO_wdt, which is an Intel driver?

I have seen similar freezes many times. I captured the last time with netconsole and admittedly the call trace was different but I've seen many reports of Ryzen freezes on a variety of motherboards and disabling C6 or enabling CONFIG_RCU_NOCB_CPU is always said to work around it. Both of these certainly work here.

I must stress that this issue must be not confused with the segfault issue commonly reported with early Ryzens. I was facing that issue too and had my CPU replaced but while the segfaults have gone, both my original and the replacement exhibit these freezes.

I believe we have not seen so many reports of this issue, partly because most users will not know how to get kernel output in this situation, and partly because Fedora's kernel already has CONFIG_RCU_NOCB_CPU enabled. I am using Gentoo but I have confirmed that this affects Debian.
Comment 10 Wim Van Sebroeck 2017-10-06 08:37:11 UTC
Oeps. Sorry. That comment was for BUG 196509.
Comment 11 fin4478 2017-10-06 10:14:43 UTC
Thanks for James Le Cuirot 2017-10-05 20:23:40 UTC. My Ryzen 5 1600 build did hang randomly with Amd drm-next-4.15-wip kernel and mainline 4.14-rc1 and rc2 and 4.13.5 custom non debug 1000Hz timer kernels. It took 3 weeks to find this solution. 

X did freeze randomly when scrolling web content in the Chrome Beta browser. I thought it was usb 3.0 problem at first and moved my mouse to usb 2.0 port. Then I found help for a Chrome bug:
https://askubuntu.com/questions/765974/chrome-freeze-very-frequently-with-ubuntu-16-04
No help.

Now uptime is 7 hours and 19 minutes and system looks stable. I did enable gpu support in the Chrome again and system is not freezing while surfing.

xfce@ryzen5pc:~$ screenfetch
         _,met$$$$$gg.           xfce@ryzen5pc
      ,g$$$$$$$$$$$$$$$P.        OS: Debian unstable sid
    ,g$$P""       """Y$$.".      Kernel: x86_64 Linux 4.14.0-rc2
   ,$$P'              `$$$.      Uptime: 7h 19m
  ',$$P       ,ggs.     `$$b:    Packages: 1942
  `d$$'     ,$P"'   .    $$$     Shell: bash 4.4.12
   $$P      d$'     ,    $$P     Resolution: 1920x1080
   $$:      $$.   -    ,d$$'     DE: XFCE
   $$\;      Y$b._   _,d$P'      WM: Xfwm4
   Y$$.    `.`"Y$$$$P"'          WM Theme: Default
   `$$b      "-.__               GTK Theme: Xfce [GTK2]
    `Y$$                         Icon Theme: Tango
     `Y$$.                       Font: Sans 10
       `$$b.                     CPU: AMD Ryzen 5 1600 Six-Core @ 12x 3.194GHz [39.6°C]
         `Y$$b.                  GPU: AMD/ATI Baffin [Polaris11]
            `"Y$b._              RAM: 1150MiB / 7989MiB
Comment 12 fin4478 2017-10-06 10:30:17 UTC
With RCU_NOCB_CPU you need to have the following in the kernel command line: 
rcu_nocbs=0-11

Ryzen 5 1600 has 12 threads so change the upper limit according to your cpu thread count.
Comment 13 James Le Cuirot 2017-10-06 10:43:30 UTC
(In reply to fin4478 from comment #12)
> With RCU_NOCB_CPU you need to have the following in the kernel command line: 
> rcu_nocbs=0-11

Correct, there used to be CONFIG_RCU_NOCB_CPU_ALL until very recently. I don't know why it was removed. I still see this as a workaround though as I gather it effectively avoids C6.

ptheis, could you please confirm whether these workarounds work for you and reassign the bug to someone more appropriate? This isn't a watchdog bug, the watchdog just gets triggered by the problem.
Comment 14 James Le Cuirot 2017-10-06 11:07:53 UTC
I wonder if this is going to start affecting Fedora 27 now that CONFIG_RCU_NOCB_CPU_ALL has gone. I grabbed a rawhide kernel RPM and that entry is indeed missing from the config. It seems unlikely they would add rcu_nocbs to everyone's command line.
Comment 15 fin4478 2017-10-07 07:37:43 UTC
(In reply to James Le Cuirot from comment #13)
> (In reply to fin4478 from comment #12)
> > With RCU_NOCB_CPU you need to have the following in the kernel command
> line: 
> > rcu_nocbs=0-11
> 
> Correct, there used to be CONFIG_RCU_NOCB_CPU_ALL until very recently. I
> don't know why it was removed. I still see this as a workaround though as I
> gather it effectively avoids C6.
> 

RCU_NOCB_CPU_ALL was removed 2017-06-08 18:52:43 -0700. "The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing and
are redundant with the rcu_nocbs= boot parameter.  This commit therefore
removes these three Kconfig options and adjusts the rcutorture scripts
to use the boot parameter instead."

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

This is typical not end user friendly solution from tired programmers. Other  companies have a conspiracy against Amd, they do want to make Amd look bad and do code like this. Another example, Ryzen k10temp and Zen cpu select patches have been ready for several months but are not in the mainline kernel. Coffee lake and q cpus are coming.
Comment 16 Account Deleted 2017-10-07 08:55:54 UTC
Alright, many thanks for the workarounds. Manjaro has seen a few kernel updates recently. The last time I saw the error was on Kernel 4.13.4-1 (current stable is 4.13.4-3). This is what I changed on my system (not sure both are needed at the same time):

- added rcu_nocbs=0-15 to kernel command line (1700x has 16 threads)
- disabled C6 power management in BIOS

I also have an eye on HW acceleration in Firefox which was enabled all the time. If the workarounds help, I will need 1-3 weeks to confirm as the system doesn't run very long on weekdays.

@James Le Cuirot: I also changed the category of this bug (I hope this is correct - this is the first time I filed a Kernel bug).
Comment 17 fin4478 2017-10-07 09:16:27 UTC
(In reply to ptheis from comment #16)

> - added rcu_nocbs=0-15 to kernel command line (1700x has 16 threads)
> - disabled C6 power management in BIOS

You need to make a custom kernel and enable  RCU_NOCB_CPU in the kernel configuration. Otherwise the rcu_nocbs parameter have no effect.

You want to save power when your CPU cores are not used so keep the C6 option enabled in the Bios.
Comment 18 James Le Cuirot 2017-10-07 10:22:04 UTC
(In reply to ptheis from comment #16)
> This is what I changed on my system (not sure
> both are needed at the same time):
Only one is needed, I am not enabling these kernel options at the moment.

> @James Le Cuirot: I also changed the category of this bug (I hope this is
> correct - this is the first time I filed a Kernel bug).
The category/component you've chosen is better but the bug is still assigned to the watchdog team. Unfortunately the default assignee for this component works for Intel and I doubt they'd be able to spend time on this. Maybe you should change it to "x86-64|Platform Specific/Hardware" instead. Make sure you tick the "Reset Assignee to default" box when it appears.

(In reply to fin4478 from comment #17)
> You need to make a custom kernel and enable  RCU_NOCB_CPU in the kernel
> configuration. Otherwise the rcu_nocbs parameter have no effect.
Yes, Manjaro's kernel config doesn't appear to have this enabled.

> You want to save power when your CPU cores are not used so keep the C6
> option enabled in the Bios.
Building your own kernel can be daunting for some so the BIOS option is far easier. I don't think it makes any difference power-wise as I gather RCU_NOCB_CPU effectively prevents the CPU from entering C6 anyway.

(In reply to fin4478 from comment #15)
> This is typical not end user friendly solution from tired programmers. Other
> companies have a conspiracy against Amd, they do want to make Amd look bad
> and do code like this.
Let's just stick to the facts. I think the change was justified and it shouldn't be necessary to do this for Ryzen. I doubt the guy was even aware of this Ryzen issue as I discovered the RCU_NOCB_CPU workaround myself after spending several days flipping kernel options.
Comment 19 fin4478 2017-10-07 11:01:11 UTC
(In reply to James Le Cuirot from comment #18)

> Let's just stick to the facts. I think the change was justified and it
> shouldn't be necessary to do this for Ryzen.

IBM does not have the money to build one Ryzen pc to test their code;-)
Think of all newbies that have a new Ryzen PC and even using Bios is difficult. Ryzen & C6 documentation is hidden in here. Thousands of PCs with stock kernels  are freezing randomly now. Great for Amd;-)
Comment 20 fin4478 2017-10-07 11:15:17 UTC
(In reply to James Le Cuirot from comment #18)
> easier. I don't think it makes any difference power-wise as I gather
> RCU_NOCB_CPU effectively prevents the CPU from entering C6 anyway.


"
Depending on the CPU / APU model, the highest boosted frequency PState usually has a C6-state requirement.
For example on FX-8370 which has 4.0GHz (P0, base), 4.1GHz (Pb1, boost) and 4.3GHz (Pb0, boost) states, the C6-state requirement is set to four. This means that the highest PState (4.3GHz, Pb0) will not activate unless half of the cores are currently in C6 active state.

If you disable C6-state this condition can obviously never be met, and the highest boosted state will never activate.
In this case the CPU will operate at the highest frequency PState which doesn´t have the C6-state requirement (4.1GHz Pb1).
"

http://www.overclock.net/t/1328938/what-is-core-c6-state-exactly
Comment 21 Account Deleted 2017-10-08 18:35:46 UTC
C6 activated in BIOS ... testing custom kernel with rcu_nocbs=0-15 ...
Comment 22 James Le Cuirot 2017-10-08 18:41:33 UTC
Thanks for changing the category/component again but it still needs reassigning. Please set this to platform_x86_64@kernel-bugs.osdl.org.
Comment 23 Account Deleted 2017-10-09 16:50:44 UTC
There is no option for setting ...
Comment 24 fin4478 2017-10-10 06:24:50 UTC
(In reply to ptheis from comment #23)
> There is no option for setting ...

If you mean kernel configuration, You need to to enable the RCU_EXPERT option to see the RCU_NOCB_CPU setting.

https://cateee.net/lkddb/web-lkddb/RCU_NOCB_CPU.html
"
depends on: ( CONFIG_TREE_RCU || CONFIG_PREEMPT_RCU ) && ( CONFIG_RCU_EXPERT || CONFIG_NO_HZ_FULL )

"
Comment 25 Account Deleted 2017-10-10 18:15:46 UTC
(In reply to fin4478 from comment #24)
> (In reply to ptheis from comment #23)
> > There is no option for setting ...
> 
> If you mean kernel configuration, You need to to enable the RCU_EXPERT
> option to see the RCU_NOCB_CPU setting.
> 
> https://cateee.net/lkddb/web-lkddb/RCU_NOCB_CPU.html
> "
> depends on: ( CONFIG_TREE_RCU || CONFIG_PREEMPT_RCU ) && ( CONFIG_RCU_EXPERT
> || CONFIG_NO_HZ_FULL )
> 
> "

Sorry, my comment was on changing the assignee. I found no option to do so.
Comment 26 oyvinds 2017-10-13 14:52:58 UTC
I got hit by this bug on Fedora. My Ryzen 1600X system would randomly hang a short while after boot after upgrading to kernel 4.13.4. I looked at various things that could be the cause and thought it was fixed but then I rebooted and it happened again. And again. I quickly figured out that there's no problem with kernel 4.12.3 on Fedora. Tried kernel 4.13.5, same problem, went back to 4.12.3 until I wasted too much time looking into this today.

config-4.12.14-300.fc26.x86_64 on Fedora has
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y

and there's no problem. config-4.13.5-300.fc27.x86_64 only has 
CONFIG_RCU_NOCB_CPU=y

and with that kernel there's a problem _unless_ I add rcu_nocbs=0-11 to the kernel command line - which I only figured out after looking at this bug.

Thank you James Le Cuirot.

This bug is listed as "Regression: No". It should be Yes in the case of Fedora; 4.12.x kernels work, 4.13.x do not work without a kernel boot parameter fix.

The commit that removed CONFIG_RCU_NOCB_CPU_ALL should please be reverted. The statement "The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing" is clearly false since these options are/were used by distributions like Fedora and removing CONFIG_RCU_NOCB_CPU_ALL *breaks* kernel 4.13.5. You can't really expect distributions to ship with/add the rcu_nocbs= parameter as an alternative.

Just to repeat this point: My first conclusion was simply that 4.13.x kernels are broken and this made me simply stick with 4.12.x which doesn't have this new problem with Ryzen CPUs until I wasted time looking into this.
Comment 27 James Le Cuirot 2017-10-13 15:03:17 UTC
(In reply to oyvinds from comment #26)
> I got hit by this bug on Fedora. My Ryzen 1600X system would randomly hang a
> short while after boot after upgrading to kernel 4.13.4.

Welcome to the club. Hopefully now that Fedora is affected, this will get more attention.

> This bug is listed as "Regression: No". It should be Yes in the case of
> Fedora; 4.12.x kernels work, 4.13.x do not work without a kernel boot
> parameter fix.

I wouldn't call it as a regression as the kernel options involved are merely a workaround for the underlying issue. Fedora just happened to be enabling them anyway until one of them went away.
Comment 28 oyvinds 2017-10-14 03:26:54 UTC
(In reply to James Le Cuirot from comment #27)
> I wouldn't call it as a regression as the kernel options involved are merely
> a workaround for the underlying issue.

That's true and we shouldn't have to apply this workaround. Still, on Fedora it's the regression in my opinion because kernel 4.12.x (happens to) work fine and 4.13.x doesn't.

As a end-user my conclusion was simply that 4.13.x is broken and I kept using 4.12.x until I had time to look into why 4.13.x is broken. And it still is for many, my 68 year old mother isn't going to be adding no rcu_nocbs= kernel command line on her laptop or do much beyond using it to check the news and weather. I am fairly sure I wouldn't be able to guide her through doing so even if I spent 10 hours on the phone, I'd have to go there. Luckily it's got some old Intel CPU. 

Yes, these are kernel options that some enable and some don't - but in the case of Fedora it's You update, you get new kernel, you're screwed and now your machine hangs randomly - which it previously did not. Isn't that the definition of regression?
Comment 29 Account Deleted 2017-10-17 18:10:22 UTC
My results, testing some days: There was one freeze without errors in journal. The second was also not visible in the logs, but the system rebooted and threw a mce hardware error. The system is not stable yet.

Okt 16 19:32:22 pc1 kernel: x86: Booting SMP configuration:
Okt 16 19:32:22 pc1 kernel: .... node  #0, CPUs:        #1  #2
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: Machine check events logged
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: b2a00020003f0000
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: TSC 0 IPID 300b000000000 
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1508175138 SOCKET 0 APIC 2 microcode 8001126
Okt 16 19:32:22 pc1 kernel:   #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15
Okt 16 19:32:22 pc1 kernel: smp: Brought up 1 node, 16 CPUs
Comment 30 James Le Cuirot 2017-10-17 21:18:09 UTC
(In reply to ptheis from comment #29)
> My results, testing some days: There was one freeze without errors in
> journal. The second was also not visible in the logs, but the system
> rebooted and threw a mce hardware error. The system is not stable yet.

How old is your Ryzen? The older ones with the random segfault issue often emitted MCE messages. I don't recall seeing any with my newly replaced Ryzen. Although segfaults are the most common symptom, I wouldn't be surprised if the fault could also trigger a freeze in some way.
Comment 31 eric.c.morgan 2017-10-26 23:31:55 UTC
Week 33 Ryzen 1700 checking in here. I compiled my 4.13.4 kernel withCONFIG_RCU_NOCB_CPU=Y, and still have crashes as described. 

I'm hoping CONFIG_RCU_NOCB_CPU_ALL will be added back as an option as kernel command params seem like a hack.

Please do know I appreciate those that offer their time to the linux kernel.
Comment 32 eric.c.morgan 2017-10-27 00:13:22 UTC
Update, added kernel boot param and enabled C6 and cool and quiet in the BIOS. I'll report back... I guess in a few days or a few weeks.
Comment 33 hoper 2017-10-27 09:44:05 UTC
Just a small message to say "me too". I spend the last 4 months spending a lot of time and money to change the motherboard, the cpu, the power... Before doing research and find that these freezes are software related :(

I tried (and manage) to compile my own kernel with CONFIG_RCU_NOCB_CPU=Y and so on. Before that, my server always crashed after 2 ou 3 days. With this custom kernel, yes, it's better... The freeze only appear after 8 or 9 days. But the freeze are still here. And I guess I will just sell all this stuff and go back to intel :(

I can't understand that this information "LINUX + RYZEN = NOT STABLE" is not spread everywhere. Lot's of people out there are loosing lot's of time and money, I'm sure of that. Of course I'm grateful for all open sources developers (I'm sharing also what I can :) and I really hope that the root cause of this bug will be found and corrected in the next few months... We need to be able to use linux on RYZEN ! (perfect cpu for servers).

If someone manage to make this bug disappear (and have an uptime > 30 days), please share ! how you did that, with enough details for beginner like me :)
Comment 34 eric.c.morgan 2017-10-27 12:50:15 UTC
Hoper,

If you're still running your compiled kernel add a kernel param "rcu_nocbs=0-15", where 0-15 is for a 16 thread CPU, 0-11 for a 12 thread for example.

Here is a good link on how to add it.

https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter

Remember that you're still using a new architecture. It takes time to make Linux play nice. We've been in an intel dominated world for a very long time. ;-)
Comment 35 hoper 2017-10-30 09:44:22 UTC
Thanks. I did it Sunday... Waiting for the freeze :o)
Comment 36 Jon 2017-11-04 03:44:20 UTC
I'm experiencing this bug as well, locking up every 24-48 hours roughly.  I've tried adding "rcu_nocbs=0-15" to the grub boot to see if that resolves it.  

Fedora 26 (kernel 4.13.9-200.fc26.x86_64)
Ryzen 1700X

Nov 03 19:54:30 pc003 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [kworker/14:1:217]
Nov 03 19:54:30 pc003 kernel: Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_co
Nov 03 19:54:30 pc003 kernel:  crc32_pclmul ccp snd_timer drm ghash_clmulni_intel snd soundcore tpm_tis sp5100_tco tpm_tis_core wmi_bmof i2c_piix4 shpchp tpm parport_pc parport acp 
Nov 03 19:54:30 pc003 kernel: CPU: 14 PID: 217 Comm: kworker/14:1 Tainted: P           OEL  4.13.9-200.fc26.x86_64 #1
Nov 03 19:54:30 pc003 kernel: Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 TOMAHAWK ARCTIC (MS-7A34), BIOS H.50 06/22/2017
Nov 03 19:54:30 pc003 kernel: Workqueue: events netstamp_clear
Nov 03 19:54:30 pc003 kernel: task: ffff8ac634d62640 task.stack: ffffaea343fdc000
Nov 03 19:54:30 pc003 kernel: RIP: 0010:smp_call_function_many+0x24a/0x270
Nov 03 19:54:30 pc003 kernel: RSP: 0018:ffffaea343fdfce0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Nov 03 19:54:30 pc003 kernel: RAX: ffff8ac63e61f398 RBX: ffff8ac63e99b580 RCX: 0000000000000000
Nov 03 19:54:30 pc003 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ac63e022b88
Nov 03 19:54:30 pc003 kernel: RBP: ffffaea343fdfd18 R08: ffffffffffffffff R09: 000000000000bfff
Nov 03 19:54:30 pc003 kernel: R10: fffffa649fe76140 R11: ffff8ac63e007c00 R12: 0000000000000010
Nov 03 19:54:30 pc003 kernel: R13: 0000000000000010 R14: ffffffff8b02d6a0 R15: 0000000000000000
Nov 03 19:54:30 pc003 kernel: FS:  0000000000000000(0000) GS:ffff8ac63e980000(0000) knlGS:0000000000000000
Nov 03 19:54:30 pc003 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 03 19:54:30 pc003 kernel: CR2: 00007fc490f23000 CR3: 000000011fe09000 CR4: 00000000003406e0
Nov 03 19:54:30 pc003 kernel: Call Trace:
Nov 03 19:54:30 pc003 kernel:  ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel:  ? setup_data_read+0xa0/0xa0
Nov 03 19:54:30 pc003 kernel:  ? netif_receive_skb_internal+0x29/0x410
Nov 03 19:54:30 pc003 kernel:  on_each_cpu+0x2d/0x60
Nov 03 19:54:30 pc003 kernel:  ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel:  text_poke_bp+0x6a/0xf0
Nov 03 19:54:30 pc003 kernel:  __jump_label_transform.isra.0+0x10b/0x120
Nov 03 19:54:30 pc003 kernel:  arch_jump_label_transform+0x32/0x50
Nov 03 19:54:30 pc003 kernel:  __jump_label_update+0x68/0x80
Nov 03 19:54:30 pc003 kernel:  jump_label_update+0xae/0xc0
Nov 03 19:54:30 pc003 kernel:  static_key_slow_inc+0x95/0xa0
Nov 03 19:54:30 pc003 kernel:  static_key_enable+0x1d/0x30
Nov 03 19:54:30 pc003 kernel:  netstamp_clear+0x2d/0x40
Nov 03 19:54:30 pc003 kernel:  process_one_work+0x193/0x3c0
Nov 03 19:54:30 pc003 kernel:  worker_thread+0x4a/0x3a0
Nov 03 19:54:30 pc003 kernel:  kthread+0x125/0x140
Nov 03 19:54:30 pc003 kernel:  ? process_one_work+0x3c0/0x3c0
Nov 03 19:54:30 pc003 kernel:  ? kthread_park+0x60/0x60
Nov 03 19:54:30 pc003 kernel:  ret_from_fork+0x25/0x30
Nov 03 19:54:30 pc003 kernel: Code: 35 30 00 39 05 fc 92 f1 00 89 c1 0f 8e 3d fe ff ff 48 98 48 8b 13 48 03 14 c5 e0 e3 d3 8b 48 89 d0 8b 52 18 83 e2 01 74 0a f3 90 <8b> 50 18 83 e
Comment 37 Seth Jennings 2017-11-06 18:56:06 UTC
Another hang on 4.13.5 with Ryzen 1600X using Fedora 26.  Works fine on 4.12.9.
Comment 38 eric.c.morgan 2017-11-06 19:02:14 UTC
I'm at 11 days uptime with the custom 4.13.4 kernel and kernel boot params.

Jon, did you also compile the RCU settings or only apply the boot params?

Seth, I'm on 4.13.4 FWIW and now doing well. How long was your uptime?
Comment 39 Seth Jennings 2017-11-06 19:14:55 UTC
Eric, Fedora has CONFIG_RCU_NOCB_CPU=y for 4.13 so I just set rcu_nocbs=0-11 on the kernel command line.  I'll report how it goes.  Without the params, I would hang within a few hours.
Comment 40 Seth Jennings 2017-11-06 20:27:11 UTC
Eric, been running for a few hours now, longer than the hang normally takes to surface, and it hasn't.  Seems that the rcu_nocbs=0-11 masks the issue.

I sent an email off to Paul McKenney, RCU maintainer, to see if he couldn't take a look.  I'm out of my depth on this one.  RCU is voodoo of the highest order!
Comment 41 eric.c.morgan 2017-11-06 20:31:26 UTC
Seth, fantastic! Agreed on RCU complexity. I started reading up on it.. and then went back to funny cat pictures.
Comment 42 John-Paul Herold 2017-11-06 22:25:31 UTC
Eric and Seth, I'm debugging and testing myself now after having very similar issues. Hardware no longer an issue after early-week RMA and confirmed stability with 48hr stressapptest run. Curious if you wouldn't mind sharing what distro and DE/WM you are running as well? (Seth looks like Fedora 26 for you).

I also assume that at this point the only Ryzen specific tweaks you are running are `CONFIG_RCU_NOCB_CPU` and `rcu_nocbs`? No more Ryzen-specific bios tweaks?
Comment 43 Seth Jennings 2017-11-06 22:35:46 UTC
John,

No custom kernel or BIOS settings.  Running stock clocks and voltages for memory and CPU.  Just added rcu_nocbs to cmdline to (re)mask whatever is going on.

Also, I'm not sure that stress tests are the best way to recreate this issue.  In my experience, it happens when the system is mostly idle.
Comment 44 Jon 2017-11-06 22:38:12 UTC
(In reply to eric.c.morgan from comment #38)
> I'm at 11 days uptime with the custom 4.13.4 kernel and kernel boot params.
> 
> Jon, did you also compile the RCU settings or only apply the boot params?
> 
> Seth, I'm on 4.13.4 FWIW and now doing well. How long was your uptime?

I've only specified the boot parameter.  Too soon to say it's resolved for sure, but uptime is over 2 days at this point, which seems to be an improvement.
Comment 45 Jon 2017-11-06 22:39:59 UTC
(In reply to John Paul Herold from comment #42)
> Eric and Seth, I'm debugging and testing myself now after having very
> similar issues. Hardware no longer an issue after early-week RMA and
> confirmed stability with 48hr stressapptest run. Curious if you wouldn't
> mind sharing what distro and DE/WM you are running as well? (Seth looks like
> Fedora 26 for you).
> 
> I also assume that at this point the only Ryzen specific tweaks you are
> running are `CONFIG_RCU_NOCB_CPU` and `rcu_nocbs`? No more Ryzen-specific
> bios tweaks?

I'm also running Fedora 26, DE/WM is dwm.  

I also want to say that my CPU is confirmed to be affected by the Ryzen bug.  I'm currently discussing RMA with AMD but I do not believe these issues are related.
Comment 46 John-Paul Herold 2017-11-06 22:53:05 UTC
(In reply to Seth Jennings from comment #43)
> Also, I'm not sure that stress tests are the best way to recreate this
> issue.  In my experience, it happens when the system is mostly idle.

Seth, correct on the stress test, just wanted to clarify my hardware is stable, no bad OC, etc. I too get the issue when I return to work the next day after system had been idle overnight.
Comment 47 Seth Jennings 2017-11-06 23:03:23 UTC
Just got the hang even with rcu_nocbs=0-11.  Back to 4.12 then.
Comment 48 eric.c.morgan 2017-11-06 23:12:08 UTC
John,

- week 33 R7 1700 CPU no segfault issues
- ubuntu 16.04 with 4.13.4 kernel pulled from git, copied config, applied RCU_NOCBS, the RCU_NOCBS_ALL option was NOT available.
- XFCE/Xubuntu to be more exact
- memtest 4 passes OK
- all c states, cool and quiet enabled in BIOS, no changes really
- boot params used as discussed


Seth,

My system crashes/d under very low load, almost idle. Before I tried any of these fixes I would run a video on loop to keep one processor more active to stave off crashes. It seemed to help. Maybe this would help you until a fix is official?


Jon,

Segfault bug is indeed a different issue. I had to RMA to get my week 33 CPU. It takes some time. Hang in there!
Comment 49 James Le Cuirot 2017-11-06 23:13:43 UTC
(In reply to Seth Jennings from comment #47)
> Just got the hang even with rcu_nocbs=0-11.  Back to 4.12 then.

Are you able to disable C6 in the BIOS? I still don't know whether the RCU workaround avoids the issue entirely or just makes it less likely. I haven't had a single freeze since disabling C6.
Comment 50 eric.c.morgan 2017-11-06 23:17:13 UTC
Seth/James,

Also consider disabling ASLR.

# disable in current session
echo 0 | tee /proc/sys/kernel/randomize_va_space

# make change permanent (across reboots)
echo "kernel.randomize_va_space = 0" > /etc/sysctl.d/01-disable-aslr.conf
Comment 51 James Le Cuirot 2017-11-06 23:24:23 UTC
(In reply to eric.c.morgan from comment #50)
> Also consider disabling ASLR.

That only helps with the segfault issue. I now have a new Ryzen and no longer encounter that.

I discovered the RCU workaround after receiving the new Ryzen and trying to work out why Gentoo froze but Fedora (at the time) didn't. I compared the kernel configurations and manually converged one towards the other, focusing on the things that seemed most likely to make a difference. It took several days. I was certain by that point that ASLR was unrelated here.
Comment 52 John-Paul Herold 2017-11-07 00:04:57 UTC
(In reply to James Le Cuirot from comment #51)
> That only helps with the segfault issue. I now have a new Ryzen and no
> longer encounter that.

James, regarding ASLR, that is helpful as I've seen it recommended as a general Ryzen stability tweak. Which we definitely want to keep separate from this bug.
Comment 53 Panagiotis Malakoudis 2017-11-07 10:42:41 UTC
I want to confirm this issue on my system (AMD Ryzen 1700X, segfault free chip after RMA) and offer some more info:

- First of all, let's define what CONFIG_RCU_NOCB_CPU does. It enables the support for rcuo kernel threads which handle the RCU callback processing. Option is automatically enabled if you select CONFIG_NO_HZ_FULL=y and this is the reason why it is automatically enabled in Fedora kernels. Fedora also chose to set CONFIG_RCU_NOCB_CPU_ALL=y in their 4.12 series kernels, while the default is CONFIG_RCU_NOCB_CPU_NONE=y which only enables RCU callback offloading to rcuo kernel threads for those CPUs that are defined in rcu_nocbs boot parameter. This is also the default for 4.13 kernels, and options CONFIG_RCU_NOCB_CPU_* have been dropped. So, in order to have the behaviour of CONFIG_RCU_NOCB_CPU_ALL=y in 4.13 kernel, you need to supply boot parameter rcu_nocbs=0-XX, where XX is your number of cpus-1. So, by setting CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y (or rcu_nocbs=0-XX), we are telling the kernel to offload RCU callbacks to seperate kernel threads called rcuo (rcu offload), one for each cpu. Affinity of those kernel threads is set to ffff, so they can run to any CPU.

- C6 is not effectively disabled by using rcuo kernel threads for RCU callback processing. When C6 is disabled, CPU voltage never drops under the voltage defined for P2 state (0.9V for my CPU). With C6 enabled and rcuo kernel threads enabled, my voltage drops to 0.4V frequently. Also, disabling C6 prevents single core to hit XFR turbo speeds. Max cpu frequency for my 1700X cpu is 3500 MHz when C6 is disabled. With C6 enabled, I can run single thread processes at 3900 MHz (XFR turbo speed of 1700X). Enabling rcuo kernel threads for RCU callback processing does not disable XFR turbo speeds. So, this feature does not effectively disables C6, this is not true.

- C6 can be disabled manually with zenstates.py script, even if your BIOS does not offer this option. It can be found at https://github.com/r4m0n/ZenStates-Linux

- In my case, idle freezes are prevented with either C6 disabled or with rcuo kernel threads for RCU callback processing enabled and C6 enabled. I had 14 days uptime with rcuo kernel threads enabled, while my system will usually freeze overnight without it.

- And finally, I have no idle freezes in Windows 10 with C6 enabled.

My opinion is that this idle freeze is a hardware issue of the AMD Zen processor and the rcuo kernel threads for RCU callback just hide or make less probable to trigger this hardware issue. Probably something similar happens in Windows 10 kernel.

I think AMD should be contacted.
Comment 54 James Le Cuirot 2017-11-07 11:11:25 UTC
Thank you very much for this extra detail, Panagiotis. I'll turn C6 back on! I wanted to contact AMD but I wasn't sure how best to do so.
Comment 55 Panagiotis Malakoudis 2017-11-07 12:28:55 UTC
I think that an experienced kernel developer that understands how RCU callback could affect the triggering of a probable hardware issue should contact AMD. There are AMD kernel developers so they should get the attention of this bug report.
Comment 56 Account Deleted 2017-11-07 18:22:49 UTC
(In reply to Panagiotis Malakoudis from comment #55)
> I think that an experienced kernel developer that understands how RCU
> callback could affect the triggering of a probable hardware issue should
> contact AMD. There are AMD kernel developers so they should get the
> attention of this bug report.

How can we contact them? For me, the situation is still unclear ... I didn't have problems for some weeks thus power using my PC while configuring a new firewall. After a custom kernel I used standard Manjaro kernels from 4.13.9 to 4.13.11, keeping the RCU_NOCB_CPU option in my standard kernel command line. I is not visible in the Manjaro kernel config. Maybe it's triggered by something else?

https://github.com/manjaro/packages-core/blob/master/linux413/config.x86_64
Comment 57 Jon 2017-11-08 03:03:25 UTC
I've now got over 4 days of uptime with ryzen 1700x on fedora 26 (4.13.9-200) using the rcu_nocbs=0-XX boot option.  I feel pretty confident in saying this resolved my issue.  Should we be filing a bug report with Fedora as well?  Clearly the current kernel configuration isn't compatible with (at least some) Ryzen CPUs.
Comment 58 Seth Jennings 2017-11-08 04:06:16 UTC
I'm running 4.13.11 now on F26.  I enabled kdump and kernel.softlockup_panic = 1 so maybe I can get a kernel core on the next hang if/when it occurs.
Comment 59 oyvinds 2017-11-08 05:33:49 UTC
Using Fedora 27 I've had zero problems with random hangs since I first commented on this on 2017-10-13, I've had rcu_nocbs=0-11 as part of GRUB_CMDLINE_LINUX= in /etc/sysconfig/grub since then. Currently using 4.13.11-300.fc27.x86_64, been using various 4.13.x kernels.

Like already mentioned, this was something I noticed very quickly after going from kernel 4.12.x to 4.13.x, everything was working fine and then Fedora "upgraded" and the system started randomly hanging (mostly right after boot if I didn't start stuff immediately). My first reaction (before finding this bug) was to simply go back to 4.12.x.

I have not filed a Fedora bug but I'm not sure how useful it would be. Their kernels used to ship with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y and they probably still would if CONFIG_RCU_NOCB_CPU_ALL wasn't removed.

The kernel developers could fix this (for Fedora at minimum, do not know what other distributions kernels are configured to do) by reverting this,

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

I'm not sure why this is not happening. Perhaps it will in the far future when Ubuntu and RHEL/CentOS finally switch to a 4.13+ kernel, AMD seems to only care about Ubuntu 16 and CentOS 7 for some reason (look at their binary blob GPU drivers, for example). It would be excellent if someone with access could please revert that git commit before 2020. I don't really see any compelling arguments as to why that change was done and a whole lot of Ryzen systems randomly hanging is a good argument as to why it should be reverted.
Comment 60 Panagiotis Malakoudis 2017-11-08 08:13:16 UTC
Reverting the CONFIG_RCU_NOCB_CPU_{ALL|ZERO|NONE} options deletion is not a solution. This option is just a workaround for what I think is a hardware issue.

There is at least one Ryzen 1700X processor (friend of mine has it) that doesn't freeze on idle with stock kernel settings. Also, my original CPU that had segfault issues, didn't freeze on idle too. Actually, it seems most CPUs from RMA that don't have the segfault issue are more affected with the idle freeze issue than others. 

AMD should be informed about the freeze on idle issue and an experienced kernel developer should work with them to find why a kernel with the rcuo kernel threads for RCU callback enabled hide the issue (or make it less probable to happen).
Comment 61 James Le Cuirot 2017-11-08 11:00:05 UTC
(In reply to Panagiotis Malakoudis from comment #60)
> Reverting the CONFIG_RCU_NOCB_CPU_{ALL|ZERO|NONE} options deletion is not a
> solution. This option is just a workaround for what I think is a hardware
> issue.
Very much agreed.

> There is at least one Ryzen 1700X processor (friend of mine has it) that
> doesn't freeze on idle with stock kernel settings. Also, my original CPU
> that had segfault issues, didn't freeze on idle too. Actually, it seems most
> CPUs from RMA that don't have the segfault issue are more affected with the
> idle freeze issue than others.
My original CPU definitely had both issues and I wouldn't say that one froze more than the other. Perhaps there's another factor involved.
Comment 62 Tyler 2017-11-08 18:05:57 UTC
Another affected user here. Got my RMA week 33 segfault-free Ryzen 7 1700 a few weeks ago, running Gentoo on a custom vanilla kernel 4.13.10 atm. After 10 days of uptime, I now consider it stable. I did have freezes before, also with my pre-RMA CPU, but rcu_nocbs boot parameter seems to have fixed it, together with CONFIG_RCU_NOCB_CPU of course.

What I don't understand is why this should be another hardware bug? Ok, we don't know, maybe it is, but even then, this time there should be some software workaround. And I don't necessarily mean this RCU offloading-thing, which probably just masks the issue, but more in the terms of some underlying RCU (?) kernel issue triggering on this platform...?

Also consider that there are no freeze-issues on Windows as far as I know. Ok, AMD quickly put out their AMD Ryzen Balanced power plan, but I think it's mostly about SMT-scheduling, although it does disable C6 (I think, not sure). But anyhow, even on default Windows power plans, there are no freezes. So... either Microsoft quickly and silently incorporated some patch fixing this issue a long time ago, or it never was an issue for Windows in the first place. Doesn't matter however, the point is that something can be done in software in order to make this "bug" not triggering.

Forgive my ignorance, but why is it so difficult to assign this bug to the correct kernel dev? I'm just a user really and wondering... thanks.
Comment 63 eric.c.morgan 2017-11-08 20:24:20 UTC
I'm at 13 days uptime with kernel and boot params applied. This ties my system stability days counted from before the boot params were added. I'll keep reporting back.

I am used to months of uptime with my previous Intel i5 2400 server. I hope to reach the same with my Ryzen.
Comment 64 Seth Jennings 2017-11-09 14:29:13 UTC
Created attachment 260577 [details]
softlock-dmesg.log

Was finally able to get a BUG on the soft lockup after attaching a serial console.  Hoping to get a few more to create a good sampling of cases in which this happens.

tl;dr, two cores seem to be soft locked in native_flush_tlb_others().
Comment 65 Seth Jennings 2017-11-09 15:20:06 UTC
Created attachment 260579 [details]
panic-dmesg.log

Here is a soft lock followed by watchdog timer panic.
Comment 66 Seth Jennings 2017-11-09 15:30:11 UTC
Created attachment 260581 [details]
panic2-dmesg.log

Last one for good measure.

smp_call_function_many() is the common denominator for all traces in this report.
Comment 67 Panagiotis Malakoudis 2017-11-09 15:32:32 UTC
@Seth: How you achieve capturing these logs? You do serial console logging to another computer? Enabling serial console is enough or something else is needed too? I would also like to capture my idle freeze logs.
Comment 68 Seth Jennings 2017-11-09 15:35:32 UTC
@Panagiotis, yes, I attached a serial port header to my motherboard and am using a DB9 null modem cable and minicom to get the dmesg.  Added console=ttyS0,38400n8 to the kernel boot line.
Comment 69 James Le Cuirot 2017-11-09 15:36:03 UTC
I was able to capture logs using netconsole, which may be easier for some people.
Comment 70 Derrick Aguren 2017-11-10 23:44:47 UTC
Disabling C6 with zenstates.py has worked so far for me (6 hours without a freeze vs a freeze every one to two hours).

Ryzen 1800x (I haven't checked what week)
Ubuntu 16.04 LTS 4.11.0-kfd-compute-rocm-rel-1.6-180
Gigabyte AX370 - Gaming 5 BIOS version F8 (default params)
MSI Radeon RX 580
amdgpu-pro version 17.40-492261
Comment 71 Seth Jennings 2017-11-11 03:59:52 UTC
Doing some code investigation

http://elixir.free-electrons.com/linux/v4.13.12/source/kernel/smp.c#L401

The two RIP addresses in my 3 dmesg logs go here

smp_call_function_many
csd_lock_wait
smp_cond_load_acquire
cpu_relax (called in loop from smp_cond_load_acquire)
rep_nop
asm volatile("rep; nop" ::: "memory")  <-- smp_call_function_many+0x248

smp_call_function_many
csd_lock_wait
smp_cond_load_acquire
READ_ONCE(x) (called in loop from smp_cond_load_acquire)
__READ_ONCE(x, 1)
__read_once_size
__READ_ONCE_SIZE  <-- smp_call_function_many+0x24a

Both are within the tight loop in smp_cond_load_acquire waiting on the per-cpu csd locks

Basically, smp_call_function_many() executes a function on each cpu via IPI.
When wait=true, it runs synchronously, with the cpu that runs smp_call_function_many() waiting for each of the other cpus to report they have run the function as indicated by releasing their per-cpu csd lock.

The CSD_FLAG_SYNCHRONOUS flag determines the order of the func and unlock in flush_smp_call_function_queue()

http://elixir.free-electrons.com/linux/v4.13.11/source/kernel/smp.c#L242

This leads me to believe the issue here deals with IPIs being dropped and thus the CPU calling smp_call_function_many() deadlocks waiting on a cpu that has dropped the IPI and therefore will not unlock the csd lock.
Comment 72 Seth Jennings 2017-11-11 04:19:43 UTC
I found this commit that went into 4.12-rc2

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3fc5b3b6a80b2e08a0fec0056208c5dff757e547

While the reasoning in the commit message makes sense, it does modify the exact code area where this is happening.  It would be interesting to revert this and see if the problem goes away.  There could be some edge case where the enqueued csd get dropped.
Comment 73 Jon 2017-11-11 13:05:06 UTC
This will be my last post unless the behavior changes, but I've gone from 1-2 lock ups per day to an uptime of: 7 days, 11:57 so I'm considering this resolved for me.  

By adding the following boot option: rcu_nocbs=0-15

And disabling ASLR.  I don't know which one solved the problem, but I'd start with the boot option then try disabling ASLR if you still have issues.
Comment 74 Tyler 2017-11-11 20:00:12 UTC
(In reply to Seth Jennings from comment #72)
> I found this commit that went into 4.12-rc2
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=3fc5b3b6a80b2e08a0fec0056208c5dff757e547

Great find, thanks!
I'm going to test 4.13.12 with this reverted. C6 is enabled in UEFI, no special boot params, just booted. Now, let's wait... :D
Comment 75 Roderick Colenbrander 2017-11-11 20:19:57 UTC
Created attachment 260619 [details]
Kernel softlockup Fedora 27 serial log

Hi,

I'm seeing the same issue as others reported here. If my memory serves me well I started seeing it on some 4.13.x kernel on fedora 26 and have had it ever since even now on fedora 27.

The issue usually happens every few hours and my system is fairly idle, just a few gnome-terminals and a browser open, no compilation or heavy 3d content.

I'm about to try the workarounds suggested on this ticket, but I want to capture a few more logs just in case.

Thanks,
Roderick
Comment 76 Panagiotis Malakoudis 2017-11-12 16:05:50 UTC
@Seth: I tried with the commit you mentioned reverted and got idle freeze again. If I remember well I have had idle freeze with Debian 9.0 4.9 kernel as well.

I don't know if it helps, but running without gnome/X, I didn't have idle freeze. System was running only some network services (ssh, nfs, samba etc).
Comment 77 James Le Cuirot 2017-11-12 17:00:53 UTC
Yeah, I really want to believe it was that commit but while I haven't tried reverting it, I'm pretty sure I started with a kernel older than 4.12.
Comment 78 Seth Jennings 2017-11-13 03:13:24 UTC
@Panagiotis yes, I also built a kernel with a revert for that commit and it still locked up so... not that.
Comment 79 Seth Jennings 2017-11-14 04:02:51 UTC
@James, I am disabling C6 now (w/o the kernel options).  See how this goes.  If that alone can fix it, that is a pretty strong indicator in my mind that this is a hardware issue :-/
Comment 80 James Le Cuirot 2017-11-14 10:16:40 UTC
Derrick, evidently you work for AMD, is there anybody you can speak to about this?
Comment 81 Seth Jennings 2017-11-14 15:08:57 UTC
I'm stable with C6 disabled (both package and core) and no other modifications.  

In my mind, this proves that this is a hardware issue.  Gotta find some way to get AMDs attention on this...
Comment 82 Tom Hughes 2017-11-14 15:14:21 UTC
We have a machine (with Ryzen Threadripper 1950X) where this is happening even with rcu_nocbs=0-31 and I'd love to test the disabling C6 hypothesis except that as far as I can tell there's not such option in the BIOS (ASUS PRIME X399-A motherboard).

In fact it looks to me like there is only C1 and C2 and hence no C6 to disable - that's going both by powertop and looking at /sys/devices/system/cpu/cpu0/cpuidle/state*/name to see what states exist.
Comment 83 eric.c.morgan 2017-11-14 15:20:12 UTC
@Seth I crashed with C states disabled.

I haven't crashed in 20 days with C states ENABLED and RCU kernel and boot params.
Comment 84 eric.c.morgan 2017-11-14 15:22:17 UTC
@Tom, did you compile your kernel with the RCU option/s set?
Comment 85 Tom Hughes 2017-11-14 15:23:34 UTC
No this is a stock Fedora kernel, but my understanding was that rcu_nocbs=0-31 was equivalent to recompiling with the RCU configuration change?
Comment 86 eric.c.morgan 2017-11-14 15:30:12 UTC
@Tom, unfortunately not. You'll need to verify CONFIG_RCU_NOCB_CPU=Y and if possible CONFIG_RCU_NOCB_CPU_ALL=Y. The newer kernels dont have CONFIG_RCU_NOCB_CPU_ALL, so the boot param you reference does the same.

Here is an article I contributed to that discusses this. While Ubuntu focused it can still guide you.

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Like I stated before, with these options set I'm at 20 days crash free.
Comment 87 Tom Hughes 2017-11-14 15:32:11 UTC
Turns out it is actually set:

rover [~] % fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y
Comment 88 Seth Jennings 2017-11-14 15:32:27 UTC
@Tom that is correct.  You do NOT have to recompile your kernel to use the rcu_nocbs option.  The fedora kernel already has CONFIG_RCU_NOCB_CPU=y.
Comment 89 eric.c.morgan 2017-11-14 15:39:06 UTC
@Tom, then life is easier for you! ;-)

Perhaps consider disabling ASLR? http://blog.programster.org/how-to-disable-aslr

Maybe up the SOC voltage a tad?
Comment 90 eric.c.morgan 2017-11-14 15:45:09 UTC
I found this project a while ago that claims to set C states for ryzen chips on some motherboards. 

https://github.com/r4m0n/ZenStates-Linux
Comment 91 Tom Hughes 2017-11-14 15:48:08 UTC
So that does show it as enabled at code level:

rover [~/ZenStates-Linux] % sudo ./zenstates.py -l
P0 - Enabled - FID = 88 - DID = 8 - VID = 44 - Ratio = 34.00 - vCore = 1.12500
P1 - Enabled - FID = 8C - DID = A - VID = 5A - Ratio = 28.00 - vCore = 0.98750
P2 - Enabled - FID = 84 - DID = C - VID = 6A - Ratio = 22.00 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Enabled

I've turned it off now so we'll see what happens with that...
Comment 92 eric.c.morgan 2017-11-14 15:50:55 UTC
@Tom, Good luck!

As dumb as it sounds, at one point I had a video running on loop on my server to keep it from going too idle. If your C state attempt is fruitless then maybe that would be a bandaid until a resolution for all this is found.
Comment 93 Seth Jennings 2017-11-14 16:02:19 UTC
@Tom the BIOS option on Gigabyte boards that disabled C6 is called "Global C-state Control".  I think ASUS calls it the same thing.

@Eric interesting.  IIRC I hung with just the rcu_nocbs kernel options set and C6 enabled, similar to Tom's experience.

All of this doesn't not change the fact that one should not have to do hacks to prevent the hardware from going "too idle".
Comment 94 Derrick Aguren 2017-11-14 16:42:22 UTC
(In reply to James Le Cuirot from comment #80)
> Derrick, evidently you work for AMD, is there anybody you can speak to about
> this?

Hi James. Unfortunately this sort of thing is outside of my wheelhouse. Let's let it progress through normal channels.
Comment 95 Seth Jennings 2017-11-14 16:58:52 UTC
@Derrick these "normal channels" of which you speak... please, do tell! :D  What are they?

How do users report issues that will likely require and AGESA update to fix?
Comment 96 Derrick Aguren 2017-11-14 17:19:28 UTC
@Seth I would start here, which mentions this site, but also other paths: https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html
Comment 97 Lucio Crusca 2017-11-19 09:25:56 UTC
(In reply to Derrick Aguren from comment #96)
> @Seth I would start here, which mentions this site, but also other paths:
> https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html



Maybe there's a shorter path, maybe Bridgman@Phoronix knows how to get in touch with the right AMD person. I've just asked him to have a look just in case he knows: 


https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/990202-linux-4-15-is-a-huge-update-for-both-amd-cpu-radeon-gpu-owners?p=990258#post990258
Comment 98 James Le Cuirot 2017-11-19 09:41:51 UTC
I already tried to contact Bridgman through a Phoronix PM but his inbox was full so I figured that he's probably pestered about random AMD issues enough already.

I then sent a message to AMD Tech Support. I haven't heard anything back yet.
Comment 99 Lorenzo Bona 2017-11-20 18:14:39 UTC
Hi guys.
I've been running a Ryzen build since late May/early June from Linus git and I never faced this lockup.
I've just checked and in my configuration that option is disabled.

In attachment you can find my config file.

May I help you testing something?

R5-1600 with a Gigabyte Gaming 3.
Comment 100 Panagiotis Malakoudis 2017-11-20 18:50:00 UTC
@Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py script mentioned earlier in this thread.
Comment 101 Lorenzo Bona 2017-11-20 19:04:33 UTC
(In reply to Panagiotis Malakoudis from comment #100)
> @Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py
> script mentioned earlier in this thread.

Yes, C6 enabled.

./zenstates.py -l                                                                                                                                                                                                                 
P0 - Enabled - FID = 94 - DID = 8 - VID = 32 - Ratio = 37.00 - vCore = 1.23750                                                                                                                                                                                                 
P1 - Enabled - FID = B9 - DID = A - VID = 32 - Ratio = 37.00 - vCore = 1.23750                                                                                                                                                                                                 
P2 - Enabled - FID = 7C - DID = 10 - VID = 68 - Ratio = 15.50 - vCore = 0.90000                                                                                                                                                                                                
P3 - Disabled                                                                                                                                                                                                                                                                  
P4 - Disabled                                                                                                                                                                                                                                                                  
P5 - Disabled                                                                                                                                                                                                                                                                  
P6 - Disabled                                                                                                                                                                                                                                                                  
P7 - Disabled                                                                                                                                                                                                                                                                  
C6 State - Package - Enabled                                                                                                                                                                                                                                                   
C6 State - Core - Enabled
Comment 102 Lorenzo Bona 2017-11-20 19:07:37 UTC
Created attachment 260745 [details]
ryzen kernel config
Comment 103 Klaus Mueller 2017-11-20 20:24:22 UTC
I've been facing those system hangs, too, until I switched on "daily computing" optimization in bios (ASUS PRIME X370-PRO BIOS 0902 09/08/2017) running 4.13.x.

Switching to "daily computing" increases CPU speed to max MHz 3600 (instead of 3400) using Ryzen 7 1700X. Maybe it changes some more things I don't know off. But C states are definitely enabled:

# zenstates.py -l 
P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P1 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled
Comment 104 Johannes Hirte 2017-11-20 22:02:02 UTC
I don't think Bridgman can help here. You should ask Borislav Petkov <bp@alien8.de> with cc-ing linux-kernel@vger.kernel.org. He doesn't work for AMD anymore, but he is still involved with kernel development. So if he can't say what's going on, he can point to the right developer.
Comment 105 Scott Farrell 2017-11-21 17:20:12 UTC
Just to add more as I've been following along ever since this 1st happened to me about 6 days ago.  To preface I am running Debian 9 with 4.13 kernel from backports.  I went several days after upgrade from 4.12 with a very idle Ryzen 1700 and didn't have any issues.  I did have 2 qemu-kvm VM's running, and they are also mostly idle.  I encountered the soft lock CPU bug out of the blue on the Nov 15th while not doing anything interactive with the system.

I'm now running 6 VM's on this system as it was a replacement for an old Intel system.  I simply cannot have this thing crashing on me unattended, especially while I am remote.

Here are the measures I took, and frankly I don't really know if it's going to prevent this as the system simply hasn't been up long enough.

Board: Gigabyte GA-AB350-GAMING 3
Bios version: F7

Disabled Global C-State Control in bios.  The manual also says there should be a C6 Mode in bios, however it simply doesn't exist.

Disabled AMD Cool&Quiet in bios.

Further issued a disable of C6 via the zenstate.py script upon boot up in systemd as I wasn't certain what the difference is between Package and Core.

# zenstates.py -l
P0 - Enabled - FID = 78 - DID = 8 - VID = 3A - Ratio = 30.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Disabled 

Question for clarification.  Does CONFIG_RCU_NOCB_CPU need to be set to yes in kernel to be able to use the boot parameter?  I get the impression that the boot parameter doesn't actually do anything unless kernel option is configured.
Comment 106 James Le Cuirot 2017-11-21 17:39:11 UTC
(In reply to Scott Farrell from comment #105)
> Disabled Global C-State Control in bios.  The manual also says there should
> be a C6 Mode in bios, however it simply doesn't exist.
These are the same thing.

> Disabled AMD Cool&Quiet in bios.
You don't need to disable this. I would leave it on.

> Question for clarification.  Does CONFIG_RCU_NOCB_CPU need to be set to yes
> in kernel to be able to use the boot parameter?  I get the impression that
> the boot parameter doesn't actually do anything unless kernel option is
> configured.
Yes, the boot parameter alone won't do anything. You only need to do this or disable C6, not both. At this point, the overall consensus seems to be that the RCU workaround is better as it still allows some power saving and doesn't prevent boost.
Comment 107 Tom Hughes 2017-11-23 18:29:52 UTC
Well our machine has now gone about nine days without any soft lockups with C6 disabled - previously (even with rcu_nocbs=0-31) it was lucky to go more than a day or two.
Comment 108 eric.c.morgan 2017-11-24 17:18:00 UTC
I'm at 30 days with C states ENABLED, Cool & Quiet ENABLED, boot param set, and kernel param set. I consider my server stable now.

Now if only the out of the box linux kernel would be this stable.
Comment 109 James Le Cuirot 2017-11-24 17:31:48 UTC
I have had a response from AMD Tech Support. They told me to disable C6 even though I'd already told them that this has been found to work around the issue. Not very helpful but they also added that the issue "should be addressed in an up and coming BIOS release." This would imply that they already know about the issue.

I didn't say so but I am slightly dubious about the suggestion of a BIOS fix as they said the same thing about the segfault issue and that never materialised.

However I did implore them to be more transparent with their customers. Bad press is bad but how they handle the situation is what really counts. While they never fully admitted that the segfault issue was a large scale hardware fault in the face of overwhelming evidence, they did at least honour all RMA requests with little resistance. I get the impression that most customers were grateful for this. I certainly was.

They then told me my feedback had been passed onto the Ryzen team.
Comment 110 Tom Hughes 2017-11-24 17:39:07 UTC
Don't forget that a BIOS update can include new CPU microcode.
Comment 111 Mathias Tillman 2017-11-24 17:59:19 UTC
(In reply to James Le Cuirot from comment #109)
> I have had a response from AMD Tech Support. They told me to disable C6 even
> though I'd already told them that this has been found to work around the
> issue. Not very helpful but they also added that the issue "should be
> addressed in an up and coming BIOS release." This would imply that they
> already know about the issue.
> 
> I didn't say so but I am slightly dubious about the suggestion of a BIOS fix
> as they said the same thing about the segfault issue and that never
> materialised.
> 
> However I did implore them to be more transparent with their customers. Bad
> press is bad but how they handle the situation is what really counts. While
> they never fully admitted that the segfault issue was a large scale hardware
> fault in the face of overwhelming evidence, they did at least honour all RMA
> requests with little resistance. I get the impression that most customers
> were grateful for this. I certainly was.
> 
> They then told me my feedback had been passed onto the Ryzen team.

Good to know that they are working on it!

I've had this problem myself, with multiple random lockups per day when it was mostly idle. But like others a custom compiled kernel (latest 4.15) with the CONFIG_RCU_NOCB_CPU option set and the rcu kernel command line seems to have fixed it.
Comment 112 Tyler 2017-11-24 20:02:51 UTC
As a test, I've been running like below with an uptime of 12 days without any freeze or strange syslog entries. Note that this is with the patch Seth mentioned far above _reverted_, so at least for me this seems to have fixed the issue. I've read that for some it hadn't, strange. I have to countercheck another few weeks.

# uname -a
Linux donald 4.13.12-x64 #14 SMP Sun Nov 12 17:23:57 CET 2017 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux
# zcat /proc/config.gz |grep RCU_NO
  CONFIG_RCU_NOCB_CPU is not set
# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.13.12 root=/dev/nvme0n1p3 ro
# ./zenstates.py -l
P0 - Enabled - FID = 8C - DID = 8 - VID = 3A - Ratio = 35.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Some relevant dmidecode output:

Base Board Information
        Manufacturer: ASRock
        Product Name: AB350 Gaming-ITX/ac

BIOS Information
        Version: P3.10
        Release Date: 08/28/2017
        
Memory Device (x2)
        Data Width: 64 bits
        Size: 8192 MB
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 3200 MT/s
        Part Number: G.Skill F4-3200C14-8GFX
        Rank: 1
        Configured Clock Speed: 1600 MT/s

I'm running RAM on its specified XMP profile 3200-14-14-14-34 1.35V, which is quite good actually.  UEFI settings are mostly default, only P-state overclocking is used and two memory related options turned off: GearDown mode and Bank Swapping. Voltages are default except CPU offset -100 mV. All turbo and power saving stuff is working as intented using ondemand governor.

So... are you guys sure your freezes are no UEFI setting and/or kernel config and/or hardware, most probably RAM, issue? Especially RAM is really critical on this platform and maybe Memtest cannot detect it...? 
(But as noted, I have to countercheck at least 2 weeks.)
Comment 113 James Le Cuirot 2017-11-24 21:26:45 UTC
(In reply to Tyler from comment #112)
> As a test, I've been running like below with an uptime of 12 days without
> any freeze or strange syslog entries. Note that this is with the patch Seth
> mentioned far above _reverted_, so at least for me this seems to have fixed
> the issue.

I'll have to try it myself then.

> So... are you guys sure your freezes are no UEFI setting and/or kernel
> config and/or hardware, most probably RAM, issue? Especially RAM is really
> critical on this platform and maybe Memtest cannot detect it...? 

I am booting in legacy mode so I doubt it's UEFI related. As someone who actually does have faulty RAM but has worked around it (for now), I can say that while it might have caused the odd freeze earlier, it was more apparent in other ways like random disk corruption (particularly when using git) and the browser constantly crashing until I rebooted.
Comment 114 Mathias Tillman 2017-11-27 07:38:04 UTC
(In reply to James Le Cuirot from comment #113)
> (In reply to Tyler from comment #112)
> > As a test, I've been running like below with an uptime of 12 days without
> > any freeze or strange syslog entries. Note that this is with the patch Seth
> > mentioned far above _reverted_, so at least for me this seems to have fixed
> > the issue.
> 
> I'll have to try it myself then.
I tried this myself over the weekend - I compiled a kernel with CONFIG_RCU_NOCB_CPU set to n and that commit reverted, unfortunately, after a few hours of being left idle it had locked up. After that I also did a test with that commit applied again, and CONFIG_RCU_NOCB_CPU set to y (with the kernel command line also being set) and it ran well for 12+ hours. So it definitely feels like that config option fixes it.
Comment 115 LimeTech 2017-11-29 19:42:35 UTC
We have been in "indirect" contact with an engineer at AMD who offers this explanation:


> We have been investigating the issue where systems are reportedly locking up
> when idling or running small workloads.
>
> This issue is related to the power supply.  Most PC power supplies (PSUs) are
> designed to handle a wide range of power consumption from your PC components,
> but not all PSUs are created equal. 
>
> Because of this, there are some rare conditions where the power draw of an
> efficient PC does not meet the minimum power consumption requirements for one
> or more circuits inside some PSUs. 
>
> This scenario (called "minimal loading supply") can cause such PSUs to output
> poor quality power, or shut off entirely. 
>
> To prevent this issue from happening, it is important to ensure that the
> power supply supports 0A minimum load on the +12V circuit. These PSUs became
> commonplace starting in 2013 for the Intel "Haswell" platform.
>
> This specification can be found printed on the sticker affixed to most PSUs,
> or it may be available on the manufacturer’s website.
>
> However, AMD understands that not everyone is in a position to replace their
> PSU with a contemporary 0A-rated unit. To help with that, AMD is also
> developing a firmware workaround for these power supplies, and will make it
> available through motherboard partners as a BIOS update in the future.


You guys buy this?
Comment 116 Seth Jennings 2017-11-29 20:16:19 UTC
Nope, my PSU is this:
https://www.newegg.com/Product/Product.aspx?Item=N82E16817139146

Advertised "Haswell Ready" (0A min load).  I _do_ believe this is a problem with power delivery, just not from the PSU.  I think it is the on-chip power management.

Sounds like they might try to spin it.  I don't really care though as long as the firmware update fixes the issue :)
Comment 117 James Le Cuirot 2017-11-29 20:37:26 UTC
It sounds plausible but my Seasonic G-550, bought last year, was also declared to be Haswell ready.

https://www.pcper.com/news/General-Tech/Seasonic-Releases-Information-Its-Haswell-Ready-Power-Supplies

Maybe Ryzens use even less power in this state!?
Comment 118 Panagiotis Malakoudis 2017-11-29 20:52:09 UTC
I don't "buy" this explanation. My PSU (Coolermaster G650M) specifically states: Haswell C6/C7 support & zero load operation

@Klaus Mueller: What this option changes is that it disables XFR and turbo (single core is 3600 same as all core) and it sets P1 state to 3600, 1.35V as can be seen in zenstates.py output. This configuration makes the CPU to spend longer time in 1.35V and less time to 0.9V and less. It does seem to make the problem less frequent.

@Seth Jennings: I agree with you, I believe there is an issue with the on-chip power management. Let's hope it can be fixed in firmware.
Comment 119 oyvinds 2017-11-30 08:00:16 UTC
>> This issue is related to the power supply.
> You guys buy this?

No. This is hogwash. 

EVGA SuperNOVA 750 G3 is a rather new PSU. Corsair RMi Series RM650i is also fairly new. Both of these were released after Haswell was in 2013. These are the ones I (ab)use for my Ryzen 1600X's. AMD's "engineer" can forget about selling me some story about how both of these PSUs are somehow flawed in a way that makes these CPUs hang shortly after boot if I leave the system idle (when not using the right kernel options/parameter).

If everyone with this problem was re-using a ancient PSU from an earlier build then there would have been a chance that this would hold water. That's just not the case and that makes it obvious that AMD story is hogwash.

It is still the case that kernel configuration CONFIG_RCU_NOCB_CPU=y and the kernel option rcu_nocbs=0-11 fixes this issue - completely on both my Ryzen systems. This tells me that a kernel patch can fix this (or permanently mask it) because it's plainly clear that software does fix it.
Comment 120 Mathias Tillman 2017-11-30 12:29:52 UTC
Yeah, I don't buy it either. My PSU is a Fractal Design R3 1000W and is rated for a minimum current of 0A on the 12V output.
Plus, Fractal Design's support page states that:
"All power supplies
that use the DC-DC method are able to output their full 3.3V/5V ratings even with no load on the 12V
rail, so Tesla R2 and Newton R3 power supplies will easily support the new sleep states introduced with
Intel’s Haswell platform."
Comment 121 Klaus Mueller 2017-12-01 07:12:07 UTC
(In reply to Panagiotis Malakoudis from comment #118)
> @Klaus Mueller: What this option changes is that it disables XFR and turbo
> (single core is 3600 same as all core) and it sets P1 state to 3600, 1.35V
> as can be seen in zenstates.py output. This configuration makes the CPU to
> spend longer time in 1.35V and less time to 0.9V and less. It does seem to
> make the problem less frequent.

Well, happily I couldn't measure any increased power consumption. It's unchanged compared to standard configuration.

I'm running this config now since about 3 weeks and I didn't face any problem so far - there would have been a lot of chances to hang.
Comment 122 Panagiotis Malakoudis 2017-12-01 07:22:37 UTC
(In reply to Klaus Mueller from comment #121)
> 
> Well, happily I couldn't measure any increased power consumption. It's
> unchanged compared to standard configuration.
> 
> I'm running this config now since about 3 weeks and I didn't face any
> problem so far - there would have been a lot of chances to hang.

I have tried it in my system and although it didn't freeze in the first 30-60 minutes of idle like it usually does if I run it without rcu threads, it did freeze when idle all night - found it frozen in the morning. But even if it worked, I wouldn't like to use it since such setting limits single core potential performance. 1700X can go up to 3900 MHz for single core performance.
Comment 123 Dimitris Tsitsipis 2017-12-01 08:46:45 UTC
Just to chime in with my case... My Ryzen 1600 (w/ Gigabyte AB350M Gaming 3) also used to lock up on stock Ubuntu 17.10 (4.13), installed on a hdd, C6 enabled in bios. The PSU is a 750W one from 2011-2012, I doubt it's OA min load. 
As in everyone's (?) case here, this was fixed with the CONFIG_RCU_NOCB_CPU workaround.

A few days ago though, I did a clean install on an ssd , and thought I'd check again. It didn't lock up for 3-4 days that I've left it on. Could one say that  it has some relation to the ssd not using 12V? (It seems opposite to what one would expect? Perhaps it has something to do with the 12V power draw alternating between zero and non-zero? Are the sata 12V rails even related to this?)
Comment 124 Klaus Mueller 2017-12-01 10:20:32 UTC
(In reply to Panagiotis Malakoudis from comment #122)
> I have tried it in my system and although it didn't freeze in the first
> 30-60 minutes of idle like it usually does if I run it without rcu threads,
> it did freeze when idle all night - found it frozen in the morning.

Thanks for testing. It obviously isn't a workaround suitable for everybody.

> But even
> if it worked, I wouldn't like to use it since such setting limits single
> core potential performance. 1700X can go up to 3900 MHz for single core
> performance.

It depends on the characteristics of the workload.
Comment 125 Francesco Biscani 2017-12-01 13:06:56 UTC
Another me too.

Originally I had a Ryzen 1700 which was suffering from both the segfault and the idle lockup issues. I sent it back via RMA, got a replacement which does not segfault anymore but which still exhibits the idle lockup.

For me, messing around with RCU kernel settings/boot params does not fix the idle lockup behaviour. Disabling the C6/Coolnquiet at the BIOS level does not fix it either (ASRock X370 Gaming K4 motherboard). What DOES fix the problem is disabling C6/P states via the zenstates.py script: with it, the system has been rock solid for weeks.

At this point I am waiting for a BIOS update before possibly going through RMA again.
Comment 126 James Le Cuirot 2017-12-01 13:24:07 UTC
(In reply to Francesco Biscani from comment #125)
> At this point I am waiting for a BIOS update before possibly going through
> RMA again.
I wouldn't go through RMA again as there's no indicator that this has been fixed in hardware.
Comment 127 Francesco Biscani 2017-12-01 13:54:52 UTC
(In reply to James Le Cuirot from comment #126)
> (In reply to Francesco Biscani from comment #125)
> > At this point I am waiting for a BIOS update before possibly going through
> > RMA again.
> I wouldn't go through RMA again as there's no indicator that this has been
> fixed in hardware.

Yeah you are probably right. Let's hope they fix this in the next AGESA update.

Running with C6 disabled is not that much of a problem: thermally I did not notice any difference, and while the power savings would be nice, it's not a deal breaker for me personally.

Still, according to some quick tests I ran, it seems like there's a ~10% performance decrease in single-threaded workloads due to the fact that, without C6, turbo mode is disabled. Would be nice to get that back.
Comment 128 Jonathan 2017-12-02 20:15:50 UTC
I've been interested in getting a new AMD CPU, but being new to Linux this bug has scared me off sofar. Too complex for me to compile my own kernel yet. So I created an account to follow this bug here, and see if it gets fixed, maybe with the new Ryzen's coming next February, so rumours have it.

However, I would like to ask if anyone here knows if this bug has been happening on Threadrippers as well. AMD says the TR's are the sorted top 5 Zeppelin dies. Therefor if this bug is a general design issue, it should happen on those high end workstation CPU's as well, but I could not find any complaints. If it does not happen on these TR's, then it's a manufacturing error?
Comment 129 Francesco Biscani 2017-12-02 20:35:46 UTC
(In reply to Jonathan from comment #128)
> I've been interested in getting a new AMD CPU, but being new to Linux this
> bug has scared me off sofar. Too complex for me to compile my own kernel
> yet. So I created an account to follow this bug here, and see if it gets
> fixed, maybe with the new Ryzen's coming next February, so rumours have it.
> 
> However, I would like to ask if anyone here knows if this bug has been
> happening on Threadrippers as well. AMD says the TR's are the sorted top 5
> Zeppelin dies. Therefor if this bug is a general design issue, it should
> happen on those high end workstation CPU's as well, but I could not find any
> complaints. If it does not happen on these TR's, then it's a manufacturing
> error?

It's difficult to say, as AMD has been horribly tight-lipped. Most information on these issues comes from various online resources (forums, bugzilla, etc.), and it's often anecdotal and sometimes contradictory.

My understanding is that the segfault problem (the only one publicly admitted by AMD so far) was a manufacturing issue, which was solved somewhere around the introduction of the TR. See also here:

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

The other issue, the hang/reboot when idle, seems to be persisting also on newer processors, and, going by memory (again, anecdotal) I do remember at one point someone reporting this very problem on a TR.

The current speculation/wishful thinking is that the idle bug may be solved by a future AGESA/microcode update, which should be imminent:

https://www.vortez.net/news_story/its_looking_like_november_for_amds_agesa_1_7.html
Comment 130 fin4478 2017-12-03 15:56:46 UTC
(In reply to Francesco Biscani from comment #129)
> My understanding is that the segfault problem (the only one publicly
> admitted by AMD so far) was a manufacturing issue, which was solved
> somewhere around the introduction of the TR. See also here:
> 
Ryzen cpus manufactured before week 20 or so have the segfault problem with high stress compilation that average users are not doing. You get a new cpu Amd after submitting RMA.

> 
> The other issue, the hang/reboot when idle, seems to be persisting also on
> newer processors, and, going by memory (again, anecdotal) I do remember at
> one point someone reporting this very problem on a TR.
> 
> The current speculation/wishful thinking is that the idle bug may be solved
> by a future AGESA/microcode update, which should be imminent:

This bug is made by the intel developers who does not test their code with other hardware and older intel hardware is affected this problem too. Easy workaround is found in this bug report, put for those who can not follow discussions, here it is again:
To prevent random kernel lock ups, enable RCU_NOCB_CPU and boot the kernel with the rcu_nocbs=0-X command line parameter. X is the cpu thread count -1.
Comment 131 Panagiotis Malakoudis 2017-12-03 18:59:05 UTC
(In reply to fin4478 from comment #130)
> Ryzen cpus manufactured before week 20 or so have the segfault problem with
> high stress compilation that average users are not doing. You get a new cpu
> Amd after submitting RMA.
> 

This is not true, processors from week 28 have also segfault problem.

> This bug is made by the intel developers who does not test their code with
> other hardware and older intel hardware is affected this problem too. Easy
> workaround is found in this bug report, put for those who can not follow
> discussions, here it is again:

This is not true either. Present what change Intel submitted in the kernel that made AMD cpus to freeze on idle. The previously mentioned commit from Seth (comment 72) has been reverted and doesn't fix the issue. And there are people that only C6 disable fixes the issue for them, rcuo threads don't fix it. For me, rcuo fixes the issue - although I would prefer to say "it hides" the issue. To me, as a Computer Engineer, this is a hardware issue, much like the segfault issue. It doesn't seem to occur under Windows, and it hides itself with rcuo threads enabled.

Also, there is at least one user of Threadripper that has this problem and rcuo threads didn't help him. Setting C6 to disabled helped him (comments 82, 91 and 107)
Comment 132 eric.c.morgan 2017-12-03 19:42:20 UTC
I made it 38 days uptime until system crash. I had RCU fixes applied, C6 enabled, cool and quiet enabled. RCU changes appear to help but not fix my issues.

I just turned off C6.

Frustrating. I want to rely on my server.
Comment 133 eric.c.morgan 2017-12-03 19:43:08 UTC
Why doesn't this issue manifest in Windows?
Comment 134 James Le Cuirot 2017-12-03 20:35:26 UTC
(In reply to eric.c.morgan from comment #133)
> Why doesn't this issue manifest in Windows?

I had previously said half jokingly that Windows never gets that idle. I have since heard that AMD have said the same thing themselves.
Comment 135 eric.c.morgan 2017-12-03 23:09:29 UTC
(In reply to James Le Cuirot from comment #134)
> (In reply to eric.c.morgan from comment #133)
> > Why doesn't this issue manifest in Windows?
> 
> I had previously said half jokingly that Windows never gets that idle. I
> have since heard that AMD have said the same thing themselves.

Thats funny ;-)

While it sucks overall for us, creating a whole new CPU must be maddening to validate. You'll never hit all corner cases the first try.

I'll probably pay for a Zen 2 or whatever is next.
Comment 136 fin4478 2017-12-04 08:15:16 UTC
(In reply to Panagiotis Malakoudis from comment #131)
> (In reply to fin4478 from comment #130)
> > Ryzen cpus manufactured before week 20 or so have the segfault problem with
> > high stress compilation that average users are not doing. You get a new cpu
> > Amd after submitting RMA.
> > 
> 
> This is not true, processors from week 28 have also segfault problem.
>

I did not check what phoronix wrote. You can have a new Ryzen cpu anyway if you are compiling Mesa and having the segfault problem.
 
To me, as a Computer Engineer, this is a hardware issue, much
> like the segfault issue. It doesn't seem to occur under Windows, and it
> hides itself with rcuo threads enabled.

I am Msc software engineer and when you say win virus hoover does not have random lockups, then it is a software problem made by wintel kernel developers. Intel develops Linux core and test only with their latest hardware.
Comment 137 Panagiotis Malakoudis 2017-12-04 08:30:45 UTC
(In reply to fin4478 from comment #136)
> I did not check what phoronix wrote. You can have a new Ryzen cpu anyway if
> you are compiling Mesa and having the segfault problem.

When has phoronix become a respectful place for info? When they published the "performance marginality problem exclusive to to certain workloads in Linux"? phoronix wrote whatever AMD told them to get some Threadripper and Epyc systems for review. I don't trust phoronix. They first escalate the segfault issue, then they buy the "marginality" excuse from AMD.

It has been reported in the original thread in AMD forums (https://community.amd.com/thread/215773) that CPUs from week 28 bought from stores still excibit segfault issues. CPUs from week 33 and after seem to have this fixed, but maybe still not 100%.

> I am Msc software engineer and when you say win virus hoover does not have
> random lockups, then it is a software problem made by wintel kernel
> developers. Intel develops Linux core and test only with their latest
> hardware.

This "Intel develops Linux core" is a statement of no value. Unless you present some specific commits from Intel that specifically affect Ryzen (AMD's Opteron for example is not affected by this idle freeze issue), you are just talking nonsense. We should better stay with facts. Fact is that even rcuo threads enabled don't fix the issue for everyone, some CPUs require C6 disabled completely. Fact is that some CPUs don't idle freeze even with rcuo threads disabled and C6 enabled. To me this is a clear indication about a hardware and not a software issue.
Comment 138 fin4478 2017-12-04 08:40:30 UTC
(In reply to Panagiotis Malakoudis from comment #137)

> have this fixed, but maybe still not 100%.

Less complex devices than Ryzen cpus do have hardware errors and you can have money back or a new product.
> 
> > I am Msc software engineer and when you say win virus hoover does not have
> > random lockups, then it is a software problem made by wintel kernel
> > developers. Intel develops Linux core and test only with their latest
> > hardware.
> 
> This "Intel develops Linux core" is a statement of no value. Unless you
> present some specific commits from Intel that specifically affect Ryzen

https://bugzilla.kernel.org/show_bug.cgi?id=197177
Comment 139 Panagiotis Malakoudis 2017-12-04 08:48:44 UTC
(In reply to fin4478 from comment #138)

> https://bugzilla.kernel.org/show_bug.cgi?id=197177

And this proves that? This commit doesn't affect Ryzen, it affects some specific motherboard with some specific BIOS/ACPI implementation. It is just some false ACPI interpretation - and more probably it is what you have been told - an incorrect ACPI table compilation based on older tools.
Comment 140 Konstantin Kharlamov 2017-12-04 08:55:56 UTC
(In reply to fin4478 from comment #138)
> (In reply to Panagiotis Malakoudis from comment #137)
> > This "Intel develops Linux core" is a statement of no value. Unless you
> > present some specific commits from Intel that specifically affect Ryzen
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=197177

Besides being irrelevant, you've been told your firmware is buggy, for that matter up to the point of not even working on Windows. You've also been told that inhibiting firmware errors is a work-in-progress; and I bet many users do want to know if their firmware is buggy — I do.

You've straight ignored everything, and started cussing the dev. Please stop being an asshole.
Comment 141 fin4478 2017-12-04 09:03:04 UTC
(In reply to Panagiotis Malakoudis from comment #139)
> (In reply to fin4478 from comment #138)
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=197177
> 
> And this proves that? This commit doesn't affect Ryzen, it affects some
> specific motherboard with some specific BIOS/ACPI implementation. It is just
> some false ACPI interpretation - and more probably it is what you have been
> told - an incorrect ACPI table compilation based on older tools.

Asus sure knows better than intel what tools to use when creating a motherboard. Nobody buys intel motherboards. Intel cpus have had and will have hardware errors too, so let us stop here.

https://en.wikipedia.org/wiki/Pentium_FDIV_bug
Comment 142 Panagiotis Malakoudis 2017-12-04 10:50:39 UTC
(In reply to fin4478 from comment #141)
> 
> Asus sure knows better than intel what tools to use when creating a
> motherboard. Nobody buys intel motherboards. Intel cpus have had and will
> have hardware errors too, so let us stop here.
> 
> https://en.wikipedia.org/wiki/Pentium_FDIV_bug

I don't understand what you are trying to say, sure Intel has bugs and more recent ones from the FDIV bug, there was hyperthreading bug in Skylake and Kabylake CPUs - which were fixed by microcode update.

The problem discussed here is probably hardware related and we hope it could be fixed with a future microcode (AGESA) update. Workaround it in software (the rcuo threads fix) would be nice but since it doesn't fix it for 100% of the cases, it should be investigated further. Intel Skylake hyperthreading issue took around 18 months to be fixed in microcode, so I can wait for a fix from AMD. But we need a confirmation that a. they have reproduced the issue and are looking at it and b. that it can be fixed with microcode update.
Comment 143 Lucio Crusca 2017-12-04 11:11:29 UTC
(In reply to Panagiotis Malakoudis from comment #142)
> I don't understand what you are trying to say, 

fin4478 (a.k.a. debianxfce@phoronix) is trolling as usual, please don't feed it.
Comment 144 Jonathan 2017-12-04 11:22:24 UTC
Is there a reliable way to test and be 100% sure a Ryzen/TR is OK or has the problem? Like there was a script to test for the segfault issue IIRC?
Comment 145 Panagiotis Malakoudis 2017-12-04 13:54:06 UTC
(In reply to Jonathan from comment #144)
> Is there a reliable way to test and be 100% sure a Ryzen/TR is OK or has the
> problem? Like there was a script to test for the segfault issue IIRC?

Leaving your computer idle over night for a couple of nights will either trigger the issue and you will find your computer freezed or it will not. Idle freeze usually happens from 15-30 minutes up to some hours.

It should be noted here that there are two kinds of idle freeze. In the first one (discussed here) compurer locks but if you press reset button it reboots fine. The second one is related to memory/cpu overclock and computer freezes completely (in overclock.net it is called "black screen freeze on idle") and pressing reset button does nothing - you have to completely power off. This second one also happens under Windows. So, in order to test reliably don't overclock CPU and RAM - meaning that RAM should run at max 2400 MHz.
Comment 146 Tom 2017-12-05 18:33:17 UTC
I just wanted to chime in as well. I have an AMD Ryzen 1600, ASRock X370 Taichi acting as a NFS File Server. It is running Ubuntu 17.10 server with I believe kernel 4.13 or whatever the latest kernel is after updates. Recently I have been experiencing the same exact issue as you all. Within 24 of the system being idle or low load the system just freezes. I am unable to access it. It actually happened today as I am typing this, I am unable to remote into it but I have it hooked up to a keyboard, mouse, and monitor. When I return home I want to see if there are errors of anykind on screen. I initally thought it may have been due to some hardware changes but it just didnt make sense and then I found this thread.

I am going to try disabling the C6 State and see if it helps at all and will report back. Hopefully this gets fixed with a BIOS update or something. This is a major inconvenience for server use with frequent idle loads. Thanks.
Comment 147 kernel 2017-12-06 07:16:51 UTC
I have two identical Ryzen systems, running Debian 9 "stretch" as servers.

Specs:
Ryzen 5 1600
ASRock AB350 Pro4
Crucial memory with ECC

These servers have been running fine since early August. I had one random Apache crash (segfault) on kernel 4.9 (too old for Ryzen really), but both had been running kernel 4.12 (from stretch-backports) just fine for about 3.5 months now. Suddenly, in the last week, I have seen 3 crashes across both machines. Similar symptoms as described above. Total system freeze in all cases, sometimes with some jibberish on screen:
https://i.imgur.com/J8M1zUS.jpg
https://i.imgur.com/Qb7Z8YW.jpg
In all cases, crashes happened between 03.00 and 06.00, when no users are on and backup jobs have finished.

I am since yesterday running stress-ng on both servers which keeps 3 cores at 100% 24/7, to see if that stabilises the systems for about a week. This could add to the evidence that prolonged full idle is the trigger for this issue.
Comment 148 kernel 2017-12-06 07:19:00 UTC
What I forgot to mention is that I find it rather peculiar that both machines are rock solid for 3.5 months and suddenly start crashing a number of times in a week.. Perhaps that is worth investigating too.
Comment 149 Panagiotis Malakoudis 2017-12-06 13:08:30 UTC
(In reply to kernel from comment #147)
> in the last week, I have seen 3 crashes across
> both machines. Similar symptoms as described above. Total system freeze in
> all cases, sometimes with some jibberish on screen:
> https://i.imgur.com/J8M1zUS.jpg
> https://i.imgur.com/Qb7Z8YW.jpg
> In all cases, crashes happened between 03.00 and 06.00, when no users are on
> and backup jobs have finished.

Your first screen shows sda disk error. Your second screen indeed shows the problem discussed here.
Comment 150 kernel 2017-12-06 13:12:12 UTC
(In reply to Panagiotis Malakoudis from comment #149)
> Your first screen shows sda disk error. Your second screen indeed shows the
> problem discussed here.

Agreed, but the machine completely locked up in the process. That disk is part of a SSD raid1 array for root, and one disk going down or not responding should not crash the entire machine; I find it hard to believe it is unrelated. Both crashes are from the same machine, by the way.
Comment 151 eric.c.morgan 2017-12-06 13:41:00 UTC
(In reply to kernel from comment #147)
> I have two identical Ryzen systems, running Debian 9 "stretch" as servers.
> 
> Specs:
> Ryzen 5 1600
> ASRock AB350 Pro4
> Crucial memory with ECC
> 
> These servers have been running fine since early August. I had one random
> Apache crash (segfault) on kernel 4.9 (too old for Ryzen really), but both
> had been running kernel 4.12 (from stretch-backports) just fine for about
> 3.5 months now. Suddenly, in the last week, I have seen 3 crashes across
> both machines. Similar symptoms as described above. Total system freeze in
> all cases, sometimes with some jibberish on screen:
> https://i.imgur.com/J8M1zUS.jpg
> https://i.imgur.com/Qb7Z8YW.jpg
> In all cases, crashes happened between 03.00 and 06.00, when no users are on
> and backup jobs have finished.
> 
> I am since yesterday running stress-ng on both servers which keeps 3 cores
> at 100% 24/7, to see if that stabilises the systems for about a week. This
> could add to the evidence that prolonged full idle is the trigger for this
> issue.

FWIW before my RCU and C6 changes I would run a video on loop and that improved things for me. I'm not sure I would keep 50% of my CPU on full tilt like you!
Comment 152 Scott Farrell 2017-12-16 02:39:02 UTC
Specs:

AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
Ryzen 1700 week 33 
G.SKILL DDR4-2400 32GB 15-15-15-39 1.2v
4.13.13-custom #1 SMP Mon Nov 27 20:17:00 CST 2017 x86_64 GNU/Linux

Debian 9 and kernel is compiled with CONFIG_RCU_NOCB_CPU and booting with parameter set.  ASLR disabled.  C-State enabled bios.

It runs samba and 6 active VMs.  The VMs are fairly idle most of the time.  The crash occurred at 6:40AM after 10 days of uptime.

"watchdog: BUG: soft lookup - CPU#x stuck for 23s! [worker:1659]"

https://photos.app.goo.gl/3ZIsi9aWXltZcixn1
https://photos.app.goo.gl/wfJFlVChftMHyqYQ2

So is the general consensus now that we simply must disable C-State completely in bios if kernel work-around is not working?  I just need this system to not crash on me at this point.
Comment 153 Tom 2017-12-16 03:24:32 UTC
Hello all,

Since I have posted my last comment, about ten days ago, I had disabled C6 state in the BIOS and have not had a single lockup. It has been stable. Just wanted to report back in with my results. Thanks.

(In reply to Tom from comment #146)
> I just wanted to chime in as well. I have an AMD Ryzen 1600, ASRock X370
> Taichi acting as a NFS File Server. It is running Ubuntu 17.10 server with I
> believe kernel 4.13 or whatever the latest kernel is after updates. Recently
> I have been experiencing the same exact issue as you all. Within 24 of the
> system being idle or low load the system just freezes. I am unable to access
> it. It actually happened today as I am typing this, I am unable to remote
> into it but I have it hooked up to a keyboard, mouse, and monitor. When I
> return home I want to see if there are errors of anykind on screen. I
> initally thought it may have been due to some hardware changes but it just
> didnt make sense and then I found this thread.
> 
> I am going to try disabling the C6 State and see if it helps at all and will
> report back. Hopefully this gets fixed with a BIOS update or something. This
> is a major inconvenience for server use with frequent idle loads. Thanks.
Comment 154 Scott Farrell 2017-12-16 03:35:25 UTC
I forgot to mention the PSU is Corsair CX 550M.


(In reply to Scott Farrell from comment #152)
> Specs:
> 
> AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
> Ryzen 1700 week 33 
> G.SKILL DDR4-2400 32GB 15-15-15-39 1.2v
> 4.13.13-custom #1 SMP Mon Nov 27 20:17:00 CST 2017 x86_64 GNU/Linux
> 
> Debian 9 and kernel is compiled with CONFIG_RCU_NOCB_CPU and booting with
> parameter set.  ASLR disabled.  C-State enabled bios.
> 
> It runs samba and 6 active VMs.  The VMs are fairly idle most of the time. 
> The crash occurred at 6:40AM after 10 days of uptime.
> 
> "watchdog: BUG: soft lookup - CPU#x stuck for 23s! [worker:1659]"
> 
> https://photos.app.goo.gl/3ZIsi9aWXltZcixn1
> https://photos.app.goo.gl/wfJFlVChftMHyqYQ2
> 
> So is the general consensus now that we simply must disable C-State
> completely in bios if kernel work-around is not working?  I just need this
> system to not crash on me at this point.
Comment 155 Panagiotis Malakoudis 2017-12-16 09:11:56 UTC
I built a new Ryzen system with MSI B350 motherboard, Ryzen 1700 CPU 1733SUS (segfault free direct from retail). Running same Debian 9 system with kernel 4.13 from backports (also tried 4.14.5 self built). This system does not freeze on idle. Believing it is a CPU issue, I installed this CPU to my first system (Asus Prime X370, 1700X 1725SUS from RMA). 

Guess what? My first system still freezes on idle with the new CPU.

Differences: First system has GTX1060, second has GT1030. They also have different PSU and different disk. RAM is different but have already exchanged them and didn't make any difference.

Will continue trying hardware combinations to find the root cause, at least for my case.
Comment 156 Klaus Mueller 2017-12-16 09:46:31 UTC
(In reply to Panagiotis Malakoudis from comment #155)
> Believing it is a CPU issue, I installed this CPU to my
> first system (Asus Prime X370, 1700X 1725SUS from RMA). 
> 
> Guess what? My first system still freezes on idle with the new CPU.

Could you please tell, which Bios version you are using w/ Asus Prime X370?
Comment 157 Panagiotis Malakoudis 2017-12-16 10:33:41 UTC
BIOS is 3401, but this was happening with previous BIOS versions as well.
Comment 158 Jonathan 2017-12-20 17:35:16 UTC
Did you do further tests with your hardware combinations? Any news? It's frustrating not knowing what is the cause and AMD not interested in recognising let alone fixing the issue.
Comment 159 eric.c.morgan 2017-12-20 18:41:10 UTC
C6 disabled and lockup after about a week. This is getting infuriating.
Comment 160 Francesco Biscani 2017-12-20 18:44:33 UTC
(In reply to eric.c.morgan from comment #159)
> C6 disabled and lockup after about a week. This is getting infuriating.

How did you disable C6? From BIOS or using the python script?
Comment 161 Panagiotis Malakoudis 2017-12-20 18:54:42 UTC
(In reply to Jonathan from comment #158)
> Did you do further tests with your hardware combinations? Any news? It's
> frustrating not knowing what is the cause and AMD not interested in
> recognising let alone fixing the issue.

Yes I did.
I changed the graphics card, system still froze on idle.
Now I changed the disk. Original system was running from a USB 3.0 adapter with an mSATA SSD in it. Now running with a SATA SSD (the one from the second system that didn't freeze).
If it freezes again, last thing to change is the PSU. But I can't accept it is the PSU. Most probably it is the power deliver system on the 2nd motherboard that made the 2nd system not freeze - at least not as quick as my 1st system.

@eric.c.morgan: Freezing on idle with C6 disabled should be something completely different. Are you running with memory overclocked? I keep testing my different setups with memory always running at 2400 or 2133.
Comment 162 eric.c.morgan 2017-12-20 19:02:20 UTC
C6 disabled via bios, I'll verify with that python script I posted earlier.

Mem is stock, memtested for 24 hours too.
Comment 163 eric.c.morgan 2017-12-20 19:14:20 UTC
Crap, the .py said C6 was still enabled! I disabled via said script. Fingers crossed.
Comment 164 Francesco Biscani 2017-12-20 20:03:15 UTC
(In reply to eric.c.morgan from comment #163)
> Crap, the .py said C6 was still enabled! I disabled via said script. Fingers
> crossed.

I had the same experience: disabling C6 via BIOS did not actually disable C6 (at least according to the script), and it would not fix the idle bug. Since I started disabling C6 via the script, I haven't experienced the bug once so far (in 2 months or so of usage - in normal circumstances the bug happens once every 1/2 days for me).
Comment 165 Panagiotis Malakoudis 2017-12-21 08:21:23 UTC
As expected, system froze on idle overnight again.
My last test is with the PSU from the 2nd system.

To summarize:
1st system Asus Prime X370, Ryzen 1700X, Flare X 2*8GB DDR4, mSATA SSD on USB 3.0, nvme disk (Windows install), GTX 1060 6GB, Coolermaster G650M PSU freezes on idle. It doesn't freeze with rcuo threads or C6 disabled.
2nd system MSI B350M Bazooka, Ryzen 1700, Corsair Vengeance 2*8GB DDR4, SATA SSD, GT 1030, Antec Earthwatts EA-380D PSU did not freeze on idle for 3 days, running same kernel with C6 enabled without rcuo threads. I didn't test much longer than 3 days because 1st system usually freezes overnight.

Installed on 1st system CPU, RAM, Graphics Card and disk from 2nd system and still freezes on idle. Now testing with the PSU. The only thing after that left is the motherboard. There is also a small difference on peripheral devices (different USB keyboard, a USB webcam) and the other difference is the water cooling system on 1st system.
Comment 166 Panagiotis Malakoudis 2017-12-21 16:34:42 UTC
System froze with the PSU from 2nd system. As it is I am out of ideas.
Comment 167 Tobias Klausmann 2017-12-22 10:07:50 UTC
(In reply to Panagiotis Malakoudis from comment #166)
> System froze with the PSU from 2nd system. As it is I am out of ideas.

FWIW, at GHz speeds all circuits are analog. Propensity to crash on low load may well require (or be hastened by) a particular combination of PSU, CPU and mainboard, possible more components. Plus, in the light of Win10 being unaffected, the system being exercised in a particular way.

In this case, there are at least three variables (PSU/MB/CPU). Combined with the elusive nature of an easy way to reproduce, it may well be a fool's errand to try and figure out which component is at fault.

This may also be the reason why AMD is so mum about this whole matter: there are no clear steps to make a crashing system better that aren't papering over the problem (C6 disabling, RCUO tweaking). If I had to make a decision at AMD, I'd be *very* careful to not claim a source of the error (and possibly solution) too quickly. I would, however, be more transparent about the whole ordeal.



Relatedly, I'm on an Asus Prime B350 Plus with a 1700X and 2x16GB Kingston RAM. I used to have an AMD GPU (ancient, 6450, IIRC) but now have a GTX1050, but crashiness didn't really vary between the two GPUs.

My first CPU had both the lockup and the segv problem, so I got it RMA'd. The new CPU has no segv problem, but is still crashy, usually within a day of light computing, or at the very least over night.

I haven't tried the RCUO approach yet, but with C6 disabled via script, I've gotten to two days of uptime, with very long stretches of complete idle (sitting at login prompt, not doing anything in the background).

As others have said, while this is far from perfect (I don't particularly care about the power saving or turbo, but I understand those who do), it at least it reduces the risk of data loss.
Comment 168 Jonathan 2017-12-22 10:38:14 UTC
(In reply to Panagiotis Malakoudis from comment #166)
> System froze with the PSU from 2nd system. As it is I am out of ideas.

For the hell of it and completion's sake, could you try switching the USB devices as well (or remove them from system 1)?

If that still makes no difference, then perhaps we could start listing which mainboards have issues with what bios versions and see if we find a pattern there.
Comment 169 Panagiotis Malakoudis 2017-12-22 12:25:36 UTC
@Jonathan: Friend of mine has Asus Prime X370 as I do and he doesn't have the idle freeze issue. And I don't think a USB webcam is to blame.

I think the problem is a complex PSU/Motherboard/CPU + specific workload puzzle that I am not willing to solve any more. I already lost many days trying combinations. After all I didn't pay AMD to debug their problems, I paid to use my CPU.

Since my issue is resolved with rcuo threads, I will use it like that and test every now and then if a BIOS update solves this. And will certainly replace my Motherboard/CPU combination as soon as X470 and new Ryzen CPUs are out.
Comment 170 Arup 2017-12-24 06:25:49 UTC
Have the same issue with my ASIS 350M and Ryzen 1700 latest BIOS and GSkill RAM, all the freeze happen during idle and never under load. Sometimes it would go days without a freeze, other days its random. I am on Arch with GNOME and using Nvidia drivers for my 1050Ti
Comment 171 Yibin Lin 2017-12-27 05:59:37 UTC
Just would like to report that I have almost exactly the same problem (log shows "soft lockup - CPU#0 stuck for 22s!"), with TP-Link TL-WDN4800 WiFi adapter. Apparently someone has exactly the same issue with me at https://forum.level1techs.com/t/solved-linux-is-unstable-ever-since-i-upgraded-to-ryzen/117541/152 .

Not sure if I have other same components with the OP in the level1techs forum. But for me the soft lockup definitely comes from the WiFi driver process (it mentions about ath9k_hw_wait, which is the WiFi driver in kernel I think).

It seems to me that TP-Link TL-WDN4800, with a recent kernel (I tried 4.10, 4.11, 4.13) can reliably re-produce the soft lockup issue within an hour in Ubuntu 16.04. (Of course you will need to turn on WiFi)

I have given my Ryzen CPU to a friend who is a windows user. But I kept a copy of the log - let me know if you guys need me to upload it.
Comment 172 Yibin Lin 2017-12-27 06:03:35 UTC
(In reply to Yibin Lin from comment #171)
> Just would like to report that I have almost exactly the same problem (log
> shows "soft lockup - CPU#0 stuck for 22s!"), with TP-Link TL-WDN4800 WiFi
> adapter. Apparently someone has exactly the same issue with me at
> https://forum.level1techs.com/t/solved-linux-is-unstable-ever-since-i-
> upgraded-to-ryzen/117541/152 .
> 
> Not sure if I have other same components with the OP in the level1techs
> forum. But for me the soft lockup definitely comes from the WiFi driver
> process (it mentions about ath9k_hw_wait, which is the WiFi driver in kernel
> I think).
> 
> It seems to me that TP-Link TL-WDN4800, with a recent kernel (I tried 4.10,
> 4.11, 4.13) can reliably re-produce the soft lockup issue within an hour in
> Ubuntu 16.04. (Of course you will need to turn on WiFi)
> 
> I have given my Ryzen CPU to a friend who is a windows user. But I kept a
> copy of the log - let me know if you guys need me to upload it.

Some additional information: I tried disabling C6 and with the WiFi card it still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA 550 G3 PSU, Asus ROG Strix B350-f motherboard.
Comment 173 Francesco Biscani 2017-12-27 10:38:54 UTC
(In reply to Yibin Lin from comment #172)
> Some additional information: I tried disabling C6 and with the WiFi card it
> still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA
> 550 G3 PSU, Asus ROG Strix B350-f motherboard.

How did you disable C6? Via BIOS or the Python script posted earlier?
Comment 174 Yibin Lin 2017-12-28 04:03:22 UTC
(In reply to Francesco Biscani from comment #173)
> (In reply to Yibin Lin from comment #172)
> > Some additional information: I tried disabling C6 and with the WiFi card it
> > still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA
> > 550 G3 PSU, Asus ROG Strix B350-f motherboard.
> 
> How did you disable C6? Via BIOS or the Python script posted earlier?

I think I did both..(?) After I disabled C6 in the BIOS, I ran the Python script and it shows two things about C6 states, one disabled (I guess that's the effect of BIOS setup) and the other enabled. I then wrote a simple systemd script to disable the other C6 state at the Ubuntu startup time. But normally within half an hour my computer would freeze.
Comment 175 Francesco Biscani 2017-12-28 20:39:42 UTC
(In reply to Yibin Lin from comment #174)
> I think I did both..(?) After I disabled C6 in the BIOS, I ran the Python
> script and it shows two things about C6 states, one disabled (I guess that's
> the effect of BIOS setup) and the other enabled. I then wrote a simple
> systemd script to disable the other C6 state at the Ubuntu startup time. But
> normally within half an hour my computer would freeze.

Thanks for the info.

That's a bummer, I was becoming convinced that C6 was at the core of the issue.

For reference, since 2 months I am executing these commands at every startup:

python zenstates.py --c6-disable
python zenstates.py --disable -p0
python zenstates.py --disable -p1
python zenstates.py --disable -p2

And I haven't had a single issue so far. I will start experimenting leaving C6 on and disabling only the P states, just to see what happens. Still waiting for the next BIOS update for my motherboard, and then it's almost time for the release of Ryzen 2...
Comment 176 kernel 2017-12-28 22:29:53 UTC
I've seen something new on one of my two affected machines. Note I have C6 *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The machines have been rock-solid since I started doing that, about 3-4 weeks uptime now.

Anyway, this message came up on one of them:
[1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
[1905495.738523] Do you have a strange power saving mode enabled?
[1905495.738524] Dazed and confused, but trying to continue

I have no idea what to make of this, perhaps it can help.
Comment 177 eric.c.morgan 2017-12-29 01:32:58 UTC
(In reply to kernel from comment #176)
> I've seen something new on one of my two affected machines. Note I have C6
> *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The
> machines have been rock-solid since I started doing that, about 3-4 weeks
> uptime now.
> 
> Anyway, this message came up on one of them:
> [1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
> [1905495.738523] Do you have a strange power saving mode enabled?
> [1905495.738524] Dazed and confused, but trying to continue
> 
> I have no idea what to make of this, perhaps it can help.

I can't even with those errors. Wow. 

Regarding the loaded CPU. I did the same with about %15 cpu load 24/7 and had similar results. I'm now testing with just the python script C6 turned off.

This whole mess is getting really annoying though. I want to support AMD, but man, I might have to go back to intel for a totally reliable machine. Segfault first, RMA pains, and now random crashes with low load? FFS...
Comment 178 Sergio C. 2018-01-08 19:27:55 UTC
(In reply to kernel from comment #176)
> I've seen something new on one of my two affected machines. Note I have C6
> *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The
> machines have been rock-solid since I started doing that, about 3-4 weeks
> uptime now.
> 
> Anyway, this message came up on one of them:
> [1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
> [1905495.738523] Do you have a strange power saving mode enabled?
> [1905495.738524] Dazed and confused, but trying to continue
> 
> I have no idea what to make of this, perhaps it can help.

As per this thread, this message could be due to your GPU, for instance: https://bbs.archlinux.org/viewtopic.php?id=121291
Comment 179 Moritz Naumann 2018-01-12 07:24:09 UTC
Several Ubuntu users seem to be able to work around this bug as discussed in
https://bugs.launchpad.net/linux/+bug/1690085 , more specifically in https://bugs.launchpad.net/linux/+bug/1690085/comments/69
Comment 180 darkbasic 2018-01-12 17:18:57 UTC
Didn't know about it, that's really bad! I'm subscribing to make sure to not buy a Ryzen until this gets fixed.
Comment 181 James Le Cuirot 2018-01-12 17:32:59 UTC
(In reply to Moritz Naumann from comment #179)
> Several Ubuntu users seem to be able to work around this bug as discussed in
> https://bugs.launchpad.net/linux/+bug/1690085 , more specifically in
> https://bugs.launchpad.net/linux/+bug/1690085/comments/69

That's just the same workaround we've been discussing the whole time. It works but it's not 100% effective. I've maybe had one or two freezes with it over the course of a few months.
Comment 182 Darren Salt 2018-01-12 19:36:28 UTC
Created attachment 273573 [details]
Workaround init script (derived from zenstates.py)

I'm using the attached init script to work around the problem. I can be fairly sure of it hanging (or even rebooting) within a day or two without this; no issues with it in place.

CONFIG_RCU_NOCB_CPU is not set here.
Comment 183 Ashley Bone 2018-01-19 13:49:48 UTC
Apologies if this has been mentioned or is completely obvious, but if you're using zenstates or ryzen-stabilizator or some other software means to disable C6, make sure you run it after resuming from S3.  At least on my system, C6 is re-enabled after I resume from sleep.
Comment 184 Chris Siebenmann 2018-01-26 18:11:04 UTC
I'm also experiencing what appears to be this same issue, on a ASUS
Prime X370-PRO motherboard and a Ryzen 1800X running Fedora 27 (with no
overclocking and Kingston server RAM with ECC). I've been able to capture
kernel logs through netconsole and, as with other people, they're reported
to run through smp_call_function_many and/or smp_call_function_single
(usually called from TLB flushing code).

Booting my system with 'rcu_nocbs=0-15 processor.max_cstate=5' appears
to have made it stable (again, as with other people).
Comment 185 Chris Hall 2018-02-17 16:04:48 UTC
I have Ryzen 7 1800X, ASUS Prime X370-PRO, with CPU replaced by AMD.

The damn thing has not worked properly since I bought it.

I have: kernel 4.14.18-300.fc27 
        CONFIG_RCU_NOCB_CPU=y
        rcu-nocbs=0-15

I have seen streams of "watchdog: BUG: soft lockup", but mostly the system just stops and I can find no logging that tells me why.

Currently I find the machine has stopped over night (when it is idle), every two or three days.

I have just installed kernel 4.15.3-300.fc27.  But I can find nothing to suggest that this fault is understood, let alone fixed.

I have just added processor.max_cstate=5 -- though I have failed to find any way to check what difference that has made.

So: is the "work-around" still the only available solution ?

[FWIW, while the AMD CPU went through the RMA process, I bought an Intel Coffee Lake i7-8700K machine.  Will I buy another AMD ?  I doubt it.]
Comment 186 John-Paul Herold 2018-02-17 19:57:17 UTC
There really has to be something else at play here. With my system everything has been stable since finding this thread and setting rcu-nocbs. But my co-worker who built an almost identical system (only differences being 1800X and non QVL memory) is experiencing the problem and close to giving up. My system specs:

Fedora 26
kernel 4.14.18
CONFIG_RCU_NOCB_CPU=y
rcu-nocbs=0-15

Ryzen 7 1700x (week 37)
ASrock Fatal1ty AB350 Gaming-ITX/ac (4.40 bios)
32gb QVL memory running at rated 2933 via profile
Other stuff: AMD GPU, Corsair PSU, Samsung SSD

Bios settings are pretty much stock apart from memory profile, enabling virtualization stuff, and tweaking fan profiles. Nothing tweaked related to C6 (bios or zenstates script).


To help aggregate all our experiences, and try to find some patterns what is everyone's thoughts on starting a Google form spreadsheet with columns for:
- CPU
- CPU week: yes, not related really but good to know in general
- Mobo: helpful for finding others in "your" situation
- Mobo Chipset: curious if 350 vs 370 shows anything significant. Mobo name would probably contain this, but having a dedicated column to sort is helpful
- Mobo BIOS version
- (boolean) QVL memory: we know Ryzen is sensitive to memory, so would be good data point to collect
- (boolean) CONFIG_RCU_NOCB_CPU
- (boolean) rcu-nocbs: bool since value is based on processor and not something I think we misalign with number of threads 
- C6 tweaks: "false" or what you did (e.g. bios, zenstates.py, etc)
- Distro
- Kernel
- (boolean) Passes stress test: to help eliminate hardware instability from OC'ing or faulty memory sticks, we can define what qualifies as "true" here
- (linear scale) Stable: from my life sucks and I can't use system, to my life on Ryzen is awesome and I forgot last time I rebooted (but still conscious bugs and crashes exist outside of Ryzen related issues).


I want to try and keep it as simple/straightforward as possible so it compliments this thread, and helps us see the data in a more structured format. Hopefully we can find a pattern! I also can't help but wonder how more people are not reporting issues. Would be curious if any people on the levelonetechs community can get Wendell & Co. to draw more attention to this. AMD is making great strides for open source and linux friendly hardware, but this problem really hurts that effort.


**EDIT** v1 of form is live: https://goo.gl/forms/oGCTPnNK0vJtNntj2

If you want to be a collaborator on the form, message me directly. Also share feedback on how to make the form better suited for our goal!
Comment 187 eric.c.morgan 2018-02-17 20:26:13 UTC
(In reply to Chris Hall from comment #185)
> I have Ryzen 7 1800X, ASUS Prime X370-PRO, with CPU replaced by AMD.
> 
> The damn thing has not worked properly since I bought it.
> 
> I have: kernel 4.14.18-300.fc27 
>         CONFIG_RCU_NOCB_CPU=y
>         rcu-nocbs=0-15
> 
> I have seen streams of "watchdog: BUG: soft lockup", but mostly the system
> just stops and I can find no logging that tells me why.
> 
> Currently I find the machine has stopped over night (when it is idle), every
> two or three days.
> 
> I have just installed kernel 4.15.3-300.fc27.  But I can find nothing to
> suggest that this fault is understood, let alone fixed.
> 
> I have just added processor.max_cstate=5 -- though I have failed to find any
> way to check what difference that has made.
> 
> So: is the "work-around" still the only available solution ?
> 
> [FWIW, while the AMD CPU went through the RMA process, I bought an Intel
> Coffee Lake i7-8700K machine.  Will I buy another AMD ?  I doubt it.]

Run the python script mentioned earlier. 

For me new kernel/params with RCU setting enabled & python script to turn off C states has had my machine running for 60 days now.
Comment 188 Francesco Biscani 2018-02-18 14:21:08 UTC
(In reply to eric.c.morgan from comment #187)
> Run the python script mentioned earlier. 

+1, disabling C6 has resulted in a rock-solid experience over here, no idle reboots in months of constant use of the machine.

Just wanted to mention a couple of points for those still following this thread.

I recently updated my motherboard's BIOS to the latest version, which contains a CPU microcode update from AMD (I am now on version "0x08001136", whereas I was stuck on version "0x08001129" for - I believe - 6 months or so). I had high hopes a microcode update would fix the idle reboots, but, alas, they still persist with the new microcode and C6 enabled.

The other interesting tidbit is this article from 1 year ago I found:

https://www.gamersnexus.net/news-pc/2870-ryzen-power-plan-update-min-frequency-90-pct

Apparently, soon after the ryzen launch last year, AMD issued an updated version of the Windows "Balanced" preset power plan which disables "core parking". Doing a bit of research, it seems like core parking is synonymous (or, at least strongly related) to the C6 power state. Might this explain why people on Windows are not experiencing random reboots? I am not sure 100% how these Windows preset power plans work, but the page linked above seems to indicate that:

1) after the update, the "Balanced" power plan disables core parking,
2) core parking was always disabled anyway in the "Performance" power plan.

See also this thread on the AMD community forum:

https://community.amd.com/thread/220563

Maybe AMD knew all along there are issues with the C6 power state...
Comment 189 kernel 2018-02-18 15:11:28 UTC
Nice work John-Paul. I've just put in my entry. I'm curious to see whether *anyone* has experienced these freezes when C6 is disabled. If not then I think we can be fairly sure the culprit has been found.

My own stress-ng approach is also still holding strong. Both servers have been rock solid since early December, with C6 *on* and no kernel options. This adds to the evidence that prolonged full idle is the trigger for this issue.

This was brought up before, but can we get onto an AMD engineer with regards to this? It'd be particularly interesting to ask about Francesco's hypothesis.
Comment 190 John-Paul Herold 2018-02-19 15:03:42 UTC
I enabled the setting so you can see charts and summary of data after submission, but for those who missed that or just want a direct link to the spreadsheet, here it is: https://docs.google.com/spreadsheets/d/1PuIkhxbxdE2H7fjbmfWYKBJI_iw8A6W1h-SZnQUXBSI/edit?usp=sharing

Pretty interesting stuff, only 7 responses total so far so sample size way too small to draw conclusions. However, initial notes I find interesting:

- 2 of 7 still have stability issues, one has a >1725 chip (1744). Again, this issue not related to segfault stuff, but good data point that newer steppings still show the behavior.
- The only two responses that aren't using QVL listed memory, are the two having issues still. Both happen to be ECC memory as well. May or not be relevant because these two do not have any RCU tweaks in place.
- Of the 5 who have stable systems, the CPU models vary, the chipsets vary, BIOS and AGESA vary from new to old.
- The only constant from 5 "stable" systems is "CONFIG_RCU_NOCB_CPU=y". Whether C-States was tweaked, or rcu-nocbs, or ASLR, there are mixed responses.
- Good on everyone for testing system/hardware stability!

Thank you to those who shared! Let's get more data!
Comment 191 kernel 2018-02-19 16:40:48 UTC
John-Paul, I'd like to note that I entered that I still have issues because with stress-ng not running to create artificial load, the problem persists. As in my previous comment, with a single thread at 100% 24/7 I do *not* have issues.

I did this originally as to provide more data; letting the system not enter full idle will also solve the problem, without BIOS tweaks or custom kernel options. n=2 for this result.
Comment 192 Chris Hall 2018-02-19 18:02:32 UTC
OK...

  []# cat /proc/cmdline
  BOOT_IMAGE=/vmlinuz-4.15.3-300.fc27.x86_64 root=... ro rhgb quiet rcu_nocbs=0-15 processor.max_cstate=5

  []# python zenstates.py -l
  P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
  P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
  P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
  P3 - Disabled
  P4 - Disabled
  P5 - Disabled
  P6 - Disabled
  P7 - Disabled
  C6 State - Package - Enabled
  C6 State - Core - Enabled

So... kernel command line option "processor.max_cstate=5" does not appear to disable C6.  [And removing the option did not seem to change anything, either.]

But...

  []# python zenstates.py --c6-disable
  Disabling C6 state

  []# python zenstates.py -l
  P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
  P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
  P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
  P3 - Disabled
  P4 - Disabled
  P5 - Disabled
  P6 - Disabled
  P7 - Disabled
  C6 State - Package - Disabled
  C6 State - Core - Disabled

So... having added "python zenstates.py --c6-disable" to rc.local (eventually), I now appear to have C6 disabled when the system is rebooted.

Do I have to do anything else to make sure that C6 *stays* disabled ?

Chris
Comment 193 Ashley Bone 2018-02-19 19:02:38 UTC
(In reply to Chris Hall from comment #192)
> OK...
> 
>   []# cat /proc/cmdline
>   BOOT_IMAGE=/vmlinuz-4.15.3-300.fc27.x86_64 root=... ro rhgb quiet
> rcu_nocbs=0-15 processor.max_cstate=5
> 
>   []# python zenstates.py -l
>   P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
>   P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
>   P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
>   P3 - Disabled
>   P4 - Disabled
>   P5 - Disabled
>   P6 - Disabled
>   P7 - Disabled
>   C6 State - Package - Enabled
>   C6 State - Core - Enabled
> 
> So... kernel command line option "processor.max_cstate=5" does not appear to
> disable C6.  [And removing the option did not seem to change anything,
> either.]
> 
> But...
> 
>   []# python zenstates.py --c6-disable
>   Disabling C6 state
> 
>   []# python zenstates.py -l
>   P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
>   P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
>   P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
>   P3 - Disabled
>   P4 - Disabled
>   P5 - Disabled
>   P6 - Disabled
>   P7 - Disabled
>   C6 State - Package - Disabled
>   C6 State - Core - Disabled
> 
> So... having added "python zenstates.py --c6-disable" to rc.local
> (eventually), I now appear to have C6 disabled when the system is rebooted.
> 
> Do I have to do anything else to make sure that C6 *stays* disabled ?
> 
> Chris

If you suspend your system, make sure you run zenstates again upon resume.  Otherwise you should be ok.
Comment 194 it 2018-02-23 18:28:19 UTC
I want to share my findings. Today I googled for "Ryzen C6 state" and found this reddit thread: https://www.reddit.com/r/Amd/comments/7ita4h/why_its_suggested_to_disable_c6_and_cstate_global/

The last comment regarding DRAM "Power Down Enable" and instabilities combined with C6 raised my interest. So I searched the UEFI of my ASrock AB350 Pro4 and found that option, but it was already disabled - but perhaps it's worth to look into if you experience idle freezes.

But I found another interesting option called "Power Supply Idle Control". Didn't find much documentation about it except: https://lime-technology.com/forums/topic/61767-amd-ryzen-update/?tab=comments#comment-607258

As was mentioned before AMD claims that's the idle freezes are related to the PSU. I don't think my PSU isn't able to support that low power mode but here http://forum.asrock.com/forum_posts.asp?TID=6832&title=ramocvoltagecantlowerthanxmpvoltageb350pro4-420
someone states, that the "Typical current Idle" setting changes the idle voltage of the cores. Perhaps that might make a difference? 
Anyone tried that?

I've set it to "Typical current Idle" for now and we'll see. 3h of idle uptime so far - not bad :-)

Oh, and I was a little bit astonished to see my AGESA version shown in the UEFI to be 1.0.0.1a. If anyone wonders like me, it seems, that AMD has reset the versioning for whatever reason.
Comment 195 ChrisB 2018-02-25 09:17:22 UTC
(In reply to John-Paul Herold from comment #186)

Linux version 4.15.5-gentoo (gcc version 7.3.0 (Gentoo 7.3.0 p1.0)) #1 SMP Sat Feb 24 08:35:43 CET 2018
Command line: BOOT_IMAGE=/kernel-4.15.5-prod rcu_nocbs=0-15 root=/dev/sdb5 ro

Boot Mode is legacy - no tweaks, C6 is on

Hardware R7 1700 (RMA'ed because of segfault problem)
GA-AX370-Gaming K7 Bios F22b
32 GB RAM no QVL

The lockup is reproduceable after longtime idle/sleep (more than 30 min). I didn't found something inside the logs.

With CONFIG_RCU_NOCB_CPU=n I had the lockups. Now I use CONFIG_RCU_NOCB_CPU=y and the commandline above and it works since 3 days stable.

I use the CPU "on demand gouvernor" and the range of the frequency is from 1400 MHz up to 3100 MHz. It's wider than the "AMD balanced power plan for Ryzen" that you can see for Win10.
Comment 196 Jonathan 2018-02-25 17:34:35 UTC
Question, I'm curious what the impact is: how much power would say a Ryzen 1800x use more in idle or low use (like desktop or browsing internet) with these 'fixes' enabled?
Comment 197 it 2018-03-02 18:06:01 UTC
FYI: I just passed 1 week uptime without any freeze. I never before managed to have such a long uptime. So the "Power Supply Idle Control" option in BIOS seems to really make a difference.
Comment 198 rtux 2018-03-02 18:29:29 UTC
Having an uptime of 5 days now. Seems to be stable.
Was down to ~4-6h between lockups.

Running successfully with:
	rcu_nocbs=0-11
	zenstates.py --c6-disable
On
	4.13.0-36-generic (HWE kernel)
	Ubuntu 16.04.4 LTS
	AMD Ryzen 5 1600X
	Asus PRIME X370-PRO (Bios 3803)

I'm not sure if I would have needed the RCU. C6 disable was a must.
Comment 199 eric.c.morgan 2018-03-02 18:39:22 UTC
(In reply to rtux from comment #198)
> Having an uptime of 5 days now. Seems to be stable.
> Was down to ~4-6h between lockups.
> 
> Running successfully with:
>       rcu_nocbs=0-11
>       zenstates.py --c6-disable
> On
>       4.13.0-36-generic (HWE kernel)
>       Ubuntu 16.04.4 LTS
>       AMD Ryzen 5 1600X
>       Asus PRIME X370-PRO (Bios 3803)
> 
> I'm not sure if I would have needed the RCU. C6 disable was a must.

RCU did not fix my issue, C6 disabling script did. I'm at 75 days uptime.
Comment 200 Artem Hluvchynskyi 2018-03-03 21:01:05 UTC
I think it might be solved for me by the latest ASRock X370 Gaming K4 firmware.

So, I have an RMA'd Ryzen 1700X, which had this freezing problem since receiving it. The workaround I applied was disabling the C6 state in BIOS, which stopped this. After installing BIOS 4.50 (currently latest) I noticed there's no such option any more so I started using the zenstates script on boot to do this. However, a week ago I was reading more on this issue and the "Power Supply Idle Control" option in the BIOS. So I decided to stop disabling C6 while keeping the option at default (where it should be adaptive) and see how it goes. During this week I tried leaving idle over night, whole day, using as usual, causing sustained high load, etc. and there's been no freezing so far.

Might be just luck of course, but before, with C6 enabled, it would freeze quite reliably even with normal daily usage. Or perhaps it's just a lot less common now.
Comment 201 AMD Linux User 2018-03-04 20:17:24 UTC
Good evening,

I was facing the same issues most people are facing here: the so-called AMD Ryzen soft-lock bug, happening when the CPU was iddling. Digging on the Internet, I found some interesting information avoiding having to disable C-state, which is very good for CPU life-expectancy and energy saving.
It is all about tricking the component voltage and frequency. For interested people, this is depicted here: http://www.silence.host/node/1

To Artem Hluvchynskyi, please let us know more about this. As guessed by some, this definately looks to be a power issue, and not a kernel one.
Comment 202 Artem Hluvchynskyi 2018-03-07 21:00:26 UTC
Unfortunately just got a freeze again after 1.5 weeks. So I ended up switching "Power Supply Idle Control" to "Typical" and apparently what it does is actually disabling the C6 state on the CPU package. At least that's what MSR shows (through zenstates script) and I can't see Vcore going lower than 0.8V so it does indeed look like C6 is not being reached. This should make it stable but it really sucks. Will later check if the most fresh firmware (released these days) changes anything and how RCU offload workaround affects stability.
Comment 203 Alexander von Gluck 2018-03-11 16:41:13 UTC
Same issue here for a while on Fedora 27.  System seems to hard-lock when idle almost daily.  (I've never had a lockup while "in-use")


Ryzen X1800
PRIME X370-PRO + Latest BIOS
32GB Ram
amd_iommu=on
I noticed the issue completely stopped ~4.15.3, but definitely has returned on 4.15.6

I'm trying the "rcu_nocbs=0-15" fix since it has been driving me nuts.
Comment 204 Jonathan 2018-03-11 17:03:13 UTC
Is anyone here planning on getting a Ryzen 2 a month or so from now? With or without a new x470/b450 mainboard? I'm curious if AMD, even if they haven't really acknowledged the issue, has actually worked on the problem and fixed it in the new CPU or mainboards.
Comment 205 oyvinds 2018-03-12 07:02:27 UTC
> I'm trying the "rcu_nocbs=0-15" fix since it has been driving me nuts.
This may not solve it for you. I use rcu_nocbs=0-15 and I made systemd services that run on boot and return from suspend which runs the script attached to this bug https://bugzilla.kernel.org/attachment.cgi?id=273573 that disables C6. A combination of those two solves it on both my boxes, the one with a ASUS board hung regularly with just rcu_nocbs=0-15 but works fine with both.

Unrelated, it's interesting that AMD is very silent about this total scandal bug in their system. I won't be buying a new Ryzen 2 when it's released (not that I would anyway, current one is fast enough). Intel does not have this problem. It is further a total outrage that AMD has been trying to blame this on PSUs. If two brand new high-end PSUs from two different brands result in the exact same problem then it's clearly AMDs fault.
Comment 206 Alexander von Gluck 2018-03-12 11:30:40 UTC
Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in this thread.
Comment 207 James Le Cuirot 2018-03-12 11:43:25 UTC
(In reply to Alexander von Gluck from comment #206)
> Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in
> this thread.

I don't think the rest of us without ASUS boards are imagining this issue.
Comment 208 oyvinds 2018-03-12 21:17:17 UTC
(In reply to Alexander von Gluck from comment #206)
> Are we sure this isn't an Asus BIOS bug?
Yes. I have two 1600X's, one's on a Asus board and the other a Gigabyte board. Switching PSU/RAM/GPU doesn't change anything, both have the idle bug in every combination. This total scandal is all AMDs fault.
Comment 209 AMD Linux User 2018-03-13 12:15:57 UTC
Asus is the main partner from AMD concerning Ryzen chipsets. AMD works 
first with them, and when they decide a fix is stable, propagate the 
BIOS updates to other vendors, ie MSI or Gigabyte.

The issue _is_ a power issue (not a PSU issue). Giving a bit more power 
to the chipset, CPU and memory generally solves the bug. The bug also 
existed in a different form on Windows, and exists with the exact same 
form on BSD.

Please, consider this (and this is something different from disabling 
C-state). Google, or consider reading this: http://www.silence.host/node/1


Le 12/03/2018 à 12:43, bugzilla-daemon@bugzilla.kernel.org a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #207 from James Le Cuirot (chewi@gentoo.org) ---
> (In reply to Alexander von Gluck from comment #206)
>> Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in
>> this thread.
> I don't think the rest of us without ASUS boards are imagining this issue.
>
Comment 210 OptionalRealName 2018-03-14 00:09:57 UTC
Is anyone from AMD aware of this at all?

I don't have a Ryzen system but I support someone who does and this appears to be going on for them too.

Anyone here, or anyone who someone knows here, got contacts within AMD to make them aware of the issue?
Comment 211 James Le Cuirot 2018-03-14 09:46:52 UTC
(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

They know, they're just not admitting it publicly.
Comment 212 Chris Hall 2018-03-14 12:37:18 UTC
(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

I opened a ticket with them on 25-Feb-2018.  

I received a response on 28-Feb-2018, to which I replied as follows:

[start_______________________________________________________________________
On 28/02/18 10:51, TECH.SUPPORT@AMD.COM wrote:
> Dear Chris,
>
> Your service request : SR #{ticketno:[8200794428]} has been reviewed and 
> updated.
>
> Response and Service Request History:
>
> Thank you for the response.
>
> I understand you are experiencing an issue on your PC with Ryzen processor
> when C6 state is enabled on the BIOS.

Yes, my PC freezes at random when it is idle.  Typically it will freeze when left overnight, roughly every 2 or 3 days.

Having disabled C6 -- using the 'zenstates.py' script, from https://github.com/r4m0n/ZenStates-Linux -- my machine has not frozen for 7 days.

> This issue has been fixed with the latest BIOS updates, but the option to fix
> it may not be available in all BIOS.

What is the root cause of the issue ?

In what way has it been fixed ?

> I request you to update to the latest BIOS and see if you have the Power 
> Supply Control option in the MB BIOS. Try toggling this option between 
> the different settings to see if it fixes it. If the specific option is 
> not available I would suggest you keep C6 off for now.

I have the latest available "PRIME X370-PRO BIOS 3803" from ASUS.  That apparently includes:

  2.Update to AGESA 1000a for new upcoming processors

I understand that means AGESA 1.0.0.0a (?) -- I have no idea what that means, since AMD seem to keep the release notes for AGESA as a deep, dark secret.  A previous BIOSes had (according to ASUS) "AGESA 1071", and before that were "AGESA 1.0.0.6B" and "AGESA 1.0.0.6a"... so I admit to being baffled.

What does this new BIOS "Power Supply Option" do ?

Are you telling me that this is problem with my power supply ?

If so, does this mean I need a better power supply ?

Disabling C6 is not really a long term solution... since that disables both (a) the maximum single core performance, and (b) the minimum power consumption state.  While these are arguably marginal, I have wasted a lot of time and energy trying to get my machine to work reliably.

I am seriously disappointed that the only information available is buried in kernel bug report(s) and in various support forums.

Having (eventually) found Linux Kernel Bug 196683, I have been hoping that AMD would leap into action to: (a) inject some clarity into the discussion, and (b) provide a proper solution.

<sigh>

For completeness, let me repeat my questions:

  1) What is the root cause of the issue ?

  2) In what way has it been fixed ?

  3) What does this new BIOS "Power Supply Option" do ?

  4) Are you telling me that this is problem with my power supply ?

  5) If so, does this mean I need a better power supply ?
__________________________________________________________________________end]

I prompted them on 12-Mar-2018, and today (14-Mar-2018) have received:

[start_______________________________________________________________________

Your service request : SR #{ticketno:[8200794428]} has been reviewed and updated.

Response and Service Request History:

Thanks for getting back to me, i really appreciate your patience. 

This issue is related to the power supply.  Most PC power supplies (PSUs) are designed to handle a wide range of power consumption from your PC components, but not all PSUs are created equal. 

Because of this, there are some rare conditions where the power draw of an efficient PC does not meet the minimum power consumption requirements for one or more circuits inside some PSUs. 

This scenario (called “minimal loading supply”) can cause such PSUs to output poor quality power, or shut off entirely. 

To prevent this issue from happening, it is important to ensure that the power supply supports 0A minimum load on the +12V circuit. These PSUs became commonplace starting in 2013 for the Intel “Haswell” platform.

This specification can be found printed on the sticker affixed to most PSUs, or it may be available on the manufacturer’s website. If you cannot locate this information related to your PSU, you will need to contact the manufacturer directly. 

If you have not already, please download and update to the latest Motherboard bios and check to see if you have this new bios option and let me know the outcome of toggling this option on and off, and between the various preset settings. 
__________________________________________________________________________end]

I admit that the PSU I was using was (a) old and (b) cheap.  I have now replaced that by a brand-new, more efficient PSU which supports 0A loads across all voltages.  I am running that with C6 enabled, and await developments.
Comment 213 James Le Cuirot 2018-03-14 13:24:33 UTC
(In reply to Chris Hall from comment #212)
> I admit that the PSU I was using was (a) old and (b) cheap.  I have now
> replaced that by a brand-new, more efficient PSU which supports 0A loads
> across all voltages.  I am running that with C6 enabled, and await
> developments.

Thanks for putting the pressure on them. I also contacted support but didn't press them any further. Your new PSU probably won't help though, we discussed the various PSUs that people have further up this page and most, like mine, are good quality and not that old.
Comment 214 Klaus Mueller 2018-03-14 16:21:38 UTC
(In reply to Chris Hall from comment #212)
> I admit that the PSU I was using was (a) old and (b) cheap.  I have now
> replaced that by a brand-new, more efficient PSU which supports 0A loads
> across all voltages.  I am running that with C6 enabled, and await
> developments.

Could you please tell which PSU you're now testing?
Comment 215 Chris Siebenmann 2018-03-14 16:53:37 UTC
My Ryzen 1800X / Asus Prime X370-Pro based system that experiences this
has an EVGA SuperNova G3 550 watt PSU. Measured power usage is well
under its maximum rating, and while I can't find an explicit statement
in its documentation about 0A loads, I would be stunned if it doesn't
fully support that, since it's a quite modern and well regarded model
(and other G3 models seem to be reported as Haswell-ready in this way).

(More broadly, if AMD Ryzen CPUs and motherboards don't work reliably
with modern high quality PSUs, there is a problem in practice for people
who want to have reliable Ryzen-based Linux systems.)
Comment 216 Alexander von Gluck 2018-03-14 17:21:32 UTC
There's a lot of amd bashing in this thread, I'm still inclined to believe that it's just bugs to work out with a new major CPU changeup. The same kind of bugs bubbled up when athlon 64 + opteron came out.

Beyond the idle bug (which does indeed seem fixed so far adding the rcu_nocbs boot flag), performance seems fine under Fedora 27.

Let me ping my amd graphics contacts and see if they're aware of this ticket.
Comment 217 Chris Hall 2018-03-14 17:27:49 UTC
(In reply to Klaus Mueller from comment #214)
> (In reply to Chris Hall from comment #212)
> > I admit that the PSU I was using was (a) old and (b) cheap.  I have now
> > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > across all voltages.  I am running that with C6 enabled, and await
> > developments.

> Could you please tell which PSU you're now testing?

It's a "be quiet! Straight Power 11", Model E11-450W (BN280).
Comment 218 Klaus Mueller 2018-03-14 18:23:33 UTC
(In reply to Chris Hall from comment #217)
> (In reply to Klaus Mueller from comment #214)
> > (In reply to Chris Hall from comment #212)
> > > I admit that the PSU I was using was (a) old and (b) cheap.  I have now
> > > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > > across all voltages.  I am running that with C6 enabled, and await
> > > developments.
> 
> > Could you please tell which PSU you're now testing?
> 
> It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011. I'm facing problems, which can be worked around here by using the optimized mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm curious about your test result. According be quiet!, your PSU should be 0A - stable (compliant to intel C6/C7).
Comment 219 Chris Hall 2018-03-14 19:28:10 UTC
(In reply to Klaus Mueller from comment #218)
>
> Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011.
> I'm facing problems, which can be worked around here by using the optimized
> mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm
> curious about your test result. According be quiet!, your PSU should be 0A -
> stable (compliant to intel C6/C7).

FWIW, I have looked at a number of reputable PSU suppliers, and found absolutely nothing else which explicitly covers the minimum current.

However, I find that ATX12V v2.4 is included in Revision 1.31 of Intel's 'Design Guide for Desktop Platform Form Factors', published in April 2013.  [https://www.intel.com/content/dam/www/public/us/en/documents/guides/power-supply-design-guide.pdf]  This appears to be the latest revision, and I note:

  * revision history says Revision 1.31 includes:

     * Changed 3.2.10 12 V2DC Minimum Loading to REQUIRED

     * Updated ... ATX12V Specific Guidelines to version 2.4

  * section 3.2.10 seems to:

     * REQUIRE +12 V2DC to operate down to 0.05A

     * and RECOMMEND that it operates down to 0A

So... I guess that a PSU which claims ATX12V v2.4 compliance should be OK ?
Comment 220 Alex 2018-03-14 21:06:03 UTC
Hi,
ASRock AB350 Pro4 + Be quiet! Pure Power 10 CM owner here.
Still having issues (got hit by this bug during "pacman -Syu" update last week).
The only thing that works for me is disabling C6.
Could AMD provide a reference PC spec or PSU QVL or something?
I've found no PSU recommendations from AMD except this:
https://support.amd.com/en-us/recommended/power-supplies
Comment 221 rtux 2018-03-14 21:25:34 UTC
(In reply to Klaus Mueller from comment #218)
> Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011.
> I'm facing problems, which can be worked around here by using the optimized
> mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm
> curious about your test result. According be quiet!, your PSU should be 0A -
> stable (compliant to intel C6/C7).

I'm running a new beQuiet! System Power 8 with 400W (Jan 2018) which supports latest processor generations from Intel and AMD.

I had to disable C6 to get it stable - see comment #198 for details.
Current uptime is 17 days now.

Still waiting for a Asus BIOS fix, so that I can re-enable C6 again.
Comment 222 John-Paul Herold 2018-03-15 16:24:08 UTC
I'm running a [Corsair SF450](https://www.corsair.com/us/en/Power/Plug-Type/sf-series-psu-config/p/CP-9020104-NA), which under "Tech Specs" lists the ATX12V version as v2.4. I had the idle freeze issue until I added `rcu_nocbs=0-15`. Haven't had to do anything else specific in BIOS or other tweaks.

I have modified the google form to include a question for validating the ATX12V version, just to see if anything trends from that.
Comment 223 Panagiotis Malakoudis 2018-03-15 20:22:54 UTC
Blaming the PSU from AMD's part is wrong, since idle freezes have been documented with various PSUs, new and old. If you read my older posts, I did a complete replacement of parts between two systems: one excibiting the freeze on idle issue and one that didn't. The system not excibiting the freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the other system didn't avoid idle freeze.

The problem is related to power and actually I have "hide" the problem by overclocking my CPU. Hitting certain voltages/frequencies avoids completely the issue (at least on my case with an ASUS X370 Prime). But even these voltages/frequencies are different from one case to another. Another user here reported he "hides" the problem by using the automatic overclocking of motherboard ("optimized mode for daily computing" option) which actually overclocks to 3600 if I remember well. For me this didn't work, still had idle freezes. But idle freezes disappeared when I overclocked at 3900.

rcu option doesn't fix the problem for all. It changes the way cores get idle so some see the problem as fixed, some don't. It is more than obvious that this is not a kernel problem. So I would close this report as invalid. It is not a kernel bug. It is a hardware issue that AMD needs to fix and it seems it has been fixed in with this new option "Power Supply Option". Unfortunately my motherboard doesn't have this option yet.
Comment 224 Alexander von Gluck 2018-03-16 01:39:35 UTC
I opened a support case to AMD, and buried in several basic steps (what GPU, etc) was "disable C6 in BIOS".

So, sounds like AMD is definitely aware of the problem since disable C6 was in their first support communication.


I have a 1000 Watt Corsair HX1000i and see the idle hang-ups, so definitely *NOT* a power supply issue.

I wonder if windows users see this issue?

(In reply to Panagiotis Malakoudis from comment #223)
> rcu option doesn't fix the problem for all.

For the RCU fix to work, your distro also has to have kernel support for it (CONFIG_RCU_NOCB_CPU).  See this article:

http://textandhubris.com/2017/12/09/ryzen-issues-on-fedora-27/
Comment 225 OptionalRealName 2018-03-16 01:50:17 UTC
I suspect this solves the Windows issues.


https://www.eteknix.com/amd-releases-balanced-power-plan-patch-for-windows-10/
Comment 226 OptionalRealName 2018-03-16 01:50:34 UTC
Also does anyone know if the Epyc series has this issue?
Comment 227 Klaus Mueller 2018-03-16 05:35:46 UTC
(In reply to Panagiotis Malakoudis from comment #223)
> Blaming the PSU from AMD's part is wrong, since idle freezes have been
> documented with various PSUs, new and old. If you read my older posts, I did
> a complete replacement of parts between two systems: one excibiting the
> freeze on idle issue and one that didn't. The system not excibiting the
> freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the
> other system didn't avoid idle freeze.

Two have one more documented check by Chris Hall can't be bad. But I fear I know the end of this test - even if I wish him it would work.

> The problem is related to power and actually I have "hide" the problem by
> overclocking my CPU.

This is a new interesting information.

> Hitting certain voltages/frequencies avoids completely
> the issue (at least on my case with an ASUS X370 Prime). But even these
> voltages/frequencies are different from one case to another. Another user

Me!

> here reported he "hides" the problem by using the automatic overclocking of
> motherboard ("optimized mode for daily computing" option) which actually
> overclocks to 3600 if I remember well.

You do! 3600 probably is enough for me, as I'm always running 4 VMs - which are mostly idle, too, but they add some additional minimal load, which could prevent the system from often entering very low current levels. Moderate overclocking additionally helps to circumvent the problem.
Next point is, that I'm seldom running this machine more than 15 hours - therefore, I can't say, if the problem disappeared completely. But before the overclocking, I could see freezes even after an hour or earlier.

> For me this didn't work, still had
> idle freezes. But idle freezes disappeared when I overclocked at 3900.

This is good to know! There are a lot of posts elsewhere about overclocking Ryzen (and RAM - that seems to be much more difficult) - mostly by Windows users. Overclocking doesn't cause difficulties but even could resolve problems! That's a cool new idea!
Could it be related to the relation between RAM and CPU clock?

[...]

> It is a hardware issue that AMD needs to fix and it
> seems it has been fixed in with this new option "Power Supply Option".

As I could read elsewhere [1], this option would disable C6 states among others. so, using zenstates.py to switch off c6 states would be the same "solution".

[1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714
Comment 228 Paulo J. S. Silva 2018-03-16 15:19:58 UTC
> 
> As I could read elsewhere [1], this option would disable C6 states among
> others. so, using zenstates.py to switch off c6 states would be the same
> "solution".
> 
> [1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714

This does not seem to be true for all motherboards. I have a B350 Tomahawk from MSI and I updated the BIOS today. In its last BIOS it has this new Power Supply Option that can be set to Auto, Low current idle, Typical current idle. The default is auto. 

I have changed it to both Low and Typical current and checked the possible states using zenstates.py -l and in both cases I get:


P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

So the C6 states looks enabled. I am now running it with Typical current idle to see if it improves stability. Next monday, when I get back to work I will see if the machine locked over the weekend, as usual, or not.
Comment 229 eric.c.morgan 2018-03-16 15:58:35 UTC
(In reply to Paulo J. S. Silva from comment #228)
> > 
> > As I could read elsewhere [1], this option would disable C6 states among
> > others. so, using zenstates.py to switch off c6 states would be the same
> > "solution".
> > 
> > [1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714
> 
> This does not seem to be true for all motherboards. I have a B350 Tomahawk
> from MSI and I updated the BIOS today. In its last BIOS it has this new
> Power Supply Option that can be set to Auto, Low current idle, Typical
> current idle. The default is auto. 
> 
> I have changed it to both Low and Typical current and checked the possible
> states using zenstates.py -l and in both cases I get:
> 
> 
> P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
> 
> So the C6 states looks enabled. I am now running it with Typical current
> idle to see if it improves stability. Next monday, when I get back to work I
> will see if the machine locked over the weekend, as usual, or not.

I look forward to your results! I've done new kernel, params, and .py to get a stable system. I'd like to get back to stock distro kernels and if your setting works I'll be there!
Comment 230 AMD Linux User 2018-03-16 18:17:50 UTC
Created attachment 274783 [details]
attachment-6763-0.html

> This does not seem to be true for all motherboards. I have a B350 Tomahawk
> from
> MSI and I updated the BIOS today. In its last BIOS it has this new Power
> Supply
> Option that can be set to Auto, Low current idle, Typical current idle. The
> default is auto.
>
> So the C6 states looks enabled. I am now running it with Typical current idle
> to see if it improves stability. Next monday, when I get back to work I will
> see if the machine locked over the weekend, as usual, or not.
>

Good to hear that the new BIOS start to be spread over other constructors.
For my motherboard (MSI X370 Gaming Plus), there is an explicit comment 
for the new BIOS version:

1. Do not update this BIOS if you`re currently using windows7

Thus letting believe they are focusing on these Unix soft luckups even 
if nothing official is written anywhere about it. But with that kind of 
comment we can expect a big mess on BIOS side in some months.
On my side, I won't test it since tricking the voltages made my computer 
to be very stable (20 days for now, 14 days last time until a reboot due 
to new kernel installation).
Comment 231 Moritz Naumann 2018-03-16 23:36:43 UTC
For me, Ryzen 7 1800X (acquired in late April 2017) on ASRock X370 Taichi with 2x 16GB (non-QVL) Kingston 9965669-019.A00G unbuffered ECC RAM (banks 2+4) on be quiet! Dark Power Pro 11 80plus 650W, is stable out of the box since BIOS 4.60 (2018/3/6) with default BIOS settings (except for SVM = enabled) on Ubuntu 16.04 and linux-image-hwe-edge 4.13.0.36.37. The previous BIOS version 4.40 (2018/2/9) also worked with a default Linux kernel image and without kernel parameters after setting the new BIOS 'current' option to the non-default 'full current' value.
So it just took a year and the follow-up generation to be released, yeay.
Comment 232 Moritz Naumann 2018-03-18 10:57:12 UTC
It turns out my previous statement was premature. The system still freezes unless "Power Supply Idle Control" is set to the non-default value of "Typical Current".
Comment 233 Klaus Mueller 2018-03-18 16:24:49 UTC
(In reply to Moritz Naumann from comment #232)
> It turns out my previous statement was premature. The system still freezes
> unless "Power Supply Idle Control" is set to the non-default value of
> "Typical Current".

http://download.gigabyte.cn/FileList/Manual/mb_manual_ga-ax370m-ds3h_e.pdf

Power Supply Idle Control
Enables or disables Package C6 State.

Typical Current Idle 
Disables this function.

Low Current Idle 
Enables this function.

Auto 
The BIOS automatically configures this setting (Default)
Comment 234 Paulo J. S. Silva 2018-03-19 16:33:38 UTC
> 
> Power Supply Idle Control
> Enables or disables Package C6 State.

Interesting, according to zenstates.py in my motherboard (a MSI B350 Tomahawk) the Package C6 State is enabled in both cases. 

Actually, I have good news. Setting this option to typical current idle made my machine stable over the week end (it almost always freezes during the weekends if let idle). It is still too soon to assert anything. But it is a good sign.

I may also bring my kill-a-watt to compare the idle consumption from the outlet with this option enabled and disabled to see if I can find any difference. But, I have to find the kill-a-watt at my place first.
Comment 235 Douglas S 2018-03-20 22:36:34 UTC
Ryzen 3 1200 Quad-Core (Made: 17/28) with an ASRock AB350 Pro4 running BIOS 3.20 (AGESA 1.0.0.6b)

System gets hardlocks (anywhere from 1-3, 10-30 days) that cause even SysRq to not work. The chip is in a server that varies between idle and active, but the hang seems to happen regardless of CPU usage state.

Only once did I get a "BUG: soft lockup". And in that situation I was able to REISUB sysrq my way to a clean reboot.

I've been following this bug for a while, but I'm not sure of what a proper solution might be (if any) for those that have found one at this point.

Below is some possibly helpful information:
Kernel: 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1 (2018-01-14) x86_64 GNU/Linux
Grub: rcu_nocbs=0-3

-
python zenstates.py --list
P0 - Enabled - FID = 7C - DID = 8 - VID = 3A - Ratio = 31.00 - vCore = 1.18750
P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I thought that disabled C6 states, and that wasn't desirable, but perhaps I'm mistaken.

Appreciate any guidance
Doug
Comment 236 eric.c.morgan 2018-03-20 22:40:08 UTC
(In reply to Douglas S from comment #235)
> Ryzen 3 1200 Quad-Core (Made: 17/28) with an ASRock AB350 Pro4 running BIOS
> 3.20 (AGESA 1.0.0.6b)
> 
> System gets hardlocks (anywhere from 1-3, 10-30 days) that cause even SysRq
> to not work. The chip is in a server that varies between idle and active,
> but the hang seems to happen regardless of CPU usage state.
> 
> Only once did I get a "BUG: soft lockup". And in that situation I was able
> to REISUB sysrq my way to a clean reboot.
> 
> I've been following this bug for a while, but I'm not sure of what a proper
> solution might be (if any) for those that have found one at this point.
> 
> Below is some possibly helpful information:
> Kernel: 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1 (2018-01-14)
> x86_64 GNU/Linux
> Grub: rcu_nocbs=0-3
> 
> -
> python zenstates.py --list
> P0 - Enabled - FID = 7C - DID = 8 - VID = 3A - Ratio = 31.00 - vCore =
> 1.18750
> P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore =
> 1.05000
> P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore =
> 0.88750
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
> 
> At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I
> thought that disabled C6 states, and that wasn't desirable, but perhaps I'm
> mistaken.
> 
> Appreciate any guidance
> Doug

Run the mentioned python script and disable C6. I have 90 days uptime on my server after doing this.
Comment 237 James Le Cuirot 2018-03-20 22:47:50 UTC
(In reply to Douglas S from comment #235)
> At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I
> thought that disabled C6 states, and that wasn't desirable, but perhaps I'm
> mistaken.

I thought that at one point but that's not true. It's an effective workaround for most (but evidently not all) people with fewer downsides.

On my side, I have updated to my latest BIOS and tried setting the new "Power Supply Idle Control" option to "Typical Current" while removing the rcu_nocbs parameter. So far, so good. zenstates.py reports the following, which makes sense given the option's description.

C6 State - Package - Disabled
C6 State - Core - Enabled

If you don't know what this means, I've read that "core" just powers down the CPU cores while "package" powers down most of the rest of the CPU after the cores. Sounds like an appropriate fix. I'd like to measure the power usage but haven't gotten round to it yet.
Comment 238 OptionalRealName 2018-03-21 01:51:37 UTC
Still no response from AMD?  Look forward to the brave soul who tests this on the new 2xxx series in a month or two.
Comment 239 Kai-Heng Feng 2018-03-21 16:42:41 UTC
Created attachment 274853 [details]
disable c6

So maybe disable package c6 within kernel?
Comment 240 Klaus Mueller 2018-03-21 17:17:16 UTC
(In reply to Kai-Heng Feng from comment #239)
> So maybe disable package c6 within kernel?

Good idea and for testing ok - but please as kernel commandline option if it should go to production because not everybody does have this problem respectively does have better solution like more or less CPU overclocking.
Comment 241 Kai-Heng Feng 2018-03-22 04:03:25 UTC
(In reply to Klaus Mueller from comment #240)
> Good idea and for testing ok - but please as kernel commandline option if it
> should go to production because not everybody does have this problem
> respectively does have better solution like more or less CPU overclocking.

It's not necessary because we can already write the MSR from userspace.

Is the RMA'd CPU also stepping 1?
Comment 242 Panagiotis Malakoudis 2018-03-22 08:37:45 UTC
Created attachment 274863 [details]
zenstates.py small patch to allow disabling only C6 package

I made a small patch to zenstates.py to allow disabling only C6 package for those that don't have the new power idle control options in their BIOS. Test and report.
Comment 243 Sergio C. 2018-03-22 17:12:58 UTC
(In reply to Kai-Heng Feng from comment #241)
> (In reply to Klaus Mueller from comment #240)
> > Good idea and for testing ok - but please as kernel commandline option if
> it
> > should go to production because not everybody does have this problem
> > respectively does have better solution like more or less CPU overclocking.
> 
> It's not necessary because we can already write the MSR from userspace.
> 
> Is the RMA'd CPU also stepping 1?

Mine is stepping 1.


For people using the "Power Supply Idle Control" setting in the BIOS/UEFI, does it also get reset after resuming from suspend/sleep?
Comment 244 James Le Cuirot 2018-03-22 17:57:37 UTC
(In reply to Sergio C. from comment #243)
> (In reply to Kai-Heng Feng from comment #241)
> > Is the RMA'd CPU also stepping 1?
> 
> Mine is stepping 1.

So is mine, which has been RMA'd.

> For people using the "Power Supply Idle Control" setting in the BIOS/UEFI,
> does it also get reset after resuming from suspend/sleep?

Just tested, C6 Package stays disabled.
Comment 245 Kai-Heng Feng 2018-03-23 03:43:04 UTC
If we are sure the issue is gone when package C6 is disabled, I'll send a patch which disables package on model == 1 && stepping == 1.
Comment 246 Paulo J. S. Silva 2018-03-23 11:34:30 UTC
(In reply to Kai-Heng Feng from comment #245)
> If we are sure the issue is gone when package C6 is disabled, I'll send a
> patch which disables package on model == 1 && stepping == 1.

Does this does not have consequences on power consumption and/or performance? I am not sure this is the best approach now that the nw BIOS option is in place. As I said in my system (MSI B350 Motherboard) the option set to "Typical" solves the problem. My machine is running for a week now without any freeze. 

Moreover, in my case the BIOS option does not seem to disable the C6 Package as fair as zenstates.py can tell:

pjssilva@quorra:~/local/administracao/ZenStates-Linux$ sudo ./zenstates.py -l
[sudo] senha para pjssilva: 
P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled
pjssilva@quorra:~/local/administracao/ZenStates-Linux$
Comment 247 Kai-Heng Feng 2018-03-23 11:38:51 UTC
(In reply to Paulo J. S. Silva from comment #246)
> (In reply to Kai-Heng Feng from comment #245)
> > If we are sure the issue is gone when package C6 is disabled, I'll send a
> > patch which disables package on model == 1 && stepping == 1.
> 
> Does this does not have consequences on power consumption and/or
> performance? 

I don't know. Need someone who has power factor meter to do the test.
Comment 248 Tobias Klausmann 2018-03-23 11:52:49 UTC
I for one have a stable system *without* disabling C6, but just rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it wouldn't help me, but likely make the power consumption a lot worse.

Hardware:
AMD Ryzen 7 1700X (model 1, stepping 1) that was RMA'd for segfaults
ASUS PRIME B350-Plus, BIOS 3401
2x Kingston 9965669-017.A00G

No overclocking and core boost disabled.

I regularly get uptimes well over two weeks and haven't had a lockup since I added the RCU blacklist.
Comment 249 James Le Cuirot 2018-03-23 11:59:19 UTC
I haven't measured it yet but I wouldn't panic about power usage prematurely. I think just disabling C6 package is quite different to disabling C6 entirely.

It also seems like BIOS settings aren't always proving reliable and you need to check zenstates.py to see whether it has changed what you expect or even anything at all.
Comment 250 Kai-Heng Feng 2018-03-23 12:11:11 UTC
(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

I guess RCU_NOCBS only papers over the bug by let the CPU hard to enter Package C6.

Regarding to power consumption, we need real world test results before making any judgment.
Comment 251 Klaus Mueller 2018-03-23 12:40:14 UTC
(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

This option must be switchable of course - I don't need it, too! It should be handled like the bios switch: default is without any change - machines which need it, enable it with c6package.disable e.g. or even better additionally via sysctl (always opt in).

About power consumption: I can test it on base of an already overclocked system (+200 MHz). I can do it once I can switch off the system to put in the power meter. The precision isn't very high, but 1 or 2 Watt more or less should be noticeable.
Comment 252 Artem Hluvchynskyi 2018-03-23 12:42:19 UTC
I think both RCU workaround and disabling C6 (either through MSR or using the BIOS option) just masks the problem, which is probably on hardware level. It is likely also masked on Windows through how its power management works. I strongly believe that AMD should properly investigate this issue instead of waving it off on PSUs. This happens on all kinds of PSUs, including those that definitely support C6/C7.

One thing I noticed is that with C6 disabled through the script or using the "Typical Current" power supply option in BIOS in my case MB sensors never report Vcore less than 0.8V, which indicates that indeed it doesn't reach lower idle states even if only "package C6" is disabled. On default settings it will frequently report 0.4-0.5V range when idle.

Since Zen would only apply boosting and especially XFR when certain power and thermal conditions are met, disabling the deep idle states may affect this adversely. So IMO all that should be well tested before applying a workaround like this on the kernel level.
Comment 253 it 2018-03-23 13:08:24 UTC
Since I've enabled "Typical Current" my system is absolutely stable. On my system that disables package C6.

I did a short test and run stress-ng -c1, pinned the process affinity to the first core and watched the output of cpufreq-aperf. Didn't see a difference between package C6 enabled or disabled. 

Didn't do a very long running test nor do I know, if the cpufreq-aperf values are really meaningful in this scenario (sometimes it showed frequencies above the 3.9GHz my 1700x should be able to reach), but at least there was no difference.
Comment 254 Klaus Mueller 2018-03-23 14:58:05 UTC
(In reply to Artem Hluvchynskyi from comment #252)
> I think both RCU workaround and disabling C6 (either through MSR or using
> the BIOS option) just masks the problem, which is probably on hardware
> level. 

Maybe true.

[...]

> One thing I noticed is that with C6 disabled through the script or using the
> "Typical Current" power supply option in BIOS in my case MB sensors never
> report Vcore less than 0.8V, which indicates that indeed it doesn't reach
> lower idle states even if only "package C6" is disabled. On default settings
> it will frequently report 0.4-0.5V range when idle.

I'm reaching exactly this 0.4-0.5V range with overclocking (+200 MHz) and C6 completely enabled - and I didn't see the problem any more since a lot of weeks now (before not even after an hour). My PSU  officially is definitely not C6/C7 capable because it's too old.

> Since Zen would only apply boosting and especially XFR when certain power
> and thermal conditions are met, disabling the deep idle states may affect
> this adversely. So IMO all that should be well tested before applying a
> workaround like this on the kernel level.

Fine with me if its switchable (opt in). 

Nevertheless it would be indisputable much better to have a real fix basing of the real reason instead of a workaround - but I fear it won't come anytime soon.
Comment 255 Paulo J. S. Silva 2018-03-23 15:16:20 UTC
> One thing I noticed is that with C6 disabled through the script or using the
> "Typical Current" power supply option in BIOS in my case MB sensors never
> report Vcore less than 0.8V, which indicates that indeed it doesn't reach
> lower idle states even if only "package C6" is disabled. On default settings
> it will frequently report 0.4-0.5V range when idle.

The same here (Tomahawk B350 from MSI). With the "Typical current" option in BIOS the voltage has a minimum of 0.82V (that's what sensors report). If I change the setting to "Auto" or "Low current" it goes down to 0.4-0.5V.

With "Typical" my system is stable, with "Auto" it freezes, usually during the night. I haven't tried "Low" yet. Maybe next week.
Comment 256 Harald Welte 2018-03-24 09:09:48 UTC
FYI: I'm experiencing a similar o the same issue on a Ryzen 1700X

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 1
model name      : AMD Ryzen 7 1700X Eight-Core Processor
stepping        : 1
microcode       : 0x8001129

Base Board Information
        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: PRIME B350M-A
        Version: Rev X.0x
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 3803
        Release Date: 01/22/2018

where at least one per day the machine would hang with
<pre>
Message from syslogd@host2 at Mar 23 21:35:56 ...
 kernel:[31606.963222] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [(md-udevd):2623]
</pre>

being the last messages.  C6 was enabled for both the package and the core by the bios.

The machine would hang quite reliably at least once every 24 hours, typically during the idle night times.  The load at that point was 0.01.

OS/Kernel is Debian 9 / 4.9.0-6-amd64, which has "CONFIG_RCU_EXPERT is not set" and hence none of the [old or new] more detailed RCU settings are enabled.

Have decided not to upgrade the kernel but simply disable C6 and wait...
Comment 257 Klaus Mueller 2018-03-24 11:17:36 UTC
(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

I tested power consumption with C6 package enabled (default) or disabled. I disabled it via patch for zenstates.py.

I ran two tests over 2 hours with disabled VMs and no KDE session (just the login screen started, but active was text console 0).

Based on the measuring tolerance of the power meter, I couldn't find any difference between enabled or disabled C6 package. 

The voltages I saw in sensors have been the same (most of the time 0.39 V) - no difference. The load of the machine was 0 0 0.

Maybe the patch for zenstates.py to disable C6 package doesn't work? It would have been nice to have the possibility to re enable it.

Another nice finding was: the consumption of my 4 idle(!) VMs is constantly about 5 Watt!
Comment 258 Panagiotis Malakoudis 2018-03-24 19:17:01 UTC
Those of you that have the new power idle options, can you report the values of msr registers 0xC0010292 and 0xC0010296 after booting?
It can be done with:
modprobe msr
rdmsr -x 0xC0010292
rdmsr -x 0xC0010296
Comment 259 Panagiotis Malakoudis 2018-03-25 09:21:40 UTC
Disabling only C6 package does not fix the issue for me. So whatever "Power Supply Idle Control" does, it is not just disabling C6 package.
Comment 260 James Le Cuirot 2018-03-25 09:49:06 UTC
(In reply to Panagiotis Malakoudis from comment #258)
> rdmsr -x 0xC0010292

52

> rdmsr -x 0xC0010296

484848

I've taken some basic power measurements. Not particularly scientific or accurate, just watching the numbers fluctuate on my power plug meter. I've not looked into this before so I really didn't know whether the difference would be tens of watts or milliwatts. I booted with just C6 package disabled and it hovered around 64.5W. With C6 entirely disabled, it went up to 70W. With C6 entirely enabled, it was around 63.5W. So C6 Package hardly makes any difference. I made these changes with zenstates.py rather than the BIOS for this test.
Comment 261 Robert Hoffmann 2018-03-25 09:50:15 UTC
Hello guys, (new here, but not to unix/linux)

I have at the moment a new Epyc-based server, and also a new Ryzen PC.
No soft lokups on Epyc, only a very evil error under heavy load I have been hunting down for a while (https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID) until I decided to move away from FC27 (not certified on this server), and load Centos7. So far no error, but I have yet to test more load on it.

Specs, Epyc:
Kernel: 3.10.0-693.21.1.el7.x86_64 x86_64 bits: 64 gcc: 4.8.5
           Desktop: Openbox Distro: CentOS Linux release 7.4.1708 (Core)                            
Machine:   Device: kvm System: Supermicro product: AS -2023US-TR4 v: 0123456789 serial: <filter>    
           Mobo: Supermicro model: H11DSU-iN v: 1.02A serial: <filter>                              
           UEFI [Legacy]: American Megatrends v: 1.1 date: 02/07/2018                               
CPU(s):    2 16 core AMD EPYC 7351s (-MCP-SMP-) arch: Zen rev.2 cache: 16384 KB 

Ryzen:
Kernel: 4.15.10-300.fc27.x86_64 x86_64 bits: 64 gcc: 7.3.1 Console: tty 0
           Distro: Fedora release 27 (Twenty Seven)                                                 
Machine:   Device: desktop Mobo: ASUSTeK model: PRIME B350M-A v: Rev X.0x serial: <filter>          
           UEFI [Legacy]: American Megatrends v: 3402 date: 12/11/2017                              
CPU:       6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1 cache: 3072 KB     


I can say that I have seen the soft lockup on the Ryzen several times in the past 3 months, maybe once / week or so. (but then again the machine is always doing stuff in my home, i.e. downloading backups from servers)

The boot options on both are now:
Epyc: GRUB_CMDLINE_LINUX="rhgb selinux=0 rcu_nocbs=0-63"
Ryzen: GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 rhgb quiet selinux=0 nmi_watchdog=0 nohpet pci=biosirq rcu_nocbs=0-11"

I have not been home yet to upgrade the BIOS on the B350 mobo, but I will be there in 7 days, do it, and post my findings. In the meantime, I will leave the Ryzen completely idle, waiting for a crash.
I have not (ever) touched the C6 issue, alhough I do remember setting my Mobo BIOS on "Standard"  (not Performance, and not Power Saving).
Comment 262 Robert Hoffmann 2018-03-25 10:49:05 UTC
Oh, I forgot:

Epyc (Centos7):
fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_NONE=y
# CONFIG_RCU_NOCB_CPU_ZERO is not set
# CONFIG_RCU_NOCB_CPU_ALL is not set


Ryzen (FC27):
fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y
Comment 263 Kai-Heng Feng 2018-03-25 12:37:10 UTC
If there's no noticeable power consumption increase when Package C6 gets disabled, and we are sure that it really workarounds the hard lockup, I'll send the patch to upstream.
Comment 264 Panagiotis Malakoudis 2018-03-25 12:42:46 UTC
It doesn't workarounds the hard lock, as I reported a while back (comment 259)
Comment 265 Jonathan 2018-03-25 13:41:18 UTC
(In reply to Kai-Heng Feng from comment #263)
> If there's no noticeable power consumption increase when Package C6 gets
> disabled, and we are sure that it really workarounds the hard lockup, I'll
> send the patch to upstream.

What exactly would this mean (Linux newbie here)? Would this mean this would be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because electricity is bloody expensive here and the 5-10% power consumption difference would matter to me.
Comment 266 Kai-Heng Feng 2018-03-25 13:50:48 UTC
(In reply to Panagiotis Malakoudis from comment #264)
> It doesn't workarounds the hard lock, as I reported a while back (comment
> 259)

So does disable Core C6 do the trick?
Comment 267 Chris Hall 2018-03-25 13:53:47 UTC
(In reply to Chris Hall from comment #217)
> (In reply to Klaus Mueller from comment #214)
> > (In reply to Chris Hall from comment #212)
> > > I admit that the PSU I was using was (a) old and (b) cheap.  I have now
> > > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > > across all voltages.  I am running that with C6 enabled, and await
> > > developments.

> > Could you please tell which PSU you're now testing?
 
> It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

I have been running with the new PSU with C6 enabled (package and core), and with CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15, for 10 days.

This morning the machine was frozen.  So, new PSU has not cured the problem.

With the old PSU the machine was freezing every 2-3 days.  So, things have improved... which could be the PSU, but the kernel (now 4.15.10) and other software has been updated and the idle load on the machine depends on how frequently somebody tries to ssh in :-(

The BIOS for my ASUS Prime X370-PRO does not have the magic "Power Supply Idle Control".  I will now try zenstates.py --c6-package-disable".

FWIW: I last wrote to AMD "TECH.SUPPORT" on 14-Mar... so far, no reply.
Comment 268 Panagiotis Malakoudis 2018-03-25 13:56:05 UTC
Disabling both C6 package and C6 core prevents hard lock, but also disables single core turbo XFR speeds. With C6 disable my 1700X can go only up to 3500MHz while a single core can go up to 3900 with C6 enabled.

Disabling both C6 package and C6 core should not be considered a fix.
Comment 269 Harald Welte 2018-03-25 14:27:34 UTC
I can report that on my Ryzen 1700X machine disabling core+package C6 seems to
be working around the problem.  Where previously the machine would fail quite
reliably at least once per 24 hours, it's now running longer without any trouble
so far.
Comment 270 Klaus Mueller 2018-03-25 15:22:15 UTC
(In reply to Jonathan from comment #265)
> (In reply to Kai-Heng Feng from comment #263)
> > If there's no noticeable power consumption increase when Package C6 gets
> > disabled, and we are sure that it really workarounds the hard lockup, I'll
> > send the patch to upstream.
> 
> What exactly would this mean (Linux newbie here)? Would this mean this would
> be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because
> electricity is bloody expensive here and the 5-10% power consumption
> difference would matter to me.

Kai-Heng Feng seems to mulishly ignore the fact, that not *all people* need this workaround. I repeatedly wrote, that this *must be switchable*, as opt in e.g.

But before anything is done, it must be proven in general, that the fix is correct at all - means, that it really addresses C6 package and nothing else.

My tests here have been shown, that there isn't any difference between enabled / disabled C6 package on voltages using zenstates.py. Did it really address the C6 package? But I'm not sure, if it does the same as the kernel patch.
Comment 271 LimeTech 2018-03-25 15:28:17 UTC
(In reply to Klaus Mueller from comment #240)
> (In reply to Kai-Heng Feng from comment #239)
> > So maybe disable package c6 within kernel?
> 
> Good idea and for testing ok - but please as kernel commandline option if it
> should go to production because not everybody does have this problem
> respectively does have better solution like more or less CPU overclocking.

Good luck getting Linus to accept a kernel change for a defect that AMD won't even acknowledge.
Comment 272 Klaus Mueller 2018-03-25 15:42:56 UTC
(In reply to Chris Hall from comment #267)
> (In reply to Chris Hall from comment #217)
> > (In reply to Klaus Mueller from comment #214)
> > > (In reply to Chris Hall from comment #212)
> > > > I admit that the PSU I was using was (a) old and (b) cheap.  I have now
> > > > replaced that by a brand-new, more efficient PSU which supports 0A
> loads
> > > > across all voltages.  I am running that with C6 enabled, and await
> > > > developments.
> 
> > > Could you please tell which PSU you're now testing?
>  
> > It's a "be quiet! Straight Power 11", Model E11-450W (BN280).
> 
> I have been running with the new PSU with C6 enabled (package and core), and
> with CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15, for 10 days.
> 
> This morning the machine was frozen.  So, new PSU has not cured the problem.

Thanks for your reply!

> With the old PSU the machine was freezing every 2-3 days.  So, things have
> improved... which could be the PSU, but the kernel (now 4.15.10) and other
> software has been updated and the idle load on the machine depends on how
> frequently somebody tries to ssh in :-(
> 
> The BIOS for my ASUS Prime X370-PRO does not have the magic "Power Supply
> Idle Control".  I will now try zenstates.py --c6-package-disable".

I've got the same board. My workaround is to overclock the CPU a bit: I'm using the PC scenario "Daily computing" (+200 MHz) - all other configurations are default and no RCU or C6 - workaround. Panagiotis Malakoudis had to use even higher overclocking (see comment 223) to work around the problem - but he has another CPU if I remember correctly - I have a AMD Ryzen 7 1700X and my clock for DRAM is 2400 MHz (default - it isn't changed). Overclocking doesn't affect power consumption here in idle mode (I tested it)!

Maybe you want to try it?
Comment 273 Kai-Heng Feng 2018-03-25 17:28:47 UTC
(In reply to Klaus Mueller from comment #270)
> (In reply to Jonathan from comment #265)
> > (In reply to Kai-Heng Feng from comment #263)
> > > If there's no noticeable power consumption increase when Package C6 gets
> > > disabled, and we are sure that it really workarounds the hard lockup,
> I'll
> > > send the patch to upstream.
> > 
> > What exactly would this mean (Linux newbie here)? Would this mean this
> would
> > be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because
> > electricity is bloody expensive here and the 5-10% power consumption
> > difference would matter to me.
> 
> Kai-Heng Feng seems to mulishly ignore the fact, that not *all people* need
> this workaround. I repeatedly wrote, that this *must be switchable*, as opt
> in e.g.

But lots of systems are affected by this, no? A functional kernel should work out of the box.
So fix like this needs to be opted out, like how all other workarounds get handled in the kernel. You can still easily opt out via MSR.

But apparently people are not happy about this fix, let's just wait for AMD's fix, if there's one.
Comment 274 Robert Hoffmann 2018-03-25 18:50:52 UTC
Well, judging by what I have read and what I have seen, this is obviously a fault in AMD or the Ryzen processor either:
1. by allowing the power to the CPU to get too low, so it starves and stutters.
2. by specifying the wrong CPU power needs to the motherboard manufacturers.

I have read above, there is a site where they "play with the voltages". That seems to be the most obvious real fix to me. However, it's not for "granny", and it should just not happen to a company like AMD, or a hyped product like Ryzen.

All said and done, when I leave my comp at home transcoding 8TB of videos to mobile format, the beast just rocks so good, and it never crashes then ...
Comment 275 Klaus Mueller 2018-03-26 02:17:03 UTC
(In reply to Kai-Heng Feng from comment #273)
> But lots of systems are affected by this, no? A functional kernel should
> work out of the box.
> So fix like this needs to be opted out, like how all other workarounds get
> handled in the kernel. You can still easily opt out via MSR.

Opt out is fine for me, too. This is ok. It's important to be switchable. But zenstates.py can't switch it off as of now. Therefore, it must be done via kernel.

> But apparently people are not happy about this fix, let's just wait for
> AMD's fix, if there's one.

Because it wouldn't be a fix, but a workaround. That's the problem. AMD should fix it - but they seem to behave lazy and just seem to ignore it!
Comment 276 OptionalRealName 2018-03-26 03:28:57 UTC
I find it hard to believe this can't be fixed with a bios / microcode update.  It really is crazy.

Someone earlier here was going to contact AMD graphics division to see if they could shed some light on it, I'm going to assume that went nowhere.


There is an AMD staff member who regularly posts on reddit, perhaps if this were to be posted on the AMD subreddit, succinctly, with evidence, it might get upvotes (but some AMD fans, tend to be extremely defensive)

It's certainly frustrating, this bug has been now open quite a while.
Comment 277 Sergio C. 2018-03-26 03:37:43 UTC
(In reply to OptionalRealName from comment #276)
> I find it hard to believe this can't be fixed with a bios / microcode
> update.  It really is crazy.
> 
> Someone earlier here was going to contact AMD graphics division to see if
> they could shed some light on it, I'm going to assume that went nowhere.
> 
> 


They probably got the same "I apologise for the delay in response, unfortunately the information you are requesting is private and confidential and not publicly available at this time." I got when asked about the power supply idle control setting.

On a sidenote, I am currently trying to debug an issue with a wireless card (Intel 7260, mobo ASRock AB350 Pro4, Ryzen 5 1600). Today I went back to an older BIOS that does not have the power supply workaround and simply disabling C6 package did not help. I got a reboot with the infamous "mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108" message; hadn't seen that one in months :)
Comment 278 it 2018-03-26 06:57:04 UTC
(In reply to Panagiotis Malakoudis from comment #258)
> rdmsr -x 0xC0010292


52

> rdmsr -x 0xC0010296


484848
Comment 279 Panagiotis Malakoudis 2018-03-26 11:57:37 UTC
It seems my motherboard, ASUS Prime X370 will be getting the new "Power Supply Idle Control" option in new BIOS with version 3907 and AGESA 1002, as reported in https://www.hardwareluxx.de/community/f219/asus-prime-x370-pro-am4-1156996-289.html#post26231965

So I will be able to test this workaround soon.
Comment 280 Simon 2018-03-26 17:44:47 UTC
Hi there,

i read this list here since about 3 month,
trying to get 4 hard/software identical computers
running stable:
 

amd ryzen7 1700
asrock ab350m
opensuse tumbleweed.
32 gb memory (max for this board)
no overcloaking, memory (except the color) correct for this board
(there are different numbers of the memory but they are related to
the color of the cover (as my computer shop told me))
 

1) i changed all processors twice now the high load bug is gone
   ua 1714 pgt (was first processors with this bug)
   ua 1733 pgs (where the second processors with this bug)
   now i got some other (no idear which, but they are running
   with ( a modified) "kill rizen script" and BIOS LOWER than 3.2!!!!
   -> so there ARE processors out there with week > 30 which still did
      have high load bug!!
 

2) still with that bios versions at low load the system crashed
   randomly about ever 5 days, systems are running about 12 hours/day,
   after this, shut down, next morning starting again.
   and in this situation every about 5 days once a crash.
   -> the crash(s) did not appear if the system runs contignously
    one week without touching the system!!!
   it crahes when starting browser, or changing windows in kde, or
   type after a while not doing anithing in libreoffice, so  it seems
   there has to be a small load from nearly no load to cause the problem.
 

3) i updated to bios 4.4 later to 4.5.
   crahes are still present. 
   but in this bios there is this feature:
   /advanced/amdcbs/zen-common-options/"power supply idel control"
   changed this from "auto" to "typical" (as i have seen here in this list)
   up to now, since about 3 weeks no lowload-crashes appear.
   BUT:
 

4) with bios 4.4 and 4.5 under high load i run into the problem:
   with modified "kill ryzen script", from 16 loops, i wrote 4 loops
   to the nvme0 (because for 16 loops 32gb memory is to small)
   and 12 loops to ramdrive.
   so there is a heavy load also at the bus.
   kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
   kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
   this errors occour every couple of seconds.
   with bios smaller 3.2 this was all fine, this error did not show up.
 
   
5) to solve the 4) problem, i added pcie_aspm=off" to the kernel command-line.
 

now it SEEMS that this systems are stable under low and high load :-)))
... cost me nearly a half of a year to get to this point.

thanks to all here who have reported ad gave me some hints.

simoN
Comment 281 Chris Hall 2018-03-28 14:43:02 UTC
FWIW, after 14 days of silence, I have heard back from "TECH.SUPPORT@AMD.COM".

When I last wrote I sent them the URL for this discussion.

"TECH.SUPPORT@AMD.COM" tells me:
__________________________________________________________________________
  Thank you for your patience, I understand that the idle hang freezing
  issue is frustrating.

  Thanks for sharing the Bugzilla link, I have glanced through the 
  discussion thread and saw that users that have the Power Supply 
  Control option available in the bios confirmed that changing the 
  setting has resolved their problem.

  This option is the AMD solution and has been provided to motherboard
  vendor for validation and inclusion in a future BIOS release.

  According to comment 279, a poster with the same motherboard as you,
  has pointed to a discussion where it states that an updated BIOS will
  be made available for your motherboard which contains the Power Supply
  Control Option. https://www.hardwareluxx.de/community/f219/
                   asus-prime-x370-pro-am4-1156996-289.html#post26231965

  Although I cannot confirm the accuracy of this post, you may want to
  check with ASUS directly.
__________________________________________________________________________

I did ask them what the "Power Supply Control" option actually does, and why.

I have asked again.

Given that the "Power Supply Control" option is "the AMD solution" I assume that AMD are the right people to ask ?  

FWIW: ASUS PRIME X370-PRO BIOS 3805 has appeared, with the release note "Improve system performance" -- but that does not have a "Power Supply Control" setting (or at least not in the "Advanced/AMD CBS" menu).
Comment 282 Panagiotis Malakoudis 2018-03-28 15:14:24 UTC
For ASUS X370 Prime, beta BIOS 3907 has this option available. 3805 does not have it.
I think I will test 3907 in the next days, although I would prefer not to use beta BIOS.
Comment 283 Panagiotis Malakoudis 2018-03-29 17:32:37 UTC
Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've seen some reports from users that they no longer have the issue with 4.15 kernel. I updated to 4.15 and already survived two nights of idle usage at stock settings (no cpu overclock etc). Of course this has happened in the past as well some times so I have to wait a few more days to make conclusions, but I wonder if anyone else tested with 4.15
Comment 284 Artem Hluvchynskyi 2018-03-29 17:43:47 UTC
(In reply to Panagiotis Malakoudis from comment #283)
> Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've
> seen some reports from users that they no longer have the issue with 4.15
> kernel. I updated to 4.15 and already survived two nights of idle usage at
> stock settings (no cpu overclock etc). Of course this has happened in the
> past as well some times so I have to wait a few more days to make
> conclusions, but I wonder if anyone else tested with 4.15

For me it froze in a bit more than a week of mixed usage on all defaults on 4.15.x.
Comment 285 Simon 2018-03-29 20:01:08 UTC
(In reply to Panagiotis Malakoudis from comment #283)
> Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've

Hi, i using opensuse's 4.15.10-1-default kernel see my comment 280 for system configuration (stable).
how to check if rcuo treads are enabled or not?
Comment 286 Simon 2018-03-29 20:08:52 UTC
(In reply to Simon from comment #285)
> (In reply to Panagiotis Malakoudis from comment #283)
> > Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've
> 
> Hi, i using opensuse's 4.15.10-1-default kernel see my comment 280 for
> system configuration (stable).
> how to check if rcuo treads are enabled or not?

maybe this is the info you need?:

zcat /proc/config.gz  | grep RCU

# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_HAVE_RCU_TABLE_FREE=y
# RCU Debugging
# CONFIG_PROVE_RCU is not set
CONFIG_RCU_PERF_TEST=m
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
Comment 287 Account Deleted 2018-04-01 07:59:23 UTC
I don't get it ... my Manjaro system was stable for 3 months, lately with 4.14 LTS kernels and constantly upgraded NVIDIA drivers. Now the system is unstable again, even when decoding HD video streams from the web ... I will give kernel 4.16 a try.
Comment 288 Chris Siebenmann 2018-04-06 16:17:22 UTC
We've now seen this idle lockup with a Ryzen Pro 1700X based system running current Fedora 27. As with my Fedora 27 system, it is avoided by using 'rcu_nocbs=0-15 processor.max_cstate=5'. The hardware is an Asus Prime Z370-Pro motherboard, EVGA SuperNova G3 650W PSU, 2x 8GB Kingston ECC RAM, and an nVidia graphics card.

(It's disappointing but not surprising that the Ryzen Pro doesn't seem to fix this.)
Comment 289 Chris Siebenmann 2018-04-06 16:42:37 UTC
Whoops, I need to correct my comment slightly, because Asus has so many motherboards: the Ryzen Pro is in an Asus Prime X370-Pro motherboard.

(There is no Asus 'Prime Z370-Pro', but there is an Asus 'Prime Z370-A' for current Intel CPUs so you can easily be confused just by changing a few letters.)
Comment 290 JerryD 2018-04-07 02:17:02 UTC
See also https://bugzilla.redhat.com/show_bug.cgi?id=1562530

I was thinking this was an amdgpu driver bug, but not sure now.
Comment 291 OptionalRealName 2018-04-10 02:00:12 UTC
Does the AMD Epyc embedded series have this bug?
Comment 292 Robert Hoffmann 2018-04-10 03:27:01 UTC
(In reply to OptionalRealName from comment #291)
> Does the AMD Epyc embedded series have this bug?

I am running a dual standard Epyc (7351s - Zen Rev.2) server and I can say that it has not manifested itself in any way. The server was one week completely idle waiting for software installations, and there was no errors. (CentOS 7.4).
I cannot say anything about the Embedded version though. But I guess it may depend on the "successful" marriage with its motherboard in reality.
Comment 293 Michaël Colignon 2018-04-12 16:20:33 UTC
Hi.
I am testing it under Windows.
First, you must know that on Windows only the economy saver mode make use of core parking. Normal mode doesn't, as the ryzen delivered with the chipset driver package. From what i see via Parkcontrol app.

I'm now using core parking for try to reproduce it under this others OS. It's running now since 4hours in economy mode with core parking enabled and still no crash.

I use several OS and i had a crash (freeze) just after i installed the chipset driver package, which touch to the scheduler settings.
Since it, never had an other freeze. I must have use the reset button for get out of the freeze.

From what i read in this thread if i want to reproduce it on Linux, best is to try Fedora 27 updated.

I'll be back with others tests.
Comment 294 Bráulio Bhavamitra 2018-04-13 22:49:34 UTC
(In reply to JerryD from comment #290)
> See also https://bugzilla.redhat.com/show_bug.cgi?id=1562530
> 
> I was thinking this was an amdgpu driver bug, but not sure now.

See my replies there, I think with Raven Ridge this is really related to amdgpu driver.
Comment 295 Dennis Schridde 2018-04-14 07:25:33 UTC
I am using an AMD Ryzen 5 2400G on an Asus ROG Strix B350-F (firmware version 3805 and 3803 before that) and was also experiencing freezes, e.g. after about one hour into `rsync -a /home ...`, with Gentoo Linux (Linux 4.15.10), Arch Linux 2018.03.1, Fedora 27, Fedora 28 nightly, Fedora 28 beta.  Setting rcu_nocbs=0-7, processor.max_cstate=5 or intel_idle.max_cstate=5 did not seem to have an effect -- the freezes still occurred.  The hardware itself (mainboard, CPU, RAM) was cross-checked by the supplier, who found no fault in it.  Since I disabled CPU C-states in the mainboard firmware about a day ago, I am no longer able to reproduce the freeze.
Comment 296 AMD Linux User 2018-04-16 13:20:31 UTC
(In reply to Dennis Schridde from comment #295)
> I am using an AMD Ryzen 5 2400G on an Asus ROG Strix B350-F (firmware
> version 3805 and 3803 before that) and was also experiencing freezes, e.g.
> after about one hour into `rsync -a /home ...`, with Gentoo Linux (Linux
> 4.15.10), Arch Linux 2018.03.1, Fedora 27, Fedora 28 nightly, Fedora 28
> beta.  Setting rcu_nocbs=0-7, processor.max_cstate=5 or
> intel_idle.max_cstate=5 did not seem to have an effect -- the freezes still
> occurred.  The hardware itself (mainboard, CPU, RAM) was cross-checked by
> the supplier, who found no fault in it.  Since I disabled CPU C-states in
> the mainboard firmware about a day ago, I am no longer able to reproduce the
> freeze.

Ryzen 2xxx are almost Ryzen 1xxx but with lower TDP. Meaning lower CPU frequency and thus lower power.
And since all available solutions and workarounds are about not to let the chipset give less power to the system (cpu, memory...) or to explicitly give more power to the system, this is not surprising (to my opinion) that the bug is still here.
What will be more interesting is to see if the bug will still be here for the next generatation of Ryzen...

Also, since now this is clear like rock water that this bug is not a software bug (thus not a kernel bug), but clearly a bug from AMD (most probably in the full design of Ryzen both CPU and chipsets), please let me propose you all to consider again to trick your motherboard voltage. Just giving it very few more and you'll have a system fully working. The CPU will still be able to lower its power, to scale its frequency... This is totally a win-win.
Comment 297 Panagiotis Malakoudis 2018-04-19 09:49:36 UTC
I flashed my motherboard (ASUS Prime X370 Pro) with the new 4008 BIOS that finally offers the "Power Supply Control" option. I set it to typical idle, disabled my CPU overclock that made the problem less frequent and now waiting to see if it will freeze.

What is already obvious though is that CPU power never goes under 0.8V. Without the Power Supply Control option set to Typical, voltage goes down to 0.39V

XFR and single core turbo works fine even if voltage is not dropping to 0.39V

Will report back in a few days.
Comment 298 Michaël Colignon 2018-04-19 12:24:47 UTC
I do the same on my  side.
Comment 299 Michaël Colignon 2018-04-19 12:26:23 UTC
(In reply to Panagiotis Malakoudis from comment #297)
> I flashed my motherboard (ASUS Prime X370 Pro) with the new 4008 BIOS that
> finally offers the "Power Supply Control" option. I set it to typical idle,
> disabled my CPU overclock that made the problem less frequent and now
> waiting to see if it will freeze.
> 
> What is already obvious though is that CPU power never goes under 0.8V.
> Without the Power Supply Control option set to Typical, voltage goes down to
> 0.39V
> 
> XFR and single core turbo works fine even if voltage is not dropping to 0.39V
> 
> Will report back in a few days.

How do you monitor XFR/single core turbo under Linux?
Thank you
Comment 300 Panagiotis Malakoudis 2018-04-19 15:19:47 UTC
I open a terminal and run as root:
watch -n 0.5 cpupower monitor

cpupower is in package linux-cpupower in Debian 9 that I am using.
apt install linux-cpupower
Comment 301 Panagiotis Malakoudis 2018-04-19 16:31:31 UTC
And if you want to test a single core benchmark pinned to a specific core to check what frequency can be reached, it is easy with following command:
taskset -c 10 openssl speed rsa2048

On my 1700X, cpupower monitor shows frequency up to 3880 for core 10 in the above benchmark.

If you want to stress two cores, you can either run two commands, or:
taskset -c 6,10 openssl speed -multi 2 rsa2048

Two cores go up to 3800 on my 1700X. Three and above go only to 3500. Two cores+multithread (taskset -c 6,7,10,11 for example) go up to 3650. This is the point where the new ryzen 2XXX are better, they can use higher turbo when using >2 cores. It seems they can do 4000 MHz up to 4 cores/8 threads with XFR2.
Comment 302 Michaël Colignon 2018-04-19 18:35:39 UTC
(In reply to Panagiotis Malakoudis from comment #301)
> And if you want to test a single core benchmark pinned to a specific core to
> check what frequency can be reached, it is easy with following command:
> taskset -c 10 openssl speed rsa2048
> 
> On my 1700X, cpupower monitor shows frequency up to 3880 for core 10 in the
> above benchmark.
> 
> If you want to stress two cores, you can either run two commands, or:
> taskset -c 6,10 openssl speed -multi 2 rsa2048
> 
> Two cores go up to 3800 on my 1700X. Three and above go only to 3500. Two
> cores+multithread (taskset -c 6,7,10,11 for example) go up to 3650. This is
> the point where the new ryzen 2XXX are better, they can use higher turbo
> when using >2 cores. It seems they can do 4000 MHz up to 4 cores/8 threads
> with XFR2.
No problem for it, i use -mmt1 2 3 4 etc for it with 7za b
And as you said, xfr and turbo are working.
Now i see that my performance under windows are better, i think it's because of timing for change frequencies who is smaller on windows. I'm going to tune it a bit for see, on windows better timing improve performance by ~10% iirc.

Yes they can.
between  4.05 et 4.075 GHz over eight cores. "
Comment 303 Michaël Colignon 2018-04-19 19:33:47 UTC
So far, 
[philectro@linux ~]$ uptime 
 21:32:34 up  7:14

The typical current seems to fix the problem.
The script zenstates says my C6 is enabled, so i don't know what this parameter does.
Comment 304 Daniel Phillips 2018-04-22 21:14:34 UTC
For my Ryzen 1700 system, setting bios "Power Supply Idle Control" to "Typical Current" increases power consumption at idle by approximately 4 watts, from 39 watts to 43. Still impressively low, but I hope that AMD will give me my 4 watts back eventually. Uptime is currently 2 days, but has been as high as 6 weeks in the past before locking up, so I cannot yet report that my stability issue is resolved.

As others have reported, the "Typical Current" option disables package C6 state, but not core C6 state. It is high time for AMD to make an official statement on this issue.

Motherboard: GA-AB360-GAMING, stock voltages and clocks
Processor: Ryzen 7 1700, week 38
Power supply: EVGA 650 GQ
Video: Radeon 6450, radeon driver
Kernel: various, including 4.15.0
Comment 305 Daniel Phillips 2018-04-22 21:18:57 UTC
Correction: GA-AB350-GAMING
Comment 306 Panagiotis Malakoudis 2018-04-22 21:54:11 UTC
package C6 state is not disabled with "Power Supply Idle Control" to "Typical Current" on the Asus Prime X370 with 4008 chipset. Also, disabling package C6 state with zenstates.py didn't have any effect on previous BIOSes.

I don't care about the 4 watts, but I would definitely like a technical explanation of the issue.
Comment 307 Michaël Colignon 2018-04-23 11:16:52 UTC
Same for me.
I confirm than on my x370-pro "typical current" doesn't disable C6 at all.

For the script, it disabled both C6 on my motherboard on 3805 uefi version.
What do you mean by didn't have any effect?
My system was stable when i disabled C6.
Comment 308 Panagiotis Malakoudis 2018-04-23 11:19:15 UTC
If I was only disabling package C6, I still had idle freezes. Completely disabling C6 was not an option, since you loose XFR and turbo speeds above 3500 (on my 1700X)
Comment 309 Michaël Colignon 2018-04-23 11:22:23 UTC
I contacted AMD support for my Ryzen 1600 and asked an explanation. I got an answer but so far no explanation.
Comment 310 Klaus Mueller 2018-04-29 16:10:45 UTC
Today, I tested bios 4008 (Prime X370-PRO / previous was 3404). The update wasn't any problem. I measured power consumption with enabled "typical current" and couldn't find any difference to the standard adjustment (and to my current overclocking solution w/ bios 3404). 

Sensors showed with enabled "typical current", that minimal voltage is about 0.8 V - whereas standard uses between 0.4 V and 0.8 V. C6 (package) states are enabled.

Nevertheless I went back to 3404 (and overclocking) via dos/afudos because of the completely broken fan regulation (CPU fan min duty cycle is 60% - this is completely unusable) - I didn't want to use the modded bios, because I'm not sure how to install it.
Comment 311 Michaël Colignon 2018-04-30 16:55:10 UTC
Now that my freezes are solved i have get this kind of flood on 4.15.17 today(and maybe yesterday as my smb share were odly slow)

pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000b(Transmitter ID)
I have reboot several times for try different kernel version.
4.16.4 and 4.15.16
The same error appeared.

I have shutdown the computer. Cut the PSU completely with the switch and wait.
Pushed the radeon 7970 a bit in the slot in case it moved.
Put it on, the error doesn't appear anymore.
Tried to reproduce it with 4.15.17 but i can't.

I have see several users on level1 with threadripper and some with ryzen that had this error.
Comment 312 Panagiotis Malakoudis 2018-05-02 10:57:17 UTC
Unfortunately I have to report that the idle freeze issue hasn't been fixed for me with new BIOS and the "Typical Power idle" option. My system froze 2 times since 19/4 when I installed the new BIOS.
Comment 313 Klaus Mueller 2018-05-02 15:42:14 UTC
(In reply to Panagiotis Malakoudis from comment #312)
> Unfortunately I have to report that the idle freeze issue hasn't been fixed
> for me with new BIOS and the "Typical Power idle" option.

What's your kernel version?

I'm testing at the moment w/ Bios 4008 / default configuration - means w/o using "Typical Power idle" option and Linux 4.16.6. I didn't see any problem so far - but this isn't a final result until now - it may change anytime.
Comment 314 Klaus Mueller 2018-05-02 16:28:17 UTC
Forgot to mention, that I applied this patch(1) to 4.16.6, which is now part of 4.16.7. See comments(2).

BTW, I couldn't measure any power consumption change with this patch.

(1) https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86-urgent-for-linus&id=da6fa7ef67f07108a1b0cb9fd9e7fcaabd39c051&utm_source=anz

(2) https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix
Comment 315 Jonathan 2018-05-02 16:29:27 UTC
Has anyone tried a Ryzen 2000 series / x470 mainboard yet? 
no reason to wait for a TR 2000 series apparently, since it won't be using Zen+ cores.
Comment 316 OptionalRealName 2018-05-02 22:58:24 UTC
(In reply to Jonathan from comment #315)
> Has anyone tried a Ryzen 2000 series / x470 mainboard yet? 
> no reason to wait for a TR 2000 series apparently, since it won't be using
> Zen+ cores.

I literally just came here tot ask the same question, curious if they've actually fixed this properly.
Comment 317 Michaël Colignon 2018-05-03 11:28:58 UTC
(In reply to Panagiotis Malakoudis from comment #312)
> Unfortunately I have to report that the idle freeze issue hasn't been fixed
> for me with new BIOS and the "Typical Power idle" option. My system froze 2
> times since 19/4 when I installed the new BIOS.

Hmm time for talk with AMD and rma because it does for me.
But i still consider it as a dirty workaround for hide the fact cpus are defective.
Comment 318 Matthew Vaughn 2018-05-03 23:20:14 UTC
To answer the question re: the new Ryzen CPUs:

I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board, and I can positively confirm that the random soft-lockups are very much still a thing. Nothing appears in dmesg, and nothing appears in the system journal.

My kernel is 4.16.6 (gentoo-sources patchset) and so far, the only troubleshooting I've done before landing on this bug report is to set 'processor.max_cstate=5' in my kernel boot options. The machine soft-locked about two hours later whilst I was using it (a few minutes ago).

I'll go back through the comments and suggestions to attempt the deeper workarounds, including kernel 4.16.7, but I thought this information might be valuable to someone first.

If anyone has anything they'd like me to test in particular, this is a Gentoo system and I am more than happy to subject it to trials in service of finding a solution to this bug.
Comment 319 Michaël Colignon 2018-05-03 23:41:43 UTC
You don't have test the "typical current" option in the uefi?
It's the official workaround.

If you have still freezes with the B2 stepping ouch.
I tho they have at least fixed it there...

You means it freeze completely and you need to press reset?
Comment 320 OptionalRealName 2018-05-03 23:55:01 UTC
(In reply to Matthew Vaughn from comment #318)
> To answer the question re: the new Ryzen CPUs:
> 
> I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board,
> and I can positively confirm that the random soft-lockups are very much
> still a thing. 


This is truly appalling.  Thanks for the information.
Comment 321 Matthew Vaughn 2018-05-04 00:02:50 UTC
(In reply to Michaël Colignon from comment #319)
> You don't have test the "typical current" option in the uefi?
> It's the official workaround.

I only found this bug report in the last hour, so I hadn't yet switched on the 'Typical Current Idle' option in UEFI. I just switched that on during my upgrade to kernel 4.16.7 a moment ago. I'll return here and report my results.
 
> You means it freeze completely and you need to press reset?

That's correct; up until now, it would freeze completely and I needed to press reset to reboot the machine. This had happened several times in the last 24 hours before I started looking around for reports of the issue and landed here. It's a brand-new installation.
Comment 322 oyvinds 2018-05-04 02:03:00 UTC
Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01. This is over ONE YEAR since the first release of Ryzen CPUs.

Can ya'all verify for me that this actually works - long term - and that I can remove both the rcu_nocbs=0-7 kernel boot option and my custom disable-c6-on-boot.service & disable-c6-on-suspend.service (which disables both C6 package and core)?

Or do I need to keep either rcu_nocbs or zenstates --c6-disable?

I would very much like to know if Typical Idle would be enough. I realize I could simply test this but I don't want to experiment and have random hangs. I'm on kernel 4.17rc3 btw.

If Typical Idle (I also see a "auto" option") actually fixes this then it's still a TOTAL SCANDAL since it's been a year now. But hey, better late than never.
Comment 323 Matthew Vaughn 2018-05-04 04:51:28 UTC
I'm testing with just the 'Typical Current Idle' option and no other countermeasures enabled right now. I'll give it 24 hours before I'm convinced, given the frequency of previously observed lockups. So far, there have been none since my last comment, which is longer than any previous uptime interval I've had on this hardware since installing it.

For what it's worth, the 'auto' option for the current idle setting is no good. That's the default, and was the setting in effect when I was observing lockups.
Comment 324 it 2018-05-04 07:05:47 UTC
(In reply to oyvinds from comment #322)
> Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option
> in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01.
> This is over ONE YEAR since the first release of Ryzen CPUs.
> 
> Can ya'all verify for me that this actually works - long term - and that I
> can remove both the rcu_nocbs=0-7 kernel boot option and my custom
> disable-c6-on-boot.service & disable-c6-on-suspend.service (which disables
> both C6 package and core)?
> 
> Or do I need to keep either rcu_nocbs or zenstates --c6-disable?
> 
> I would very much like to know if Typical Idle would be enough. I realize I
> could simply test this but I don't want to experiment and have random hangs.
> I'm on kernel 4.17rc3 btw.


I didn't experience any lockups since more than 2 months. Only Typical Idle enabled, nothing else changed.
Running Ubuntu 17.10 stock kernel 4.13.0-39-generic
Comment 325 James Le Cuirot 2018-05-04 08:52:43 UTC
(In reply to oyvinds from comment #322)
> Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option
> in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01.
> This is over ONE YEAR since the first release of Ryzen CPUs.

I have this board and it was available in F22 over a month ago. Still not great but I honestly don't think AMD were even aware of this issue before last August or so. The other segfault issue was getting far more attention and many wrongly assumed the freezes were related.

> I would very much like to know if Typical Idle would be enough.

It is enough. I have disabled all other workarounds and I've maybe had one freeze since though I can't remember for sure and that may well have been down to something else.
Comment 326 Jonathan 2018-05-04 09:32:34 UTC
(In reply to Matthew Vaughn from comment #318)
> I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board,
> and I can positively confirm that the random soft-lockups are very much
> still a thing. Nothing appears in dmesg, and nothing appears in the system
> journal.

Very disappointing. AMD really isn't taking this serious. Since power consumption is really important for me, and my PC is often idling (so either live with freezes or accept even higher power consumption), this really means no AMD. Especially since their Threadripper next gen is supposed to use an older Epyc core, not even Zen+. And since they're not taking this seriously, my guess is Zen 2 next year will still be buggy. Way to go AMD.

And with Intel just getting hit with another Spectre round, I guess their fixes will slow their CPU's down even slower than my old non patched Sandy Bridge.

*grumble*

Oh well, this is one way to get upgraditis cured I guess.
Comment 327 Chris Hall 2018-05-04 13:12:12 UTC
I lost interest in my Ryzen 7 1800X machine... but this morning I got round to upgrading the BIOS on the ASUS Prime X370-Pro (to 4011, hot off the press: "Update AGESA 1.0.0.2a + SMU 43.18" whatever that means).

I was running with rcu_nocbs and zenstates --c6-package-disable.  That ran for 11 days and then froze.

I last heard from "TECH.SUPPORT@AMD.COM":

  Thank you for the update and confirming that your BeQuite Straight Power 11
  supports 0A minimum load. 

  Because your system still freezes, it could be due to cross loading problems 
  which can result in the power supply turning off when a load changes or 
  result in voltages becoming out of specification causing system crashes and
  hangs.
 
  What the Power Supply Idle Control option does is disables the lowest power 
  state for the CPU.  This is where ALL cores are not in operation and the 
  entire CCX or Core complex is taken down.

  There are many levels of power states that a core can be in from C1 to C6, 
  CC6 and finally PC6.  The Power Supply Idle Control option is designed to 
  keep enough current on the rail so that power supply does not go out of 
  regulation.

  The Power Supply Idle Control option is part of an AGESA update from AMD 
  provided to the motherboard vendors for validation and implementation in 
  their BIOS updates.  However, it is motherboard vendors decision as this 
  which BIOS version will contain the Power Supply Idle Control option.

So now I have options: "low current idle", "typical current idle" and "auto".  Neither AMD nor ASUS seem to think it necessary to document what those mean.

I have set "typical current idle".  I note that zenstates shows that both C6 States Package and Core are Enabled.

I guess I am back to waiting and seeing.
Comment 328 JerryD 2018-05-04 15:46:52 UTC
(In reply to Chris Hall from comment #327)
--- snip ---
> So now I have options: "low current idle", "typical current idle" and
> "auto".  Neither AMD nor ASUS seem to think it necessary to document what
> those mean.
> 
> I have set "typical current idle".  I note that zenstates shows that both C6
> States Package and Core are Enabled.
> 
> I guess I am back to waiting and seeing.

So have you tried the "auto" mode?  or "low current idle".  Cant hurt to try them and just let the machine run doing something.
Comment 329 Michaël Colignon 2018-05-04 15:50:14 UTC
Auto must make use of low current idle, the setting who gives freeze.
I can say than auto gives freeze as for Matthew.
Comment 330 Matthew Vaughn 2018-05-05 02:25:59 UTC
24 hours of uptime without a lockup after only setting "typical current idle." Considering my rig had been locking up after very short uptimes without this setting, I'd say that's a significant difference.
Comment 331 tcl_de 2018-05-05 13:46:32 UTC
(In reply to Chris Hall from comment #327)

--snip--
> I last heard from "TECH.SUPPORT@AMD.COM":
> 
>   Thank you for the update and confirming that your BeQuite Straight Power 11
>   supports 0A minimum load. 
> 
>   Because your system still freezes, it could be due to cross loading
>   problems 
>   which can result in the power supply turning off when a load changes or 
>   result in voltages becoming out of specification causing system crashes and
>   hangs.
>  
>   What the Power Supply Idle Control option does is disables the lowest power 
>   state for the CPU.  This is where ALL cores are not in operation and the 
>   entire CCX or Core complex is taken down.
> 
>   There are many levels of power states that a core can be in from C1 to C6, 
>   CC6 and finally PC6.  The Power Supply Idle Control option is designed to 
>   keep enough current on the rail so that power supply does not go out of 
>   regulation.
> 
>   The Power Supply Idle Control option is part of an AGESA update from AMD 
>   provided to the motherboard vendors for validation and implementation in 
>   their BIOS updates.  However, it is motherboard vendors decision as this 
>   which BIOS version will contain the Power Supply Idle Control option.
> 
--snip--
Actually there should not be any cross loading problems by design here as this PSU uses DC/DC converters to regulate +5V and +3.3V independently of +12V.

I suspect that the problem might rather be related to the voltage regulation on the mainboard.
It's good to know however that the Power Supply Idle Control option is part of an AGESA update.
Comment 332 Michaël Colignon 2018-05-05 13:53:57 UTC
I suggest we fill all a spreadsheet on the web with our hardware, PSU included, for have the final word of this story.
I let you decide the right tool for it. I can host it if needed.
Comment 333 Klaus Mueller 2018-05-14 15:54:19 UTC
(In reply to Klaus Mueller from comment #313)

> I'm testing at the moment w/ Bios 4008 / default configuration - means w/o
> using "Typical Power idle" option and Linux 4.16.6. I didn't see any problem
> so far - but this isn't a final result until now - it may change anytime.

=> Got a hang yesterday. 
watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [worker:11105]

Went back again to BIOS 3404 / Linux 4.14.x and overclocking which is known to be stable for me.
Comment 334 OptionalRealName 2018-05-16 01:17:29 UTC
Someone needs to make Tomshardware, HardOCP, Anandtech aware of this thread.
Comment 335 kernel 2018-05-16 07:09:36 UTC
I think that is a great idea. I have time to waste at airports soon, so can draft an email message.

Receipients:
Tomshardware, HardOCP, Anandtech, Ars Technica, Tweakers.net, Phoronix, ... anyone else?

Contents:
- Total system freezes on full idle
- Seemingly/definitely(?) a hardware bug
- AMD has been contacted multiple times but refuses to acknowledge the issue
- AMD first blamed PSUs
- AMD then pushed mobo manufacturers to add an undocumented BIOS option, which in its default setting does not solve the problem and with other settings doesn't always solve the problem
- Cause is still unknown
- Issue is also present in Ryzen 2xxx CPUs

Other suggestions welcome before I start writing it up!
Comment 336 Simon 2018-05-16 07:18:31 UTC
I would suggest to change one of your lines:

> - AMD then pushed mobo manufacturers to add an undocumented BIOS option,
> which in its default setting does not solve the problem and with other
> settings MOST TIMES (NOT ALWAYS) solve the problem

and add one sentence:
here the settings: /advanced/amdcbs/zen-common-options/"power supply idel control"
changed this from "auto" to "typical"

simoN
Comment 337 AMD Linux User 2018-05-16 13:37:52 UTC
(In reply to kernel from comment #335)
> I think that is a great idea. I have time to waste at airports soon, so can
> draft an email message.
> 

Just to tell part of my story: I bought the well-known Asus laptop with Ryzen CPU. It was exposing the segfault bug. But before I was most sure about that, it burnt just two weeks after having bought it... I sent it to Asus, hoping for a new lapotp. Asus offically told me that they replaced the motherboard and the keyboard (they did not tell anything about if the CPU was replaced, and I was not able to check this).
The new but repaired laptop did not expose the segfault bug anymore (same kernel version). But it exposed the soft-lock bug. I didn't know this was this bug at that time. So I returned the laptop again to Asus claiming to be reimbursed. Asus repaired it again by changing the motherboard again. The bug was still present. (and for the interested people I could finally got reimbursed).

Then I decided to do as usual: build my own desktop. And I was facing the soft-lock bug until I tricked the voltages...

To go back to the story, and if we can trust Asus, a motherboard change made the segfault bug to disappear (AMD officially 'replaced' such 'faulty' CPUs) but made the soft-lock bug to appear. A second change of the motherboard did not improve anything.

What I guess (since this cannot be really ensured) is that AMD does not know where all these bugs come from. And AMD does not know how to resolve them. AMD does not even know if this can be solved.

What we can know is that the guy responsible for the Ryzen architecture was hired by AMD back again (ater having moved to Intel or Apple) but he left AMD 3 years ago (so before Ryzen went out). He moved to Tesla and is now working to Intel again.
Link is here: https://en.wikipedia.org/wiki/Jim_Keller_(engineer)
Comment 338 AMD Linux User 2018-05-16 13:44:04 UTC
And before my last paragraph could be misinterpreted, I was just meaning that this guy is the one to contact if we want some concrete information.
Comment 339 kernel 2018-05-18 19:31:53 UTC
I've sent the aforementioned outlets an email. For completeness, here's the entire thing:

Receipients:
Ars Technica: https://arstechnica.wufoo.com/forms/z7p8x7/
Tomshardware: http://www.purch.com/about/#contact-general
HardOCP: hardnews@hardocp.com
Anandtech: http://www.purch.com/about/#contact-general
Tweakers.net: redactie@tweakers.net
Phoronix: https://www.phoronix-media.com/?

Subject: Publication of an AMD Ryzen hardware issue

Dear editor,

Since their introduction the AMD Ryzen processors have been plagued by several issues, most notably the segfault issue that occurred under high (compilation) loads. To that particular issue AMD has responded by replacing affected chips.

However, there is another significant issue that affects both Ryzen 1xxx and Threadripper CPUs, as well as the newer Ryzen 2xxx processors. It appears Epyc is not affected (although sample size is one in this case).

The most complete storyline on this issue can be found in the link below [1], however for an overview the rest of my email attempts to summarise the issue.

This issue results in a complete system freeze, occurring under full idle conditions, and requires a hard reset. Evidence suggests this is a hardware problem, since several workarounds have been found that mitigate/solve the issue, such as disabling C6 entirely (inefficient), overclocking and tweaking voltages (not for everyone), or running processes that keep the CPU active at all times (again, inefficient and pointless).

AMD has been contacted multiple times but refuses to acknowledge the issue. At some point in one reply AMD blamed users' PSUs; this is obviously nonsensical as the issue occurs on a wide variety of PSUs including brand new models, the only constant is the Ryzen platform. 

In response, AMD has pushed motherboard manufacturers to add an otherwise undocumented BIOS option, "Power Supply Idle Control", as part of their AGESA update. However, in its default value this setting does *not* solve the problem, and with other settings doesn't *always* solve the problem. To be precise, this setting needs to be changed from "auto" to "typical":
/advanced/amdcbs/zen-common-options/"power supply idle control"
However it is unknown what this option actually does.

The issue is also present in the laptop platforms. In example of one user, he sent his laptop back to Asus repeatedly for motherboard replacements, but the issue remains.

The root cause of this issue is still unknown, and moreover, the issue is still present in the latest Ryzen 2xxx CPUs. AMDs refusal to acknowledge let alone resolve the issue has led to this email, in the hope that media attention to this problem will inspire AMD to take action. Hence, I propose that your outlet publishes an article about the issue.

Kind regards,
kernel@sesmo.eu

[1] https://bugzilla.kernel.org/show_bug.cgi?id=196683
Comment 340 kernel 2018-06-01 12:05:36 UTC
Sadly I haven't received back apart from the following from HardOCP:

"We touched on this issue in one of our recent reviews and found the issue to be with the Power settings implemented by Microsoft.  The bug has been reported to MS.  If you cycle through one of two of the other power settings, the issue will go away from our experience."

To which I pointed out that this issue occurs under Linux in particular (due to full idle). Haven't heard back yet.

Does anyone here have direct contacts with any of the major outlets?
Comment 341 Dennis Schridde 2018-06-01 13:18:25 UTC
(In reply to kernel from comment #340)
> Sadly I haven't received back apart from the following from HardOCP:

I got a reply from c't / Heise.  They say that during their experiments with Linux on Ryzen they could not find any evidence of this bug so far.
Comment 342 Jonathan 2018-06-04 13:44:57 UTC
So basically, the press is not interested. Big surprise. Anyway, C't reported today on the new 4.17 kernel, with following about low power state for AMD Ryzen. Anyone care to see if this has an effect on this bug?

"Auch Systeme mit AMDs aktuellen Prozessoren dürften mit 4.17 etwas sparsamer laufen. Das ist einer kleinen Änderung am Code zu verdanken, der den Prozessor im Leerlauf schlafen schickt. Er nutzte auf AMD-CPUs bislang den MWAIT-Aufruf, durch den AMDs aktuelle Prozessoren aber lediglich in den Schlafmodus C1 wechseln; jetzt verwendet der Kernel CPUIDLE oder HALT, durch die Ryzen, Epyc & Co. auch in tiefere und daher effizientere Schlafzustände wechseln."

Google translate:
Even systems with AMD's current processors should run a bit more economical with 4.17. This is due to a small change to the code that makes the processor sleep idle. He used on AMD CPUs so far the MWAIT call, through which AMD's current processors but only switch to sleep mode C1; Now the kernel uses CPUIDLE or HALT, through which Ryzen, Epyc & Co. also switch to deeper and therefore more efficient sleep states.
Comment 343 Freihut 2018-06-04 20:09:13 UTC
(In reply to Jonathan from comment #342)
> Anyone care to see if this has an effect on this bug?


Better save your time, on my Ryzen 2500U; Kernel 4.15 (Ubuntu's), nor 4.16 (Fedora's) nor 4.17rc6+7 (Mainline) had an effect on this.

Ubuntu & mainline kernels needed "pci=noacpi" or "acpi=noirq" to even boot. 
Fedora's just needed "nomodeset", if you wanted to boot with XFCE, and nothing at all with Gnome.

Doing a backup with fsarchiver
```fsarchiver savefs -o -j8 -A -a -v -Z3 $BackupfileDestination $Partition``` 
__always lead to a freeze within 1 - 5 runs__). Fsarchiver causes a lot of up- and downclocking (and boosting above baseclock), while it produces only partial cpu-load). 
Full idle never caused a freeze neither did full load (handbrake, kill-ryzen.sh).

On kernel 4.13 the runs of fsarchiver never caused a freeze (max cpu-clock 2ghz).
Also, the setting of scaling_governor to powersave on all other tested kernels prevented that freezes (cpu clock was locked to 1,6ghz in that case).

Things I (unsuccessful) tried to get rid of the freezes:
*blacklisting amdgpu
*kernel @ CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-7
*rmmod ath9k (rmmod ath10k_pci ath10k_core)
*30.5.2018 Ubuntu's microcode update for amd64
*disabled ASLR
*new mesa (18.0rc5 > 18.1.1. > 18.2 dev)
*c6-disabled via service
via grub:
*acpi=strict
*libata.noacpi
*amdgpu.dpm=0
*.dc=0
*.audio=0
*.bapm=0
*noibrs noibpb nopti
*libata.force=noncq
*pcie_aspm=off & the others
*amd_iommu=off & the other options (including iommu + options)
*pci=nocrs 
*nomodeset
*all acpi= parameters
*processor.max_cstate=5
Comment 344 James Le Cuirot 2018-06-04 20:22:30 UTC
(In reply to Freihut from comment #343)
> 
> Better save your time, on my Ryzen 2500U; Kernel 4.15 (Ubuntu's), nor 4.16
> (Fedora's) nor 4.17rc6+7 (Mainline) had an effect on this.

My earlier posts above relate to my desktop Ryzen 5 1600X but I've also been seeing daily freezes with a 2700U-powered laptop. I don't believe this is the same issue though as ZenStates.py did not help. I think it's more likely to be graphics-related, especially given what another user with the same hardware reported in bug #199653. Maybe your situation is different though.
Comment 345 Lars Viklund 2018-06-05 12:45:24 UTC
(In reply to Freihut from comment #343)
> Doing a backup with fsarchiver
> ```fsarchiver savefs -o -j8 -A -a -v -Z3 $BackupfileDestination
> $Partition``` 
> __always lead to a freeze within 1 - 5 runs__). Fsarchiver causes a lot of
> up- and downclocking (and boosting above baseclock), while it produces only
> partial cpu-load). 
> Full idle never caused a freeze neither did full load (handbrake,
> kill-ryzen.sh).

This is starting to sound very similar to my total system freeze with ZFS on FreeBSD on all three of my Ryzen 1700 chips on two different ASUS motherboards (B350M-A and X370-PRIME).

The underlying storage I had there was a pile of four HDDs, or a SSD, which I made a snapshot of and then:

  while true; do zfs send -R stuff@foo | pipemeter | cat >/dev/null; done

This would lock the system up in about 2-7 terabytes of HDD data if memory serves me right. It would not happen on USB HDDs, but happened regardless of if it was onboard or on a disk controller. I've also managed to reproduce it by receiving a stream over the network and writing it out to said disks, but this one is easier to run as one doesn't need another machine to feed it.

Doing the same under Linux at the time could not produce any similar hangs, but it may be due to a different implementation of ZFS and not using pipemeter.

I discussed it with AMD support over months (Aug-Dec 2017) and they gave me all sorts of power twiddling suggestions for firmware, but the only thing that ever really worked there was to disable hyperthreading (SMT), bringing the machine down to 8C/8T.

Have you tried disabling SMT in your firmware and seen if it changes anything?
Please note that if you do it and upgrade your firmware, you may need to do a full wipe of settings to even get the option back. Support recommended me (rightfully) to do full clears of firmware settings between upgrades, seems like the migration path isn't quite solid.
Comment 346 Freihut 2018-06-05 20:45:25 UTC
(In reply to Lars Viklund from comment #345)
> Have you tried disabling SMT in your firmware and seen if it changes
> anything?

I can't do that, because it's a notebook with only 2 BIOS-options (something virtualization something). Tried disabling both, but that had no effect at all.
As far as I can remember, I tried disabling SMT via Grub, but only to test, if it's booting (wasn't).

But I wouldn't say it's SMT-related, because the CPU runs utterly fine, while it's below it's base clock or at full load.
The freezes also occured on singlethreaded action: While the play-back of random youtube-stuff (autoplay, no user interactions applied) with Palemoon (Firefox fork). That usual took 30 - 120 mins to freeze and was also tested with disabled amdgpu, so I'm pretty sure it's CPU-related.

Have to send it back to the vendor now, because I ran out of time (for the right to return).
Comment 347 Dennis Schridde 2018-06-07 03:38:37 UTC
(In reply to Dennis Schridde from comment #295)

As an update from my earlier post, here is what I wrote to AMD support recently:

I can confirm the effectiveness of the "Power Supply Idle Control = Typical Current Idle" non-default firmware-option introduced with AGESA 1.0.0.2a only partially.  After setting this option it often still takes several attempts until the machine boots -- during the failing attempts I either get 3 long beeps from the mainboard followed by an automatic reboot, or I see varying Linux kernel stack-traces during the early boot, seemingly related to firmware / EFI and CPU / idle.  Sometimes the system can be restarted from this state using soft-reboot (ctrl+alt+del).  But sometimes the situation requires a hard-reset, e.g. because the system completely froze (similar to the original problem reported here), or because the init process dies.  I have the feeling that in such situations the system, even if it does not freeze completely, works with corrupt data and then writes this to the hard disk [3,4].  I destroyed my installation several times in the last month due to these problems, e.g. because the system had destroyed all (?!) superblocks of the file system, or lately the LVM cache.  Surprisingly, once the system has booted properly, it will run stably for days without any further issues.

I could not reproduce the problem synthentically using either PassMark memtest86 [5], TU Dresden Firestarter [6] or Google Stressful Application Test [7], the latter including CPU, RAM, hard drive and file-system tests.  My hardware supplier also tested all components (CPU, RAM, mainboard) again using Windows with Prime95 and Furmark, as well as with Memtest, and assured me that they could not detect an issue either.  During regular use I can reliably reproduce it on Gentoo (with Linux 4.16), Fedora 27 (Linux 4.14), Fedora 28 Beta (Linux 4.15), Fedora 28 (Linux 4.16) and Arch 2018.03 (Linux 4.15).  Before the introduction of the "Power Supply Idle Control = Typical Current Idle" firmware-option, I first noticed the issue while compiling large amounts of software on Gentoo, but was later able to reliably reproduce the freeze with simple `rsync` operations on all other operating systems, too.  This even happened when no X-server is running and the GPU is not utilised in any other way (as far as I know), so I do not see a connection to possibly incomplete Vega-GPU-support in Linux.  After setting "Power Supply Idle Control = Typical Current Idle" these freezes stopped happening and I only see the problem during early boot.

[3]: https://www.redhat.com/archives/linux-lvm/2018-May/msg00006.html
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1585670
[5]: https://www.memtest86.com/
[6]: https://tu-dresden.de/zih/forschung/projekte/firestarter
[7]: https://github.com/stressapptest/stressapptest
Comment 348 rainer 2018-06-07 05:07:51 UTC
Hi Dennis, from my point of view, your case is not the same as the reported idle freezes in this bug report. As a owner of a Ryzen 7 1700 that has been replaced as of the other segfault bug during compiling, the bug mentioned here is known to me, but my feeling is that it's quite seldom now. Now to compare here my Ryzen 3 2200G, would be unfair as it is a completely different beast. By saying this, I can assure you I have trouble with this system as well. But it's known and Phoronix posted about it, but they say the 2400G might be stable now, but not the 2200G. If you have issues with this APU, which is quite expected reading your post, then from my own experience I can tell you that you need the latest kernel, latest UEFi, latest firmware, latest Mesa... This should give you a working system, at least my 2200G still often fails to boot, but if it's booted it works ok. But as this is a APU this way more complex, you may see GPU issues and that is why I think your post doesn't belong to the original bug report. From my personal experience, I got once a corrupted btrfs file system on my Ryzen 7 1700, but in the end one of my 16GB Ram was defect (detected by Memtest86+) and it corrupted the file system during a crash. Again, I have the feeling we shouldn't mix to many different issue in this bug report and I think my issues with the 2200G do not belong here.
Comment 349 Daniel Phillips 2018-06-09 01:41:03 UTC
I am happy to report 45 days of continuous uptime with "typical current idle" selected in bios, and otherwise vanilla everything including 4.15.0 kernel. My system configuration is as posted above.

I find myself willing to believe the AMD "indirectly" statement: this is related to power supply. My power supply is an EVGA 650GQ, with eco switch set on to reduce fan noise. I suspect that if I swapped it for some other power supply, perhaps one not quite so efficient, or flipped to eco switch off, that I would find myself among those lucky users able to run with default bios settings and 4 watts less power consumption at idle. Unless anybody can refute this theory, I would consider it advisable to post power supply details for anybody still hitting problems.

I suspect that the truth is, some power supplies just do not expect such low full system power draw, and freak out about it. Without "typical current", mine is 38 watts at the wall including 32GB of memory, but unfortunately, only lives a few days at that setting.
Comment 350 Chris Hall 2018-06-09 12:18:25 UTC
(In reply to Daniel Phillips from comment #349)
> I am happy to report 45 days of continuous uptime with "typical current
> idle" selected in bios, and otherwise vanilla everything including 4.15.0
> kernel. My system configuration is as posted above.

With my Ryzen 7 1800X and Asus X370 Pro: I can report that with the magic "typical" BIOS setting I have today 30 days uptime.  This is more than twice the previous record.

FWIW: I am so fed up with this machine that I haven't used it since I updated the BIOS and applied the setting.  It is running 4.16.5 (Fedora 27), with CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15.  I don't know if the rcu_nocbs=0-15 is still required.

Also FWIW: zenstates.py -l tells me that C6 Package is Disabled, but C6 Core is Enabled.  Before the BIOS update I used zenstates.py to set C6 the same way, but the machine froze after some 12 days.  After the BIOS update I no longer use zenstates.py to set anything.  So I guess the BIOS "typical" option disables C6 Package, but also does some other magic.

> I suspect that the truth is, some power supplies just do not expect such low
> full system power draw, and freak out about it. Without "typical current",
> mine is 38 watts at the wall including 32GB of memory, but unfortunately,
> only lives a few days at that setting.

Mr BeQuiet! are adamant that the Straight Power 11 I have is perfectly happy to supply 0A at all voltages.

Of course, there's a lot of stuff between the PSU and the CPU... so it could be a motherboard issue.  Who can tell ?

Possibly, some day, I will go back to using by AMD machine, but I doubt I shall come to be fond of it :-(  Certainly I am livid with AMD's abject failure to address the issue promptly, and their continuing inability to discuss or document the issue.  Bugs happen.  It's how they are dealt with that separates the sheep from the goats.  <sigh>
Comment 351 AMD Linux User 2018-06-09 14:28:49 UTC
Since the problem is not (and it seems will not) be solved, may I ask people here to send me full specs and BIOS settings of working configuration so that I can publish them on a web page (with the hope to be useful for other people) ?
So basically I am looking for the original spec list (HW, BIOS version and settings, boot options) and what you did to make your system stable (let say stability starts with at least 14 days of uptime).

Just email me if you are interested, and if I have enough lines to put in that DB, I will post the link of the web page here later on.

Since the power supply (and harddrives) seem still to be an important piece of the puzzle, send me that information too.
Comment 352 Kai-Heng Feng 2018-06-09 14:33:53 UTC
Is there anyone not affected by this when Package C6 is enabled?
Otherwise I intent to send the patch, which has the same effect as "typical current".
Comment 353 Bráulio Bhavamitra 2018-06-09 19:02:32 UTC
(In reply to Kai-Heng Feng from comment #352)
> Is there anyone not affected by this when Package C6 is enabled?
> Otherwise I intent to send the patch, which has the same effect as "typical
> current".

That would be perfect! Can you send us that patch for testing?
Comment 354 daniel 2018-06-10 20:46:37 UTC
On 2018-06-09 07:33 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #352 from Kai-Heng Feng (kai.heng.feng@canonical.com) ---
> Is there anyone not affected by this when Package C6 is enabled?
> Otherwise I intent to send the patch, which has the same effect as "typical
> current".
>
There are multiple reports above that disabling package C6 is not the 
same as setting typical current in Bios. ISTR there is even a report 
that setting typical current did not disable C6 for that particular 
bios. The theory that this power bug is related to motherboard is 
gaining prominence, because different motherboards seem to have 
different tweaks for "typical current". If only AMD would just tell us 
what is going on.

Meanwhile, I am more than content with my Ryzen setup now that it is 
apparently stable, but discontented with AMDs lack of disclosure. Not to 
the point of swearing off AMD, but it got close.
Comment 355 hoper 2018-06-11 09:05:04 UTC
How did I made my system ultra stable after months of despair ?
Easy, I never let it idle anymore :p
Mining crypto money on at least some of the core and I never
had trouble again :)


Le 09/06/2018 à 16:28, bugzilla-daemon@bugzilla.kernel.org a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> 
> AMD Linux User (dagecko@free.fr) changed:
> 
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |dagecko@free.fr
> 
> --- Comment #351 from AMD Linux User (dagecko@free.fr) ---
> Since the problem is not (and it seems will not) be solved, may I ask people
> here to send me full specs and BIOS settings of working configuration so that
> I
> can publish them on a web page (with the hope to be useful for other people)
> ?
> So basically I am looking for the original spec list (HW, BIOS version and
> settings, boot options) and what you did to make your system stable (let say
> stability starts with at least 14 days of uptime).
> 
> Just email me if you are interested, and if I have enough lines to put in
> that
> DB, I will post the link of the web page here later on.
> 
> Since the power supply (and harddrives) seem still to be an important piece
> of
> the puzzle, send me that information too.
>
Comment 356 oyvinds 2018-06-11 19:06:30 UTC
> I find myself willing to believe the AMD "indirectly" statement: this is
> related to power supply. My power supply is an EVGA 650GQ, with eco switch
> set on to reduce fan noise. I suspect that if I swapped it for some other
Some want to believe and that's fine but AMD story is and always was bullshit and here's why: My EVGA SuperNOVA 750 G3 & Corsair RM650i are modern and C6/"haswell" compliant PSUs. There's plenty of other people with a variety of new and old and cheap and expensive PSUs who are affected by this TOTAL SCANDAL bug.

The new Power Supply option in BIOS set to Typical does fix it all combinations of Ryzen motherboards/CPUs/PSUs I've tried.

The root cause could be a combination of motherboard/CPU/PSU but my guess is that it's got more to do with motherboard voltage regulators and less to do with PSUs. Buying a new PSU isn't a viable option anyway, what are you going do to? buy every single PSU on the market and hope you stumble upon one that works with the "low current" power supply option in BIOS after trying 20 that don't? If you sell a computer monitor that does not work properly on 19 out of 20 graphics cards then it's not the graphics cards that are to blame, you're selling a defective product.

The Typical Idle setting is a fine solution if you know about it, so there's that. Works for me. I do feel sorry for poor noobs who don't know and pull their hairs for hours before discovering the shocking truth, though.
Comment 357 Paulo J. S. Silva 2018-06-14 13:09:33 UTC
I would like just to pass some evidence that this affects Threadrippers. I have two identical system with the 1950X on a Asus X399-A Prime motherboard. This motherboard does not have the Power Supply option in the BIOS. Both freeze from time to time. 

But more interesting, both systems freeze when running a code developed by one of our students that uses Matlab and computes the SVD of a large matrix followed by some time almost idle writing to a NFS mount many times in a loop. The code takes around an hour to run. I made a small script loops between running the code followed by calling sleep for one extra hour to put the machine in a idle state. It always make the machines freeze in less than one day. In some cases the freeze happens even while the code is running, not only when idle. This seems odd. It is very reproducible. 

The solution is to disable C6 fully using zenstates.py. Only disabling c6-packaging is not enough. Another possibility, that I am inclined is to recover some of the performance I lost by loosing turbo boost and XFR, is a light overclock. I am testing one of the machines now @3.75GHz.

Have anyone seen the "power supply" option in a X399 motherboard?
Comment 358 Ryan Phillips 2018-06-14 15:56:18 UTC
@Paulo: Are you running the latest bios? There is a newer bios from ~April or so, but I do not have the motherboard to try it out.

To add some more information to this thread, I have the following machine:

 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to On)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0601 04/19/2018, Typical Current set in the BIOS)

This setup is currently working better with the Typical Current setting, though I have experienced a soft lockup with these settings after about ~6 hours of uptime. I am going to run the same setup with the eco switch to off on the PSU to see if anything changes.
Comment 359 Ryan Phillips 2018-06-14 16:10:19 UTC
Eco mode supposedly only controls the PSU fan. I'm guessing that will not change anything on the CPU soft lock side.
Comment 360 Paulo J. S. Silva 2018-06-14 16:47:39 UTC
@Ryan, yes. Acoording to dmidecode I am using 0601 BIOS that is the last one available in Asus website.

Interesting that you are having soft lockups even with Typical Current. It completely solved the problems in my old 1700X system and it seems to have worked well for others. Are you overclocking? If yes, try first with default settings. Good luck!
Comment 361 Eduardo Reyes 2018-06-15 02:11:14 UTC
@Paulo:
I downloaded the manual for your motherboard and found this

http://dlcdnet.asus.com/pub/ASUS/mb/socketTR4/PRIME_X399-A/E13557_PRIME_X399-A_BIOS_EM_WEB_20171030.pdf



EPU Power Saving Mode
The ASUS EPU (Energy Processing Unit) sets the CPU in its minimum power consumption
settings. Enabling this item will apply lower CPU Core/Cache Voltage and help save energy
consumption. Set this item to disabled if you are over clocking the system. Configuration
options: [Disabled] [Enabled]


Have you tried Disabling this?
Comment 362 daniel 2018-06-15 02:16:05 UTC
On 2018-06-14 09:10 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #359 from Ryan Phillips (ryan@trolocsis.com) ---
> Eco mode supposedly only controls the PSU fan. I'm guessing that will not
> change anything on the CPU soft lock side.
>
How about this theory: With Eco mode on, the power supply prevents the 
current from increasing until the fan speeds up, causing the voltage to 
drop ever so slightly, and pademonium ensues. Note that EVGA's site does 
not explicitly say that that Eco mode only controls the fan. I am not 
pointing the finger at EVGA, I am just exploring a theory, please refute 
if possible.

I am willing to believe that this system is stable now that I am at 51 
days uptime,  so I will try some experiments. The first one will be, run 
with both Typical Power and Eco Mode off, and all stock bios and kernel 
settings.
Comment 363 Paulo J. S. Silva 2018-06-15 13:01:21 UTC
@Eduardo. Yes I've found that option before. It is disabled by default (so the default is not to try the power saving). I even tried to enable hoping that the power saving system would be smart and avoid the lock up but it didn't work.

What I am trying right now is to set the option "Overclocking Enhancement" that is supposed to do "It enables SenseMi Skew (artificially reading lower temperatures in order to trick XFR to boost longer and higher), Performance Bias and tweaks VRM settings to improve overclockability". I am specially interested on the tweaked VRM settings. (https://rog.asus.com/forum/showthread.php?97680-Overclocking-Enhancement)

Interesting enough, if I only set this to Enable (no overclock, just default setting + this option enabled) it seems to make things better. As I said, I have a code that consistently triggers a lockup if I run in a loop followed by one hour of inactivity. I have been running this test in one of my system with the "Overclocking Enhancement" set to Enable and it did not trigger the bug for the last 24hs. I have just changed the other machine to test this as well. I'll keep the test running today and the whole weekend on both machines and report back.

If anyone else is having the problem with a Asus X399 motherboard it may be worth a try and, please, report back too.
Comment 364 Paulo J. S. Silva 2018-06-15 13:03:19 UTC
Oh, and I forgot to mention. Both machines are using the "Overclocking Enhancement" options and I have explicitly set c6 state to enabled using zenstates.py.
Comment 365 Paulo J. S. Silva 2018-06-19 18:02:42 UTC
Hi, a follow up. My two Threadripper machines are working flawlessly since Friday. I tried both, running a test that alternates between high load and idle that would normally trigger the but, and leave the machine idle for many hours.

It seems like a possible workaround for someone with a ASUS X399-A Prime motherboard is to set the BIOS option "Overclocking Enhancement". Beware that this option has a weird side-effect of skewing the temperature to report a lower value. Therefore you should have a good cooling solution in place.
Comment 366 Artem Hluvchynskyi 2018-06-20 11:03:38 UTC
So AMD has finally published the revision guide for 17h family, which includes errata:
https://developer.amd.com/wp-content/resources/55449_1.12.pdf

Would be cool if someone with more knowledge about that kind of stuff could have a quick look through and see if this issue could be related to anything in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.
Comment 367 Bráulio Bhavamitra 2018-06-20 11:25:04 UTC
(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> 
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

Wow, a lot of errors to be workarounded...
Comment 368 James Le Cuirot 2018-06-20 11:35:41 UTC
(In reply to Artem Hluvchynskyi from comment #366)
> So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

I remember seeing the kernel fix/workaround for that issue when I was trying to diagnose our issue. I don't think they are related.
Comment 369 Klaus Mueller 2018-06-20 15:43:57 UTC
(In reply to James Le Cuirot from comment #368)
> I remember seeing the kernel fix/workaround for that issue when I was trying
> to diagnose our issue. I don't think they are related.

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix

But don't know if it's the same MWAIT problem as mentioned in the errata pdf above.
Comment 370 James Le Cuirot 2018-06-20 15:52:50 UTC
I was thinking of this but with fresh eyes, maybe that is yet another issue. I'll shut up now and leave this to the experts. :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88d879d29f9cc0de2d930b584285638cdada6625
Comment 371 Klaus Mueller 2018-06-20 16:56:46 UTC
@James Le Cuirot:
Linux 4.14 already came with the mentioned patch from beginning and doesn't prevent freezing on idle at all - same as the MWAIT-patch mentioned at Phoronix (see link above).
Comment 372 AMD Linux User 2018-06-21 12:43:27 UTC
(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
> 
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

"1033 A Lock Operation May Cause the System to Hang" seems also related since some logs were refering about locking issues.
Comment 374 Cong Wang 2018-06-22 21:41:55 UTC
Hey, guys

We have been seeing this smp function call soft lockup since 3.x kernel. Unfortunately we don't know how to reproduce it either.

I suspect there is something on other CPU (not shown without sysrq-l) blocking the smp call function execution, we have to figure out what it is. So next time you see a living case of this bug, please collect the stack traces on all the CPU's at that time, use sysrq-l.

Thanks!
Comment 375 OptionalRealName 2018-06-25 00:21:08 UTC
Thread still ongoing with no response from AMD.

Really shouldn't need to fiddle with custom bios options.

What is big business / big iron or whatever it's called doing with Epyc and linux servers? Are those totally fine?
Comment 376 Robert Hoffmann 2018-06-26 08:50:29 UTC
(In reply to OptionalRealName from comment #375)
> Thread still ongoing with no response from AMD.
> 
> Really shouldn't need to fiddle with custom bios options.
> 
> What is big business / big iron or whatever it's called doing with Epyc and
> linux servers? Are those totally fine?

I am running an Epyc-based server, and as posted in this thread here:
https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID
(worth a full read)
everything in there was solved by setting the correct Virtualization flags in BIOS, as specified by a Supermicro Tech as follows:
========
It looks like you want to enable the virtualization feature.
Please go into BIOS,
Advanced -> NB Configuration -> IOMMU (change to Enabled).
Advanced -> PCIe/PCI/PnP Configuration -> SR-IOV Support (change to Enabled).
========

After this, the server worked without a hitch, and I have now upgraded it from Centos7, to Fedora 28, and it runs like a charm.

Of course the server is not IDLE at all, and maybe that saves the server from the low power issue discussed in this thread. However, I have had it at least 7 days just sitting there not doing anything, and it never had any problem.
Comment 377 OptionalRealName 2018-06-26 09:13:56 UTC
I'm praying intensely that the new AMD Epyc 3000 Embedded series (low power e3000) don't have the same issues.   Hoping to buy one in the next few months when they finally put the damn things on some ITX boards.
Comment 378 Ryan Phillips 2018-06-26 22:11:45 UTC
I was able to capture a soft-lock with sysrq-l [1]. Networking and mouse were still functional, but focusing windows and window activity in Plasma were locked.

https://gist.github.com/rphillips/fae9540a8e7a3a83731c9a9809a98df4

Notes:
 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to Off)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0601 04/19/2018, Typical Current set in the BIOS)
Comment 379 Alexandre Badalo 2018-07-02 09:12:26 UTC
Is there any workaround/solution for this bug? With this problem i can't reliably have the PC running as a server :(

The bug is almost 1 year old
Comment 380 Chris Siebenmann 2018-07-02 16:15:50 UTC
There are two solutions or workarounds that have been reliable for me, one of which I now trust more than the other.

First, if your BIOS has the option for 'Power Supply Idle Control' hiding somewhere in its settings (generally in a sub-menu off an 'advanced' menu), setting it to 'Typical current idle' seems to work. I trust this workaround more than the next one, but it requires a BIOS that has been updated to include AGESA 1.0.0.2a (AGESA is apparently a magic blob that AMD supplies to vendors).

Before I had the BIOS option available, I also had a stable system by using the kernel command line parameters 'rcu_nocbs=1-15 processor.max_cstate=5' (some people use 1 as the maximum cstate). This requires a kernel that supports rcu_nocbs, which not all kernels are built to do, and is more magical than the BIOS setting; it's clear that these settings are stabilizing the system through some side effects, not their direct operation.

(I experimentally determined that on my hardware and setup, merely using 'processor.max_cstate=5' wasn't enough; my machine still locked up.)

My machine runs Fedora 27, using Fedora 4.16.x and 4.17.x kernels on a Ryzen 1800X on an ASUS Prime X370-PRO motherboard with ECC RAM, currently using BIOS 4011.
Comment 381 Michaël Colignon 2018-07-02 21:37:52 UTC
Created attachment 277137 [details]
attachment-17307-0.html

Yep, typical current parameter in your UEFI
If not there, update it



________________________________
De : bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org>
Envoyé : Monday, July 2, 2018 11:12:26 AM
À : ledesillusionniste@hotmail.com
Objet : [Bug 196683] Random Soft Lockup on new Ryzen build

https://bugzilla.kernel.org/show_bug.cgi?id=196683

Alexandre Badalo (alexandre9099@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |alexandre9099@gmail.com

--- Comment #379 from Alexandre Badalo (alexandre9099@gmail.com) ---
Is there any workaround/solution for this bug? With this problem i can't
reliably have the PC running as a server :(

The bug is almost 1 year old

--
You are receiving this mail because:
You are on the CC list for the bug.
Comment 382 OptionalRealName 2018-07-02 23:17:27 UTC
(In reply to Alexandre Badalo from comment #379)
> Is there any workaround/solution for this bug? With this problem i can't
> reliably have the PC running as a server :(
> 
> The bug is almost 1 year old

People need to contact AMD, it's ridiculous this is still ongoing.
Comment 383 Ryan Phillips 2018-07-03 00:06:02 UTC
I have finally gotten to stable (no soft lockup). Asus released the 0702 bios which includes Agesa update 1.0.0.2c. Typical Current is set to enabled within the bios. No other special kernel command-line options are enabled.

Machine:
 
 * Kernel 4.7.13
 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to Off)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0702 06/22/2018, Typical Current set in the BIOS)
Comment 384 Toni Spets 2018-07-04 07:29:39 UTC
Adding to the confusion here:

I'm running a Ryzen 1700 on Asus PRIME B350-PLUS motherboard and on BIOS version 4011 I have the "Power Supply Idle Control" option and I have changed it from "Auto" to "Low Current Idle" a week or so ago and after that I have had zero lockups.

According to this bug report disabling C6 states *or* setting that option to "Typical Current Idle" should work and "Auto" should mean "Low Current Idle" hence it should not have made any difference. zenstates.py shows C6 enabled for both package and core.

Unless the Ubuntu 18.04 kernel 4.15.0-*-generic got an off-tree patch applied somewhere in-between my reboots I'm confused and relieved at the same time.
Comment 385 Freihut 2018-07-04 20:10:50 UTC
Did the ppl with the new BIOSes double test the 'Power Supply Idle Control' option by disabling it temporary? Because, it could also be fixed by that Agesa update some of you mentioned (also mentioned in [1])

tl;dr: 
Had soft lockup at full load, BIOS-update seems to help, but it doesn't have the 'Power Supply Idle Control'-option.

More details:
My HP 17-ca0202ng (Ryzen 2500U) was stable by starting up with "idle=nomwait" Kernel-parameter for 2~3 weeks until *something*[2] happened and from one day to the next it was hard-locking again and on another testrun it threw the soft-lock-error (NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!) while it was transcoding x264-stuff to x265 (kind of full load; 95% @ all 8 threads). It was bootet up by "nomodeset" at the time the errors occur, so I'm quite sure it can't be amdgpu-related.
The time the soft-lock occurred, the system was kind of responsive, just freezing after a amount of seconds with unfreezing after another amount of seconds. I was able to run dmesg and just saw that NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! for all threads, before more freezings in shorter amount of seconds appeared and the system completely froze. Strange!

But: HP provided a BIOS-Update (F4 > F10) some days ago, I was able to update it yesterday and I've done 8 hours of full-load testing, 8 hours of idle testing and 7,5 hours of partial-load testing. No lockup so far and no Kernel-parameters needed (running Ubuntu's mainline build kernel 4.17.2, the stuff from Padoka stable ppa and ravenridge firmware from [3] to prevent amdgpu-freezes). As any lockup occurred within 3 hours, the laptop seems to be stable (or at least "more stable" than before).
Can't say anything about the changes HP made in that update, because they don't provide a changelog. But it's a laptop, so it has 3 or 4 BIOS-options (something virtualisation something, something secure boot something and some boot order options) and for sure no 'Power Supply Idle Control'. Allow me to point out, that Laptops usually have power adapters special designed for the device. So it can't be power supply related as AMD said (IMO).


[1] https://community.amd.com/thread/225795 , https://community.amd.com/thread/224000
[2] May be microcode-update related provided by Ubuntu, but I can't remember for sure.
[3] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu
Comment 386 Robert Hoffmann 2018-07-05 11:50:17 UTC
Interestingly,

I had a nicely running system, uptime about 30 days, and I just decided to upgrade the Kernel to the latest (Fedora 28), and "what the heck" I will just upgrade the motherboard BIOS too.

Mobo: ASUSTeK model: PRIME B350M-A
6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1

So, now I have BIOS version 4104  (previous was 3801 I think)

And after like 10 minutes of doing stuff (i.e. not idle), LOCKED UP !!!

So I went into the BIOS and changed the Global C-State to Disabled, and right under that is the power profile, changed it to Typical.

We shall see how long it lasts !

I guess "don't fix what ain't broken" applied to my activity today lol.
Comment 387 Alexandre Badalo 2018-07-05 13:47:50 UTC
(In reply to Robert Hoffmann from comment #386)
> Interestingly,
> 
> I had a nicely running system, uptime about 30 days, and I just decided to
> upgrade the Kernel to the latest (Fedora 28), and "what the heck" I will
> just upgrade the motherboard BIOS too.
> 
> Mobo: ASUSTeK model: PRIME B350M-A
> 6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1
> 
> So, now I have BIOS version 4104  (previous was 3801 I think)
> 
> And after like 10 minutes of doing stuff (i.e. not idle), LOCKED UP !!!
> 
> So I went into the BIOS and changed the Global C-State to Disabled, and
> right under that is the power profile, changed it to Typical.
> 
> We shall see how long it lasts !
> 
> I guess "don't fix what ain't broken" applied to my activity today lol.

My case is similar, but i don't recall updating the BIOS/UEFI firmware, it was after a kernel upgrade that the system started locking up :/
Comment 388 Arup 2018-07-05 19:08:08 UTC
I have the ASUS B350M board and I updated to latest BIOS 4014 but my lock ups started with the latest kernel 4.17 in Arch where the lock up bug reappeared. I have set the idle current setting to typical and so far no lock up in two days, lets see what happens in few days. Keeping my fingers crossed.
Comment 389 Bráulio Bhavamitra 2018-07-05 19:17:23 UTC
idle=nomwait fixed all hangs (from https://community.amd.com/thread/224000
Comment 390 ChrisB 2018-07-07 10:38:30 UTC
I'm watching this thread since February. My first stable workaround with legacy BIOS-Mode was the kernel parameter rcu_nocbs=0-15 because the BIOS didn't have the "Typical Current Idle" Option.

Systeminfo
R7 1700, GA AX370-Gaming K7 with BIOS F23f in legacy mode, PSU Corsair HX750, OS Gentoo Linux Kernel 4.17.4, desktop kde-plasma-5.13.2

no overclocking or tweaking, only using the "Typical Current Idle" Option (it's possible now with F23f) instead "Auto". 

So I didn't need rcu_nocbs=0-15 furthermore and all works fine - no lockups
Comment 391 ValdikSS 2018-07-17 10:26:40 UTC
These lockups are probably not related to this bug. I've updated by Intel Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have random CPU lockups, too.
4.17.3 worked fine.
Comment 392 Willem Dreyer 2018-07-23 15:28:16 UTC
(In reply to ValdikSS from comment #391)
> These lockups are probably not related to this bug. I've updated by Intel
> Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have
> random CPU lockups, too.
> 4.17.3 worked fine.

Please take a look at this report, it matches your description https://bugzilla.redhat.com/show_bug.cgi?id=1598989 The user that reported it also has an Intel CPU (Acer Aspire V3-771 from his attached screenshot). I have experienced this issue once on one of my Ryzen systems running 4.17.6, could it be an unrelated problem? Another one https://bugzilla.redhat.com/show_bug.cgi?id=1598462

With regards to the original Ryzen Random Soft Lockup issue: I am witnessing it on every kernel that I have tested so far, ranging from kernel 4.10 through to 4.17.6. The frequency of the Ryzen soft lockup has increased since kernel >= 4.15 in my experience. It could just be a random effect as I am unaware of the cause of this problem.

I am running my machines at stock clocks and have the latest stable BIOS updates install as of today on a Ryzen 1700 w/ MSI PRIME X370-PRO, Ryzen 1800X w/ ASRock Fatal1ty X370 Professional Gaming, and finally my personal desktop Ryzen 1800X w/ X470 Taichi Ultimate w/ Seasonic 1000W Platinum PSU (Haswell ready). All of my CPUs are running microcode patch level 0x8001137

I can't find any correlation in workload, the issue occurs on web servers, VFIO gaming, even live USB sessions. I used the following (try) to provide insight.
journalctl -t kernel --no-pager | grep "soft lockup" | awk -F"!" '{print $2}' | sort -u
 [Compositor:4167]
 [kworker/10:1:18249]
 [kworker/1:3:418]
 [libvirtd:1226]
 [systemd:1]
 [Web Content:3505]

I have attempted CPU pinning, disabling ASLR completely. I have also tried isolating workloads in virtual machines that do not cross CCX units with hugepages (THP off). I am not an expert in knowledge of CCX, NUMA, Infinity Fabric, etc... I could have made a mistake in my test. That said, I am currently testing the "idle=nomwait" parameter in hope of getting better results, who knows perhaps even a stable system.
Comment 393 OptionalRealName 2018-07-24 03:04:49 UTC
There still does not appear to be a single consistent solution to this problem and it exists in Ryzen 1xxx and 2xxx series.


Apparently not in the Epyc series, is that correct?

This thread is utterly embarrassing for AMD, truly embarrassing.
Comment 394 Alexander von Gluck 2018-07-24 13:33:29 UTC
(In reply to OptionalRealName from comment #393)
> There still does not appear to be a single consistent solution to this
> problem and it exists in Ryzen 1xxx and 2xxx series.

Actually, there are several workarounds known to work listed in this thread.

I've been running a stable x1800 using the rcu_nocbs=0-15 fixes noted above for almost a year now.

> This thread is utterly embarrassing for AMD, truly embarrassing.

It's a nuanced hard to troubleshoot bug in a processor + support chipsets with a known workaround which only seems to happen on Linux... not really sure how that is "embarrassing".  It sucks, it's new silicon. Welcome to being an early adopter.

Look at Intel's heart-bleed and all the other major security catastrophes they've had in recent history.  Stuff happens.  I do wish AMD would address it, but they're likely reluctant to bring it up until they have a reliable fix (if it even is a fix on their end)
Comment 395 OptionalRealName 2018-07-24 13:40:46 UTC
> It's a nuanced hard to troubleshoot bug in a processor + support chipsets
> with a known workaround which only seems to happen on Linux... not really
> sure how that is "embarrassing".  It sucks, it's new silicon. Welcome to
> being an early adopter.


Thread Reported:	2017-08-16 19:05 UTC


https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700
16 month old silicon.



Nah this is too long.
Firstly, nothing should need to be done in linux in any capacity, there should be a bios flag that fixes it.
Second, the bios flag should be on at all times
Third, based on my second point, as long as someone has the latest firmware, it should 100% solve the problem.

Users should not need to mess around for stability, that is not acceptable.
Comment 396 Chris Hall 2018-07-30 18:25:27 UTC
I have Ryzen 7 1800X on Asus Prime X370-Pro.  I upgraded the BIOS to v4011(Update AGESA 1.0.0.2a + SMU 43.18) and:

  1) turned on the "Typical Current Idle" option.

  2) stopped using zenstates.py -- which I had been using to enable "C6 Core"
     but disable "C6 Package" (to no avail).

  3) did *not* change Linux -- which was 4.16.5 -- Fedora 27.

  4) continued to use CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15

After 67 days uptime (leaving the system completely idle and changing nothing), I became convinced that the "Typical Current Idle" option has dealt with the "freezing when idle" problem.

When I say "freezing when idle", what I mean is: if the machine is left idle (typically over night) it simply stops responding.  Nothing at all is logged -- no application, driver or kernel errors or warnings are logged -- the machine is still powered up, but frozen solid.  The only way to restart the machine is to power down and up again.

Reviewing this thread, it seems to be mostly concerned with the "freezing while idle" issue.

The symptoms of the original "Random Soft Lockup" include log messages of the form:

     NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s!

is that related to "freezing while idle", or is it a separate issue ?

I get the impression that CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15 may be related to the "Random Soft Lockup"... but not to "freezing while idle" ???

It seems that other crashes/lockups are trying to attach themselves to this thread.

I note that this bug is asigned to platform_x86_64@kernel-bugs.osdl.org.  This bug is very nearly 1 year old.  Is this a good moment for the assignee to address this thread and say:

  * what, if any, Kernel issues have been identified

  * what, if any, Kernel fixes have been applied

related to this thread.

If the root cause of (some or all of) the issues in this thread is fixed or worked around by the "Typical Current Idle" BIOS option, does the assignee think that this "bug" can now be closed, or are there actual Kernel issues that remain, waiting to be fixed ?

Is it significant that W*nd*rs does not seem to suffer ?
Comment 397 Daniel Phillips 2018-08-11 20:46:56 UTC
This bug should be closed as "FIXED" in my humble opinion. "NEW" is grossly inaccurate.

My workstation uptime is now 109 days, continuous operation without suspend, mainly idle, mixed with episodes of heavy load on all cores and everything in between. The technical term for this is "spectacularly stable". Credit belongs to all involved, including AMD, kernel devs, motherboard maker, Debian maintainers, and many other groups who made this possible. Again in my humble opinion, the only issue remaining is AMD's failure to explain what went wrong and exactly how they worked around it. That is a separate bug, this one is done.

I suspect that Windows users did get this bug, but they didn't notice it because, well use your imagination. Then it was quietly fixed in a Windows update supplied by AMD that went out around the same time as the Typical Power bios updates.
Comment 398 Antoine Pitrou 2018-08-18 13:39:43 UTC
@Daniel what makes you think this bug is fixed?  It certainly isn't for me.  I have a ASRock Fatal1ty X370 Gaming X motherboard, a Ryzen 7 1700 CPU, have updated the motherboard BIOS to its latest version (4.80 with PinnaclePI-AM4_1.0.0.2 Patch A) and am still getting idle lockups with kernel 4.15.0-32-generic (Ubuntu 18.04.1 kernel).  The workaround for me is to disable C6 states using the Zenstates.py script (which has the unfortunate side effect of lowering the max CPU frequency in single-threaded mode).

(I also don't see any "Typical Current Idle" option in my Setup options, though I might have overlooked it)
Comment 399 Nelson Castillo 2018-08-18 14:11:05 UTC
Hello there. I read most of the thread a week ago.

My machine was freezing when idle and when not idle. I read most of this thread and thanks to it got a stable setup.

@Daniel: When you say that you think the problem is fixed, do you know what change/Linux version fixes the issue for most people?

-------------

Now my report.

I have the following setup:

Motherboard: AB350M-DS3H (F23d BIOS).
CPU: Ryzen 2700X.
No overclock.

I still wonder whether my CPU is doing too much extra work (I don't know what cpufreq would report if a C6 state was reached).

cpufreq stats:

CPU 0: 3.70 GHz:6.64%, 3.20 GHz:1.15%, 2.20 GHz:92.21%
CPU15: 3.70 GHz:4.86%, 3.20 GHz:0.80%, 2.20 GHz:94.34%

Anyway, to make the things work I had to use three tweaks. I tried individual tweaks to no avail.

- Select in BIOS: "Typical Current Idle"
- Start Linux with: idle=nomwait
- Disable C6 states (both core and package) with Zenstates.py

Before doing the last step Zenstates.py reports Core enabled and Package disabled.

I'm using Ubuntu 18.04.01 LTS. I didn't compile Linux with CONFIG_RCU_NOCB_CPU / CONFIG_RCU_NOCB_CPU_ALL.

So, things are working for me. But if you think I should test a new Linux version that is supposed to fix the issue please let me know.
Comment 400 Michaël Colignon 2018-08-18 14:19:59 UTC
(In reply to Antoine Pitrou from comment #398)
> @Daniel what makes you think this bug is fixed?  It certainly isn't for me. 
> I have a ASRock Fatal1ty X370 Gaming X motherboard, a Ryzen 7 1700 CPU, have
> updated the motherboard BIOS to its latest version (4.80 with
> PinnaclePI-AM4_1.0.0.2 Patch A) and am still getting idle lockups with
> kernel 4.15.0-32-generic (Ubuntu 18.04.1 kernel).  The workaround for me is
> to disable C6 states using the Zenstates.py script (which has the
> unfortunate side effect of lowering the max CPU frequency in single-threaded
> mode).
> 
> (I also don't see any "Typical Current Idle" option in my Setup options,
> though I might have overlooked it)

Hi Antoine, i checked the manual of your Asrock motherboard quickly.
For your machine the "typical idle current" is named "c6 mode"
Disable it and you get the "typical idle current", i think.
While keeping the boost monothread.
Comment 401 Chris Hall 2018-08-18 14:21:38 UTC
The "freezing while idle" problem does appear to be (at least) worked around by the "Typical Current Idle" option.  If that counts as a fix, then the bug is fixed -- or rather, was never a Kernel Bug.

The "Random Soft Lockup" -- the nominal subject of the bug -- on the other hand... who can tell ?

It would be churlish not to be happy that my machine is now reliable and I am a believer in credit where credit is due.

So, as you say, all credit goes to AMD for their total silence up to and after the release of the work around (or fix) -- except for the advice that some (older) power supplies might not work.  I don't really need to know why or how the new option does the trick, but I remain curious whether it has a material effect on power consumption.

But I also give credit to this bug's assignee for their total silence -- in particular:

  * it's not clear whether the "freezing while idle" and "Random Soft Lockup"
    problems are separate, related or the same.

  * hence, the CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15 voodoo may or may
    not still be recommended ?

    For all I know, with the "Typical Current Idle" option, I may be better
    off returning to the default configuration ?

  * and there is talk of kernel changes which reduce power consumption
    (significantly ?) -- which may or may not be related to these issues.

Sadly, the absence of solid information allows half-truths and ill-informed speculation to take its place and to live on in the undamped echo chamber of the interweb :-(  [I acknowledge my small part in that noise :-(]
Comment 402 Michaël Colignon 2018-08-18 14:28:29 UTC
(In reply to Daniel Phillips from comment #397)
> This bug should be closed as "FIXED" in my humble opinion. "NEW" is grossly
> inaccurate.
> 
> My workstation uptime is now 109 days, continuous operation without suspend,
> mainly idle, mixed with episodes of heavy load on all cores and everything
> in between. The technical term for this is "spectacularly stable". Credit
> belongs to all involved, including AMD, kernel devs, motherboard maker,
> Debian maintainers, and many other groups who made this possible. Again in
> my humble opinion, the only issue remaining is AMD's failure to explain what
> went wrong and exactly how they worked around it. That is a separate bug,
> this one is done.
> 
> I suspect that Windows users did get this bug, but they didn't notice it
> because, well use your imagination. Then it was quietly fixed in a Windows
> update supplied by AMD that went out around the same time as the Typical
> Power bios updates.

It seems than the normal power mode under this OS disable core parking (C6)
I say it because i checked myself and core parking doesn't work by default under w10.
So it could explain why it's not a windows thing.
Comment 403 James Le Cuirot 2018-08-18 19:40:45 UTC
To anyone applying the BIOS/CONFIG_RCU_NOCB_CPU/ZenStates solution together with  the idle=nomwait solution, I believe this is unnecessary as these are different problems relating to different CPUs. The former worked for my Ryzen 5 1600X, while only the latter worked for my much newer Ryzen 7 2700U (Raven Ridge). I've heard the idle=nomwait issue applies to all Raven Ridge CPUs. I don't know if that's true but I'm fairly certain it doesn't apply to any earlier hardware.
Comment 404 Paul Menzel 2018-08-19 07:02:36 UTC
The title of the bug should be changed to include Ryzen 7 1700X, and the reporter should follow up, if the problem still occurs with Linux 4.18.

For the other issues, separate bug reports should be submitted. Though the mailing lists might be the better forum for this, as the Linux kernel developers for the subsystem do not seem to use the Kernel.org Bugzilla.
Comment 405 Paul Menzel 2018-08-19 07:07:18 UTC
(In reply to Antoine Pitrou from comment #398)
> @Daniel what makes you think this bug is fixed?  It certainly isn't for me. 
> I have a ASRock Fatal1ty X370 Gaming X motherboard, a Ryzen 7 1700 CPU, have
> updated the motherboard BIOS to its latest version (4.80 with
> PinnaclePI-AM4_1.0.0.2 Patch A) and am still getting idle lockups with
> kernel 4.15.0-32-generic (Ubuntu 18.04.1 kernel).  The workaround for me is
> to disable C6 states using the Zenstates.py script (which has the
> unfortunate side effect of lowering the max CPU frequency in single-threaded
> mode).

Please remember, this is the upstream Linux kernel bug tracker. So, reports should only be for the latest stable releases, which currently is 4.18.3 [1]. Some distributions make it easy to install those. Packages are available for Ubuntu [2].

Please report issues with the heavily patched Ubuntu default Linux kernel to the Ubuntu bug tracker [3].


[1]: https://www.kernel.org/
[2]: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18.3/
[3]: https://bugs.launchpad.net/ubuntu/+source/linux/+bugs
Comment 406 Antoine Pitrou 2018-08-19 14:22:38 UTC
@Paul I understand this. I'm also monitoring the corresponding Ubuntu issue.  Still, I was asking whether there was a specific reason to believe this bug was fixed upstream (as opposed to the fact that a single person doesn't experience the issue anymore).
Comment 407 Antoine Pitrou 2018-08-19 16:50:00 UTC
(In reply to Michaël Colignon from comment #400)
> 
> Hi Antoine, i checked the manual of your Asrock motherboard quickly.
> For your machine the "typical idle current" is named "c6 mode"
> Disable it and you get the "typical idle current", i think.
> While keeping the boost monothread.

Hi Michaël, that option was removed in the latest BIOS updates. Probably because it didn't work at all.
Comment 408 Michaël Colignon 2018-08-19 16:58:13 UTC
(In reply to Antoine Pitrou from comment #407)
> (In reply to Michaël Colignon from comment #400)
> > 
> > Hi Antoine, i checked the manual of your Asrock motherboard quickly.
> > For your machine the "typical idle current" is named "c6 mode"
> > Disable it and you get the "typical idle current", i think.
> > While keeping the boost monothread.
> 
> Hi Michaël, that option was removed in the latest BIOS updates. Probably
> because it didn't work at all.

Oh yes it works. On others boards it fix the crashes in idle.
Contact asrock support about the "typical idle current" needed option.
Comment 409 Antoine Pitrou 2018-08-19 17:19:11 UTC
I don't think there is any reason to think that the "c6 mode" option had anything to do with "typical idle current". Its name implied that it disabled C6 states, which it actually didn't (according to the Zenstates script).
Comment 410 ZaNKuRO 2018-08-24 20:02:46 UTC
I have an Asus ROG Strix GL702ZC notebook with Ryzen 7 1700 + Radeon RX580, and I have soft lockups too. The computer completly freeze when idle. My system is an Ubuntu 18.04.1. Kernel parameters like "idle=nomwait" and "rcu_nocbs=0-15" don't fix my problem. Now, I'm using my computer "without lockups" with cpu governor setted in "performance". My BIOS, updated to 305, does not have the option "Typical Idle Current" or similar.
Comment 411 Antoine Pitrou 2018-08-25 08:20:11 UTC
Unfortunately, setting the governor to performance doesn't fix the issue here (Ryzen 7 1700, ASRock Fatal1ty X370 Gaming X).
Comment 412 Chris Hall 2018-09-06 17:34:17 UTC
It appears that selecting the "Typical Current Idle" BIOS option (where available) eliminates the "freeze when idle" problem (somehow).

That said, it also appears that this is *not* the default.

The need for the option appears to be Linux specific, or at least it is not needed for W*nd*rs.

That said, I am damned if I can find any documentation for the option, or any authoritative recommendation for its use with Linux.

So, anyone bitten by this problem may struggle to find their way to selecting "Typical Current Idle".  Particularly as the symptoms have all the appearance of a hardware issue -- possibly an overclocking step too far.

Would it be possible for the Kernel to set whatever the "Typical Current Idle" BIOS option sets ?  Or invoke some BIOS function to select the option ?  That way, at least an up to date version of Linux would avoid this problem "out of the box".

I have been clapping my hands vigorously for some time, but I am coming to the conclusion that the platform_x86_64@kernel-bugs.osdl.org fairy does not, in fact, exist (or perhaps not in this universe).

While I am practising believing the regulation six impossible things before breakfast, I note that the platform_x86_64@kernel-bugs.osdl.org fairy would have to be able to communicate with the AMD documentation fairy... perhaps using a 56k ouija board ?

Finally and FWIW, I tried turning off "rcu_nocbs=0-15" and "Typical Current Idle", and my Ryzen 7 1800X/ASUS X370-Pro/Linux 4.18/Fedora froze within hours.  Before the "Typical Current Idle" option became available, I was running "rcu_nocbs=0-15" (following the advice in this thread), and my machine froze every week or so.  I have now been running for some days with "Typical Current Idle" turned back on, but without the "rcu_nocbs=0-15" voodoo -- so far, so good (as the man said as he fell past the 7th floor).
Comment 413 OptionalRealName 2018-09-06 22:36:39 UTC
So still no concrete 100% sure fire solution?

Incredible.
Comment 414 Chris Hall 2018-09-07 12:12:20 UTC
Who knows ?

When I say that "Typical Current Idle" appears to eliminate the "freeze when idle" problem, I mean that (a) that is my experience (to date), and (b) I have seen others reporting similar experience.

I'm not aware of any public, definitive information from AMD, any motherboard vendor or the relevant kernel folk -- certainly nothing which says what the "Typical Current Idle" BIOS option actually does, and whether or how it solves the problem, in part or completely.

Sadly, this long thread has almost nothing to do with identifying any actual Kernel bug or bugs.  It's more a sort of support group.  [Hello, I'm Chris and I'm an AMD Ryzen user... since I found the "Typical Current Idle" BIOS option, I haven't had a "freeze while idle" for two weeks -- praise be.]

If you are looking for useful information, look away now: angry, bewildered person about to howl at the moon...

--------------

...do I find the situation "Incredible" ?  hmmm...

[Final warning: reduced signal/noise ration ahead.]

I bought my Ryzen machine the moment I could order it.  

When I came to build it, I found there was some confusion about the "standard" fitting for the CPU cooler -- the plate glued to the motherboard and the cooler fitting were incompatible.  I found myself in a three-way stand-off between AMD, the motherboard vendor and the cooler vendor.  If there was a specification for the standard AM4 socket fitting for the cooler, it was Top Secret.  In the end I voided all warranties, ripped the plate from the motherboard and replaced it by the one supplied with the cooler.

When I set up my new machine, I found the standard configuration for the memory was not as fast as I expected, noting <https://www.amd.com/en/products/cpu/amd-ryzen-7-1800x>.  But, I stumbled across this <https://community.amd.com/community/gaming/blog/2017/03/14/tips-for-building-a-better-amd-ryzen-system> where I discovered that the 2667 MT/s I was expecting was for 2 DIMMs; with my 4 DIMMs, 2133 MT/s is what I should have expected.  Silly Me !  I have seen BIOSes since then which are supposed to improve memory support... but I think that's more options for overclocking... who knows ?  Call me old fashioned, but I am disappointed (but no longer surprised) that I cannot find an AMD data sheet that specifies memory support (or much of anything else, for that matter).

Can I find any documentation for AGESA ?  No.  It's various versions (and changes in version numbering) ?  No.

Does my motherboard vendor provide release notes for each BIOS version ?  No.  How did I discover which BIOS version had the "Typical Current Idle" option ?  By experiment.

If the platform_x86_64@kernel-bugs.osdl.org fairy has died because I stopped clapping, then I'm sorry.

... so, Incredible ?  Nah.  Infuriating ?  You bet.
Comment 415 Arup 2018-09-07 20:16:07 UTC
Haven't had a single lockup after update to Kernel 4.15 and setting the BIOS to typical idle current setting. Have run the PC for few days and no issues nor any lock up under heavy CPU use.This is a ASUS B350M board with latest BIOS on Ryzen 1700.
Comment 416 daniel 2018-09-07 21:08:07 UTC
I erred earlier in suggesting that this bug should be closed with 
"fixed". Rather, this bug should should be closed with "notabug" because 
there appears to be no evidence to suggest that it is actually a kernel 
bug, compared to considerable evidence suggesting it is not.

Meanwhile, I personally appreciate the slow trickle of comments along 
the lines of confirming that the typical current bios fix is effective, 
compared to the absence of comments that the bug persists even with this 
option enabled. My current uptime with typical current enabled is 136 
days, compared to only a few days without it.
Comment 417 OptionalRealName 2018-09-08 01:32:15 UTC
The situation is incredible, in that there doesn't (seem) to be a concrete sure fire solution to this.

If there's a BIOS option, which can fix it, it should damn well be put MANDATORY in all BIOS - not an option, locked in.  System stability should not be this unreliable.    Especially now they're on 2xxx series processors.

This huge job log is neatly a year old now, it's crazy.   It should be fixed on the processor itself or the BIOS.
Comment 418 Michael 2018-09-08 23:30:43 UTC
Sorry guys, typical Current Idle doesn't fix it for me. 

I can't believe this is not yet nailed down. I kinda lost believe in the AMD comeback.

I do have this issue.

My setup is a 2700X on a Gigabyte B450 AORUS PRO.

I did hunt this freeze issue down to the behaviour decribed here: System is suddenly stuck during normal office use. Everything stays where is is. No restart, whoops, panic, log entries of any type. 

Only reset-button helps.

I did try to diable Global C-State Control in UEFI - didn't help.
After finding this thread today I set Power Supply Idle Control to "Typical Current Idle" and set global C-State back to Auto (since I thought that doesn't help). Sadly, my system froze after 3h.

That is with the first and stock UEFI version F1 (AGESA 1.0.0.4).
I've just seen that there is an updated UEFI version F2. I've just updated and set global c-states to disabled AND Power Supply Idle Control to "Typical Current Idle".

"sudo ./zenstates.py  -l" is now:
C6 State - Package - Disabled
C6 State - Core - Disabled

I'll come back with the results.

This makes the whole AMD 1xxx and 2xxx seem unstable. 

Questions: 
- Why does AMD not care? Is this maybe not NOT a general issue and only affecting a small percentage of users?

- If Windows doesn't suffer from that issue - can we find out why and do a "if amd then powersave like windows"? 
Or is it just the lack of information/support by vendor so that we don't know what exactly the issue is, what is causing it and how to work around it?
Comment 419 Nelson Castillo 2018-09-09 00:18:50 UTC
Michael, with all that enabled also try the " idle=nomwait " kernel option If the upgrade to F2 doesn't work.

I have a similar BIOS and the same CPU. I had to enable all what you did but this wouldn't work without idle=nomwait. See my post above.
Comment 420 Michaël Colignon 2018-09-09 15:00:34 UTC
Hi, another Michael here.
The typical idle current fixed the trouble, if you have still freeze under low load after it's:
a bug in the uefi that doesn't really apply the "typical idle current"(disabling partially core parking).
another problem than the one on this thread.


I want to add another thing, about fixing it directly from the kernel.
As typical idle current seems (i am testing it again for see) not needed under the other OS, that's because the core parking under this other OS is partially disabled by DEFAULT on this OS as i said earlier in this thread.

So it seems possible to fix it via the kernel, without make use of an uefi parameters not enabled by default.
It could increase the user experience under Linux for people who doesn't know this trouble.

So, why don't make simply the same on Linux? IIRC from my reading over hardware.fr article the same apply to Intel cpu. That's a tweak for increasing performance afaik. Ryzen gains ~10%(on games, where latencies are more important) while not having "deep sleep" of cores enabled.

My Wattmeter is not a good one but i see mostly no difference between this core parking partially disabled or not. It must save really few power.

You all can easily test it by core parking control software on this other OS.
Comment 421 nancy 2018-09-13 14:01:24 UTC
(In reply to Michaël Colignon from comment #420)
> ...
> I want to add another thing, about fixing it directly from the kernel.
> As typical idle current seems (i am testing it again for see) not needed
> under the other OS, that's because the core parking under this other OS is
> partially disabled by DEFAULT on this OS as i said earlier in this thread.
> 
> So it seems possible to fix it via the kernel, without make use of an uefi
> parameters not enabled by default...

I can perhaps confirm that. But I have to say first, I'm not a developer and I do not understand anything about Linux.

I use an AMD Ryzen 5 1600 Six-Core @ 12x 3.2GHz with Gentoo as the operating system. So it's very easy for me to build a kernel just for my machine. (Without the whole Intel, ARM64, Android and Big-Data stuff.)

I try the kernels with standard bios setup, without kernel parameters like "idle = nomwait", without OC. My bios only has Global C-states Auto (which is default) and enable/disable options. There is no "Idle Control to "Typical Current Idle"" in the BIOS. Only CONFIG_RCU_NOCB_CPU=y is set in the .config.

BOOT_IMAGE=/boot/kernel-x86_64-4.19.0-rc3 root=UUID=465f9c93-09d0-441b-9440-8ec7799b557c ro real_root=/dev/sdd4 resume=/dev/sdd3 init=/usr/lib/systemd/systemd
rootfstype=ext4

./zenstates.py -l
...
C6 State - Package - Enabled
C6 State - Core - Enabled
...
On this basis,
It is to be confirmed by me that the kernel 4.17.14 works really well. But the series 4.18.x is cruel. Also the currently as "stable" declarated 4.18.7. Bad. Freeze or reboot every 2 - 3 hours.

To my surprise, the 4.19.0-rcX work well. So far no soft lockups.

I do not know why. But these are the subjective results of my playing with the kernels.

I can also understand that the few kernel developers are working on priority issues. Please think of Specter and Meltdown.

But a few clarifying information would be very welcome..
Comment 422 Michael 2018-09-17 00:11:10 UTC
Hi,

update from me, as promised. First my system for reference:
2700X
B450 Chipset - Gigabyte B450 AORUS PRO with F2 BIOS
PSU from 2018 which is "Haswell compatible".

Typical Current Idle AND disabling global c-states didn't help.
Running Ubuntu 18.04 LTS with its stock kernel 4.15.0-34 right now.

I am having some success (4 days now) with all of that AND idle=nomwait kernel parameter - although I only read that to be related to resets and not freezes.

Anyways. will keep my system on.

I would like to trade in the LEDs on GPU and mainboard for vendors effort to get some stability in their systems.

Any update from someone who really knows what is going on would be really appreciated. We're playing around with settings which are not explained anywhere.
Comment 423 Michael 2018-09-21 12:40:34 UTC
So, I'm running Ubuntu 18.04 stock kernel with 
- Typical Current Idle
- disabling global c-states
- idle=nomwait kernel parameter

and have no freezes for 8 Days now - new record. So it looks like I can confirm what Nelson Castillo mentioned in comment #419

This seems ok now. Now I need to verify if all changes are needed or if only one or two are necessary.

Will come back with an update when I have the results.

Michael
Comment 424 OptionalRealName 2018-09-22 00:31:08 UTC
So what are the implications of using idle=nowait and why isn't it a standard behaviour for all Ryzen CPU going forward?
Comment 425 Liviu Lalescu 2018-09-22 15:10:43 UTC
I had this same bug in June 2018 and in the beginning of July 2018. It seems to be solved for me by " Look in BIOS settings for "PSU Idle Control" and set it to "Typical Current Idle" "

The bug was appearing about a few days (2-7) in low usage mode.

I was using openSUSE Leap 15.0 before this and no bug. Then I switched to Leap 15.1 and the bug appeared for the first time (it had a newer kernel). Then I had about 3 weeks before finding this fix.

Now I have openSUSE Tumbleweed with kernel 4.18.8. It was OK since kernel 4.14 and now it keeps on updating the kernel as usual and everything is OK.

Thank you for providing this fix!
Comment 426 JerryD 2018-09-30 04:50:30 UTC
I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.

The zenstates.py script fails when I try to disable C6. Oh well.

I have kernel parameters: idle=nomwait processor.max_cstate=5

I still get lockup. No BIOS settings available on this mavhine.

Feel pretty hopeless at the moment.
Comment 427 Klaus Mueller 2018-09-30 05:39:34 UTC
(In reply to JerryD from comment #426)
> I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.
> 
> The zenstates.py script fails when I try to disable C6. Oh well.

Did you load msr kernel module before (modprobe msr)?

Do you have the possibility to slightly overclock? Isn't there a Bios switch like "Typical Current Idle"? Is there a possibility to switch of C-states completely in the Bios (just for testing to be sure your hangs are the same reason as here)?
Comment 428 JerryD 2018-09-30 21:51:43 UTC
(In reply to Klaus Mueller from comment #427)
> (In reply to JerryD from comment #426)
> > I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.
> > 
> > The zenstates.py script fails when I try to disable C6. Oh well.
> 
> Did you load msr kernel module before (modprobe msr)?
> 
> Do you have the possibility to slightly overclock? Isn't there a Bios switch
> like "Typical Current Idle"? Is there a possibility to switch of C-states
> completely in the Bios (just for testing to be sure your hangs are the same
> reason as here)?

As follows:

[root@amdr jerry]# modprobe msr
[root@amdr jerry]# ./bin/zenstates.py --list
P0 - Enabled - FID = 64 - DID = A - VID = 35 - Ratio = 20.00 - vCore = 1.21875
P1 - Enabled - FID = 66 - DID = C - VID = 60 - Ratio = 17.00 - vCore = 0.95000
P2 - Enabled - FID = 60 - DID = C - VID = 66 - Ratio = 16.00 - vCore = 0.91250
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

[root@amdr jerry]# ./bin/zenstates.py --c6-disable
Traceback (most recent call last):
  File "./bin/zenstates.py", line 112, in <module>
    writemsr(0xC0010292, readmsr(0xC0010292) & ~(1 << 32))
  File "./bin/zenstates.py", line 23, in writemsr
    raise OSError("msr module not loaded (run modprobe msr)")
OSError: msr module not loaded (run modprobe msr)

As you can see msr is loaded and listing the states works fine but the disable option fails.

The bios on this laptop has no power related options that I can see.
Comment 429 Chris Hall 2018-10-01 11:24:37 UTC
The root cause of the "freeze when idle" appears to be a fault in the mechanism which wakes the CPU up once it has gone into the deepest of deep sleeps.

Mr AMD has pointed the finger at PSUs which fail to maintain the correct voltage when the current draw approaches 0A.  Between the PSU and the CPU there is circuitry on the motherboard, which I guess could also be involved ?

Disabling C-States, in particular C6, obviously addresses the apparent root cause.

However, for my machine, setting C6 Package Disabled (using zenstates.py) was not enough.  Nor was it enough to set both C6 Package and C6 Core Disabled.

After setting "Typical Current Idle", I have had ~5 months without a "freeze when idle" -- with various kernels up to and including 4.18.9, and (latterly) with no other settings.

So I assume that "Typical Current Idle" does something more than disabling C6... something I have yet to discover. 

------------

What the "Typical Current Idle" option does is secret :-(

On my machine (Ryzen 7 1800X, Asus X370-Pro, BIOS 4012), zenstates.py tells me:

   Low Current Idle     : C6 Package Enabled  : C6 Core Enabled
   Typical Current Idle : C6 Package Disabled : C6 Core Enabled
   Auto                 : C6 Package Enabled  : C6 Core Enabled

and all three have the same three P-States:

   P0: FID=90  DID=8  VID=20  Ratio=36.00  vCore=1.35000
   P1: FID=80  DID=8  VID=2C  Ratio=32.00  vCore=1.27500
   P2: FID=84  DID=C  VID=68  Ratio=22.00  vCore=0.90000

[I thought that "Typical Current Idle" might be fiddling with P2, but that does not seem to be the case.]

I imagine there are many more parameters I could look at, if only I knew more.  For completeness, I leave all other BIOS options in their default state.

While I was checking the effect of the "Current Idle" options, I noticed something peculiar about setting "Typical Current Idle".  I started with:

 0)                       "Typical Current Idle"   was: C6 Package Disabled

and then:

 1) reboot into BIOS, set "Low Current Idle",     gave: C6 Package Enabled
 2) reboot into BIOS, set "Typical Current Idle", gave: C6 Package Enabled  !
 3) shutdown and restart,                         gave: C6 Package Disabled !
 
To get the (full) effect of "Typical Current Idle" I have to do a cold boot, apparently !!  [I tried this three times, just to be sure.]

------------

I'm curious about "idle=nomwait".  I find this will "disable mwait for CPU C-states".  AFAIKS, the MWAIT instruction halts the current thread and sets a given C-State to drop into.  So not using MWAIT looks like another way of disabling C6 for the core ?  Or is there something else going on here ?

Given that I found that disabling C6 is not enough to eliminate "freeze when idle", I would not expect "idle=nomwait" to be enough.  Unless "idle=nomwait" does something more... avoiding a Kernel bug, for example ?
Comment 430 Michael 2018-10-18 20:25:10 UTC
(In reply to Michael from comment #423)
> So, I'm running Ubuntu 18.04 stock kernel with 
> - Typical Current Idle
> - disabling global c-states
> - idle=nomwait kernel parameter
> 
> and have no freezes for 8 Days now - new record. So it looks like I can
> confirm what Nelson Castillo mentioned in comment #419
> 
> This seems ok now. Now I need to verify if all changes are needed or if only
> one or two are necessary.
> 
> Will come back with an update when I have the results.
> 
> Michael

Hi!

First, I've changed my mainboard (see below). AGESA-wise this was a downgrade from 1.0.0.4 to 1.0.0.2.
This does change the chipset from B450 to X470.

Second, I did enable the global c-states in UEFI and am still running without freezes for over 14 days now. I consider that long enough to call it stable.




That means only the following changes are needed:

- typical current idle UEFI setting 
- the idle=nomwait kernel parameter


For reference:
Asrock X470 Master SLI with BIOS 1.10
2700x
no overclocking, but ram on XMP profile.

# ./zenstates.py -l
P0 - Enabled - FID = 94 - DID = 8 - VID = 36 - Ratio = 37.00 - vCore = 1.21250
P1 - Enabled - FID = 80 - DID = 8 - VID = 59 - Ratio = 32.00 - vCore = 0.99375
P2 - Enabled - FID = 84 - DID = C - VID = 76 - Ratio = 22.00 - vCore = 0.81250
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Enabled

I'll stop playing with parameters since I think I reached the best I can. I'll only update if something changes like freezes keep returning etc.

Regards,
Michael
Comment 431 Vladyslav Yamkovyi 2018-10-21 09:46:55 UTC
(In reply to JerryD from comment #426)
> I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.
> 
> The zenstates.py script fails when I try to disable C6. Oh well.
> 
> I have kernel parameters: idle=nomwait processor.max_cstate=5
> 
> I still get lockup. No BIOS settings available on this mavhine.
> 
> Feel pretty hopeless at the moment.

I'm on Ryzen 2500U Laptop as well, HP 15-db0229ur with latest BIOS (F.11 Rev.1 in my case) and latest kernel (something that used to be called 4.19-rc7 on kernel.org). Also have no option of entering BIOS settings available on this machine. Tried putting both kernel parameters AND *successfully* disabling ZenStates.py. Already had massive data loss at least twice. I really, really-really hope this issue won't turn into one of those 10-years-old issues.

AMD did publish an errata report, pointing that this may be fixed at software level (page 63).
https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf
Comment 432 simonmcquire 2018-10-22 08:35:15 UTC
Want to share this in case it helps anyone else.  I stumbled across this issue after putting together a new Ryzen 2200G HTPC with AsRock AB350 M-ITX mainboard a few weeks ago.  Would encounter random freezes every couple of hours.

After updating UEFI to latest version, updating kernel to latest 4.19-rc8, and Mesa to 18.2.2, I was still having stability problems (freezing with error messages in dmesg output).  Made the changes in BIOS to typical current IDLE etc. and still was having issues.

But then realised I had not updated the amdgpu firmware, so grabbed the latest version from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git and everything seems to have been stable since then *fingers crossed*.
Comment 433 Vladyslav Yamkovyi 2018-10-22 13:37:02 UTC
(In reply to simonmcquire from comment #432)
> But then realised I had not updated the amdgpu firmware...

Does that means that the issue was already resolved and it's just distros being outdated? I haven't tried this solution yet but you're not the only one who is saying that the problem can be solved by grabbing latest AMDGPU drivers.
Comment 434 simonmcquire 2018-10-22 13:39:22 UTC
(In reply to Vladyslav Yamkovyi from comment #433)
> (In reply to simonmcquire from comment #432)
> > But then realised I had not updated the amdgpu firmware...
> 
> Does that means that the issue was already resolved and it's just distros
> being outdated? I haven't tried this solution yet but you're not the only
> one who is saying that the problem can be solved by grabbing latest AMDGPU
> drivers.

No idea, but I would say if you have outdated firmware, it is definitely worth trying updating to latest version.
Comment 435 Eduardo Reyes 2018-10-22 13:59:02 UTC
@simonmcquire

Can you please clarify what this amdgpu firmware you are using is and what for?
Are you flashing the video card?
Are you flashing the motherboard?

Thanks in Advance
Comment 436 simonmcquire 2018-10-22 14:18:08 UTC
(In reply to Eduardo Reyes from comment #435)
> @simonmcquire
> 
> Can you please clarify what this amdgpu firmware you are using is and what
> for?
> Are you flashing the video card?
> Are you flashing the motherboard?
> 
> Thanks in Advance

No flashing of video card or motherboard here.  I'm not using a separate GPU - I'm using the GPU built into the Ryzen 2200G APU.  This is just firmware loaded by the amdgpu kernel driver.  

The firmware is loaded and used by the amdgpu driver as far as I understand.  I have just replaced all of the files in /lib/firmware/amdgpu with the latest version.
Comment 437 Eduardo Reyes 2018-10-22 14:24:45 UTC
(In reply to simonmcquire from comment #436)
> (In reply to Eduardo Reyes from comment #435)
> > @simonmcquire
> > 
> > Can you please clarify what this amdgpu firmware you are using is and what
> > for?
> > Are you flashing the video card?
> > Are you flashing the motherboard?
> > 
> > Thanks in Advance
> 
> No flashing of video card or motherboard here.  I'm not using a separate GPU
> - I'm using the GPU built into the Ryzen 2200G APU.  This is just firmware
> loaded by the amdgpu kernel driver.  
> 
> The firmware is loaded and used by the amdgpu driver as far as I understand.
> I have just replaced all of the files in /lib/firmware/amdgpu with the
> latest version.

I'm using the rx580 with mesa stable and kubuntu 18.04.1.... I dont have this issue anymore after motherboard firmware update on another asrock board but just curious would this firmware help with performance or bug fixes on all AMD gpu?
Comment 438 simonmcquire 2018-10-22 14:46:22 UTC
(In reply to Eduardo Reyes from comment #437)
> (In reply to simonmcquire from comment #436)
> > (In reply to Eduardo Reyes from comment #435)
> > > @simonmcquire
> > > 
> > > Can you please clarify what this amdgpu firmware you are using is and
> what
> > > for?
> > > Are you flashing the video card?
> > > Are you flashing the motherboard?
> > > 
> > > Thanks in Advance
> > 
> > No flashing of video card or motherboard here.  I'm not using a separate
> GPU
> > - I'm using the GPU built into the Ryzen 2200G APU.  This is just firmware
> > loaded by the amdgpu kernel driver.  
> > 
> > The firmware is loaded and used by the amdgpu driver as far as I
> understand.
> > I have just replaced all of the files in /lib/firmware/amdgpu with the
> > latest version.
> 
> I'm using the rx580 with mesa stable and kubuntu 18.04.1.... I dont have
> this issue anymore after motherboard firmware update on another asrock board
> but just curious would this firmware help with performance or bug fixes on
> all AMD gpu?

I don't know, but maybe if it's stable now you should stick with your current configuration.

Since the 2200G is the APU chip, which has built in graphics I think I may have experienced amdgpu and cpu-related issues (which this bug seems to be more related to), so I've needed a combination (new UEFI and amdgpu firmware) to get it stable.
Comment 439 Owen Swerkstrom 2018-10-23 15:56:15 UTC
I'm one of the users James mentioned early on; I don't know how to capture kernel output when I'm greeted with a completely-locked-up or already-rebooted machine.  If it's hardware, I don't see what the kernel could do anyway.

I've tried all the tricks mentioned here, when applicable.  (My BIOS doesn't expose anything like "typical current idle", etc.)

The only way I've been able to keep this system stable is to keep it busy.  I run BOINC and donate some of this shark-style swim-or-die CPU to medical research.  There are some nice settings for how the cores get used, and according to htop, tasks get shuffled around to all cores, if that matters.  I went with using at most 50% of 25% of the cores (figures I pulled out of ...thin air).  That keeps my load average around 2 when I'm doing nothing, but whatever - it's for a good cause.

My CPU is an AMD Ryzen 7 1800X with 8 or 16 cores depending on how one counts, and keeping them from getting bored has, so far, let me get back to rebooting on my terms (upgrades) rather than at random.

This comes up as a Linux problem, but it sure smells like a hardware defect to me.  In my mind, all the self-monitoring and user-spying and background-indexing and auto-downloading that Windows does happens to mask this problem for most people.  (Plus who would think twice about a randomly-hanging or -rebooting Windows box?)  Has anyone run *BSD or FreeDOS or something else which would allow a Ryzen to get bored for hours/days?
Comment 440 Konstantin Kharlamov 2018-10-23 17:16:08 UTC
> Has anyone run *BSD or FreeDOS or something else which would allow a Ryzen to
> get bored for hours/days?

Yes, FreeBSD is affected too https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-BSD-Lock-Ups-2018
Comment 441 Vladyslav Yamkovyi 2018-10-23 19:49:23 UTC
(In reply to Owen Swerkstrom from comment #439)
> This comes up as a Linux problem, but it sure smells like a hardware defect
> to me.
Any system that does not provides a software workaround must be affected according to published errata, not even a question. I've published one of their revision guides in my previous comments. They have no plans on fixing this and suggest using software workarounds. We're left on our own.
Comment 442 Chris Hall 2018-10-24 12:53:32 UTC
(In reply to Vladyslav Yamkovyi from comment #441)
> (In reply to Owen Swerkstrom from comment #439)
> > This comes up as a Linux problem, but it sure smells like a hardware defect
> > to me.
> Any system that does not provides a software workaround must be affected
> according to published errata, not even a question. I've published one of
> their revision guides in my previous comments. They have no plans on fixing
> this and suggest using software workarounds. We're left on our own.

What is the "this" that AMD have no plans to fix ?

I had a look at <https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf> which you referenced earlier.  I found 3 MWAIT issues:

  1057 MWAIT or MWAITX Instructions May Fail to Correctly Exit From
       the Monitor Event Pending State

  1059 In Real Mode or Virtual-8086 Mode MWAIT or MWAITX Instructions May
       Fail to Correctly Exit From the Monitor Event Pending State

  1109 MWAIT Instruction May Hang a Thread

but I could not find anything else that might be related to the "freeze when idle" problem.

FWIW, here's the full text for the Erratum 1109:

  1109 MWAIT Instruction May Hang a Thread

       Description: Under a highly specific and detailed set of internal timing
                    conditions, the MWAIT instruction may cause a thread to
                    hang in SMT (Simultaneous Multithreading) Mode.

       Potential Effect on System: The system may hang or reset.

       Suggested Workaround: System software may contain the workaround for
                             this erratum.

       Fix Planned: No fix planned

so there !  I guess "idle=nomwait" is "the workaround" ?

"Typical Current Idle" appears to work for some (including me) but not for everyone.  If one or more of these MWAIT errata is the root cause of the "freeze when idle" problem, I wonder why AMD introduced "Typical Current Idle" and how that relates to these MWAIT issues ??
Comment 443 Owen Swerkstrom 2018-10-24 16:09:13 UTC
...
> so there !  I guess "idle=nomwait" is "the workaround" ?
> 
> "Typical Current Idle" appears to work for some (including me) but not for
> everyone.  If one or more of these MWAIT errata is the root cause of the
> "freeze when idle" problem, I wonder why AMD introduced "Typical Current
> Idle" and how that relates to these MWAIT issues ??

Good questions, and beyond my ken.  "idle=nomwait" did not prevent my system from rebooting while I was away.  (I don't know if reboots and freezes are separate problems?)  My BIOS doesn't expose anything like "typical current idle", and I have no way to upgrade it since the upgrade tools require Windows.

Is there a guide out there for dummies like me to be able to collect kernel output from these reboots?  (Does the kernel even get a chance to do/report anything?  Would a report even be helpful to kernel hackers in figuring out more workarounds?)  If I can contribute anything plausibly-constructive, I'd love to.

Otherwise, I use and need this machine, so I'll probably just keep treating the symptoms by keeping it busy.  (Which I suppose is another software workaround. ;^)
Comment 444 ZaNKuRO 2018-10-24 17:42:45 UTC
I have an Asus ROG Strix GL702ZC notebook with Ryzen 7 1700 + Radeon RX580 on Ubuntu 18.04. Kernel parameter "idle=nomwait" made no diference to me. Idle freezes persists. The only thing that solved my problem was to disable CPU C6 state through zenstates.py.
Comment 445 fin4478 2018-11-27 13:55:44 UTC
(In reply to Owen Swerkstrom from comment #443)
 I have no way to upgrade it since the upgrade tools require Windows.

Win10 is free and runs freely forever. Download the iso file from ms site. Mount the iso file virtually and copy files to a gparted fat32 formatted and bootable usb memory stick.  Make space for win10 about 30GB with gparted in your main drive. Boot from the usb memory and install win10, skip registration key dialog etc.
Comment 446 Owen Swerkstrom 2018-11-27 17:20:41 UTC
(In reply to fin4478 from comment #445)
> (In reply to Owen Swerkstrom from comment #443)
>  I have no way to upgrade it since the upgrade tools require Windows.
> 
> Win10 is free and runs freely forever. Download the iso file from ms site.
> Mount the iso file virtually and copy files to a gparted fat32 formatted and
> bootable usb memory stick.  Make space for win10 about 30GB with gparted in
> your main drive. Boot from the usb memory and install win10, skip
> registration key dialog etc.

Thanks, I didn't know that.  But, since figuring out zenstates.py, I appear to be set.  Fingers crossed!  (Anyway, I'd rather keep the cpu busy than install windows. ;^)

<soapbox>
BIOS updates should have their own boot images; how on Earth they can get away with requiring a specific OS is all kinds of backwards.
</soapbox>
Comment 447 fin4478 2018-11-29 11:32:36 UTC
(In reply to Owen Swerkstrom from comment #446)
 <soapbox>
> BIOS updates should have their own boot images; how on Earth they can get
> away with requiring a specific OS is all kinds of backwards.
> </soapbox>

Asus motherboards are the best, for Linux users especially. You can update the bios via the Ethernet connection or via a small fat32 partition with the Ez flash tool in the bios.
Comment 448 eric.c.morgan 2018-11-29 13:38:58 UTC
(In reply to fin4478 from comment #447)
> (In reply to Owen Swerkstrom from comment #446)
>  <soapbox>
> > BIOS updates should have their own boot images; how on Earth they can get
> > away with requiring a specific OS is all kinds of backwards.
> > </soapbox>
> 
> Asus motherboards are the best, for Linux users especially. You can update
> the bios via the Ethernet connection or via a small fat32 partition with the
> Ez flash tool in the bios.

My Asrock has all those features as well I believe.
Comment 449 C0rn3j 2018-12-02 15:40:42 UTC
*** Bug 201719 has been marked as a duplicate of this bug. ***
Comment 450 C0rn3j 2018-12-02 15:49:05 UTC
From the dupe bug report -

Gigabyte GA-AX370-Gaming K5 (latest UEFI - 2018/08/08)
AMD Ryzen 5 2400G

Running Arch Linux with 2018-10-26 linux-firmware and I still get soft lockups and one other issue capable of freezing the system.

I'll try the 'Power Supply Idle Control' setting and see if that does anyhing (if my mobo even has that...)
Comment 451 Sami Farin 2018-12-04 16:06:26 UTC
With Ryzen 1600X + Radeon RX 550 + ASRock Taichi X370 I didn't have this bug until 4.18.20.  4.18.18 had 12 days uptime and 4.18.20
(and 4.19.6) maybe 6 hours.  4.18.19 now has 56 hours uptime.

X just freezes (keyboard+mouse dead) and I have to press reset button.  Likewise, if I am in console, freeze happens the same way; cursor stops blinking and I don't get any messages.

I am booting with nosmt=force rcu_nocbs=0-5 mem_encrypt=off 
 (also CONFIG_RCU_NOCB_CPU=Y ).

Now, I don't feel like doing git-bisect (commits v4.18.19..v4.18.20),...
Does someone have ideas as to what to try next?  Anything suspicious in v4.18.20 commits?

Some differences in dmesg 4.18.19..4.18.20:

-smpboot: Allowing 16 CPUs, 10 hotplug CPUs
+smpboot: Allowing 16 CPUs, 4 hotplug CPUs
-.... node  #0, CPUs:        #1  #2  #3  #4  #5
+.... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11
 smp: Brought up 1 node, 6 CPUs
 smpboot: Max logical packages: 3

-ACPI: (supports S0 S5)
+ACPI: (supports S0 S3 S5)
Comment 452 Sami Farin 2018-12-08 10:44:43 UTC
Well, 4.18.19 froze after four days.

4.19.7 froze in 3½ hours.

rcu_nocb_poll mem_encrypt=off nosmt=force
CONFIG_RCU_NOCB_CPU=y
CONFIG_PREEMPT_RCU=y

[12572.931476] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [(journald):18688]
[12572.931509] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [amdgpu_cs:0:6535]
[12600.931702] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [(journald):18688]
[12600.931736] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [amdgpu_cs:0:6535]
Comment 453 Vladyslav Yamkovyi 2018-12-12 00:15:59 UTC
Let's face it: I really, really don't think measuring time before a freeze makes any sense. It just occurs under certain circumstances or even randomly due to some hardware bug - 4.19 will work 4 days for the first time, 3 hours for the second time and 23 hours for the third. That's for sure - it's not related to a specific version, it just persists across all kernel versions.
Comment 454 edigitalb 2018-12-22 04:06:17 UTC
(In reply to Vladyslav Yamkovyi from comment #453)
> Let's face it: I really, really don't think measuring time before a freeze
> makes any sense. It just occurs under certain circumstances or even randomly
> due to some hardware bug - 4.19 will work 4 days for the first time, 3 hours
> for the second time and 23 hours for the third. That's for sure - it's not
> related to a specific version, it just persists across all kernel versions.

I confirm that, tested on kernel 3.x, 4.12, 4.15, 4.19
Comment 455 edigitalb 2018-12-22 04:07:18 UTC
That's not an AMD specific problem, I'm getting the same issue on a Intel I7 8700
Comment 456 Weasel 2018-12-22 14:19:08 UTC
You can test your ryzen with the program(It is currently no longer being further developed by the developer): https://github.com/suaefar/ryzen-test
My 2200G shows no segfault, but "TIME TO FAIL: 885 s". 

Under Windows 10 one Ryzen 2200G is freezing often under load for 20-30 seconds.
Another Ryzen 2200G often freezes under Debian Sid and crashes KDE plasma all the time, while Gnome Wayland runs more stable.

C-states, C6, Cool'n'Quiet, Low current idle, Typical current idle, nothing changed the fact that the CPU freezes.

Now I don't have the two systems anymore. It was too crazy with the ryzen 2200G for me and I have enough of it.
Comment 457 Antoine Pitrou 2018-12-22 14:22:18 UTC
@Weasel that's a different problem, that doesn't necessarily occur with the same CPUs.  This specific bug entry is about lockups that (to quote the original message) "typically occur when the load is low".

For example, I've had several low-load lockups on my Ryzen 1700, I haven't had a single high-load lockup.  Please let's not conflate the two.
Comment 458 Tolga Cakir 2018-12-30 13:31:20 UTC
Hi,

I have an ASUS A320M-K with latest BIOS 4027 (AGESA 1.0.0.6), Ryzen 7 2700, Patriot Viper RGB 16GB RAM 3200CL16, Samsung 830 SSD 128GB, Corsair AX860 860W PSU (80+ Platinum) and ASUS ROG Strix RX480 8GB. Running Arch Linux w/ Kernel 4.19.12 and GNOME 3.30 on Wayland. BIOS configured to Defaults + XMP Profile activated + CSM disabled (-> UEFI-only mode).

I have experienced 2 hangs with this system, both within 1 hour uptime. However, I couldn't spot anything in the logs. There were no watchdog entries, unlike previously posted in this bug report.

The "system lockup" expresses itself by blanking my screen to dark and showing a blinking cursor in the top left corner. No other symbols / information. I can't interact in any way with my system - it ignores all inputs; can't switch console, doesn't react to single power button press, nothing. Only way to shutdown is holding power button for 4s.

I entered the BIOS after the lockups and noticed *VERY* slow performance within BIOS. Keypresses took very long to register (like 5+ seconds) until BIOS eventually hung. After another powercycle, all was good again. I wonder, if rebooting using the reset button could give us more clues, if the system continues to run "buggy" after the reset.

After setting "Typical Current Idle", I didn't had a single lockup ever since, even after 48h uptime, mostly idling. System runs perfectly good. This fix is fine for desktops, but laptops suffer from higher power draw, if they're able to set this.
Comment 459 Michael 2019-01-02 19:58:24 UTC
I think this bug is mostly about the official AMD Errata 1109 "MWAIT Instruction May Hang a Thread".

This is described here ( https://developer.amd.com/wp-content/resources/55449_1.12.pdf ) as:

"Under a highly specific and detailed set of internal timing conditions, the MWAIT instruction may cause a
thread to hang in SMT (Simultaneous Multithreading) Mode."

Since this is a nofix for AMD, this must be fixed by the OS. The AMD document states:

"Suggested Workaround
System software may contain the workaround for this erratum."

So, unfortunately AMD will never solve that hardware error. I suggest they change the first two letters of their company name ;)

This in fact leaves every linux system with a Ryzen CPU instable, so this should be considered serious.

I didn't read a comment from any kernel developer in the history of this bug report. Did I overlook something?

Question: Will the Linux kernel ever implement a workaround (as suggested by AMD??
Comment 460 Bráulio Bhavamitra 2019-01-02 20:00:49 UTC
(In reply to Michael from comment #459)
> I think this bug is mostly about the official AMD Errata 1109 "MWAIT
> Instruction May Hang a Thread".
> 
> This is described here (
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf ) as:
> 
> "Under a highly specific and detailed set of internal timing conditions, the
> MWAIT instruction may cause a
> thread to hang in SMT (Simultaneous Multithreading) Mode."
> 
> Since this is a nofix for AMD, this must be fixed by the OS. The AMD
> document states:
> 
> "Suggested Workaround
> System software may contain the workaround for this erratum."
> 
> So, unfortunately AMD will never solve that hardware error. I suggest they
> change the first two letters of their company name ;)
> 
> This in fact leaves every linux system with a Ryzen CPU instable, so this
> should be considered serious.
> 
> I didn't read a comment from any kernel developer in the history of this bug
> report. Did I overlook something?
> 
> Question: Will the Linux kernel ever implement a workaround (as suggested by
> AMD??

idle=nomwait kernel option fixed all hangs for me
Comment 461 Michael 2019-01-02 20:03:36 UTC
(In reply to Bráulio Bhavamitra from comment #460)
> (In reply to Michael from comment #459)
> > I think this bug is mostly about the official AMD Errata 1109 "MWAIT
> > (...)
> > 
> > Question: Will the Linux kernel ever implement a workaround (as suggested
> by
> > AMD??
> 
> idle=nomwait kernel option fixed all hangs for me

True, same for me. 
But, is every future ZEN user doomed to go through a valley of pain and possibly find this workaround themselves or is it going to be default at any time?
Comment 462 Francesco Biscani 2019-01-02 20:48:49 UTC
My understanding is that idle=nomwait disables the use of the MWAIT instruction in the kernel implementation of the CPU idle state.

I also *think* that, under the hood, disabling C6 and the "Typical Power idle" option also have a similar (or perhaps the same?) effect of reducing the use of MWAIT. E.g., since I started enabling the "Typical Power idle" option in my BIOS, at every boot I have the following message showing up in the kernel logs:

[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

However, because the MWAIT instruction is part of SSE3 [1], it might end up being executed by any software which was compiled with support for SSE3. So I believe that disabling MWAIT for idle states greatly reduces the risk of getting these MWAIT hangs (simply because the instruction will end up being executed much less), but it will not completely eliminate them. Thus, it seems like the only bulletproof solution is either to upgrade to Ryzen+/Ryzen2, or to disable hypterthreading.

[1] https://en.wikipedia.org/wiki/SSE3
Comment 463 Another User 2019-01-08 08:18:48 UTC
I have random freezes with Ryzen 2200u laptop. Display image just freezes and system stops responding. Only way to get out from this state is to holding power button (even magic SysRq key does not work). No any clues in logs.

With Ubuntu 18.04 it was happen nearly once per week (not intense use). After installing 18.10 things become more stable - I've got first lockup after 6 weeks and another after 3. Hangs are completely random and on different cpu load - once I've got this while gaming (and ambient noise continue playing...)

I've tried idle=nomwait but this did not help on 18.04 - and line "monitor-mwait will be used for idle loop" was still present in dmesg with this parameter!

idle=halt parameter completely disables mwait instruction in kernel and also forces use of only C1 state. In my laptop it does not cause higher battery drain or any suspend-resume issues. But I've not tested it completely an used it for few days... So this may be even not the solution for my problem.
Comment 464 Bráulio Bhavamitra 2019-01-08 10:07:15 UTC
(In reply to Another User from comment #463)
> I have random freezes with Ryzen 2200u laptop. Display image just freezes
> and system stops responding. Only way to get out from this state is to
> holding power button (even magic SysRq key does not work). No any clues in
> logs.
> 
> With Ubuntu 18.04 it was happen nearly once per week (not intense use).
> After installing 18.10 things become more stable - I've got first lockup
> after 6 weeks and another after 3. Hangs are completely random and on
> different cpu load - once I've got this while gaming (and ambient noise
> continue playing...)
> 
> I've tried idle=nomwait but this did not help on 18.04 - and line
> "monitor-mwait will be used for idle loop" was still present in dmesg with
> this parameter!
> 
> idle=halt parameter completely disables mwait instruction in kernel and also
> forces use of only C1 state. In my laptop it does not cause higher battery
> drain or any suspend-resume issues. But I've not tested it completely an
> used it for few days... So this may be even not the solution for my problem.

You freeze seems GPU related, try a new kernel/firmware
Comment 465 Another User 2019-01-08 15:06:38 UTC
(In reply to Bráulio Bhavamitra from comment #464)
> You freeze seems GPU related, try a new kernel/firmware

I also have got amdgpu crash with same screen freeze, but magic key had worked that time and error left in kernel log. And I was able reproduce that with certain game in dolphin emulator.

My random freezes happened without any special program running, sometimes in clear gnome-shell desktop. But this never happens just after power on - only after some working time.
Comment 466 ricki 2019-01-08 17:42:25 UTC
Have these lockups with Fedora 29 default kernel and AMD Ryzen 7 1700X Eight-Core Processor.

Kernel parameter "idle=nomwait" does not help, but the disable C6 script does...
Comment 467 kernel 2019-01-10 08:02:12 UTC
So it would appear that also for me the issue is not completely resolved by setting "Typical Power idle". One of my two (identical) servers reported this morning that a bunch of processes had blocked for more than 120s:

(I cut out a lot of other info like call traces, also these are just 4 examples but there's way more):
> [1094614.681444] INFO: task md1_raid1:301 blocked for more than 120 seconds.
> [1094614.681902] INFO: task fail2ban-server:1095 blocked for more than 120
> seconds.
> [1094614.682226] INFO: task mysqld:1319 blocked for more than 120 seconds.
> [1094614.682447] INFO: task ntpd:1653 blocked for more than 120 seconds.
All the output occurred in the same 3 ms.

Interestingly, the server had not crashed because of this. I did reboot it though, just in case. Meanwhile, still no word from AMD..
Comment 468 Account Deleted 2019-01-12 16:38:00 UTC
Just asked the helpdesk what we can do to get help and more attention. Disabling c6 still is a workaround for me, but that's not how I intended to run my system.
Comment 469 eric.c.morgan 2019-01-12 18:48:02 UTC
(In reply to dl9px from comment #468)
> Just asked the helpdesk what we can do to get help and more attention.
> Disabling c6 still is a workaround for me, but that's not how I intended to
> run my system.

Thanks.

I'm in the same boat with C6. I know I'm a small fish but my company won't be using amd hardware until this is worked out.
Comment 471 nucrap 2019-01-16 23:57:00 UTC
Hey guys so finally I found this bugreport which I guess explains those soft lockups on my Ryzen 7 1700 based system. I had temporarily switched to Win10 since I hoped this bug would be resolved after some time, but now 1,5 years later it is still there on the latest kernel.. At least I can finally use my system again knowing that I need to disable the C6 state.

Can we finally bring some attention to this bug; at least set it to confirmed?
Comment 472 RussianNeuroMancer 2019-01-17 05:02:02 UTC
Maybe we need to contact AMD support and ask them to implement workaround they suggested (document in Comment 459) in Linux?
Comment 473 Borislav Petkov 2019-01-17 10:30:28 UTC
Sorry but this bugzilla entry is humongous with 110 people on CC and all reporting some different aspects of what they're seeing and what they're trying and and...

In order to debug this properly, I'd like for someone to test the latest upstream kernel 4.20 and try to reproduce the issue there. Then to explain how exactly one has reproduced it. Once that happens we'll take it from there with further questions.

Thx.
Comment 474 Aaron Muir Hamilton 2019-01-19 22:49:47 UTC
I just experienced this issue with a Threadripper 2950X on kernel 4.20.3
Comment 475 Borislav Petkov 2019-01-19 23:28:21 UTC
(In reply to Aaron Muir Hamilton from comment #474)
> I just experienced this issue with a Threadripper 2950X on kernel 4.20.3

Anything in dmesg?

Please upload full dmesg, /proc/cpuinfo and kernel .config.

Anything particular you did to reproduce it? Any correlation between what the box does and occurrence of the lockup?

Thx.
Comment 476 Aaron Muir Hamilton 2019-01-19 23:53:36 UTC
(In reply to Borislav Petkov from comment #475)
> (In reply to Aaron Muir Hamilton from comment #474)
> > I just experienced this issue with a Threadripper 2950X on kernel 4.20.3
> 
> Anything in dmesg?
> 
> Please upload full dmesg, /proc/cpuinfo and kernel .config.
There is no header soldered on the board's UART, I do not have a normal serial port, and the logs would not write to disk (so I can't copy the dmesg from the incident). It amounted to:

BUG: soft lockup - CPU#13 stuck for 22s!

The config is here https://git.archlinux.org/svntogit/packages.git/tree/trunk/config?h=packages/linux

> 
> Anything particular you did to reproduce it? Any correlation between what
> the box does and occurrence of the lockup?
Sadly no. The core happened to be running pulseaudio and upower, so my audio buffer started looping and eventually the watchdog timer timed out. I don't think any of that is particularly special though.

> 
> Thx.

No, thank you! I don't know what I'm doing, and maybe you do. :- )

For the time being, I've now added idle=nomwait to my kernel params, as it seems at least one of these issues is related to MWAIT (according to AMD's errata).
Comment 477 Aaron Muir Hamilton 2019-01-20 00:02:05 UTC
Created attachment 280601 [details]
My /proc/cpuinfo

This is my ThreadRipper 2950X /proc/cpuinfo, as requested.
Comment 478 Borislav Petkov 2019-01-20 10:05:35 UTC
(In reply to Aaron Muir Hamilton from comment #476)
> For the time being, I've now added idle=nomwait to my kernel params, as it
> seems at least one of these issues is related to MWAIT (according to AMD's
> errata).

Ok, that's good. Please give your box hell to check whether that really
is fixing the issue.

Thx.
Comment 480 Aaron Muir Hamilton 2019-01-21 14:21:02 UTC
(In reply to Borislav Petkov from comment #478)
> (In reply to Aaron Muir Hamilton from comment #476)
> > For the time being, I've now added idle=nomwait to my kernel params, as it
> > seems at least one of these issues is related to MWAIT (according to AMD's
> > errata).
> 
> Ok, that's good. Please give your box hell to check whether that really
> is fixing the issue.
> 
> Thx.

So my box was mostly idle for the last day or so, and had locked up while the monitor was asleep, so I had to reset it.
Comment 481 Another User 2019-01-21 15:29:43 UTC
(In reply to Aaron Muir Hamilton from comment #480)
> So my box was mostly idle for the last day or so, and had locked up while
> the monitor was asleep, so I had to reset it.

Can you test your system with idle=halt parameter instead of idle=nomwait?
Comment 482 Borislav Petkov 2019-01-21 16:28:59 UTC
(In reply to Aaron Muir Hamilton from comment #480)
> So my box was mostly idle for the last day or so, and had locked up while
> the monitor was asleep, so I had to reset it.

Let me make sure I understand it correctly: you had "idle=nomwait" on
the kernel command line and it became unresponsive?

If so, it would be really helpful if you could connect that box over
serial to another one and collect logs in realtime with the hope of
maybe catching something relevant.

The other thing I could suggest is to disable SMT: I know of another
threadripper machine where disabling SMT helped with the lockups. (Not
saying this is the final solution but it would be a good data point to
know).

Thx.
Comment 483 Aaron Muir Hamilton 2019-01-21 16:51:45 UTC
(In reply to Borislav Petkov from comment #482)
> (In reply to Aaron Muir Hamilton from comment #480)
> > So my box was mostly idle for the last day or so, and had locked up while
> > the monitor was asleep, so I had to reset it.
> 
> Let me make sure I understand it correctly: you had "idle=nomwait" on
> the kernel command line and it became unresponsive?

Yes

> If so, it would be really helpful if you could connect that box over
> serial to another one and collect logs in realtime with the hope of
> maybe catching something relevant.
Yeah, I'm looking at that, I don't want to fry the (~500 CAD) board with the wrong serial voltage though (RS-232 v. other serial), so I'm being a bit careful. 

> The other thing I could suggest is to disable SMT: I know of another
> threadripper machine where disabling SMT helped with the lockups. (Not
> saying this is the final solution but it would be a good data point to
> know).

That is maybe an option, for now I disabled C6 from my board firmware (in ASRock's case, by selecting "Typical current idle" [sometimes called "Common current idle" I think, on some Ryzen boards] in Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control); and I'll see if that does the trick.

I've heard that this resolved some soft lockup issues on Ryzen.

> 
> Thx.

Thank you.
Comment 484 Borislav Petkov 2019-01-21 17:01:48 UTC
(In reply to Aaron Muir Hamilton from comment #483)
> That is maybe an option, for now I disabled C6 from my board firmware (in
> ASRock's case, by selecting "Typical current idle" [sometimes called "Common
> current idle" I think, on some Ryzen boards] in Advanced > AMD CBS > Zen
> Common Options > Power Supply Idle Control); and I'll see if that does the
> trick.

Yes, that's also worth a try.

Thx.
Comment 485 Trevor Cordes 2019-01-22 07:08:08 UTC
Hi, just setup brand new Ryzen 5 2600 on Asrock AB350 PRO4 with ECC RAM with Fedora 29 and the box froze after only 3 hours uptime, right after we went home for the night, so it was fairly idle (< 0.2% CPU 99% of the time).

Luckily I had read this thread before buying, so I wasn't too shocked.

I wanted to say "me too" because some were wondering if the 2nd gen Ryzens were affected: they are.  Also, we were able to get a stack trace / panic output that was on the frozen screen in a phone capture jpg.  If anyone wants that, I can attach it.  It looks similar to other ones I've seen for this bug elsewhere.  No NMI errors logged on our box though.

We did the "idle: typical" bios tweak and the idle=nomwait tweak and the system has been 100% stable for the 2 days since.

Not sure if it's relevant, but the box also has a NVMe M.2 SSD, a rust drive, a high-quality Japanese-cap 620W PS (but states min 0.6A on +12V, but the drives and mobo are drawing at least that), and a very cheap PCIe video card for server use.
Comment 486 Borislav Petkov 2019-01-22 10:03:02 UTC
(In reply to Trevor Cordes from comment #485)
> Also, we were able to get a stack trace / panic output that was on
> the frozen screen in a phone capture jpg. If anyone wants that, I can
> attach it.

Please do.

> We did the "idle: typical" bios tweak

What is that?

> and the idle=nomwait tweak and the system has been 100% stable for the
> 2 days since.

Ok, it would be good to test those settings separately:

1. once with the "idle: typical" BIOS thing
2. idle=nomwait

and say which fixes the issue for ya. Or both?

Thx.
Comment 487 Trevor Cordes 2019-01-22 12:26:02 UTC
Created attachment 280661 [details]
picture of panic on screen before reboot
Comment 488 Trevor Cordes 2019-01-22 12:29:32 UTC
Sorry the pic is so horrible in attachment 280661 [details], but at least you can see most of the stack traces.  I have a wider pic of it I can use to transcribe the missing right-hand-side bits if needed.

By "idle: typical" bios tweak I meant the <<selecting "Typical current idle">> tweak everyone else has done.

As for which fix (1 or 2) really fixed it, I'll get back to you in about a month because this box is now in production and we can't afford downtime until our next scheduled period.  I have a feeling just the bios tweak is required (from all I've read in dozens of forums), but I did both for good measure.
Comment 489 Borislav Petkov 2019-01-22 12:37:28 UTC
(In reply to Trevor Cordes from comment #488)
> Sorry the pic is so horrible in attachment 280661 [details], but at least
> you can see most of the stack traces.  I have a wider pic of it I can use to
> transcribe the missing right-hand-side bits if needed.

Thanks, that's good enough.

Btw, is there any particular reason why you're running a 32-bit kernel?
If not, I'd consider switching to a 64-bit kernel which is a lot more
and widely tested.

> By "idle: typical" bios tweak I meant the <<selecting "Typical current
> idle">> tweak everyone else has done.

I need to figure out what's behind that setting. What kind of BIOS do
you have?

dmesg | grep DMI:

should have it.

> As for which fix (1 or 2) really fixed it, I'll get back to you in about a
> month because this box is now in production and we can't afford downtime
> until our next scheduled period.  I have a feeling just the bios tweak is
> required (from all I've read in dozens of forums), but I did both for good
> measure.

Ok, whenever you can.

Thx.
Comment 490 Paulo J. S. Silva 2019-01-22 14:02:23 UTC
From my experience (a Ryzen 1700X on a MSI B350 motherboard) the freezes on IDLE, like the ones described from Trevor Cordes, went way once I changed the "Idle: typical" setting on the BIOS. 

I had the bug before the BIOS setting appeared, and tried many things with more or less success. But the best solution by far is just to change the BIOS setting. I have many months of uptime without any lock up since then. *It should be the first thing to test if your computer freezes by itself on idle*. 

On the other hand I have a Threadripper that freezes when doing hard computations (that alternate between high CPU usage and low, because it needs to write temporary results to disk). This can not be solved using the "Idle: typical" BIOS option. This other freeze happens specially when using Matlab. I am starting to think that the problem is that Matlab (or the MKL Blas libraryit uses) might use heavily SSE3 instructions that have the MWAIT instruction as cited by Francesco Buscani above. I should try the kernel option "idle=mnowait" to see if I can get rid of these freezes. Unfortunately I am travelling until the end of the month and away from the Threadrippers to test it.
Comment 491 Klaus Mueller 2019-01-22 15:07:51 UTC
(In reply to Borislav Petkov from comment #486)
> (In reply to Trevor Cordes from comment #485)
> > Also, we were able to get a stack trace / panic output that was on
> > the frozen screen in a phone capture jpg. If anyone wants that, I can
> > attach it.
> 
> Please do.
> 
> > We did the "idle: typical" bios tweak
> 
> What is that?

You may take a look here to get some more information about this problem and what AMD says: https://community.amd.com/thread/225795 - start reading at 7th post.

Short: AMD says, that this would be a problem of the PSU - but that seems not to be true, because it happens even w/ PSUs supporting 0A at 12V. 

The "Typical current idle" switch seems to disable C6 State - Package - which is normally enabled (see post 14 at the above given URL). Used tool is zenstates.py https://github.com/r4m0n/ZenStates-Linux/blob/master/zenstates.py
Comment 492 Maxim Bakulin 2019-01-22 16:24:22 UTC
Created attachment 280669 [details]
dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied

some older info here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566

I have three machines with new 2700x CPUs, and all three of them experience freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with make -j16 in a loop to test for stability.

I'm not sure it is the same bug, because I observe different behaviour: one machine, that was compiling QEMU, froze during the night, and the one left idle worked for 1 day. Another thing that SEEM to help is disabling SMT: machine with disabled SMT is currently at 7 hours of uptime, which is much more, than usual 1-3 hours before freeze. Windows 10 seem to work fine: I used WSL with ubuntu 18.04 for the same test compilations for 3 days, and no freezes with the same bios settings.

Things I've tried that didn't help:
idle=nomwait
rcu_nocbs=0-15 with new kernel (4.20.3)
disabled cool n quiet and c6 states
Typical Current Idle in uefi
Set SoC voltage to 1.1v
Set DRAM voltage to 1.3v
Update to latest BIOS
High-end PSU
Comment 493 Trevor Cordes 2019-01-23 06:59:44 UTC
(In reply to Borislav Petkov from comment #489)
> Btw, is there any particular reason why you're running a 32-bit kernel?
> If not, I'd consider switching to a 64-bit kernel which is a lot more
> and widely tested.

Yes, we know.  This box has been running forever, always upgraded to the latest Fedora, and we recently upgraded the hardware to the Ryzen from a non-64-capable P4.  That's why it's 32b.  We will upgrade it to 64-bit during next scheduled downtime in a month.

Maxim: I think your problem is different from this bug.  From everything I've read everywhere, the mwait and/or idle:typical tweaks always solve this bug.  You've done even the esoteric stuff like nocbs and voltage override and your problem remains.  Might (must?) be something else.
Comment 494 Aaron Muir Hamilton 2019-01-23 23:51:14 UTC
(In reply to Maxim Bakulin from comment #492)
> Created attachment 280669 [details]
> dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
> 
> some older info here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566
> 
> I have three machines with new 2700x CPUs, and all three of them experience
> freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with
> make -j16 in a loop to test for stability.
> 
> I'm not sure it is the same bug, because I observe different behaviour: one
> machine, that was compiling QEMU, froze during the night, and the one left
> idle worked for 1 day. Another thing that SEEM to help is disabling SMT:
> machine with disabled SMT is currently at 7 hours of uptime, which is much
> more, than usual 1-3 hours before freeze. Windows 10 seem to work fine: I
> used WSL with ubuntu 18.04 for the same test compilations for 3 days, and no
> freezes with the same bios settings.
> 
> Things I've tried that didn't help:
> idle=nomwait
> rcu_nocbs=0-15 with new kernel (4.20.3)
> disabled cool n quiet and c6 states
> Typical Current Idle in uefi
> Set SoC voltage to 1.1v
> Set DRAM voltage to 1.3v
> Update to latest BIOS
> High-end PSU

With my Threadripper 2950X I can confirm that none of these were sufficient. I have an excellent PSU which I have confirmed is operating well better than spec, the voltage is definitely sufficient, the power states people are saying cause this are disabled, mwait is disabled, AGESA is current, etc.
Comment 495 Aaron Muir Hamilton 2019-01-23 23:59:31 UTC
(In reply to Another User from comment #481)
> (In reply to Aaron Muir Hamilton from comment #480)
> > So my box was mostly idle for the last day or so, and had locked up while
> > the monitor was asleep, so I had to reset it.
> 
> Can you test your system with idle=halt parameter instead of idle=nomwait?

Now trying idle=halt, as you suggested. We'll see.
Comment 496 Łukasz Żarnowiecki 2019-01-24 14:06:11 UTC
I did power consumption tests with idle=nomwait option on my Huawei Matebook D 14 Ryzen 2500u with power meter.

With idle=nomwait idle power consumption is about 4W less!!  I did few reboots to be sure about it.
Comment 497 Borislav Petkov 2019-01-24 20:19:45 UTC
Folks, just a quick thing: please check whether you have the latest BIOS and if not, do upgrade it and check if it makes any difference.

Thx.
Comment 498 T X 2019-01-25 02:45:01 UTC
OS: Antergos Linux 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux.
CPU: AMD Ryzen 7 1700 Eight-Core Processor.
CPU flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
MB: ASRock AB350 Gaming-ITX/ac; Version: P4.60; Release Date: 04/19/2018
RAM: F4-2400C15D-16GFXR, 32 GB DDR4.

Attempt #1:
a) Set BIOS / Advanced / AMD CBS / Zen Common Options / Power Supply Idle Control to "Typical Current Idle".

Result: Hangs

Attempt #2:
a) Set /etc/default/grub to: GRUB_CMDLINE_LINUX_DEFAULT="quiet idle=nomwait resume=UUID=9d2d2002-406a-4eb5-bd21-fdb889831991"
b) Ran update-grub
c) Rebooted
d) BIOS remains at "Typical Current Idle"

Result: Hangs

Journal Log:

Jan 24 17:21:00 kernel: BUG: unable to handle kernel paging request at ffffffffffffffff
Jan 24 17:21:00 kernel: PGD 3a0a0e067 P4D 3a0a0e067 PUD 3a0a10067 PMD 0 
Jan 24 17:21:00 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 24 17:21:00 kernel: CPU: 1 PID: 994 Comm: xfwm4 Not tainted 4.20.3-arch1-1-ARCH #1
Jan 24 17:21:00 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Gaming-ITX/ac, BIOS P4.60 04/19/2018
Jan 24 17:55:01 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Jan 24 17:55:01 kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
Jan 24 17:55:27 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks:
Jan 24 17:59:39 kernel: ACPI BIOS Warning (bug): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20181003/tbfadt-624)
Jan 24 17:59:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

$ sudo cat /boot/grub/grub.cfg  | grep -i idle
	linux	/vmlinuz-linux root=UUID=16a8e892-ec00-4316-b62d-5b60da84a7fb rw  quiet idle=nomwait resume=UUID=9d2d2002-406a-4eb5-bd21-fdb889831991

Have not tried:

1) idle=halt
2) Disable C6 state with zenstates.py
3) BIOS upgrade
Comment 499 Borislav Petkov 2019-01-25 17:51:04 UTC
(In reply to T X from comment #498)
> Jan 24 17:59:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported
> by HW (0x0)

So this looks like a broken BIOS.

> Jan 24 17:21:00 kernel: Hardware name: To Be Filled By O.E.M. To Be
> Filled By O.E.M./AB350 Gaming-ITX/ac, BIOS P4.60 04/19/2018

Looking at Asrock's website, there's newer BIOS for your board, AFAICT,
so can you update it and rerun the same test? Or what you were doing to
cause the splat.

Also, please upload full dmesg after you've updated the BIOS.

Thx.
Comment 500 Maxim Bakulin 2019-01-25 18:01:52 UTC
(In reply to Borislav Petkov from comment #497)
> Folks, just a quick thing: please check whether you have the latest BIOS and
> if not, do upgrade it and check if it makes any difference.
> 
> Thx.

Already running latest BIOS (2018/12/19 for ASRock B450 Pro4 and 2018/12/17 for ASUS X470 Pro).

I found out that openSUSE Tumbleweed with stock 4.20.0 kernel doesn't appear to freeze. I tried copying suse's kernel and /lib/modules to Xubuntu 18.04LTS, but it didn't seem to help, xubuntu still freezes. Maybe, xubuntu's firmware packages are old? But I could've done something wrong with the transplant procedure :) I'm going to test Xubuntu 18.10 during the weekends, maybe it has newer packages and does not freeze.
Comment 501 Borislav Petkov 2019-01-25 18:09:09 UTC
(In reply to Maxim Bakulin from comment #500)
> I found out that openSUSE Tumbleweed with stock 4.20.0 kernel doesn't appear
> to freeze. I tried copying suse's kernel and /lib/modules to Xubuntu
> 18.04LTS, but it didn't seem to help, xubuntu still freezes. Maybe,
> xubuntu's firmware packages are old? But I could've done something wrong
> with the transplant procedure :)

Yeah, I wouldn't do that.

What's stopping you from building a kernel on your machine and
installing it?

The web is full of tutorials like this one:

https://www.linux.com/learn/how-build-latest-linux-kernel-debian-linus-git-repo

for example, which is for debian-based systems.
Comment 502 T X 2019-01-25 18:18:04 UTC
Upgraded BIOS as follows:

1. https://www.asrock.com/mb/AMD/Fatal1ty%20AB350%20Gaming-ITXac/index.asp
2. Click Support > BIOS
3. Download Version 5.30 through Global site (ignore "AMD all in 1 VGA driver")
4. Formatted USB as FAT32
5. Copied and unzipped AB350 Gaming-ITXac(5.30)ROM.zip to USB
6. Flashed BIOS as per https://www.asrock.com/support/BIOSIG.asp?cat=BIOS8

$ dmidecode -t bios -q
  Version: P5.30
  Release Date: 12/18/2018

BIOS installed correctly.

$ journalctl -b

Jan 25 09:38:33 kernel: Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=16a8e892-ec00-4316-b62d-5b60da84a7fb rw quiet idle=nomwait resume=...
Jan 25 09:38:33 kernel: ACPI BIOS Warning (bug): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1>
Jan 25 09:38:33 kernel: mtrr: your CPUs had inconsistent variable MTRR settings
Jan 25 09:38:33 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
Jan 25 09:38:33 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Jan 25 09:38:33 kernel: sp5100-tco sp5100-tco: Watchdog hardware is disabled
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: ccp enabled
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: psp initialization failed
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: enabled
Jan 25 09:38:38 kernel: kauditd_printk_skb: 23 callbacks suppressed

Since the BIOS was flashed, I assume that "Typical Current Idle" was not set.

Splat was caused by idling for ~20 minutes. Will try again with "Typical Current Idle" and, if it hangs again, will provide the dmesg log.

Additional hardware:

* Chassis: https://www.hdplex.com/hdplex-h5-fanless-computer-case.html
* PSU: https://www.hdplex.com/hdplex-internal-400w-ac-dc-adapter-with-active-pfc-and-19vdc-output.html and https://www.hdplex.com/hdplex-400w-hi-fi-dc-atx-power-supply-16v-24v-wide-range-voltage-input.html
Comment 503 Maxim Bakulin 2019-01-25 18:19:13 UTC
(In reply to Borislav Petkov from comment #501)
> What's stopping you from building a kernel on your machine and
> installing it?

I tried and it didn't help. After freezes started happening at Xubuntu 18.04LTS with stock 4.15, I compiled latest 4.20.3 kernel and added CONFIG_RCU_NOCB_CPU=y, as it was suggested here, but freezes didn't stop. There's my dmesg at comment 492 with latest kernel.
Comment 504 Borislav Petkov 2019-01-25 18:35:59 UTC
(In reply to Maxim Bakulin from comment #503)
> I tried and it didn't help. After freezes started happening at Xubuntu
> 18.04LTS with stock 4.15, I compiled latest 4.20.3 kernel and added
> CONFIG_RCU_NOCB_CPU=y, as it was suggested here, but freezes didn't stop.
> There's my dmesg at comment 492 with latest kernel.

Lemme get this straight:

opensuse 4.20.0 kernel doesn't freeze the box but 4.20.3 stable kernel does?!
Comment 505 Borislav Petkov 2019-01-25 18:44:27 UTC
(In reply to T X from comment #502)
> Jan 25 09:38:33 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> Jan 25 09:38:33 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported
> by HW (0x0)

Yah, even with the new BIOS that's still there. Doesn't look like it has been
fixed.

If I send you a debugging patch, would you be able to apply it, build a
kernel, boot into it and get me dmesg from the box?

Thx.
Comment 506 Aaron Muir Hamilton 2019-01-25 19:40:03 UTC
I'm running with idle=halt, and I have not experienced a freeze for a couple days now, FWIW.
Comment 507 T X 2019-01-25 19:57:14 UTC
Using:

* BIOS Version: P5.30, Release Date: 12/18/2018
* idle=nomwait
* Power Supply Idle Control set to "Typical Current Idle"

The ~20-minute idle lock-up issue appears fixed, despite:

Jan 25 10:21:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not support...

If it locks up overnight, I'll report back with the dmesg.
Comment 508 Borislav Petkov 2019-01-25 20:00:26 UTC
Can people confirm that idle=halt fixes the issue, like Aaron says in comment #506?
Comment 509 T X 2019-01-25 20:08:44 UTC
> If I send you a debugging patch, would you be able to apply it, build a
kernel, boot into it and get me dmesg from the box?

I haven't built a kernel in about decade, but if you provide step-by-step instructions for how to do so, I'll attempt to build the patched kernel and then post the dmesg.

$ uname -a
Linux 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux

$ cat /etc/issue
Antergos Linux
Comment 510 Sami Farin 2019-01-25 20:37:33 UTC
(In reply to Maxim Bakulin from comment #492)
> Created attachment 280669 [details]
> dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
> 
> some older info here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566
> 
> I have three machines with new 2700x CPUs, and all three of them experience
> freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with
> make -j16 in a loop to test for stability.
> 
> I'm not sure it is the same bug, because I observe different behaviour: one
> machine, that was compiling QEMU, froze during the night, and the one left
> idle worked for 1 day. Another thing that SEEM to help is disabling SMT:

The Processor errata lists two bugs (SMT-related) for Ryzen 1 and Ryzen 2 (1095 and 1109) with status "no fix planned".  If you disable MWAIT but enable SMT, you are left with bug 1095: "Potential Violation of Read Ordering In Lock Operation In SMT (Simultaneous Multithreading) Mode".  This can cause crashes. Not necessarily the cause of your crashes :-D

# lsmsr -r 0xc0011020
warning: unknown MSR c0011020
unknown = 0x0006800000000010

On my Ryzen 1600X bit 57 (no idea what it does) is 0.  (But I have nosmt=force.)  Linux kernel doesn't seem to touch that bit.

Also, if you get "ACPI MWAIT C-state 0x0 not supported by HW (0x0)", mwait is not used by kernel.
Comment 511 Borislav Petkov 2019-01-28 10:53:46 UTC
(In reply to T X from comment #509)
> I haven't built a kernel in about decade, but if you provide step-by-step
> instructions for how to do so, I'll attempt to build the patched kernel and
> then post the dmesg.

Ok, here we go:

1. Clone stable kernel from here if you haven't done so yet:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

2. checkout 4.20 branch

$ git checkout -b 4.20-stable origin/linux-4.20.y

3. apply the attached patch

$ patch -p1 -i acpi-dump-cstates.diff

4. copy your 4.20 config

$ cp /boot/config-4.20.... .config

5. do

$ make oldconfig

6. build

$ make -j16

7. install kernel, needs root

# make modules_install install

# reboot

Then, select this new kernel in grub, boot it and upload full dmesg.

Thx.
Comment 512 Borislav Petkov 2019-01-28 10:54:30 UTC
Created attachment 280825 [details]
acpi-dump-cstates.diff
Comment 513 T X 2019-01-29 04:29:47 UTC
(In reply to Borislav Petkov from comment #511)
> Ok, here we go:
> 1. Clone stable kernel from here if you haven't done so yet:
> $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
> 2. checkout 4.20 branch
> $ git checkout -b 4.20-stable origin/linux-4.20.y
> 3. apply the attached patch
> $ patch -p1 -i acpi-dump-cstates.diff
> 4. copy your 4.20 config
> $ cp /boot/config-4.20.... .config

    $ find /boot | grep -i config
    /boot/grub/i386-pc/configfile.mod

I don't think that that's the right file.

    $ sudo updatedb
    $ locate -i config-4
    $ uname -a
    Linux 4.20.4-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 23 00:12:22 UTC 2019 x86_64 GNU/Linux

Where is the config file?
Comment 514 Celti Burroughs 2019-01-29 08:40:16 UTC
(In reply to T X from comment #513)
>     $ sudo updatedb
>     $ locate -i config-4
>     $ uname -a
>     Linux 4.20.4-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 23 00:12:22 UTC 2019
> x86_64 GNU/Linux
> 
> Where is the config file?

Arch Linux doesn't package the kernel's config with the kernel. For the current 4.20-arch1 config, see: https://git.archlinux.org/svntogit/packages.git/plain/trunk/config?h=packages/linux — or just use /proc/config.gz.
Comment 516 T X 2019-01-30 07:07:41 UTC
Created attachment 280867 [details]
dmesg log after rebooting with acpi-dump-cstates patch
Comment 517 T X 2019-01-30 07:08:25 UTC
(In reply to Borislav Petkov from comment #511)
> 7. install kernel, needs root
> # make modules_install install

    $ sudo make modules_install install
      ...
      INSTALL virt/lib/irqbypass.ko
      DEPMOD  4.20.5-dirty
    sh ./arch/x86/boot/install.sh 4.20.5-dirty arch/x86/boot/bzImage \
        System.map "/boot"
    Cannot find LILO.

That's a misleading error message as LILO isn't installed on the system, but the new kernel was copied:

    $ ll /boot
    -rw-r--r--  1 root root  5851008 Jan 29 22:40 vmlinuz

GRUB seems to have loaded it fine.

See attached dmesg-acpi-dump-cstates.txt for details.
Comment 518 Philip Rosvall 2019-02-01 17:37:34 UTC
I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen notebook more stable than anything else I've tried. I have not experienced freezes for some days now.
Comment 519 Borislav Petkov 2019-02-01 17:52:35 UTC
(In reply to Philip Rosvall from comment #518)
> I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen
> notebook more stable than anything else I've tried. I have not experienced
> freezes for some days now.

Can you upload dmesg from the working and non-working kernels pls?

Thx.
Comment 520 Philip Rosvall 2019-02-01 18:53:38 UTC
Created attachment 280915 [details]
dmesg w/o boot parameter idle=halt
Comment 521 Philip Rosvall 2019-02-01 18:54:33 UTC
Created attachment 280917 [details]
dmesg with boot parameter idle=halt
Comment 522 Philip Rosvall 2019-02-01 19:02:19 UTC
(In reply to Borislav Petkov from comment #519)
> (In reply to Philip Rosvall from comment #518)
> > I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen
> > notebook more stable than anything else I've tried. I have not experienced
> > freezes for some days now.
> 
> Can you upload dmesg from the working and non-working kernels pls?
> 
> Thx.

I have only used the workaround since kernel 4.20.4, so I just booted the kernel now without and with the parameter. As can be seen in line 704 in the log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)". This error does not occur when "idle=halt" is set.
Comment 523 Borislav Petkov 2019-02-02 00:35:40 UTC
(In reply to Philip Rosvall from comment #522)
> I have only used the workaround since kernel 4.20.4, so I just booted the
> kernel now without and with the parameter. As can be seen in line 704 in the
> log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT
> C-state 0x0 not supported by HW (0x0)". This error does not occur when
> "idle=halt" is set.

Yeah, this is starting to look more and more like a BIOS issue on
certain mobos.

Since your machine is issuing that ACPI MWAIT error, can you apply the
debugging patch from comment #512, build a kernel, boot into it
without "idle=halt" and upload dmesg again?

Thx.
Comment 524 Philip Rosvall 2019-02-02 02:39:38 UTC
Created attachment 280921 [details]
dmesg w/o boot parameter idle=halt, patched acpi-dump-cstates
Comment 525 Philip Rosvall 2019-02-02 02:40:53 UTC
(In reply to Borislav Petkov from comment #523)
> (In reply to Philip Rosvall from comment #522)
> > I have only used the workaround since kernel 4.20.4, so I just booted the
> > kernel now without and with the parameter. As can be seen in line 704 in
> the
> > log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT
> > C-state 0x0 not supported by HW (0x0)". This error does not occur when
> > "idle=halt" is set.
> 
> Yeah, this is starting to look more and more like a BIOS issue on
> certain mobos.
> 
> Since your machine is issuing that ACPI MWAIT error, can you apply the
> debugging patch from comment #512, build a kernel, boot into it
> without "idle=halt" and upload dmesg again?
> 
> Thx.

Here you go!
Comment 526 Borislav Petkov 2019-02-05 05:48:57 UTC
Created attachment 280961 [details]
Don't do mwait on B1 and earlier
Comment 527 Borislav Petkov 2019-02-05 05:57:50 UTC
Ok, here's a test patch ontop of 4.20-stable.

It should practically make idle=halt the default on revisions before B2. You check which revision you have by doing

$ grep stepping /proc/cpuinfo

The number must be < 2.

For the folks with B2 machines we need to keep debugging.

Thx.
Comment 528 Klaus Mueller 2019-02-05 17:06:48 UTC
(In reply to Borislav Petkov from comment #526)
> Created attachment 280961 [details]
> Don't do mwait on B1 and earlier

What's the downside of generally disabling mwait? I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/ 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI firmware bug messages disappear. That's all.
Comment 529 Borislav Petkov 2019-02-05 21:02:00 UTC
(In reply to Klaus Mueller from comment #528)
> What's the downside of generally disabling mwait?

So in your case, you can't do MWAIT to enter C1 anyway because your
revision doesn't support it. This is why you're seeing those firmware
messages.

> I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/
> 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI
> firmware bug messages disappear. That's all.

That's an "interesting" way to fix it but if it works ... :)

The intention of the fix is to make idle=halt the default for obvious reasons.

HTH.
Comment 530 Klaus Mueller 2019-02-05 21:24:44 UTC
(In reply to Borislav Petkov from comment #529)
> (In reply to Klaus Mueller from comment #528)
> > What's the downside of generally disabling mwait?
> 
> So in your case, you can't do MWAIT to enter C1 anyway because your
> revision doesn't support it. This is why you're seeing those firmware
> messages.

Hmm, if rev. 1 doesn't support MWAIT - why can it be a problem anyway at the same time which must be fixed by disabling the usage of MWAIT? I seem to miss something?

> 
> > I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/
> > 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI
> > firmware bug messages disappear. That's all.
> 
> That's an "interesting" way to fix it but if it works ... :)
> 
> The intention of the fix is to make idle=halt the default for obvious
> reasons.

Isn't this already done automatically if MWAIT isn't supported?
Comment 531 Borislav Petkov 2019-02-05 21:49:16 UTC
(In reply to Klaus Mueller from comment #530)
> Hmm, if rev. 1 doesn't support MWAIT - why can it be a problem anyway
> at the same time which must be fixed by disabling the usage of MWAIT?
> I seem to miss something?

That's a good question but, frankly, I don't have a very exact answer to
it right now.

In order to understand what's *really* going on in the
cstate detection code one would need to instrument
at least acpi_processor_get_power_info_cst() and
acpi_processor_ffh_cstate_probe() to figure out what exactly does the
kernel parse from those CST objects and what it uses to try to enter
idle.

And do all that instrumentation on an affected system.

My current suspicion is is that it tries to enter idle with
misconfigured states and under certain conditions, it misses the wakeup,
leading to the stall.

This is all a conjecture anyway.

Now my patch simply falls back to the good old idle entry on AMD where
we simply do HLT and we won't even attempt to enter idle the ACPI way.

Makes sense?
Comment 532 Maxim Bakulin 2019-03-01 16:24:51 UTC
Created attachment 281445 [details]
Testcase in docker

I have created a docker file and run script, that allows me to reproduce the issue. It still takes quite a lot of time usually. This docker freezes the system in 30 minutes - 5-6 hours. I've been able to reproduce it with docker in Windows 10, xubuntu 18.04, xubuntu 18.10, opensuse tumbleweed and centos 7.

I know, that docker can have bugs too, and reproducing in docker is not exactly reproducing natively, but since these freezes also happen on 4 of my machines in native 18.04 and in VirtualBoxed 18.04 too, it's possible that the bug is not in the docker or VirtualBox.
Comment 533 OptionalRealName 2019-03-02 09:33:18 UTC
Input interpretation:
Wednesday, August 16, 2017
Open code
Enlarge Data Customize A Plaintext Interactive
Date formats:More formats/calendars
16/08/2017 (day/month/year)
Time difference from today (Saturday, March 2, 2019):
1 year 6 months 17 days ago
18 months 17 days ago
80 weeks 3 days ago
563 days ago
402 weekdays ago
1.54 years ago


Embarassing AMD, just a disgrace.
Comment 534 CodingEagle02 2019-03-10 00:02:01 UTC
Does anyone know whether the kernel version 5.0 solves the issue? From the changelog, it seems to me like it might.
Comment 535 ison 2019-03-10 10:23:39 UTC
I have tested it on 5.0 and still experienced the soft lockups just as frequently as with 4.20.
However, I am currently testing kernel 5.0 with PAGE_TABLE_ISOLATION disabled and it seems that it may be a fix.

Although maybe it's a bit premature to say that, since my experience with this issue (on 2700X) has been extremely inconsistent. I get the soft lockups, but in addition to the lockups (and usually preceding them) it seems like my whole system becomes unstable. Applications start segfaulting, or turning into zombies that can't be killed, then the lockup happens if I just let it go.
What's strange is that once this sort of "instability" occurs it seems to stay unstable even across reboots, and I can only have my machine running 20-30min before things start segfaulting or locking up again. Usually it takes me about 5 or 6 restarts before it just seems to hit some sweet spot and become "stable". Once it's stable it stays stable for days, even if I restart the machine.

So based on the above description I can see that narrowing down the problem could be very difficult since something might seem to be working for a while, until it doesn't.

At any rate, the solution I'm currently testing was proposed in this Gentoo thread:
https://forums.gentoo.org/viewtopic-t-1074860-start-0.html
which I stumbled upon after receiving a kernel panic and noticing this error
>Unexpected reschedule of offline CPU#0!
From what I can tell "offline CPU" sounds very relevant to our issue.

Their solution was to disable PAGE_TABLE_ISOLATION in the kernel. This also intrigues me as I remember seeing other errors in dmesg relating to "page" writing.
Apparently that kernel option isn't even necessary for AMD CPU's anyway. It was meant to fix an insecurity in intel CPU's, so it can be safely disabled.

NOTE: I am testing this with some of the BIOS modifications recommended by others here (such as disabling c6 state and "typical current idle", etc..)
If this solution works for another week or so I'll try restoring the BIOS settings to their defaults and testing again.
Comment 536 CodingEagle02 2019-03-10 12:07:46 UTC
Well, it's a pity to know 5.0 might not fix it. I was really hoping it would. On the other hand, it's good to know so many other people have this issue, which means I'm not alone in hoping it'll be at some point fixed.

I'll just throw my two cents in though, to hopefully help.

Firstly, technical info. My computer is (I'm pretty sure) a HP 15-DB0125, with Ryzen 5 2500U and Vega 8 graphics. I've applied a bunch of fixed I've found online (mainly the Zen script and grub options), and while I still get freezes, they happen much more rarely, particularly if I shut down my computer when I'm not using it and after heavy activities. I haven't tried idle=halt yet, though

Anyhow. To the part I don't think has been mentioned.


I've noticed games running on Wine/Proton make my computer very much more likely to freeze. Neverwinter (which should be the lightest, I assume), for example, will crash after a few minutes. Warframe will crash after maybe half an hour. Counterstrike: Global Offensive? That game, which runs natively, can be played for a comfortable amount of time (maybe 1-2+ hours?) without risking any freezes. Which is a shame, really, because I'm being locked out of playing Wine-enabled games until Linux and Ryzen fix their dysfunctional relationship.
Comment 537 CodingEagle02 2019-03-10 12:10:45 UTC
fixes*

To be clear, I'm not sure whether CS:GO does freeze my computer after a given amount of time. I don't think I've tested it thoroughly enough to be sure. I do know I can play it comfortably without worrying.
Comment 538 Simon 2019-03-10 12:32:43 UTC
Hi, 
this comment (536 and 535) for me sounds more or less
the instability is not from idle. because:
"... Applications start segfaulting,..."
"running wine/... (with games)"
for segfaulting you have to change the processor. search the internet
for the "kill ryzon script" run it and look what happens.
if it segfaults, normally amd will change your processor to one
who will not segfault.

so for me this was the first thing i have done to get all my ryzon
systems rockstable. all processors where exchanged by amdsupport (some twice)
(my (small) hardware computer store has done this for me)
after you get a new processor check again!!!!
and maybe (it happend for me after updates of bios to get the "typical idel control settings") , check if you under hard load (i used the
kill ryzon script again) 
get:
kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
(ore something similar) 
for this use the "pcie_aspm=off" option for grub/kernel commandline

...  what i will say, before you are not sure that the "kill ryzon script"
will work fine, do not search for other things, these processors have
a bugs. FIRST change with amd the processor, do not thrust the production
date of the processor, i have had processors who according to all infos
i found inside the internet should be ok, but the where NOT ok.

as i understood this randomsoft lookups buglist here is for softlookups
when idle, or wakeing up from idle.

simoN
Comment 539 Simon 2019-03-10 12:33:27 UTC
Hi, 
this comment (536 and 535) for me sounds more or less
the instability is not from idle. because:
"... Applications start segfaulting,..."
"running wine/... (with games)"
for segfaulting you have to change the processor. search the internet
for the "kill ryzon script" run it and look what happens.
if it segfaults, normally amd will change your processor to one
who will not segfault.

so for me this was the first thing i have done to get all my ryzon
systems rockstable. all processors where exchanged by amdsupport (some twice)
(my (small) hardware computer store has done this for me)
after you get a new processor check again!!!!
and maybe (it happend for me after updates of bios to get the "typical idel control settings") , check if you under hard load (i used the
kill ryzon script again) 
get:
kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
(ore something similar) 
for this use the "pcie_aspm=off" option for grub/kernel commandline

...  what i will say, before you are not sure that the "kill ryzon script"
will work fine, do not search for other things, these processors have
a bugs. FIRST change with amd the processor, do not thrust the production
date of the processor, i have had processors who according to all infos
i found inside the internet should be ok, but the where NOT ok.

as i understood this randomsoft lookups buglist here is for softlookups
when idle, or wakeing up from idle.

simoN
Comment 540 onox 2019-03-10 19:16:55 UTC
Hi ! I haven’t read all the comments here because there are too many by now, but I’m puzzled that many consider this to be a Linux-only issue, because I can tell you that it definitely happens on Windows (10) too, and I found other people with the same freeze/reboot problem on Windows. For me the crashes are really the same on both Linux and Windows. I have a Ryzen 7 1700X and an Asrock X370 Taichi mainboard, and the "Power Supply Idle Control" > "Typical Current Idle" setting fixes the problem. However, lately I’ve noticed a couple of times, but still rarely, that the computer would crash and reboot when waking up from sleep, then it seems to enter a boot loop for a few times, and then starts normally but it resets the BIOS settings… I haven’t overclocked anything. My CPU also seems to have the "Ryzen segfault bug", which seems to be a different issue from what I’ve understood. 

I should probably RMA the CPU (and the mainboard ?), but I haven’t because disassembling everything would be difficult for me. In my case those crashes (without "Typical Current Idle") happen fairly rarely, like every two-three weeks, or even more rarely than that.
Comment 541 Philip Rosvall 2019-03-11 14:56:25 UTC
Try idle=halt! It is the workaround that fixes the random freezes! You don't need anything else (at least not with Ryzen 1*** and 2***U processors, as it works wonders on my desktop with a 1600 and my notebook with a 2700U).

IDLE=HALT IS THE WORKAROUND THAT FIXES THIS BULLSHIT!
I REPEAT:
IDLE=HALT!

It will probably work for most of you too!
Comment 542 alfie 2019-03-11 16:27:26 UTC
Here (Asus CrossHair VI, 1600, gentoo with any kernel around), no suggested workaround works. The system freezes when in very light use and I was never able to obtain a single line of debug via serial (I put a small linux mini itx box near this computer just to receive the serial log).

It seems like after a workaround is applied (BIOS settings, iddle=halt, rcu_nocbs=0-11 and so on), the BAI (Bug AI) finds a way to freeze the system again and the dances restart.

Ryzen is not a CPU to run Linux onto.
Comment 543 Philip Rosvall 2019-03-11 17:50:03 UTC
(In reply to alfie from comment #542)
> Here (Asus CrossHair VI, 1600, gentoo with any kernel around), no suggested
> workaround works. The system freezes when in very light use and I was never
> able to obtain a single line of debug via serial (I put a small linux mini
> itx box near this computer just to receive the serial log).
> 
> It seems like after a workaround is applied (BIOS settings, iddle=halt,
> rcu_nocbs=0-11 and so on), the BAI (Bug AI) finds a way to freeze the system
> again and the dances restart.
> 
> Ryzen is not a CPU to run Linux onto.

Strange, since idle=halt makes both my machines rock stable. Do you have AMD-V enabled? If so, iommu=pt fixes some problems that can occur with that.
Comment 544 ison 2019-03-12 03:34:21 UTC
(In reply to Philip Rosvall from comment #541)
> Try idle=halt! It is the workaround that fixes the random freezes! You don't
> need anything else (at least not with Ryzen 1*** and 2***U processors, as it
> works wonders on my desktop with a 1600 and my notebook with a 2700U).
> 
> IDLE=HALT IS THE WORKAROUND THAT FIXES THIS BULLSHIT!
> I REPEAT:
> IDLE=HALT!
> 
> It will probably work for most of you too!

In light of the last few comments here I can say that I may have been confusing 2 separate issues: The random segfaults and the lockups.
I tested idle=halt a while back and I may have inadvertently concluded that it didn't work due to my machine still receiving segfaults, which I was assuming would eventually lead to a lockup.

Likewise, the solution I mentioned earlier (disabling PAGE_TABLE_ISOLATION) seems to have fixed my crashes, but my machine just locked up.

I've set idle=halt again and this time I'll try to keep the two issues separate and only post back here if I still receive lockups.
Thanks for clarifying the issue.
Comment 545 CodingEagle02 2019-03-12 04:23:04 UTC
Hey out of curiosity, since I can't seem to find a clear answer, what exactly does idle=halt do?
Comment 546 Trevor Cordes 2019-03-12 06:16:23 UTC
(In reply to alfie from comment #542)
> 
> Ryzen is not a CPU to run Linux onto.

YMMV but our Ryzen 2600 has been 100% rock solid stable (on 24/7) since my last comment #493 Jan 23, 2019.  All we did was the mwait and idle:typical bios tweaks.  That has worked for many/most people hitting this bug, as proven by the fact many people comment, say thanks for the tweaks, and never come back.

I strongly suspect that anyone having problems after doing mwait and idle:typical (and maybe rcu_nocbs too) are having a separate problem.  Certainly the segfault people are hitting a different bug which should not get convoluted into this bug thread.  Our 2600 never exhibited the segfault problem (thankfully!): seems to be more a 1xxx issue.
Comment 547 Dennis Schridde 2019-03-12 09:13:34 UTC
I was severely affected a year ago.  My system would not run through one rebuild of Gentoo, i.e. it would lock up / hard freeze after only a few hours.  I suspect the heavy disk corruption I experienced every time it happened was a direct result of this, and not caused by unrelated hardware defects.  The situation was dire and at the time I had to set "typical power idle" in the firmware in order to make the problem go away and the system usable.

Things seem to have improved a lot since then.  I now have default firmware settings (i.e. I have NOT set "typical power idle"), I have default Linux idle settings (i.e. I have NOT set idle=halt, idle=mwait or similar), and am I not using any other counter measure anymore.  The system is as stable as it gets: No lockups and no unexplainable segfaults.  This is since before Linux 5.0 and I strongly suspect the recent firmware update to version 4207 to have fixed the issue.

CPU: AMD Ryzen 5 2400G
Motherboard: Asus ROG STRIX B350-F Gaming
Firmware version: 4207
Linux: 5.0.1 (with Gentoo patches)
Comment 548 Tolga Cakir 2019-03-12 09:25:08 UTC
AGESA 0070 has been released for some mainboards (e.g. ASUS A320M-K, ASUS X470 Pro and a couple of MSI boards). For anyone experiencing crashes, I think 0070 is worth a try. Official changelog says something along "added support for new processors", but more changes under the hood aren't unlikely.

Regarding the segfault issue: it has already been confirmed by AMD to be a hardware issue on early batches of Ryzen 1xxx CPUs and you can RMA affected CPUs under warranty. I don't have the landing page at hand right now, but I'd advise anyone with the segfault bug to replace their CPUs; especially those experiencing crashes under Windows.

Like Trevor, I also think we have incostencies, because we're most likely dealing with multiple bugs. From what I've seen so far, Vega users have a different form of crash caused by Vega (including Ryzen Mobile, Vega 56/64 dGPU, 2200G/2400G), users experiencing crashes under Windows and Linux have a different form of instability issue, crashes caused by segfault is another category and issues fixed by idle=halt / "Typical Load Idle" is another category.

From my own tests, I fall into the category fixed by "Typical Load Idle". Later today, I will update to AGESA 0070, test against latest kernel with various settings and report back.
Comment 549 Philip Rosvall 2019-03-12 14:55:49 UTC
(In reply to CodingEagle02 from comment #545)
> Hey out of curiosity, since I can't seem to find a clear answer, what
> exactly does idle=halt do?

idle=halt disables ACPI MWAIT completely, and issues HALT instead of different C-states for idle.
Without idle=halt, both my machines are issuing errors upon boot ([Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)), which disappears when idle=halt is set.
Borislav Petkov has a suspicion about what happens (comment 531):
"In order to understand what's *really* going on in the
cstate detection code one would need to instrument
at least acpi_processor_get_power_info_cst() and
acpi_processor_ffh_cstate_probe() to figure out what exactly does the
kernel parse from those CST objects and what it uses to try to enter
idle.

And do all that instrumentation on an affected system.

My current suspicion is is that it tries to enter idle with
misconfigured states and under certain conditions, it misses the wakeup,
leading to the stall."


I have successfully used idle=halt for over a month now, without a single lockup!
Comment 550 Philip Rosvall 2019-03-12 15:05:17 UTC
(In reply to ison from comment #544)
> (In reply to Philip Rosvall from comment #541)
> > Try idle=halt! It is the workaround that fixes the random freezes! You
> don't
> > need anything else (at least not with Ryzen 1*** and 2***U processors, as
> it
> > works wonders on my desktop with a 1600 and my notebook with a 2700U).
> > 
> > IDLE=HALT IS THE WORKAROUND THAT FIXES THIS BULLSHIT!
> > I REPEAT:
> > IDLE=HALT!
> > 
> > It will probably work for most of you too!
> 
> In light of the last few comments here I can say that I may have been
> confusing 2 separate issues: The random segfaults and the lockups.
> I tested idle=halt a while back and I may have inadvertently concluded that
> it didn't work due to my machine still receiving segfaults, which I was
> assuming would eventually lead to a lockup.
> 
> Likewise, the solution I mentioned earlier (disabling PAGE_TABLE_ISOLATION)
> seems to have fixed my crashes, but my machine just locked up.
> 
> I've set idle=halt again and this time I'll try to keep the two issues
> separate and only post back here if I still receive lockups.
> Thanks for clarifying the issue.

It seems like seperate issues. The lockups seems, to me, to be related to MWAIT and misconfigured C-states. Have you taken a look at your dmesg after a boot without idle=halt, if you get the error "[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)"?
Comment 551 alfie 2019-03-12 15:17:41 UTC
kernel 4.19.27 with rcu_nocbs=0-11 idle=halt nopti, yesterday at around 11:30 p.m. I had an "usual" total lookup of the system during light usage (I just had an ssh session opened). With idle=halt, I get no "ACPM WAIT C-state bug" line in dmesg.

The problem with this freeze problem is that is difficult to replicate. The segfault bug was very easy: 3-4 mesa compilations triggered it for sure. This bug, instead, can hide for days and days and it reappears all of a sudden.
Comment 552 eric.c.morgan 2019-03-12 15:25:33 UTC
(In reply to alfie from comment #551)
> kernel 4.19.27 with rcu_nocbs=0-11 idle=halt nopti, yesterday at around
> 11:30 p.m. I had an "usual" total lookup of the system during light usage (I
> just had an ssh session opened). With idle=halt, I get no "ACPM WAIT C-state
> bug" line in dmesg.
> 
> The problem with this freeze problem is that is difficult to replicate. The
> segfault bug was very easy: 3-4 mesa compilations triggered it for sure.
> This bug, instead, can hide for days and days and it reappears all of a
> sudden.

The segfault compilation issue is well known. I had the same with my first 1700. AMD will RMA. My replacement chip has been fine.

As for the other idle lockup issues, a combination of custom rcu kernel and python script to disable C6 has had my machine up for months.
Comment 553 Maxim Bakulin 2019-03-12 18:07:27 UTC
As I have said previously, I'm not sure that my problem was the same that others have (I didn't have freezes on idle, but rather freezes when compiling), but what helped me is updating BIOS to latest version. ASUS has released a new BIOS with AGESA 0070 for X470 Pro, and I can't trigger the bug anymore. As for ASRock, they will probably release update in a week or so, all vendors usually release new AGESA within a moth, as far as I remember.
Comment 554 Tommy Vercetti 2019-03-14 01:39:13 UTC
Same here, I'm using opensuse Leap 15 and Ubuntu 18.10, both 4.12 and 4.18 kernel suffer the same issue.
I'm using AMD Ryzen 5 1600 with Asrock Fatal1ty AB350 Gaming-ITX/ac board.
I'll try updating the latest bios to see if things got fixed.
Comment 555 ison 2019-03-15 21:52:37 UTC
I am still receiving lockups even with the latest BIOS, all the BIOS setting recommendations here, and with the idle=halt kernel parameter.

Although I am now fairly certain that I have a faulty motherboard (x470) which seems to be getting worse, and sometimes doesn't even get to the BIOS at all (I should have suspected something was up when I first bought it and the PS/2 port didn't work).
I'm going to try replacing it and hopefully that'll help.
Comment 556 ZaNKuRO 2019-03-16 00:11:51 UTC
Hi guys. I have an Asus ROG Strix notebook with Ryzen 7 1700. My CPU was manufactured on week 46 of 2017 (UA 1746PGS), so it does not have the segfault problem. My notebook completely freeze when it is idle, not always, but it is more common when newly powered (cool) or after heavy activities. The freezes never occurred during more intense use, only when idle. Like all notebook BIOS, my BIOS options is quite limited and does not have a "Typical current idle" or similar option. I tried to recompile the kernel with the option "RCU_NOCB_CPU" and also tried all possible combinations with the parameters "rcu_nocbs", "iommu=soft/pt", "idle=nomwait/halt", "pti=off" / "nopti", etc, etc, etc. The only alternative that actually solved the freeze problem was to use the program "zenstates.py" disabling the "c6state" core and package. But this attitude makes my CPU run about 5°C warmer. I used Windows on this same machine for weeks without any crashes. This makes me conclude the following: does not crash on windows (ok, it's a software problem, "linux"); CPU intel does not hang in linux (oh no, ok, it's a hardware problem, "Ryzen"). I do not know what AMD thinks about this, but I was forced to change my notebook. Now I got one with an Intel processor. Thank you AMD.
Comment 557 Tommy Vercetti 2019-03-16 17:27:03 UTC
(In reply to Tommy Vercetti from comment #554)
> Same here, I'm using opensuse Leap 15 and Ubuntu 18.10, both 4.12 and 4.18
> kernel suffer the same issue.
> I'm using AMD Ryzen 5 1600 with Asrock Fatal1ty AB350 Gaming-ITX/ac board.
> I'll try updating the latest bios to see if things got fixed.

I updated the bios to P5.30, it only locked up for once so far.
How do I turn the log on for this?
Comment 558 Another User 2019-03-17 15:02:37 UTC
I assume there is several (at least two) similar problems that cause spontaneous system hangs.

One of them is mwait bug listed in AMD errata. Looks like idle=halt is partial workaround for this. But, as said in AMD community forum, guest OS in virtual machine may execute mait instruction and provoke this bug. Luckily, mwait is not common for user-space applications (still not sure about that).

Another is power supply problem. This may be caused by unsupported PSU (no 0A 12V) or unsuitable power subsystem on motherboard. This may be partially (again!) solved by BIOS "typical current idle" option or disabling C6 states by .py script.
IMO this bios option may be implemented not properly in some mobos firmware. Looks like it only "says to OS" do not use C-states, but does not prevent deep sleep on hw level. Here we got [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) in dmesg and problem persists.

Have read some forums with assures of completely stable Ryzen systems work under Linux without any tweks. So not really too many systems affected by mwait and PSU lockups. I think some combination of factors may provoke this behavior. Mobo+PSU, memory latency (why not?) or some vendor-provided bios config.

Also in some cases these two problems may appear together. 
Obviously, segfaults or amdgpu crashes is not related to this bug.

PS:Sorry for my bad English.
Comment 559 alfie 2019-03-17 15:48:30 UTC
The PSU problem sounds like a weak excuse from AMD or may they are talking about very very bad 10 dollars units. 

The BIOS options "typical/low current" doesn't seems to change the c-states set up by the kernel routines here. With any selection, I always end up with
c0 POLL
c1 ACPI HLT
c2 ACPI IOPORT 0x414

as you can see with cpufreq cpupower idle-info or /usr/src/linux/tools/power/x86/turbostat/turbostat -n1

That happens because the firmware already informed the kernel mwait/monitor cannot be used for c-states.

If idle=halt is used, there is no c-state management and the idl instruction is used to put a core in idle mode.

I think it is the hyper treading handling that is somehow bugged, somewhere.

Also note that many ryzen users overclock their machine and if you overclock, you have to c-state problems at all.
Comment 560 Another User 2019-03-18 08:41:49 UTC
So if you got message with "typical current idle" option:
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
it means probe of this C-state failed and it must be avoided by kernel.
Is there any difference between C-state list in OS with and without this option? I have only laptop with very limited BIOS and can't check.

And there only C0-1-2. What about C6? What says zenstates.py about that? Frankly, I have not tried this script - do not want disable power-saving features for laptop.

Also some overclocking may solve this problem (see comment 103). Turn on "performance bias" or similar option in BIOS for testing.
Comment 561 alfie 2019-03-20 18:32:58 UTC
This is pretty strange, could someone explain it to me? Wasn't mwait bugged in ryzen?

Tired of the random freezes, I tried this:

--- ./arch/x86/kernel/acpi/cstate.c.orig     2018-10-22 08:37:37.000000000 +0200
+++ ./arch/x86/kernel/acpi/cstate.c        2019-03-20 19:26:45.261101857 +0100
@@ -86,6 +86,7 @@ static long acpi_processor_ffh_cstate_pr
        num_cstate_subtype = edx_part & MWAIT_SUBSTATE_MASK;
 
        retval = 0;
+#if 0
        /* If the HW does not support any sub-states in this C-state */
        if (num_cstate_subtype == 0) {
                pr_warn(FW_BUG "ACPI MWAIT C-state 0x%x not supported by HW (0x%x)\n",
@@ -93,6 +94,7 @@ static long acpi_processor_ffh_cstate_pr
                retval = -1;
                goto out;
        }
+#endif
 
        /* mwait ecx extensions INTERRUPT_BREAK should be supported for C2/C3 */
        if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||

And I don't have any freeze since 3 days...
Comment 562 Tolga Cakir 2019-03-20 19:21:50 UTC
I have revisited the errata. Errata 1033 "A Lock Operation May Cause the System to Hang" and 1109 "MWAIT Instruction May Hang a Thread" are the top contenders. According to page 12 and 13 of that document, Pinnacle Ridge processors are not affected. However, page 16 and 17 suggest 2nd Gen Ryzen are affected. This makes me conclude: either there is an error in the document, or Raven Ridge aka improved Zen 1 (Desktop and Mobile) are affected, while Pinnacle Ridge is not.

During my latest tests on kernel 5.0.1, on AGESA 1006 and after upgrading to AGESA 0070, I wasn't able to reproduce any freezes / hangs on a Ryzen 7 2700, ASUS A320M-K (BIOS defaults, no OC, no Typical Current Idle setting), 16GB Patriot Viper RGB 3200CL16, ASUS RX480 8GB and Corsair AX860 (Haswell C6/C7 support, 80 PLUS Platinum).

I installed a fresh, bare minimum, default Arch Linux install w/o any special configuration, let the system sit in idle for half a day and check log for hangs. Then, I enabled all power management features using powertop and let the system sit over night, still no hangs. I repeated the same for AGESA 0070, still no hangs.

I think it's too early to conclude anything, as the hangs / freezes used to be very random. I will repeat my tests (on AGESA 0070) with a slightly more bloated system by installing GNOME.
Comment 563 JerryD 2019-03-21 01:24:28 UTC
I am running Fedora on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx with Gnome.

I use kernel parameters idle=nomwait iommu=pt processor.max_cstate=1 set via grubby.

I am not getting any hangs unless I suspend with lid close or suspend on power button. It would probably be better to not execute an MWAIT per the errata.  The method in comment 561 would work for those who have the gumption to build their own kernel.

The iommu=pt I read about somewhere as useful to do, but I don't know if it helps. Some suggest set idle=halt which also avoids the MWAIT instruction.

I am also told that the DRI driver has an issue with loading at boot which will hang the kernel. https://bugs.freedesktop.org/show_bug.cgi?id=109206
Comment 564 Another User 2019-03-21 07:43:13 UTC
(In reply to alfie from comment #561)
> This is pretty strange, could someone explain it to me? Wasn't mwait bugged
> in ryzen?

Looks like this patch makes kernel to ignore BIOS messages about unsupported C-state.
this function used in "acpi_processor_ffh_cstate_probe" which further used in "processor_idle.c" file. It cause "cx.entry_method" value change from ACPI_CSTATE_SYSTEMIO to ACPI_CSTATE_FFH.

It is really strange that this patch works but idle=halt is not. Maybe this is somehow silently disables C6 state.
Can your system reach turbo frequencies with this patch?
Maybe I understand something wrongly...
Comment 565 Borislav Petkov 2019-03-21 09:39:03 UTC
There's a patch in comment #526 for people to test before we include it so that there's at least *some* fix in the kernel, going forward...
Comment 566 alfie 2019-03-21 11:45:36 UTC
(In reply to Another User from comment #564)
> Can your system reach turbo frequencies with this patch?
> Maybe I understand something wrongly...

Yes, I have a ryzen 1600 and I can see 1/2 cores running at ~3.6 GHz under some peculiar stress case.


(In reply to Borislav Petkov from comment #565)
> There's a patch in comment #526 for people to test before we include it so
> that there's at least *some* fix in the kernel, going forward...

But I really want C6/P6 and ignoring the BIOS seems to get me there...
Comment 567 Borislav Petkov 2019-03-21 11:55:10 UTC
(In reply to alfie from comment #566)
> But I really want C6/P6 and ignoring the BIOS seems to get me there...

Why do you think the patch I pointed to won't give you C6?
Comment 568 Tolga Cakir 2019-03-21 18:54:39 UTC
@Borislav has the fix for erratum 1033 "A Lock Operation May Cause the System to Hang" been applied so far? The suggested workaround was "Program MSRC001_1020[4] to 1b", but I couldn't find anything about it in master branch. According to the document, 1033 only affects B1.
Comment 569 Borislav Petkov 2019-03-21 21:16:22 UTC
(In reply to Tolga Cakir from comment #568)
> @Borislav has the fix for erratum 1033 "A Lock Operation May Cause the
> System to Hang" been applied so far? The suggested workaround was "Program
> MSRC001_1020[4] to 1b", but I couldn't find anything about it in master
> branch. According to the document, 1033 only affects B1.

Such a "fix" does not exist. Possibly because it is unlikely this is causing it and BIOS might be applying the fix already. People with B1s (model 1, stepping 1) could test though by doing as root:

# modprobe msr
# rdmsr -a 0xc0011020

and looking at bit 4 in the result.
Comment 570 Lars Viklund 2019-03-21 23:22:04 UTC
(In reply to Borislav Petkov from comment #569)
> (In reply to Tolga Cakir from comment #568)
> Such a "fix" does not exist. Possibly because it is unlikely this is causing
> it and BIOS might be applying the fix already. People with B1s (model 1,
> stepping 1) could test though by doing as root:
> 
> # modprobe msr
> # rdmsr -a 0xc0011020
> 
> and looking at bit 4 in the result.

I've got an ASUS PRIME X370-PRO with the B1 CPU mentioned on firmware 4024
(from 2018/09/28).

rdmsr yields 6800000000010, which has bit 4 set.
Comment 571 Borislav Petkov 2019-03-22 04:50:18 UTC
(In reply to Lars Viklund from comment #570)
> rdmsr yields 6800000000010, which has bit 4 set.

Looks like your BIOS applies the fix. Now, does the patch in comment #526 fix your freezes?
Comment 572 Lars Viklund 2019-03-22 08:33:55 UTC
(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
> 
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

I'm sorry, I don't currently suffer from any significant hangs on my machines when running stock Ubuntu 18.04.02 LTS (4.15.0-43-generic).

I mostly meant #570 to demonstrate that some firmware does indeed implement the errata.

While I've had core hangs and hard freezes in the past, the machine has enough workarounds applied to make it sufficiently stable. Current uptime is something like 81 days since the last incident and I can honestly not remember what it did then, but it's been used heavily as a build machine.

From my memory: sufficiently RMA:d CPUs, slight overclocking, change the current idle setting in firmware, boot idle=nomwait.

Back when I had significant problems, the best way to mitigate them was to disable SMT.

It's possible that my increased stability has come from firmware implementing this errata, or it might be coincidental, I don't have enough data to tell.

(You people should be happy you're not running FreeBSD, there I can reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )
Comment 573 Dennis Schridde 2019-03-22 08:53:59 UTC
(In reply to Borislav Petkov from comment #569)
> (In reply to Tolga Cakir from comment #568)
> > @Borislav has the fix for erratum 1033 "A Lock Operation May Cause the
> > System to Hang" been applied so far? The suggested workaround was "Program
> > MSRC001_1020[4] to 1b", but I couldn't find anything about it in master
> > branch. According to the document, 1033 only affects B1.
> 
> Such a "fix" does not exist. Possibly because it is unlikely this is causing
> it and BIOS might be applying the fix already. People with B1s (model 1,
> stepping 1) could test though by doing as root:
> 
> # modprobe msr
> # rdmsr -a 0xc0011020
> 
> and looking at bit 4 in the result.

206800000000100

CPU: AMD Ryzen 5 2400G
Motherboard: Asus ROG STRIX B350-F Gaming
Firmware version: 4207
Linux: 4.18.16-300.fc29 (Fedora 29 live system)
Comment 574 Borislav Petkov 2019-03-22 09:32:31 UTC
(In reply to Dennis Schridde from comment #573)
> 206800000000100

If this is correct:

http://www.cpu-world.com/CPUs/Zen/AMD-Ryzen%205%202400G.html

then you should have in /proc/cpuninfo something like this:

cpu family: 23
model:      17
stepping:   0

and if so, not affected.
Comment 575 Dennis Schridde 2019-03-22 10:44:28 UTC
(In reply to Borislav Petkov from comment #574)
> (In reply to Dennis Schridde from comment #573)
> > 206800000000100
> 
> If this is correct:
> 
> http://www.cpu-world.com/CPUs/Zen/AMD-Ryzen%205%202400G.html
> 
> then you should have in /proc/cpuninfo something like this:
> 
> cpu family: 23
> model:      17
> stepping:   0
> 
> and if so, not affected.

A year ago I was affected by something (comment #295), but the issue appears to have vanished within the last months (comment #547 and comment #573).
Comment 576 Tolga Cakir 2019-03-22 12:05:22 UTC
(In reply to Lars Viklund from comment #572)
> (You people should be happy you're not running FreeBSD, there I can
> reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )

Looks like they've addressed Ryzen errata issues around August 2018:

https://github.com/freebsd/freebsd/blob/75ee4f08d3acd4bf70f24b3203fa440255873973/sys/amd64/amd64/initcpu.c#L133

Also, they've made machdep.idle=hlt and machdep.idle_mwait=0 default for Ryzen processors.

I've additionally checked the registers mentioned in FresBSD's source. Latest BIOS 4406 (w/ AGESA 0070) for my ASUS A320M-K seems to apply fixes for all affected errata.

Eventhough I can't reliably reproduce the stability issues anymore, I think it's still good practice to have Boris' patch applied, since it's addresses a known erratum. I'll test the patch and check for regressions.
Comment 577 Vladyslav Yamkovyi 2019-03-22 14:07:00 UTC
I'm actually desperate about this issue. I'm looking forward to see results of Boris' patch, though I really doubt it will solve this issue. Needs explicit check against self-built kernel really.

Not all manufacturers are willing to fix this issue (especially when it comes to laptops), as Windows seems to fix this inside it's kernel since 1709 I believe. I've made a post on HP forum about this, specifically for my laptop:

https://h30434.www3.hp.com/t5/Notebook-Boot-and-Lockup/Random-Soft-Lock-up-on-Ryzen/m-p/7062823
Comment 578 JerryD 2019-03-24 19:42:20 UTC
(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
> 
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

Hi Boris, on my laptop with its latest BIOS I get:

$ sudo rdmsr -a 0xc0011020
6800000000000

This is on:

cpu family	: 23
model		: 17
model name	: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
stepping	: 0
microcode	: 0x8101007

Booting with:

[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.20.16-200.fc29.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet LANG=en_US.UTF-8 idle=nomwait iommu=pt processor.max_cstate=1

Has your patch been applied to kernel yet?
Comment 579 Borislav Petkov 2019-03-24 20:14:03 UTC
(In reply to JerryD from comment #578)
> $ sudo rdmsr -a 0xc0011020
> 6800000000000
> 
> This is on:
> 
> cpu family    : 23
> model         : 17
> model name    : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
> stepping      : 0
> microcode     : 0x8101007

You're showing me the MSR output for erratum 1033 but your CPU is 
not affected.

> Has your patch been applied to kernel yet?

No it hasn't - I'm still waiting for someone to test it and confirm that it fixes the issue on their boxes. And by issue I don't mean erratum 1033 - that is a red herring anyway - but the lockups people are reporting on some broken BIOSes.
Comment 580 alfie 2019-03-25 09:06:47 UTC
Created attachment 281999 [details]
Add rifw kernel parameter to test a couple of patch to workaround ryzen freezes

Add rifw kernel parameter.

rifw=none - don't do anything
rifw=alfie - ignore BIOS
rifw=boris - use Borislav Petkov patch

I works with kernels 4.19, 4.20 and 5.0.2
Comment 581 alfie 2019-03-25 09:07:56 UTC
> No it hasn't - I'm still waiting for someone to test it and confirm that it
> fixes the issue on their boxes. And by issue I don't mean erratum 1033 -
> that is a red herring anyway - but the lockups people are reporting on some
> broken BIOSes.

Yes, your patch "seems" to work in my case. 

"Seems" because this bug is difficult to replicate...
Comment 582 brodyck 2019-04-01 15:40:26 UTC
I've spent the last few days reading through this thread.

I have Fedora 28, ryzen 1700x, and ASRock Steel Legend B450M. It's a server mostly used for files at the moment. Bone stock Fedora 28, samba, ZFS.

With GPU plugged in, hangs within 6-10 hours. Replicating this too often since March 27th.

With GPU unplugged, hangs happen within 10-20 minutes. Replicated 3 times.

Does not hang while SSH'd in.

Ran Memtest and tested for the segfault issue. Hardware is fine. Bought 'open box' on Amazon; this was in case they were re-selling an affected CPU without knowing.

Have not tried alfie's patch. Do have kernel 4.20 compiled with alfie's newest rifw patch as per Comment 580.


With Borislav's patch:
- Modifications to the OS are Samba, ZFS, Borislav's patch (comment #526 on kernel 5.0.1)
- booted March 30th at ~22:00
- Hung March 31st at ~09:40
- This is slightly longer than other tests
- I used the instructions he provided in Comment 511 without checking out 4.20. It compiles, but doesn't fix the issue, at least with linux 5.0.1-rc2 (apologies for not following steps and using 4.20)


'Typical Current Idle' set:
- Modifications to the OS are only Samba and ZFS
- Booted at 20:45 EST
- Still on now at time of this post 11:40 EST(I'll check before I hit send)
- This is the record up-time for this server
- Haven't tested anything else because I needed to have reliable access to my shares for my days off. This was 1 seemingly sure-fire way of making that happen.


Will update with more results once linux 4.20 with borislav and Alfie's patches tested. Could be a couple days. Less if 'Typical Current Idle' doesn't work, as I'll likely be RMAing my mobo and CPU.
Comment 583 Dyson 2019-04-08 14:53:05 UTC
I used the rifw=alfie with a 1700x on an MSI X370 Gaming Plus BIOS 7A33v5H with no freezing for 1 week.  Computer was rock solid on older BIOS (7A33v55) for a year+ until BIOS update to latest.


I'm using low power idle option with kernel 4.14.111 opensuse 15.0.


PC had kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) - After BIOS update and gone now after patch in Comment 580 with rifw=alfie boot option.
Comment 584 Hazok 2019-04-11 22:05:04 UTC
I haven't had the chance to get through the whole thread yet and not sure if I'll have enough time today. However, I wanted to quickly ask if anyone has faced this issue with an Intel processor?

We are seeing kernel panics due to the soft lockup issue with the following configuration:

CPU Architecture: Intel Skylake or Higher
vCPUs: 8
RAM: 52 GB
Kernel: 4.15.0
OS: Ubuntu 16.04.1

For some additional information, this setup does leverage Nested VMX to run Docker containers with the --privileged option enabled.

The soft lockups seems to occur at a rate of varying from 1-5% of our VMs getting impacted in a 24-hour period running in this configuration.

The load on the cluster does not seem to impact the frequency of the issue.

One thing I did notice is that it seems most/all of the kernel panics have had been involved with a call to smp_call_function_single.

Since I have not made it far enough through this thread, I'm not sure if the focus on this is exclusively on AMD, or if anyone else had an Intel example?

Is there a fix in the pipe for this issue and if so, a target kernel version?
Comment 585 Liu Liu 2019-04-13 16:05:54 UTC
Created attachment 282319 [details]
Threadripper 2920x soft lockup

I've went through this thread, and tried to disable c6 state, add idle=halt with no effect. Followed this excellent reprod steps in https://www.reddit.com/r/Amd/comments/apw8im/ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent. However, switching to gcc-8 seems resolved this for this particular reprod steps. With dmesg -w, I was able to capture and verify this is indeed a soft lockup. Attached the dmesg output.

It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest BIOS on Ubuntu 18.04.2 LTS.

Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Comment 586 Klaus Mueller 2019-04-14 04:52:16 UTC
Hello Liu

You're hitting another bug: it happens on high load (which can be easily triggered by a lot of parallel compilation threads). The only solution for your kind of problem is to RMA you CPU. See: https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response (but the problem seems not to be related to Ryzen 7 as stated at Phoronix and it's not possible to say that CPU's after a special build time would be ok). I have been affected, too (Ryzen 7) and went through RMA process. Luckily, the first RMA'd CPU was ok - there have been people who needed more than one RMA. Take a look there: 
https://community.amd.com/message/2822339
If I hit the problem those days (gcc segfault), I had to rapidly reboot my machine because it mostly hang completely some time later.


This thread here covers the problem of hanging Ryzen CPU's when the machine is completely on idle over some time (it's the opposite of the problem you're facing and I saw before CPU RMA). I "fixed" the hang on idle problem by slightly overclocking the CPU (+200 MHz - ASUS X370-PRO / BIOS 4011 04/19/2018 - Bios option: Optimization for daily computing). All other suggested workarounds don't work for me. But you can't say, overclocking would fix the problem for sure (as any other suggested workaround). But you could try it.
Comment 587 Liu Liu 2019-04-14 04:57:38 UTC
Hey, Klaus,

My understanding of that issue is it only affects Ryzen 1700 / 1700x and doesn't affect Epyc / TR line. I am on ThreadRipper 2920x, which should be out of marginality problem for over a year now. Also, it doesn't SegFault at any point, just hangs. I do believe it is the similar problem as this thread, but since it is rare, so it could be just one faulty CPU for sure. Just doesn't align well with marginality problem timeline.
Comment 588 Paulo J. S. Silva 2019-04-14 14:28:46 UTC
I have a couple of Threadrippers (1950X) that also hang under very specific workloads with a Asus X-399-A motherboard. In my case when the freeze happens the only way to get the machine back is to unplug it from the wall and plug it back. Not even reset works.  

After many tweaks, I found out that I can avoid that but setting either the "overclocking enhancement" on in the BIOS (this is specific to Asus boards) our manually setting the Vcore to something like 1.315V (which should work on all motherboards). My guess is than when in idle the motherboard drops the voltage too low and the CPU hangs.
Comment 589 Brendan Long 2019-04-16 15:53:17 UTC
I ran into similar problems to this with a Threadripper 1950X and none of the proposed workarounds did anything. Disabling C6 in the BIOS made it fail to boot for some reason. Setting "Typical Current Idle" in the BIOS, processor.max_cstate=1 in the kernel command line, or disabling C6 with ZenStates.py all had no effect. I also tried overclocking and underclocking.

I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just works (for 10+ hours, I don't leave my machine on overnight). Once it started working, I set everything back to the defaults except for "Typical Current Idle" (since I have a PSU from 2010 and I doubt it supports any fancy features from 2013).

Unfortunately I can't easily narrow down all of the changes after the re-install, but the ones I'm aware of are:

 - Using the Nouvea driver (previously proprietary nVidia drivers)
 - Booting in EFI mode now (previously legacy mode)
 - Using ext4 (previously btrfs)

I strongly suspect that the graphics driver was the problem since my lockups would cause the screen to become completely unresponsive, but sound continued working, and in one case I had a lockup during a video call and the other person could still see and hear me.

I plan to try the proprietary drivers again sometime and I'll make a note if that brings the problem back for me.
Comment 590 Klaus Mueller 2019-04-16 18:44:08 UTC
(In reply to Brendan Long from comment #589)
> I strongly suspect that the graphics driver was the problem since my lockups
> would cause the screen to become completely unresponsive, but sound
> continued working, and in one case I had a lockup during a video call and
> the other person could still see and hear me.

What you're describing here is a new "feature" introduced between kernel 4.19.16 and 17 e.g. I can see exactly the same here with radeon hardware. The system is completely working (even VMs on the host are running well) except of graphics - even tty terminals are working sometimes. When ssh'ing the machine, I can always see log entries like these:

radeon 0000:0a:00.0: ring 0 stalled for more than 14084msec
radeon 0000:0a:00.0: GPU lockup (current fence id 0x0000000000053ed7 last fence id 0x0000000000053f0f on ring 0)
...

I'm trying to near it down currently using git bisect. The suspicious changes left are at the moment:

2019-01-22      arm64: Don't trap host pointer auth use to EL2                                  Mark Rutland           bad
2019-01-22      arm64/kvm: consistently handle host HCR_EL2 flags                               Mark Rutland
2019-01-22      scsi: target: iscsi: cxgbit: fix csk leak                                       Varun Prakash
2019-01-22      scsi: target: iscsi: cxgbit: fix csk leak                                       Varun Prakash
2019-01-22      Revert "scsi: target: iscsi: cxgbit: fix csk leak"                              Sasha Levin
2019-01-22      mmc: sdhci-msm: Disable CDR function on TX                                      Loic Poulain
2019-01-22      netfilter: nf_conncount: fix argument order to find_next_bit
2019-01-22      netfilter: nf_conncount: speculative garbage collection on empty lists          Pablo Neira Ayuso
2019-01-22      netfilter: nf_conncount: move all list iterations under spinlock                Pablo Neira Ayuso
2019-01-22      netfilter: nf_conncount: merge lookup and add functions                         Florian Westphal

2019-01-22      netfilter: nf_conncount: restart search when nodes have been erased             Florian Westphal                ?
2019-01-22      netfilter: nf_conncount: split gc in two phases                                 Florian Westphal
2019-01-22      netfilter: nf_conncount: don't skip eviction when age is negative               Florian Westphal
2019-01-22      netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS      Shawn Bohrer
2019-01-22      can: gw: ensure DLC boundaries after CAN frame modification                     Oliver Hartkopp
2019-01-22      tty: Don't hold ldisc lock in tty_reopen() if ldisc present                     Dmitry Safonov
2019-01-22      tty: Simplify tty->count math in tty_reopen()                                   Dmitry Safonov
2019-01-22      tty: Hold tty_ldisc_lock() during tty_reopen()                                  Dmitry Safonov
2019-01-22      tty/ldsem: Wake up readers after timed out down_write()                         Dmitry Safonov

As you're describing correctly, the problem seems to be network related. I'm getting this error two when watching videos from internet. I'm currently testing the changes between "restart search when nodes have been erased" and "Wake up readers after timed out down_write()".
Comment 591 Trevor Cordes 2019-04-16 23:12:16 UTC
(In reply to Brendan Long from comment #589)
> 
> I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just
> works (for 10+ hours, I don't leave my machine on overnight). Once it
> started working, I set everything back to the defaults except for "Typical
> Current Idle" (since I have a PSU from 2010 and I doubt it supports any
> fancy features from 2013).

a) Fedora 29 has very new kernels.  In the last few weeks they rebased to 5.0 and 5.0 is supposed to have a lot of Ryzen fixes in it.

b) For me "Typical Current Idle" (plus mwait) was all that was required to fix the problem.  I'm pretty sure kernel 5.0 makes it so mwait override is not needed.

>  - Using the Nouvea driver (previously proprietary nVidia drivers)
>  - Booting in EFI mode now (previously legacy mode)
>  - Using ext4 (previously btrfs)

We never used proprietary video driver and we both had the problem and fixed the problem.  Same with EFI: we use just legacy mode.  FS should not matter at all.
Comment 592 Moritz Naumann 2019-05-15 10:56:35 UTC
The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power Supply Idle Control" option off the configuration UI; downgrading is not supported (but effectively possible at least from Windows). It is still possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g. 
  /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
during boot.

Luckily, this workaround may no longer be needed. While, with default 'BIOS' (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during idle for me, it no longer does so on 5.1.2 (apparently - needs more testing - not on 5.0 either).
Comment 593 Philip Rosvall 2019-05-15 12:36:50 UTC
(In reply to Moritz Naumann from comment #592)
> The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power
> Supply Idle Control" option off the configuration UI; downgrading is not
> supported (but effectively possible at least from Windows). It is still
> possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g. 
>   /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
> during boot.
> 
> Luckily, this workaround may no longer be needed. While, with default 'BIOS'
> (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during
> idle for me, it no longer does so on 5.1.2 (apparently - needs more testing
> - not on 5.0 either).

Since 5.0.10, and this commit ...:

(https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.10)
"commit 205c53cbe553c9e5a9fe93f63e398da7e59124b6
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Sun Apr 14 19:51:06 2019 +0200

    x86/speculation: Prevent deadlock on ssb_state::lock
    
    commit 2f5fb19341883bb6e37da351bc3700489d8506a7 upstream.
    
    Mikhail reported a lockdep splat related to the AMD specific ssb_state
    lock:
    
      CPU0                       CPU1
      lock(&st->lock);
                                 local_irq_disable();
                                 lock(&(&sighand->siglock)->rlock);
                                 lock(&st->lock);
      <Interrupt>
         lock(&(&sighand->siglock)->rlock);
    
      *** DEADLOCK ***
    
    The connection between sighand->siglock and st->lock comes through seccomp,
    which takes st->lock while holding sighand->siglock.
    
    Make sure interrupts are disabled when __speculation_ctrl_update() is
    invoked via prctl() -> speculation_ctrl_update(). Add a lockdep assert to
    catch future offenders.
    
    Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD")"

... the problem seems to be fixed! I no longer need to use idle=halt and the systems don't freeze anymore!
Comment 594 Borislav Petkov 2019-05-15 14:03:34 UTC
(In reply to Philip Rosvall from comment #593)
> Since 5.0.10, and this commit ...:
> 
> (https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.10)
> "commit 205c53cbe553c9e5a9fe93f63e398da7e59124b6
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Sun Apr 14 19:51:06 2019 +0200
> 
>     x86/speculation: Prevent deadlock on ssb_state::lock

If that commit really fixes the "issue" on your machine then you
shouldn't have been experiencing any softlockups etc, i.e., what this
bugzilla is about, but rather a software deadlock.

Which would mean that your machine is not necessarily affected by the
entering into a C-state with MWAIT and not waking up after, issue.
Comment 595 Radomír Polách 2019-05-16 12:28:58 UTC
This bug bugs me so much, that I registered to kernel bugzilla.

I tried many workarounds, nothing solved it.

Machine: ThinkPad A485 with AMD 2500U.
System:
Distributor ID:	Ubuntu
Description:	Ubuntu 19.04
Release:	19.04
Codename:	disco
Kernel: Linux kni 5.0.0-13-generic #14-Ubuntu SMP Mon Apr 15 14:59:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

And the problem is still there.

Have idle=nomwait kernel parameter set, but tried many others. Disabling C6 states, everything.

I can reproduce this with 100% success by running my accounting software Flexibee (https://www.flexibee.eu/podpora/stazeni-flexibee/) which has a PostgreSQL backend. When it stars PostgreSQL, soft lockups start to appear, it gives me freezes and finally the system completely freezes after a 30 or so seconds. I can even reproduce it in VirtualBox, when starting the software in VirtualBox, it crashes the whole system (the host system!).

I can offer help with debugging this problem. This problem needs to go away. It is not present on Windows as far as I know.
Comment 596 Radomír Polách 2019-05-16 12:29:14 UTC
This bug bugs me so much, that I registered to kernel bugzilla.

I tried many workarounds, nothing solved it.

Machine: ThinkPad A485 with AMD 2500U.
System:
Distributor ID:	Ubuntu
Description:	Ubuntu 19.04
Release:	19.04
Codename:	disco
Kernel: Linux kni 5.0.0-13-generic #14-Ubuntu SMP Mon Apr 15 14:59:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

And the problem is still there.

Have idle=nomwait kernel parameter set, but tried many others. Disabling C6 states, everything.

I can reproduce this with 100% success by running my accounting software Flexibee (https://www.flexibee.eu/podpora/stazeni-flexibee/) which has a PostgreSQL backend. When it stars PostgreSQL, soft lockups start to appear, it gives me freezes and finally the system completely freezes after a 30 or so seconds. I can even reproduce it in VirtualBox, when starting the software in VirtualBox, it crashes the whole system (the host system!).

I can offer help with debugging this problem. This problem needs to go away. It is not present on Windows as far as I know.
Comment 597 Liu Liu 2019-05-16 21:44:46 UTC
(In reply to Liu Liu from comment #585)
> Created attachment 282319 [details]
> Threadripper 2920x soft lockup
> 
> I've went through this thread, and tried to disable c6 state, add idle=halt
> with no effect. Followed this excellent reprod steps in
> https://www.reddit.com/r/Amd/comments/apw8im/
> ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent.
> However, switching to gcc-8 seems resolved this for this particular reprod
> steps. With dmesg -w, I was able to capture and verify this is indeed a soft
> lockup. Attached the dmesg output.
> 
> It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest
> BIOS on Ubuntu 18.04.2 LTS.
> 
> Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux

Some updates since I last posted. I've updated to gcc-8 and enabled idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day uses, I haven't encountered any system lockup in the past 2 months. I concluded that idle=halt should mitigate this problem for normal uses.
Comment 598 JerryD 2019-05-17 01:05:23 UTC
(In reply to Liu Liu from comment #597)
--- snip --
> Some updates since I last posted. I've updated to gcc-8 and enabled
> idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> uses, I haven't encountered any system lockup in the past 2 months. I
> concluded that idle=halt should mitigate this problem for normal uses.

Are you saying that you compiled kernel with gcc-8 or that you use gcc-8 in your day to day work?

On my system I get a lockup if I do not use idle=nomwait. Like everyone else I too have tried verious combinations to be stable. (Ryzen 2500U Laptop)  As far as I can tell this bug is not really fixed except on Microsoft Windows (and who knows what they are doing). Regardless things seem stable now but I can not use suspend (a different bug)
Comment 599 OptionalRealName 2019-05-17 01:20:13 UTC
This embarrassing problem persists, will it be fixed on the new 3000 series, due in the next few months? Does anyone out there have access to this hardware?


Very close to the 2 year anniversary of this issue and there's still no 100% clear solution outlining how to fix the problem.
Comment 600 Radomír Polách 2019-05-17 01:44:33 UTC
It may be originally a hardware issue, but I am sure it does not affect Windows, so it should be fixable in kernel code. I am not even sure if the problem was present on Windows at all.

I can reproduce it with 100% success anytime I want.
Comment 601 Liu Liu 2019-05-17 03:00:35 UTC
(In reply to JerryD from comment #598)
> (In reply to Liu Liu from comment #597)
> --- snip --
> > Some updates since I last posted. I've updated to gcc-8 and enabled
> > idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> > still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> > uses, I haven't encountered any system lockup in the past 2 months. I
> > concluded that idle=halt should mitigate this problem for normal uses.
> 
> Are you saying that you compiled kernel with gcc-8 or that you use gcc-8 in
> your day to day work?
> 
> On my system I get a lockup if I do not use idle=nomwait. Like everyone else
> I too have tried verious combinations to be stable. (Ryzen 2500U Laptop)  As
> far as I can tell this bug is not really fixed except on Microsoft Windows
> (and who knows what they are doing). Regardless things seem stable now but I
> can not use suspend (a different bug)

gcc-8 for my day-to-day work.
Comment 602 Jonathan 2019-05-17 06:19:33 UTC
(In reply to OptionalRealName from comment #599)
> This embarrassing problem persists, will it be fixed on the new 3000 series,
> due in the next few months? Does anyone out there have access to this
> hardware?
> 
> 
> Very close to the 2 year anniversary of this issue and there's still no 100%
> clear solution outlining how to fix the problem.

No, it won't be fixed. I GUESS. Because I have mailed this issue and the link to this thread to many hardware reviews sites to get attention, and to AMD, and not a single one even bothered to reply let alone have a look at it. 

If they even don't want to recognize this as a bug, why fix it?
Comment 603 Martin 2019-05-20 21:50:42 UTC
(In reply to Liu Liu from comment #597)
> Some updates since I last posted. I've updated to gcc-8 and enabled
> idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> uses, I haven't encountered any system lockup in the past 2 months. I
> concluded that idle=halt should mitigate this problem for normal uses.

I have a Ryzen Threadripper 2990WX setup and experience the similar random lockups that you describe! Setting idle=halt however improved but did not fix the problem for me entirely. When do a "btrfs scrub" on my NVME SSD with 250 GB data I could get reproducible freezes even with idle=halt set. In average every second btrfs scrub over 250 GB data would cause a freeze so this was a very effective way to reproduce the error condition. (scrubbing twice through 250GB data takes only about 3 minutes on my system)

The "good" news is that since I changed the idle parameter to idle=poll (which obviously burns off electricity like crazy) I can now do many (20+) btrfs scrub runs in a row without provoking any lockups and so far the system runs stable. 

Maybe someone here who also has a btrfs on a SSD drive can try if they can reproduce the freezes with it in the same way. The btrfs scrub may be the more "evil" workload compared to parallel kernel compilations with gcc-7/8.

A final thought is that this thread may report two independent issues. There is the "completely unstable system issue" that initially froze up during two out of three boot cycles that can be solved for me by seting idle=nomwait and disabling c6 states in bios. 

And then there is this random freeze once-in-a-while issue that only is resolved so far by going to idle=poll.
Comment 604 Arup 2019-05-20 21:53:07 UTC
AMD issued a new ucode few days back. Anyone using that, it updated automatically on my Arch system. I havent had the lock up issue on my ASUS B350 board with latest BIOS and option of system idle current checked.
Comment 605 Martin 2019-05-20 22:21:23 UTC
(In reply to Arup from comment #604)
> AMD issued a new ucode few days back. Anyone using that, it updated
> automatically on my Arch system. 
For me this update that came yesterday on Arch did not change the microcode version at all! Seems like there is no update for my processor.
Before (and still) running:
model name	: AMD Ryzen Threadripper 2990WX 32-Core Processor
microcode	: 0x800820b
Comment 606 OptionalRealName 2019-05-28 04:29:55 UTC
So Ryzen 3000 is coming, who will be the first guinea pig? 

Curious to see if this problem is mysteriously, entirely fixed on the 3000 series.

(I damn well hope so)
Comment 607 onox 2019-05-28 09:23:35 UTC
(In reply to Moritz Naumann from comment #592)
> The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power
> Supply Idle Control" option off the configuration UI; downgrading is not
> supported (but effectively possible at least from Windows). It is still
> possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g. 
>   /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
> during boot.
> 
> Luckily, this workaround may no longer be needed. While, with default 'BIOS'
> (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during
> idle for me, it no longer does so on 5.1.2 (apparently - needs more testing
> - not on 5.0 either).

Hi, have you been able to do more testing? Does it still not freeze anymore for you with v.5.50? You're the second person I find saying that v5.50 removes the freezes with the default settings. I also use the ASRock X370 Taichi and the "Power Supply Idle Control" option, but I'm still on v4.70. However, since my computer also freezes on Windows when idle without changing that option, I now think that this is probably a different problem than the bug in this discussion. I actually found other people on forums who also have experienced those freezes on Windows, so I’m not an isolated case. They don’t all have ASRock boards, but there might also be something specific to ASRock. I have not really tested this yet, but sometime ago, I was told that at least on the ASRock boards, the default SOC voltage of 0.9 V is too low, and that raising it to at least 1.0 V would prevent the freezes (but don't go higher than 1.1 V). I'm probably going to raise that voltage anyway because I would like to increase the frequency of my RAM and lower the timings, if the BIOS would let me do so properly and keep the settings… The v4.x BIOSes for this board are known to have serious bugs in this respect [http://forum.asrock.com/forum_posts.asp?TID=9371&KW=ram+worse&title=psa-stay-away-from-480-bios-x370-taichi], so I will probably take the risk to upgrade to v5.50 soon, too.

Meanwhile, AMD replaced my faulty 1700X CPU, and the replacement I received (same CPU model, but produced much later) seems not to have the segfault bug, although I should test it more in order to be completely sure. I must say that AMD was very nice in the RMA process, I really recommend doing the RMA directly with them if you have the segfault bug.
Comment 608 Moritz Naumann 2019-05-29 16:43:41 UTC
(In reply to onox from comment #607)

To prevent misunderstandings, let me repeat what I previously wrote, from a different angle: ASRock X370 Taichi firmware version v5.50 is worse than previous versions in that it removes the "Power Supply Idle Control" option off the configuration screen. As a result, with Linux 4.18 (Ubuntu kernel 4.18.0-20-generic) and below the system *will* lockup unless different counter measures are taken (namely set the MSR).

This said, I have been and am running this single system 24/7 without issues despite extended low load periods on Ubuntu 18.04 LTS kernel 5.0.0-15-generic with microcode level 0x08001137 since May 15 (and from May 11 to May 15 on mainline Linux 5.1.2 with the same microcode) with AsRocks' firmware configuration defaults and without setting the MSR.

Summing up, the latest ASRock firmware for this mainboard takes a step in the wrong direction for Linux users, however newer Linux versions prevent this system from locking up.

MCExtractor is aware of a newer microcode, 0x08001138 (seen on 2019-02-04) for CPUID 0x00800f11, which ASRock decided not to include in their firmware.
Comment 609 weirdconstructor 2019-05-30 12:20:16 UTC
I think I experience the same problem as described here with my recently built Ryzen computer. After installing a fresh Linux distribution I started out with Linux kernel 4.19.45. The computer ran for two days continuously, after a short pause I came back and saw it completely froze. Screen showed the last visible image, but no input worked. Neither ping/ssh, neither mouse or keyboard worked. Not even the num lock LED. There were no log entries written, neither on screen or harddisk. Last entry was some cron job, cleaning up temporary directories. The workload was just a vim, a VNC viewer and a Firefox. The next night, the same freeze happened. That finally got my attention to dig deeper.

As first measure I updated the Kernel to 5.1.4. After 2 hours running, without much load, it also froze.

I dug deeper, and as first measure I set the BIOS from "Auto" to "Typical current Idle". Uptime is now 1 day, without any workload except a chat client.
This is 

This is my System:

# CPU:
    cpu family	: 23
    model		: 1
    model name	: AMD Ryzen 5 1600 Six-Core Processor
    stepping	: 1
    microcode	: 0x8001137
    cpu MHz		: 1646.140

# Mainboard:
    Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F5 01/25/2019

# GPU:
    NVIDIA GP107 [GeForce GTX 1050 Ti]
    Driver/OpenGL: 4.6.0 NVIDIA 418.74

# OS:
    Distro: Manjaro
    Kernel: 5.1.4-1-MANJARO x86_64

I also see this in my bootup log:
12 times: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
12 times: microcode: CPU0: patch_level=0x0800113


Like said, my computer is running now for 1 day without any load with the BIOS option "Typical current Idle" set. This is so far the same "workaround" that
brodyck wrote at 2019-04-01 15:40:26 UTC. Unfortunately, or fortunately for him, he has not wrote anything since then. That might mean, that the problems stopped for him at that point.
I will further observe this issue with my new workstation. If you don't hear back from me, the BIOS setting very likely worked.
Comment 610 Trevor Cordes 2019-05-31 07:50:50 UTC
As I mention around Comment 485 and later (starting Jan 22), for us we have nearly the same computer and after setting BIOS to current idle as you did, all of our problems disappeared.  Haven't messed with it since.  That's 4 months of uptime, zero hangs, zero crashes.

I'd bet everyone who was here and said they'd try the BIOS tweak and has not reported back falls into the same camp.  Everyone else is dealing with more or other problems, probably.

As for Ryzen 3000 series, AMD better have fixed this !@%!^# problem!  Since it looks like 3000 will do ECC nicely, I'll be building another box with it in early July.
Comment 611 kernel 2019-05-31 09:36:47 UTC
My two (server) machines running 1600's are also not experiencing problems any more using new BIOSes + 4.19 kernels + "Typical current bullshit". Hence I tend to agree that for most people the issue is solved.

However Trevor I'm mostly posting to let you know ECC works fine, at least with ASRock boards (of which the spec actually features ECC support, e.g. mine https://www.asrock.com/MB/AMD/AB350M%20Pro4/index.asp#Specification), even under Ryzen 1xxx CPU's:

# dmesg | grep -i edac
[    0.057169] EDAC MC: Ver: 3.0.0
[    9.926945] EDAC amd64: Node 0: DRAM ECC enabled.
[    9.926946] EDAC amd64: F17h detected (node 0).
[    9.926985] EDAC MC: UMC0 chip selects:
[    9.926986] EDAC amd64: MC: 0:     0MB 1:     0MB
[    9.926987] EDAC amd64: MC: 2:  4096MB 3:     0MB
[    9.926988] EDAC amd64: MC: 4:     0MB 5:     0MB
[    9.926989] EDAC amd64: MC: 6:     0MB 7:     0MB
[    9.926991] EDAC MC: UMC1 chip selects:
[    9.926991] EDAC amd64: MC: 0:     0MB 1:     0MB
[    9.926992] EDAC amd64: MC: 2:  4096MB 3:     0MB
[    9.926992] EDAC amd64: MC: 4:     0MB 5:     0MB
[    9.926993] EDAC amd64: MC: 6:     0MB 7:     0MB
[    9.926993] EDAC amd64: using x8 syndromes.
[    9.926994] EDAC amd64: MCT channel count: 2
[    9.927095] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
[    9.927104] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[    9.927105] AMD64 EDAC driver v3.5.0

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
edac-util: No errors to report.

See also:
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive.html
Comment 612 Grzechu 2019-06-25 10:42:26 UTC
AMD has released a new version of AGESA Combo-AM4 1.0.0.3 for my motherboard MSI B450 Gaming Plus and others. There were also updates of the CPU microcode.

The first (#1) on the list for example is Ryzen 2600, old and new bios, comparison of the microcode version:

OLD bios
╔═════════════════════════════════════════════════════════════════╗
║                               AMD                               ║
╟────┬──────────┬──────────┬────────────┬───────┬──────────┬──────╢
║ #  │  CPUID   │ Revision │    Date    │  Size │  Offset  │ Last ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 1  │ 00800F82 │ 0800820B │ 2018-06-20 │ 0xC80 │ 0x4DD000 │  No  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 2  │ 00800F12 │ 08001230 │ 2018-08-04 │ 0xC80 │ 0x4DDD00 │  No  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 3  │ 00800F11 │ 08001137 │ 2018-02-14 │ 0xC80 │ 0x4DEA00 │  No  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 4  │ 00800F10 │ 0800100C │ 2017-01-31 │ 0xC80 │ 0x4DF700 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 5  │ 00800F00 │ 0800002A │ 2016-10-06 │ 0xC80 │ 0x4E0400 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 6  │ 00810F10 │ 0810100B │ 2018-02-12 │ 0xC80 │ 0x65A500 │  No  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 7  │ 00820F00 │ 08200002 │ 2018-02-14 │ 0xC80 │ 0x65B200 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 8  │ 00810F00 │ 08100004 │ 2016-11-20 │ 0xC80 │ 0x65BF00 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 9  │ 00810F80 │ 08108002 │ 2018-06-05 │ 0xC80 │ 0x65CC00 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 10 │ 00810F81 │ 08108102 │ 2018-08-13 │ 0xC80 │ 0x65D900 │  No  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 11 │ 00810F11 │ 08101102 │ 2018-11-06 │ 0xC80 │ 0x65E600 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 12 │ 00660F00 │ 06006012 │ 2014-10-14 │ 0xA20 │ 0xD97F50 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 13 │ 00660F01 │ 0600611A │ 2018-01-26 │ 0xA20 │ 0xD98970 │ Yes  ║
╚════╧══════════╧══════════╧════════════╧═══════╧══════════╧══════╝

NEW bios


╔═════════════════════════════════════════════════════════════════╗
║                               AMD                               ║
╟────┬──────────┬──────────┬────────────┬───────┬──────────┬──────╢
║ #  │  CPUID   │ Revision │    Date    │  Size │  Offset  │ Last ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 1  │ 00800F82 │ 0800820D │ 2019-04-16 │ 0xC80 │ 0x50D000 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 2  │ 00800F12 │ 08001250 │ 2019-04-16 │ 0xC80 │ 0x50DD00 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 3  │ 00800F11 │ 08001138 │ 2019-02-04 │ 0xC80 │ 0x50EA00 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 4  │ 00800F10 │ 0800100C │ 2017-01-31 │ 0xC80 │ 0x50F700 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 5  │ 00800F00 │ 0800002A │ 2016-10-06 │ 0xC80 │ 0x510400 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 6  │ 00810F10 │ 08101016 │ 2019-04-30 │ 0xC80 │ 0x68A500 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 7  │ 00820F01 │ 08200103 │ 2019-04-17 │ 0xC80 │ 0x68B200 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 8  │ 00810F00 │ 08100004 │ 2016-11-20 │ 0xC80 │ 0x68BF00 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 9  │ 00810F80 │ 08108002 │ 2018-06-05 │ 0xC80 │ 0x68CC00 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 10 │ 00810F81 │ 08108109 │ 2019-04-17 │ 0xC80 │ 0x68D900 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 11 │ 00810F11 │ 08101102 │ 2018-11-06 │ 0xC80 │ 0x68E600 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 12 │ 00870F10 │ 08701012 │ 2019-05-08 │ 0xC80 │ 0x84F200 │ Yes  ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 13 │ 00870F00 │ 08700004 │ 2018-12-06 │ 0xC80 │ 0x84FF00 │ Yes  ║
╚════╧══════════╧══════════╧════════════╧═══════╧══════════╧══════╝

Maybe this solves the problem? :)
Comment 613 OptionalRealName 2019-06-25 11:14:29 UTC
So when do we get our first post, with the same problems, from X570 users, with Ryzen 3600 / 3600x / 3800 etc?

I'm very curious to know if this issue magically is gone on the next series.
Comment 614 C0rn3j 2019-06-25 11:49:37 UTC
No workarounds fixed my 2400G lockups.
I RMAd the CPU (got a new one) and motherboard (because the firmware killed itself) and got it repaired.

Since then I didn't have a single lockup with all the workarounds disabled, so 

I'd say that CPUs before some certain time period were possibly manufactured with defects or my unit was just defective.
Comment 615 Chris Hall 2019-06-30 12:44:57 UTC
FWIW: I have an ASUS Prime 370-Pro, Ryzen 7 1800X, currently running Kernel v5.0.17.

With "Typical Current Idle", and no kernel tweaks, that has run without "freezing while idle" for about a year.

Yesterday I upgraded the BIOS, which for some reason reset the "Power Supply Idle Control" to "Auto".  That prompted me to run a small experiment to see if kernel or other upgrades have improved things.  They have not: the machine "froze while idle" (fairly promptly).  I have now restored "Typical Current Idle", in the hope of another untroubled year.

I note, however, that my cpu (0x00800f11) is running with firmware 0x0800137, not the more recent 0x0800138 -- if I am reading the table in Comment 612, above, correctly.
Comment 616 OptionalRealName 2019-07-13 23:11:31 UTC
No 3000 owners yet? Is it safe to buy?
Comment 617 JerryD 2019-07-13 23:22:33 UTC
(In reply to OptionalRealName from comment #616)
> No 3000 owners yet? Is it safe to buy?

Google for linux benchmarks, etc. You will find people are running these things. There is one bios update push I read about regarding some windows related thing. I plan to buy a 3000 series setup soon. No fear here. I noticed HP just issued a new bios for my 2500U laptop (F.21), it has been running fine on F.20 and likewise on the upgraded, so I suspect it was a 3000 related update. They are also selling complete systems as well.
Comment 618 linuxannoyance 2019-07-15 18:58:38 UTC
I have to kind of wonder, doesn't AMD like, want to sell these chips (in Epyc) to server farms? Ones that will run Linux? Or do Epyc/TR just not have this problem? My experience with this bug has been a bit bizarre, especially since AMD's GPUs have really great Linux support these days.

My report:
- Ryzen 1600
- CORSAIR Vengeance LPX 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3000 (PC4 24000)
- Gigabyte AB-350 Gaming 3
- RX 460 Sapphire 2GB
- TP-Link TL-WN722N USB Wifi -- This is its own bag of worms
- misc: SATA SSDs, HDDs
I've had this crash problem ever since I built this PC, HOWEVER for a very long time, the LTS kernel 4.14 didn't have the crashing/hanging/random reboot. Usually, if I installed a new kernel, that kernel would be stable for a week or so (including lots of suspend/unsuspend) and then develop the crashing problem. Meaning I mainly just stayed on 4.14. But today was a special day, because 4.14 developed the problem as well now!
IDK if this has been discussed further up already, but I'd like to highlight two of those points:
- freshly installed kernels don't have the problem at first? what could cause this?
- 4.14 didn't have it either? Again, what did they change to break it?
Moreover, the crashing is WORSE THAN EVER now. Like hourly. 

So I came back to this thread and I'm trying disabling Global C-States, as my older motherboard firmware doesn't have the other options that have been mentioned ITT. I'll wait and see if that solves it, and if not, I'll try the python script, and then maybe updating my motherboard so I can access more settings.
Comment 620 angel.forest.public 2019-07-23 01:26:10 UTC
Currently running ASRock AB350M Pro4 + Ryzen 1600
Ever since I updated BIOS to 5.50 and set "Typical Current Idle" setting (couple months ago) I haven't had a single hang.
Comment 621 Alberto 2019-08-14 18:44:01 UTC
(In reply to OptionalRealName from comment #616)
> No 3000 owners yet? Is it safe to buy?

I have one, but I am going to return the laptop I think, it works as expected but in terms of battery ryzen in linux sucks

PD: windows 5 W in idle
Linux, 12 W in idle in the same laptop with kernels 5.0 5.1 5.2 5.3...  BIG problem for laptops, I am not going to buy other laptop with rayzen I think, I don't know if I am the only person that have noticed this consumptions with a ryzen mobile...
Other thing, linux only recognices 3 states idle, there are:
CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 1270
Duration: 2635
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 60347
Duration: 17870643
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 400
Usage: 22640
Duration: 59696920
 I don't know if there is GPU integrated problems or CPU problems but I think that are cpu ones for many reasons.. one of them is lscpu and cpupower telling me that hardware limits are 1,4Ghz-2,1Ghz IN A LAPTOP YEAH, in windows it reports 400 mhz like the minimum
Comment 622 C0rn3j 2019-08-14 19:23:43 UTC
My experience - 
Bought 2400G and 3600. Both CPUs had lockups and no UEFI settings and messing with C states helped.
What helped was RMA in both cases, the RMA'd CPUs do not have lockups.
Comment 623 eric.c.morgan 2019-08-15 17:45:49 UTC
Asrock Fatal1ty X370 Gaming-ITX/ac
Ryzen 1700

After the most recent Asrock BIOS update I haven't had any more crashing. I do not disable C6 anymore. This was a BIOS update to support the new Ryzen 3xxx chips as I was eyeing the 3900x and wanted the motherboard to support said chip if I had a weak moment and actually bought a 3900x.
Comment 624 zheilbron 2019-08-16 20:58:05 UTC
As @C0rn3j mentioned, I'm in a similar boat. With a Ryzen 3700x, I continue to get the following errors:

```
rcu: INFO: rcu_sched self-detected stall on CPU
...
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:1:94] 
```

I've tried just about every workaround that I've read about:
* Disable C5/C6 states (from BIOS, from zenstates.py)
* Set "Typical Current Idle" in BIOS
* Add a combination of kernel parameters (idle=nomwait, noapci, acpi=off, rcu_nocbs=0-15, processor.max_cstate=1)

Nothing works.

System specs:
Linux ... 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64
MSI B450 Gaming Pro Carbon AC (BIOS version 7B85v18)
Ryzen 7 3700X
Comment 625 zheilbron 2019-08-16 21:55:39 UTC
(In reply to zheilbron from comment #624)
> As @C0rn3j mentioned, I'm in a similar boat. With a Ryzen 3700x, I continue
> to get the following errors:
> 
> ```
> rcu: INFO: rcu_sched self-detected stall on CPU
> ...
> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:1:94] 
> ```
> 
> I've tried just about every workaround that I've read about:
> * Disable C5/C6 states (from BIOS, from zenstates.py)
> * Set "Typical Current Idle" in BIOS
> * Add a combination of kernel parameters (idle=nomwait, noapci, acpi=off,
> rcu_nocbs=0-15, processor.max_cstate=1)
> 
> Nothing works.
> 
> System specs:
> Linux ... 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64
> MSI B450 Gaming Pro Carbon AC (BIOS version 7B85v18)
> Ryzen 7 3700X

It seems that leaving the machine idle does not produce the issue. However, connecting over Wireguard + SSH (which is how I had been accessing the machine) seems to cause the issue to manifest. After following the advice here (https://bbs.archlinux.org/viewtopic.php?id=247900), I've added the `nordrand` boot flag. This seems to be keeping the issue at bay so far. Perhaps I was chasing the wrong issue!

Time will tell... I'll report back with any new findings or if it remains stable for some time.
Comment 626 JerryD 2019-08-17 00:10:10 UTC
--- snip ---
> It seems that leaving the machine idle does not produce the issue. However,
> connecting over Wireguard + SSH (which is how I had been accessing the
> machine) seems to cause the issue to manifest. After following the advice
> here (https://bbs.archlinux.org/viewtopic.php?id=247900), I've added the
> `nordrand` boot flag. This seems to be keeping the issue at bay so far.
> Perhaps I was chasing the wrong issue!
> 
> Time will tell... I'll report back with any new findings or if it remains
> stable for some time.

I have just last week installed Fedora 30 on a new Ryzen 3600X. The system would not boot due to a bug in systemd which has been fixed, but the FedoraLive image used to install does not have this fix. Fortunately, this is fixed in a respin of the live image that one can use. It has been running to perfection for my needs now for over a week without any kernel boot parameters needed. Clean as a whistle. So, your fix ought to be good.
Comment 628 Another User 2019-08-20 16:05:44 UTC
Released 3rd generation of ZEN and this strange thing still persist. So "No fix planned" was not a lie...


Kernel parameter idle=halt fixed my problem for R3 2200u (no lockups for 7 months). You may give it a try.
...just to remind.
Comment 629 pmenzel 2019-08-20 16:06:02 UTC
Sehr geehrte Damen und Herren,


vielen Dank für Ihre Nachricht, die ich nach dem 26. August 2019 lesen werde.


Freundliche Grüße

Paul Menzel

---

Dear Madam or Sir,


Thank you for your message, I’ll read after my return on August 27th, 2019.


Kind regards,

Paul Menzel
Comment 630 zheilbron 2019-08-21 09:48:24 UTC
Reporting back as I said I would: with `nordrand` set as a kernel boot parameter, the system is stable.
Comment 631 txrx 2019-08-21 13:39:46 UTC
(In reply to zheilbron from comment #630)
> Reporting back as I said I would: with `nordrand` set as a kernel boot
> parameter, the system is stable.

Are you using nordrand in combination with other workarounds (BIOS settings, ZenStates)?
My hardware specs are identical and my system still freezes using nordrand.

Ryzen 3700X
MSI B450 Gaming Pro Carbon AC (BIOS version 7B85v17, not v18!)
Linux 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 13:51:02 UTC 2019 x86_64

On the other hand, MSI could have fixed these problems with the new BIOS.
Comment 632 zheilbron 2019-08-22 06:55:47 UTC
I'm not using any other workarounds in conjunction with `nordrand`. Perhaps try upgrading to the new BIOS. Note also that my kernel version differs from yours, but it seems unlikely (though possible) that there's a regression across those versions.
Comment 633 pmenzel 2019-08-22 06:56:04 UTC
Sehr geehrte Damen und Herren,


vielen Dank für Ihre Nachricht, die ich nach dem 26. August 2019 lesen werde.


Freundliche Grüße

Paul Menzel

---

Dear Madam or Sir,


Thank you for your message, I’ll read after my return on August 27th, 2019.


Kind regards,

Paul Menzel
Comment 634 raul 2019-08-23 02:15:36 UTC
My system reboots randomly, mostly at idle. I have tried the 'powersupply idle' UEFI option, but it didn't work. I have changed motherboard, RAM and PSU, but it still reboots/crashes, doesn't matter if it's overclocked or not. The system is not reliable.

Currently an ASUS CROSSHAIR VI HERO, BIOS 7201 07/12/2019.
The previous motherboard was a Gigabyte Aorus Ultra Gaming X470 with BIOS F40.

Now I'm waiting for the next reboot (if it happens) after disabling C6 global states in UEFI (core C6 as per zenstates.py). The next step would be to flash a new BIOS file containing AGESA 1.0.0.3ABB. Comment https://bugzilla.kernel.org/show_bug.cgi?id=196683#c623 reports no more stability problems with a BIOS providing that AGESA version.

Has anyone used the xHCI debug capability to at least know what happens just before the crash/reboot?
https://www.kernel.org/doc/html/latest/driver-api/usb/usb3-debug-port.html

AFAIK, the chipset should support DbC, as the file "dbc" is listed under the xHCI device node in sysfs and echo "enable" to it.
Comment 635 raul 2019-08-23 02:17:51 UTC
(In reply to raulvior.bcn from comment #634)

Ah, yes. I forgot to add kernel version:
Linux 5.0.0-25-generic #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

> My system reboots randomly, mostly at idle. I have tried the 'powersupply
> idle' UEFI option, but it didn't work. I have changed motherboard, RAM and
> PSU, but it still reboots/crashes, doesn't matter if it's overclocked or
> not. The system is not reliable.
> 
> Currently an ASUS CROSSHAIR VI HERO, BIOS 7201 07/12/2019.
> The previous motherboard was a Gigabyte Aorus Ultra Gaming X470 with BIOS
> F40.
> 
> Now I'm waiting for the next reboot (if it happens) after disabling C6
> global states in UEFI (core C6 as per zenstates.py). The next step would be
> to flash a new BIOS file containing AGESA 1.0.0.3ABB. Comment
> https://bugzilla.kernel.org/show_bug.cgi?id=196683#c623 reports no more
> stability problems with a BIOS providing that AGESA version.
> 
> Has anyone used the xHCI debug capability to at least know what happens just
> before the crash/reboot?
> https://www.kernel.org/doc/html/latest/driver-api/usb/usb3-debug-port.html
> 
> AFAIK, the chipset should support DbC, as the file "dbc" is listed under the
> xHCI device node in sysfs and echo "enable" to it.
Comment 636 linuxannoyance 2019-08-23 02:48:06 UTC
(In reply to linuxannoyance from comment #618)
After I posted this comment, I have not had any more random crashes. I disabled Global C-States in my motherboard's firmware, as that was the only option relating to C-states (C-6 states not being available). Based on my experience and other comments in this issue, I believe disabling Global C-States is what has caused my system to stabilize. I recommend this as a viable option for people with older motherboard firmware who don't want to flash it.

I still think this is an embarrassment for AMD. I do not know how they intend to take the server market when they allow an issue like this to persist. Maybe Epyc gets better support. Even so, I am a desktop and laptop Linux user, and I will mention this issue to everyone I meet who is looking for personal hardware to run Linux on.
Comment 637 raul 2019-08-28 12:20:10 UTC
(In reply to raulvior.bcn from comment #635)
ASUSTeK COMPUTER INC. CROSSHAIR VI HERO 7403 08/20/2019
AMD Ryzen 7 1800X Eight-Core Processor
16410MB
2560x1440 pixels
Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-25-generic, LLVM 8.0.0)
Linux 5.0.0-25-generic (x86_64) #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019
GNU C library / (Ubuntu GLIBC 2.29-0ubuntu2) 2.29
Ubuntu 19.04
BOOT_IMAGE=/vmlinuz-5.0.0-25-generic root=UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ro quiet splash vt.handoff=1


With the latest BIOS (which provides AGESA 1.0.0.3ABB), Power Supply Idle option did work. This option increases core voltage. Knowing that increasing core voltage stabilized the system for more than 48 hours, I suspected the "performance" governor could be problematic. Because even though it made the CPU to operate at higher frequencies, the voltage was not increased. Values of 0.35, 0.5 or 0.85V were still showing up despite the minimum 3 GHz frequency instead of 2 GHz.

I have disabled "power supply idle" UEFI option and started to use the "ondemand" governor, which has a lower minimum frequency of 2 GHz.

The system continues to be stable. This is the longest time I have had the system without reboots.

Linux reports the following bugs firmware bugs:
[    0.065489] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[    0.586638] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.586749] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.586830] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.586920] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.586983] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587042] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587110] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587163] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587235] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587309] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587378] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587448] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587522] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587590] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587646] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[    0.587699] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

Which effectively turns off MWAIT. cpupower idle-info output:

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 42208
Duration: 98245
C1:
Flags/Description: ACPI HLT
Latency: 1
Usage: 2147693
Duration: 605512928
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 400
Usage: 4338245
Duration: 71750833391


For my problem, which might not be the same as for the rest of users here, not having MWAIT did not stop the system to reboot randomly. And despite the kernel not detecting MWAIT, the processor still boosts up to 4.1 GHz as intended. C6 States are enabled. The CPU might enter them after entering C1 with HLT. Zenstates.py output:

P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

I'll report in the future whether if I have had another reboot. I'm nonitoring the system with the Vitals GNOME extension (https://extensions.gnome.org/extension/1460/vitals/). The kernel module asus-wmi-sensors provide the actual data (https://github.com/electrified/asus-wmi-sensors).
Comment 638 L.Y. Sim 2019-08-29 04:53:40 UTC
So since October of last year I was having problems with my Arch + KDE build freezing at idle on my 2700X build. Usually freezes occur while a YouTube or Twitch video is playing. When it freezes, it would often also cause whatever audio that is playing to loop over and over. Journalctl would have no logs of anything at all after the system freezes.

To fix this problem, I had to make sure that both Power Supply Idle was set to 'Typical' and that my VCore undervolt wasn't too aggressive. Both of those had to be satisfied, because for a time I only set Power Supply Idle to 'Typical' while leaving my undervolt at -0.75V, which even though it gave me no problems when the system is stressed (such as running Prime95 for close to 24 hours), it would cause the system to freeze at idle. All freezing stopped once I reduced my undervolt to -0.6875V, and the system has been running for close to 3 months now for 16 hours a day without a single hitch.

Just to be doubly sure that a fix has to require Power Supply Idle being set to 'Typical', I disabled it yesterday (while leaving my undervolt at -0.6875v), and it froze within a day. 

TLDR: Setting Power Supply Idle to 'Typical' and making sure that I wasn't running an overly aggressive VCore undervolt was what solved my freezing issues.

My Hardware: Ryzen 2700X, ASUS X470 CROSSHAIR VII with 2304 BIOS, Seaonic Focus+80+ Gold 650W Power Supply.
Comment 639 L.Y. Sim 2019-08-29 05:10:15 UTC
Sorry, just to clarify my above comment (since there doesn't appear to be any way to edit my comment).

1. Freezes usually occur while the system is being left alone, but I noticed that it would also often occur while a YouTube video, Twitch stream or even music (via Clementine) was left playing in the background.

2. My motherboard is actually a ASUS X470 CROSSHAIR VII WI-FI edition with the 2304 BIOS.
Comment 640 JerryD 2019-08-30 18:30:27 UTC
Anyone here seeing this one: Bug 1738650 - Kernel 5.2.5 graphics unstable 

https://bugzilla.redhat.com/show_bug.cgi?id=1738650

Everytning working great as long as I stay on kernel-5.1.20-300.fc30.x86_64
Comment 641 raul 2019-09-02 12:13:40 UTC
(In reply to raulvior.bcn from comment #637)

I got a reboot with the ondemand governor at 1d 14h of uptime.
I have reactivated the Typical Power Supply idle switch. Currently 4 days and 2 hours of uptime. The longest I've ever seen.

The minimum voltage reported by the UEFI is 0.83 V. Using the performance governor the minimum voltage reported is 1.26 V. This corresponds with P2 and P1 P-states voltages.

> (In reply to raulvior.bcn from comment #635)
> ASUSTeK COMPUTER INC. CROSSHAIR VI HERO 7403 08/20/2019
> AMD Ryzen 7 1800X Eight-Core Processor
> 16410MB
> 2560x1440 pixels
> Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-25-generic, LLVM 8.0.0)
> Linux 5.0.0-25-generic (x86_64) #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019
> GNU C library / (Ubuntu GLIBC 2.29-0ubuntu2) 2.29
> Ubuntu 19.04
> BOOT_IMAGE=/vmlinuz-5.0.0-25-generic
> root=UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ro quiet splash vt.handoff=1
Comment 642 Nelson Castillo 2019-09-04 19:17:17 UTC
After one year, I am finally using my 2700x without any issues, latest Ubuntu 18.04, Linux 5.0.0-27-generic. No patches, no kernel options. I put the system to sleep everyday and it always comes back when I press a key the next day.

I updated the BIOS of the AB350M-DS3Hto F31, the most recent update before 3rd Gen AMD support was added.

I also put a PSU that I had around, a TT-450NL2NK. The one before had less watts. I let the Typical Current Idle option, but I don't know whether it has some effect.

I also have a GeForce GTX 1050 Ti now.

I changed quite a few things, so I cannot really say what improved my setup. But I want to report that the system is finally stable.


Regards.

(In reply to Nelson Castillo from comment #399)
> Hello there. I read most of the thread a week ago.
> 
> My machine was freezing when idle and when not idle. I read most of this
> thread and thanks to it got a stable setup.
> 
> @Daniel: When you say that you think the problem is fixed, do you know what
> change/Linux version fixes the issue for most people?
> 
> -------------
> 
> Now my report.
> 
> I have the following setup:
> 
> Motherboard: AB350M-DS3H (F23d BIOS).
> CPU: Ryzen 2700X.
> No overclock.
> 
> I still wonder whether my CPU is doing too much extra work (I don't know
> what cpufreq would report if a C6 state was reached).
> 
> cpufreq stats:
> 
> CPU 0: 3.70 GHz:6.64%, 3.20 GHz:1.15%, 2.20 GHz:92.21%
> CPU15: 3.70 GHz:4.86%, 3.20 GHz:0.80%, 2.20 GHz:94.34%
> 
> Anyway, to make the things work I had to use three tweaks. I tried
> individual tweaks to no avail.
> 
> - Select in BIOS: "Typical Current Idle"
> - Start Linux with: idle=nomwait
> - Disable C6 states (both core and package) with Zenstates.py
> 
> Before doing the last step Zenstates.py reports Core enabled and Package
> disabled.
> 
> I'm using Ubuntu 18.04.01 LTS. I didn't compile Linux with
> CONFIG_RCU_NOCB_CPU / CONFIG_RCU_NOCB_CPU_ALL.
> 
> So, things are working for me. But if you think I should test a new Linux
> version that is supposed to fix the issue please let me know.
Comment 643 CodingEagle02 2019-09-05 10:37:48 UTC
I've been lightly following this thread, but it is still a seemingly unresolved issue that impedes me from using Linux. Has there been any real progress on this? Or is the resolution still a vague 'we more or less know where the issue is but we're still waiting for someone upstream to take action to fix it'?
Comment 644 007lva 2019-09-05 11:09:19 UTC
I have a 2700X on MSI X470 Gaming Plus. Update the BIOS to support my new 3800X also fixed this issue to me. Both CPUs are works well on Debian Buster without any extra kernel parameter and without disable C6.
Comment 645 Boris 2019-09-06 07:45:25 UTC
Hi, i bought Acer Aspire 3, A315-41G, Ryzen 5 2500U in January and since then i had CPU lock up, i manage to install arch but i got CPU lockUP on every second boot, on every other distribution i got CPU lockUP in installer...i never use C6 disable script, but i try all possibile kernel parameters and it never works...

Three days ago i found new BIOS patch Update PI code v1.1.0.8. from 2019/08/12,  on acer web site, after install everything works, no lockUPs, all distribution i try works with no problems, no need for any grub or kernel parameters...

Sry for bad english,

Regards
Comment 646 Rafal Kupiec 2019-09-11 06:52:17 UTC
Sep 10 21:13:56 BelliashPC kernel: BUG: kernel NULL pointer dereference, address: 000000000000000d
Sep 10 21:13:56 BelliashPC kernel: #PF: supervisor write access in kernel mode
Sep 10 21:13:56 BelliashPC kernel: #PF: error_code(0x0002) - not-present page
Sep 10 21:13:56 BelliashPC kernel: PGD 0 P4D 0 
Sep 10 21:13:56 BelliashPC kernel: Oops: 0002 [#1] SMP NOPTI
Sep 10 21:13:56 BelliashPC kernel: CPU: 11 PID: 30804 Comm: kworker/11:2 Tainted: P           O    T 5.2.13-gentoo #1
Sep 10 21:13:56 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep 10 21:13:56 BelliashPC kernel: Workqueue:  0x0 (events)
Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39 c5 74
Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS: 00010002
Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX: ffff8d78aecdfa80 RCX: ffff8d789fa1b400
Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI: ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08: ffff8d78aece0180 R09: 0000000000001800
Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8d78a4112b68
Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14: ffff8d78a4112b40 R15: ffff8d78a4112b40
Sep 10 21:13:56 BelliashPC kernel: FS:  0000000000000000(0000) GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
Sep 10 21:13:56 BelliashPC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3: 000000069063a000 CR4: 0000000000340ee0
Sep 10 21:13:56 BelliashPC kernel: Call Trace:
Sep 10 21:13:56 BelliashPC kernel:  kthread+0xf8/0x130
Sep 10 21:13:56 BelliashPC kernel:  ? set_worker_desc+0xb0/0xb0
Sep 10 21:13:56 BelliashPC kernel:  ? kthread_park+0x80/0x80
Sep 10 21:13:56 BelliashPC kernel:  ret_from_fork+0x22/0x40
Sep 10 21:13:56 BelliashPC kernel: Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
Sep 10 21:13:56 BelliashPC kernel: CR2: 000000000000000d
Sep 10 21:13:56 BelliashPC kernel: ---[ end trace 23be40ec0c6dea2a ]---
Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39 c5 74
Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS: 00010002
Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX: ffff8d78aecdfa80 RCX: ffff8d789fa1b400
Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI: ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08: ffff8d78aece0180 R09: 0000000000001800
Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8d78a4112b68
Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14: ffff8d78a4112b40 R15: ffff8d78a4112b40
Sep 10 21:13:56 BelliashPC kernel: FS:  0000000000000000(0000) GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
Sep 10 21:13:56 BelliashPC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3: 000000069063a000 CR4: 0000000000340ee0
Sep 10 21:14:16 BelliashPC kernel: NVRM: GPU at PCI:0000:06:00: GPU-4c9bfa83-5407-92ac-ce82-441bb9e3ac79
Sep 10 21:14:16 BelliashPC kernel: NVRM: GPU Board Serial Number: 
Sep 10 21:14:16 BelliashPC kernel: NVRM: Xid (PCI:0000:06:00): 8, pid=3723, Channel 00000010
Sep 10 21:14:18 BelliashPC kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Sep 10 21:14:18 BelliashPC kernel: rcu: \x090-...0: (0 ticks this GP) idle=f26/1/0x4000000000000000 softirq=1165408/1165408 fqs=5246 
Sep 10 21:14:18 BelliashPC kernel: \x09(detected by 6, t=21002 jiffies, g=3474865, q=223)
Sep 10 21:14:18 BelliashPC kernel: Sending NMI from CPU 6 to CPUs 0:
Sep 10 21:14:18 BelliashPC kernel: NMI backtrace for cpu 0
Sep 10 21:14:18 BelliashPC kernel: CPU: 0 PID: 26618 Comm: kworker/0:0 Tainted: P      D    O    T 5.2.13-gentoo #1
Sep 10 21:14:18 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep 10 21:14:18 BelliashPC kernel: Workqueue: events vmstat_shepherd
Sep 10 21:14:18 BelliashPC kernel: RIP: 0010:queued_spin_lock_slowpath+0x3d/0x1a0
Sep 10 21:14:18 BelliashPC kernel: Code: 3e f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 18 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 66 c7 07 01 00 c3 f6 c4 01 75 04 c6 47 01 00 48
Sep 10 21:14:18 BelliashPC kernel: RSP: 0018:ffffaec609d3fe08 EFLAGS: 00000002
Sep 10 21:14:18 BelliashPC kernel: RAX: 0000000000300101 RBX: ffff8d78aece4900 RCX: 0000000000000000
Sep 10 21:14:18 BelliashPC kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8d78aecdfa80
Sep 10 21:14:18 BelliashPC kernel: RBP: 000000000000002a R08: ffff8d78ae001da8 R09: 00000000000000d8
Sep 10 21:14:18 BelliashPC kernel: R10: 0000000000000000 R11: ffff8d78ae001e80 R12: ffff8d78ab174e00
Sep 10 21:14:18 BelliashPC kernel: R13: ffff8d78aecdd4a0 R14: 000000000001d808 R15: 000000000000000b
Sep 10 21:14:18 BelliashPC kernel: FS:  0000000000000000(0000) GS:ffff8d78aea00000(0000) knlGS:0000000000000000
Sep 10 21:14:18 BelliashPC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 10 21:14:18 BelliashPC kernel: CR2: 00007f0232d5f8b0 CR3: 00000007de9a8000 CR4: 0000000000340ef0
Sep 10 21:14:18 BelliashPC kernel: Call Trace:
Sep 10 21:14:18 BelliashPC kernel:  __queue_work+0x95/0x2f0
Sep 10 21:14:18 BelliashPC kernel:  ? __switch_to_asm+0x40/0x70
Sep 10 21:14:18 BelliashPC kernel:  queue_delayed_work_on+0x17/0x20
Sep 10 21:14:18 BelliashPC kernel:  vmstat_shepherd+0x61/0xa0
Sep 10 21:14:18 BelliashPC kernel:  process_one_work+0x19a/0x310
Sep 10 21:14:18 BelliashPC kernel:  worker_thread+0x45/0x3b0
Sep 10 21:14:18 BelliashPC kernel:  kthread+0xf8/0x130
Sep 10 21:14:18 BelliashPC kernel:  ? set_worker_desc+0xb0/0xb0
Sep 10 21:14:18 BelliashPC kernel:  ? kthread_park+0x80/0x80
Sep 10 21:14:18 BelliashPC kernel:  ret_from_fork+0x22/0x40
Sep 10 21:14:18 BelliashPC kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 10 21:14:20 BelliashPC kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context



This is what happens on my Ryzen7 3700X CPU installed on Gigabyte B450 Aorus Elite motherboard on 5.2.13 kernel. I got BIOS with AGESA 1.0.0.3ABB already flashed. I tried zenstates.py as well as CONFIG_RCU_NOCB_CPU=y in the kernel and passing rcu-nocbs=0-15 as a kernel parameter. None of this helps.
Comment 647 Rafal Kupiec 2019-09-11 06:53:58 UTC
And another example of calltrace:

Sep  5 21:25:32 BelliashPC kernel: NMI backtrace for cpu 1
Sep  5 21:25:32 BelliashPC kernel: CPU: 1 PID: 13564 Comm: DOM File Tainted: P      D    O    T 5.2.11-gentoo #1
Sep  5 21:25:32 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep  5 21:25:32 BelliashPC kernel: Call Trace:
Sep  5 21:25:32 BelliashPC kernel:  <IRQ>
Sep  5 21:25:32 BelliashPC kernel:  dump_stack+0x46/0x60
Sep  5 21:25:32 BelliashPC kernel:  nmi_cpu_backtrace.cold+0x14/0x53
Sep  5 21:25:32 BelliashPC kernel:  ? lapic_can_unplug_cpu.cold+0x42/0x42
Sep  5 21:25:32 BelliashPC kernel:  nmi_trigger_cpumask_backtrace+0x89/0x8b
Sep  5 21:25:32 BelliashPC kernel:  rcu_dump_cpu_stacks+0x7b/0xa9
Sep  5 21:25:32 BelliashPC kernel:  rcu_sched_clock_irq.cold+0x1a2/0x38d
Sep  5 21:25:32 BelliashPC kernel:  ? tick_sched_do_timer+0x50/0x50
Sep  5 21:25:32 BelliashPC kernel:  update_process_times+0x24/0x60
Sep  5 21:25:32 BelliashPC kernel:  tick_sched_timer+0x33/0x70
Sep  5 21:25:32 BelliashPC kernel:  __hrtimer_run_queues+0xe7/0x180
Sep  5 21:25:32 BelliashPC kernel:  hrtimer_interrupt+0x100/0x220
Sep  5 21:25:32 BelliashPC kernel:  smp_apic_timer_interrupt+0x56/0x90
Sep  5 21:25:32 BelliashPC kernel:  apic_timer_interrupt+0xf/0x20
Sep  5 21:25:32 BelliashPC kernel:  </IRQ>
Sep  5 21:25:32 BelliashPC kernel: RIP: 0010:queued_spin_lock_slowpath+0x3f/0x1a0
Sep  5 21:25:32 BelliashPC kernel: Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 18 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 c3 f6 c4 01 75 04 c6 47 01 00 48 c7 c0
Sep  5 21:25:32 BelliashPC kernel: RSP: 0018:ffffa299085afc48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Sep  5 21:25:32 BelliashPC kernel: RAX: 00000000003c0101 RBX: ffffa299085afd10 RCX: 000000008a2ee8ff
Sep  5 21:25:32 BelliashPC kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa299034fdb44
Sep  5 21:25:32 BelliashPC kernel: RBP: ffffa299085afcc0 R08: ffffa299085afcc0 R09: 0000000000000000
Sep  5 21:25:32 BelliashPC kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00007fa4d675aa38
Sep  5 21:25:32 BelliashPC kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffffa299034fdb40
Sep  5 21:25:32 BelliashPC kernel:  futex_wait_setup+0x77/0x110
Sep  5 21:25:32 BelliashPC kernel:  futex_wait+0xcf/0x230
Sep  5 21:25:32 BelliashPC kernel:  ? ep_poll_callback+0x256/0x280
Sep  5 21:25:32 BelliashPC kernel:  do_futex+0x15e/0xc60
Sep  5 21:25:32 BelliashPC kernel:  ? pipe_write+0x382/0x410
Sep  5 21:25:32 BelliashPC kernel:  ? new_sync_write+0x12c/0x1d0
Sep  5 21:25:32 BelliashPC kernel:  __x64_sys_futex+0x137/0x170
Sep  5 21:25:32 BelliashPC kernel:  do_syscall_64+0x4a/0x110
Sep  5 21:25:32 BelliashPC kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep  5 21:25:32 BelliashPC kernel: RIP: 0033:0x7fa4e76c7067
Sep  5 21:25:32 BelliashPC kernel: Code: 5c 24 68 48 89 44 24 50 e8 96 38 00 00 e8 c1 3d 00 00 89 de 41 89 c1 40 80 f6 80 45 31 d2 31 d2 4c 89 ff b8 ca 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 44 89 cf e8 f9 3d 00 00 31 f6 4c 89 f7 e8
Sep  5 21:25:32 BelliashPC kernel: RSP: 002b:00007fa4d7c5eac0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Sep  5 21:25:32 BelliashPC kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fa4e76c7067
Sep  5 21:25:32 BelliashPC kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 00007fa4d675aa38
Sep  5 21:25:32 BelliashPC kernel: RBP: 0000000000000000 R08: 00007fa4d675a9e0 R09: 0000000000000000
Sep  5 21:25:32 BelliashPC kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa4d675aa10
Sep  5 21:25:32 BelliashPC kernel: R13: 00007fa4d675a9e0 R14: 00007fa4d7c5eaf0 R15: 00007fa4d675aa38
Comment 648 Borislav Petkov 2019-09-11 07:08:36 UTC
> Sep 10 21:13:56 BelliashPC kernel: Workqueue:  0x0 (events)
> Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
> Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49
> 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47
> 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39
> c5 74
> Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS:
> 00010002
> Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX:
> ffff8d78aecdfa80 RCX: ffff8d789fa1b400
> Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI:
> ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
> Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08:
> ffff8d78aece0180 R09: 0000000000001800
> Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11:
> ffffffffffffffff R12: ffff8d78a4112b68
> Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14:
> ffff8d78a4112b40 R15: ffff8d78a4112b40
> Sep 10 21:13:56 BelliashPC kernel: FS:  0000000000000000(0000)
> GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
> Sep 10 21:13:56 BelliashPC kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3:
> 000000069063a000 CR4: 0000000000340ee0
> Sep 10 21:13:56 BelliashPC kernel: Call Trace:
> Sep 10 21:13:56 BelliashPC kernel:  kthread+0xf8/0x130
> Sep 10 21:13:56 BelliashPC kernel:  ? set_worker_desc+0xb0/0xb0
> Sep 10 21:13:56 BelliashPC kernel:  ? kthread_park+0x80/0x80
> Sep 10 21:13:56 BelliashPC kernel:  ret_from_fork+0x22/0x40
> Sep 10 21:13:56 BelliashPC kernel: Modules linked in: nvidia_drm(PO)
> nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O)
> vboxdrv(O)

You have proprietary crap loaded.

Try reproducing it with the latest upstream kernel and *without* that
nvidia* and vbox* gunk.

If you can, please open a separate bug and add me to CC.

Thx.
Comment 649 Rafal Kupiec 2019-09-11 07:17:05 UTC
This happened also w/o vbox (I installed it few days ago) and nvidia proprietary drivers. I used to use nouveau with the same results. I will open a new ticket.
Comment 650 Alberto 2019-09-11 13:09:54 UTC
I have freezes in ryzen 7 2700u in the acer swift sf315-41 model, kernels tested, 5.2
Comment 651 txrx 2019-09-14 12:35:03 UTC
I was able to update my BIOS to version 18, but my system still locks up.
I tried the following with the new BIOS:
 - use factory defaults
 - disable SMT
 - disable SMT with Typical Current Idle
 - all of the above with SVM disabled/enabled
Right now I set the power supply idle control to "Low ..." and will report back.

On a positive note, I got an error message:

Sep 14 13:33:04 user-MS-7B85 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000160
Sep 14 13:33:04 user-MS-7B85 kernel: #PF error: [normal kernel read fault]
Sep 14 13:33:04 user-MS-7B85 kernel: PGD 0 P4D 0 
Sep 14 13:33:04 user-MS-7B85 kernel: Oops: 0000 [#1] SMP NOPTI
Sep 14 13:33:04 user-MS-7B85 kernel: CPU: 6 PID: 1620 Comm: QXcbEventReader Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu
Sep 14 13:33:04 user-MS-7B85 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B85/B450 GAMING PRO CARBON AC (MS-7B85), BIOS 1.80 07/22/2019
Sep 14 13:33:04 user-MS-7B85 kernel: RIP: 0010:pick_next_task_fair+0x225/0x6f0
Sep 14 13:33:04 user-MS-7B85 kernel: Code: 40 78 48 3d 80 39 e2 b9 75 6c 4c 8b 65 b0 eb 2a 4c 89 e7 45 31 ed e8 aa c6 ff ff 84 c0 75 41 4c 89 e7 4c 89 ee e8 4b 3b ff ff <4c> 8b a0 60 01 00 00 4d 85 e4 0f 84 e9 00 00 00 4d 8b 6c 24 40 4d
Sep 14 13:33:04 user-MS-7B85 kernel: RSP: 0018:ffffaa51c34ff938 EFLAGS: 00010046
Sep 14 13:33:04 user-MS-7B85 kernel: RAX: 0000000000000000 RBX: ffff8e210e7a2d80 RCX: 000000000000000a
Sep 14 13:33:04 user-MS-7B85 kernel: RDX: ffffaa51c34ff9c0 RSI: 0000000000000000 RDI: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: RBP: ffffaa51c34ff998 R08: 00000000000f48d3 R09: ffff8e2109ca8a00
Sep 14 13:33:04 user-MS-7B85 kernel: R10: ffffaa51c34ff898 R11: 0000000000000001 R12: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: R13: 0000000000000000 R14: ffff8e21089820c0 R15: ffffaa51c34ff9c0
Sep 14 13:33:04 user-MS-7B85 kernel: FS:  00007fa560ab6700(0000) GS:ffff8e210e780000(0000) knlGS:0000000000000000
Sep 14 13:33:04 user-MS-7B85 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 14 13:33:04 user-MS-7B85 kernel: CR2: 0000000000000160 CR3: 00000004081dc000 CR4: 0000000000340ee0
Sep 14 13:33:04 user-MS-7B85 kernel: Call Trace:
Sep 14 13:33:04 user-MS-7B85 kernel:  __schedule+0x173/0x850
Sep 14 13:33:04 user-MS-7B85 kernel:  ? ep_poll_callback+0x93/0x2a0
Sep 14 13:33:04 user-MS-7B85 kernel:  schedule+0x2c/0x70
Sep 14 13:33:04 user-MS-7B85 kernel:  schedule_hrtimeout_range_clock+0x181/0x190
Sep 14 13:33:04 user-MS-7B85 kernel:  ? __wake_up_common+0x73/0x130
Sep 14 13:33:04 user-MS-7B85 kernel:  ? add_wait_queue+0x44/0x50
Sep 14 13:33:04 user-MS-7B85 kernel:  ? __pollwait+0xaf/0xe0
Sep 14 13:33:04 user-MS-7B85 kernel:  schedule_hrtimeout_range+0x13/0x20
Sep 14 13:33:04 user-MS-7B85 kernel:  poll_schedule_timeout.constprop.11+0x46/0x70
Sep 14 13:33:04 user-MS-7B85 kernel:  do_sys_poll+0x3d6/0x590
Sep 14 13:33:04 user-MS-7B85 kernel:  ? kmem_cache_free+0x1a7/0x1d0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? kmem_cache_free+0x1a7/0x1d0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? update_load_avg+0x8b/0x5a0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? update_load_avg+0x8b/0x5a0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? __enqueue_entity+0x5c/0x60
Sep 14 13:33:04 user-MS-7B85 kernel:  ? enqueue_entity+0x111/0x650
Sep 14 13:33:04 user-MS-7B85 kernel:  ? record_times+0x1b/0xc0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? poll_select_copy_remaining+0x1b0/0x1b0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? ttwu_do_wakeup+0x1e/0x140
Sep 14 13:33:04 user-MS-7B85 kernel:  ? ttwu_do_activate+0x77/0x80
Sep 14 13:33:04 user-MS-7B85 kernel:  ? try_to_wake_up+0x59/0x4c0
Sep 14 13:33:04 user-MS-7B85 kernel:  ? default_wake_function+0x12/0x20
Sep 14 13:33:04 user-MS-7B85 kernel:  ? pollwake+0x72/0x90
Sep 14 13:33:04 user-MS-7B85 kernel:  ? wake_up_q+0x80/0x80
Sep 14 13:33:04 user-MS-7B85 kernel:  ? __wake_up_common+0x73/0x130
Sep 14 13:33:04 user-MS-7B85 kernel:  ? __wake_up_locked_key+0x1b/0x20
Sep 14 13:33:04 user-MS-7B85 kernel:  ? vfs_write+0x16a/0x1a0
Sep 14 13:33:04 user-MS-7B85 kernel:  __x64_sys_poll+0x3b/0x140
Sep 14 13:33:04 user-MS-7B85 kernel:  ? __x64_sys_poll+0x3b/0x140
Sep 14 13:33:04 user-MS-7B85 kernel:  do_syscall_64+0x5a/0x120
Sep 14 13:33:04 user-MS-7B85 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 14 13:33:04 user-MS-7B85 kernel: RIP: 0033:0x7fa579df3bf9
Sep 14 13:33:04 user-MS-7B85 kernel: Code: 90 41 54 55 41 89 d4 53 48 89 f5 48 89 fb 48 83 ec 10 e8 fa ba 01 00 44 89 e2 41 89 c0 48 89 ee 48 89 df b8 07 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 89 44 24 0c e8 33 bb 01 00 8b 44
Sep 14 13:33:04 user-MS-7B85 kernel: RSP: 002b:00007fa560ab5c20 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
Sep 14 13:33:04 user-MS-7B85 kernel: RAX: ffffffffffffffda RBX: 00007fa560ab5c68 RCX: 00007fa579df3bf9
Sep 14 13:33:04 user-MS-7B85 kernel: RDX: 00000000ffffffff RSI: 0000000000000001 RDI: 00007fa560ab5c68
Sep 14 13:33:04 user-MS-7B85 kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000005
Sep 14 13:33:04 user-MS-7B85 kernel: R10: 001dd0f901a02da5 R11: 0000000000000293 R12: 00000000ffffffff
Sep 14 13:33:04 user-MS-7B85 kernel: R13: 0000000000000000 R14: 00005626664481c8 R15: 0000000000000000
Sep 14 13:33:04 user-MS-7B85 kernel: Modules linked in: rfcomm cmac bnep edac_mce_amd binfmt_misc kvm irqbypass nls_iso8859_1 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel snd_hda_codec_realtek aes_x86_64 snd_hda_codec_generic crypto_simd cryptd ledtrig_audio glue_helper snd_hda_codec_hdmi btusb btrtl snd_hda_intel btbcm snd_usb_audio btintel snd_hda_codec snd_usbmidi_lib bluetooth snd_hda_core snd_hwdep snd_seq_midi snd_pcm joydev snd_seq_midi_event input_leds wmi_bmof snd_rawmidi ecdh_generic snd_seq snd_seq_device snd_timer snd ccp soundcore mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 uas usb_storage hid_generic usbhid hid amdgpu chash amd_iommu_v2 gpu_sched ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb i2c_piix4 dca i2c_algo_bit ahci libahci wmi gpio_amdpt gpio_generic
Sep 14 13:33:04 user-MS-7B85 kernel: CR2: 0000000000000160
Sep 14 13:33:04 user-MS-7B85 kernel: ---[ end trace c1ca7818f368c3a6 ]---
Sep 14 13:33:04 user-MS-7B85 kernel: RIP: 0010:pick_next_task_fair+0x225/0x6f0
Sep 14 13:33:04 user-MS-7B85 kernel: Code: 40 78 48 3d 80 39 e2 b9 75 6c 4c 8b 65 b0 eb 2a 4c 89 e7 45 31 ed e8 aa c6 ff ff 84 c0 75 41 4c 89 e7 4c 89 ee e8 4b 3b ff ff <4c> 8b a0 60 01 00 00 4d 85 e4 0f 84 e9 00 00 00 4d 8b 6c 24 40 4d
Sep 14 13:33:04 user-MS-7B85 kernel: RSP: 0018:ffffaa51c34ff938 EFLAGS: 00010046
Sep 14 13:33:04 user-MS-7B85 kernel: RAX: 0000000000000000 RBX: ffff8e210e7a2d80 RCX: 000000000000000a
Sep 14 13:33:04 user-MS-7B85 kernel: RDX: ffffaa51c34ff9c0 RSI: 0000000000000000 RDI: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: RBP: ffffaa51c34ff998 R08: 00000000000f48d3 R09: ffff8e2109ca8a00
Sep 14 13:33:04 user-MS-7B85 kernel: R10: ffffaa51c34ff898 R11: 0000000000000001 R12: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: R13: 0000000000000000 R14: ffff8e21089820c0 R15: ffffaa51c34ff9c0
Sep 14 13:33:04 user-MS-7B85 kernel: FS:  00007fa560ab6700(0000) GS:ffff8e210e780000(0000) knlGS:0000000000000000
Sep 14 13:33:04 user-MS-7B85 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 14 13:33:04 user-MS-7B85 kernel: CR2: 0000000000000160 CR3: 00000004081dc000 CR4: 0000000000340ee0
Sep 14 13:33:04 user-MS-7B85 kernel: BUG: unable to handle kernel paging request at 00007fa560ab6fff
Sep 14 13:33:04 user-MS-7B85 kernel: #PF error: [INSTR]
Comment 652 Rafal Kupiec 2019-09-14 12:46:36 UTC
(In reply to txrx from comment #651)
> I was able to update my BIOS to version 18, but my system still locks up.
> I tried the following with the new BIOS:
>  - use factory defaults
>  - disable SMT
>  - disable SMT with Typical Current Idle
>  - all of the above with SVM disabled/enabled
> Right now I set the power supply idle control to "Low ..." and will report
> back.
> 

You got the same error as me. There is also bug 204811 for that.
Comment 653 raul 2019-09-15 17:39:47 UTC
(In reply to txrx from comment #651)

Typical Current Idle might not be working. Read the sensor output. If voltage is not higher than without enabling it, try to increase the core voltage.

My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with an ASUS Crosshair VI Hero and enabled Typical current idle.


> I was able to update my BIOS to version 18, but my system still locks up.
> I tried the following with the new BIOS:
>  - use factory defaults
>  - disable SMT
>  - disable SMT with Typical Current Idle
>  - all of the above with SVM disabled/enabled
> Right now I set the power supply idle control to "Low ..." and will report
> back.
>
Comment 654 Rafal Kupiec 2019-10-29 18:56:42 UTC
https://cdn.arstechnica.net/wp-content/uploads/2019/10/rdrand-test.zip - download, extract, launch ./test-rdrand and check if produces something different than 0xffffffff
Comment 655 Jaap Crezee 2019-10-31 08:17:29 UTC
Could this be related?

[620533.9804061 RBP: 0000000000000000 R08: 0000000000000001 R09: 00007f9014000080 [620533.981792] R10: 00007f9014de0270 R11: 0000000000000206 R12: 00007f90477fdlfe [620533.983039] R13: 00007f90477fdlff R14: 00007f901ffff700 R15: 00007f901fffe540 [620541.767819] watchdog: BUG: soft lockup - CPU#4 stuck for 220 Enc4:Eantoor:5552.1 [620541.769193] Modules linked in: tun ueth nf_conntrack_netlink nfnetlink xfrm_user xfrn_algo xt_addrtype iptable filter xt c, ntrack br_petfilter bridge stp 11c ouerlay xt_nat xt_tcpudp iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defraliipu6 -n id.nfjlefrag_ipu4 nouueau snd_bda_codec_realtek snd_hda_codec_generic ledtrig_audio mxn_wmi i2c_algo_bit ttm edac_mCe_amd snd a_intel drmitms_helper snd_hda_codec eeepc_wmi kum_and snd_hda_core sparse_keymap drm rfkill snd_hudep joydeu inputied s snd, cm r8169 wml_trmof kum irgbypass realtek snd_timer libphy agpgart ccp snd syscopyarea sysfillrect sysimgblt sp5100_tco sou ndcore fb_sys_fops rng_core i2cpiix4 crctlOdif_pclmul klOtemp crc32_pclmul ghash_clmulni_inte1 aesni_intel aes_x86_64 crypto_si md crypt& glue helper pcspkr wmi pinctrl_and pcc_cpufreq eudeu gpio_andpt mac_hid acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd gr ace sunrpc ip_tables x tables ext4 crc16 mbcache jbd2 raid456 libcrc32c crc32c_generic async_raid6_recou 1620541.7692211 async memcpy async_pq async_xor xor async_tx raid6_pq raidl md_mod hid_generic sd_mod usbhid hid ahci libahci 1 ibata crc32c_intel xhcipci xhci_bcd scsi_mod 1620541.7638311 CPU: 4 PID: 5552 Comm: nc4:Eantoor Tainted: G U L 5.2.11-archl-l-ARCH #1 [620541.765314] Hardware name: System manufacturer System Product Name/PRIME B450-PLUS, BIBS 1804 07/29/2019 [620541.766760] RIP: 0010:smp_call_function_many+Ox20b/Ox270 [620541.766216] Code: e6 8a ae 78 00 3b 05 f8 Od 21 01 89 c7 Of 83 7a fe ff ff 48 63 c7 48 8b Ob 48 03 Oc c5 60 ba d6 al 8b 41 1 6 a8 01 74 Oa f3 90 <ft> 51 16 83 e2 01 75 f6 eb c9 48 c7 c2 c0 14 f4 al 4c 89 fe 89 df [620541.791144] RSP: 0016:ffffaefd63e2bcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffl3 [620541.7925011 RAX: 0000000000000003 RBX: ffff8fe6be72b440 RCX: ffff8fe6be7flle0 1620541.793697] RDX: 0000000000000001 BSI: 0000000000000000 RDI: 0000000000000007 [620541.795357] RBP: ffffffffa0c7c7e0 R08: ffff8fe6be72b448 R09: ffff8fe6be72b470 1620541.796751] R10: ffff8fe6be72b448 R11: 0000000000000005 R12: ffff8febbe729d40 1620541:798198] R13: ffff8fe6be72b448 R14: 0000000000000001 R15: 0000000000000140 [620541.799600] FS: 00007f9044d5c700(0000) G3:ffff8fe6be700000(0000) knlGS:0000000000000000 [620541.800986] CS: 0010 DS: 0000 ES: 0000 CH0: 0000000080050033 [620541.802414] CR2: 00007f9062e8 f3b0 CR3: 00000003ae65a000 CR4: 00000000003406e0 [620541.803847] Call Trace: [629541.80-r 8] flush tlb mm_range+Oxe1/0x140 1620541.8065503 tlb_flush mmu+Ox72/0x140 [628541.807929] tlb_finish_mmu+Ox3d/Ox70 t62e541me92?el zap_page_range+Oxl4b/Oxia0 [620541.810589] se sys maduise+Ox613/0x890 [620541.811932] ? do syscall_64+0x5f/Ox1d0 [628541.813277] do syscall_64+0x5f/Ox1d0 [629541.8146481 ? page_fault+Ox8/0x30 [628541.815992] entry SYSCALL64afterbaframe+Ox44/0xa9 [620541i817395] RIP: 0833:0x7f9062dd7597 russdriAluffral Code: ff ff ff ff c3 48 8b 15 f7 f8 Oc 00 f7 d8 64 89 02 b8 ff ff ff ff eb be 66 2e Of if 84 00 00 00 00 00 90 b 8 lc ee ee ee Of 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b Od c9 f8 c 0 f7 dB 64 89 01 48 1.8Z1587.1 1SP: 082b:00007f9044d5b478 EFLAGS: 00000206 ORIG_ORAX:0 000000000000001c tfiZe541.13229471 RAX: ffffffffffffffda RBX. 00007f904455c000 RCX: 00007f9062dd7597 00000000007fb000 RDI: 00007f904455c000 E629541.8Z44311 RDX: 000 00000004 BSI: [ 1.8258153] RBP: 0000000000000000 R08: 0000000000000001 R09: 00007f9024000080 E 1.8M871 B1e: 00007f90246af180R11: 0000000000000206 R12: 00007f9047ffelfe I 1.8287181 R13: 00007f9047ffelff R14: 00007f9044d5c700 R15: 00007f9044d5b540 
ME, 

Sorry, no logging in syslog because the machine crashes hard. 
Optional picture of the output: https://iot.jcz.nl/soft_lockup.jpg
Comment 656 JerryD 2019-11-01 02:00:19 UTC
(In reply to Jaap Crezee from comment #655)
> Could this be related?
> 
> [620533.9804061 RBP: 0000000000000000 R08: 0000000000000001 R09:
> 00007f9014000080 [620533.981792] R10: 00007f9014de0270 R11: 0000000000000206
> R12: 00007f90477fdlfe [620533.983039] R13: 00007f90477fdlff R14:
> 00007f901ffff700 R15: 00007f901fffe540 [620541.767819] watchdog: BUG: soft
> lockup - CPU#4 stuck for 220 Enc4:Eantoor:5552.1 [620541.769193] Modules
> linked in: tun ueth nf_conntrack_netlink nfnetlink xfrm_user xfrn_algo

See also bug report 205017. It is difficult to capture the traces when it first happens. May or may not be the same bug.
Comment 657 Jaap Crezee 2019-11-13 08:14:30 UTC
Since I set up remote console logging and upgraded my BIOS (again), I have not seen the hang anymore.
Although I like this, I still have no more clues about the initial trace/problem.

FYI

I am running on a

        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: PRIME B450-PLUS
        Version: 1820
        Release Date: 09/12/2019
model name      : AMD Ryzen 5 1600 Six-Core Processor
Comment 658 pghj 2019-11-16 14:42:58 UTC
Ever since upgrading to a Ryzen (1700X), I have experienced frequent system freezes, which may be related to the problems discussed here.



The freeze mostly happens during a certain heavily threaded task with disk io.

Symptoms:

* Screen completely freezes, including mouse pointer,
* Existing SSH connections die, no new connection can be established,
* System can no longer switch to text console,
* LEDs indicate **unceasing disk activity**,
* System still responds to pings,
* Alt-SysRq keys remain active, but cannot output to screen even if already in text console.

I've succeeded in capturing kernel logging after a freeze using netconsole:

This timeout message appears:

    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259]       Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8     D    0   610      2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290]  __schedule+0x24e/0x880
    [35042.581295]  ? bit_wait+0x60/0x60
    [35042.581300]  schedule+0x2c/0x80
    [35042.581304]  io_schedule+0x16/0x40
    [35042.581308]  bit_wait_io+0x11/0x60
    [35042.581313]  __wait_on_bit+0x4c/0x90
    [35042.581317]  out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323]  ? bit_waitqueue+0x40/0x40
    [35042.581328]  __wait_on_buffer+0x32/0x40
    [35042.581333]  jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337]  ? __switch_to_asm+0x41/0x70
    [35042.581343]  kjournald2+0xc8/0x270
    [35042.581347]  ? kjournald2+0xc8/0x270
    [35042.581351]  ? wait_woken+0x80/0x80
    [35042.581355]  kthread+0x121/0x140
    [35042.581359]  ? commit_timeout+0x20/0x20
    [35042.581363]  ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366]  ret_from_fork+0x22/0x40
    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259]       Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8     D    0   610      2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290]  __schedule+0x24e/0x880
    [35042.581295]  ? bit_wait+0x60/0x60
    [35042.581300]  schedule+0x2c/0x80
    [35042.581304]  io_schedule+0x16/0x40
    [35042.581308]  bit_wait_io+0x11/0x60
    [35042.581313]  __wait_on_bit+0x4c/0x90
    [35042.581317]  out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323]  ? bit_waitqueue+0x40/0x40
    [35042.581328]  __wait_on_buffer+0x32/0x40
    [35042.581333]  jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337]  ? __switch_to_asm+0x41/0x70
    [35042.581343]  kjournald2+0xc8/0x270
    [35042.581347]  ? kjournald2+0xc8/0x270
    [35042.581351]  ? wait_woken+0x80/0x80
    [35042.581355]  kthread+0x121/0x140
    [35042.581359]  ? commit_timeout+0x20/0x20
    [35042.581363]  ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366]  ret_from_fork+0x22/0x40

Also, I have thousands of lines of output for blocked tasks. Most traces look more or less like this:

    [34274.346748] sysrq: SysRq : Show Blocked State
    [34274.346766]   task                        PC stack   pid father
    [34274.346771] systemd         D    0     1      0 0x00000000
    [34274.346776] Call Trace:
    [34274.346786]  __schedule+0x24e/0x880
    [34274.346792]  ? mempool_alloc_slab+0x15/0x20
    [34274.346795]  schedule+0x2c/0x80
    [34274.346798]  schedule_timeout+0x15d/0x350
    [34274.346804]  ? __next_timer_interrupt+0xe0/0xe0
    [34274.346808]  ? wait_woken+0x80/0x80
    [34274.346812]  io_schedule_timeout+0x1e/0x50
    [34274.346815]  mempool_alloc+0x15d/0x190
    [34274.346820]  ? wait_woken+0x80/0x80
    [34274.346825]  bio_alloc_bioset+0xa9/0x1e0
    [34274.346830]  __split_and_process_non_flush+0x147/0x2c0
    [34274.346834]  __split_and_process_bio+0x139/0x2a0
    [34274.346838]  dm_make_request+0x7a/0xd0
    [34274.346843]  ? SyS_madvise+0x990/0x990
    [34274.346847]  generic_make_request+0x124/0x300
    [34274.346850]  submit_bio+0x73/0x140
    [34274.346853]  ? submit_bio+0x73/0x140
    [34274.346856]  ? get_swap_bio+0xcd/0x100
    [34274.346861]  __swap_writepage+0x323/0x3b0
    [34274.346865]  ? __frontswap_store+0x73/0x100
    [34274.346869]  swap_writepage+0x34/0x90
    [34274.346872]  pageout.isra.54+0x11b/0x350
    [34274.346878]  shrink_page_list+0x99a/0xbc0
    [34274.346883]  shrink_inactive_list+0x242/0x590
    [34274.346887]  shrink_node_memcg+0x364/0x770
    [34274.346892]  shrink_node+0xf7/0x300
    [34274.346896]  ? shrink_node+0xf7/0x300
    [34274.346900]  do_try_to_free_pages+0xc9/0x330
    [34274.346904]  try_to_free_pages+0xee/0x1b0
    [34274.346910]  __alloc_pages_slowpath+0x3fc/0xe00
    [34274.346914]  ? __switch_to_asm+0x35/0x70
    [34274.346917]  ? __switch_to_asm+0x35/0x70
    [34274.346920]  ? __switch_to_asm+0x35/0x70
    [34274.346924]  ? __switch_to_asm+0x35/0x70
    [34274.346929]  ? __switch_to_asm+0x35/0x70
    [34274.346932]  ? __switch_to_asm+0x41/0x70
    [34274.346936]  __alloc_pages_nodemask+0x29a/0x2c0
    [34274.346940]  alloc_pages_current+0x6a/0xe0
    [34274.346944]  __page_cache_alloc+0x81/0xa0
    [34274.346948]  __do_page_cache_readahead+0x113/0x2c0
    [34274.346952]  ? radix_tree_lookup_slot+0x22/0x50
    [34274.346956]  ? find_get_entry+0x1e/0x110
    [34274.346959]  filemap_fault+0x2ad/0x6f0
    [34274.346968]  ? filemap_fault+0x2ad/0x6f0
    [34274.346971]  ? page_add_file_rmap+0x134/0x180
    [34274.346975]  ? filemap_map_pages+0x181/0x390
    [34274.346980]  ext4_filemap_fault+0x31/0x44
    [34274.346748] sysrq: SysRq : Show Blocked State
    [34274.346984]  __do_fault+0x5b/0x115
    [34274.346988]  __handle_mm_fault+0xdef/0x1290
    [34274.346992]  handle_mm_fault+0xb1/0x210
    [34274.346997]  __do_page_fault+0x281/0x4b0
    [34274.347001]  do_page_fault+0x2e/0xe0
    [34274.347004]  ? page_fault+0x2f/0x50
    [34274.347008]  page_fault+0x45/0x50
    [34274.347011] RIP: 0033:0x7fa9446ee83a
    [34274.347015] RSP: 002b:00007ffcccb01470 EFLAGS: 00010206
    [34274.347019] RAX: 0000000000000001 RBX: 00005615eff63650 RCX: 00007fa944bcebb7
    [34274.347021] RDX: 0000000000000093 RSI: 00007ffcccb01470 RDI: 0000000000000000
    [34274.347025] RBP: 00007ffcccb01c60 R08: 0000000000000000 R09: 0000000000000008
    [34274.347027] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000000001
    [34274.347032] R13: ffffffffffffffff R14: 00007ffcccb01470 R15: 0000000000000001

Another detail that may be relevant it that show-task-states outputs about 5000 lines of this kind:

    [34830.962684]     in-flight: 3235:kcryptd_crypt [dm_crypt], 3237:kcryptd_crypt [dm_crypt], 6056:kcryptd_crypt [dm_crypt], 6058:kcryptd_crypt [dm_crypt], 6055:kcryptd_crypt [dm_crypt], 6057:kcryptd_crypt [dm_crypt], 3992:kcryptd_crypt [dm_crypt], 4861:kcryptd_crypt [dm_crypt], 32431:kcryptd_crypt [dm_crypt], 2682:kcryptd_crypt [dm_crypt], 4850:kcryptd_crypt [dm_crypt], 1429:kcryptd_crypt [dm_crypt], 6054:kcryptd_crypt [dm_crypt], 6060:kcryptd_crypt [dm_crypt], 4862:kcryptd_crypt [dm_crypt], 1862:kcryptd_crypt [dm_crypt]
    [34830.874519] DefaultDispatch 22853    343906.389378     12131   120 
    [34830.962714]     delayed: kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
    [34830.962761] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
    [34830.881008]   .nr_spread_over                : 0
    [34830.962862] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]

Is there someone who can interpret all this? If it is helpful I can attach the full blocked-tasks output.

(kernel version is 4.15.0-62-generic)
Comment 659 eric.c.morgan 2019-11-18 18:46:01 UTC
I applied the latest asrock BIOS with new options "amd cbs global c-state control" to disable voltage lowering when idle.

Even with all BIOS settings, custom kernels and params, disabling C states, power supplies and so forth I'm done. 2 years of this BS.

I picked up a 65 watt Intel i9 9900. 

Good luck to you all and thanks for all the ideas and help. I hope to revisit AMD Ryzen based linux servers in a few years.
Comment 660 Michaël Colignon 2019-11-18 22:25:11 UTC
(In reply to eric.c.morgan from comment #659)
> I applied the latest asrock BIOS with new options "amd cbs global c-state
> control" to disable voltage lowering when idle.
> 
> Even with all BIOS settings, custom kernels and params, disabling C states,
> power supplies and so forth I'm done. 2 years of this BS.
> 
> I picked up a 65 watt Intel i9 9900. 
> 
> Good luck to you all and thanks for all the ideas and help. I hope to
> revisit AMD Ryzen based linux servers in a few years.

I think you had a faulty hardware. With just the typical idle current it fix the trouble.
That happens.

On the other side, Windows kernel has no problem with default parameters, so i think something in Linux is not good.
Comment 661 JerryD 2019-11-19 01:25:09 UTC
(In reply to Michaël Colignon from comment #660)
> (In reply to eric.c.morgan from comment #659)
> > I applied the latest asrock BIOS with new options "amd cbs global c-state
> > control" to disable voltage lowering when idle.
> > 
> > Even with all BIOS settings, custom kernels and params, disabling C states,
> > power supplies and so forth I'm done. 2 years of this BS.
> > 
> > I picked up a 65 watt Intel i9 9900. 
> > 
> > Good luck to you all and thanks for all the ideas and help. I hope to
> > revisit AMD Ryzen based linux servers in a few years.
> 
> I think you had a faulty hardware. With just the typical idle current it fix
> the trouble.
> That happens.
> 
> On the other side, Windows kernel has no problem with default parameters, so
> i think something in Linux is not good.

Agree, see https://bugzilla.kernel.org/show_bug.cgi?id=205017
Comment 662 eric.c.morgan 2019-11-19 02:48:09 UTC
(In reply to Michaël Colignon from comment #660)
> (In reply to eric.c.morgan from comment #659)
> > I applied the latest asrock BIOS with new options "amd cbs global c-state
> > control" to disable voltage lowering when idle.
> > 
> > Even with all BIOS settings, custom kernels and params, disabling C states,
> > power supplies and so forth I'm done. 2 years of this BS.
> > 
> > I picked up a 65 watt Intel i9 9900. 
> > 
> > Good luck to you all and thanks for all the ideas and help. I hope to
> > revisit AMD Ryzen based linux servers in a few years.
> 
> I think you had a faulty hardware. With just the typical idle current it fix
> the trouble.
> That happens.
> 
> On the other side, Windows kernel has no problem with default parameters, so
> i think something in Linux is not good.

Likely not bad hardware. My 1700 was RMAd for the segfault issue (another pain point), and has passed all testing. Memory has been tested as well, many times over, with different profiles and settings.

Tiny i9-9900 itx ready for transplant. I feel dirty. https://i.imgur.com/fTvWrWd.png
Comment 663 jan.public@famvlug.nl 2019-11-19 08:54:37 UTC
> Likely not bad hardware. My 1700 was RMAd for the segfault issue (another
> pain point), and has passed all testing. Memory has been tested as well,
> many times over, with different profiles and settings.
> 

I also had my 1600 RMAd, but still I experience hangs now and then. I have the impression that the hangs are worse lately. But I have no hard statistics.
Comment 664 Borislav Petkov 2019-11-19 09:46:46 UTC
Folks,

this bugzilla entry, with the amount of different issues reported and the amount of FUD all collected in one place, is prohibitively hard for a debugger to handle. So, I'd suggest if you still would like your issue looked at, to open a separate bug. 

And please refrain from commenting on a bug with your own issue - it is much easier to open a separate one first and then when it turns out that it is a known issue, to merge the two bugzilla entries than to keep them apart in a single report.

Thx.
Comment 668 Account Deleted 2020-04-03 13:51:26 UTC
I as the thread starter fully agree with Borislav and set this bug report to status resolved. Thank you all very much for the interesting discussion!

Having started with massive stability problems on my new Ryzen build, I reported this issue to address a potential problem in the Linux kernel. In the hundreds of comments, which surely could be very interesting for everybody affected, I found a workaround - disabling C6 states with the python script attached to a comment above. Since then the system was perfectly stable. Test with the "kill Ryzen" script also showed me, the CPU was not affected by the other huge problem early adopters had. 

A few days ago I finally found the time to update the AB350 BIOS to the latest version and set the "typical idle current" in the options. I had only 2 issues since then: The bootloader entry for my OS was broken after the update and network manager did not bring up the LAN interface anymore. But after resolving these issues: No more stability problems. Of course I deactivated my systemd service for c6 states. I am still on the stable branch of Manjaro and use the latest 5.5 series kernel.

See you!
Comment 673 Charles Lim 2020-05-11 02:40:57 UTC
Hi folks!
 
For those who are looking for a solution or already found a solution, there is a new update of AGESA rolling out. The new version 1.0.0.4 claims:

* Improved system stability when switching through ACPI power states.

It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept default optimised BIOS settings. When I used ZenState to check cpu status, the C6 state is now: 

C6 State - Package - Enabled
C6 State - Core - Disabled

Previously they were all enabled, I guessed the new AGESA release finally solved this problem from source? I will report back days later to see if the system still hangs.
Comment 674 Paul Menzel 2020-05-12 10:55:06 UTC
(In reply to Charles Lim from comment #673)

> For those who are looking for a solution or already found a solution, there
> is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> 
> * Improved system stability when switching through ACPI power states.
> 
> It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept
> default optimised BIOS settings. When I used ZenState to check cpu status,
> the C6 state is now: 
> 
> C6 State - Package - Enabled
> C6 State - Core - Disabled
> 
> Previously they were all enabled, I guessed the new AGESA release finally
> solved this problem from source? I will report back days later to see if the
> system still hangs.

Are you sure about the AGESA version string? For the Dell OptiPlex 5055, firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*, and then grepping for `AGESA!` in the strings/hexdump, it says it includes AGESA version 1.0.0.7a.

    $ strings 1\ --\ 1\ OptiPlex\ System\ BIOS\ v1.1.20.bin | grep -A1 AGESA!
    %pAGESA!V9
    SummitPI-AM4 1.0.0.7a

[1]: https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=w6mw5&oscode=wt64a&productcode=optiplex-5055-ryzen-cpu
[2]: https://github.com/platomav/BIOSUtilities
Comment 675 Rafal Kupiec 2020-05-12 11:02:04 UTC
(In reply to Paul Menzel from comment #674)
> (In reply to Charles Lim from comment #673)
> 
> > For those who are looking for a solution or already found a solution, there
> > is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> > 
> > * Improved system stability when switching through ACPI power states.
> > 
> > It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept
> > default optimised BIOS settings. When I used ZenState to check cpu status,
> > the C6 state is now: 
> > 
> > C6 State - Package - Enabled
> > C6 State - Core - Disabled
> > 
> > Previously they were all enabled, I guessed the new AGESA release finally
> > solved this problem from source? I will report back days later to see if
> the
> > system still hangs.
> 
> Are you sure about the AGESA version string? For the Dell OptiPlex 5055,
> firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*,
> and then grepping for `AGESA!` in the strings/hexdump, it says it includes
> AGESA version 1.0.0.7a.
> 
>     $ strings 1\ --\ 1\ OptiPlex\ System\ BIOS\ v1.1.20.bin | grep -A1 AGESA!
>     %pAGESA!V9
>     SummitPI-AM4 1.0.0.7a
> 
> [1]:
> https://www.dell.com/support/home/de-de/drivers/
> driversdetails?driverid=w6mw5&oscode=wt64a&productcode=optiplex-5055-ryzen-
> cpu
> [2]: https://github.com/platomav/BIOSUtilities


1.0.0.7a is for Zen+
1.0.0.4 is for Zen2.

AMD resets versioning every new CPU generation. 1.0.0.4 is newer than 1.0.0.7
Comment 676 Paul Menzel 2020-05-12 11:06:50 UTC
(In reply to Rafal Kupiec from comment #675)
> (In reply to Paul Menzel from comment #674)
> > (In reply to Charles Lim from comment #673)
> > 
> > > For those who are looking for a solution or already found a solution,
> there
> > > is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> > > 
> > > * Improved system stability when switching through ACPI power states.

[…]

> > Are you sure about the AGESA version string? For the Dell OptiPlex 5055,
> > firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*,
> > and then grepping for `AGESA!` in the strings/hexdump, it says it includes
> > AGESA version 1.0.0.7a.

[…]

> 1.0.0.7a is for Zen+
> 1.0.0.4 is for Zen2.
> 
> AMD resets versioning every new CPU generation. 1.0.0.4 is newer than 1.0.0.7

Thank you for the clarification. Though I am confused now, as I thought you could use Zen2 devices in “Zen+ boards” (boards original for Zen+). So, AGESA 1.0.0.4 for Zen2 also support the predecessor generation?
Comment 677 Rafal Kupiec 2020-05-12 11:14:38 UTC
There is nothing like Zen+ or Zen2 board... There are AM4 motherboards based on different chipsets. All of them supports Zen, Zen+ and Zen2. If you want to install Zen2 in such board you must first ensure it has flashed BIOS that supports new CPU. The piece of software that includes CPU microcode is called AGESA. It is responsible for initializing CPU and memory at least. And it is redistributed as BIOS upgrade by motherboards manufacturers.

So if you want to install Zen2 CPU on AM4 motherboard, you need to make sure it has flashed BIOS with AGESA 1.0.0.1 that brings support for Zen2 CPUs. And that AGESA 1.0.0.1 is something newer than 1.0.0.7. It still supports Zen and Zen+, but since it is dedicated for Zen2, the versioning restarts from 1.0.0.0 what gets people confusing.
Comment 678 Charles Lim 2020-05-15 05:10:52 UTC
Hi folks again!

I have been testing the new AGESA 1.0.0.4 these days. Currently my uptime is reaching 3days+, which was impossible before - the system always hangs within 2 days. Hence I consider the new firmware indeed solved this Soft Lock issue!

I'm testing on Ubuntu Focal, acpi related settings are all left default. On my Asus PRIME B350M, default optimised settings are being used.

In addition, the CPU C6 State seems also normal, during idle, CPU enters its idle state with very low power consumption (measured externally). k10temp reports Vcore 850mV(approx.).

>Thank you for the clarification. Though I am confused now, as I thought you
>could use Zen2 devices in “Zen+ boards” (boards original for Zen+). So, AGESA
>1.0.0.4 for Zen2 also support the predecessor generation?

It seems that to address to this versioning mess, AMD claims that this 1.0.0.4 release reunited all architectures into one codebase.[1] I guess they also "reunited" the version number. The AGESA was also 1.0.0.7 on my motherboard  before.


---REFERENCES---
[1]: https://www.reddit.com/r/Amd/comments/dtgutp/an_update_on_the_am4_platform_agesa_1004/
Comment 680 Ashesh Ambasta 2020-06-01 08:39:12 UTC
I've been facing the same issue since I've bought this machine and setting my BIOS power settings to "Typical current idle" has only partially fixed things (the system still crashes every 1-2 weeks during idle: I leave my desk, screensaver fires up, then screens go to sleep, and by the time I get back to my desk, the system no longer responds to any input and the only option is a hard reset.).

At the time of writing, I can't tell if the AGESA version has made it into my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS 

But I'd be willing to give this a try. I also just yesterday requested an RMA from AMD though: I spoke to their support and they asked me to try setting the power settings in the BIOS settings. Once I reported back that the power settings only reduced the frequency of these crashes, their immediate response was to open an RMA request. It seemed like they were aware of the issue and they seemed quite confident that an RMA will fix this (I sure hope so since the RMA means at least 2 weeks of downtime for me).
Comment 681 Rafal Kupiec 2020-06-01 10:46:10 UTC
(In reply to Ashesh Ambasta from comment #680)
> I've been facing the same issue since I've bought this machine and setting
> my BIOS power settings to "Typical current idle" has only partially fixed
> things (the system still crashes every 1-2 weeks during idle: I leave my
> desk, screensaver fires up, then screens go to sleep, and by the time I get
> back to my desk, the system no longer responds to any input and the only
> option is a hard reset.).
> 
> At the time of writing, I can't tell if the AGESA version has made it into
> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS 
> 
> But I'd be willing to give this a try. I also just yesterday requested an
> RMA from AMD though: I spoke to their support and they asked me to try
> setting the power settings in the BIOS settings. Once I reported back that
> the power settings only reduced the frequency of these crashes, their
> immediate response was to open an RMA request. It seemed like they were
> aware of the issue and they seemed quite confident that an RMA will fix this
> (I sure hope so since the RMA means at least 2 weeks of downtime for me).

I wish you luck. I think this is all I can tell you.
I had exact problem with B450. I have opened RMA request and got money back. I bought X570 based motherboard and this solved issue for me.
Comment 682 Ashesh Ambasta 2020-06-01 11:15:25 UTC
Created attachment 289445 [details]
attachment-1715-0.html

Does this mean the issue could be motherboard related? I actually received a reply from AMD and they’re asking me for pictures of the processor installed on the motherboard for some reason. 

Best,

Ashesh Ambasta 

> On 1 Jun 2020, at 12:46, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> 
> --- Comment #681 from Rafal Kupiec (belliash@asiotec.eu.org) ---
> (In reply to Ashesh Ambasta from comment #680)
>> I've been facing the same issue since I've bought this machine and setting
>> my BIOS power settings to "Typical current idle" has only partially fixed
>> things (the system still crashes every 1-2 weeks during idle: I leave my
>> desk, screensaver fires up, then screens go to sleep, and by the time I get
>> back to my desk, the system no longer responds to any input and the only
>> option is a hard reset.).
>> 
>> At the time of writing, I can't tell if the AGESA version has made it into
>> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS 
>> 
>> But I'd be willing to give this a try. I also just yesterday requested an
>> RMA from AMD though: I spoke to their support and they asked me to try
>> setting the power settings in the BIOS settings. Once I reported back that
>> the power settings only reduced the frequency of these crashes, their
>> immediate response was to open an RMA request. It seemed like they were
>> aware of the issue and they seemed quite confident that an RMA will fix this
>> (I sure hope so since the RMA means at least 2 weeks of downtime for me).
> 
> I wish you luck. I think this is all I can tell you.
> I had exact problem with B450. I have opened RMA request and got money back.
> I
> bought X570 based motherboard and this solved issue for me.
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 683 Paul Menzel 2020-06-01 15:59:23 UTC
(In reply to Ashesh Ambasta from comment #680)
> I've been facing the same issue since I've bought this machine and setting
> my BIOS power settings to "Typical current idle" has only partially fixed
> things (the system still crashes every 1-2 weeks during idle: I leave my
> desk, screensaver fires up, then screens go to sleep, and by the time I get
> back to my desk, the system no longer responds to any input and the only
> option is a hard reset.).
> 
> At the time of writing, I can't tell if the AGESA version has made it into
> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS 

[I have an MSI board, but checked the firmware update files for you.]

The description for version 3.60 includes:

> Update AMD AGESA ThreadRipperPI-SP3r2 1.1.0.2

Reading the ASRock forum thread *What happened to Fatality X399 Bios 1.60?* [1], it’s enough to run the firmware update file through a hexeditor. I searched for *GES* and was lucky for version 3.80.

    $ hexdump -C X399MT3.80 | less
    […]
    00e34ce0  00 00 00 00 00 00 00 00  00 00 00 00 be 2c 63 17  |.............,c.|
    00e34cf0  80 6c b6 49 82 07 12 b5  3d 9b 25 70 41 47 45 53  |.l.I....=.%pAGES|
    00e34d00  41 21 56 39 00 54 68 72  65 61 64 52 69 70 70 65  |A!V9.ThreadRippe|
    00e34d10  72 50 49 2d 53 50 33 72  32 2d 31 2e 31 2e 30 2e  |rPI-SP3r2-1.1.0.|
    00e34d20  32 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |2...............|
[…]
```

So, it still has AGESA version 1.1.0.2.

[…]


[1]: http://forum.asrock.com/forum_posts.asp?TID=6165&PN=4&title=what-happened-to-fatality-x399-bios-160
Comment 684 Ashesh Ambasta 2020-06-01 16:15:40 UTC
I see. But that is assuming that the 1.0.0.4 version “fixes” this issue. At this point it’s only a guess but it seems likely given the previous replies I’ve seen. 
My god what a nightmare this has been. :-)

Best,

Ashesh Ambasta 

> On 1 Jun 2020, at 17:59, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> 
> --- Comment #683 from Paul Menzel (pmenzel+bugzilla.kernel.org@molgen.mpg.de)
> ---
> (In reply to Ashesh Ambasta from comment #680)
>> I've been facing the same issue since I've bought this machine and setting
>> my BIOS power settings to "Typical current idle" has only partially fixed
>> things (the system still crashes every 1-2 weeks during idle: I leave my
>> desk, screensaver fires up, then screens go to sleep, and by the time I get
>> back to my desk, the system no longer responds to any input and the only
>> option is a hard reset.).
>> 
>> At the time of writing, I can't tell if the AGESA version has made it into
>> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS 
> 
> [I have an MSI board, but checked the firmware update files for you.]
> 
> The description for version 3.60 includes:
> 
>> Update AMD AGESA ThreadRipperPI-SP3r2 1.1.0.2
> 
> Reading the ASRock forum thread *What happened to Fatality X399 Bios 1.60?*
> [1], it’s enough to run the firmware update file through a hexeditor. I
> searched for *GES* and was lucky for version 3.80.
> 
>    $ hexdump -C X399MT3.80 | less
>    […]
>    00e34ce0  00 00 00 00 00 00 00 00  00 00 00 00 be 2c 63 17 
> |.............,c.|
>    00e34cf0  80 6c b6 49 82 07 12 b5  3d 9b 25 70 41 47 45 53 
> |.l.I....=.%pAGES|
>    00e34d00  41 21 56 39 00 54 68 72  65 61 64 52 69 70 70 65 
> |A!V9.ThreadRippe|
>    00e34d10  72 50 49 2d 53 50 33 72  32 2d 31 2e 31 2e 30 2e 
> |rPI-SP3r2-1.1.0.|
>    00e34d20  32 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> |2...............|
> […]
> ```
> 
> So, it still has AGESA version 1.1.0.2.
> 
> […]
> 
> 
> [1]:
>
> http://forum.asrock.com/forum_posts.asp?TID=6165&PN=4&title=what-happened-to-fatality-x399-bios-160
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 685 ChrisB 2020-06-01 18:51:06 UTC
Several experiments with Ryzen 1700 and Gigabyte X370 Gaming K7 Board. Best choice was "typical current idle", but the latest Bios did'nt have that option. So I bought a R5 3600 put the latest BIOS at the board and all the problems gone away. Rock solid now. I think, that the first generation of Ryzen make some problems with Linux - mine was a RMA'ed because of the compiler error. See Comment 195.
Meanwhile gcc-9.3.0, Kernel-5.7.0, Bios F50a
Comment 686 Michaël Colignon 2020-06-01 18:59:44 UTC
(In reply to Paul Menzel from comment #683)
> (In reply to Ashesh Ambasta from comment #680)
> > I've been facing the same issue since I've bought this machine and setting
> > my BIOS power settings to "Typical current idle" has only partially fixed
> > things (the system still crashes every 1-2 weeks during idle: I leave my
> > desk, screensaver fires up, then screens go to sleep, and by the time I get
> > back to my desk, the system no longer responds to any input and the only
> > option is a hard reset.).
> > 
> > At the time of writing, I can't tell if the AGESA version has made it into
> > my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS 
> 
> [I have an MSI board, but checked the firmware update files for you.]
> 
> The description for version 3.60 includes:
> 
> > Update AMD AGESA ThreadRipperPI-SP3r2 1.1.0.2
> 
> Reading the ASRock forum thread *What happened to Fatality X399 Bios 1.60?*
> [1], it’s enough to run the firmware update file through a hexeditor. I
> searched for *GES* and was lucky for version 3.80.
> 
>     $ hexdump -C X399MT3.80 | less
>     […]
>     00e34ce0  00 00 00 00 00 00 00 00  00 00 00 00 be 2c 63 17 
> |.............,c.|
>     00e34cf0  80 6c b6 49 82 07 12 b5  3d 9b 25 70 41 47 45 53 
> |.l.I....=.%pAGES|
>     00e34d00  41 21 56 39 00 54 68 72  65 61 64 52 69 70 70 65 
> |A!V9.ThreadRippe|
>     00e34d10  72 50 49 2d 53 50 33 72  32 2d 31 2e 31 2e 30 2e 
> |rPI-SP3r2-1.1.0.|
>     00e34d20  32 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
> |2...............|
> […]
> ```
> 
> So, it still has AGESA version 1.1.0.2.
> 
> […]
> 
> 
> [1]:
> http://forum.asrock.com/forum_posts.asp?TID=6165&PN=4&title=what-happened-to-
> fatality-x399-bios-160

Hi,
Agesa for TR is not the same than for AM4 in versioning.
Comment 687 Ashesh Ambasta 2020-06-01 19:55:08 UTC
I’ve tried that option of “Typical current idle” to no avail. 

Best,

Ashesh Ambasta 

> On 1 Jun 2020, at 20:51, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> 
> --- Comment #685 from ChrisB (c.buddeweg@web.de) ---
> Several experiments with Ryzen 1700 and Gigabyte X370 Gaming K7 Board. Best
> choice was "typical current idle", but the latest Bios did'nt have that
> option.
> So I bought a R5 3600 put the latest BIOS at the board and all the problems
> gone away. Rock solid now. I think, that the first generation of Ryzen make
> some problems with Linux - mine was a RMA'ed because of the compiler error.
> See
> Comment 195.
> Meanwhile gcc-9.3.0, Kernel-5.7.0, Bios F50a
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 688 Ashesh Ambasta 2020-06-07 14:09:12 UTC
I've also found https://www.reddit.com/r/Amd/comments/8yzvxz/ryzen_c6_state_sleep_power_supply_common_current/e33lc6s/ which seems to be promising. It is, however, appalling to see this kind of response from AMD and I'm quite sure this is my last AMD system. 

I'm currently testing my system with "Deep sleep" enabled and will report if its stable. The "Typical current idle" fix certainly didn't fix things for me, and in any case, the best I can say about that "fix" is that its an ugly hack and not really a fix from AMD. 

I'm bewildered to see so many reports on this thread and no word from AMD. I'm also convinced they are aware of this issue and are doing nothing but propose cheap hacks. 

AMD has also approved my RMA request and I'll send this processor back if the "deep sleep" "fix" doesn't fix things from me. If the RMA'd processor doesn't work, I'm going to sell.
Comment 689 raul 2020-06-11 16:03:14 UTC
(In reply to raulvior.bcn from comment #653)
> (In reply to txrx from comment #651)
> 
> Typical Current Idle might not be working. Read the sensor output. If
> voltage is not higher than without enabling it, try to increase the core
> voltage.
> 
> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with
> an ASUS Crosshair VI Hero and enabled Typical current idle.
> 
> 
> > I was able to update my BIOS to version 18, but my system still locks up.
> > I tried the following with the new BIOS:
> >  - use factory defaults
> >  - disable SMT
> >  - disable SMT with Typical Current Idle
> >  - all of the above with SVM disabled/enabled
> > Right now I set the power supply idle control to "Low ..." and will report
> > back.
> >

The motherboard kept hanging. I had to remove the Vitals GNOME Extension. It seems that polling voltage values hangs the motherboard... Still, there are times that the computer does not come back from suspend. There's something wrong with the BIOS/UEFI.
Comment 692 Ashesh Ambasta 2020-06-27 11:42:11 UTC
As a last resort, I've tried `idle=halt` on this machine. And yet my 
system just crashed after 3 weeks of uptime.

I'm done with AMD. I will RMA this processor to try things out, but 
overall, if that doesn't work, this thing is headed to the junkyard and 
I'm going to live with Intel.

At least in the 13 or so odd Intel systems I've tried, I've not had 
exhasperating issues like these where the company is positively trying 
to ignore this ongoing issue.

This is disgusting from AMD.

On 6/11/20 6:03 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #689 from raulvior.bcn@gmail.com ---
> (In reply to raulvior.bcn from comment #653)
>> (In reply to txrx from comment #651)
>>
>> Typical Current Idle might not be working. Read the sensor output. If
>> voltage is not higher than without enabling it, try to increase the core
>> voltage.
>>
>> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with
>> an ASUS Crosshair VI Hero and enabled Typical current idle.
>>
>>
>>> I was able to update my BIOS to version 18, but my system still locks up.
>>> I tried the following with the new BIOS:
>>>   - use factory defaults
>>>   - disable SMT
>>>   - disable SMT with Typical Current Idle
>>>   - all of the above with SVM disabled/enabled
>>> Right now I set the power supply idle control to "Low ..." and will report
>>> back.
>>>
> The motherboard kept hanging. I had to remove the Vitals GNOME Extension. It
> seems that polling voltage values hangs the motherboard... Still, there are
> times that the computer does not come back from suspend. There's something
> wrong with the BIOS/UEFI.
>
Comment 693 Arup 2020-06-27 20:31:02 UTC
Before you discard the CPU if your BIOS has PBO option turn it off, with latest BIOS on my B350 motherboard version 5406 PBO is on by default and apart from causing high temperature it would lead to this kind of crashes and reboots with my Ryzen7. Turning off PBO fixed it and now I am running a Ryzen9 with the same board and all works fine with PBO off.
Comment 694 Ashesh Ambasta 2020-06-28 05:23:16 UTC
I’ll try that, but I doubt if at this stage it will have any effect. 

Best,

Ashesh Ambasta 

> On 27 Jun 2020, at 22:31, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> 
> --- Comment #693 from Arup (arup.chowdhury@gmail.com) ---
> Before you discard the CPU if your BIOS has PBO option turn it off, with
> latest
> BIOS on my B350 motherboard version 5406 PBO is on by default and apart from
> causing high temperature it would lead to this kind of crashes and reboots
> with
> my Ryzen7. Turning off PBO fixed it and now I am running a Ryzen9 with the
> same
> board and all works fine with PBO off.
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 695 Ashesh Ambasta 2020-06-28 06:07:11 UTC
Moreover, from a bit of reading I see that PBO is related to how fast the processor clock speeds are boosted. How is this related to crashes when idle?

Best,

Ashesh Ambasta 

> On 28 Jun 2020, at 07:23, Ashesh Ambasta <ashesh.ambasta@gmail.com> wrote:
> 
> I’ll try that, but I doubt if at this stage it will have any effect. 
> 
> Best,
> 
> Ashesh Ambasta 
> 
>> On 27 Jun 2020, at 22:31, bugzilla-daemon@bugzilla.kernel.org wrote:
>> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>> 
>> --- Comment #693 from Arup (arup.chowdhury@gmail.com) ---
>> Before you discard the CPU if your BIOS has PBO option turn it off, with
>> latest
>> BIOS on my B350 motherboard version 5406 PBO is on by default and apart from
>> causing high temperature it would lead to this kind of crashes and reboots
>> with
>> my Ryzen7. Turning off PBO fixed it and now I am running a Ryzen9 with the
>> same
>> board and all works fine with PBO off.
>> 
>> -- 
>> You are receiving this mail because:
>> You are on the CC list for the bug.
Comment 698 raul 2020-07-12 13:58:58 UTC
Did you ask for an RMA? Did it work?
(In reply to Ashesh Ambasta from comment #692)
> As a last resort, I've tried `idle=halt` on this machine. And yet my 
> system just crashed after 3 weeks of uptime.
> 
> I'm done with AMD. I will RMA this processor to try things out, but 
> overall, if that doesn't work, this thing is headed to the junkyard and 
> I'm going to live with Intel.
> 
> At least in the 13 or so odd Intel systems I've tried, I've not had 
> exhasperating issues like these where the company is positively trying 
> to ignore this ongoing issue.
> 
> This is disgusting from AMD.
> 
> On 6/11/20 6:03 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=196683
> >
> > --- Comment #689 from raulvior.bcn@gmail.com ---
> > (In reply to raulvior.bcn from comment #653)
> >> (In reply to txrx from comment #651)
> >>
> >> Typical Current Idle might not be working. Read the sensor output. If
> >> voltage is not higher than without enabling it, try to increase the core
> >> voltage.
> >>
> >> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB
> with
> >> an ASUS Crosshair VI Hero and enabled Typical current idle.
> >>
> >>
> >>> I was able to update my BIOS to version 18, but my system still locks up.
> >>> I tried the following with the new BIOS:
> >>>   - use factory defaults
> >>>   - disable SMT
> >>>   - disable SMT with Typical Current Idle
> >>>   - all of the above with SVM disabled/enabled
> >>> Right now I set the power supply idle control to "Low ..." and will
> report
> >>> back.
> >>>
> > The motherboard kept hanging. I had to remove the Vitals GNOME Extension.
> It
> > seems that polling voltage values hangs the motherboard... Still, there are
> > times that the computer does not come back from suspend. There's something
> > wrong with the BIOS/UEFI.
> >
Comment 699 Ashesh Ambasta 2020-07-12 14:08:16 UTC
Created attachment 290243 [details]
attachment-20700-0.html

I haven't; and to be honest, I've been procrastinating this issue.

As a very ugly hack/workaround; I've disabled screen power management in 
xscreensaver: so the CPU keeps drawing graphics on my screen instead of 
my displays going to sleep.

That way, my CPU never really enters the idle states for the crashes to 
occur.

I understand that this is /far/ from a satisfactory solution; but I 
didn't want to try my luck with the RMA anymore. As long as my system 
doesn't crash, I can live with this CPU (albeit this continues to 
frustrate me). I may lose patience in the coming months and go for  an 
RMA anyway. But I'm deterred by the mixed reports for the RMA as well: 
some people claim that an RMA fixes their issues; some people say it 
makes no difference. I've even read reports of the RMA'd CPU actually 
turning out to be worse.

I don't think I'm prepared for the gamble. I've been burnt pretty bad 
with AMD at the moment. For now, I'm just making this work. The next 
time I'm buying a CPU, I'll do my research more thoroughly and stay away 
from AMD.

AMD did publish an errata in which they claim an issue like this exists; 
but a solution is ruled out. Which is further bad news. There was some 
discussion on a fix being at the kernel level, but that isn't anywhere 
in sight either. I believe these CPU's are plagued by several issues, 
which probably makes a fix for this at the kernel level hard. However, 
Windows seems to have managed to fix it.

Anyway; rants aside, this is my current take on the CPU.

On 7/12/20 3:58 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #698 from raulvior.bcn@gmail.com ---
> Did you ask for an RMA? Did it work?
> (In reply to Ashesh Ambasta from comment #692)
>> As a last resort, I've tried `idle=halt` on this machine. And yet my
>> system just crashed after 3 weeks of uptime.
>>
>> I'm done with AMD. I will RMA this processor to try things out, but
>> overall, if that doesn't work, this thing is headed to the junkyard and
>> I'm going to live with Intel.
>>
>> At least in the 13 or so odd Intel systems I've tried, I've not had
>> exhasperating issues like these where the company is positively trying
>> to ignore this ongoing issue.
>>
>> This is disgusting from AMD.
>>
>> On 6/11/20 6:03 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>>>
>>> --- Comment #689 from raulvior.bcn@gmail.com ---
>>> (In reply to raulvior.bcn from comment #653)
>>>> (In reply to txrx from comment #651)
>>>>
>>>> Typical Current Idle might not be working. Read the sensor output. If
>>>> voltage is not higher than without enabling it, try to increase the core
>>>> voltage.
>>>>
>>>> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB
>> with
>>>> an ASUS Crosshair VI Hero and enabled Typical current idle.
>>>>
>>>>
>>>>> I was able to update my BIOS to version 18, but my system still locks up.
>>>>> I tried the following with the new BIOS:
>>>>>    - use factory defaults
>>>>>    - disable SMT
>>>>>    - disable SMT with Typical Current Idle
>>>>>    - all of the above with SVM disabled/enabled
>>>>> Right now I set the power supply idle control to "Low ..." and will
>> report
>>>>> back.
>>>>>
>>> The motherboard kept hanging. I had to remove the Vitals GNOME Extension.
>> It
>>> seems that polling voltage values hangs the motherboard... Still, there are
>>> times that the computer does not come back from suspend. There's something
>>> wrong with the BIOS/UEFI.
>>>
Comment 701 Paul Menzel 2020-07-13 08:20:37 UTC
(In reply to Ashesh Ambasta from comment #699)

> Created attachment 290243 [details]
> attachment-20700-0.html

I am getting:

> Sorry, you are not authorized to access attachment #290243 [details].

[…]

> AMD did publish an errata in which they claim an issue like this exists; 
> but a solution is ruled out. Which is further bad news. There was some 
> discussion on a fix being at the kernel level, but that isn't anywhere 
> in sight either. I believe these CPU's are plagued by several issues, 
> which probably makes a fix for this at the kernel level hard. However, 
> Windows seems to have managed to fix it.

That’s interesting. Could you please share the URL to these statements?

[…]

raulvior.bcn, Asheesh: Please always remove the cited parts, as otherwise the Web issue page gets even more convoluted. Thanks.


> Anyway; rants aside, this is my current take on the CPU.
> 
> On 7/12/20 3:58 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=196683
> >
> > --- Comment #698 from raulvior.bcn@gmail.com ---
> > Did you ask for an RMA? Did it work?
> > (In reply to Ashesh Ambasta from comment #692)
> >> As a last resort, I've tried `idle=halt` on this machine. And yet my
> >> system just crashed after 3 weeks of uptime.
> >>
> >> I'm done with AMD. I will RMA this processor to try things out, but
> >> overall, if that doesn't work, this thing is headed to the junkyard and
> >> I'm going to live with Intel.
> >>
> >> At least in the 13 or so odd Intel systems I've tried, I've not had
> >> exhasperating issues like these where the company is positively trying
> >> to ignore this ongoing issue.
> >>
> >> This is disgusting from AMD.
> >>
> >> On 6/11/20 6:03 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> >>>
> >>> --- Comment #689 from raulvior.bcn@gmail.com ---
> >>> (In reply to raulvior.bcn from comment #653)
> >>>> (In reply to txrx from comment #651)
> >>>>
> >>>> Typical Current Idle might not be working. Read the sensor output. If
> >>>> voltage is not higher than without enabling it, try to increase the core
> >>>> voltage.
> >>>>
> >>>> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB
> >> with
> >>>> an ASUS Crosshair VI Hero and enabled Typical current idle.
> >>>>
> >>>>
> >>>>> I was able to update my BIOS to version 18, but my system still locks
> up.
> >>>>> I tried the following with the new BIOS:
> >>>>>    - use factory defaults
> >>>>>    - disable SMT
> >>>>>    - disable SMT with Typical Current Idle
> >>>>>    - all of the above with SVM disabled/enabled
> >>>>> Right now I set the power supply idle control to "Low ..." and will
> >> report
> >>>>> back.
> >>>>>
> >>> The motherboard kept hanging. I had to remove the Vitals GNOME Extension.
> >> It
> >>> seems that polling voltage values hangs the motherboard... Still, there
> are
> >>> times that the computer does not come back from suspend. There's
> something
> >>> wrong with the BIOS/UEFI.
> >>>
Comment 703 Victor Queiroz 2020-07-24 02:17:28 UTC
I'm also experiencing severe and constant crashes with a Ryzen 5 3600X. I also had problems with my Ryzen 7 1700 and I was confident that the problem was never going to have problems again with my Ryzen 5 3600X, but I was wrong.

First months everything was fine, and after reinstalling the system, everything went wrong. Crashes daily and constantly. I had a critical Git repository corrupted (luckily I could recover things in time), but it completely destroyed my experience with my PC and actually become a financial gap as I use the computer to work on.

Not sure what to do. I've decided to try until the end of this year not to have more crashes, but as a Linux user, I'm unsure I will ever think about buying AMD again. I will always remember about the hassle and lost years I had with this.

Sometimes all you want to do is to have a great experience with your Linux computer and BOOM, a hard freeze that requires manually turning the computer off. It's ridiculous. To be honest, I expected much more from AMD. I hope they'll fix it ASAP, we can't stand this anymore. Intel becomes our only option with this problem.

I also opened an issue where I'm posting updates about my problems: https://bugzilla.kernel.org/show_bug.cgi?id=208615
Comment 704 Ashesh Ambasta 2020-07-24 16:45:26 UTC
Have you tried the suggested fixes in this thread? And what changed 
between that reinstall and earlier?

On 7/24/20 4:17 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> Victor Queiroz (victorcqueirozg@gmail.com) changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |victorcqueirozg@gmail.com
>
> --- Comment #703 from Victor Queiroz (victorcqueirozg@gmail.com) ---
> I'm also experiencing severe and constant crashes with a Ryzen 5 3600X. I
> also
> had problems with my Ryzen 7 1700 and I was confident that the problem was
> never going to have problems again with my Ryzen 5 3600X, but I was wrong.
>
> First months everything was fine, and after reinstalling the system,
> everything
> went wrong. Crashes daily and constantly. I had a critical Git repository
> corrupted (luckily I could recover things in time), but it completely
> destroyed
> my experience with my PC and actually become a financial gap as I use the
> computer to work on.
>
> Not sure what to do. I've decided to try until the end of this year not to
> have
> more crashes, but as a Linux user, I'm unsure I will ever think about buying
> AMD again. I will always remember about the hassle and lost years I had with
> this.
>
> Sometimes all you want to do is to have a great experience with your Linux
> computer and BOOM, a hard freeze that requires manually turning the computer
> off. It's ridiculous. To be honest, I expected much more from AMD. I hope
> they'll fix it ASAP, we can't stand this anymore. Intel becomes our only
> option
> with this problem.
>
> I also opened an issue where I'm posting updates about my problems:
> https://bugzilla.kernel.org/show_bug.cgi?id=208615
>
Comment 705 Ashesh Ambasta 2020-08-01 11:15:33 UTC
Even with screensavers enabled and displays never going to sleep; I just 
experienced a crash on idle.

Here's the logs (https://pastebin.com/D7SEJn9J). I did change the 
display configuration just before leaving my desk (with xrandr) which 
explain some of these logs. But I can see nothing from the kernel etc.

The system just dies.


On 7/24/20 4:17 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> Victor Queiroz (victorcqueirozg@gmail.com) changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |victorcqueirozg@gmail.com
>
> --- Comment #703 from Victor Queiroz (victorcqueirozg@gmail.com) ---
> I'm also experiencing severe and constant crashes with a Ryzen 5 3600X. I
> also
> had problems with my Ryzen 7 1700 and I was confident that the problem was
> never going to have problems again with my Ryzen 5 3600X, but I was wrong.
>
> First months everything was fine, and after reinstalling the system,
> everything
> went wrong. Crashes daily and constantly. I had a critical Git repository
> corrupted (luckily I could recover things in time), but it completely
> destroyed
> my experience with my PC and actually become a financial gap as I use the
> computer to work on.
>
> Not sure what to do. I've decided to try until the end of this year not to
> have
> more crashes, but as a Linux user, I'm unsure I will ever think about buying
> AMD again. I will always remember about the hassle and lost years I had with
> this.
>
> Sometimes all you want to do is to have a great experience with your Linux
> computer and BOOM, a hard freeze that requires manually turning the computer
> off. It's ridiculous. To be honest, I expected much more from AMD. I hope
> they'll fix it ASAP, we can't stand this anymore. Intel becomes our only
> option
> with this problem.
>
> I also opened an issue where I'm posting updates about my problems:
> https://bugzilla.kernel.org/show_bug.cgi?id=208615
>
Comment 719 Ewerton Urias 2020-09-17 05:25:09 UTC
Hello everyone.

I apologize for my English, I'll try to communicate.

I did a hardware upgrade in November 2019 (Intel for AMD), my current hardware is this:

------------------------------------------------
ASUS TUF B450-PRO GAMING
Ryzen 5 1600 Six-Core (BIOS always updated)
GeForce GTX 960 4 GB 128 Bits
Corsair 650w
Corsair LPX 16 GB 2666
------------------------------------------------

During the first few weeks, I noticed reboots and freezes, and after a few months of research, I found an alternative solution, which is just to add "processor.max_cstate=1" to Grub.

After I did this, my computer went 6 months without rebooting and/or freezing.

Yesterday I removed this parameter from Grub, to see what the result would be, and it happened, I had a reboot and a freeze, it means that "processor.max_cstate=1" is a solution for me.

The reason I'm here is to understand the root of the problem, to correct it correctly, I'll soon test "Power Supply Idle Control" and "Global C-State Control" in BIOS, but I have seen that for some users it didn't work.

I'm trying to read all of your comments (there are many), but skipping to the last comments, it seems that there is still no solution to this problem, and this makes me very sad.

I don't know anything about hardware, could someone explain to me, in a layman's way, the difference between using "processor.max_cstate=1", "processor.max_cstate=5", "Power Supply Idle Control" and "Global C-State Control"?

I thank you for your patience.
Comment 720 Ashesh Ambasta 2020-09-17 06:14:50 UTC
I can confirm that in my case, all the suggested alternatives in this 
thread didn't work (the ones that were applicable to my use-case anyway).

In the end, I threw my hands up and did an RMA. And that was over a 
month ago. It seems to have solved the isssue. The processor I had 
previously and its replacement were different batches; the former being 
from 2019 and the latter from Feb. 2020. It seems to me that AMD ironed 
out some issues with processors and I got lucky with the replacement. Or 
it could just be some tiny variations in fabrication. I can never be sure.

But my machine has been up since the RMA without any crashes.

On 9/17/20 7:25 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> Ewerton Urias (ewerton.urias@gmail.com) changed:
>
>             What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                   CC|                            |ewerton.urias@gmail.com
>
> --- Comment #719 from Ewerton Urias (ewerton.urias@gmail.com) ---
> Hello everyone.
>
> I apologize for my English, I'll try to communicate.
>
> I did a hardware upgrade in November 2019 (Intel for AMD), my current
> hardware
> is this:
>
> ------------------------------------------------
> ASUS TUF B450-PRO GAMING
> Ryzen 5 1600 Six-Core (BIOS always updated)
> GeForce GTX 960 4 GB 128 Bits
> Corsair 650w
> Corsair LPX 16 GB 2666
> ------------------------------------------------
>
> During the first few weeks, I noticed reboots and freezes, and after a few
> months of research, I found an alternative solution, which is just to add
> "processor.max_cstate=1" to Grub.
>
> After I did this, my computer went 6 months without rebooting and/or
> freezing.
>
> Yesterday I removed this parameter from Grub, to see what the result would
> be,
> and it happened, I had a reboot and a freeze, it means that
> "processor.max_cstate=1" is a solution for me.
>
> The reason I'm here is to understand the root of the problem, to correct it
> correctly, I'll soon test "Power Supply Idle Control" and "Global C-State
> Control" in BIOS, but I have seen that for some users it didn't work.
>
> I'm trying to read all of your comments (there are many), but skipping to the
> last comments, it seems that there is still no solution to this problem, and
> this makes me very sad.
>
> I don't know anything about hardware, could someone explain to me, in a
> layman's way, the difference between using "processor.max_cstate=1",
> "processor.max_cstate=5", "Power Supply Idle Control" and "Global C-State
> Control"?
>
> I thank you for your patience.
>
Comment 721 Paul Menzel 2020-09-17 06:34:49 UTC
(In reply to Ashesh Ambasta from comment #720)
> I can confirm that in my case, all the suggested alternatives in this 
> thread didn't work (the ones that were applicable to my use-case anyway).
> 
> In the end, I threw my hands up and did an RMA. And that was over a 
> month ago. It seems to have solved the isssue. The processor I had 
> previously and its replacement were different batches; the former being 
> from 2019 and the latter from Feb. 2020. It seems to me that AMD ironed 
> out some issues with processors and I got lucky with the replacement. Or 
> it could just be some tiny variations in fabrication. I can never be sure.
> 
> But my machine has been up since the RMA without any crashes.

As you disassembled the processor, were you able to copy the serial numbers from the old and new one, and could post them here please? Could you pleas

PS: Please always remove the full quote from messages, when replying by email, as it just clutters the Web interface.
Comment 723 Ashesh Ambasta 2020-11-09 12:34:44 UTC
Apologies for the late reply.
> As you disassembled the processor, were you able to copy the serial numbers
I did copy and send the older processor's serial numbers to the dealer, 
for the RMA. That was a processor from a batch in late 2018 IIRC.
> from the old and new one, and could post them here please? Could you pleas
The new one comes from a batch in Feb. of 2020. I'm afraid I don't have 
the serial number copied somewhere (it is probably on the box but that 
doesn't show the batch number). That was a rookie mistake.
> PS: Please always remove the full quote from messages, when replying by
> email,
> as it just clutters the Web interface.
Noted, sorry about that!
Comment 724 matthew clark 2020-11-21 14:37:52 UTC
if I disable multithreading my system that had these crashes is stable for months without them.  e.g.  run this in rc.local:

#!/bin/bash
#
# disables hyperthreading, which stops the system from crashing
#
   for CPU in /sys/devices/system/cpu/cpu[0-9]*; do
        CPUID=$(basename $CPU)
        echo "CPU: $CPUID";
        if test -e $CPU/online; then
                echo "1" > $CPU/online; 
        fi;

        COREID="$(cat $CPU/topology/core_id)";
        eval "COREENABLE=\"\${core${COREID}enable}\"";

        if ${COREENABLE:-true}; then        
                echo "${CPU} core=${CORE} -> enable"
                eval "core${COREID}enable='false'";
        else
                echo "$CPU core=${CORE} -> disable"; 
                echo "0" > "$CPU/online"; 
        fi; 
    done;
Comment 725 Ignat Awwit 2020-12-13 09:03:25 UTC
Hello everyone. I've been following this topic for a long time. I have the same problem on my Rizen 1700. For a while, I was helped by switching "Power Supply Idle Control" to "Typical Current Idle". For a while, everything worked fine for me with the value "auto". But then I changed something in the settings of my BIOS and the lockup began to happen even with the value "Typical Current Idle".

Motherboard: GA-AB350N-Gaming WIFI (rev. 1.0)
Linux Kernel Version: 5.9.12

Then I realized that the bug is due to a combination of some BIOS settings.

So far, I have managed to achieve stable operation with the following settings:

"Power Supply Idle Control" -> "Typical Current Idle"
"Power" (Tab) -> "ErP" -> "Enabled"
"Power" (Tab) -> "CEC 2019 Ready" -> "Disabled"

The settings on the "Power" tab are very important! Especially "CEC 2019 Ready (Enabled)" setting, with which lockups are repeated.

I will try to change different BIOS settings to achieve stable system operation when the "Power Supply Idle Control" is set to "auto".
Comment 726 Ewerton Urias 2020-12-17 02:22:12 UTC
I'm here again, after my first comment (Comment 719).

I have Ryzen 5 1600 (UA 1843PGS), and I come to report some news about my case.

I have 2x8GB 2666 Mhz, but in the BIOS Setup, by default, the frequency is set to "AUTO" (2133 Mhz), and in 2019 I had applied it to 2666 Mhz, and I didn't remember this.

My friend warned me about this, that what I did was an "overclocking of RAM", I didn't know that this is an overclocking, so I decided to revert, applying "AUTO" again (2133 Mhz).

Since then... the reboots have stopped, however... the system has continued to freeze, and the logs point only to my NVIDIA GTX 960.

In the BIOS Setup, I disabled the "Global C-State Control", and since then, it has been almost 60 days since there are no more system freezes or reboots, things that happened every day.

Resume:

1. Maintaining RAM frequency in AUTO resolves reboots;
2. Disabling "Global C-State Control" solves my problem with system freezes (log points to NVIDIA).

These two things above solved my problem, and that parameter "processor.max_cstate = 1" in GRUB is no longer needed.
Comment 727 Nelson Castillo 2020-12-17 03:04:01 UTC
Created attachment 294183 [details]
attachment-20815-0.html

Hi Ewerton.

This might be a side effect and kind of unrelated with the bug. Since YMMV
and this can wreck your HW please read this as an anecdotal experience.
When you overclock RAM you need to change other stuff in order to stabilize
the system. This video from AMD people helped me get 3000Mhz with no
issues. Before this video I was wasting -part of- the money I paid for the
sticks running at 2666. This was with an Ab350m-ds3h (BIOS F31) and a Ryzen
2700X.

  https://www.youtube.com/watch?v=vZgpHTaQ10k

BTW, Memtest86 helped me a lot here. It would detect an invalid
configuration in a second. I was happy after it ran a few times and I
didn't have any issues after that. Since then I switched to a recent
motherboard and the RAM was detected automatically.


On Wed, Dec 16, 2020 at 9:22 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #726 from Ewerton Urias (ewerton.urias@gmail.com) ---
> I'm here again, after my first comment (Comment 719).
>
> I have Ryzen 5 1600 (UA 1843PGS), and I come to report some news about my
> case.
>
> I have 2x8GB 2666 Mhz, but in the BIOS Setup, by default, the frequency is
> set
> to "AUTO" (2133 Mhz), and in 2019 I had applied it to 2666 Mhz, and I
> didn't
> remember this.
>
> My friend warned me about this, that what I did was an "overclocking of
> RAM", I
> didn't know that this is an overclocking, so I decided to revert, applying
> "AUTO" again (2133 Mhz).
>
> Since then... the reboots have stopped, however... the system has
> continued to
> freeze, and the logs point only to my NVIDIA GTX 960.
>
> In the BIOS Setup, I disabled the "Global C-State Control", and since
> then, it
> has been almost 60 days since there are no more system freezes or reboots,
> things that happened every day.
>
> Resume:
>
> 1. Maintaining RAM frequency in AUTO resolves reboots;
> 2. Disabling "Global C-State Control" solves my problem with system freezes
> (log points to NVIDIA).
>
> These two things above solved my problem, and that parameter
> "processor.max_cstate = 1" in GRUB is no longer needed.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 728 Alexander von Gluck 2020-12-17 14:21:27 UTC
Some information.  I was having this issue all the time on my Asus x370 and my Ryzen 1800x.

I upgraded my mainboard to an Asus x470, and the issue disappeared overnight. Same ram/CPU/add-on cards/OS


So, potentally due to the x370, or the combonation of early Ryzen + x370
Comment 729 Ashesh Ambasta 2021-01-14 13:50:40 UTC
So after months after my RMA, this issue seems to have surfaced again. I 
experienced a lockup on system idle just a while ago.

It seems to also somehow be related to particular boots: my system was 
off (not suspend, powered down) last night. It had been stable for 
months before that.

I'll be selling my 2950/mobo/cooler and switching to a more stable Intel 
(something old). Life is too short to live on the "edge" with these AMD 
processors.
Comment 730 raul 2021-01-14 18:00:32 UTC
This thing can happen due to multiple factors.
I was running a 1800X. Freezes ocurred in 24-48h of uptime. Disabling Global C-State or enabling typical power idle in UEFI stopped those freezes. The latter option disables PC6 on top of using 0.85V idle voltage. To disable PC6 you need a cold boot, otherwise only the voltage change is applied and PC6 remains enabled, crashing the system as usual.

I upgraded to a 2700X (UEFI cleared) and apparently the issue disappeared. But nope, it's just less frequent. Much less. Currently I have an uptime of 23d. I also had another 2700X running for 17d before testing again a 1800X which crashed within 24h, afterwards I inserted the current 2700X which crashed within 48h. The following boot is the one with an uptime of 23d.

All in all, this might be related to either a PSU thing or to a non-existent but required kernel workaround for a bug in the processor, as detailed in the AMD Revision Guide [1]. The 1800X would crash more because it has more bugs and workarounds needed, while the 2700X has fewer, specially related to the PCIe controller. Ironically, the 2700X consumes less power at idle than the 1800X because it requires lower voltages: 12nm+ vs 14nm, and the voltages specified at every power level are also lower. I don't see much logic in saying that the PSU is the culprit of system instability.

All the people trying "idle=nomwait", "idle=halt" or "processor.max_cstate=5" should be warned those options are useless. There are only 2 Cstates available in Ryzen systems, so if you want to limit Cstate you have to set it to 1 at most -> "processor.max_cstate=1". And the use of the MWAIT instruction is disabled by the UEFI if you insert a Ryzen 1800X processor. The 2700X and the rest of 2nd gen Ryzen are not affected by any MWAIT bug. Idle option is thus useless too.


It's better to try with "pcie_aspm=off" or "pcie_aspm=force" pcie_aspm.policy=performance" and or "nvme_core.default_ps_max_latency_us=0". Maybe the PCIe root or anything related does not wakeup and the processor stalls waiting for an interrupt to be served.

[1] https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf
Comment 731 ricki 2021-01-14 21:13:30 UTC
I have a Ryzen 1700X and I'm heavily affected by the problem. Default kernel parameters: lockups every 1-2 days. With "processor.max_cstate=5 rcu_nocbs=0-15": reboot instead of lockups, but less frequent. Still too cumbersome to deal with.

Does anyone know whether this problem also occurs with Ryzen 5? I'd like to give it a second chance.
Comment 732 Ashesh Ambasta 2021-01-15 12:27:30 UTC
I did some research into the latest Ryzens, and I suspect the issue is 
still not solved. I did manage to read some alarming posts discussing 
similar lockups.

I suspect this also has to do with the motherboard. At this stage, I'm 
no longer willing to try (I've done one RMA and 2.5 years with an 
unreliable system).

Not to rant, but I had a similar experience with AMD a decade earlier. 
Maybe its time for me to move on. :-)
Comment 734 Radomír Polách 2021-01-18 18:06:06 UTC
As far as I can tell it was solved for me a quite some time ago (on my A485), most probably a by a Bios update (with previously employed kernel updates), and I have not encountered it E459 with newer Ryzen at all.

I have never encountered it on Windows.

My thinking is that it was a combination of kernel behaviour with some in early BIOSes and was fixed when both were updated.

If you use Desktop and you have latest BIOS I would suggest replacing the motherboard as somebody pointed above.
Comment 735 Ashesh Ambasta 2021-01-23 14:01:47 UTC
Created attachment 294817 [details]
attachment-13720-0.html

After some more digging around, I see the following in `journalctl -b-1` :

    Jan 23 14:48:41 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
    Jan 23 14:48:41 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3: ep deq seg = 00000000734e9686, deq ptr = 00000000f203822a

and indeed, the "crashes" I experience also seem to be accompanied by 
all my USB devices (mouse, keyboard) losing power.
Comment 736 Ashesh Ambasta 2021-01-23 14:46:00 UTC
Created attachment 294819 [details]
attachment-29532-0.html

Actually, never mind.

I also see the following when looking at all similar messages:

    |╰─$ journalctl| grep 'xhci.*Mismatch' ||
    ||Jun 13 14:43:09 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jun 27 17:58:56 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jul 16 18:10:53 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jul 19 16:53:52 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Sep 10 12:24:53 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Oct 09 13:23:53 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Oct 10 12:40:37 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Nov 01 16:44:36 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Dec 21 12:24:03 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jan 13 19:09:51 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jan 23 14:48:41 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.|

There were no crashes in December/November/October & neither did I lose 
any USB functionality as I did now.

In any case, I'm still looking to replace this hardware.
Comment 737 Paul Menzel 2021-01-24 12:06:01 UTC
(In reply to Ashesh Ambasta from comment #736)

[…]


> I also see the following when looking at all similar messages:
> 
>     |╰─$ journalctl| grep 'xhci.*Mismatch' ||
>     ||Jun 13 14:43:09 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
>     Mismatch between completed Set TR Deq Ptr command & xHCI internal
>     state.||

[…]

Please report that to the Linux USB folks (entry for USB SUBSYSTEM in the file MAINTAINERS [1]). If you do, please also attach output of `lsusb`.


[1]: https://www.kernel.org/doc/linux/MAINTAINERS
Comment 744 JerryD 2021-03-19 17:42:25 UTC
Comment 743 is Phishing, do not click links.
Comment 751 Agmon 2021-08-05 16:27:29 UTC
Hello,

after while with random AMD Ryzen 5 reboots, final solution was add Kernel parameter:

acpi_osi=Linux

3 days of stability with Ubuntu 18.04 5.4.0-80-generic and counting.
Comment 752 Paul Menzel 2021-08-05 17:13:55 UTC
Agmon, thank you for your report. Please open a separate issue with the full output of `dmesg` attached, and the output of `acpidump`. Please reference the issues here.
Comment 779 Jonathan N 2022-09-07 15:28:42 UTC
Hi, I was encountering this issue in a frequent basis. I tried several methods listed all across the internet to solve the "soft lockup" specially while using rsync to copy data from one disk to another. 

At some point, I ran memtest86 to check if there was a memory problem. I have 4 DIMM installed, of 16GB. The BIOS settings were mostly set to auto for most fields. Memtest86 failed 25% of the test, which is really bad.

I noticed that Memtest86 reported metadata for each memory about the voltage setting to 1.350v. I noticed the Auto Setting in the bios had set those values to 1.200v instead.

After updating that value, I ran the test suite again, none failed. Ran rsync over a large directory, and did not trigger the soft lockup.

tldr; Make sure your RAM is correctly configured with the manufacturer recommended settings instead of Auto setting from the BIOS, this goes for both frequency and voltage.
Comment 780 Paul Menzel 2022-09-07 15:36:52 UTC
Jonathan, thank you for the feedback. Please share the mainboard model and firmware, and the used memory for the record.
Comment 781 Nelson Castillo 2022-09-07 16:00:13 UTC
For the hardware I had this talk was life-saving. I thought it was this issue and it was the RAM as well, used memtest86 for the first time. Mandatory stuff!

Fortunately things seem to be less manual nowadays. See:

https://www.youtube.com/watch?v=vZgpHTaQ10k
Comment 782 Jonathan N 2022-09-07 16:05:20 UTC
Here is the setup: 


Vendor: American Megatrends Inc.
Version: P4.80
Release Date: 02/16/2022
BIOS Revision: 5.17


Base Board Information
Manufacturer: ASRock
Product Name: X570 Taichi

Processor Information
Version: AMD Ryzen 9 3900X 12-Core Processor            

Memory:
Kingston RAM KF3200C16D4/16GX 16GB DIMM DDR4 3200MT
XMP Profile: DDR4-3200 CL16-18-18 @1.35V

I could not find a way to format this information nicely. So here is a pastebin 
https://pastebin.com/TsB2LVD3
Comment 784 Carol Jams 2022-11-08 13:49:02 UTC
You may need to upgrade your system with an SSD like https://www.allhdd.com/dell-dc29p-solid-state-drive/ or your Memory to DDR 4 to fix it.