Bug 197923 - ZEN CPUs hang when power management enabled, workaround included.
Summary: ZEN CPUs hang when power management enabled, workaround included.
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: drivers_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-19 17:04 UTC by Mathieu Belanger
Modified: 2017-11-21 16:29 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.13+
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Mathieu Belanger 2017-11-19 17:04:43 UTC
ZEN CPUs have issue on where the system will hang mostly on idle if proper power managment is enabled and in my case will also have display anomalies with AMDGPU.

The workaround was to enable CONFIG_RCU_NOCB_CPU_ALL and this option was actually fixing all the issues. Major distributions began to turn it on by default from kernel 4.12 (Redhat did it, did not check for the others...)

And the option got removed from the kernel with commit 44c65ff2e3b0b48250a970183ab53b0602c25764 . That is a problem. Because even if competent users can enable it via rcu_nocbs boot parameters, it allow only a CPU range to be specified. It prevent integration with mainstream distribution as they have no idea of the number of threads there user will have. The CONFIG_RCU_NOCB_CPU_ALL was allowing to specify all the core and so should be put back, at lease until AMD resolve that issue.

This option description should mention that it is a temporary workaround for AMD ZEN (Ryzen, Threadripper, Epyc) CPUs and will be removed once AMD will have fixed this issue.

Please revert 44c65ff2e3b0b48250a970183ab53b0602c25764.
Comment 1 James Le Cuirot 2017-11-19 17:47:59 UTC
Please get your facts straight. Debian, Ubuntu, and Mageia have never enabled either option. OpenSUSE 42.3 (kernel 4.4) enables CONFIG_RCU_NOCB_CPU but not CONFIG_RCU_NOCB_CPU_ALL. Fedora enabled both since 3.16. RHEL enabled it earlier. It's probably still enabled in RHEL.

https://bugzilla.redhat.com/show_bug.cgi?id=1109113

I build my own kernel anyway and I think reintroducing workarounds will reduce the incentive to fix it properly so I'm not calling for this, though I'm not against it either. Please see the original bug #196683.
Comment 2 Klaus Mueller 2017-11-20 20:22:28 UTC
I've been facing those system hangs, too, until I switched on "daily computing" optimization in bios (ASUS PRIME X370-PRO BIOS 0902 09/08/2017) running 4.13.x.

Switching to "daily computing" increases CPU speed to max MHz 3600 (instead of 3400) using Ryzen 7 1700X. Maybe it changes some more things I don't know off. But C states are definitely enabled:

# zenstates.py -l 
P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P1 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled
Comment 3 Mathieu Belanger 2017-11-20 20:30:01 UTC
Klaus Mueller : It might by stabilisation by overclocking.
It might help in some case but it also produce more heat.

Your two higher P-States are set at 1.35V. My current Ryzen 1700 system, the vcore is set to 1.33V and the clock, 4.025Ghz.

To confirm you actually have the same level of powersaving (C6), you can use "watch sensors" and wait on idle, the Vcore should go down to about 0.4V if it's actually working.
Comment 4 Klaus Mueller 2017-11-21 16:29:08 UTC
(In reply to Mathieu Belanger from comment #3)
Yes, lowest value was 0.39V during about 1 minute (max was 0.91). The machine runs 4 VMs in parallel.

C-states are definitely working, because there is no difference in power consumption between 3600 and 3400 MHz (measured at the 230V power socket). Disabling C-states increases power consumption by about 9W.

Note You need to log in before you can comment on or make changes to this bug.