Bug 202043 - amdgpu: Vega 56 SCLK drops to 700 Mhz when undervolting
Summary: amdgpu: Vega 56 SCLK drops to 700 Mhz when undervolting
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - non Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_video-dri
Depends on:
Reported: 2018-12-22 12:24 UTC by antifermion
Modified: 2019-10-31 13:07 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.19.8, 4.20.0-rc6
Tree: Mainline
Regression: No

Example of undervolting effect (328.06 KB, image/png)
2019-03-20 07:45 UTC, Ivan Avdeev

Description antifermion 2018-12-22 12:24:14 UTC
When undervolting my Sapphire Pulse Vega 56 by just 1mV, SCLK immediately drops down to 700 Mhz and pstate 1-2 under load (`gputest /test=fur /width=1920 /height=1080`).

Script to undervolt:
echo "s 7 1630 1199" > /sys/class/drm/card0/device/pp_od_clk_voltage
echo "c" > /sys/class/drm/card0/device/pp_od_clk_voltage

Stock voltage would be 1200 on the Vega 64 Bios.
The same behavior can be observed with the stock Vega 56 Bios.
Undervolting the memory by 1mV results in similar behavior.
Overvolting by 1mV has no discernable effect.

`echo r > pp_od_clk_voltage` does not work to go back to the normal behavior. Instead, I need to use `echo "s 7 1630 1200" > pp_od_clk_voltage` as above.

Without undervolting, SCLK is around 1330 Mhz, which matches the behavior on Windows, where undervolting by around 150 mV is no problem and increases clock.

With an increased power limit of 300W, the clocks increase to around 1100 Mhz while the card uses the full 300W.
It even maxes that limit with a significant underclock/undervolt which would pull around 200W on Windows.

I tested with current Manjaro (4.19.8-2-MANJARO), as well as Kubuntu 18.10 with stock (4.18) and 4.20 from https://github.com/M-Bab/linux-kernel-amdgpu-binaries.
Comment 1 antifermion 2018-12-22 15:31:30 UTC
It says
> To manually adjust these settings, first select manual using power_dpm_force_performance_level

in the driver source.
However, that does not change anything.
Setting the powerstate manually via `pp_dpm_sclk` does not work for me either.
Comment 2 fin4478 2018-12-24 13:05:52 UTC
Have amdgpu.ppfeaturemask=0xffffffff in the kernel command line.
Comment 3 antifermion 2018-12-26 19:39:31 UTC
I do have that enabled (my `/proc/cmdline` is `BOOT_IMAGE=/boot/vmlinuz-4.19-x86_64 root=UUID=2994b8cf-341a-48b0-b49d-771df87dc509 rw quiet amdgpu.ppfeaturemask=0xffffffff`)
Comment 4 Ivan Avdeev 2019-03-20 07:40:36 UTC
I have the very same issue on Vega 64. Changing any voltage value from stock even by 1mV leads to severe throttling.

echo "manual" > $DEV/power_dpm_force_performance_level
echo "2" > $DEV/pp_sclk_od
echo "2" > $DEV/pp_mclk_od
echo "s 7 1663 1190" > $DEV/pp_od_clk_voltage # stock = 1200
echo "c" > $DEV/pp_od_clk_voltage

Will result in 0.5x performance and stutter.

This doesn't happen on windows on the same hardware.

Card: MSI Radeon RX Vega 64 Air Boost 8G OC
Kernels tried: 4.20, 5.0-rc2, 5.0.2
Mesa: 18.3.3, 18.3.4
MB: ASUS Prime X399-A, bios 0808
CPU: Threadripper 1950X
Comment 5 Ivan Avdeev 2019-03-20 07:45:41 UTC
Created attachment 281917 [details]
Example of undervolting effect

Example of what happens when undervolting is attempted. Notice how:
- frame time goes from stable ~15ms to stuttering 25-30ms
- core freq drops from ~1.4GHz to 1.1GHz
- memory freq goes from stable ~900MHz to rapidly oscillating between 200-900MHz.

Captured by running Unigine Superposition in game mode on medium setttings on 2560x1440 screen with `GALLIUM_HUD=.d.w1920frametime,.d.w1920shader-clock+memory-clock GALLIUM_HUD_PERIOD=0`
Comment 6 mistarzy 2019-03-27 00:41:56 UTC
Same issue here with Vega 64. From watching /sys/kernel/debug/dri/0/amdgpu_pm_info my conclusion is that basically driver gets max voltage applied even though tables in /sys/class/drm/card0/device/pp_od_clk_voltage suggest otherwise. I did test with Windows 10 and applied same pp_od_clk_voltage - no issues. For testing I applied 1.2V for all states and had same result with amdgpu. Tested on 5.0.3 and 5.0.4 ubuntu mainline kernel.
Comment 7 mistarzy 2019-03-27 00:45:46 UTC
Best results performance wise had with following setup:

echo 275000000 >> /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap
echo "m 3 1100 1000" > /sys/class/drm/card0/device/pp_od_clk_voltage

GPU clock hovered above 1400Mhz and memory kept state 3 with 1100Mhz even though I did not change to manual or sent commit. But after extensive use by benchmarking or gaming for 20+ min drivers seems to bug at goes to lower states even though temperatures stay the same. Let me know if you would like some tests and logs.
Comment 8 mart.b 2019-09-29 05:04:17 UTC
Same issue on a strix vega 64 it seems pretty weird. Is there any fix known?
Comment 9 haro41 2019-10-31 13:07:24 UTC

Should be fixed 5.4 release.

Note You need to log in before you can comment on or make changes to this bug.