Bug 202043
Summary: | amdgpu: Vega 56 SCLK drops to 700 Mhz when undervolting | ||
---|---|---|---|
Product: | Drivers | Reporter: | Dorian Rudolph (mail) |
Component: | Video(DRI - non Intel) | Assignee: | drivers_video-dri |
Status: | NEW --- | ||
Severity: | normal | CC: | fin4478, haro41, lists, mail, mart.b, maxijac, mistarzy |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.19.8, 4.20.0-rc6 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | Example of undervolting effect |
Description
Dorian Rudolph
2018-12-22 12:24:14 UTC
It says
> To manually adjust these settings, first select manual using power_dpm_force_performance_level
in the driver source.
However, that does not change anything.
Setting the powerstate manually via `pp_dpm_sclk` does not work for me either.
Have amdgpu.ppfeaturemask=0xffffffff in the kernel command line. I do have that enabled (my `/proc/cmdline` is `BOOT_IMAGE=/boot/vmlinuz-4.19-x86_64 root=UUID=2994b8cf-341a-48b0-b49d-771df87dc509 rw quiet amdgpu.ppfeaturemask=0xffffffff`) I have the very same issue on Vega 64. Changing any voltage value from stock even by 1mV leads to severe throttling. E.g.: ``` echo "manual" > $DEV/power_dpm_force_performance_level echo "2" > $DEV/pp_sclk_od echo "2" > $DEV/pp_mclk_od echo "s 7 1663 1190" > $DEV/pp_od_clk_voltage # stock = 1200 echo "c" > $DEV/pp_od_clk_voltage ``` Will result in 0.5x performance and stutter. This doesn't happen on windows on the same hardware. Card: MSI Radeon RX Vega 64 Air Boost 8G OC Kernels tried: 4.20, 5.0-rc2, 5.0.2 Mesa: 18.3.3, 18.3.4 MB: ASUS Prime X399-A, bios 0808 CPU: Threadripper 1950X Created attachment 281917 [details]
Example of undervolting effect
Example of what happens when undervolting is attempted. Notice how:
- frame time goes from stable ~15ms to stuttering 25-30ms
- core freq drops from ~1.4GHz to 1.1GHz
- memory freq goes from stable ~900MHz to rapidly oscillating between 200-900MHz.
Captured by running Unigine Superposition in game mode on medium setttings on 2560x1440 screen with `GALLIUM_HUD=.d.w1920frametime,.d.w1920shader-clock+memory-clock GALLIUM_HUD_PERIOD=0`
Same issue here with Vega 64. From watching /sys/kernel/debug/dri/0/amdgpu_pm_info my conclusion is that basically driver gets max voltage applied even though tables in /sys/class/drm/card0/device/pp_od_clk_voltage suggest otherwise. I did test with Windows 10 and applied same pp_od_clk_voltage - no issues. For testing I applied 1.2V for all states and had same result with amdgpu. Tested on 5.0.3 and 5.0.4 ubuntu mainline kernel. Best results performance wise had with following setup: echo 275000000 >> /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap echo "m 3 1100 1000" > /sys/class/drm/card0/device/pp_od_clk_voltage GPU clock hovered above 1400Mhz and memory kept state 3 with 1100Mhz even though I did not change to manual or sent commit. But after extensive use by benchmarking or gaming for 20+ min drivers seems to bug at goes to lower states even though temperatures stay the same. Let me know if you would like some tests and logs. Same issue on a strix vega 64 it seems pretty weird. Is there any fix known? https://bugzilla.kernel.org/show_bug.cgi?id=205277 Should be fixed 5.4 release. I am seeing the same behaviors with vega 64 on 5.4.96 (LTS) and also newer kernels. From my testing, it seems that amdgpu is simply not able to properly apply the od table to the GPU. As soon as a clock change or voltage change is sent to GPU, it disturbs the PM and it can only be fixed by rebooting. (In reply to mistarzy from comment #6) > Same issue here with Vega 64. From watching > /sys/kernel/debug/dri/0/amdgpu_pm_info my conclusion is that basically > driver gets max voltage applied even though tables in > /sys/class/drm/card0/device/pp_od_clk_voltage suggest otherwise. From my testing, I'd say that the voltages are just broken once a change to od is sent to GPU. After booting, if you monitor amdgpu_pm_info you will see some "uneven" VDD values (825, 831, 918mV, etc...) which lets me think some kind of curve is applied between voltage values of the initial sclk table. Once any single change to the od table is sent, you will see that now the VDD steps are just big chunks of the VDD steps, like incrementing by 50mV each up to maximum value (1000, 1050, 1100, 1200mV...) It seems using the voltage curve feature of pp_od_clk_voltage ("vc") _could_ fix the issue but it is not supported on cards older than VEGA20... Not sure if this is a SW limitation in the driver or a GPU limitation. Unexpectedly, the same effect can be seen when sending a full PP table. (In reply to haro41 from comment #9) > https://bugzilla.kernel.org/show_bug.cgi?id=205277 > > Should be fixed 5.4 release. No, it's not fixed. (In reply to mistarzy from comment #7) > Best results performance wise had with following setup: > > echo 275000000 >> /sys/class/drm/card0/device/hwmon/hwmon0/power1_cap > echo "m 3 1100 1000" > /sys/class/drm/card0/device/pp_od_clk_voltage Caution, if just changing the MCLK without committing (sending "c") it seems the change is not actually sent to GPU even though all the other tables and info files report the updated value. |