Created attachment 278829 [details] lsusb lspci cpuinfo config url sensors sensors reports a power consumption of 13 W idle for 4.18.10 sensors reports a power consumption of 7 W idle for 4.18.9 Attached: lsusb lspci cpuinfo url to mainboard an graphics card kernel config output of sensors while idle
Can you bisect?
Git bisect howto: https://www.kernel.org/doc/html/v4.18/admin-guide/bug-bisect.html
Am 28.09.18 um 20:14 schrieb bugzilla-daemon@bugzilla.kernel.org: > https://bugzilla.kernel.org/show_bug.cgi?id=201275 > > --- Comment #2 from Alex Deucher (alexdeucher@gmail.com) --- > Git bisect howto: > https://www.kernel.org/doc/html/v4.18/admin-guide/bug-bisect.html > Sounds prakticable, but may take 1-2 days.
I can second that for RX580 (Polaris20) It raised to 60 W 'idle' (from ~31/32 W with 4.18.9) bisect? - Not so fast 'cause I use openSUSE Tumbleweed 'Kernel:stable' when 'amd-staging-drm-next' do NOT work for me --- and it do NOT work for me since 21/22 August.... but that come with another ticket. amdgpu-pci-0100 Adapter: PCI adapter vddgfx: +1.20 V fan1: 886 RPM temp1: +43.0°C (crit = +94.0°C, hyst = -273.1°C) power1: 59.16 W (cap = 175.00 W)
Created attachment 278841 [details] bisect result Includes bisect steps + sensors output
RX560 is Polaris11, so Bug may be ported from Polaris20
For now i have resolved this problem by simply removing patch for 4.18.11 git diff 93b100ddda3be284be160e9ccba28c7f8f21ab73^1..93b100ddda3be284be160e9ccba28c7f8f21ab73 and patch -R -p1 or without "-R": git diff 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1 Maybe the specs for Vega and Polaris just differ at this point?
4.18.12 Bug still present, removing 93b100ddda3be284be160e9ccba28c7f8f21ab73 solves this problem for now.
I don't think this is a bug. The problem is, prior to that patch, the display component was requesting minimum clocks that were 10x too low. This saved power, but led to display problems on some systems because the clocks were too low to sustain the display requirements.
(In reply to Alex Deucher from comment #9) > I don't think this is a bug. The problem is, prior to that patch, the > display component was requesting minimum clocks that were 10x too low. This > saved power, but led to display problems on some systems because the clocks > were too low to sustain the display requirements. Sorry Alex, what? _All_ was fine _before_ this commit for ages with stable upstream and all 'amd-staging-drm-next'. Now, I get ~60 W raised from ~30 W with 1920x1080 (even dual display was good before).
Can you attach the output of `cat /sys/kernel/debug/dri/0/amdgpu_pm_info` before and after the patch?
openSUSE Tumbleweed Kernel:stable 4.18.12-2.ga880bd8-default After the patch. (For 'before' I have to reboot to broken 'amd-staging-drm-next') https://bugs.freedesktop.org/show_bug.cgi?id=108096 Clock Gating Flags Mask: 0x37bcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: Off Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off GFX Clocks and Power: 2000 MHz (MCLK) 1411 MHz (SCLK) 600 MHz (PSTATE_SCLK) 1000 MHz (PSTATE_MCLK) 1200 mV (VDDGFX) 61.254 W (average GPU) GPU Temperature: 44 C GPU Load: 0 % UVD: Disabled VCE: Disabled
amd-staging-drm-next (with broken SDDM and then 'init 3') Why is 'GPU Load:' so hight? => take it with a drain of salt. Clock Gating Flags Mask: 0x3fbcf Graphics Medium Grain Clock Gating: On Graphics Medium Grain memory Light Sleep: On Graphics Coarse Grain Clock Gating: On Graphics Coarse Grain memory Light Sleep: On Graphics Coarse Grain Tree Shader Clock Gating: Off Graphics Coarse Grain Tree Shader Light Sleep: Off Graphics Command Processor Light Sleep: On Graphics Run List Controller Light Sleep: On Graphics 3D Coarse Grain Clock Gating: Off Graphics 3D Coarse Grain memory Light Sleep: Off Memory Controller Light Sleep: On Memory Controller Medium Grain Clock Gating: On System Direct Memory Access Light Sleep: Off System Direct Memory Access Medium Grain Clock Gating: On Bus Interface Medium Grain Clock Gating: Off Bus Interface Light Sleep: On Unified Video Decoder Medium Grain Clock Gating: On Video Compression Engine Medium Grain Clock Gating: On Host Data Path Light Sleep: On Host Data Path Medium Grain Clock Gating: On Digital Right Management Medium Grain Clock Gating: Off Digital Right Management Light Sleep: Off Rom Medium Grain Clock Gating: On Data Fabric Medium Grain Clock Gating: Off GFX Clocks and Power: 300 MHz (MCLK) 303 MHz (SCLK) 600 MHz (PSTATE_SCLK) 1000 MHz (PSTATE_MCLK) 831 mV (VDDGFX) 32.176 W (average GPU) GPU Temperature: 29 C GPU Load: 84 % UVD: Disabled VCE: Disabled
Diff !!! BAD Host Data Path Light Sleep: Off GOOD Host Data Path Light Sleep: On
(In reply to Dieter Nützel from comment #14) > Diff !!! > > BAD > Host Data Path Light Sleep: Off card0/device> cat pp_dpm_mclk 0: 300Mhz 1: 1000Mhz 2: 2000Mhz * card0/device> cat pp_dpm_sclk 0: 300Mhz 1: 600Mhz 2: 900Mhz 3: 1145Mhz 4: 1215Mhz 5: 1257Mhz 6: 1300Mhz 7: 1411Mhz * > GOOD > Host Data Path Light Sleep: On card0/device cat pp_dpm_mclk 0: 300Mhz * 1: 1000Mhz 2: 2000Mhz card0/device cat pp_dpm_sclk 0: 300Mhz 1: 600Mhz * 2: 900Mhz 3: 1145Mhz 4: 1215Mhz 5: 1257Mhz 6: 1300Mhz 7: 1411Mhz But SCLK changed much. Need badly some sleep. Saturday morning off for family vacation. Greetings, Dieter
Created attachment 278933 [details] amdgpu_pm_info content of /sys/kernel/debug/dri/1/amdgpu_pm_info for 4.18.12 +/- 93b100ddda3be284be160e9ccba28c7f8f21ab73
I can confirm the issue with Polaris10. Power consumption is roughly 30 Watts higher in idle compared to what it used to be and compared to Windows. DPM are stuck in highest power modes for both s and m. The reporter has already bisected so I haven't. If you still need more info, please ping.
(In reply to Alex Deucher from comment #9) > I don't think this is a bug. The problem is, prior to that patch, the > display component was requesting minimum clocks that were 10x too low. This > saved power, but led to display problems on some systems because the clocks > were too low to sustain the display requirements. so - 93b100ddd... fools DC + 93b100ddd... fools PM from my point of view scaling clock values just happens at the wrong place. So we may have to find different points in code from where smuX_get_XXX gets called by PM _or_ DC, may be in Firmware.
People on the Phoronix forum mentioned that this doesn't seem to happen with 4.19-rc kernels. If people here can confirm that, maybe there are other corresponding changes that need to be backported as well.
(In reply to Michel Dänzer from comment #19) > People on the Phoronix forum mentioned that this doesn't seem to happen with > 4.19-rc kernels. If people here can confirm that, maybe there are other > corresponding changes that need to be backported as well. 4.19-rc1: amdgpu-pci-0100 Adapter: PCI adapter vddgfx: +0.81 V fan1: 1602 RPM temp1: +22.0°C (crit = +94.0°C, hyst = -273.1°C) power1: 6.10 W (cap = 48.00 W)
I can confirm that only 4.18.x (x > 9) is affected. 4.19-rc6 reports the same clock and power values as with 4.18.9. At least that's the case for my RX 560.
Created attachment 278939 [details] bisect Author of 93b100ddda3be284be160e9ccba28c7f8f21ab73 simply forgot to remove scaling values for powerplay. Have a look at 23ec3d1479fd79658cd52c47618d8ddd2f32550b where the same scaling applied to vega. You may have to patch drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_services.c too. Have a look at needed_patch.txt
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_services.c (v4.18.12) There is a lot of work to do: 230 for (i = 0; i < dc_clks->num_levels; i++) { 231 DRM_INFO("DM_PPLIB:\t %d\n", pp_clks->clock[i]); 232 /* translate 10kHz to kHz */ 233 dc_clks->clocks_in_khz[i] = pp_clks->clock[i] * 10; 234 } 257 for (i = 0; i < clk_level_info->num_levels; i++) { 258 DRM_DEBUG("DM_PPLIB:\t %d in 10kHz\n", pp_clks->data[i].clocks_in_khz); 259 /* translate 10kHz to kHz */ 260 clk_level_info->data[i].clocks_in_khz = pp_clks->data[i].clocks_in_khz * 10; 261 clk_level_info->data[i].latency_in_us = pp_clks->data[i].latency_in_us; 262 } and maybe 306 /* Translate 10 kHz to kHz. */ 307 validation_clks.engine_max_clock *= 10; 308 validation_clks.memory_max_clock *= 10; since 2017-09-12 15:58:20 bool dm_pp_get_clock_levels_by_type_with_voltage( const struct dc_context *ctx, enum dm_pp_clock_type clk_type, struct dm_pp_clock_levels_with_voltage *clk_level_info) { /* TODO: to be implemented */ return false; } bool dm_pp_notify_wm_clock_changes( const struct dc_context *ctx, struct dm_pp_wm_sets_with_clock_ranges *wm_with_clock_ranges) { /* TODO: to be implemented */ return false; } bool dm_pp_apply_power_level_change_request( const struct dc_context *ctx, struct dm_pp_power_level_change_request *level_change_req) { /* TODO: to be implemented */ return false; } bool dm_pp_apply_clock_for_voltage_request( const struct dc_context *ctx, struct dm_pp_clock_for_voltage_req *clock_for_voltage_req) { /* TODO: to be implemented */ return false; } bool dm_pp_get_static_clocks( const struct dc_context *ctx, struct dm_pp_static_clock_info *static_clk_info) { /* TODO: to be implemented */ return false; }
This patch shouldn't have been applied to 4.18. It looks like it was autoselected: https://lkml.org/lkml/2018/9/15/172 It should be reverted.
(In reply to Alex Deucher from comment #24) > This patch shouldn't have been applied to 4.18. It looks like it was > autoselected: > https://lkml.org/lkml/2018/9/15/172 > It should be reverted. So "[...]the display component was requesting minimum clocks[...]" isn´t an issue with Polaris? Is there any QA left? Avoiding unusual units is a good idea generally, but it should happen very early in development.
(In reply to quirin.blaeser from comment #25) > (In reply to Alex Deucher from comment #24) > > This patch shouldn't have been applied to 4.18. It looks like it was > > autoselected: > > https://lkml.org/lkml/2018/9/15/172 > > It should be reverted. > > So "[...]the display component was requesting minimum clocks[...]" isn´t an > issue with Polaris? > Is there any QA left? > Avoiding unusual units is a good idea generally, but it should happen very > early in development. It was a fix for fallout from an interface refactor we did in 4.19 that mixed up the units between display and power. We did not intend to have the patch applied to 4.18 and we did not flag the patch for 4.18, it was flagged for 4.18 by someone else outside of AMD.
I can´t find "c3df50abc84b" from https://lkml.org/lkml/2018/9/15/172 but "drm/amd/pp: Convert clock unit to KHz as defined" is 23ec3d1479fd79658cd52c47618d8ddd2f32550b
(In reply to quirin.blaeser from comment #27) > I can´t find "c3df50abc84b" from https://lkml.org/lkml/2018/9/15/172 > but "drm/amd/pp: Convert clock unit to KHz as defined" is > 23ec3d1479fd79658cd52c47618d8ddd2f32550b That was the commit id in our amd-staging-drm-next branch: https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=c3df50abc84b289be8e7b96968d7d7e006576880
Created attachment 278965 [details] git diff -p 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1 git diff -p 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1 Apply to v4.18.10 .. v4.18.12 to revert
(In reply to Alex Deucher from comment #28) > (In reply to quirin.blaeser from comment #27) > > I can´t find "c3df50abc84b" from https://lkml.org/lkml/2018/9/15/172 > > but "drm/amd/pp: Convert clock unit to KHz as defined" is > > 23ec3d1479fd79658cd52c47618d8ddd2f32550b > > That was the commit id in our amd-staging-drm-next branch: > https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm- > next&id=c3df50abc84b289be8e7b96968d7d7e006576880 thx
Hello, This error is also registered on launchpad.net https://bugs.launchpad.net/bugs/1796720 Best regards, -- Cristian Aravena Romero (caravena)
Hello, The kernel 4.18.13 works correctly? Best regards, -- Cristian Aravena Romero (caravena)
Yes, it should. GregKH reverted the offending commit in 4.18.13.