Bug 201275

Summary: Power consumption RX560 idle raised from 7 W to 13 W
Product: Drivers Reporter: quirin.blaeser
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: low CC: abolte, alexdeucher, caravena, chewi, Dieter, grmat, harry.wentland, lee295012, sndirsch, taijian, thomas-lange2
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.18.10 Tree: Mainline
Regression: No
Attachments: lsusb lspci cpuinfo config url sensors
bisect result
amdgpu_pm_info
bisect
git diff -p 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1

Description quirin.blaeser 2018-09-28 16:51:31 UTC
Created attachment 278829 [details]
lsusb lspci cpuinfo config url sensors

sensors reports a power consumption of 13 W idle for 4.18.10
sensors reports a power consumption of 7 W idle for 4.18.9

Attached:
lsusb
lspci
cpuinfo
url to mainboard an graphics card
kernel config
output of sensors while idle
Comment 1 Alex Deucher 2018-09-28 17:11:57 UTC
Can you bisect?
Comment 2 Alex Deucher 2018-09-28 18:14:20 UTC
Git bisect howto:
https://www.kernel.org/doc/html/v4.18/admin-guide/bug-bisect.html
Comment 3 quirin.blaeser 2018-09-28 21:51:16 UTC
Am 28.09.18 um 20:14 schrieb bugzilla-daemon@bugzilla.kernel.org:
> https://bugzilla.kernel.org/show_bug.cgi?id=201275
> 
> --- Comment #2 from Alex Deucher (alexdeucher@gmail.com) ---
> Git bisect howto:
> https://www.kernel.org/doc/html/v4.18/admin-guide/bug-bisect.html
> 
Sounds prakticable, but may take 1-2 days.
Comment 4 Dieter Nützel 2018-09-29 02:44:41 UTC
I can second that for RX580 (Polaris20)

It raised to 60 W 'idle' (from ~31/32 W with 4.18.9)

bisect? - Not so fast 'cause I use openSUSE Tumbleweed 'Kernel:stable' when 'amd-staging-drm-next' do NOT work for me --- and it do NOT work for me since 21/22 August.... but that come with another ticket.

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       +1.20 V  
fan1:         886 RPM
temp1:        +43.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:       59.16 W  (cap = 175.00 W)
Comment 5 quirin.blaeser 2018-09-29 21:20:33 UTC
Created attachment 278841 [details]
bisect result

Includes bisect steps + sensors output
Comment 6 quirin.blaeser 2018-09-29 21:28:30 UTC
RX560 is Polaris11, so Bug may be ported from Polaris20
Comment 7 quirin.blaeser 2018-09-30 14:07:38 UTC
For now i have resolved this problem by simply removing patch for 4.18.11

git diff 93b100ddda3be284be160e9ccba28c7f8f21ab73^1..93b100ddda3be284be160e9ccba28c7f8f21ab73
and patch -R -p1

or without "-R":

git diff 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1

Maybe the specs for Vega and Polaris just differ at this point?
Comment 8 quirin.blaeser 2018-10-04 17:56:53 UTC
4.18.12
Bug still present, removing 93b100ddda3be284be160e9ccba28c7f8f21ab73 solves this problem for now.
Comment 9 Alex Deucher 2018-10-04 20:21:41 UTC
I don't think this is a bug.  The problem is, prior to that patch, the display component was requesting minimum clocks that were 10x too low.  This saved power, but led to display problems on some systems because the clocks were too low to sustain the display requirements.
Comment 10 Dieter Nützel 2018-10-04 22:48:47 UTC
(In reply to Alex Deucher from comment #9)
> I don't think this is a bug.  The problem is, prior to that patch, the
> display component was requesting minimum clocks that were 10x too low.  This
> saved power, but led to display problems on some systems because the clocks
> were too low to sustain the display requirements.

Sorry Alex,
what?

_All_ was fine _before_ this commit for ages with stable upstream and all 'amd-staging-drm-next'.

Now, I get ~60 W raised from ~30 W with 1920x1080 (even dual display was good before).
Comment 11 Alex Deucher 2018-10-05 02:37:12 UTC
Can you attach the output of `cat /sys/kernel/debug/dri/0/amdgpu_pm_info` before and after the patch?
Comment 12 Dieter Nützel 2018-10-05 03:04:04 UTC
openSUSE Tumbleweed Kernel:stable 4.18.12-2.ga880bd8-default

After the patch.
(For 'before' I have to reboot to broken 'amd-staging-drm-next')
https://bugs.freedesktop.org/show_bug.cgi?id=108096

Clock Gating Flags Mask: 0x37bcf
        Graphics Medium Grain Clock Gating: On
        Graphics Medium Grain memory Light Sleep: On
        Graphics Coarse Grain Clock Gating: On
        Graphics Coarse Grain memory Light Sleep: On
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: On
        Graphics Run List Controller Light Sleep: On
        Graphics 3D Coarse Grain Clock Gating: Off
        Graphics 3D Coarse Grain memory Light Sleep: Off
        Memory Controller Light Sleep: On
        Memory Controller Medium Grain Clock Gating: On
        System Direct Memory Access Light Sleep: Off
        System Direct Memory Access Medium Grain Clock Gating: On
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: On
        Unified Video Decoder Medium Grain Clock Gating: On
        Video Compression Engine Medium Grain Clock Gating: On
        Host Data Path Light Sleep: Off
        Host Data Path Medium Grain Clock Gating: On
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: Off
        Rom Medium Grain Clock Gating: On
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        2000 MHz (MCLK)
        1411 MHz (SCLK)
        600 MHz (PSTATE_SCLK)
        1000 MHz (PSTATE_MCLK)
        1200 mV (VDDGFX)
        61.254 W (average GPU)

GPU Temperature: 44 C
GPU Load: 0 %

UVD: Disabled

VCE: Disabled
Comment 13 Dieter Nützel 2018-10-05 03:41:47 UTC
amd-staging-drm-next (with broken SDDM and then 'init 3')

Why is 'GPU Load:' so hight?
 => take it with a drain of salt.

Clock Gating Flags Mask: 0x3fbcf
        Graphics Medium Grain Clock Gating: On
        Graphics Medium Grain memory Light Sleep: On
        Graphics Coarse Grain Clock Gating: On
        Graphics Coarse Grain memory Light Sleep: On
        Graphics Coarse Grain Tree Shader Clock Gating: Off
        Graphics Coarse Grain Tree Shader Light Sleep: Off
        Graphics Command Processor Light Sleep: On
        Graphics Run List Controller Light Sleep: On
        Graphics 3D Coarse Grain Clock Gating: Off
        Graphics 3D Coarse Grain memory Light Sleep: Off
        Memory Controller Light Sleep: On
        Memory Controller Medium Grain Clock Gating: On
        System Direct Memory Access Light Sleep: Off
        System Direct Memory Access Medium Grain Clock Gating: On
        Bus Interface Medium Grain Clock Gating: Off
        Bus Interface Light Sleep: On
        Unified Video Decoder Medium Grain Clock Gating: On
        Video Compression Engine Medium Grain Clock Gating: On
        Host Data Path Light Sleep: On
        Host Data Path Medium Grain Clock Gating: On
        Digital Right Management Medium Grain Clock Gating: Off
        Digital Right Management Light Sleep: Off
        Rom Medium Grain Clock Gating: On
        Data Fabric Medium Grain Clock Gating: Off

GFX Clocks and Power:
        300 MHz (MCLK)
        303 MHz (SCLK)
        600 MHz (PSTATE_SCLK)
        1000 MHz (PSTATE_MCLK)
        831 mV (VDDGFX)
        32.176 W (average GPU)

GPU Temperature: 29 C
GPU Load: 84 %

UVD: Disabled

VCE: Disabled
Comment 14 Dieter Nützel 2018-10-05 03:43:19 UTC
Diff !!!

BAD
Host Data Path Light Sleep: Off

GOOD
Host Data Path Light Sleep: On
Comment 15 Dieter Nützel 2018-10-05 03:52:06 UTC
(In reply to Dieter Nützel from comment #14)
> Diff !!!
> 
> BAD
> Host Data Path Light Sleep: Off

card0/device> cat pp_dpm_mclk 
0: 300Mhz 
1: 1000Mhz 
2: 2000Mhz *

card0/device> cat pp_dpm_sclk 
0: 300Mhz 
1: 600Mhz 
2: 900Mhz 
3: 1145Mhz 
4: 1215Mhz 
5: 1257Mhz 
6: 1300Mhz 
7: 1411Mhz *


> GOOD
> Host Data Path Light Sleep: On

card0/device cat pp_dpm_mclk
0: 300Mhz *
1: 1000Mhz
2: 2000Mhz

card0/device cat pp_dpm_sclk
0: 300Mhz
1: 600Mhz *
2: 900Mhz
3: 1145Mhz
4: 1215Mhz
5: 1257Mhz
6: 1300Mhz
7: 1411Mhz

But SCLK changed much.

Need badly some sleep.
Saturday morning off for family vacation.

Greetings,
Dieter
Comment 16 quirin.blaeser 2018-10-05 08:25:45 UTC
Created attachment 278933 [details]
amdgpu_pm_info

content of /sys/kernel/debug/dri/1/amdgpu_pm_info
for 4.18.12 +/- 93b100ddda3be284be160e9ccba28c7f8f21ab73
Comment 17 grmat 2018-10-05 11:52:39 UTC
I can confirm the issue with Polaris10. Power consumption is roughly 30 Watts higher in idle compared to what it used to be and compared to Windows. DPM are stuck in highest power modes for both s and m.

The reporter has already bisected so I haven't. If you still need more info, please ping.
Comment 18 quirin.blaeser 2018-10-05 13:50:25 UTC
(In reply to Alex Deucher from comment #9)
> I don't think this is a bug.  The problem is, prior to that patch, the
> display component was requesting minimum clocks that were 10x too low.  This
> saved power, but led to display problems on some systems because the clocks
> were too low to sustain the display requirements.

so
 - 93b100ddd... fools DC
 + 93b100ddd... fools PM

from my point of view scaling clock values just happens at the wrong place.
So we may have to find different points in code from where smuX_get_XXX gets called by PM _or_ DC, may be in Firmware.
Comment 19 Michel Dänzer 2018-10-05 14:44:38 UTC
People on the Phoronix forum mentioned that this doesn't seem to happen with 4.19-rc kernels. If people here can confirm that, maybe there are other corresponding changes that need to be backported as well.
Comment 20 quirin.blaeser 2018-10-05 22:18:46 UTC
(In reply to Michel Dänzer from comment #19)
> People on the Phoronix forum mentioned that this doesn't seem to happen with
> 4.19-rc kernels. If people here can confirm that, maybe there are other
> corresponding changes that need to be backported as well.

4.19-rc1:

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       +0.81 V  
fan1:        1602 RPM
temp1:        +22.0°C  (crit = +94.0°C, hyst = -273.1°C)
power1:        6.10 W  (cap =  48.00 W)
Comment 21 Thomas Lange 2018-10-06 10:43:21 UTC
I can confirm that only 4.18.x (x > 9) is affected. 4.19-rc6 reports the same clock and power values as with 4.18.9. At least that's the case for my RX 560.
Comment 22 quirin.blaeser 2018-10-07 00:34:38 UTC
Created attachment 278939 [details]
bisect

Author of 93b100ddda3be284be160e9ccba28c7f8f21ab73 simply forgot to remove scaling values for powerplay.
Have a look at 23ec3d1479fd79658cd52c47618d8ddd2f32550b where the same scaling applied to vega.
You may have to patch drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_services.c
too.
Have a look at needed_patch.txt
Comment 23 quirin.blaeser 2018-10-07 09:20:01 UTC
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_services.c (v4.18.12)
There is a lot of work to do:

      230         for (i = 0; i < dc_clks->num_levels; i++) {
      231                 DRM_INFO("DM_PPLIB:\t %d\n", pp_clks->clock[i]);
      232                 /* translate 10kHz to kHz */
      233                 dc_clks->clocks_in_khz[i] = pp_clks->clock[i] * 10;
      234         }

      257         for (i = 0; i < clk_level_info->num_levels; i++) {
      258                 DRM_DEBUG("DM_PPLIB:\t %d in 10kHz\n", pp_clks->data[i].clocks_in_khz);
      259                 /* translate 10kHz to kHz */
      260                 clk_level_info->data[i].clocks_in_khz
                            = pp_clks->data[i].clocks_in_khz * 10;
      261                 clk_level_info->data[i].latency_in_us
                            = pp_clks->data[i].latency_in_us;
      262         }

and maybe

      306         /* Translate 10 kHz to kHz. */
      307         validation_clks.engine_max_clock *= 10;
      308         validation_clks.memory_max_clock *= 10;


since 2017-09-12 15:58:20

bool dm_pp_get_clock_levels_by_type_with_voltage(
        const struct dc_context *ctx,
        enum dm_pp_clock_type clk_type,
        struct dm_pp_clock_levels_with_voltage *clk_level_info)
{
        /* TODO: to be implemented */
        return false;
}

bool dm_pp_notify_wm_clock_changes(
        const struct dc_context *ctx,
        struct dm_pp_wm_sets_with_clock_ranges *wm_with_clock_ranges)
{
        /* TODO: to be implemented */
        return false;
}

bool dm_pp_apply_power_level_change_request(
        const struct dc_context *ctx,
        struct dm_pp_power_level_change_request *level_change_req)
{
        /* TODO: to be implemented */
        return false;
}

bool dm_pp_apply_clock_for_voltage_request(
        const struct dc_context *ctx,
        struct dm_pp_clock_for_voltage_req *clock_for_voltage_req)
{
        /* TODO: to be implemented */
        return false;
}

bool dm_pp_get_static_clocks(
        const struct dc_context *ctx,
        struct dm_pp_static_clock_info *static_clk_info)
{
        /* TODO: to be implemented */
        return false;
}
Comment 24 Alex Deucher 2018-10-08 14:26:57 UTC
This patch shouldn't have been applied to 4.18.  It looks like it was autoselected:
https://lkml.org/lkml/2018/9/15/172
It should be reverted.
Comment 25 quirin.blaeser 2018-10-08 19:34:55 UTC
(In reply to Alex Deucher from comment #24)
> This patch shouldn't have been applied to 4.18.  It looks like it was
> autoselected:
> https://lkml.org/lkml/2018/9/15/172
> It should be reverted.

So "[...]the display component was requesting minimum clocks[...]" isn´t an issue with Polaris?
Is there any QA left?
Avoiding unusual units is a good idea generally, but it should happen very early in development.
Comment 26 Alex Deucher 2018-10-08 19:42:16 UTC
(In reply to quirin.blaeser from comment #25)
> (In reply to Alex Deucher from comment #24)
> > This patch shouldn't have been applied to 4.18.  It looks like it was
> > autoselected:
> > https://lkml.org/lkml/2018/9/15/172
> > It should be reverted.
> 
> So "[...]the display component was requesting minimum clocks[...]" isn´t an
> issue with Polaris?
> Is there any QA left?
> Avoiding unusual units is a good idea generally, but it should happen very
> early in development.

It was a fix for fallout from an interface refactor we did in 4.19 that mixed up the units between display and power.  We did not intend to have the patch applied to 4.18 and we did not flag the patch for 4.18, it was flagged for 4.18 by someone else outside of AMD.
Comment 27 quirin.blaeser 2018-10-08 19:49:14 UTC
I can´t find "c3df50abc84b" from https://lkml.org/lkml/2018/9/15/172
but "drm/amd/pp: Convert clock unit to KHz as defined" is 23ec3d1479fd79658cd52c47618d8ddd2f32550b
Comment 28 Alex Deucher 2018-10-08 19:59:04 UTC
(In reply to quirin.blaeser from comment #27)
> I can´t find "c3df50abc84b" from https://lkml.org/lkml/2018/9/15/172
> but "drm/amd/pp: Convert clock unit to KHz as defined" is
> 23ec3d1479fd79658cd52c47618d8ddd2f32550b

That was the commit id in our amd-staging-drm-next branch:
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=c3df50abc84b289be8e7b96968d7d7e006576880
Comment 29 quirin.blaeser 2018-10-08 20:58:38 UTC
Created attachment 278965 [details]
git diff -p 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1

git diff -p 93b100ddda3be284be160e9ccba28c7f8f21ab73..93b100ddda3be284be160e9ccba28c7f8f21ab73^1

Apply to v4.18.10 .. v4.18.12 to revert
Comment 30 quirin.blaeser 2018-10-08 20:59:36 UTC
(In reply to Alex Deucher from comment #28)
> (In reply to quirin.blaeser from comment #27)
> > I can´t find "c3df50abc84b" from https://lkml.org/lkml/2018/9/15/172
> > but "drm/amd/pp: Convert clock unit to KHz as defined" is
> > 23ec3d1479fd79658cd52c47618d8ddd2f32550b
> 
> That was the commit id in our amd-staging-drm-next branch:
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-
> next&id=c3df50abc84b289be8e7b96968d7d7e006576880

thx
Comment 31 Cristian Aravena Romero 2018-10-09 18:29:04 UTC
Hello,

This error is also registered on launchpad.net
https://bugs.launchpad.net/bugs/1796720

Best regards,
--
Cristian Aravena Romero (caravena)
Comment 32 Cristian Aravena Romero 2018-10-10 11:39:14 UTC
Hello,

The kernel 4.18.13 works correctly?

Best regards,
--
Cristian Aravena Romero (caravena)
Comment 33 Harry Wentland 2018-10-10 13:42:12 UTC
Yes, it should. GregKH reverted the offending commit in  4.18.13.