Bug 213561

Summary: [bisected][regression] GFX10 AMDGPUs can no longer enter idle state after commit. Commit has been pushed to stable branches too.
Product: Drivers Reporter: Linux_Chemist (untaintableangel)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: RESOLVED CODE_FIX    
Severity: normal CC: bjo, fkrueger, hagar-dunor, matoro_bugzilla_kernel, mscardovi, reiver, soshial, tgnff242
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 5.13rc7, 5.12.13, 5.10.46, 5.4.128 Subsystem:
Regression: Yes Bisected commit-id:

Description Linux_Chemist 2021-06-23 14:41:10 UTC
Nature of the problem: RX 5700 is unable to enter low power state at idle (see below for usual behaviour)

Sensors at idle prior to the commit:
amdgpu-pci-0f00
Adapter: PCI adapter
vddgfx:      775.00 mV 
fan1:           0 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +48.0°C  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:     +48.0°C  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:          +52.0°C  (crit = +105.0°C, hyst = -273.1°C)
                       (emerg = +110.0°C)
power1:        8.00 W  (cap = 165.00 W)

After the commit, the lowest is:
amdgpu-pci-0f00
Adapter: PCI adapter
vddgfx:        1.03 V  
fan1:           0 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +54.0°C  (crit = +100.0°C, hyst = -273.1°C)
                       (emerg = +105.0°C)
junction:     +56.0°C  (crit = +110.0°C, hyst = -273.1°C)
                       (emerg = +115.0°C)
mem:          +52.0°C  (crit = +105.0°C, hyst = -273.1°C)
                       (emerg = +110.0°C)
power1:       31.00 W  (cap = 165.00 W)


This problem wasn't present in rc6 but is present in 5.13rc7 and bisects to:

1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3 is the first bad commit
commit 1c0b0efd148d5b24c4932ddb3fa03c8edd6097b3
Author: Yifan Zhang <yifan1.zhang@amd.com>
Date:   Thu Jun 10 10:10:07 2021 +0800

    drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell.
    
    If GC has entered CGPG, ringing doorbell > first page doesn't wakeup GC.
    Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround this issue.
    
    Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
    Cc: stable@vger.kernel.org


The device is a Sapphire Pulse RX5700 and this problem is seen even with one monitor set at 60Hz.
GPU: 0f:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev c4)
Comment 1 matoro 2021-06-24 02:57:21 UTC
Hi, I am confirming this issue in 5.12.13 on my Colorful Red Devil RX 5700XT.  Because of the OC profile it was consuming almost 100W continuously and heated up to nearly 90°C before I realized what was happening and reverted to 5.12.12.

My card:
0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c1)
Comment 2 tgn-ff 2021-06-24 03:53:40 UTC
It's not just locked at the highest clock states, the GPU seems to be under "load": radeontop shows the GPU utilisation to be at nearly 100%. Consequently, performance is terrible.
Comment 3 hagar-dunor 2021-06-24 11:19:18 UTC
Same on vega 64
Comment 4 Linux_Chemist 2021-06-24 11:48:15 UTC
Yes, it seems this commit was also pushed into 5.12.13 so users with similar hardware (gfx10) may also be experiencing this.
Comment 5 hagar-dunor 2021-06-24 12:14:19 UTC
Sorry should have provided more info, I have this on 5.10.46.

cat /sys/kernel/debug/dri/0/amdgpu_pm_info reports 100% GPU usage and ~60W "idle" on 5.10.46 where I get 0% GPU usage and ~7W on 5.10.45.
Comment 6 Linux_Chemist 2021-06-24 14:06:26 UTC
Commit is also present in kernels 5.10.46 and 5.4.128 (stables) dated 23rd June 2021, so updating info for bug report. Should be a few people hitting this if they update to the latest kernels this week.

It is also conceivable that a similar bug may happen with gfx9 devices with a similar commit (I don't know and can't test though) (but that is commit 4cbbe34807938e6e494e535a68d5ff64edac3f20 upstream) that is also in all of these kernels.
If you don't have a gfx10 device, you should file a separate bug report if that commit IS causing you an issue (it may not, I'm just guessing based on reports) - build a linux kernel and do a git bisect to check.
Comment 7 Linux_Chemist 2021-06-24 14:18:27 UTC
(In reply to hagar-dunor from comment #3)
> Same on vega 64

I think I'm right in saying Vega is gfx9 rather than gfx10 (navi etc), so you may be affected by a similar commit (4cbbe34807938e6e494e535a68d5ff64edac3f20 upstream) and not the specific one I'm filing for. 

Are you able to build a linux kernel and check if you are being affected by this particular commit instead? 
At any rate, could you file a similar new bug report ("for gfx9 devices") and link it to this one since the specific commit I've confirmed that causes the problem is not applicable to your particular device.
Comment 8 hagar-dunor 2021-06-24 18:16:22 UTC
Thanks for pointing to a different commit. I don't really have the time currently to revert a specific commit to try it out, pointing out the problem happening between two consecutive kernel versions should be enough TBH for the author to know what this is about.

I don't mind filling another bug if you insist, it would be nice to have the dev show up here and state if that's necessary; the problem might not affect the same hwid, but it's basically identical, I wouldn't be surprised if I open a bug the dev decides it's a duplicate.
Comment 9 Alan Swanson 2021-06-24 21:52:13 UTC
These patches have just been reverted for 5.13-rc8 and should hopefully be backported to be stable.

https://lists.freedesktop.org/archives/amd-gfx/2021-June/065575.html
https://lists.freedesktop.org/archives/dri-devel/2021-June/312755.html
Comment 10 Linux_Chemist 2021-06-24 22:53:07 UTC
Thank you :) I'll mark this as resolved since the problem is known and code has been reverted ready for the next kernels.
Comment 11 Marco Scardovi (scardracs) 2021-06-28 07:22:35 UTC
Hi everyone, I'm facing same issue here on kernel 5.12.13 with the AMD 3200U in an HP-15s laptop. Can you confirm these commits will fix for iGPU too?
Comment 12 Linux_Chemist 2021-06-28 11:10:06 UTC
(In reply to Marco Scardovi from comment #11)
> Hi everyone, I'm facing same issue here on kernel 5.12.13 with the AMD 3200U
> in an HP-15s laptop. Can you confirm these commits will fix for iGPU too?

Hi Marco, it should do if it's the same issue. Your choice of actions are to either:

1) Downgrade to or use kernel 5.12.12 (I don't know which distro you're using, but it should be available somewhere).
2) Build your own kernel from mainline (currently latest version is 5.13 final)
3) Wait until kernel 5.12.14 or later is available for you (at this time, I don't think it's been released yet).
4) Download and run a kernel from a 3rd party source that doesn't contain these commits.

As you're on a laptop (and thus probably on battery power), I would just pick an earlier kernel for now (option 1). 
If you've got grub for your bootloader (for example), just install an earlier kernel (or use another one if there's one installed), by choosing it at grub's menu when you boot up, then once you're logged in and confirm you're not on 5.12.13 (confirm with the command 'uname -a'), remove/uninstall 5.12.13 and then return things to how you like it.
Comment 13 Marco Scardovi (scardracs) 2021-06-28 12:06:28 UTC
(In reply to Linux_Chemist from comment #12)
> (In reply to Marco Scardovi from comment #11)
> > Hi everyone, I'm facing same issue here on kernel 5.12.13 with the AMD
> 3200U
> > in an HP-15s laptop. Can you confirm these commits will fix for iGPU too?
> 
> Hi Marco, it should do if it's the same issue. Your choice of actions are to
> either:
> 
> 1) Downgrade to or use kernel 5.12.12 (I don't know which distro you're
> using, but it should be available somewhere).
> 2) Build your own kernel from mainline (currently latest version is 5.13
> final)
> 3) Wait until kernel 5.12.14 or later is available for you (at this time, I
> don't think it's been released yet).
> 4) Download and run a kernel from a 3rd party source that doesn't contain
> these commits.
> 
> As you're on a laptop (and thus probably on battery power), I would just
> pick an earlier kernel for now (option 1). 
> If you've got grub for your bootloader (for example), just install an
> earlier kernel (or use another one if there's one installed), by choosing it
> at grub's menu when you boot up, then once you're logged in and confirm
> you're not on 5.12.13 (confirm with the command 'uname -a'),
> remove/uninstall 5.12.13 and then return things to how you like it.

Hi and thank for the answer. I'm using Gentoo and waiting for 5.13 release (it has been released today upstream). I hope this will help as my laptop is running at 73°C on idle
Comment 14 Marco Scardovi (scardracs) 2021-06-28 16:05:41 UTC
Can confirm on kernel 5.13-final is fixed. 44°C instead of 73°C on idle
Comment 15 soshial 2022-01-20 05:27:24 UTC
I have exactly the same problem on my Dell XPS 9575 laptop.

GPU: AMD Polaris 22 XL [Radeon RX Vega M GL].
Kernel: 5.15.12

Several months ago there was no such problem. The amdgpu is always in D0 state and fans are spinning all the time. How may I help to fix the problem?