Bug 205291

Summary: Cannot switch off Radeon HD 4330/4350/4550 with vgaswitcheroo
Product: Drivers Reporter: K J Petrie (kernel.bugs)
Component: Video(DRI - non Intel)Assignee: drivers_video-dri
Status: REOPENED ---    
Severity: normal CC: alexdeucher, lukas, rui.zhang
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.17.0 and later Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg for 4.16.17
dmesg for 4.17.2
Corrected dmesg for 4.17.2

Description K J Petrie 2019-10-22 16:27:08 UTC
"]# uname -r
4.16.17
[root@master2 ken]# cat /sys/kernel/debug/vgaswitcheroo/switch 
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
2:DIS-Audio: :Off:0000:01:00.1

]# uname -r
4.17.0
[root@master2 ken]# cat /sys/kernel/debug/vgaswitcheroo/switch 
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0
2:DIS-Audio: :DynPwr:0000:01:00.1

]# uname -r
5.3.7
]# cat /sys/kernel/debug/vgaswitcheroo/switch 
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0
2:DIS-Audio: :DynPwr:0000:01:00.1

]# inxi -G
Graphics:  Device-1: Intel Mobile 4 Series Integrated Graphics driver: i915 v: kernel 
           Device-2: AMD RV710/M92 [Mobility Radeon HD 4330/4350/4550] driver: radeon v: kernel 
           Display: server: X.Org 1.20.5 driver: ati,intel,modesetting,v4l unloaded: radeon 
           resolution: 1366x768~60Hz 
           OpenGL: renderer: Mesa DRI Mobile Intel GM45 Express v: 2.1 Mesa 19.2.1"

The problem is present with all kernels, official or distribution from 4.17.0 on, so that is where the regression appears to have occurred. The laptop runs hot and the battery drains quickly on newer kernel series as a result.

I'm not an expert on kernel code - compiling is about my limit - but I'll help in any way I can. I hope I have the right component - Power Management describes the function, but not necessarily the source of the problem.
Comment 1 Alex Deucher 2020-04-20 13:50:48 UTC
Can you bisect?
Comment 2 Alex Deucher 2020-04-20 13:52:16 UTC
Please attach your dmesg output in both the working and non-working cases.
Comment 3 K J Petrie 2020-04-20 21:03:08 UTC
Created attachment 288637 [details]
dmesg for 4.16.17

I no longer had the kernels installed so I have used the nearest distribution kernels I could find in the interest of simplicity and speed, as that doesn't seem to make a difference and compiling kernels takes a while on this machine. I hope that's OK. If not it might take a while to get round to compiling the kernels again.
Comment 4 K J Petrie 2020-04-20 21:04:12 UTC
Created attachment 288639 [details]
dmesg for 4.17.2
Comment 5 K J Petrie 2020-04-20 21:05:53 UTC
(In reply to Alex Deucher from comment #1)
> Can you bisect?

Probably not. I can just about compile a kernel, but finding my way around anything other than a standard source download is likely to be beyond me.
Comment 6 K J Petrie 2020-04-20 21:16:57 UTC
Comment on attachment 288639 [details]
dmesg for 4.17.2

Scrub this. It has radeon.runpm=0 on it's command line. I'll upload a corrected version.
Comment 7 K J Petrie 2020-04-20 21:27:08 UTC
Created attachment 288641 [details]
Corrected dmesg for 4.17.2

This one doesn't have radeon.runpm=0 in its command line!
Comment 8 Lukas Wunner 2020-04-22 05:58:50 UTC
Starting with v4.17, the power management of HDA controllers on discrete GPUs was changed such that the HDA keeps the GPU awake as long as it's in use:
https://lists.freedesktop.org/archives/nouveau/2018-February/029851.html

This exposed an issue with some ATI cards which was fixed in June 2018:
https://git.kernel.org/linus/57cb54e53bdd

So if you still experience GPU insomnia with v5.3 (which contains that fix), then it's a different problem.

In one case, a user reported GPU insomnia with an Nvidia card and it turned out that it was caused by a user space tool called "tlp" which disabled runtime power management of the HDA via sysfs. Naturally, this caused the GPU to stay awake. The solution in this case was to change the configuration of "tlp". But it was also possible to manually override disablement of runtime PM on the HDA by echoing "on" to the "power/control" file in the HDA PCI device's sysfs directory:
https://bugs.freedesktop.org/show_bug.cgi?id=75985#c116

So you first may want to check whether runtime PM is disabled in sysfs, try to manually enable it and see if the GPU runtime suspends, and if that works, find out which user space tool disabled runtime PM on the HDA.

It's also possible that you've got a user space tool running which has opened the HDA and thereby keeps the GPU awake. Some audio mixers do that.

If none of that fixes the problem, then we may indeed be dealing with a kernel bug. The other bugs related to runtime PM of the HDA contain all the steps and several debug patches to understand what's keeping the HDA awake, so we need you to follow those instructions and report the results back. Here are the relevant bugzillas:
https://bugs.freedesktop.org/show_bug.cgi?id=106597#c4
https://bugs.freedesktop.org/show_bug.cgi?id=106957#c1

One oddity I notice in your dmesg output is that there's only a single HDA controller detected in your machine and that's the one on the discrete GPU. Normally there are two HDAs, one is part of the Intel chipset and is responsible for headphones, loudspeakers, mic and so on, and the other one is on the discrete GPU and is only responsible for HDMI audio. On your machine, there's no Intel chipset HDA and the one on the discrete GPU has a Line Out for loudspeakers, headphone out, digital out and two microphone inputs. So that's a little odd and may contribute to this issue.
Comment 9 K J Petrie 2020-04-22 11:46:51 UTC
Well, it'll take time to patch and recompile the kernel, but in the meantime here is all the contents of the power directories:

AMD Radeon GPU
cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status 
active
cat /sys/bus/pci/devices/0000\:01\:00.0/power/control 
auto
cat /sys/bus/pci/devices/0000\:01\:00.0/power/autosuspend_delay_ms 
5000
cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_active_time 
1744516
cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_suspended_time 
184197

AMD HDA
cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_status 
active
cat /sys/bus/pci/devices/0000\:01\:00.1/power/control 
auto
cat /sys/bus/pci/devices/0000\:01\:00.1/power/autosuspend_delay_ms 
cat: '/sys/bus/pci/devices/0000:01:00.1/power/autosuspend_delay_ms': Input/output error
cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_active_time 
2415454
cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_suspended_time 
214644

I double-checked that IO error, in case it was just a fluke, but it's consistent.
Comment 10 K J Petrie 2020-04-22 11:49:32 UTC
Looks like I need to recompile with CONFIG_PM_ADVANCED_DEBUG, I suspect.
Comment 11 K J Petrie 2020-04-23 09:57:18 UTC
I decided to compile 5.6.6 with CONFIG_PM_ADVANCED_DEBUG.

To my amazement, this gave me:
# uname -r
5.6.6

# cat /sys/kernel/debug/vgaswitcheroo/switch 
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
2:DIS-Audio: :DynOff:0000:01:00.1    (!!!!!!)

(GPU)
# cat /sys/bus/pci/devices/0000\:01\:00.0/power/control 
auto
# cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_status 
suspended
cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_usage 
0
# cat /sys/bus/pci/devices/0000\:01\:00.0/power/runtime_active_kids
0

(HDA)
# cat /sys/bus/pci/devices/0000\:01\:00.1/power/control 
auto
# cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_status 
suspended
# cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_usage 
0
# cat /sys/bus/pci/devices/0000\:01\:00.1/power/runtime_active_kids 
0

So, either enabling CONFIG_PM_ADVANCED_DEBUG affects this or it's fixed in the latest kernels.

As a quick check I installed the distribution's latest kernel:

# uname -r
5.5.19-pclos1
[root@master2 ken]# cat /sys/kernel/debug/vgaswitcheroo/switch 
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0
2:DIS-Audio: :DynOff:0000:01:00.1

So it looks as if it was fixed in the 5.5 kernels.

However, both the 5.5 (distro) and 5.6 (mainline) kernels emit a terrible clattering sound during services start up. I'm unsure whether that's coming from the speakers, HDD or optical drive. If the former, it's just a nuisance, if one of the latter it's not good news. I hope that's not related to the fix!
Comment 12 K J Petrie 2020-05-07 10:39:18 UTC
Reopening as the problem has returned with 5.6.11, although the rattling sound from the speaker (bug 207437) has now gone!