Bug 215203
Summary: | Monitors no longer remain in a sleep state | ||
---|---|---|---|
Product: | EFI | Reporter: | Brandon Nielsen (nielsenb) |
Component: | Video | Assignee: | EFI Virtual User (efi) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | alexdeucher, dwagelaar, ego.cordatus, imre.deak, jan.public, mario.limonciello, mitch9654, pmenzel+bugzilla.kernel.org, regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 5.16-rc3 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
possible fix
possible fix 2 possible fix patch 1/2 patch 2/2 patch 1/2 patch 2/2 Alternative patch 1/2 Alternative patch 2/2 possible fix |
Description
Brandon Nielsen
2021-12-02 20:49:48 UTC
Hi Brandon, from https://gist.github.com/belarusaed/7f21e2532cba9e30d9023b474c4bca77 referenced from the redhat ticket I presume the display is connected to the 0b:00.0 graphics card, and by "5. Observer monitor leaving "sleep" state (backlight comes back on), but nothing is displayed" you mean that the screen contents do get restored after some input events (so it doesn't stay blank). I suspect that the result of the commit you bisected will enable runtime PM on that card (as for any PCI graphics card that would take over the EFI framebuffer) and this somehow reveals the issue you observe. Could you verify this by checking with and without the commit cat /sys/devices/pci0000:00/0000:0b:00.0/power/control after having booted up? If the above shows 'auto' with the commit applied, could you set this to 'on' and check if the problem still reproduces? (In reply to Imre Deak from comment #1) > Hi Brandon, > > from > https://gist.github.com/belarusaed/7f21e2532cba9e30d9023b474c4bca77 > > referenced from the redhat ticket I presume the display is connected to the > 0b:00.0 That is not mine, that bug report turned into a bit of a mess of "me-to" comments, and was started well before this commit, I probably shouldn't have reopened it. I can attach fpaste output to this ticket if you want me to. Display on my machine is at 01:00.0. > graphics card, and by > "5. Observer monitor leaving "sleep" state (backlight comes back on), but > nothing is displayed" > > you mean that the screen contents do get restored after some input events > (so it doesn't stay blank). Yes, that is correct. > I suspect that the result of the commit you bisected will enable runtime PM > on that card (as for any PCI graphics card that would take over the EFI > framebuffer) and this somehow reveals the issue you observe. Could you > verify this by checking with and without the commit > > cat /sys/devices/pci0000:00/0000:0b:00.0/power/control > > after having booted up? Before the commit, kernel git revision fac956710ab0: [nielsenb@boleyn ~]$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/control auto After the commit, kernel git revision 55285e21f045: [nielsenb@boleyn kernel]$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/control auto > If the above shows 'auto' with the commit applied, could you set this to > 'on' and check if the problem still reproduces? With the commit applied (git revision 55285e21f045): Forcing on: [nielsenb@boleyn linux]$ sudo sh -c "echo on > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/control" Makes the problem go away, returning to auto: [nielsenb@boleyn linux]$ sudo sh -c "echo auto > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/control" Makes the problem return. Thanks, so runtime PM being enabled for the AMD gfx driver seems to trigger the problem. About power/control being 'auto' at fac956710ab0: I suppose that tree doesn't trigger the problem. It doesn't contain 55285e21f045 so I wonder how runtime PM there can be enabled after the EFI FB->drm FB takeover (IOW how the issue in that tree doesn't trigger if we consider the root cause being RPM being enabled). If you could double-check power/control in fac956710ab0 and it's really 'auto', there could be multiple issues at hand with the same symptom and it would be better to just take v5.15.6, and check power/control and if the issue triggers there before and after reverting 55285e21f045. In any case I suspect AMD folks will need to look at this, as 55285e21f045 seems to reveal a yet unknown issue. (In reply to Imre Deak from comment #3) > Thanks, so runtime PM being enabled for the AMD gfx driver seems to trigger > the problem. > > About power/control being 'auto' at fac956710ab0: I suppose that tree > doesn't trigger the problem. It doesn't contain 55285e21f045 so I wonder how > runtime PM there can be enabled after the EFI FB->drm FB takeover (IOW how > the issue in that tree doesn't trigger if we consider the root cause being > RPM being enabled). > > If you could double-check power/control in fac956710ab0 and it's really > 'auto', there could be multiple issues at hand with the same symptom and it > would be better to just take v5.15.6, and check power/control and if the > issue triggers there before and after reverting 55285e21f045. Just double checked, it really is set to 'auto' in fac956710ab0. Switching to the stable branch, with 5.15.6 the issue does occur, PM state is 'auto'. After reverting e3b39825ed0813f787cb3ebdc5ecaa5131623647 in v5.15.6, the issue does not occur, PM state is 'auto'. Switching PM state to 'on' makes the issue come back but only for the first screen lock. The second time, after a lock the monitors will take longer to go to sleep (~15 sec. vs the usual ~3 sec.), but will not wake up again until an input event. The third time, after a lock, everything works as expected (~3 sec. after lock for monitors to go to sleep, monitors stay asleep until an input event). I am not sure if this behavior is consistent with previous kernels, I only just noticed it. Ok. Could you check in 5.15.6 if setting power/control to 'on' solves the problem, at least at the second and subsequent screen blankings? Also it's actually power/usage_count that would show if the device runtime suspends or not, so could you test once more on an ssh shell the content of power/usage_count after blanking the screen (and waiting until the monitor is supposed to be in power saving state) both in 5.15.6 and then reverting e3b39825ed081? Thanks. (In reply to Imre Deak from comment #5) > Ok. Could you check in 5.15.6 if setting power/control to 'on' solves the > problem, at least at the second and subsequent screen blankings? With v5.15.6, forcing power/control to 'on' does not seem to solve the problem, even on second and subsequent blankings. This is different than I observed in comment 2 with mainline commit 55285e21f045. Perhaps 'on' isn't always a fix, and just makes the actual problem less likely to occur? > Also it's actually power/usage_count that would show if the device runtime > suspends or not, so could you test once more on an ssh shell the content of > power/usage_count after blanking the screen (and waiting until the monitor > is supposed to be in power saving state) both in 5.15.6 and then reverting > e3b39825ed081? > > Thanks. I don't have a power/usage_count. I have: [nielsenb@boleyn ~]$ ls /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/*count* /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/wakeup_abort_count /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/wakeup_active_count /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/wakeup_count /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/wakeup_expire_count Sorry, I meant power/runtime_usage. Not sure how 55285e21f045 makes a difference if disabling runtime PM on the Gfx PCI device doesn't work around the problem. Maybe 0000:00:01.0/0000:01:00.0 is not the correct PCI device? Could you run find \( -path \*power/control -o -path \*power/runtime_usage \) -exec grep -H . {} \; in /sys/devices/pci0000:00/0000:00:01.0 while in the suspended state? You could also check the PCI device in interest and its control/runtime_usage corresponding to the DRM driver/DRM FB by # ls -l /sys/class/drm/card0/device # cat /sys/class/drm/card0/device/power/{control,runtime_usage} (In reply to Imre Deak from comment #7) > Sorry, I meant power/runtime_usage. Don't have that either: [nielsenb@boleyn ~]$ ll /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/ total 0 -rw-r--r--. 1 root root 4096 Dec 9 14:26 autosuspend_delay_ms -rw-r--r--. 1 root root 4096 Dec 9 14:26 control -r--r--r--. 1 root root 4096 Dec 9 14:26 runtime_active_time -r--r--r--. 1 root root 4096 Dec 9 14:26 runtime_status -r--r--r--. 1 root root 4096 Dec 9 14:26 runtime_suspended_time -rw-r--r--. 1 root root 4096 Dec 9 14:26 wakeup -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_abort_count -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_active -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_active_count -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_count -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_expire_count -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_last_time_ms -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_max_time_ms -r--r--r--. 1 root root 4096 Dec 9 14:26 wakeup_total_time_ms > Not sure how 55285e21f045 makes a difference if disabling runtime PM on the > Gfx PCI device doesn't work around the problem. Okay, I double checked everything, and I think the issue was I was confused by some of the behavior I was seeing. Specifically that forcing power/control to 'on' can sometimes result in the monitors not sleeping for 15 or so seconds and I'm impatient. 55285e21f045 - Displays issue, power/control 'auto' by default. Forcing power/control to 'on' fixes the issue (with sometimes delayed monitor sleep, as mentioned above). v5.15.6 - Same as above. Displays issue, power/control 'auto' by default. Forcing power/control to 'on' fixes the issue (with sometimes delayed monitor sleep, as mentioned above). When I performed my test from comment 6, I probably didn't wait for the monitor to sleep. v5.15.6 with e3b39825ed0813f787cb3ebdc5ecaa5131623647 reverted - Does not display issue. power/control 'auto' by default, no monitor sleep delay. Forcing power/control to 'on' introduces the monitor sleep delay but otherwise does not display the issue. > Maybe 0000:00:01.0/0000:01:00.0 is not the correct PCI device? Could you run > > find \( -path \*power/control -o -path \*power/runtime_usage \) -exec grep > -H . {} \; > > in > > /sys/devices/pci0000:00/0000:00:01.0 > > while in the suspended state? [nielsenb@boleyn ~]$ cd /sys/devices/pci0000:00/0000:00:01.0 [nielsenb@boleyn 0000:00:01.0]$ find \( -path \*power/control -o -path \*power/runtime_usage \) -exec grep -H . {} \; ./power/control:auto ./0000:01:00.0/graphics/fb0/power/control:auto ./0000:01:00.0/power/control:auto ./0000:01:00.0/hwmon/hwmon1/power/control:auto ./0000:01:00.0/drm/renderD128/power/control:auto ./0000:01:00.0/drm/card0/card0-HDMI-A-1/power/control:auto ./0000:01:00.0/drm/card0/card0-DVI-D-1/power/control:auto ./0000:01:00.0/drm/card0/power/control:auto ./0000:01:00.0/drm/card0/card0-DP-2/power/control:auto ./0000:01:00.0/drm/card0/card0-DP-2/drm_dp_aux1/power/control:auto ./0000:01:00.0/drm/card0/card0-DP-3/power/control:auto ./0000:01:00.0/drm/card0/card0-DP-3/drm_dp_aux2/power/control:auto ./0000:01:00.0/drm/card0/card0-DP-1/power/control:auto ./0000:01:00.0/drm/card0/card0-DP-1/drm_dp_aux0/power/control:auto ./pci_bus/0000:01/power/control:auto ./0000:01:00.1/power/control:auto ./0000:01:00.1/hdaudioC1D0/power/control:auto ./0000:01:00.1/sound/card1/input9/event9/power/control:auto ./0000:01:00.1/sound/card1/input9/power/control:auto ./0000:01:00.1/sound/card1/hwC1D0/power/control:auto ./0000:01:00.1/sound/card1/input14/power/control:auto ./0000:01:00.1/sound/card1/input14/event14/power/control:auto ./0000:01:00.1/sound/card1/pcmC1D9p/power/control:auto ./0000:01:00.1/sound/card1/pcmC1D8p/power/control:auto ./0000:01:00.1/sound/card1/pcmC1D7p/power/control:auto ./0000:01:00.1/sound/card1/input12/power/control:auto ./0000:01:00.1/sound/card1/input12/event12/power/control:auto ./0000:01:00.1/sound/card1/pcmC1D3p/power/control:auto ./0000:01:00.1/sound/card1/input10/power/control:auto ./0000:01:00.1/sound/card1/input10/event10/power/control:auto ./0000:01:00.1/sound/card1/power/control:auto ./0000:01:00.1/sound/card1/input13/event13/power/control:auto ./0000:01:00.1/sound/card1/input13/power/control:auto ./0000:01:00.1/sound/card1/controlC1/power/control:auto ./0000:01:00.1/sound/card1/input11/power/control:auto ./0000:01:00.1/sound/card1/input11/event11/power/control:auto ./0000:01:00.1/sound/card1/pcmC1D11p/power/control:auto ./0000:01:00.1/sound/card1/pcmC1D10p/power/control:auto Ok, thanks for the tests. +Alex, could you take a look at this? I think e3b39825ed0813 revealed another issue in the AMD Gfx driver which should be root-caused/fixed (as the bisected commit only permits all Gfx devices to runtime suspend whenever EFI FB is used). The reasoning based on all of Brandon's tests so far: On v5.15.6 (containing e3b39825ed0813) after inactivity the display (on /sys/devices/pci0000:00/0000:00:01.0) will blank/runtime suspend, but after a while will wake up prematurely without any user input. Reverting e3b39825ed0813 or force-disabling runtime PM on the device (by setting power/control to "on") seems to work around the issue. Brandon noticed that disabling RPM adds some extra delay before the monitor enters the sleep state. Not sure about the reason for that, power/autosuspend_delay_ms could be checked/lowered, though normally it would only have an effect when RPM is enabled. More direct ways to check that the device does in fact runtime suspend (on v5.15.6 with power/control left at 'auto'): - Check if power/runtime_suspended_time increases across multiple display sleep/wake-ups. In case the above counter doesn't increment: - Build the kernel with CONFIG_PM_ADVANCED=y and check power/runtime_usage at the point the monitor goes to sleep. or - Enable some debug option specific to the AMD gfx driver indicating the runtime suspend/resume events in dmesg. Yes, please check to see if the device has actually runtime suspended (runtime_status and runtime_suspended_time). Note that the displays blanking are independent of runtime suspend. Runtime suspend actually turns off the card when it's completely idle. The display blanking is controlled by the desktop compositor, but the card may still be active if you have background compute work or something still using the GPU. Also, in these kernels, runtime pm shouldn't kick in at all if you have drm fbdev support enabled and a console bound for this device because the console maintains a mmap to the framebuffer which increments the runtime pm ref count. If runtime pm is active, any access to the device will wake the card (e.g., IOCTL, hwmon sysfs query, etc.). You may have some process running that queries the card on a (ir)regular basis. With 5.15.6, power/control on 'auto', I can confirm approximately around the time the screens blank, /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/runtime_status goes to 'suspended'. Suspended time increments as well: [nielsenb@boleyn ~]$ cat /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/runtime_suspended_time 6409 At around the time the screens wake up, runtime_status returns to 'active', it never seems to return to 'suspended'. Lo, this is your Linux kernel regression tracker speaking. This bug about a regression on 5.16-rc looks stalled right before the holiday season, as there hasn't been any activity since ten days now. @Alex, could you give a status update please? Was the information provided by Brandon what you needed? What's needed to get this moving again? (In reply to Brandon Nielsen from comment #12) > With 5.15.6, power/control on 'auto', I can confirm approximately around the > time the screens blank, > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/runtime_status goes > to 'suspended'. > > Suspended time increments as well: > > [nielsenb@boleyn ~]$ cat > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/power/ > runtime_suspended_time > 6409 > > At around the time the screens wake up, runtime_status returns to 'active', > it never seems to return to 'suspended'. It seems that shortly after the device goes into runtime power management, something wakes it again. Perhaps some userspace process is querying the driver for something (e.g., the the compositor is polling for display changes or some process is polling GPU temperature) on a regular basis? Imre, Alex, as we are at rc6 already and the festive season is coming up; Linus said that he will release a rc8, but nevertheless: Is reverting the culprit an option? And if it is: is it a good idea to buy us more time? Runtime power management has been enabled in amdgpu long before either 55285e21f0451 or a6c0fd3d5a8b were applied to efifb. runtime pm has been enabled for radeon GPUs in laptops for close to a decade, and it's been enabled on desktop GPUs for almost 2 years. It seems like perhaps there was some change to some userspace component in the meantime which seems to regularly wake the GPU and perhaps the changes in 55285e21f0451 or a6c0fd3d5a8b papered over that. E.g., some desktop application is polling the GPU for display changes or GPU temperature. Can you repro the issue on bare bones X server for example without a full gnome or kde environment loaded? @Thorsten, reverting 55285e21f0451 alone is not acceptable from my POV since it fixed an actual problem leaking a runtime PM reference from the EFI FB driver. That would block runtime PM power saving for both the AMD and Intel graphics devices. Created attachment 300103 [details]
possible fix
Does this patch help?
(In reply to Alex Deucher from comment #18) > Created attachment 300103 [details] > possible fix > > Does this patch help? I can confirm it fixes the issue of the monitors waking from sleep, but the device no longer seems to runtime suspend ('runtime_suspended_time' no longer increments). (In reply to Alex Deucher from comment #16) > Runtime power management has been enabled in amdgpu long before either > 55285e21f0451 or a6c0fd3d5a8b were applied to efifb. runtime pm has been > enabled for radeon GPUs in laptops for close to a decade, and it's been > enabled on desktop GPUs for almost 2 years. It seems like perhaps there was > some change to some userspace component in the meantime which seems to > regularly wake the GPU and perhaps the changes in 55285e21f0451 or > a6c0fd3d5a8b papered over that. E.g., some desktop application is polling > the GPU for display changes or GPU temperature. Can you repro the issue on > bare bones X server for example without a full gnome or kde environment > loaded? I don't have any particular system monitoring services configured that I'm away of, but I am open to the idea this is just a strange confluence of issues that 55285e21f0451 happens to expose as 'monitors not sleeping'. There is certainly precedent for it really being a GNOME / mutter issue[0]. If it is still of interest, I can try to figure out how to boot this system into bare X and see what happens. [0] - https://gitlab.gnome.org/GNOME/mutter/-/issues/1021 Booted into twm (ah memories), but I can't figure out how to suspend in a fashion that would enter runtime power management. [nielsenb@boleyn ~]$ xlock -dpmsoff 1 Makes the monitors sleep, but I don't see 'runtime_suspended_time' incrementing. If this test is of interest, I think I'm going to need more guidance. $ xset dpms force off ? (In reply to Brandon Nielsen from comment #19) > (In reply to Alex Deucher from comment #18) > > Created attachment 300103 [details] > > possible fix > > > > Does this patch help? > > I can confirm it fixes the issue of the monitors waking from sleep, but the > device no longer seems to runtime suspend ('runtime_suspended_time' no > longer increments). I think that issue may be separate. It sounds like something is polling the GPU on a regular basis. If the frequency of that polling is <5 seconds, the device won't runtime suspend as the default autosuspend timeout set by the driver is 5 seconds. Created attachment 300109 [details]
possible fix 2
Does this patch also help?
(In reply to Imre Deak from comment #22) > $ xset dpms force off ? Same as '$ xlock -dpmsoff 1' without the thrill of 5 seconds of your favorite random retro screensaver. Monitors sleep correctly, and stay sleeping, but 'runtime_suspended_time' does not increment. (In reply to Alex Deucher from comment #25) > Created attachment 300109 [details] > possible fix 2 > > Does this patch also help? Yes. Monitors stay in sleep state, `runtime_suspended_time` increases. I will be gone until sometime early next week, so I won't be around to test until then. (In reply to Brandon Nielsen from comment #27) > (In reply to Alex Deucher from comment #25) > > Created attachment 300109 [details] > > possible fix 2 > > > > Does this patch also help? > > Yes. Monitors stay in sleep state, `runtime_suspended_time` increases. > > I will be gone until sometime early next week, so I won't be around to test > until then. Thanks. Unfortunately, it doesn't fix it for https://gitlab.freedesktop.org/drm/amd/-/issues/1840 . Created attachment 300111 [details]
possible fix
This patch should fix the issue at least for this stage of the kernel. We can revisit it more thoroughly for the next cycle.
Created attachment 300121 [details]
patch 1/2
Created attachment 300123 [details]
patch 2/2
How about these patches? They should restore the previous behavior.
@Alex, shouldn't amdgpu_is_conflicting_framebuffer() be called before drm_aperture_remove_conflicting_pci_framebuffers() in amdgpu_pci_probe()? I think the RPM refcount is still dropped then by drm_aperture_remove_conflicting_pci_framebuffers(), not sure if that's an issue and the driver should get it's own ref before that. Created attachment 300135 [details]
patch 1/2
Created attachment 300137 [details]
patch 2/2
Thanks. Fixed up patches attached. (In reply to Alex Deucher from comment #35) > Thanks. Fixed up patches attached. I can confirm they change 'power/control' to 'on', and that monitors sleep correctly. Created attachment 300177 [details]
Alternative patch 1/2
These new patches are an alternative approach which should handle things better if there is no efifb configured in your kernel.
Created attachment 300178 [details]
Alternative patch 2/2
(In reply to Alex Deucher from comment #37) > Created attachment 300177 [details] > Alternative patch 1/2 > > These new patches are an alternative approach which should handle things > better if there is no efifb configured in your kernel. hmm, looking at the runtime pm documentation more closely, I don't think these patches will help. If the driver enables autosuspend, the idle callback is not used. Created attachment 300213 [details]
possible fix
Does this patch work?
(In reply to Alex Deucher from comment #40) > Created attachment 300213 [details] > possible fix > > Does this patch work? No. Same behavior as described in comment #12. 'power/control' is 'auto'. On the first screen lock, monitors suspend, and '/power/runtime_suspended_time' increments. After a few seconds the screens wake up, but blank. Subsequent lockings behave similarly, but '/power/runtime_suspended_time' never increments further. (In reply to Imre Deak from comment #17) > @Thorsten, reverting 55285e21f0451 alone is not acceptable from my POV since > it fixed an actual problem leaking a runtime PM reference from the EFI FB > driver. That would block runtime PM power saving for both the AMD and Intel > graphics devices. So reverting commit 55285e21f045 (fbdev/efifb: Release PCI device's runtime PM ref during FB destroy) and a6c0fd3d5a8b (efifb: Ensure graphics device for efifb stays at PCI D0) would need to be reverted? I think this bug can be closed. The original behavior was restored in amdgpu with several iterations of patches, ultimately in: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4020c2280233279ea682a7f2f24b54426416d91d |