Bug 219629
Summary: | [REGRESSION, BISECTED] Kernel version 6.13.rc4 not powering off laptop | ||
---|---|---|---|
Product: | Drivers | Reporter: | Evert Vorster (evorster) |
Component: | PCI | Assignee: | drivers_pci (drivers_pci) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | bjorn, ilpo.jarvinen, lukas |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: | |
Attachments: |
Output of dmidecode
Journal of a session with Arch mainline kernel Output of lspci -vvvv with USB-C cable plugged Output of lspci -vvvv with HDMI cable plugged Output of sudo lspci -vvvv with the USB-C cable plugged. Output of sudo lspci -vvvv with HDMI cable plugged attachment-27986-0.html Test without enabling BW notifications Output of sudo lspci -vvvv on 6.6.68 Kernel, with USB-C plugged [PATCH] PCI/bwctrl: Preserve Link Status Register on write [PATCH] PCI/bwctrl: Fix NULL pointer deref on bind and unbind |
Created attachment 307400 [details]
Journal of a session with Arch mainline kernel
Attached a journal of a session with the Arch mainline kernel, which is the earliest kernel that I have seen this issue with.
As a sanity check, on the one of the shutdowns, I waited for about 10 minutes to see if it is not just some process that was hanging preventing the power down. It did not power down until I held down the power button for a few seconds.
Hopefully this helps in narrowing down the issue.
Merry Christmas!
On a side note, this is the earliest 6.13 version of the kernel I have tried. With the latest 6.12 kernel, it still works fine. Please bisect: https://docs.kernel.org/admin-guide/bug-bisect.html Hi there! Thanks, I'll go do that. Unfortunately it takes a little time to bisect, and it is Christmas at the moment, so I'll do it a bit later. One other interesting data point on this bug seems that it is triggered when my external monitor is plugged into the HDMI port of my laptop. I recently got a USB-C to DisplayPort cable, and when using that to communicate with the external monitor, this issue does not appear. Hi there. It took me a little while to get the bisection to work, but it was a nice learning experience. Currently I am on my fourth bisection build, and the first two runs were good, and the last one bad, so I am on track to find the commit that caused this issue. However there are still about 10 iterations left, and with the festive season upon us my computer time is limited. So, please have some patience while I track this one down. It turns out that I had a little more time on my hands. Here is the output of the kernel bisect: [code] [evert@Evert linux-mainline]$ git bisect good 665745f274870c921020f610e2c99a3b1613519b is the first bad commit commit 665745f274870c921020f610e2c99a3b1613519b Author: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Date: Fri Oct 18 17:47:52 2024 +0300 PCI/bwctrl: Re-add BW notification portdrv as PCIe BW controller This mostly reverts the commit b4c7d2076b4e ("PCI/LINK: Remove bandwidth notification"). An upcoming commit extends this driver building PCIe bandwidth controller on top of it. PCIe bandwidth notifications were first added in the commit e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification") but later had to be removed. The significant changes compared with the old bandwidth notification driver include: 1) Don't print the notifications into kernel log, just keep the Link Speed cached in struct pci_bus updated. While somewhat unfortunate, the log spam was the source of complaints that eventually lead to the removal of the bandwidth notifications driver (see the links below for further information). 2) Besides the Link Bandwidth Management Interrupt, also enable Link Autonomous Bandwidth Interrupt to cover the other source of bandwidth changes. 3) Handle Link Speed updates robustly. Refresh the cached Link Speed when enabling Bandwidth Notification Interrupts, and solve the race between Link Speed read and LBMS/LABS update in pcie_bwnotif_irq_thread(). 4) Use concurrency safe LNKCTL RMW operations. 5) The driver is now called PCIe bwctrl (bandwidth controller) instead of just bandwidth notifications because of increased scope and functionality within the driver. 6) Coexist with the Target Link Speed quirk in pcie_failed_link_retrain(). Provide LBMS counting API for it. 7) Tweaks to variable/functions names for consistency and length reasons. Bandwidth Notifications enable the cur_bus_speed in the struct pci_bus to keep track PCIe Link Speed changes. [bhelgaas: This is based on previous work by Alexandru Gagniuc <mr.nuke.me@gmail.com>; see e8303bb7a75c ("PCI/LINK: Report degraded links via link bandwidth notification")] Link: https://lore.kernel.org/r/20241018144755.7875-7-ilpo.jarvinen@linux.intel.com Link: https://lore.kernel.org/all/20190429185611.121751-1-helgaas@kernel.org/ Link: https://lore.kernel.org/linux-pci/20190501142942.26972-1-keith.busch@intel.com/ Link: https://lore.kernel.org/linux-pci/20200115221008.GA191037@google.com/ Suggested-by: Lukas Wunner <lukas@wunner.de> # Building bwctrl on top of bwnotif Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> [bhelgaas: squash fix to drop IRQF_ONESHOT and convert to hardirq handler: https://lore.kernel.org/r/20241115165717.15233-1-ilpo.jarvinen@linux.intel.com] Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Stefan Wahren <wahrenst@gmx.net> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> MAINTAINERS | 6 ++++ drivers/pci/hotplug/pciehp_ctrl.c | 5 ++++ drivers/pci/pci.c | 2 +- drivers/pci/pci.h | 11 +++++++ drivers/pci/pcie/Makefile | 2 +- drivers/pci/pcie/bwctrl.c | 186 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ drivers/pci/pcie/portdrv.c | 9 +++--- drivers/pci/pcie/portdrv.h | 6 ++-- drivers/pci/quirks.c | 9 +++++- include/linux/pci.h | 2 ++ 10 files changed, 229 insertions(+), 9 deletions(-) create mode 100644 drivers/pci/pcie/bwctrl.c [evert@Evert linux-mainline]$ [/code] Thanks a ton for your work! Please send your findings to LKML and CC ilpo.jarvinen AT linux.intel.com preferably after January, 2. I cannot CC him here because he's not on this bug tracker :-( The USA is now on a holiday break. I've changed the component to Drivers/PCI, hopefully at least some relevant people will now notice this bug report. I have sent Ilpo a mail, pointing to this bug report, and told him that it is not urgent. In my use case, this is not an urgent bug as I have a USB-C to Displayport cable, and if I drive the external monitor with that, not only do I get 144Hz refresh rate rather than the 120Hz that HDMI seems to top out at, this error does not show up. I'm not on the LKML, but if I find myself reporting many more bugs I might join. Updating importance, as any regression is urgent since other users will trip over this and may not have the same workaround. On shutdown, pcie_bwnotif_remove() is run via the following call chain: pcie_portdrv_shutdown() pcie_port_device_remove() remove_iter() device_unregister() pcie_bwnotif_remove() Maybe something in pcie_bwnotif_remove() is deadlocking. Could you try editing that function in drivers/pci/pcie/bwctrl.c and commenting out the call to pcie_cooling_device_unregister() and if that doesn't help, try commenting out the scoped_guard() statements and see if that fixes the issue? Hi Lukas. Unfortunately I am not a programmer. I can hunt down bugs and report on them, as well as test solutions, but actually making changes to the source is above my skill level. So basically you'd just have to edit the file drivers/pci/pcie/bwctrl.c in the kernel source tree, find the function pcie_bwnotif_remove() and add two slashes in front of the pcie_cooling_device_unregister() and scoped_guard() statements, like this: - pcie_cooling_device_unregister(data->cdev); + // pcie_cooling_device_unregister(data->cdev); - scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) - scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) + // scoped_guard(rwsem_write, &pcie_bwctrl_setspeed_rwsem) + // scoped_guard(rwsem_write, &pcie_bwctrl_lbms_rwsem) srv->port->link_bwctrl = NULL; And then recompile + reinstall the kernel. The two slashes result in the subsequent statement being commented out. You can also insert a line in the pcie_bwnotif_remove() function which emits a message: pci_info(srv->port, "%s\n", __func__); This will emit the name of each PCI device when the bandwidth controller is disabled on shutdown. You should be able to see this on the console and it'll allow you to pinpoint the device for which the function hangs. You may have to add "ignore_loglevel" to the kernel command line to see the messages. Hi there! I followed the steps here, and unfortunately the issue remains. Also, the screen goes black, so I can't see the last messages printed to the terminal. I waited for a little over ten minutes for the laptop to shut down before holding down the power button to force it to shut down. Here is the last bit of the journal of that run: --------------------------------- Dec 29 20:00:20 Evert.Strix systemd[1]: Reached target Unmount All Filesystems. Dec 29 20:00:20 Evert.Strix systemd[1]: systemd-fsck@dev-disk-by\x2dlabel-Sbr_Int.service: Deactivated successfully. Dec 29 20:00:20 Evert.Strix systemd[1]: Stopped File System Check on /dev/disk/by-label/Sbr_Int. Dec 29 20:00:20 Evert.Strix systemd[1]: Removed slice Slice /system/systemd-fsck. Dec 29 20:00:20 Evert.Strix systemd[1]: Stopped target Preparation for Local File Systems. Dec 29 20:00:20 Evert.Strix systemd[1]: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling... Dec 29 20:00:20 Evert.Strix systemd[1]: systemd-remount-fs.service: Deactivated successfully. Dec 29 20:00:20 Evert.Strix systemd[1]: Stopped Remount Root and Kernel File Systems. Dec 29 20:00:20 Evert.Strix systemd[1]: systemd-tmpfiles-setup-dev.service: Deactivated successfully. Dec 29 20:00:20 Evert.Strix systemd[1]: Stopped Create Static Device Nodes in /dev. Dec 29 20:00:20 Evert.Strix systemd[1]: systemd-tmpfiles-setup-dev-early.service: Deactivated successfully. Dec 29 20:00:20 Evert.Strix systemd[1]: Stopped Create Static Device Nodes in /dev gracefully. Dec 29 20:00:21 Evert.Strix systemd[1]: lvm2-monitor.service: Deactivated successfully. Dec 29 20:00:21 Evert.Strix systemd[1]: Stopped Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling. Dec 29 20:00:21 Evert.Strix systemd[1]: Reached target System Shutdown. Dec 29 20:00:21 Evert.Strix systemd[1]: Reached target Late Shutdown Services. Dec 29 20:00:21 Evert.Strix systemd[1]: systemd-poweroff.service: Deactivated successfully. Dec 29 20:00:21 Evert.Strix systemd[1]: Finished System Power Off. Dec 29 20:00:21 Evert.Strix systemd[1]: Reached target System Power Off. Dec 29 20:00:21 Evert.Strix systemd[1]: Shutting down. Dec 29 20:00:21 Evert.Strix systemd-shutdown[1]: Syncing filesystems and block devices. Dec 29 20:00:21 Evert.Strix systemd-shutdown[1]: Sending SIGTERM to remaining processes... Dec 29 20:00:21 Evert.Strix systemd-journald[572]: Received SIGTERM from PID 1 (systemd-shutdow). Dec 29 20:00:21 Evert.Strix systemd-journald[572]: Journal stopped ------------------------------------------------------- As a sanity check, I compiled and installed the latest git version of the kernel, and the issue is still present. As a sanity check, I compiled and installed the latest git version of the kernel, and the issue is still present. Quick update on this issue, it also occurs with no HDMI cable plugged in. Only when I have the USB-C to DisplayPort cable plugged in does it not occur. You've got a laptop with dual GPUs (AMD + Nvidia). It's possible that the HDMI port is attached to a different GPU than the USB-C port and that may influence the behavior here. If pcie_bwnotif_remove() does not deadlock, maybe changes to the PCI configuration by the bandwidth controller cause the issue. Could you attach the output of "lspci -vvvv", both in the working case (i.e. with kernel v6.12) and in the non-working case? Created attachment 307416 [details]
Output of lspci -vvvv with USB-C cable plugged
Created attachment 307417 [details]
Output of lspci -vvvv with HDMI cable plugged
I did a quick diff as well, seems that the IRQ's are routed differently. ----------------------------------------------- [evert@Evert Bug Reports]$ diff 2024-12-30_HDMI_Cable_Plugged_In.txt 2024-12-30_USB-c_DisplayPort_Cable_Plugged_In.txt 238c238 < Interrupt: pin A routed to IRQ 125 --- > Interrupt: pin A routed to IRQ 123 255c255 < Interrupt: pin B routed to IRQ 111 --- > Interrupt: pin B routed to IRQ 124 267c267 < Interrupt: pin A routed to IRQ 75 --- > Interrupt: pin A routed to IRQ 74 427c427 < Interrupt: pin C routed to IRQ 112 --- > Interrupt: pin C routed to IRQ 125 ------------------------------------------------ Not sure why that would make a difference? In any case, I hope this helps! If there is anything else that I can provide, please do not hesitate to ask! You need to execute the lspci command as root, otherwise the registers are hidden: Capabilities: <access denied> Sorry for not explicitly mentioning this earlier! Created attachment 307418 [details]
Output of sudo lspci -vvvv with the USB-C cable plugged.
There was also this output to the terminal:
----------------------------------------
[evert@Evert Bug Reports]$ sudo lspci -vvvv > 2024-12-30_sudo_USB-c_DisplayPort_Cable_Plugged_In.txt
[sudo] password for evert:
pcilib: sysfs_read_vpd: read failed: No such device
[evert@Evert Bug Reports]$
----------------------------------------
Created attachment 307419 [details]
Output of sudo lspci -vvvv with HDMI cable plugged
Same output to the terminal:
---------------------------
[evert@Evert Bug Reports]$ sudo lspci -vvvv > 2024-12-30_sudo_HDMI_Cable_Plugged_In.txt
[sudo] password for evert:
pcilib: sysfs_read_vpd: read failed: No such device
---------------------------
I really hope this helps. On one boot, I had the wireless card disabled, and the USB-C to DisplayPort cable plugged in. During the session, I enabled the wireless card, and then the system also was hanging on shutdown. Wireless card was enabled and disabled from the applet in KDE panel. I don't know if this has any bearing on anything, but might give us a clue as to what is happening here. Thanks! Could you provide "sudo lspci -vvvv" output for a v6.12 kernel as well? I assume "2024-12-30_sudo_HDMI_Cable_Plugged_In.txt" is the working case and "2024-12-30_sudo_USB-c_DisplayPort_Cable_Plugged_In.txt" is the non-working case. Comparing the two files I note differences for two Root Ports: 0000:00:01.1 - leads to the Nvidia GPU - indicates a pending ABWMgmt interrupt in the non-working case (Link Autonomous Bandwidth Status) - indicates that a Non-Fatal Error and Unsupported Request error has occurred in the non-working case (NonFatalErr / UnsupReq); maybe harmless 0000:00:02.1 - leads to an ASMedia USB 3 controller with integrated PCIe switch - indicates both a pending ABWMgmt interrupt as well as a pending BWMgmt interrupt in the non-working case (Link Autonomous Bandwidth Status + Link Bandwidth Management Status) - indicates a Lane Error for all four lanes - the Switch Upstream port of the ASMedia switch at 0000:05:00.0 indicates that an Advisory Non Fatal Error, Correctable Error and Unsupported Request error has occurred in the non-working case (AdvNonFatalErr / CorrErr / UnsupReq); maybe harmless Now here's the thing: Normally there shouldn't be any pending ABWMgmt or BWMgmt interrupts because the bandwidth controller's interrupt handler should acknowledge and clear them. Two possible explanations come to mind: Either those AMD Root Ports set the status bits but do not raise an interrupt. Or they raise an interrupt continuously. To verify whether the second explanation applies, you can check how often the interrupts fire. You can see the interrupt number in dmesg: $ egrep -i '0000:00:01.[12].*irq' /tmp/2024-12-24_Mainline_Journal.txt Dec 24 07:40:06 Evert.Strix kernel: pcieport 0000:00:01.1: PME: Signaling with IRQ 32 Dec 24 07:40:06 Evert.Strix kernel: pcieport 0000:00:01.2: PME: Signaling with IRQ 33 Assuming that 32 and 33 are still correct (you need to double-check this), you can see how often they fire by running: $ cat /proc/irq/32/spurious $ cat /proc/irq/33/spurious Note the "count" line. How big is that number? Does it increase rapidly, i.e. is the interrupt raised very often? If the number is low and doesn't increase rapidly, then the first explanation is more likely: That the AMD chipset neglects to send an interrupt. Bandwidth management is a new feature in the kernel and vendors previously may have not bothered to validate that it functions correctly. I don't know why this would lead to a hang on suspend, but maybe we need to selectively disable bandwidth control on known-broken systems. Note that I'm just filling in for Ilpo here, he has more experience with bandwidth control behavior of different chipsets, but he's still mostly offline this week due to holidays. Created attachment 307422 [details] attachment-27986-0.html Small comment, the HDMI plugged in is where the problem shows up. So the logic is reversed from your analysis. On Mon, Dec 30, 2024, 19:34 <bugzilla-daemon@kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=219629 > > --- Comment #26 from Lukas Wunner (lukas@wunner.de) --- > Thanks! Could you provide "sudo lspci -vvvv" output for a v6.12 kernel as > well? > > I assume "2024-12-30_sudo_HDMI_Cable_Plugged_In.txt" is the working case > and > "2024-12-30_sudo_USB-c_DisplayPort_Cable_Plugged_In.txt" is the non-working > case. > > Comparing the two files I note differences for two Root Ports: > > 0000:00:01.1 > - leads to the Nvidia GPU > - indicates a pending ABWMgmt interrupt in the non-working case > (Link Autonomous Bandwidth Status) > - indicates that a Non-Fatal Error and Unsupported Request error has > occurred > in the non-working case (NonFatalErr / UnsupReq); > maybe harmless > > 0000:00:02.1 > - leads to an ASMedia USB 3 controller with integrated PCIe switch > - indicates both a pending ABWMgmt interrupt as well as a pending > BWMgmt interrupt in the non-working case > (Link Autonomous Bandwidth Status + Link Bandwidth Management Status) > - indicates a Lane Error for all four lanes > - the Switch Upstream port of the ASMedia switch at 0000:05:00.0 > indicates that an Advisory Non Fatal Error, Correctable Error and > Unsupported Request error has occurred in the non-working case > (AdvNonFatalErr / CorrErr / UnsupReq); maybe harmless > > Now here's the thing: Normally there shouldn't be any pending ABWMgmt or > BWMgmt > interrupts because the bandwidth controller's interrupt handler should > acknowledge and clear them. Two possible explanations come to mind: Either > those AMD Root Ports set the status bits but do not raise an interrupt. Or > they > raise an interrupt continuously. > > To verify whether the second explanation applies, you can check how often > the > interrupts fire. You can see the interrupt number in dmesg: > > $ egrep -i '0000:00:01.[12].*irq' /tmp/2024-12-24_Mainline_Journal.txt > Dec 24 07:40:06 Evert.Strix kernel: pcieport 0000:00:01.1: PME: Signaling > with > IRQ 32 > Dec 24 07:40:06 Evert.Strix kernel: pcieport 0000:00:01.2: PME: Signaling > with > IRQ 33 > > Assuming that 32 and 33 are still correct (you need to double-check this), > you > can see how often they fire by running: > > $ cat /proc/irq/32/spurious > $ cat /proc/irq/33/spurious > > Note the "count" line. How big is that number? Does it increase rapidly, > i.e. > is the interrupt raised very often? > > If the number is low and doesn't increase rapidly, then the first > explanation > is more likely: That the AMD chipset neglects to send an interrupt. > Bandwidth > management is a new feature in the kernel and vendors previously may have > not > bothered to validate that it functions correctly. I don't know why this > would > lead to a hang on suspend, but maybe we need to selectively disable > bandwidth > control on known-broken systems. > > Note that I'm just filling in for Ilpo here, he has more experience with > bandwidth control behavior of different chipsets, but he's still mostly > offline > this week due to holidays. > > -- > You may reply to this email to add a comment. > > You are receiving this mail because: > You reported the bug. Created attachment 307423 [details]
Test without enabling BW notifications
BWMgmt+/ABWMgmt+ seems to occur in the HDMI case (despite Lukas stating he assumed it's the working case contradicting his analysis part).
I suggest testing entirely without the BW notifications to see if that helps (a patch to do that attached).
Perhaps the bwnotif disable handler should also clear any pending notifications just in case (even if that would turn out to not be the root cause here).
Created attachment 307426 [details]
Output of sudo lspci -vvvv on 6.6.68 Kernel, with USB-C plugged
Checking out the spurious count when the USB-C to DisplayPort cable is plugged in and where we are not expecting to see the issue, the numbers do not increase. This is the output: ------------------------------------- [evert@Evert pcie]$ cat /proc/irq/32/spurious count 10 unhandled 1 last_unhandled 1431356392 ms [evert@Evert pcie]$ cat /proc/irq/33/spurious count 1 unhandled 1 last_unhandled 1431356392 ms ------------------------------------- I've patched the kernel, it is compiling now and I'll feed back on whether the issue is still present when the HDMI cable is plugged in. Notably, this issue is also present when neither the HDMI or the USB-C cable is plugged in. With the patch applied, the issue is gone now, and the laptop reboots without any issue with the HDMI cable plugged in. So, flatly disabling this BW notifications does seem to have the intended effect. I suppose we will still have to do a bit more troubleshooting to get to the bottom of this one. There are also no spurious interrupts: ------------------------------------- [evert@Evert ~]$ cat /proc/irq/32/spurious count 0 unhandled 0 last_unhandled 0 ms [evert@Evert ~]$ cat /proc/irq/33/spurious count 0 unhandled 0 last_unhandled 0 ms -------------------------------------- I mixed up the working and non-working case in this sentence... > I assume "2024-12-30_sudo_HDMI_Cable_Plugged_In.txt" is the working case and > "2024-12-30_sudo_USB-c_DisplayPort_Cable_Plugged_In.txt" is the non-working > case. ... but otherwise the analysis is not mixed up. In the HDMI case (non-working, as you say), there are unacknowledged BWMgmt+ and ABWMgmt+ interrupts and there are lane errors for all lanes of the Root Port above the ASMedia USB controller. These do not show up in the USB-C case (working case). Comparing this to the (working) 6.6.68 kernel, the unacknowledged interrupts do show up (which is expected as there's no bandwidth controller to take care of them). But the lane errors do not show up. Could you reproduce the 6.6.68 output with HDMI cable to rule out that the USB-C cable somehow changed the behavior to the better? It would probably make sense to selectively disable bandwidth control only for the Root Port leading to the ASMedia USB controller, and then only for the Root Port leading to the Nvidia GPU, to understand which is causing problems. The interrupt count of 10 for the Root Port leading to the Nvidia card indicates that the bandwidth was changed a couple of times. GPUs are known to autonomously change bandwidth to save power when not in use. The single unhandled interrupt on each Root Port may indicate that the interrupt was raised too early: The bandwidth controller's interrupt handler found that no interrupt was actually set. And the hardware set the BWMgmt and ABWMgmt bits only afterwards, which would explain the unacknowledged interrupts. @Ilpo: > Perhaps the bwnotif disable handler should also clear any pending > notifications just in case (even if that would turn out to not be the root > cause here). I agree, that would make a lot of sense. The 6.6.68 kernel and the 6.12 series of kernels did not have trouble shutting down the power of this laptop. Thank you for the detailed explanation of what is occurring here. In my severely limited understanding, I do agree with you that the bandwidth controller should clear any pending/un-handled interrupts on shutdown. This makes sense from a robustness perspective. Would it also be a good idea to log any un-handled interrupts that were cleared to the journal for future troubleshooting? I'm sure that the eventual target would be to have no un-handled interrupts? In any case, I'm out on a camp today and tomorrow to celebrate the new year away from the crowds of people, so I'll be back on this issue late on the 1st, or definitely on the 2nd of January. Created attachment 307431 [details]
[PATCH] PCI/bwctrl: Preserve Link Status Register on write
Perhaps we shouldn't clear an already cleared bit in the Link Status Register. Does this patch help? It's a bit of a shot in the dark.
Hi Lukas. A prosperous and happy new year to you. The patch you supplied also allows the laptop to reboot and shut down properly. What I have noticed with this patch, as well as the previous one, is that every second the external screen pauses for a very brief time. Enough to be noticed, and very bad for gaming. I notice it watching youtube videos as well. This may or may not be related to the kernel, as the system gets updates too and this might just be some transient issue. However, it is worth mentioning. I'll build vanilla kernel and test with that to see if it is the kernel or not. There is nothing mentioned in dmesg. Also: --------------------------------------------- [evert@Evert ~]$ cat /proc/irq/32/spurious count 1 unhandled 1 last_unhandled 1431356602 ms [evert@Evert ~]$ cat /proc/irq/33/spurious count 1 unhandled 1 last_unhandled 1431356602 ms [evert@Evert ~]$ ---------------------------------------------- Testing with unpatched vanilla kernel. I can see the same one second brief pause on the external monitor that is not present in on the laptop's internal panel. Not quite sure what to make of that yet. However, I do see the spurious interrupts pile up on irq 32: --------------------------------------- [evert@Evert ~]$ cat /proc/irq/32/spurious count 38 unhandled 1 last_unhandled 1431356462 ms [evert@Evert ~]$ cat /proc/irq/32/spurious count 44 unhandled 1 last_unhandled 1431356462 ms [evert@Evert ~]$ cat /proc/irq/32/spurious count 50 unhandled 1 last_unhandled 1431356462 ms [evert@Evert ~]$ cat /proc/irq/32/spurious count 68 unhandled 1 last_unhandled 1431356462 ms ---------------------------------------- One other thing that I am seeing that is VERY strange is that the laptop now reboots normally with the unpatched kernel and with the HDMI cable plugged in, which used to be my test case for not powering off or rebooting normally. OK, the world makes sense again now. I had a USB-C dock plugged in that also has an HDMI output included in it. (I was downloading footage after the camp, and the UHS card reader in the dock is quite fast) When unplugging this unit and rebooting, the vanilla kernel hangs on shutting down power again. So, the kernel hangs on reboot with no USB devices plugged in, but as soon as there is any USB device plugged in, it reboots fine. Normally my external keyboard and mouse are bluetooth, so quite often, there are no USB devices plugged in to the laptop, and that is when this issue shows up. I hope that helps anyone looking at this issue. Hm, just to double-check, when you tested my patch (in attachment 307431 [details]), did you have any USB devices connected that might have changed the result?
The "spurious" file in procfs is a bit of a misnomer: "count" is the number of times the interrupt has fired and "unhandled" is the number of times no interrupt handler felt responsible. So the "unhandled" number is actually the number of spurious interrupts. An increasing "count" is nothing to worry about in principle, it just means that bandwidth may have changed a couple of times, e.g. in response to increasing or decreasing PCIe traffic.
A little more information. I tested with only a USB dongle for the mouse plugged in, and the laptop still does not power down or reboot properly. This is leading me to suspect that it only reboots properly when there is something with a USB to HDMI adapter plugged in. The way that this laptop works is that the Advanced Optimus uses the dGPU to write to the framebuffer of these devices, and a different pathway for the HDMI cable that is plugged into the laptop. Maybe it will help to put some links to how the Advanced Optimus works in this thread. From the folks at nVidia themselves: https://nvidia.custhelp.com/app/answers/detail/a_id/5097/~/nvidia-advanced-optimus-overview#:~:text=Advanced%20Optimus%20allows%20dynamically%20switching,Gsync%20and%20high%20refresh%20rate. It is a very general and hand-wavey description, not meant for any technical troubleshooting. More descriptive is this Reddit post: https://www.reddit.com/r/LenovoLegion/comments/rka6d3/optimus_mux_and_advanced_optimus_explanations_for/ My laptop uses the same technology, but using an AMD iGPU rather than an Intel one, since that is what is built into the AMD CPU. I did have that laptop hub plugged in, and will re-test without it. Thanks for reminding me. Oh, before it is cleared: [evert@Evert ~]$ ddcutil detect Invalid display I2C bus: /dev/i2c-3 DRM connector: card1-eDP-1 EDID synopsis: Mfg id: BOE - BOE Model: NE173QHM-NZ2 Product code: 2921 (0x0b69) Serial number: Binary serial number: 0 (0x00000000) Manufacture year: 2022, Week: 24 This monitor does not support DDC/CI. (I2C slave address x37 is unresponsive.) Invalid display I2C bus: /dev/i2c-7 DRM connector: card1-eDP-1 EDID synopsis: Mfg id: BOE - BOE Model: NE173QHM-NZ2 Product code: 2921 (0x0b69) Serial number: Binary serial number: 0 (0x00000000) Manufacture year: 2022, Week: 24 This monitor does not support DDC/CI. (I2C slave address x37 is unresponsive.) Invalid display I2C bus: /dev/i2c-17 DRM connector: card0-HDMI-A-1 EDID synopsis: Mfg id: BNQ - UNK Model: BenQ EX3210U Product code: 32678 (0x7fa6) Serial number: ETW5N05026SL0 Binary serial number: 16843009 (0x01010101) Manufacture year: 2022, Week: 21 This monitor does not support DDC/CI. (I2C slave address x37 is unresponsive.) --------------------------------------------- This just shows the internal display, and the "virtual" display which is the MUX. OK, patched kernel built and I can now confirm that the bug is still present with it. In other words, my previous test with this patched kernel was wrong. To sum it up, the latest git master patched with the patch 0001-PCI-bwctrl-Preserve-Link-Status-Register-on-write.patch does not power down, or reboot with an HDMI cable attached, but it does reboot and power down the laptop when there is a USB-C HDMI or DisplayPort adapter is used. Interestingly enough, the USB-C to DisplayPort shows up as HDMI in the system. Created attachment 307439 [details]
[PATCH] PCI/bwctrl: Fix NULL pointer deref on bind and unbind
Could you give this patch a spin and see if it helps?
Hi there! The patch does seem to work on the first glance, in that the system does power off and reboot normally with it installed. Are there any tests that you would like me to run to ensure that it is working as intended. Something that is new that I am not used to seeing before is this error message in my journal, just about every second: Jan 02 16:51:29 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-34 noise=9999 txrate=270000 Jan 02 16:51:32 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-31 noise=9999 txrate=270000 Jan 02 16:51:35 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-31 noise=9999 txrate=135000 Jan 02 16:51:38 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-30 noise=9999 txrate=135000 Jan 02 16:51:41 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-30 noise=9999 txrate=135000 Jan 02 16:51:44 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-32 noise=9999 txrate=135000 Jan 02 16:51:47 Evert.Strix wpa_supplicant[1020]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-20 noise=9999 txrate=135000 Could this be due to the patch? > The patch does seem to work on the first glance, in that the system does
> power off and reboot normally with it installed.
> Are there any tests that you would like me to run to ensure that it is
> working as intended.
Did you test on a stock 6.13-rc kernel with just this patch applied and no USB-C monitor or dock attached? IIUC, you were able to reproduce reliably under these conditions, so if the issue no longer occurs with the patch, I think that's sufficient proof that this fixes the root cause.
I tested this patch against the latest git kernel, and that does seem to fix the issue. The other issue I am seeing with the wpa_supplicant I see on 6.12.7 as well, so that is a completely different issue. Funnily enough, it is not present on 6.6.67, so it is definitely something introduced by a kernel driver, but not this issue. I'll hunt it down sometime in the future. For this bug, it is present on unpatched git master of kernel, and gone on the patched version. > Jan 02 16:51:29 Evert.Strix wpa_supplicant[1020]: wlp4s0: > CTRL-EVENT-SIGNAL-CHANGE above=1 signal=-34 noise=9999 txrate=270000 Definitely looks unrelated to the patch. Those messages are logged with MSG_INFO severity, according to this bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777170 I asked you earlier to add "ignore_loglevel" to the kernel command line, so if you did that and the default loglevel was, say, MSG_NOTICE, that would explain why you're now suddenly seeing those messages. List of syslog severity levels: https://signoz.io/guides/syslog-levels/ The hang on shutdown you were seeing is actually a kernel crash. It's dereferencing a pointer which is pointing to NULL, instead of into valid memory. But it did require a specific combination of prerequisites, hence it wasn't discovered during development and testing. One possible explanation is that the ASMedia USB 3.2 controller gets hot-removed on shutdown if nothing is plugged in. "Hot-removed" not in the sense that it's actually removed from the mainboard, but the Root Port above it (which is hotplug-capable) sees it disappearing from the bus. The hotplug interrupt is shared with bandwidth control. If the hotplug interrupt triggers at exactly the right time when the bandwidth controller is in the middle of deregistering itself, it'll cause the crash. We actually had a report of a boot hang before the Christmas holidays, which I strongly suspect was caused by the same issue: https://lore.kernel.org/all/db8e457fcd155436449b035e8791a8241b0df400.camel@kernel.org/ We worked around the issue but didn't actually root-cause it: https://lore.kernel.org/all/cover.1734257330.git.lukas@wunner.de/ I'll ask the reporter to test this patch as well to verify that suspicion. Thanks so much for your patience and dedication reporting the issue and testing patches, this has prevented the issue from being visible to a wider audience. It likely also prevented the bandwidth controller from being reverted before release of v6.13. > What I have noticed with this patch, as well as the previous one, is that > every second the external screen pauses for a very brief time. Enough to be > noticed, and very bad for gaming. I notice it watching youtube videos as > well. > I can see the same one second brief pause on the external monitor that is not > present in on the laptop's internal panel. This sound like a graphics driver issue. I assume the internal panel is driven by the AMD iGPU and the external monitor driven by the Nvidia dGPU? AMD bugs are tracked on the freedesktop.org GitLab: https://gitlab.freedesktop.org/drm/amd/-/issues I note you're using the out-of-tree nvidia driver for the dGPU. Nvidia uses GitHub at least for the open source portion of their driver: https://github.com/NVIDIA/open-gpu-kernel-modules/issues Hi there! Thanks for the really good description of what has gone wrong on my machine. It makes perfect sense, too. Thank you, also for your kind help. This has been a most useful and entertaining learning experience, and I look forward to using the knowledge gained here the next time I run into an easily reproduced issue. What is the protocol in terms of closing this report? Do you do it, or do I? Do we wait for the patch to be accepted into the kernel, or is it more appropriate to close it now? It's a good question, I don't really know. The patch I just submitted contains a "Closes:" tag which links to this bugzilla entry: https://lore.kernel.org/r/0ee5faf5395cad8d29fb66e1ec444c8d882a4201.1735852688.git.lukas@wunner.de/ Theoretically this could be used to automatically close it when the patch gets accepted and eventually lands in Linus' tree. But I'm not sure anyone has set that up. I guess most bugzilla entries just remain open forever because nobody bothers closing them. :) Feel free to close it once you're certain all your questions have been answered and issues resolved. Or don't. I dunno. :) My personal process for closing bugzilla reports is to wait until the fix lands in the main upstream tree and then add the relevant https://git.kernel.org/linus/ link and the release that contains it in a closing comment. |
Created attachment 307399 [details] Output of dmidecode Hi there! I have recently installed a git version of the linux kernel on my Asus ROG Strix laptop. Hardware info: [code] Operating System: Arch Linux KDE Plasma Version: 6.2.4 KDE Frameworks Version: 6.9.0 Qt Version: 6.8.1 Kernel Version: 6.12.6-arch1-1 (64-bit) Graphics Platform: Wayland Processors: 32 × AMD Ryzen 9 7945HX3D with Radeon Graphics Memory: 62.0 GiB of RAM Graphics Processor: AMD Radeon 610M Manufacturer: ASUSTeK COMPUTER INC. Product Name: ROG Strix G733PYV_G733PYV System Version: 1.0 [/code] Up to this point, I have been running a variety of kernels, Arch's lts version: 6.6.67, Arch's normal kernel version: 6.12.6, a custom g14 kernel with patches for the Asus type laptops, version: 6.12.6 None of these kernels had any trouble shutting down the power at shutdown. Arch mainline version of the kernel, currently at 6.13rc1, and the git version, currently at 6.13.rc4 both do not power off the laptop after a shutdown command has been issued. The journal stops after the shutdown command is executed, so I can't see any error as to why this power off is not working. I filed a bug report in the Arch Linux Forums, and that has some more information on the journals, but ultimately we can't figure out what is going on. Any help in troubleshooting this would be greatly appreciated.