I've been getting an oops when shutting down my laptop (with /sbin/halt) or rebooting it (/sbin/reboot or /usr/sbin/kexec). Unfortunately, I can't provide the backtrace because it is on the screen for only a moment before the system shuts down/reboots. Bisected to: cc27b735ad3a75574a6ab1a66ed6b09385e77e5e is the first bad commit commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e Author: Sinan Kaya <okaya@codeaurora.org> Date: Wed Oct 25 15:01:02 2017 -0400 PCI/portdrv: Turn off PCIe services during shutdown Some of the PCIe services such as AER are being left enabled during shutdown. This might cause spurious AER errors while SOC is being powered down. Clean up the PCIe services gracefully during shutdown to clear these false positives. Signed-off-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> :040000 040000 5a827d6956c581344a0bf392e30155c337673c1d 76c6a39b53604a0a0a370383c3503f80aa7cbc1e M drivers Reversing this change on -rc3 removes problem
Created attachment 261113 [details] dmesg from 4.15.0-rc3
Created attachment 261115 [details] output from lspci -vv
I'm trying to see if we can reproduce by writing to the remove file in each pci directory for echo 1 > /sys/bus/pci/devices/<pciid>/remove 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport 00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport 00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #2 (rev d5) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport 00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport 00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5) (prog-if 00 [Normal decode]) Kernel driver in use: pcieport Can you test this and let us know if you can capture any calltrace?
On 13/12/17 00:15, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198141 > > --- Comment #4 from Sinan Kaya (okaya@codeaurora.org) --- > I'm trying to see if we can reproduce by writing to the remove file in each > pci > directory for > > echo 1 > /sys/bus/pci/devices/<pciid>/remove > > 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor > PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode]) > Kernel driver in use: pcieport > > 00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI > Express Root Port #1 (rev d5) (prog-if 00 [Normal decode]) > Kernel driver in use: pcieport > > 00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI > Express Root Port #2 (rev d5) (prog-if 00 [Normal decode]) > Kernel driver in use: pcieport > > 00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI > Express Root Port #3 (rev d5) (prog-if 00 [Normal decode]) > Kernel driver in use: pcieport > > 00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI > Express Root Port #4 (rev d5) (prog-if 00 [Normal decode]) > Kernel driver in use: pcieport > > Can you test this and let us know if you can capture any calltrace? I've tried that and in each case the device is removed successfully and no calltrace is produced. Sorry!
I've got a calltrace by videoing the laptop rebooting and taking a screenshot moment the trace is on the screen. Here are the highlights: remove_proc_entry: removing non-empty directory 'irq/17', leaking at least 'rtsx_pci' WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572 remove_proc_entry+0x11d/0x130 Modules linked in <long list but none that are out-of-tree> ... Call Trace: unregister_irq_proc free_desc irq_free_descs mp_unmap_irq acpi_unregister_gsi_apic acpi_pci_irq_disable do_pci_disable_device pci_disable_device device_shutdown kernel_restart Sys_reboot ... If you need more, I can send you the image.
It looks like the kernel is complaining (warn) about a resource leak. It is not a real crash. It is a good thing. Need to consult Rafael
On 14/12/17 16:13, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198141 > > --- Comment #7 from Sinan Kaya (okaya@codeaurora.org) --- > It looks like the kernel is complaining (warn) about a resource leak. It is > not > a real crash. It is a good thing. > > Need to consult Rafael Is therany progress on this, please? >
Not at this moment due to holidays. I talked to Rafael in the meantime. He didn't have any immediate clues.
Back from holidays. Can you please try this patch? --- a/drivers/misc/cardreader/rtsx_pcr.c +++ b/drivers/misc/cardreader/rtsx_pcr.c @@ -1664,6 +1664,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev) rtsx_pci_power_off(pcr, HOST_ENTER_S1); pci_disable_device(pcidev); + free_irq(pcr->irq, (void *)pcr); } As far as I see, rtsx driver has a shutdown callback but it is not releasing the IRQ. This is causing the shutdown callback for pcie port driver to issue w a warning as it is cleaning up a resource that is in use.
On 02/01/18 14:02, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198141 > > --- Comment #10 from Sinan Kaya (okaya@codeaurora.org) --- > Back from holidays. Can you please try this patch? > > --- a/drivers/misc/cardreader/rtsx_pcr.c > +++ b/drivers/misc/cardreader/rtsx_pcr.c > @@ -1664,6 +1664,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev) > rtsx_pci_power_off(pcr, HOST_ENTER_S1); > > pci_disable_device(pcidev); > + free_irq(pcr->irq, (void *)pcr); > } > > As far as I see, rtsx driver has a shutdown callback but it is not releasing > the IRQ. This is causing the shutdown callback for pcie port driver to issue > w > a warning as it is cleaning up a resource that is in use. Your analysis seems good because the warning is not produced with this patch applied. Thanks. Tested-By: Chris Clayton <chris2553@googlemail.com> >
thanks for testing, let me post this.
On 02/01/18 20:22, Chris Clayton wrote: > > > On 02/01/18 14:02, bugzilla-daemon@bugzilla.kernel.org wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=198141 >> >> --- Comment #10 from Sinan Kaya (okaya@codeaurora.org) --- >> Back from holidays. Can you please try this patch? >> >> --- a/drivers/misc/cardreader/rtsx_pcr.c >> +++ b/drivers/misc/cardreader/rtsx_pcr.c >> @@ -1664,6 +1664,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev) >> rtsx_pci_power_off(pcr, HOST_ENTER_S1); >> >> pci_disable_device(pcidev); >> + free_irq(pcr->irq, (void *)pcr); >> } >> >> As far as I see, rtsx driver has a shutdown callback but it is not releasing >> the IRQ. This is causing the shutdown callback for pcie port driver to issue >> w >> a warning as it is cleaning up a resource that is in use. > > Your analysis seems good because the warning is not produced with this patch > applied. Thanks. > > Tested-By: Chris Clayton <chris2553@googlemail.com> > Sorry, I forgot to say that the patch above doesn't apply because rtsx_pcr.c is located in drivers/mfd in the current development tree. Obviously, I adjusted the patch accordingly. >>
thanks for the heads up.
Can you please test v2 and provide your tested by if it works? I added an MSI disable call for completeness. I don't have the HW to prove that it works.
On 03/01/18 12:35, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=198141 > > --- Comment #15 from Sinan Kaya (okaya@codeaurora.org) --- > Can you please test v2 and provide your tested by if it works? > > I added an MSI disable call for completeness. I don't have the HW to prove > that > it works. I've applied v2 of the patch and built and the kernel (-rc6). All I can say, is that my system still closes down without the warning and call trace that the unpatched kernel produces. It's the best I can do by way of a test because I have no idea what the code added in v2 is supposed to achieve and, because my system shuts down (or reboots) moments later, there is no opportunity to check. If that constitutes a valid test: Tested-by: Chris Clayton <chris2553@googlemail.com> >
Can you post your tested-by to the list?
Now fixed, so closing.