Bug 198141 - Oops on shutdown/reboot
Summary: Oops on shutdown/reboot
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-12-11 19:50 UTC by Chris Clayton
Modified: 2018-01-20 09:53 UTC (History)
3 users (show)

See Also:
Kernel Version: 4.15-rc[123]
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg from 4.15.0-rc3 (45.18 KB, text/plain)
2017-12-11 19:59 UTC, Chris Clayton
Details
output from lspci -vv (32.51 KB, text/plain)
2017-12-11 19:59 UTC, Chris Clayton
Details

Description Chris Clayton 2017-12-11 19:50:54 UTC

    
Comment 1 Chris Clayton 2017-12-11 19:57:07 UTC
I've been getting an oops when shutting down my laptop (with /sbin/halt) or rebooting it (/sbin/reboot or /usr/sbin/kexec). Unfortunately, I can't provide the backtrace because it is on the screen for only a moment before the system shuts down/reboots.

Bisected to:

cc27b735ad3a75574a6ab1a66ed6b09385e77e5e is the first bad commit
commit cc27b735ad3a75574a6ab1a66ed6b09385e77e5e
Author: Sinan Kaya <okaya@codeaurora.org>
Date:   Wed Oct 25 15:01:02 2017 -0400

    PCI/portdrv: Turn off PCIe services during shutdown
    
    Some of the PCIe services such as AER are being left enabled during
    shutdown. This might cause spurious AER errors while SOC is being powered
    down.
    
    Clean up the PCIe services gracefully during shutdown to clear these false
    positives.
    
    Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

:040000 040000 5a827d6956c581344a0bf392e30155c337673c1d 76c6a39b53604a0a0a370383c3503f80aa7cbc1e M	drivers

Reversing this change on -rc3 removes problem
Comment 2 Chris Clayton 2017-12-11 19:59:19 UTC
Created attachment 261113 [details]
dmesg from 4.15.0-rc3
Comment 3 Chris Clayton 2017-12-11 19:59:53 UTC
Created attachment 261115 [details]
output from lspci -vv
Comment 4 Sinan Kaya 2017-12-13 00:15:16 UTC
I'm trying to see if we can reproduce by writing to the remove file in each pci directory for 

echo 1 > /sys/bus/pci/devices/<pciid>/remove

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5) (prog-if 00 [Normal decode])	
	Kernel driver in use: pcieport
	
00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #2 (rev d5) (prog-if 00 [Normal decode])	
	Kernel driver in use: pcieport
	
00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport
	
00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #4 (rev d5) (prog-if 00 [Normal decode])
	Kernel driver in use: pcieport

Can you test this and let us know if you can capture any calltrace?
Comment 5 Chris Clayton 2017-12-13 11:20:04 UTC
On 13/12/17 00:15, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=198141
> 
> --- Comment #4 from Sinan Kaya (okaya@codeaurora.org) ---
> I'm trying to see if we can reproduce by writing to the remove file in each
> pci
> directory for 
> 
> echo 1 > /sys/bus/pci/devices/<pciid>/remove
> 
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor
> PCI Express x16 Controller (rev 06) (prog-if 00 [Normal decode])
>         Kernel driver in use: pcieport
> 
> 00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI
> Express Root Port #1 (rev d5) (prog-if 00 [Normal decode])        
>         Kernel driver in use: pcieport
> 
> 00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI
> Express Root Port #2 (rev d5) (prog-if 00 [Normal decode])        
>         Kernel driver in use: pcieport
> 
> 00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI
> Express Root Port #3 (rev d5) (prog-if 00 [Normal decode])
>         Kernel driver in use: pcieport
> 
> 00:1c.3 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI
> Express Root Port #4 (rev d5) (prog-if 00 [Normal decode])
>         Kernel driver in use: pcieport
> 
> Can you test this and let us know if you can capture any calltrace?

I've tried that and in each case the device is removed successfully and no calltrace is produced. Sorry!
Comment 6 Chris Clayton 2017-12-13 12:23:18 UTC
I've got a calltrace by videoing the laptop rebooting and taking a screenshot moment the trace is on the screen. Here are the highlights:

remove_proc_entry: removing non-empty directory 'irq/17', leaking at least 'rtsx_pci'
WARNING: CPU: 0 PID: 1578 at fs/proc/generic.c:572 remove_proc_entry+0x11d/0x130
Modules linked in <long list but none that are out-of-tree>
...
Call Trace:
unregister_irq_proc
free_desc
irq_free_descs
mp_unmap_irq
acpi_unregister_gsi_apic
acpi_pci_irq_disable
do_pci_disable_device
pci_disable_device
device_shutdown
kernel_restart
Sys_reboot
...

If you need more, I can send you the image.
Comment 7 Sinan Kaya 2017-12-14 16:13:02 UTC
It looks like the kernel is complaining (warn) about a resource leak. It is not a real crash. It is a good thing.

Need to consult Rafael
Comment 8 Chris Clayton 2017-12-21 10:36:39 UTC
On 14/12/17 16:13, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=198141
> 
> --- Comment #7 from Sinan Kaya (okaya@codeaurora.org) ---
> It looks like the kernel is complaining (warn) about a resource leak. It is
> not
> a real crash. It is a good thing.
> 
> Need to consult Rafael

Is therany progress on this, please?
>
Comment 9 Sinan Kaya 2017-12-21 14:33:42 UTC
Not at this moment due to holidays. I talked to Rafael in the meantime. He didn't have any immediate clues.
Comment 10 Sinan Kaya 2018-01-02 14:02:22 UTC
Back from holidays. Can you please try this patch?

--- a/drivers/misc/cardreader/rtsx_pcr.c
+++ b/drivers/misc/cardreader/rtsx_pcr.c
@@ -1664,6 +1664,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
        rtsx_pci_power_off(pcr, HOST_ENTER_S1);

        pci_disable_device(pcidev);
+       free_irq(pcr->irq, (void *)pcr);
 }

As far as I see, rtsx driver has a shutdown callback but it is not releasing the IRQ. This is causing the shutdown callback for pcie port driver to issue w a warning as it is cleaning up a resource that is in use.
Comment 11 Chris Clayton 2018-01-02 20:22:15 UTC
On 02/01/18 14:02, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=198141
> 
> --- Comment #10 from Sinan Kaya (okaya@codeaurora.org) ---
> Back from holidays. Can you please try this patch?
> 
> --- a/drivers/misc/cardreader/rtsx_pcr.c
> +++ b/drivers/misc/cardreader/rtsx_pcr.c
> @@ -1664,6 +1664,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
> 
>         pci_disable_device(pcidev);
> +       free_irq(pcr->irq, (void *)pcr);
>  }
> 
> As far as I see, rtsx driver has a shutdown callback but it is not releasing
> the IRQ. This is causing the shutdown callback for pcie port driver to issue
> w
> a warning as it is cleaning up a resource that is in use.

Your analysis seems good because the warning is not produced with this patch applied. Thanks.

Tested-By: Chris Clayton <chris2553@googlemail.com>

>
Comment 12 Sinan Kaya 2018-01-02 20:24:52 UTC
thanks for testing, let me post this.
Comment 13 Chris Clayton 2018-01-02 20:28:19 UTC
On 02/01/18 20:22, Chris Clayton wrote:
> 
> 
> On 02/01/18 14:02, bugzilla-daemon@bugzilla.kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=198141
>>
>> --- Comment #10 from Sinan Kaya (okaya@codeaurora.org) ---
>> Back from holidays. Can you please try this patch?
>>
>> --- a/drivers/misc/cardreader/rtsx_pcr.c
>> +++ b/drivers/misc/cardreader/rtsx_pcr.c
>> @@ -1664,6 +1664,7 @@ static void rtsx_pci_shutdown(struct pci_dev *pcidev)
>>         rtsx_pci_power_off(pcr, HOST_ENTER_S1);
>>
>>         pci_disable_device(pcidev);
>> +       free_irq(pcr->irq, (void *)pcr);
>>  }
>>
>> As far as I see, rtsx driver has a shutdown callback but it is not releasing
>> the IRQ. This is causing the shutdown callback for pcie port driver to issue
>> w
>> a warning as it is cleaning up a resource that is in use.
> 
> Your analysis seems good because the warning is not produced with this patch
> applied. Thanks.
> 
> Tested-By: Chris Clayton <chris2553@googlemail.com>
> 

Sorry, I forgot to say that the patch  above doesn't apply because rtsx_pcr.c is located in drivers/mfd in the current
development tree. Obviously, I adjusted the patch accordingly.

>>
Comment 14 Sinan Kaya 2018-01-02 20:44:23 UTC
thanks for the heads up.
Comment 15 Sinan Kaya 2018-01-03 12:35:19 UTC
Can you please test v2 and provide your tested by if it works?

I added an MSI disable call for completeness. I don't have the HW to prove that it works.
Comment 16 Chris Clayton 2018-01-03 16:00:03 UTC
On 03/01/18 12:35, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=198141
> 
> --- Comment #15 from Sinan Kaya (okaya@codeaurora.org) ---
> Can you please test v2 and provide your tested by if it works?
> 
> I added an MSI disable call for completeness. I don't have the HW to prove
> that
> it works.

I've applied v2 of the patch and built and the kernel (-rc6). All I can say, is that my system still closes down without
the warning and call trace that the unpatched kernel produces. It's the best I can do by way of a test because I have no
idea what the code added in v2 is supposed to achieve and, because my system shuts down (or reboots) moments later,
there is no opportunity to check. If that constitutes a valid test:

Tested-by: Chris Clayton <chris2553@googlemail.com>

>
Comment 17 Sinan Kaya 2018-01-03 16:25:38 UTC
Can you post your tested-by to the list?
Comment 18 Chris Clayton 2018-01-20 09:53:21 UTC
Now fixed, so closing.

Note You need to log in before you can comment on or make changes to this bug.