Bug 201001 - System with FADT ASPM disable bit set hangs at boot after '1302fcf PCI: Configure *all* devices, not just hot-added ones'
Summary: System with FADT ASPM disable bit set hangs at boot after '1302fcf PCI: Confi...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-09-03 07:55 UTC by Patrick Talbert
Modified: 2018-09-20 12:55 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.17.0
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvv output from affected host (94.54 KB, text/plain)
2018-09-03 07:55 UTC, Patrick Talbert
Details
PCI bus log file (19.34 KB, text/plain)
2018-09-03 07:57 UTC, Patrick Talbert
Details
analysis of log file (109.97 KB, text/plain)
2018-09-03 07:58 UTC, Patrick Talbert
Details

Description Patrick Talbert 2018-09-03 07:55:37 UTC
Created attachment 278259 [details]
lspci -vvv output from affected host

This bug is filed as a follow-up to the linux-pci mailing list thread found here:

-------------------------------------------------------------

https://www.spinics.net/lists/linux-pci/msg75104.html

Subject: Should pcie_aspm_init_link_state() run when FADT ASPM disable bit is set?
From: Patrick Talbert <ptalbert@xxxxxxxxxx>
Date: Thu, 9 Aug 2018 09:25:20 +0200

-------------------------------------------------------------


A RHEL 7 system was found to hang at boot after a kernel update, but only when a custom proprietary PCIE card was installed in the system. The problem was bisected down to the following commit:

1302fcf PCI: Configure *all* devices, not just hot-added ones
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1302fcf0d03e6ea74846c7fee14736306ab2ce4b


It was further confirmed that the hang still occurred using an upstream 4.17.0 kernel.


The customer who reports this issue to RH has the ability to record PCI bus activity. They provided this logging and I was able to correspond all the bus activity and dmesg output directly to pci_scan_slot() activity as it descends into pci_scan_device() and initializes the card. So with this we could confirm everything appears to work as expected at least until pci_scan_slot() gets to:

2448         /* Only one slot has PCIe device */
2449         if (bus->self && nr)
2450                 pcie_aspm_init_link_state(bus->self);
2451 
2452         return nr;
2453 }
2454 EXPORT_SYMBOL(pci_scan_slot);

... which is where the dmesg output hangs and the PCI debugger output stopped.


This pointed me to a possible issue with ASPM, at which point I noticed the system logs the following at boot ...

[   18.347488] ACPI FADT declares the system doesn't support PCIe ASPM, so disable it


... and we then found the system boots normally with pcie_aspm=off set.


So that brings me to the question I posed in the linux-pci mailing list email: Should pcie_aspm_init_link_state() run when FADT ASPM disable bit is set? It *seems* like there should be a change like the following as as when acpi_pci_init() sees the ACPI_FADT_NO_ASPM bit set it calls pcie_no_aspm() which sets aspm_disabled (but not !aspm_support_enabled):

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 8b56dad..5491dbf 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -574,7 +574,7 @@ void pcie_aspm_init_link_state(struct pci_dev *pdev)
        struct pcie_link_state *link;
        int blacklist = !!pcie_aspm_sanity_check(pdev);

-       if (!aspm_support_enabled)
+       if (!aspm_support_enabled || aspm_disabled)
                return;

        if (pdev->link_state)
Comment 1 Patrick Talbert 2018-09-03 07:57:23 UTC
Created attachment 278261 [details]
PCI bus log file
Comment 2 Patrick Talbert 2018-09-03 07:58:06 UTC
Created attachment 278263 [details]
analysis of log file
Comment 3 Zhang Rui 2018-09-19 06:59:47 UTC
reassign to the pci experts.
Plus, as the patch at https://patchwork.kernel.org/patch/10588355/ has been queued for 4.20, I think we can close this, right?
Comment 4 Patrick Talbert 2018-09-20 09:14:41 UTC
Hi Zhang,

I do not know what the formal process for kernel.org BZs is, but I agree it seems reasonable to close the BZ if a patch which addresses the issue is already accepted.


Thank you,

Patrick
Comment 5 Bjorn Helgaas 2018-09-20 12:55:54 UTC
I don't know if there's a formal process.  I like to include a URL to the commit in Linus' tree when I close things, which means I would wait until the patch gets merged during the v4.20 merge window.  Until that happens, the PCI tree is somewhat ephemeral; I sometimes update or drop patches if we discover problems before the merge window.

Note You need to log in before you can comment on or make changes to this bug.