Bug 193411

Summary: ASPM NULL pointer dereference with PCIe reverse bridge
Product: Drivers Reporter: Bjorn Helgaas (bjorn)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: hjl.tools
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://bugzilla.opensuse.org/show_bug.cgi?id=1022181
Kernel Version: v4.9-rc5 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: dmesg log
lspci -vv output
disassembly of pcie_aspm_init_link_state()
Test patch

Description Bjorn Helgaas 2017-01-27 23:05:33 UTC
Created attachment 253321 [details]
dmesg log

This is a duplicate of https://bugzilla.opensuse.org/show_bug.cgi?id=1022181
I'm opening this kernel.org bugzilla as a permanent public archive.

lists@ssl-mail.com reported a NULL pointer dereference panic on a Gigabyte D525TUD system running a SUSE kernel containing 51ebfc92b72b ("PCI: Enumerate switches below PCI-to-PCIe bridges".  From the dmesg log:

  Linux version 4.9.6-1.gd1207ac-default (geeko@buildhost) (gcc version 6.2.1 20161209 [gcc-6-branch revision 243481] (SUSE Linux) ) #1 SMP PREEMPT Thu Jan 26 09:09:16 UTC 2017 (d1207ac)
  ...
  pci 0000:00:1e.0: [8086:2448] type 01 class 0x060401
  pci 0000:00:1e.0: PCI bridge to [bus 03-04] (subtractive decode)
  pci 0000:03:00.0: [10b5:8112] type 01 class 0x060400
  pci 0000:03:00.0: disabling ASPM on pre-1.1 PCIe device
  pci 0000:04:00.0: [10de:10c3] type 00 class 0x030000
  pci 0000:04:00.1: [10de:0be3] type 00 class 0x040300
  ...
  BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
  IP: [<ffffffff9e424350>] pcie_aspm_init_link_state+0x170/0x820

From the "lspci -vv" output:

  00:1e.0: Intel 82801 Mobile PCI Bridge (conventional PCI)
  03:00.0: PLX PEX8112 x1 Lane PCI Express-to-PCI Bridge (configured as PCI-to-PCIe bridge)
  04:00.0: Nvidia GT218 [GeForce 8400 GS Rev. 3] (PCIe)
  04:00.1: Nvidia High Definition Audio Controller (PCIe)
Comment 1 Bjorn Helgaas 2017-01-27 23:07:20 UTC
This worked in an older SUSE kernel, 4.9.5-3.1.g9bb1a8a-default, so this is a regression.  Submitter confirmed that reverting 51ebfc92b72b ("PCI: Enumerate switches below PCI-to-PCIe bridges") fixed it.
Comment 2 Bjorn Helgaas 2017-01-27 23:07:57 UTC
Created attachment 253331 [details]
lspci -vv output
Comment 3 Bjorn Helgaas 2017-01-27 23:16:12 UTC
Created attachment 253341 [details]
disassembly of pcie_aspm_init_link_state()

Analysis showing the NULL pointer is pdev->bus->parent->self->link_state in the following code:

  static struct pcie_link_state *alloc_pcie_link_state(struct pci_dev *pdev)
  {
        struct pcie_link_state *link;

        link = kzalloc(sizeof(*link), GFP_KERNEL);
        if (!link)
                return NULL;
        INIT_LIST_HEAD(&link->sibling);
        INIT_LIST_HEAD(&link->children);
        INIT_LIST_HEAD(&link->link);
        link->pdev = pdev;
        if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) {
                struct pcie_link_state *parent;
                parent = pdev->bus->parent->self->link_state;

In this case, pdev is 03:00.0, the PCI-to-PCIe bridge.  pdev->bus is bus 03, so pdev->bus->parent should be bus 00, and either pdev->bus->parent->self (the bridge to bus 00) is probably NULL.  If ->self is not NULL, ->link_state is certainly NULL because everything above 03:00.0 is conventional PCI, not PCIe.
Comment 4 Bjorn Helgaas 2017-01-27 23:20:05 UTC
Created attachment 253351 [details]
Test patch

Test patch.  This applies to v4.10-rc5.
Comment 5 H.J. Lu 2017-01-31 22:48:50 UTC
(In reply to Bjorn Helgaas from comment #4)
> Created attachment 253351 [details]
> Test patch
> 
> Test patch.  This applies to v4.10-rc5.

This fixed kernel 4.9.6 kernel panic on Intel S5520SC.