Bug 187731 - Null pointer dereference during re-enumeration of PCI bus in alloc_pcie_link_state even though ASPM is off
Summary: Null pointer dereference during re-enumeration of PCI bus in alloc_pcie_link_...
Status: NEEDINFO
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-15 07:59 UTC by Jan
Modified: 2017-01-27 22:50 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.4.0
Tree: Mainline
Regression: Yes


Attachments
Patch that fixed the null pointer dereference (595 bytes, text/plain)
2016-11-15 07:59 UTC, Jan
Details
dmesg output (109.82 KB, text/plain)
2016-12-30 07:37 UTC, Jan
Details
lspci -vv as root (42.82 KB, text/plain)
2016-12-30 07:37 UTC, Jan
Details

Description Jan 2016-11-15 07:59:36 UTC
Created attachment 244571 [details]
Patch that fixed the null pointer dereference

Hi,

I'm using an IBM x3850 8664 need to reenumerate PCI bus, as the system fails to alloc BAR memory during initial boot for the attached PCIe device. Therefore, I remove the corresponding PCIe root bus via sysfs (echo 1 > /sys/bus/pci/devices/0000\:19\:00.0/remove) and then re-enumerate it via sysfs (echo 1 > /sys/bus/pci/rescan). This has worked for my previous 3.10 kernel. On a 4.4.0 Kernel, this causes a null pointer dereference in alloc_pcie_link_state.

Interestingly, PCIe ASPM is off. The actual null pointer dereference happens in pcie_aspm_sanity_check in the for loop. I'm not quite sure why this function is called in the first place, as ASPM is off.

The corresponding code does the sanity check and then exists alloc_pcie_link_state because only then it is checked if ASPM is enabled or not.

The null pointer dereference can be fixed by simply doing the sanity check after the check whether ASPM is enabled or not (patch is attached).


Here is an lspci -t and lspci for your information.
lspci -t:
-+-[0000:19]---00.0-[1a-1d]----00.0-[1b-1d]--+-02.0-[1c]--+-00.0
 |                                           |            \-00.1
 |                                           \-04.0-[1d]--+-00.0
 |                                                        \-00.1
 +-[0000:14]---00.0-[15-18]--
 +-[0000:0f]---00.0-[10-13]--
 +-[0000:0a]---00.0-[0b-0e]--
 +-[0000:06]---00.0
 +-[0000:02]---00.0
 +-[0000:01]-+-00.0
 |           +-01.0
 |           +-01.1
 |           \-02.0
 \-[0000:00]-+-00.0
             +-01.0
             +-03.0
             +-03.1
             +-03.2
             +-0f.0
             +-0f.1
             \-0f.3

lspci:
00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 04)
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV100 [Radeon 7000 / Radeon VE]
00:03.0 USB controller: NEC Corporation OHCI USB Controller (rev 43)
00:03.1 USB controller: NEC Corporation OHCI USB Controller (rev 43)
00:03.2 USB controller: NEC Corporation uPD72010x USB 2.0 Controller (rev 04)
00:0f.0 Host bridge: Broadcom CSB6 South Bridge (rev a0)
00:0f.1 IDE interface: Broadcom CSB6 RAID/IDE Controller (rev a0)
00:0f.3 ISA bridge: Broadcom GCLE-2 Host Bridge
01:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 04)
01:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
01:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
01:02.0 RAID bus controller: Adaptec AAC-RAID (rev 02)
02:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 04)
06:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 04)
0a:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
0f:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
14:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
19:00.0 PCI bridge: IBM CalIOC2 PCI-E Root Port (rev 01)
1a:00.0 PCI bridge: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch (rev 0e)
1b:02.0 PCI bridge: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch (rev 0e)
1b:04.0 PCI bridge: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch (rev 0e)
1c:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
1c:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
1d:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
1d:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

Thank you
 Jan
Comment 1 Bjorn Helgaas 2016-12-29 18:35:54 UTC
Hi Jan, thanks for the report!  I'm marking this as a regression, since it worked in v3.10 and is broken in v4.4.

I think your patch makes sense.  Would you mind attaching a complete dmesg log as well as "lspci -vv" output (as root)?  I want to explore the question of why we even get here if ASPM is off.  I also want to see the details of the resource allocation issue, because it would be nice if Linux could take care of that automatically.
Comment 2 Jan 2016-12-30 07:37:25 UTC
Created attachment 249231 [details]
dmesg output
Comment 3 Jan 2016-12-30 07:37:46 UTC
Created attachment 249241 [details]
lspci -vv as root
Comment 4 Jan 2016-12-30 07:38:20 UTC
Hi Bjorn,

thanks for responding. I attached lspci -vv as root as well as dmesg output from the last reboot, please note that the machine is currently running a 4.4.0-47-generic Ubuntu kernel with the patch I attached in the original bug report and the rescan I talked about has already taken place at around second 18 in the boot process.
Comment 5 Bjorn Helgaas 2017-01-02 19:49:45 UTC
Hi Jan, I think your patch is correct.  When I first looked at it, I thought it was in alloc_pcie_link_state(), and I wondered why we would even get there.  But I had just misread the patch.

Can you please post it to linux-pci@vger.kernel.org with a changelog and Signed-off-by?  See https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst
Comment 6 Jan 2017-01-03 08:05:32 UTC
Hi Bjorn,
thanks again for your feedback, I submitted the patch to the mailing list!

Should I close this issue? Or should it remain open until the patch was actually pulled into the tree?
Comment 7 Bjorn Helgaas 2017-01-03 14:12:39 UTC
Thanks, I saw the patch on the list.  Let's keep this open until the fix actually gets merged into Linus' tree.  Then we can close the issue with a link to the commit and a note about which release contains the fix.
Comment 8 Bjorn Helgaas 2017-01-24 20:34:51 UTC
[Already posted to linux-pci discussion, adding here for completeness]

I'm pretty sure the null pointer would be pdev->subordinate, and your
patch certainly avoids dereferencing that as long as you boot with
"pcie_aspm=off".  But I think if you boot with your patch but without
"pcie_aspm=off", we'll probably dereference that null pointer again.

The only way I can see to have a null pdev->subordinate pointer is via
SR-IOV VF devices, and you should have some of those.  But I think we
should see the messages from pci_setup_device() for them, and I don't
see them in the dmesg log in the bugzilla, so I'm still confused about
this.

Why are you using "pcie_aspm=off"?  If it's due to Linux ASPM bugs, I
want to fix those, too.  If it's due to hardware problems on your
platform, maybe we can add some sort of quirk to do this
automatically.
Comment 9 Bjorn Helgaas 2017-01-27 22:50:13 UTC
Ping?  Can you try booting with the patch (https://patchwork.ozlabs.org/patch/710415/) but without "pcie_aspm=off"?

Note You need to log in before you can comment on or make changes to this bug.