Bug 209283 - pcie hotplug doesn't work with kernel 4.19
Summary: pcie hotplug doesn't work with kernel 4.19
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-16 05:57 UTC by Jack Wang
Modified: 2020-09-30 08:55 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.19.133
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvv output (141.78 KB, text/plain)
2020-09-16 05:57 UTC, Jack Wang
Details
full dmesg on 5.4.62 (92.98 KB, text/plain)
2020-09-16 08:56 UTC, Jack Wang
Details
lspci on 4.19.133 (139.89 KB, text/plain)
2020-09-17 14:02 UTC, Jack Wang
Details
full dmesg during hotplug (89.00 KB, text/plain)
2020-09-17 14:04 UTC, Jack Wang
Details

Description Jack Wang 2020-09-16 05:57:16 UTC
Created attachment 292513 [details]
lspci -vvv output

We are testing PCIe nvme SSD hotplug, it works out of box with kernel 5.4.62,
dmesg during the hotplug:

[  605.734513] pcieport 0000:16:00.0: pciehp: Slot(0-3): Link Down
[  605.734516] pcieport 0000:16:00.0: pciehp: Slot(0-3): Card not present
[  605.842634] blk_update_request: I/O error, dev nvme0n1, sector
205976576 op 0x1:(WRITE) flags 0x0 phys_seg 112 prio class 0
[  608.908764] pcieport 0000:16:00.0: pciehp: Timeout on hotplug
command 0x15e1 (issued 3030 msec ago)
[  609.988759] pcieport 0000:16:00.0: pciehp: Timeout on hotplug
command 0x15e1 (issued 4110 msec ago)
[  683.218554] pcieport 0000:16:00.0: pciehp: Slot(0-3): Card present
[  683.218555] pcieport 0000:16:00.0: pciehp: Slot(0-3): Link Up
[  683.271702] pcieport 0000:16:00.0: pciehp: Timeout on hotplug
command 0x17e1 (issued 73280 msec ago)
[  686.301874] pcieport 0000:16:00.0: pciehp: Timeout on hotplug
command 0x13e1 (issued 3030 msec ago)
[  686.361894] pcieport 0000:16:00.0: pciehp: Timeout on hotplug
command 0x13e1 (issued 3090 msec ago)
[  686.521911] pci 0000:17:00.0: [1b96:2400] type 00 class 0x010802
[  686.521924] pci 0000:17:00.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
[  686.521934] pci 0000:17:00.0: reg 0x20: [mem 0x00000000-0x00000fff
64bit pref]
[  686.521937] pci 0000:17:00.0: reg 0x30: [mem 0x00000000-0x0001ffff pref]
[  686.521941] pci 0000:17:00.0: enabling Extended Tags
[  686.522045] pci 0000:17:00.0: BAR 6: assigned [mem
0xc5e00000-0xc5e1ffff pref]
[  686.522046] pci 0000:17:00.0: BAR 0: assigned [mem
0xc5e20000-0xc5e27fff 64bit]
[  686.522051] pci 0000:17:00.0: BAR 4: assigned [mem
0x387ffff00000-0x387ffff00fff 64bit pref]
[  686.522055] pcieport 0000:16:00.0: PCI bridge to [bus 17]
[  686.522057] pcieport 0000:16:00.0:   bridge window [io  0x4000-0x4fff]
[  686.522059] pcieport 0000:16:00.0:   bridge window [mem
0xc5e00000-0xc5efffff]
[  686.522060] pcieport 0000:16:00.0:   bridge window [mem
0x387ffff00000-0x387fffffffff 64bit pref]
[  686.522302] nvme nvme2: pci function 0000:17:00.0
[  686.522355] nvme 0000:17:00.0: enabling device (0100 -> 0102)
[  689.072008] pcieport 0000:16:00.0: pciehp: Timeout on hotplug
command 0x12e1 (issued 2710 msec ago)
[  690.373707] nvme nvme2: 40/0/0 default/read/poll queues

But with kernel 4.19.133, pcieport core doesn't print anything, is
there known problem with kernel 4.19 support for pcie hotplug, do we
need to backport some fixes from newer kernel to make it work?

In both kernel 4.19.133 and kernel 5.4.62 following configs are enabled.

CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
CONFIG_HOTPLUG_PCI_CPCI=y
CONFIG_HOTPLUG_PCI_CPCI_ZT5550=m
CONFIG_HOTPLUG_PCI_CPCI_GENERIC=m
CONFIG_HOTPLUG_PCI_SHPC=y
CONFIG_HOTPLUG_PCI_PCIE=y
Comment 1 Jack Wang 2020-09-16 08:56:55 UTC
Created attachment 292517 [details]
full dmesg on 5.4.62
Comment 2 Lukas Wunner 2020-09-16 09:40:55 UTC
The two hotplug ports are Skylake-E Root Ports. There are a bunch of oddities here:

* Hotplug port claims NoCompl- (i.e. has Command Completed support) but apparently never sets the CC bit.

* Hotplug port claims to have an Attention Button but there's no indication in the dmesg output that you pressed a button on insertion/removal.

* Hotplug port claims not to support surprise removal but dmesg output suggests that's what you're doing.

I can't find a spec for SkyLake-E, so I'm not sure if the Slot Capabilities bits are always wrong on SkyLake-E, but perhaps the bits are just configured incorrectly by BIOS, so you may want to talk to Rausch/SuperMicro if they have a BIOS update available to fix this.
Comment 3 Jack Wang 2020-09-16 13:15:02 UTC
Thanks Lukas for checking. 

I checked with our colleague who does the hotplug in DC, looks our SSD disk tray doesn't support the "Attention Button" when hotplug the disk, per my understanding the "Attention Button" is part of disk tray. He just opened the disk tray and plug out/in the SSD during the test, he didn't open the rack.

we will check if there is newer BIOS with fixes in this regards.
Comment 4 Jack Wang 2020-09-17 14:02:54 UTC
Created attachment 292527 [details]
lspci on 4.19.133
Comment 5 Jack Wang 2020-09-17 14:04:57 UTC
Created attachment 292529 [details]
full dmesg during hotplug

During the hot-removal, only one message about ACPI hotplug event, but seems kernel does react on it, and lspci shows the device is still in the system all the time.
Comment 6 Lukas Wunner 2020-09-23 12:44:36 UTC
Linux negotiates with the BIOS which PCI features are controlled by BIOS and which by Linux. On 5.4.62, the BIOS lets Linux control PCIeHotplug whereas on 4.19.133 it does not.

5.4.62:
ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 16-63])
acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
acpi PNP0A08:01: _OSC: platform does not support [SHPCHotplug AER LTR]
acpi PNP0A08:01: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]


4.19.133:
ACPI: PCI Root Bridge [PC01] (domain 0000 [bus 16-63])
acpi PNP0A08:01: _OSC: OS supports [ExtendedConfig Segments MSI]
acpi PNP0A08:01: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]

Try to amend negotiate_os_control() in drivers/acpi/pci_root.c with some debug printk's to understand why PCIeHotplug control is not handed over to the OS. Maybe it's broken in 4.19.133 or there's a problem with your BIOS (e.g. faulty ACPI tables).
Comment 7 Jack Wang 2020-09-30 08:55:40 UTC
Thanks for all the hint Lukas, I will try to get a local machine for testing, debug with remove hand is too time consuming.

Note You need to log in before you can comment on or make changes to this bug.