Bug 209113 - PCI-E hotplug doesn't work: Failed to check link status
Summary: PCI-E hotplug doesn't work: Failed to check link status
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-01 15:10 UTC by Myron Stowe
Modified: 2020-09-14 17:08 UTC (History)
1 user (show)

See Also:
Kernel Version: 5.8-rc6
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg log of v5.8-rc6 (413.22 KB, text/plain)
2020-09-01 15:10 UTC, Myron Stowe
Details
lspci -vvv log (510.45 KB, text/plain)
2020-09-01 15:11 UTC, Myron Stowe
Details
dmesg with timstamp for RHEL7.8-success and RHEL8.2-failed (104.89 KB, application/zip)
2020-09-11 06:55 UTC, Skylar Zhao
Details
dmesg log of v3.10+ showing successful hot-plug event (338.10 KB, text/plain)
2020-09-14 16:57 UTC, Myron Stowe
Details
dmesg log of v5.8.8 with increased timeout (328.19 KB, text/plain)
2020-09-14 17:08 UTC, Myron Stowe
Details

Description Myron Stowe 2020-09-01 15:10:34 UTC
Created attachment 292281 [details]
dmesg log of v5.8-rc6

Relaying all this information second hand:

On a system with a Mellanox Technologies MT27800 Family [ConnectX-5] NIC controller containing a power button, hot-plug fails to function properly.

  Normal, working, scenario:
    o Press the OCP NIC's power button;
    o Power button LED blinks and turns off (delivering event message to CPU);
    o Verify NIC is offline via 'lspci';
    o Remove controller.

  Scenario with cmdline parameter 'pcie_port_pm=off':
    o Press NIC's power button;
    o LED turns off;
    o Verify NIC is offline;
    o Press power button (in an attempt to hot-add controller);
    o NIC is not recognized.

  Scenario with no cmdline parameter, or ''pcie_aspm=off', or 'pcie_aspm=off 
  pcie_port_pm=off':
    o Press NIC's power button;
    o LED continuously flashes;
    o Checking via 'lspci' indicates NIC is offline but with LED flashing, the
      controller can not be removed.

The associated 'dmesg' log (which is attached in its entirety) is:

pcieport 0000:23:00.0: pciehp: Slot(2): Attention button pressed
pcieport 0000:23:00.0: pciehp: Slot(2) Powering on due to button press
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 2c0
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_check_link_active: lnk_status = 5001
pcieport 0000:23:00.0: pciehp: Slot(2): Card present
pcieport 0000:23:00.0: pciehp: pciehp_get_power_status: SLOTCTRL a8 value read 16f5
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
pcieport 0000:23:00.0: pciehp: __pciehp_link_set: lnk_ctrl = 40
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 200
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:23:00.0: pciehp: Failed to check link status
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 340
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 300
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status

Device 0000:23:00.0 is a Root Port.
Device 0000:24:00.[0,1] is the Mellanox controller which resides downstream of
the root port ('lspci -vvv' log is also attached).


The following is a snippet from a working scenario, but, note that this is very much an apples/oranges comparison as the working scenario is from a v4.18 basis kernel.

localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Attention button pressed
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Powering off due to button press
localhost kernel: infiniband mlx5_1: wait_for_async_commands:745:(pid 1150): done with all pending requests
localhost avahi-daemon[2639]: Withdrawing workstation service for ens2f1.
localhost NetworkManager[2773]: <info>  [1594075792.8387] device (ens2f1): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
localhost ModemManager[2695]: <info>  (net/ens2f1): released by modem /sys/devices/pci0000:23/0000:23:00.0/0000:24:00.1
localhost kernel: (0000:24:00.1): E-Switch: cleanup
localhost systemd: Unit rdma-hw.target is not needed anymore. Stopping.
localhost systemd: Unit rdma-ndd.service is not needed anymore. Stopping.
localhost systemd: Stopped target RDMA Hardware.
localhost systemd: Stopping RDMA Node Description Daemon...
localhost systemd: Stopped RDMA Node Description Daemon.
localhost kernel: infiniband mlx5_0: wait_for_async_commands:745:(pid 1150): done with all pending requests
localhost avahi-daemon[2639]: Withdrawing workstation service for ens2f0.
localhost NetworkManager[2773]: <info>  [1594075795.3042] device (ens2f0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
localhost ModemManager[2695]: <info>  (net/ens2f0): released by modem /sys/devices/pci0000:23/0000:23:00.0/0000:24:00.0
localhost kernel: (0000:24:00.0): E-Switch: cleanup
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Link Down
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Already disabled
localhost systemd: Started Session 2 of user root.
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Attention button pressed
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Card present
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2) Powering on due to button press
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Link Up
Comment 1 Myron Stowe 2020-09-01 15:11:33 UTC
Created attachment 292283 [details]
lspci -vvv log
Comment 2 Skylar Zhao 2020-09-11 06:55:10 UTC
Created attachment 292459 [details]
dmesg with timstamp for RHEL7.8-success and RHEL8.2-failed

I tested RHEL7.8 (based on 3.10) and RHEL8.2 (based on 4.18), and collected dMESg with a timestamp. I uploaded the files to the bugzilla

If there is anything else I need to do, please feel free to let me know
Comment 3 Myron Stowe 2020-09-14 16:57:58 UTC
Created attachment 292503 [details]
dmesg log of v3.10+ showing successful hot-plug event

'dmesg' log of boot with the OCP NIC -

Hot-plug events:
  o  113 seconds, button press to remove OCP NIC
  o  122 seconds, OCP-NIC successfully disconnected.

  o  138 seconds, Insert OCP-NIC and press Attention button
     OCP-NIC detected Card Present and initialized OCP successfully.
Comment 4 Myron Stowe 2020-09-14 17:08:35 UTC
Created attachment 292505 [details]
dmesg log of v5.8.8 with increased timeout

Note You need to log in before you can comment on or make changes to this bug.