Bug 209113

Summary: PCI-E hotplug doesn't work: Failed to check link status
Product: Drivers Reporter: Myron Stowe (myron.stowe)
Component: PCIAssignee: drivers_pci (drivers_pci)
Status: NEW ---    
Severity: normal CC: zhaoxuepeng01
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 5.8-rc6 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg log of v5.8-rc6
lspci -vvv log
dmesg with timstamp for RHEL7.8-success and RHEL8.2-failed
dmesg log of v3.10+ showing successful hot-plug event
dmesg log of v5.8.8 with increased timeout

Description Myron Stowe 2020-09-01 15:10:34 UTC
Created attachment 292281 [details]
dmesg log of v5.8-rc6

Relaying all this information second hand:

On a system with a Mellanox Technologies MT27800 Family [ConnectX-5] NIC controller containing a power button, hot-plug fails to function properly.

  Normal, working, scenario:
    o Press the OCP NIC's power button;
    o Power button LED blinks and turns off (delivering event message to CPU);
    o Verify NIC is offline via 'lspci';
    o Remove controller.

  Scenario with cmdline parameter 'pcie_port_pm=off':
    o Press NIC's power button;
    o LED turns off;
    o Verify NIC is offline;
    o Press power button (in an attempt to hot-add controller);
    o NIC is not recognized.

  Scenario with no cmdline parameter, or ''pcie_aspm=off', or 'pcie_aspm=off 
  pcie_port_pm=off':
    o Press NIC's power button;
    o LED continuously flashes;
    o Checking via 'lspci' indicates NIC is offline but with LED flashing, the
      controller can not be removed.

The associated 'dmesg' log (which is attached in its entirety) is:

pcieport 0000:23:00.0: pciehp: Slot(2): Attention button pressed
pcieport 0000:23:00.0: pciehp: Slot(2) Powering on due to button press
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 2c0
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_check_link_active: lnk_status = 5001
pcieport 0000:23:00.0: pciehp: Slot(2): Card present
pcieport 0000:23:00.0: pciehp: pciehp_get_power_status: SLOTCTRL a8 value read 16f5
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0
pcieport 0000:23:00.0: pciehp: __pciehp_link_set: lnk_ctrl = 40
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 200
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:23:00.0: pciehp: Failed to check link status
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 340
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status
pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 300
pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status

Device 0000:23:00.0 is a Root Port.
Device 0000:24:00.[0,1] is the Mellanox controller which resides downstream of
the root port ('lspci -vvv' log is also attached).


The following is a snippet from a working scenario, but, note that this is very much an apples/oranges comparison as the working scenario is from a v4.18 basis kernel.

localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Attention button pressed
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Powering off due to button press
localhost kernel: infiniband mlx5_1: wait_for_async_commands:745:(pid 1150): done with all pending requests
localhost avahi-daemon[2639]: Withdrawing workstation service for ens2f1.
localhost NetworkManager[2773]: <info>  [1594075792.8387] device (ens2f1): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
localhost ModemManager[2695]: <info>  (net/ens2f1): released by modem /sys/devices/pci0000:23/0000:23:00.0/0000:24:00.1
localhost kernel: (0000:24:00.1): E-Switch: cleanup
localhost systemd: Unit rdma-hw.target is not needed anymore. Stopping.
localhost systemd: Unit rdma-ndd.service is not needed anymore. Stopping.
localhost systemd: Stopped target RDMA Hardware.
localhost systemd: Stopping RDMA Node Description Daemon...
localhost systemd: Stopped RDMA Node Description Daemon.
localhost kernel: infiniband mlx5_0: wait_for_async_commands:745:(pid 1150): done with all pending requests
localhost avahi-daemon[2639]: Withdrawing workstation service for ens2f0.
localhost NetworkManager[2773]: <info>  [1594075795.3042] device (ens2f0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed')
localhost ModemManager[2695]: <info>  (net/ens2f0): released by modem /sys/devices/pci0000:23/0000:23:00.0/0000:24:00.0
localhost kernel: (0000:24:00.0): E-Switch: cleanup
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Link Down
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Already disabled
localhost systemd: Started Session 2 of user root.
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Attention button pressed
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Card present
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2) Powering on due to button press
localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Link Up
Comment 1 Myron Stowe 2020-09-01 15:11:33 UTC
Created attachment 292283 [details]
lspci -vvv log
Comment 2 Skylar Zhao 2020-09-11 06:55:10 UTC
Created attachment 292459 [details]
dmesg with timstamp for RHEL7.8-success and RHEL8.2-failed

I tested RHEL7.8 (based on 3.10) and RHEL8.2 (based on 4.18), and collected dMESg with a timestamp. I uploaded the files to the bugzilla

If there is anything else I need to do, please feel free to let me know
Comment 3 Myron Stowe 2020-09-14 16:57:58 UTC
Created attachment 292503 [details]
dmesg log of v3.10+ showing successful hot-plug event

'dmesg' log of boot with the OCP NIC -

Hot-plug events:
  o  113 seconds, button press to remove OCP NIC
  o  122 seconds, OCP-NIC successfully disconnected.

  o  138 seconds, Insert OCP-NIC and press Attention button
     OCP-NIC detected Card Present and initialized OCP successfully.
Comment 4 Myron Stowe 2020-09-14 17:08:35 UTC
Created attachment 292505 [details]
dmesg log of v5.8.8 with increased timeout