Created attachment 292281 [details] dmesg log of v5.8-rc6 Relaying all this information second hand: On a system with a Mellanox Technologies MT27800 Family [ConnectX-5] NIC controller containing a power button, hot-plug fails to function properly. Normal, working, scenario: o Press the OCP NIC's power button; o Power button LED blinks and turns off (delivering event message to CPU); o Verify NIC is offline via 'lspci'; o Remove controller. Scenario with cmdline parameter 'pcie_port_pm=off': o Press NIC's power button; o LED turns off; o Verify NIC is offline; o Press power button (in an attempt to hot-add controller); o NIC is not recognized. Scenario with no cmdline parameter, or ''pcie_aspm=off', or 'pcie_aspm=off pcie_port_pm=off': o Press NIC's power button; o LED continuously flashes; o Checking via 'lspci' indicates NIC is offline but with LED flashing, the controller can not be removed. The associated 'dmesg' log (which is attached in its entirety) is: pcieport 0000:23:00.0: pciehp: Slot(2): Attention button pressed pcieport 0000:23:00.0: pciehp: Slot(2) Powering on due to button press pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 2c0 pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status pcieport 0000:23:00.0: pciehp: pciehp_check_link_active: lnk_status = 5001 pcieport 0000:23:00.0: pciehp: Slot(2): Card present pcieport 0000:23:00.0: pciehp: pciehp_get_power_status: SLOTCTRL a8 value read 16f5 pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status pcieport 0000:23:00.0: pciehp: pciehp_power_on_slot: SLOTCTRL a8 write cmd 0 pcieport 0000:23:00.0: pciehp: __pciehp_link_set: lnk_ctrl = 40 pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 200 pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status pcieport 0000:23:00.0: Data Link Layer Link Active not set in 1000 msec pcieport 0000:23:00.0: pciehp: Failed to check link status pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status pcieport 0000:23:00.0: pciehp: pciehp_power_off_slot: SLOTCTRL a8 write cmd 400 pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 340 pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status pcieport 0000:23:00.0: pciehp: pciehp_set_indicators: SLOTCTRL a8 write cmd 300 pcieport 0000:23:00.0: pciehp: pending interrupts 0x0010 from Slot Status Device 0000:23:00.0 is a Root Port. Device 0000:24:00.[0,1] is the Mellanox controller which resides downstream of the root port ('lspci -vvv' log is also attached). The following is a snippet from a working scenario, but, note that this is very much an apples/oranges comparison as the working scenario is from a v4.18 basis kernel. localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Attention button pressed localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Powering off due to button press localhost kernel: infiniband mlx5_1: wait_for_async_commands:745:(pid 1150): done with all pending requests localhost avahi-daemon[2639]: Withdrawing workstation service for ens2f1. localhost NetworkManager[2773]: <info> [1594075792.8387] device (ens2f1): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed') localhost ModemManager[2695]: <info> (net/ens2f1): released by modem /sys/devices/pci0000:23/0000:23:00.0/0000:24:00.1 localhost kernel: (0000:24:00.1): E-Switch: cleanup localhost systemd: Unit rdma-hw.target is not needed anymore. Stopping. localhost systemd: Unit rdma-ndd.service is not needed anymore. Stopping. localhost systemd: Stopped target RDMA Hardware. localhost systemd: Stopping RDMA Node Description Daemon... localhost systemd: Stopped RDMA Node Description Daemon. localhost kernel: infiniband mlx5_0: wait_for_async_commands:745:(pid 1150): done with all pending requests localhost avahi-daemon[2639]: Withdrawing workstation service for ens2f0. localhost NetworkManager[2773]: <info> [1594075795.3042] device (ens2f0): state change: disconnected -> unmanaged (reason 'removed', sys-iface-state: 'removed') localhost ModemManager[2695]: <info> (net/ens2f0): released by modem /sys/devices/pci0000:23/0000:23:00.0/0000:24:00.0 localhost kernel: (0000:24:00.0): E-Switch: cleanup localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Link Down localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Already disabled localhost systemd: Started Session 2 of user root. localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Attention button pressed localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Card present localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2) Powering on due to button press localhost kernel: pciehp 0000:23:00.0:pcie004: Slot(2): Link Up
Created attachment 292283 [details] lspci -vvv log
Created attachment 292459 [details] dmesg with timstamp for RHEL7.8-success and RHEL8.2-failed I tested RHEL7.8 (based on 3.10) and RHEL8.2 (based on 4.18), and collected dMESg with a timestamp. I uploaded the files to the bugzilla If there is anything else I need to do, please feel free to let me know
Created attachment 292503 [details] dmesg log of v3.10+ showing successful hot-plug event 'dmesg' log of boot with the OCP NIC - Hot-plug events: o 113 seconds, button press to remove OCP NIC o 122 seconds, OCP-NIC successfully disconnected. o 138 seconds, Insert OCP-NIC and press Attention button OCP-NIC detected Card Present and initialized OCP successfully.
Created attachment 292505 [details] dmesg log of v5.8.8 with increased timeout