Created attachment 304031 [details] the tracing of nvme_pci_enable() during re-insertion Hi, There is a JHL7540-based device that may host a NVMe device. After the first insertion a nvme drive is properly discovered and handled by the relevant modules. Once disconnected any further attempts are not successful. The device is visible on a PCI bus, but nvme_pci_enable() ends up calling pci_disable_device() every time; the runtime PM status of the device is "suspended", the power status of the 04:01.0 PCI bridge is D3. Preventing the device from being power managed ("on" -> /sys/devices/../power/control) combined with device removal and pci rescan changes nothing. A host reboot restores the initial state. I would appreciate any suggestions how to debug it further.
Created attachment 304032 [details] kernel buffer after the second insertion
Created attachment 304033 [details] pci buses overview
Created attachment 304034 [details] pci device status after the second insertion
Created attachment 304035 [details] 04:01.0 PCI bridge status after the second insertion
Created attachment 304036 [details] the tracing of nvme_pci_enable() during the first insertion
Created attachment 304037 [details] the tracing of nvme_remove() during unplugging of the peripheral
Can you attach full dmesg and output of 'sudo lspci -vv' after both insertions?
Created attachment 304053 [details] 1st dmesg
Created attachment 304054 [details] 1st lspci
Created attachment 304055 [details] 2nd dmesg
Created attachment 304056 [details] 2nd lspci
Thanks for the logs! Indeed, the PCIe downstream port 04:01.0 seems to enter D3 (runtime suspend) even though the connected endpoint (nvme 05:00.0) is in D0. That's unexpected. Can you try if passing "pcie_port_pm=off" works it around?
bugzilla-daemon@kernel.org writes: > https://bugzilla.kernel.org/show_bug.cgi?id=217251 > > --- Comment #12 from Mika Westerberg (mika.westerberg@linux.intel.com) --- > Thanks for the logs! Indeed, the PCIe downstream port 04:01.0 seems to enter > D3 > (runtime suspend) even though the connected endpoint (nvme 05:00.0) is in D0. > That's unexpected. Can you try if passing "pcie_port_pm=off" works it around? > I did, and the results were the same. I also decided to widen the problem space: added 4 other distinct NVMe devices, and another mobile platform - TGL. All but those including the 970 Pro device combinations worked flawlessly. After all I could not confirm the claim of one of your colleagues something has been botched since the introduction of ADL. As far as I am concerned, we could drop the towel. Nonetheless if you think the kernel might be at fault, I am willing to devote my time nailing it down.
Okay thanks for checking anyway. Yeah, could be device issue but I'm not a NVMe expert (more like looking at this because TBT is involved).