Bug 217251 - pciehp: nvme not visible after re-insert to tbt port
Summary: pciehp: nvme not visible after re-insert to tbt port
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-03-27 12:36 UTC by Aleksander Trofimowicz
Modified: 2023-03-31 14:20 UTC (History)
1 user (show)

See Also:
Kernel Version: 6.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
the tracing of nvme_pci_enable() during re-insertion (811.40 KB, text/plain)
2023-03-27 12:36 UTC, Aleksander Trofimowicz
Details
kernel buffer after the second insertion (137.97 KB, text/plain)
2023-03-27 12:38 UTC, Aleksander Trofimowicz
Details
pci buses overview (2.10 KB, text/plain)
2023-03-27 12:39 UTC, Aleksander Trofimowicz
Details
pci device status after the second insertion (3.80 KB, text/plain)
2023-03-27 12:40 UTC, Aleksander Trofimowicz
Details
04:01.0 PCI bridge status after the second insertion (4.74 KB, text/plain)
2023-03-27 12:42 UTC, Aleksander Trofimowicz
Details
the tracing of nvme_pci_enable() during the first insertion (732.09 KB, text/plain)
2023-03-27 12:43 UTC, Aleksander Trofimowicz
Details
the tracing of nvme_remove() during unplugging of the peripheral (2.84 MB, text/plain)
2023-03-27 12:46 UTC, Aleksander Trofimowicz
Details
1st dmesg (97.65 KB, text/plain)
2023-03-29 19:22 UTC, Aleksander Trofimowicz
Details
1st lspci (76.42 KB, text/plain)
2023-03-29 19:23 UTC, Aleksander Trofimowicz
Details
2nd dmesg (115.18 KB, text/plain)
2023-03-29 19:24 UTC, Aleksander Trofimowicz
Details
2nd lspci (76.39 KB, text/plain)
2023-03-29 19:25 UTC, Aleksander Trofimowicz
Details

Description Aleksander Trofimowicz 2023-03-27 12:36:35 UTC
Created attachment 304031 [details]
the tracing of nvme_pci_enable() during re-insertion

Hi,

There is a JHL7540-based device that may host a NVMe device. After the first insertion a nvme drive is properly discovered and handled by the relevant modules. Once disconnected any further attempts are not successful. The device is visible on a PCI bus, but nvme_pci_enable() ends up calling pci_disable_device() every time; the runtime PM status of the device is "suspended", the power status of the 04:01.0 PCI bridge is D3. Preventing the device from being power managed ("on" -> /sys/devices/../power/control) combined with device removal and pci rescan changes nothing. A host reboot restores the initial state.

I would appreciate any suggestions how to debug it further.
Comment 1 Aleksander Trofimowicz 2023-03-27 12:38:36 UTC
Created attachment 304032 [details]
kernel buffer after the second insertion
Comment 2 Aleksander Trofimowicz 2023-03-27 12:39:17 UTC
Created attachment 304033 [details]
pci buses overview
Comment 3 Aleksander Trofimowicz 2023-03-27 12:40:45 UTC
Created attachment 304034 [details]
pci device status after the second insertion
Comment 4 Aleksander Trofimowicz 2023-03-27 12:42:40 UTC
Created attachment 304035 [details]
04:01.0 PCI bridge status after the second insertion
Comment 5 Aleksander Trofimowicz 2023-03-27 12:43:42 UTC
Created attachment 304036 [details]
the tracing of nvme_pci_enable() during the first insertion
Comment 6 Aleksander Trofimowicz 2023-03-27 12:46:03 UTC
Created attachment 304037 [details]
the tracing of nvme_remove() during unplugging of the peripheral
Comment 7 Mika Westerberg 2023-03-28 12:17:40 UTC
Can you attach full dmesg and output of 'sudo lspci -vv' after both insertions?
Comment 8 Aleksander Trofimowicz 2023-03-29 19:22:56 UTC
Created attachment 304053 [details]
1st dmesg
Comment 9 Aleksander Trofimowicz 2023-03-29 19:23:38 UTC
Created attachment 304054 [details]
1st lspci
Comment 10 Aleksander Trofimowicz 2023-03-29 19:24:02 UTC
Created attachment 304055 [details]
2nd dmesg
Comment 11 Aleksander Trofimowicz 2023-03-29 19:25:32 UTC
Created attachment 304056 [details]
2nd lspci
Comment 12 Mika Westerberg 2023-03-30 06:32:52 UTC
Thanks for the logs! Indeed, the PCIe downstream port 04:01.0 seems to enter D3 (runtime suspend) even though the connected endpoint (nvme 05:00.0) is in D0. That's unexpected. Can you try if passing "pcie_port_pm=off" works it around?
Comment 13 Aleksander Trofimowicz 2023-03-30 21:11:46 UTC
bugzilla-daemon@kernel.org writes:

> https://bugzilla.kernel.org/show_bug.cgi?id=217251
>
> --- Comment #12 from Mika Westerberg (mika.westerberg@linux.intel.com) ---
> Thanks for the logs! Indeed, the PCIe downstream port 04:01.0 seems to enter
> D3
> (runtime suspend) even though the connected endpoint (nvme 05:00.0) is in D0.
> That's unexpected. Can you try if passing "pcie_port_pm=off" works it around?
>
I did, and the results were the same.

I also decided to widen the problem space: added 4 other distinct NVMe
devices, and another mobile platform - TGL. All but those including the
970 Pro device combinations worked flawlessly. After all I could not
confirm the claim of one of your colleagues something has been botched
since the introduction of ADL.

As far as I am concerned, we could drop the towel. Nonetheless if you
think the kernel might be at fault, I am willing to devote my time
nailing it down.
Comment 14 Mika Westerberg 2023-03-31 14:20:39 UTC
Okay thanks for checking anyway. Yeah, could be device issue but I'm not a NVMe expert (more like looking at this because TBT is involved).

Note You need to log in before you can comment on or make changes to this bug.