Bug 218360 - Disk drive of TBT3/USB4 storage device can’t show up if connecting behind TBT3 dock or some USB4 docks DFP TBT port
Summary: Disk drive of TBT3/USB4 storage device can’t show up if connecting behind TBT...
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: AMD Linux
: P3 blocking
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-10 10:52 UTC by Sanath S
Modified: 2024-03-07 15:14 UTC (History)
5 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: Yes
Bisected commit-id: d3fcd7360338358aa0036bec6d2cf0e37a0ca624


Attachments
message.eml (10.89 KB, message/rfc822)
2024-01-10 18:18 UTC, linux-pci+owner
Details
Provided dmesg and lspci outputs for pass and fail case (89.74 KB, application/x-zip-compressed)
2024-01-11 04:48 UTC, Sanath S
Details
Proposed fix (1.09 KB, patch)
2024-01-16 20:39 UTC, Alex Williamson
Details | Diff

Description Sanath S 2024-01-10 10:52:40 UTC
The disk drive of the TBT3/USB4 storage device can’t show up behind TBT3 dock and some USB4 docks have a DFP TBT port.

TBT3/USB4 storage device can be authorized, the disk drive doesn’t show up.

Issue reproduce sequence: Connect USB4 dock to host → Connect TBT3/USB4 storage device to dock DFP TBT3 port → TBT3/USB4 storage device disk driver can’t show up.

The issue was observed in the latest mainline kernel 6.7.


Failing on all the below combination

HP Thunderbolt Dock 120W G2 HSN-IX01	
HP Hook 2.0 USB4/TBT4 Dock (PV Phase)	
CalDigit TS3Plus+ Thunderbolt Station 3 Plus	
Dell Thunderbolt Dock – WD19TB
Comment 1 Sanath S 2024-01-10 10:54:46 UTC
This is a regression

It last worked for me on the 6.5 kernel.
Started to bisect and found this to be a bad commit.

d3fcd7360338358aa0036bec6d2cf0e37a0ca624 is the first bad commit
commit d3fcd7360338358aa0036bec6d2cf0e37a0ca624
Author: Alex Williamson <alex.williamson@redhat.com>
Date:   Thu Aug 3 11:12:33 2023 -0600

    PCI: Fix runtime PM race with PME polling

    Testing that a device is not currently in a low-power state provides no
    guarantees that the device is not imminently transitioning to such a state.
    Increment the PM usage counter before accessing the device.  Since we don't
    wish to wake the device for PME polling, do so only if the device is
    already active by using pm_runtime_get_if_active().

    Link: https://lore.kernel.org/r/20230803171233.3810944-3-alex.williamson@redhat.com
    Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

 drivers/pci/pci.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)
Comment 2 linux-pci+owner 2024-01-10 18:18:22 UTC
Created attachment 305692 [details]
message.eml

Greetings!

This is the mlmmj program managing the <linux-pci@vger.kernel.org> mailing
list.

The message from <bugzilla-daemon@kernel.org> with subject "[Bug 218360]
New: Disk drive of TBT3/USB4 storage device can’t show up if connecting
behind TBT3 dock or some USB4 docks DFP TBT port" was unable to be
delivered to the list because the list address was not found in either the
To: or CC: header.

(The denied message is below.)
Comment 3 Sanath S 2024-01-11 04:48:29 UTC
Created attachment 305699 [details]
Provided dmesg and lspci outputs for pass and fail case

Hi Bjorn Helgaas,

Attaching the dmesg and "lspci -vv" for both failure and pass scenarios.

I've used HP hook2.0 Dock + Pluggable TBT3 storage device.
Boot to OS -> Plug the HP Hook2.0 dock -> Then hotplug the Pluggable storage device.

pass is taken from - 5cd903bce9dd ("PCI/VPD: Add runtime power management to sysfs interface")

failure is from - d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
Comment 4 Alex Williamson 2024-01-11 18:42:11 UTC
lspci shows 05:01.0 is in D3 in the failing case, this is the parent bridge to bus 07 where the nvme device lives.

The intent of d3fcd7360338 was only to make use of runtime power manage management to increment the usage counter of the device across the call to pci_pme_wakeup() without otherwise changing behavior.  Clearly that's not the case.

AIUI, the only way the code behaves differently with d3fcd7360338 is if pm_runtime_get_if_active() returns 0, which introduces a new continue whereas otherwise it's simply a matter of incrementing a usage counter.  That indicates dev->power.runtime_status != RPM_ACTIVE.  I used the _if_active() variant of pm_runtime_get() here to maintain the behavior suggested by the comment that we're not trying to poll devices in D3cold.

It'd be interesting to know what the runtime_status is here, whether the device is fully suspended or in a transition state.  Using pm_runtime_get() would resume the device, which appears not to be the original intent.  We could use pm_runtime_get_noresume() to prevent that transition before testing D3cold but I'm not sure if that isolates us from the race we were intending to resolve in d3fcd7360338 if the runtime_status were RPM_SUSPENDING.
Comment 5 Sanath S 2024-01-16 07:25:09 UTC
Hi Alex,

What's the way forward here? 
Let me know if you want me to try any changes.

Thanks,
Sanath S
Comment 6 Alex Williamson 2024-01-16 20:39:18 UTC
Created attachment 305721 [details]
Proposed fix

Hi Sanath,

Please test this patch in your configuration.  Forcing the device to be in the active state was an unintended change, lspci shows the bridge here in D3, so it's likely in a suspended state and therefore skipped.  Instead we can try acquiring a runtime pm usage reference and immediately thereafter flush any pending operations, at which point the PCI power state of the device should be stable relative to runtime power management operations.  Thanks,

Alex
Comment 7 Alex Williamson 2024-01-16 23:23:45 UTC
(In reply to Alex Williamson from comment #6)
> Created attachment 305721 [details]
> Proposed fix
> 
> Hi Sanath,
> 
> Please test this patch in your configuration.

I'm getting some unexplained live locks in pm_runtime_barrier() with this, so it's still got some issues to resolve.
Comment 8 Alex Williamson 2024-01-22 23:19:33 UTC
Latest attempt to resolve this is available here:

https://lore.kernel.org/all/20240122155003.587225aa.alex.williamson@redhat.com/

Testing and reviews welcome.
Comment 9 Bjorn Helgaas 2024-03-07 15:14:47 UTC
The patch is https://git.kernel.org/linus/41044d536068 ("PCI: Fix active state requirement in PME polling"), which appeared in v6.8-rc5, so we expect it to be included in v6.8.

Note You need to log in before you can comment on or make changes to this bug.