Bug 219748 - Pluggable UD-4VPD dock appears to continually reset with AMD AI 365
Summary: Pluggable UD-4VPD dock appears to continually reset with AMD AI 365
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P3 normal
Assignee: Default virtual assignee for Drivers/USB
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-02-03 20:58 UTC by Thomas
Modified: 2025-03-12 09:33 UTC (History)
5 users (show)

See Also:
Kernel Version:
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg (160.80 KB, text/plain)
2025-02-03 20:58 UTC, Thomas
Details
dmesg while tracing (207.07 KB, text/plain)
2025-02-05 23:14 UTC, Thomas
Details
trace (711.60 KB, application/zip)
2025-02-05 23:15 UTC, Thomas
Details
Try to shorten SB access timeout (2.55 KB, patch)
2025-02-06 15:51 UTC, Mika Westerberg
Details | Diff
Skip retimer scanning (419 bytes, patch)
2025-02-10 07:30 UTC, Mika Westerberg
Details | Diff
tbtrace with skip retimer patch (188.58 KB, application/zip)
2025-02-14 19:46 UTC, Thomas
Details
dmesg with skip retimer patch (201.27 KB, text/plain)
2025-02-14 19:47 UTC, Thomas
Details
Delay router enumeration for a while (548 bytes, patch)
2025-02-24 07:58 UTC, Mika Westerberg
Details | Diff
dmesg after delay-router-enumeration patch (151.37 KB, text/plain)
2025-03-03 16:06 UTC, Thomas
Details
Scan downstream retimers after the router has been enumerated (756 bytes, patch)
2025-03-04 08:46 UTC, Mika Westerberg
Details | Diff
dmesg after Scan downstream retimers after the router patch (20.27 KB, text/plain)
2025-03-04 23:41 UTC, Sean
Details
dmesg after move retimers patch (162.86 KB, text/plain)
2025-03-05 01:30 UTC, Thomas
Details
dmesg after move retimers patch only the pluggable device (150.01 KB, text/plain)
2025-03-05 12:40 UTC, Thomas
Details
Scan downstream retimers after the router has been enumerated v2 (2.17 KB, patch)
2025-03-05 13:11 UTC, Mika Westerberg
Details | Diff
0001-thunderbolt-Scan with display attached that was undetected (234.37 KB, text/plain)
2025-03-05 23:33 UTC, Thomas
Details
dmesg showing no r8152 network device afer 0001-thunderbolt-Scan patch (131.63 KB, text/plain)
2025-03-05 23:35 UTC, Thomas
Details

Description Thomas 2025-02-03 20:58:12 UTC
Not sure if this is related to the linux kernel to fix up the Pluggable UD-4VPD dock https://www.spinics.net/lists/linux-usb/msg263475.html.  

The kernel running is the latest from here AUR (en) - linux-mainline-um5606 https://aur.archlinux.org/packages/linux-mainline-um5606. It is based on the 6.13 kernel. It adds some patches as this laptop has been a bit unstable.  Hence, the additional amdgpu boot parameters.

thunderbolt.dyndbg=+p output shows the issue starting at time 27.027128 in  the attached dmesg
Comment 1 Thomas 2025-02-03 20:58:55 UTC
Created attachment 307570 [details]
dmesg
Comment 2 Mika Westerberg 2025-02-04 06:49:05 UTC
Can you enable tracing as well and attach both dmesg and the trace output here?

1. Boot the system nothing connected.
2. Enable tracing as explained in (https://github.com/intel/tbtools?tab=readme-ov-file#tracing)

# tbtrace enable

3. Plug in the dock
4. Verify that the problem reproduces.
5. Disable tracing

# tbtrace disable

6. Take both trace output and dmesg and attach them:

# tbtrace dump -vv > trace.out
# dmesg > dmesg.out

(We need the dmesg as well to be able to match the trace lines with the driver logs).
Comment 3 Thomas 2025-02-05 23:14:36 UTC
Created attachment 307577 [details]
dmesg while tracing
Comment 4 Thomas 2025-02-05 23:15:10 UTC
Created attachment 307578 [details]
trace
Comment 5 Thomas 2025-02-05 23:17:28 UTC
(In reply to Mika Westerberg from comment #2)
> Can you enable tracing as well and attach both dmesg and the trace output
> here?
> 
> 1. Boot the system nothing connected.
> 2. Enable tracing as explained in
> (https://github.com/intel/tbtools?tab=readme-ov-file#tracing)
> 
> # tbtrace enable
> 
> 3. Plug in the dock
> 4. Verify that the problem reproduces.
> 5. Disable tracing
> 
> # tbtrace disable
> 
> 6. Take both trace output and dmesg and attach them:
> 
> # tbtrace dump -vv > trace.out
> # dmesg > dmesg.out
> 
> (We need the dmesg as well to be able to match the trace lines with the
> driver logs).

Attached trace output. It charged the whole time. The first couple were slower than normal.
Comment 6 Mika Westerberg 2025-02-06 15:51:20 UTC
Created attachment 307579 [details]
Try to shorten SB access timeout

Thanks for the logs! From the trace it looks like similar what already was seen with Pluggable hub connected to Intel hardware. The sideband accesses take some time and the hub interprets this as "issue" and triggers disconnect (or this is what it looks like). The attached patch shortens the timeouts to half. Can you try with it and see if that makes any difference?
Comment 7 Thomas 2025-02-09 19:50:47 UTC
Tried the shorter timeout patch and verified it was running with uname -v.  The issue still persisted.  Shortened USB4_PORT_SB_TIMEOUT to 200 it is still disconnecting.  

Are there other values I should try or logs I can get?
Comment 8 Mika Westerberg 2025-02-10 07:30:46 UTC
Created attachment 307605 [details]
Skip retimer scanning

Can you try this one instead of the previous one? This completely disables sideband accesses done by the driver. If this makes difference then we need to look into the access patterns whether there is something we need to change in order to make the Pluggable device happy.
Comment 9 Thomas 2025-02-14 19:46:34 UTC
Created attachment 307659 [details]
tbtrace with skip retimer patch
Comment 10 Thomas 2025-02-14 19:47:05 UTC
Created attachment 307660 [details]
dmesg with skip retimer patch
Comment 11 Thomas 2025-02-14 19:50:00 UTC
The dock works with the skip retimer patch. 
Plugging in the dock makes the device connection sound twice from there:
1. plugged in ethernet 
2. plugged in usb
3. bumped the cable and unplugged the dock briefly -- sorry
4. plugged in an HDMI monitor
Comment 12 Mario Limonciello (AMD) 2025-02-14 21:31:00 UTC
Mika - what would you think about putting the retimer scan into a delayed work queue for 5 or 10 seconds later or so?

As a guess, maybe these devices don't like the sideband traffic while they're trying to get setup.
Comment 13 Mika Westerberg 2025-02-24 06:44:22 UTC
That's an option yes but I would still like to understand what is going on. In order to support CLx (with ALPM) we are going to need to access the sideband (and this applies to Windows too) before we setup DP tunnels and that might happen pretty soon after we enumerated the router.

I'll look into the traces next and see if I can identify something. Thanks!
Comment 14 Mika Westerberg 2025-02-24 07:58:25 UTC
Created attachment 307705 [details]
Delay router enumeration for a while

Can you try the attached patch (instead of the previous)? It still skips scanning retimers but adds artificial delay before we enumerate the router. If there is some sort of timeout in the pluggable device side it should trigger now. Please attach dmesg (no need for trace).
Comment 15 Thomas 2025-03-03 16:06:07 UTC
Created attachment 307733 [details]
dmesg after delay-router-enumeration patch

The latest patch resulted in the dock detaching again.  Dmesg attached.
Comment 16 Mika Westerberg 2025-03-04 08:44:17 UTC
Thanks! So we have:

[   56.773330] thunderbolt 0000:c6:00.6: 0:3: hotplug: scanning
[   56.773335] thunderbolt 0000:c6:00.6: 0:3: hotplug: no switch found
[   56.773349] thunderbolt 0000:c6:00.6: acking hot plug event on 0:2
[   56.773362] thunderbolt 0000:c6:00.6: 0:2: hotplug: scanning
[   56.773504] thunderbolt 0000:c6:00.6: 0:2: is connected, link is up (state: 2)
[   56.773515] thunderbolt 0000:c6:00.6: 0:2: waiting for a while
[   57.371077] thunderbolt 0000:c6:00.6: acking hot unplug event on 0:3
[   57.371143] thunderbolt 0000:c6:00.6: acking hot unplug event on 0:2
[   57.692761] thunderbolt 0000:c6:00.6: acking hot plug event on 0:3
[   57.692828] thunderbolt 0000:c6:00.6: acking hot plug event on 0:2
[   58.291125] thunderbolt 0000:c6:00.6: acking hot unplug event on 0:2
[   58.291217] thunderbolt 0000:c6:00.6: acking hot unplug event on 0:3
[   59.430462] thunderbolt 0000:c6:00.6: 0:2: .. done, trying to enumerate the router

In other words there is some ~600ms timeout in the pluggable device that triggers hot-remove and then hot-add if the device is not enumerated. This is naturally the USB4 spec, as far as I can tell the host does not need to enumerate the device at all actually.
Comment 17 Mika Westerberg 2025-03-04 08:46:07 UTC
Created attachment 307740 [details]
Scan downstream retimers after the router has been enumerated

Can you try this patch? It moves retimer scanning happen after the router has been enumerated so hopefully this is acceptable for the pluggable device.
Comment 18 Sean 2025-03-04 23:41:55 UTC
Created attachment 307767 [details]
dmesg after Scan downstream retimers after the router patch

While this topic is related to a Plugable dock, the Thunderbolt implementation is having some issues for AMD GPUs where display EDIDs can't be read (https://gitlab.freedesktop.org/drm/amd/-/issues/3050).  This latest patch appears to have  helped that issue, as my TB3 Lenovo dock on a Framework 16 with an AMD 7840HS CPU is now able to use its display outputs at the correct resolutions.  I'll be testing with addition TB docks I have access to in the near future as well.
Comment 19 Thomas 2025-03-05 01:30:13 UTC
Created attachment 307768 [details]
dmesg after move retimers patch

Patch worked.  There's two docks tried in the dmesg.  A Dell WD19S and then the pluggable UD-4VPD.
Comment 20 Mika Westerberg 2025-03-05 09:26:33 UTC
(In reply to Sean from comment #18)
> Created attachment 307767 [details]
> dmesg after Scan downstream retimers after the router patch
> 
> While this topic is related to a Plugable dock, the Thunderbolt
> implementation is having some issues for AMD GPUs where display EDIDs can't
> be read (https://gitlab.freedesktop.org/drm/amd/-/issues/3050).  This latest
> patch appears to have  helped that issue, as my TB3 Lenovo dock on a
> Framework 16 with an AMD 7840HS CPU is now able to use its display outputs
> at the correct resolutions.  I'll be testing with addition TB docks I have
> access to in the near future as well.

I suggest opening separate issue for this. I don't see how this patch could affect GPU reading EDID from monitor. Assuming it is over DisplayPort tunnel, at that point the device router is already enumerated. This patch just speeds up the enumeration to cope with the Pluggable device internal timeout.

Please attach full dmesg with "thunderbolt.dyndbg=+p" in the command line too to that new issue.
Comment 21 Mika Westerberg 2025-03-05 09:31:50 UTC
(In reply to Thomas from comment #19)
> Created attachment 307768 [details]
> dmesg after move retimers patch
> 
> Patch worked.  There's two docks tried in the dmesg.  A Dell WD19S and then
> the pluggable UD-4VPD.

Thanks for testing. However, I don't see Dell dock. Only the Pluggable one and there is one unplug and then plug. Is that on purpose?
Comment 22 Thomas 2025-03-05 11:52:23 UTC
Testing two in one dmesg wasn't the best move on my part.  The Dell was me plugging in the wrong dock.  Thought I'd leave it for the extra test.  It does clutter up the dmesg though.


Dell

[   52.334084] usb 7-1: new high-speed USB device number 2 using xhci_hcd
[   52.475560] usb 7-1: New USB device found, idVendor=0bda, idProduct=5487, bcdDevice= 1.47
[   52.475577] usb 7-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[   52.475581] usb 7-1: Product: Dell dock
[   52.475583] usb 7-1: Manufacturer: Dell Inc.

The switch from Dell to Pluggable, I think, happens here

[  105.079211] usb 7-1.5: USB disconnect, device number 4
[  108.745727] thunderbolt 0000:c6:00.5: control channel starting...

UD-4VPD is identified here

[  113.165640] thunderbolt 0-2: Plugable Plugable UD-4VPD
Comment 23 Thomas 2025-03-05 12:40:35 UTC
Created attachment 307771 [details]
dmesg after move retimers patch only the pluggable device

Thanks for the patch.  This is the dmesg output with the move retimers patch.  Only the pluggable device is tested in this one.
Comment 24 Mika Westerberg 2025-03-05 13:11:42 UTC
Created attachment 307772 [details]
Scan downstream retimers after the router has been enumerated v2

Can you still try the attached patch too? I missed two cases where we also want to scan the retimers so did that change there. It should not affect your case but better to check. Thanks!
Comment 25 Mario Limonciello (AMD) 2025-03-05 14:34:19 UTC
Comment on attachment 307772 [details]
Scan downstream retimers after the router has been enumerated v2

That patch looks good to me Mika.  Feel free to add:

Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>

Also since Sean mentions this helps a different issue on his Pluggable dock (no EDID read) I think you can also add:

Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3050
Comment 26 Thomas 2025-03-05 23:33:22 UTC
Created attachment 307773 [details]
0001-thunderbolt-Scan with display attached that was undetected

Tried the most recent patch.  This is a longer dmesg as devices were tried as well.  It was a bit odd as the network didn't seem to be detected initially and then, after unplugging and plugging in the dock, it did work.  Monitors plugged into the dock would report as not identified and, after plugging them in and being identified, they did not have video output.  Not sure how much of this is due to the patch.
Comment 27 Thomas 2025-03-05 23:35:14 UTC
Created attachment 307774 [details]
dmesg showing no r8152 network device afer 0001-thunderbolt-Scan patch

This is the dmesg that had the r8152 network adapter missing.
Comment 28 Mario Limonciello (AMD) 2025-03-06 03:53:51 UTC
Do those same failures happen in the case that the retimer scan was totally commented out?
Comment 29 Mika Westerberg 2025-03-06 07:16:38 UTC
For the first dmesg with "undetected" display the enumeration goes fine and the DP tunnel is created however:

[   54.496699] [drm] DPIA AUX failed on 0x0(1), error 3
[   54.579780] [drm] DM_MST: starting TM on aconnector: 00000000a96810d6 [id: 133]
[   54.587986] [drm] DM_MST: DP14, 4-lane link detected
[   54.591410] [drm] DPIA AUX failed on 0x1000(5), error 3
[   54.596690] [drm] DMUB HPD RX IRQ callback: link_index=7
[   54.605700] [drm] DMUB HPD RX IRQ callback: link_index=7
[   54.612512] [drm] DPIA AUX failed on 0x2003(1), error 3
[   54.616646] [drm] DMUB HPD RX IRQ callback: link_index=7
[   54.624654] [drm] DMUB HPD RX IRQ callback: link_index=7
[   54.631037] [drm] DPIA AUX failed on 0x2002(4), error 3

Looks to me some sort of link training issue but as far as I can tell the DP tunnel is up just fine. Immediately after this there is unplug:

[   54.632887] thunderbolt 0000:c6:00.6: acking hot unplug event on 0:2
[   54.632987] thunderbolt 1-0:2.1: retimer disconnected

I think this done by you and then replug. If that's the case then from USB4 perspective it looks okay (let me know if this is not the case). I'm leaving the graphics side for Mario to comment on as I'm not qualified. I do see MST and that seem to be problematic in Linux IIRC.

There is one more thing that need to check. Do you have CONFIG_USB4_DEBUGFS_MARGINING=y set in your kernel .config? I ask because the driver exposes the AMD retimer from the dock upstream port side too. It is okay if you have that set but if not something is not right then.

Regarding the missing network adapter, what happened there is that the USB4 link did not come up at all so all you got is the USB 2.x wires and devices. You see them "High Speed" and "Full Speed" in the dmesg. Now, this is something outside of what software can fix. Typically it's the Power Delivery firmware that handles this but most cases simply unplug plug or flipping the cable makes it work.
Comment 30 Mario Limonciello (AMD) 2025-03-06 17:24:22 UTC
> I'm leaving the graphics side for Mario to comment on as I'm not qualified. I
> do see MST and that seem to be problematic in Linux IIRC.

My main worry is if the retimer scan is the reason for the DPIA issues too.  The report that Sean linked in was a failing to read an EDID which got fixed by pushing the scan later.

So if you can reproduce the display issues try with it commented out entirely (maybe we want that for debugging as a module parameter Mika?).

With it commented out if they're cleared up, I think we still need to push it to a delayed work queue.
If they're still happening, then this should be a GPU driver or GPU microcode issue.  We can talk about that on a drm/amd Gitlab issue.

> Now, this is something outside of what software can fix. Typically it's the
> Power Delivery firmware that handles this but most cases simply unplug plug
> or flipping the cable makes it work.

Yup agree.
Comment 31 Mika Westerberg 2025-03-07 10:39:47 UTC
(In reply to Mario Limonciello (AMD) from comment #30)
> > I'm leaving the graphics side for Mario to comment on as I'm not qualified.
> I
> > do see MST and that seem to be problematic in Linux IIRC.
> 
> My main worry is if the retimer scan is the reason for the DPIA issues too. 
> The report that Sean linked in was a failing to read an EDID which got fixed
> by pushing the scan later.
> 
> So if you can reproduce the display issues try with it commented out
> entirely (maybe we want that for debugging as a module parameter Mika?).

I'm not convinced that retimer scanning has anything to do with that issue. It does affect timing when the device router is enumerated and that's the issue with Pluggable, they have implemented some sort of non-standard timeout there.

The other issue happens long after device router is already enumerated and the DP tunnels are established. The patch may seem to "solve" the issue but it probably just hides the real one and this is what we need to understand.

Therefore I suggest to open a separate issue for that one with appropriate full dmesg and the repro steps, CC me too I can take a look. We can defer this fix until we understand the real problem but let's keep the two separate.
Comment 32 Thomas 2025-03-07 20:38:12 UTC
>Do those same failures happen in the case that the retimer scan was totally
>commented out?
Yes.  Same issues happen when testing the HDMI port of the dock with that patch applied.

> Do you have CONFIG_USB4_DEBUGFS_MARGINING=y set in your kernel .config?
No, CONFIG_USB4_DEBUGFS_MARGINING doesn't show up in /proc/config.gz 

Tried a few more things to see if it would provide clearer details.  

In Linux, the monitor connected to the dock detected correctly, which doesn't happen every time, setting the resolution to below 4k causes the monitor to display output.  Before setting the resolution below 4k, the monitor would detect HDMI signal over and over without displaying video output.

Booted to Windows and reproduced the issue of the monitor blinking on an off with the HDMI output on the dock.  It did work after unplugging the dock and plugging it back in though. Didn't check the resolution though.  If that is an important detail, I can get it.

Overall, both the Skip retimer scanning and 0001-thunderbolt patches cause the dock to be correctly detected. 

I'm not sure if I should pursue -- or how to pursue --  the other issues with video.  They seems sporadic.  Amdgpu doesn't seem entirely stable.  So maybe video issues are expected.
Comment 33 Mario Limonciello (AMD) 2025-03-07 21:02:16 UTC
OK considering that I think this patch is correct, I feel we shouldn't dig any further from a CM perspective at this time.
Comment 34 Mika Westerberg 2025-03-12 09:33:30 UTC
Just sent the patch upstream:

https://lore.kernel.org/linux-usb/20250312092603.3666723-1-mika.westerberg@linux.intel.com/

Note You need to log in before you can comment on or make changes to this bug.