Bug 217737

Summary: [Thunderbolt][Intel Alder Lake] "timeout enabling lane bonding" with BlackMagic UltraStudio device
Product: Drivers Reporter: Marek Šanta (teslan223)
Component: OtherAssignee: drivers_other
Status: NEW ---    
Severity: normal CC: mika.westerberg
Priority: P3    
Hardware: Intel   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg outputs, various scenarios
dmesg output kernel 6.5-rc3
Retry with single lane if lane bonding fails (draft)
Check that lane 1 is in CL0

Description Marek Šanta 2023-07-31 11:13:15 UTC
Created attachment 304742 [details]
dmesg outputs, various scenarios

Device: Intel NUC (NUC12WSHi5)
Thunderbolt Device: https://www.blackmagicdesign.com/products/ultrastudio/techspecs/W-DLUS-13
Tested kernel: 6.4.7


An issue appeared on Intel 12 gen CPU when Thunderbolt device is connected: 
"timeout enabling lane bonding"

Full dmesg is in the attachment as dmesg_100ms file. There is also dmesg_100ms_dyndbg file which is dmesg output with thunderbolt.dyndbg='+p' enabled.

Interesting points:

- I tried to modify the kernel code (drivers/thunderbolt/switch.c, line 2790) for 10000ms timeout instead of 100ms but the issue persists. However, dmesg without dyndbg writes out more information (dmesg_10000ms file in the attachment zip)

- The same device works well with a 10-gen CPU (on NUC10i7FNHN) with the same kernel build and driver version. It also works well on numerous 11-gen NUCs and laptops with older kernel (I never had this issue before)

I am open to assistance and it is also possible to borrow Blackmagic UltraStudio from us.
Comment 1 Marek Šanta 2023-07-31 11:22:06 UTC
Created attachment 304743 [details]
dmesg output kernel 6.5-rc3

The dmesg output with Kernel 6.5-rc3 is again slightly different and I thought that it might be also useful...
Comment 2 Mika Westerberg 2023-08-07 08:24:03 UTC
Can you check on the working system under /sys/bus/thunderbolt/devices/0-1/rx_lanes and rx_speed (and tx_ similar)?
Comment 3 Marek Šanta 2023-08-07 08:48:51 UTC
Sure

rx_lanes: 1
rx_speed: 20.0 Gb/s
tx_lanes: 1
tx_speed: 20.0 Gb/s
Comment 4 Mika Westerberg 2023-08-07 09:03:48 UTC
Okay that makes sense, so the connection manager firmware does not bond the lanes (rx/tx_lanes == 1). I'll talk to our firmware people and ask if this is some sort of workaround.
Comment 5 Marek Šanta 2023-08-08 08:14:43 UTC
Thanks!

So I don't know if this is the right place to discuss it in that case, but if this is a firmware issue and it (hopefully) gets fixed, how could Connection Manager FW be updated? BIOS? New firmware blob in linux-firmware? Do we have any hope?
Comment 6 Mika Westerberg 2023-08-08 08:20:12 UTC
The connection manager firmware is part of the Thunderbolt firmware and can be upgraded through Linux. However, this is not a firmware issue but Linux issue most likely so any fix will be added to the kernel driver.
Comment 7 Mario Limonciello (AMD) 2023-08-09 11:37:14 UTC
*** Bug 217617 has been marked as a duplicate of this bug. ***
Comment 8 Mika Westerberg 2023-08-21 10:09:32 UTC
Created attachment 304920 [details]
Retry with single lane if lane bonding fails (draft)

Can you try the attached patch? With USB4 the spec says that after lane bonding fails the link gets disconnect and this is not handled currently properly in the driver. The expectation is that the first attempt fails with lane bonding timeout and the second one succeeds with single lane link. The patch is not yet finalized so there might be some corner cases it is missing but at least this gives us some indication whether this is the right approach.
Comment 9 Marek Šanta 2023-08-21 12:36:21 UTC
Perfect!
After applying the patch on top of the 6.4.11 stable, I confirm that the UltraStudio devices work well on Intel Alder Lake and AMD Zen3+ USB4.

After the initial "timeout enabling lane bonding" and a few seconds the device reconnects successfully and drivers are loaded.
Comment 10 Mika Westerberg 2023-08-22 05:16:49 UTC
Created attachment 304923 [details]
Check that lane 1 is in CL0

I realized that the lane 1 probably is not connected at all on that device. Can you try the attached simpler patch as well? With this there should be lane bonding timeouts as that is not even tried.
Comment 11 Marek Šanta 2023-08-22 06:26:00 UTC
Not sure if I should apply this patch on top of the earlier or try this one only
Comment 12 Marek Šanta 2023-08-22 08:11:45 UTC
Ok so I applied the patch on top of the previous patch even though you marked the previous one as obsolete. But I guess you needed this piece of information anyway:

Indeed, dmesg wrote out "lane 1 is not connected" and the device worked straight away without reconnecting!

Just for the record, I had to insert the lines manually as the patch is not compatible with stable (TB_LINK_WIDTH_DUAL is not used in the current stable)
Comment 13 Mika Westerberg 2023-08-22 09:10:41 UTC
Hi, correct I meant that only apply this patch (and drop the previous). Thanks for checking! I will clean it up, give some more testing and send upstream (probably after merge window closes).
Comment 14 Marek Šanta 2023-08-22 11:44:07 UTC
Thanks! This was very helpful!