218795 – USB4 / Thunderbolt + AMD: unstable and slow link (many uncorrectable errors)

Bug 218795 - USB4 / Thunderbolt + AMD: unstable and slow link (many uncorrectable errors)

Summary: USB4 / Thunderbolt + AMD: unstable and slow link (many uncorrectable errors)

Status:	NEW

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	USB (show other bugs)
Hardware:	All Linux

Importance:	P3 normal
Assignee:	Default virtual assignee for Drivers/USB

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-04-30 11:41 UTC by Guilhem Lettron
Modified:	2024-11-20 11:11 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
logs with thunderbolt debug (72.08 KB, text/plain) 2024-04-30 11:41 UTC, Guilhem Lettron	Details
tbtrace dump during connection and crash (138.56 KB, text/plain) 2024-11-05 17:17 UTC, Eduard Kachur	Details
trace log with thunderbolt and pci (2.27 MB, text/plain) 2024-11-05 17:36 UTC, Eduard Kachur	Details
Add an attachment (proposed patch, testcase, etc.)

Description Guilhem Lettron 2024-04-30 11:41:54 UTC

Created attachment 306247 [details]
logs with thunderbolt debug

Context
laptop : Asus UM5302TA
CPU : AMD Ryzen 7 6800U
egpu case : Razer Core X Chroma, thunderbolt 3
GPU : nvidia + opendriver 550.78
Ubuntu 24.04, kernel 6.7 / 6.8 / 6.9

when plugging thunderbolt cable:
```
[...]
pci 0000:35:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:04.1 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
[...]
```

```
nvidia 0000:35:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
nvidia 0000:35:00.0:   device [10de:2504] error status/mask=00100000/00000000
nvidia 0000:35:00.0:    [20] UnsupReq               (First)
nvidia 0000:35:00.0: AER:   TLP Header: 40001001 0000000c ad08000c f7f7f7f7
```

I tried many cables (tb3, tb4, etc).

Comment 1 Eduard Kachur 2024-11-05 17:17:24 UTC

Created attachment 307144 [details]
tbtrace dump during connection and crash

Comment 2 Eduard Kachur 2024-11-05 17:36:28 UTC

Created attachment 307145 [details]
trace log with thunderbolt and pci

I have similar case with eGPU and VFIO passtrough into Windows VM, which crashes.

Laptop specs
HP Zbook Firefly G10 A 
Ryzen 7 7840 HS
Wikingoo Q1L box with JHL6340, also bought and tried Wikingoo P1-60W-M with JHL7440 told by manufacturer, but lspci names it JHL7540.
Nvidia Quadro P1000
Ubuntu 24.10 Kernel 6.11

System gives lots of:
[ 6323.581954] pcieport 0000:00:04.1: AER: Correctable error message received from 0000:64:01.0
[ 6323.581966] pcieport 0000:64:01.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
[ 6323.581969] pcieport 0000:64:01.0:   device [8086:15da] error status/mask=00000080/00002000
[ 6323.581973] pcieport 0000:64:01.0:    [ 7] BadDLLP     

And eventually crashes VM with:
[ 6360.466620] pcieport 0000:00:04.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:65:00.0
[ 6360.466648] vfio-pci 0000:65:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 6360.466652] vfio-pci 0000:65:00.0:   device [10de:1cb1] error status/mask=00004000/00000000
[ 6360.466655] vfio-pci 0000:65:00.0:    [14] CmpltTO                (First)

Box with newer JHL7440 doesn't have so many BadDLLP errors, but also crashes with  CmpltTO.
Without passtrough and Nvidia driver on host system there are still lots of BadDLLP errors, but I haven't seen a crash.

I tried pcie_aspm=off with those boxes, but they are not initialized in that case with hotplug and in coldboot case, Intel based system has same behaviour.
pcie_aspm=force causes some additional errors on PCIe bus.

Possible workaround for me to get a stable system with passtrough is to use pci=nommconf, but this causes graphical glitches on host GPU in 3D rendering case.

Comment 3 Eduard Kachur 2024-11-06 10:29:23 UTC

I guess person here is in the same boat:
https://askubuntu.com/questions/1531087/pcie-bus-error-thunderbolt-4-bridge

Comment 4 Mario Limonciello (AMD) 2024-11-07 17:30:29 UTC

> pci 0000:35:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s
> PCIe x1 link at 0000:00:04.1 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4
> link)

It's worth mentioning that this message is meaningless in the context of USB4.  There were various discussions on the mailing lists about changing this, but it never landed anywhere.

https://lore.kernel.org/linux-usb/20231103190758.82911-1-mario.limonciello@amd.com/

See specifically patch 8 for more context and the specs that indicate why it behaves this way.

At least with AMD dGPUs put in eGPU enclosures this was causing problems for amdgpu because if used pcie_bandwidth_available().  We've changed this in amdgpu to look at the link partner to exclude this causing issues.

https://github.com/torvalds/linux/blob/v6.12-rc5/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5903

Comment 5 Eduard Kachur 2024-11-07 17:45:35 UTC

Just in case I also ordered ADT-UT3G and will keep you in touch when I will be able to verify errors and crashes with it.

Comment 6 Eduard Kachur 2024-11-08 08:43:49 UTC

Anyway, is there anything that can be done for PCIe errors except pci=nommconf?  GPU driver inside VM seems to be crashing periodically.

Comment 7 Mario Limonciello (AMD) 2024-11-08 15:41:06 UTC

> Anyway, is there anything that can be done for PCIe errors except
> pci=nommconf?

If you want to ignore the errors you can use "pci=noaer".

Comment 8 Eduard Kachur 2024-11-08 19:19:35 UTC

(In reply to Mario Limonciello (AMD) from comment #7)
> If you want to ignore the errors you can use "pci=noaer".

Previously it didn't work with the newer box, VM was silently crashing without any PCIe errors in console (which is expected), so I didn't bother to try it with older one, but surprisingly it works well here.

Thanks!

Comment 9 Eduard Kachur 2024-11-20 11:11:59 UTC

So, I've got ADT-UT3G, no errors, no crashes. Is there anything I can help with debugging older boxes?

Note You need to log in before you can comment on or make changes to this bug.