Created attachment 306247 [details] logs with thunderbolt debug Context laptop : Asus UM5302TA CPU : AMD Ryzen 7 6800U egpu case : Razer Core X Chroma, thunderbolt 3 GPU : nvidia + opendriver 550.78 Ubuntu 24.04, kernel 6.7 / 6.8 / 6.9 when plugging thunderbolt cable: ``` [...] pci 0000:35:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:04.1 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link) [...] ``` ``` nvidia 0000:35:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID) nvidia 0000:35:00.0: device [10de:2504] error status/mask=00100000/00000000 nvidia 0000:35:00.0: [20] UnsupReq (First) nvidia 0000:35:00.0: AER: TLP Header: 40001001 0000000c ad08000c f7f7f7f7 ``` I tried many cables (tb3, tb4, etc).
Created attachment 307144 [details] tbtrace dump during connection and crash
Created attachment 307145 [details] trace log with thunderbolt and pci I have similar case with eGPU and VFIO passtrough into Windows VM, which crashes. Laptop specs HP Zbook Firefly G10 A Ryzen 7 7840 HS Wikingoo Q1L box with JHL6340, also bought and tried Wikingoo P1-60W-M with JHL7440 told by manufacturer, but lspci names it JHL7540. Nvidia Quadro P1000 Ubuntu 24.10 Kernel 6.11 System gives lots of: [ 6323.581954] pcieport 0000:00:04.1: AER: Correctable error message received from 0000:64:01.0 [ 6323.581966] pcieport 0000:64:01.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID) [ 6323.581969] pcieport 0000:64:01.0: device [8086:15da] error status/mask=00000080/00002000 [ 6323.581973] pcieport 0000:64:01.0: [ 7] BadDLLP And eventually crashes VM with: [ 6360.466620] pcieport 0000:00:04.1: AER: Multiple Uncorrectable (Non-Fatal) error message received from 0000:65:00.0 [ 6360.466648] vfio-pci 0000:65:00.0: PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer, (Requester ID) [ 6360.466652] vfio-pci 0000:65:00.0: device [10de:1cb1] error status/mask=00004000/00000000 [ 6360.466655] vfio-pci 0000:65:00.0: [14] CmpltTO (First) Box with newer JHL7440 doesn't have so many BadDLLP errors, but also crashes with CmpltTO. Without passtrough and Nvidia driver on host system there are still lots of BadDLLP errors, but I haven't seen a crash. I tried pcie_aspm=off with those boxes, but they are not initialized in that case with hotplug and in coldboot case, Intel based system has same behaviour. pcie_aspm=force causes some additional errors on PCIe bus. Possible workaround for me to get a stable system with passtrough is to use pci=nommconf, but this causes graphical glitches on host GPU in 3D rendering case.
I guess person here is in the same boat: https://askubuntu.com/questions/1531087/pcie-bus-error-thunderbolt-4-bridge
> pci 0000:35:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s > PCIe x1 link at 0000:00:04.1 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 > link) It's worth mentioning that this message is meaningless in the context of USB4. There were various discussions on the mailing lists about changing this, but it never landed anywhere. https://lore.kernel.org/linux-usb/20231103190758.82911-1-mario.limonciello@amd.com/ See specifically patch 8 for more context and the specs that indicate why it behaves this way. At least with AMD dGPUs put in eGPU enclosures this was causing problems for amdgpu because if used pcie_bandwidth_available(). We've changed this in amdgpu to look at the link partner to exclude this causing issues. https://github.com/torvalds/linux/blob/v6.12-rc5/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c#L5903
Just in case I also ordered ADT-UT3G and will keep you in touch when I will be able to verify errors and crashes with it.
Anyway, is there anything that can be done for PCIe errors except pci=nommconf? GPU driver inside VM seems to be crashing periodically.
> Anyway, is there anything that can be done for PCIe errors except > pci=nommconf? If you want to ignore the errors you can use "pci=noaer".
(In reply to Mario Limonciello (AMD) from comment #7) > If you want to ignore the errors you can use "pci=noaer". Previously it didn't work with the newer box, VM was silently crashing without any PCIe errors in console (which is expected), so I didn't bother to try it with older one, but surprisingly it works well here. Thanks!
So, I've got ADT-UT3G, no errors, no crashes. Is there anything I can help with debugging older boxes?