Bug 201527 - Thunderbolt 3 PCI Bridge Fails to Receive Proper PCI Resources
Summary: Thunderbolt 3 PCI Bridge Fails to Receive Proper PCI Resources
Status: RESOLVED INVALID
Alias: None
Product: ACPI
Classification: Unclassified
Component: BIOS (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: acpi_bios
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-26 16:43 UTC by Robert Strube
Modified: 2020-12-10 19:04 UTC (History)
8 users (show)

See Also:
Kernel Version: 4.19
Subsystem:
Regression: No
Bisected commit-id:


Attachments
dmesg log with ACPI enabled (97.90 KB, text/plain)
2018-10-26 16:43 UTC, Robert Strube
Details
lspci with ACPI enabled (2.32 KB, text/plain)
2018-10-26 16:44 UTC, Robert Strube
Details
dmesg log with ACPI disabled (65.69 KB, text/plain)
2018-10-26 16:45 UTC, Robert Strube
Details
lspci with ACPI disabled (2.42 KB, text/plain)
2018-10-26 16:45 UTC, Robert Strube
Details
dmesg log with ACPI enabled *default bios settings for TB* (102.43 KB, text/plain)
2018-10-26 18:21 UTC, Robert Strube
Details
dmesg log with ACPI enabled *default bios settings for TB* CLEAN (93.19 KB, text/plain)
2018-10-26 18:37 UTC, Robert Strube
Details
dmesg log eGPU AMD RX 580 attached ACPI enabled (98.23 KB, text/plain)
2018-11-29 22:41 UTC, Robert Strube
Details
lspci eGPU AMD RX 580 attached ACPI enabled (2.32 KB, text/plain)
2018-11-29 22:47 UTC, Robert Strube
Details
dmesg log with kernel 5.0.x with nvidia eGPU (89.20 KB, text/plain)
2019-04-25 00:14 UTC, Robert Strube
Details
lspci kernel 5.0.x with nividia eGPU (2.60 KB, text/plain)
2019-04-25 00:15 UTC, Robert Strube
Details

Description Robert Strube 2018-10-26 16:43:48 UTC
Created attachment 279157 [details]
dmesg log with ACPI enabled

System: Dell XPS 9575 (2 in 1)
Processor: i7-8705G CPU
Internal iGPU: Intel 630
Discrete GPU: Vega M (Polaris 22)
Kernel: 4.19

Description:

I was working with amdgpu developers to try to get a Polaris 10 (RX 580) eGPU working over Thunderbolt 3 when we discovered some serious problems with PCI resource allocation to the Thunderbolt 3 PCI bridges.  These PCI resource issues prevent the eGPU from becoming initialized.

This issue is not tied directly to using eGPUs, as I can demonstrate the problem without the use of any eGPU, simply by booting the system without any thunderbolt devices attached to it.

After lots of trial and error I determined that if I pass in the acpi=off kernel boot parameter, the Vega M GPU becomes disabled - and although the PCI resource allocation issue is *still* present, the eGPU can become initialized.  This seems more like a coincidence rather than a proper fix to the problem.  The amdgpu devs seem to think that having the Vega M disabled frees up certain address ranges allowing the eGPU to become initialized.

One other thing worth mentioning is that I did try compiling a custom kernel with the Vega M device IDs commented out to see if this would help and it did *not* help the situation.  The Vega M was indeed not initialized at boot, but the PCI resource issues remained.

I'm attaching dmesg logs for 4.19 with ACPI enabled and disabled.  I'm also attaching lspci -vv -t -nn reports.

The information in the dmesg (with ACPI enabled) that appears relevant is this:

Problems with ACPI BIOS:

[  152.409452] ACPI BIOS Error (bug): Failure creating [\_GPE.XTBT.SPRT], AE_ALREADY_EXISTS (20180810/dswload2-316)
[  152.409479] No Local Variables are initialized for Method [XTBT]
[  152.409484] Initialized Arguments for Method [XTBT]:  (2 arguments defined for method invocation)
[  152.409486]   Arg0:   00000000280bc52d <Obj>           Integer 0000000000000009
[  152.409500]   Arg1:   00000000464bcc18 <Obj>           Integer 0000000001060002
[  152.409512] ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20180810/psobject-221)
[  152.409522] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180810/psparse-516)
[  152.409537] ACPI Error: Method parse/execution failed \_GPE.XTBT, AE_ALREADY_EXISTS (20180810/psparse-516)
[  152.409555] ACPI Error: Method parse/execution failed \_GPE._E42, AE_ALREADY_EXISTS (20180810/psparse-516)
[  152.409568] ACPI: Marking method _E42 as Serialized because of AE_ALREADY_EXISTS error
[  152.409578] ACPI Error: AE_ALREADY_EXISTS, while evaluating GPE method [_E42] (20180810/evgpe-509)

PCI resource allocation issues:

Note: devices 0000:04:00.0, 0000:05:00.0, 0000:05:01.0, 0000:05:02.0, and 0000:05:04.0 are all Thunderbolt PCI bridges, but device 0000:05:02.0 seems to be the problematic one.

[  152.673753] pci_bus 0000:05: Allocating resources
[  152.673792] pci 0000:05:01.0: bridge window [io  0x1000-0x0fff] to [bus 07-39] add_size 1000
[  152.673802] pci 0000:05:02.0: bridge window [io  0x1000-0x0fff] to [bus 3a] add_size 1000
[  152.673803] pci 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000
[  152.673813] pci 0000:05:04.0: bridge window [io  0x1000-0x0fff] to [bus 3b-6e] add_size 1000
[  152.673823] pci 0000:04:00.0: bridge window [io  0x1000-0x0fff] to [bus 05-6e] add_size 3000
[  152.673825] pci 0000:04:00.0: BAR 13: assigned [io  0x2000-0x4fff]
[  152.673829] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  152.673830] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  152.673831] pci 0000:05:01.0: BAR 13: assigned [io  0x2000-0x2fff]
[  152.673832] pci 0000:05:02.0: BAR 13: assigned [io  0x3000-0x3fff]
[  152.673832] pci 0000:05:04.0: BAR 13: assigned [io  0x4000-0x4fff]
[  152.673834] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  152.673835] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  152.673837] pci 0000:05:00.0: PCI bridge to [bus 06]
[  152.673842] pci 0000:05:00.0:   bridge window [mem 0xea000000-0xea0fffff]
[  152.673852] pci 0000:05:01.0: PCI bridge to [bus 07-39]
[  152.673854] pci 0000:05:01.0:   bridge window [io  0x2000-0x2fff]
[  152.673859] pci 0000:05:01.0:   bridge window [mem 0xbc000000-0xd3efffff]
[  152.673863] pci 0000:05:01.0:   bridge window [mem 0x2fb0000000-0x2fcfffffff 64bit pref]
[  152.673870] pci 0000:05:02.0: PCI bridge to [bus 3a]
[  152.673872] pci 0000:05:02.0:   bridge window [io  0x3000-0x3fff]
[  152.673877] pci 0000:05:02.0:   bridge window [mem 0xd3f00000-0xd3ffffff]
[  152.673887] pci 0000:05:04.0: PCI bridge to [bus 3b-6e]
[  152.673889] pci 0000:05:04.0:   bridge window [io  0x4000-0x4fff]
[  152.673894] pci 0000:05:04.0:   bridge window [mem 0xd4000000-0xe9ffffff]
[  152.673898] pci 0000:05:04.0:   bridge window [mem 0x2fd0000000-0x2ff9ffffff 64bit pref]
[  152.673904] pci 0000:04:00.0: PCI bridge to [bus 05-6e]
[  152.673906] pci 0000:04:00.0:   bridge window [io  0x2000-0x4fff]
[  152.673912] pci 0000:04:00.0:   bridge window [mem 0xbc000000-0xea0fffff]
[  152.673915] pci 0000:04:00.0:   bridge window [mem 0x2fb0000000-0x2ff9ffffff 64bit pref]

It also appears that pcieport has resource PCI allocation issues:

[  193.946376] thunderbolt 0000:06:00.0: stopping RX ring 0
[  193.946388] thunderbolt 0000:06:00.0: disabling interrupt at register 0x38200 bit 12 (0xffffffff -> 0xffffefff)
[  193.946404] thunderbolt 0000:06:00.0: stopping TX ring 0
[  193.946413] thunderbolt 0000:06:00.0: disabling interrupt at register 0x38200 bit 0 (0xffffffff -> 0xfffffffe)
[  193.946421] thunderbolt 0000:06:00.0: control channel stopped
[  193.946516] thunderbolt 0000:06:00.0: freeing RX ring 0
[  193.946527] thunderbolt 0000:06:00.0: freeing TX ring 0
[  193.946542] thunderbolt 0000:06:00.0: shutdown
[  193.985339] pci_bus 0000:05: Allocating resources
[  193.985415] pcieport 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000
[  193.985458] pcieport 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  193.985462] pcieport 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  193.985470] pcieport 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[  193.985473] pcieport 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[  198.333956] pcieport 0000:05:00.0: Refused to change power state, currently in D3

Based on all the feedback I received so far, this does appear to be a BIOS issue, but I felt it was important to report the issue in case the kernel developers can come up with a work around - or perhaps if there is a more direct line of communication with the Dell engineers.

I'm very willing and able to test out any patches you throw at me!

Thanks!
Rob
Comment 1 Robert Strube 2018-10-26 16:44:20 UTC
Created attachment 279159 [details]
lspci with ACPI enabled
Comment 2 Robert Strube 2018-10-26 16:45:12 UTC
Created attachment 279161 [details]
dmesg log with ACPI disabled
Comment 3 Robert Strube 2018-10-26 16:45:35 UTC
Created attachment 279163 [details]
lspci with ACPI disabled
Comment 4 Robert Strube 2018-10-26 18:21:48 UTC
Created attachment 279165 [details]
dmesg log with ACPI enabled *default bios settings for TB*

Hello,

I realized that I had been tweaking many of the Thunderbolt BIOS settings in order to try to get things to work.  Specifically Thunderbolt Adapter Boot Support and Thunderbolt Adapter Pre-Boot Module Support.

I've reset my BIOS to fatory defaults, rebooted, plugged in a TB device.  Now it looks like the PCI resource issues are present for *all* the TB PCI bridges.

Here are the lines from dmesg that look relevant to me:

[   84.162294] pci_bus 0000:05: Allocating resources
[   84.162334] pci 0000:05:01.0: bridge window [io  0x1000-0x0fff] to [bus 07-39] add_size 1000
[   84.162344] pci 0000:05:02.0: bridge window [io  0x1000-0x0fff] to [bus 3a] add_size 1000
[   84.162346] pci 0000:05:02.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 3a] add_size 200000 add_align 100000
[   84.162356] pci 0000:05:04.0: bridge window [io  0x1000-0x0fff] to [bus 3b-6e] add_size 1000
[   84.162366] pci 0000:04:00.0: bridge window [io  0x1000-0x0fff] to [bus 05-6e] add_size 3000
[   84.162369] pci 0000:04:00.0: BAR 13: no space for [io  size 0x3000]
[   84.162369] pci 0000:04:00.0: BAR 13: failed to assign [io  size 0x3000]
[   84.162371] pci 0000:04:00.0: BAR 13: no space for [io  size 0x3000]
[   84.162371] pci 0000:04:00.0: BAR 13: failed to assign [io  size 0x3000]
[   84.162375] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[   84.162376] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[   84.162377] pci 0000:05:01.0: BAR 13: no space for [io  size 0x1000]
[   84.162377] pci 0000:05:01.0: BAR 13: failed to assign [io  size 0x1000]
[   84.162378] pci 0000:05:02.0: BAR 13: no space for [io  size 0x1000]
[   84.162379] pci 0000:05:02.0: BAR 13: failed to assign [io  size 0x1000]
[   84.162380] pci 0000:05:04.0: BAR 13: no space for [io  size 0x1000]
[   84.162381] pci 0000:05:04.0: BAR 13: failed to assign [io  size 0x1000]
[   84.162382] pci 0000:05:04.0: BAR 13: no space for [io  size 0x1000]
[   84.162383] pci 0000:05:04.0: BAR 13: failed to assign [io  size 0x1000]
[   84.162384] pci 0000:05:02.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
[   84.162385] pci 0000:05:02.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
[   84.162386] pci 0000:05:02.0: BAR 13: no space for [io  size 0x1000]
[   84.162387] pci 0000:05:02.0: BAR 13: failed to assign [io  size 0x1000]
[   84.162388] pci 0000:05:01.0: BAR 13: no space for [io  size 0x1000]
[   84.162388] pci 0000:05:01.0: BAR 13: failed to assign [io  size 0x1000]
[   84.162390] pci 0000:05:00.0: PCI bridge to [bus 06]
[   84.162396] pci 0000:05:00.0:   bridge window [mem 0xea000000-0xea0fffff]
[   84.162406] pci 0000:05:01.0: PCI bridge to [bus 07-39]
[   84.162411] pci 0000:05:01.0:   bridge window [mem 0xbc000000-0xd3efffff]
[   84.162432] pci 0000:05:01.0:   bridge window [mem 0x2fb0000000-0x2fcfffffff 64bit pref]
[   84.162441] pci 0000:05:02.0: PCI bridge to [bus 3a]
[   84.162448] pci 0000:05:02.0:   bridge window [mem 0xd3f00000-0xd3ffffff]
[   84.162457] pci 0000:05:04.0: PCI bridge to [bus 3b-6e]
[   84.162463] pci 0000:05:04.0:   bridge window [mem 0xd4000000-0xe9ffffff]
[   84.162467] pci 0000:05:04.0:   bridge window [mem 0x2fd0000000-0x2ff9ffffff 64bit pref]
[   84.162473] pci 0000:04:00.0: PCI bridge to [bus 05-6e]
[   84.162478] pci 0000:04:00.0:   bridge window [mem 0xbc000000-0xea0fffff]
[   84.162482] pci 0000:04:00.0:   bridge window [mem 0x2fb0000000-0x2ff9ffffff 64bit pref]

Just wanted to confirm that the problem is more than just device 0000:05:02.0 like I had originally thought.

Thanks!
Rob
Comment 5 Robert Strube 2018-10-26 18:37:13 UTC
Created attachment 279167 [details]
dmesg log with ACPI enabled *default bios settings for TB* CLEAN

A clean version of the previous dmesg log with ACPI enabled and default BIOS settings.  No eGPU was connected the system was simply booted.
Comment 6 Robert Strube 2018-11-04 18:49:22 UTC
One more detail that I didn't include.  I was running the latest TB controller firmware for the XPS 9575, which is NVM 36 when I created this bug report.  I updated from NVM 30 to 36 using fwupdmgr.

XPS 9575 Thunderbolt Controller
  DeviceId:             069ac71f347e92d158f2c211cca10d52a19e2d41
  Guid:                 8926f505-8219-5d6c-969a-e927534113fb
  Summary:              Unmatched performance for high-speed I/O
  Plugin:               thunderbolt
  Flags:                internal|updatable|supported|registered
  Vendor:               Dell
  VendorId:             TBT:0x00D4
  Version:              36.00
  Icon:                 computer
  Created:              2018-10-24
  Modified:             2018-10-24
  UpdateState:          success
Comment 7 adnans 2018-11-04 21:25:03 UTC
I'm running the same system, with an MSI Armor MK2 8G OC in an Mantiz MZ-02 eGPU box.  Everything but the external PCI-E lane comes up (SSD, USB ports).  The card will initialize and be usable, but terrible performance, with pci=noacpi.
Comment 8 Mika Westerberg 2018-11-28 22:19:04 UTC
Hi,

The 0x00200000 failure is actually not fatal. It is just Linux PCI core that tries to add some additional resources but failing to add them is just fine. The requested resources are still allocated.

I don't see in your dmesg that you are actually connecting any Thunderbolt device. Can you connect the device that fails and then post full dmesg and output of 'sudo lspci -vv' command?

Thanks!
Comment 9 Robert Strube 2018-11-29 22:33:10 UTC
Hi Mika,

Thanks for the response.  I've actually had to return my original (AMD RX 580) because I was unable to get it working, but I did keep the TB enclosure and have access to another eGPU (Nvidia GTX 1060) that I can test with.  I'll make sure to report back with my findings.

One important thing worth mentioning is that I had originally opened up a bug report for DRI/amdgpu - located here: https://bugs.freedesktop.org/show_bug.cgi?id=108521

At the time the AMD developers thought (and it seems reasonable) that root cause of the issue was related to the PCI resource issues, and more specifically an ACPI BIOS bug - which is why I've opened up this bug report.

I still have lots of information from that original bug report that I'm attaching here, the main difference is that I'm now running kernel 4.19.4 while those logs, etc. were from kernel 4.19.0.

Thanks!
Rob
Comment 10 Robert Strube 2018-11-29 22:41:25 UTC
Created attachment 279745 [details]
dmesg log eGPU AMD RX 580 attached ACPI enabled

The attempted initialization of the RX 580 (as an eGPU over TB) happens at timestamp [   11.118853]

The initialization fails with these messages:

[   16.124778] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
[   16.124805] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing E650 (len 187, WS 0, PS 4) @ 0xE6FA
[   16.124828] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C53A (len 193, WS 4, PS 4) @ 0xC569
[   16.124851] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C410 (len 114, WS 0, PS 8) @ 0xC47C
[   16.124853] amdgpu 0000:09:00.0: gpu post error!
[   16.124854] amdgpu 0000:09:00.0: Fatal error during GPU init

The AMD developers believe the issue might be caused by PCI BAR resource issues - specifically some of the PCI to TB bridges not getting the necessary resources.
Comment 11 Robert Strube 2018-11-29 22:47:04 UTC
Created attachment 279747 [details]
lspci eGPU AMD RX 580 attached ACPI enabled

lspci -t -nn -v output when the eGPU AMD RX 580 is connected.
Comment 12 Mika Westerberg 2018-11-30 09:05:17 UTC
Since you don't have the AMD thing anymore I don't think it makes much sense to investigate that. Can you attach dmesg with the NVIDIA one and please include output of 'sudo lspci -vv' so that I can check the PCIe configuration. Assuming it does not work with the NVIDIA card.
Comment 13 Robert Strube 2019-01-09 21:19:12 UTC
Apologies for the late reply.  I've been out of town for the past month.

I ended up giving away the Nvidia GPU to a friend of mine over the holidays, but I will be purchasing another AMD GPU (probably the Vega 56) in the near future.

I'll make sure to update this bug report once I've tried to get that card setup as an eGPU.  I'll also make sure to conduct my testing against 4.20 rather than 4.19.
Comment 14 Dimitar Atanasov 2019-03-29 13:39:27 UTC
I have same problem but with Vega 56. Laptop is almost the same Precision 5530 2-in-1 (only cpu is different). I have tested it with 4.18, 4.19 and 5.0.4 and 5.0.5 with no success.
Comment 15 Robert Strube 2019-04-25 00:09:23 UTC
I'm sorry it's taken me so long to update this bug report.

After spending entirely too long trying to get eGPUs working with my Dell XPS 9575, I have been able to get one working correctly - but only a Nvidia GPU.

I purchased an RTX 2070, and combined with the Akitio Node 2 it works correctly (albeit with some Xorg.conf tweaks).

I'll include the dmesg from my system here as a reference (with the nvidia GPU as an eGPU), but I believe we can close this bug report, as the Dell BIOS does not appear to be an issue here - at least not with a Nvidia GPU.

I think the bug over here still stands: https://bugs.freedesktop.org/show_bug.cgi?id=108521
Comment 16 Robert Strube 2019-04-25 00:14:52 UTC
Created attachment 282507 [details]
dmesg log with kernel 5.0.x with nvidia eGPU
Comment 17 Robert Strube 2019-04-25 00:15:22 UTC
Created attachment 282509 [details]
lspci kernel 5.0.x with nividia eGPU
Comment 18 Maxim Levitsky 2020-10-14 22:38:22 UTC
I have most likely the same issue. Works on 5.7 and broken on 5.8 with my nvme thunderbolt drive which I rarely disconnect so I didn't notice it until now.

Bisect is incoming.
Comment 19 Andrej Podzimek 2020-12-10 13:31:09 UTC
I have a very similar resource allocation issue on a completely different platform (ASRock Creator X570, Ryzen 3950X, Razer Core X Chroma, NVidia Quadro P5000). Here are my posts on egpu.io with dmesg and other details:
https://egpu.io/forums/postid/90573/
https://egpu.io/forums/postid/90608/

The kernel command line magic I’ve been using:
pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=128M,hpmmioprefsize=16G

The number of errors in dmesg depends on hpmmiosize and hpmmioprefsize like this (based on what I tried, which is in no way exhaustive):

256M / 64G: A huge number of errors affecting multiple BARs, even before the eGPU connection: https://pastebin.com/kcdHADw6 The eGPU works though.

128M / 16G: BAR 15 has errors during boot, BAR 13 has errors upon eGPU connection. The eGPU works.

128M / 512M: Exactly the same number and type of errors as above. The eGPU works.

128M / 256M: No I/O resource errors. But the eGPU doesn’t work; initialization fails in almost the same way as reported here: https://bbs.archlinux.org/viewtopic.php?id=261303 (Almost as if the pci= tweaks on the kernel command line had no effect.)

(By "eGPU works" I mean that it can run Folding@Home. (I haven’t tested it in other ways.) There is (I guess) a performance penalty due to the missing MMIO resources.)

Unfortunately I have no clue which of the errors are benign and which ones require further attention. But I’ll be more than to experiment of there are any suggestions.
Comment 20 Alex Deucher 2020-12-10 16:32:50 UTC
If you sbios supports an option like ">4G MMIO" or ">4G Decoding" try enabling that.  That should make more MMIO space available.
Comment 21 Andrej Podzimek 2020-12-10 19:04:10 UTC
(In reply to Alex Deucher from comment #20)
> If you sbios supports an option like ">4G MMIO" or ">4G Decoding" try
> enabling that.  That should make more MMIO space available.

I always have that enabled. (There are 64b resources in my dmesg.)

Note You need to log in before you can comment on or make changes to this bug.