Bug 215958 - thunderbolt3 egpu cannot disconnect cleanly
Summary: thunderbolt3 egpu cannot disconnect cleanly
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: PCI (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: drivers_pci@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-05-08 20:29 UTC by Roberto
Modified: 2022-05-20 09:06 UTC (History)
3 users (show)

See Also:
Kernel Version: 5.17.0-1003-oem #3-Ubuntu SMP PREEMPT
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci after boot (no EGPU connected) (56.77 KB, text/plain)
2022-05-09 19:54 UTC, Bjorn Helgaas
Details
lspci after attaching EGPU (84.84 KB, text/plain)
2022-05-09 19:55 UTC, Bjorn Helgaas
Details
lspci after detaching EGPU (56.72 KB, text/plain)
2022-05-09 19:56 UTC, Bjorn Helgaas
Details
dmesg during suspend and resume (ending in system freeze) (124.49 KB, text/plain)
2022-05-14 07:37 UTC, Roberto
Details
dmesg during suspend and resume using kernel 5.15.0 (140.86 KB, text/plain)
2022-05-19 19:50 UTC, Roberto
Details

Description Roberto 2022-05-08 20:29:50 UTC
I have an external egpu (Radeon 6600 RX) connected through thunderbolt3 to my Thinkpad X1 carbon 6th Gen.. When I disconnect the thunderbolt3 cable I get the following error in dmesg:

[21874.194994] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195006] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.195123] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195129] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.195271] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195276] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.195406] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195411] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.195544] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:51 param:0x00000000 message:GetPptLimit?
[21874.195550] amdgpu 0000:0c:00.0: amdgpu: [smu_v11_0_get_current_power_limit] get PPT limit failed!
[21874.195582] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195587] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.227454] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227463] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.227532] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227536] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.227618] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227621] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.227700] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227703] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.227784] amdgpu 0000:0c:00.0: amdgpu: [smu_v11_0_get_current_power_limit] get PPT limit failed!
[21874.227804] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU metrics table!
[21874.514661] snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x2f0d00. -5
[21874.568360] amdgpu 0000:0c:00.0: amdgpu: Failed to switch to AC mode!
[21874.599292] amdgpu 0000:0c:00.0: amdgpu: Failed to switch to AC mode!
[21874.718562] amdgpu 0000:0c:00.0: amdgpu: amdgpu: finishing device.
[21878.722376] amdgpu: cp queue pipe 4 queue 0 preemption failed
[21878.722422] amdgpu 0000:0c:00.0: amdgpu: Failed to disable gfxoff!
[21879.134918] amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[21879.135144] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[21879.338158] amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[21879.338402] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[21879.543318] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR* failed to halt cp gfx
[21879.544216] __smu_cmn_reg_print_error: 5 callbacks suppressed
[21879.544220] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for index:7 param:0x00000000 message:DisableAllSmuFeatures?
[21879.544226] amdgpu 0000:0c:00.0: amdgpu: Failed to disable smu features.
[21879.544230] amdgpu 0000:0c:00.0: amdgpu: Fail to disable dpm features!
[21879.544238] [drm] free PSP TMR buffer
[21880.455935] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=io+mem
[21880.456218] pci 0000:0c:00.0: Removing from iommu group 14
[21880.456715] pci 0000:0c:00.1: Removing from iommu group 14
[21880.456798] pci_bus 0000:0c: busn_res: [bus 0c] is released
[21880.456950] pci 0000:0b:00.0: Removing from iommu group 14
[21880.456985] pci_bus 0000:0b: busn_res: [bus 0b-0c] is released
[21880.457106] pci 0000:0a:00.0: Removing from iommu group 14
[21880.457156] pci_bus 0000:0a: busn_res: [bus 0a-0c] is released
[21880.457279] pci 0000:09:01.0: Removing from iommu group 14
[21880.457311] pci_bus 0000:09: busn_res: [bus 09-3a] is released
[21880.457543] pci 0000:08:00.0: Removing from iommu group 14
[21880.457847] pci_bus 0000:06: Allocating resources
[21880.457888] pcieport 0000:06:02.0: bridge window [io  0x1000-0x0fff] to [bus 3b] add_size 1000
[21880.457897] pcieport 0000:06:04.0: bridge window [io  0x1000-0x0fff] to [bus 3c-6f] add_size 1000
[21880.457913] pcieport 0000:06:02.0: BAR 13: no space for [io  size 0x1000]
[21880.457919] pcieport 0000:06:02.0: BAR 13: failed to assign [io  size 0x1000]
[21880.457924] pcieport 0000:06:04.0: BAR 13: no space for [io  size 0x1000]
[21880.457928] pcieport 0000:06:04.0: BAR 13: failed to assign [io  size 0x1000]
[21880.457934] pcieport 0000:06:04.0: BAR 13: no space for [io  size 0x1000]
[21880.457938] pcieport 0000:06:04.0: BAR 13: failed to assign [io  size 0x1000]
[21880.457943] pcieport 0000:06:02.0: BAR 13: no space for [io  size 0x1000]
[21880.457947] pcieport 0000:06:02.0: BAR 13: failed to assign [io  size 0x1000]


upon reconnection of the cable I get:

[22192.753261] input: HDA ATI HDMI HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input98
[22192.753738] input: HDA ATI HDMI HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input99
[22192.753952] input: HDA ATI HDMI HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input100
[22192.755234] input: HDA ATI HDMI HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input101
[22192.763885] input: HDA ATI HDMI HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input102
[22192.975773] thunderbolt 0-1: new device found, vendor=0x127 device=0x1
[22192.975786] thunderbolt 0-1: Razer Core X


but the egpu no longer appears in `xrandr --listproviders`. Full reboot is needed.
Comment 1 Bjorn Helgaas 2022-05-09 19:54:32 UTC
Created attachment 300914 [details]
lspci after boot (no EGPU connected)
Comment 2 Bjorn Helgaas 2022-05-09 19:55:06 UTC
Created attachment 300915 [details]
lspci after attaching EGPU
Comment 3 Bjorn Helgaas 2022-05-09 19:56:56 UTC
Created attachment 300916 [details]
lspci after detaching EGPU

From Message-ID <23d4b1f4-09a7-c12a-7610-2863c8267341@yahoo.it>, unfortunately not archived because it contains HTML.
Comment 4 Roberto 2022-05-09 20:31:29 UTC

On 09/05/22 18:23, Bjorn Helgaas wrote:
> On Sun, May 8, 2022 at 3:29 PM <bugzilla-daemon@kernel.org> wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215958
>>
>>             Bug ID: 215958
>>            Summary: thunderbolt3 egpu cannot disconnect cleanly
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 5.17.0-1003-oem #3-Ubuntu SMP PREEMPT
>>           Hardware: All
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: PCI
>>           Assignee: drivers_pci@kernel-bugs.osdl.org
>>           Reporter: r087r70@yahoo.it
>>         Regression: No
> I assume this is not a regression, right?  If it is a regression, what
> previous kernel worked correctly?

no it's not, but I haven't tested with all the possible kernel versions, just with 5.15 and 5.17

>> I have an external egpu (Radeon 6600 RX) connected through thunderbolt3 to
>> my
>> Thinkpad X1 carbon 6th Gen.. When I disconnect the thunderbolt3 cable I get
>> the
>> following error in dmesg:
>>
>> [21874.194994] amdgpu 0000:0c:00.0: amdgpu: SMU: response:0xFFFFFFFF for
>> index:18 param:0x00000005 message:TransferTableSmu2Dram?
>> ...
>> ...
>> [21879.544226] amdgpu 0000:0c:00.0: amdgpu: Failed to disable smu features.
>> [21879.544230] amdgpu 0000:0c:00.0: amdgpu: Fail to disable dpm features!
>> [21879.544238] [drm] free PSP TMR buffer
> The above looks like what amdgpu would see when the GPU is no longer
> accessible (writes are dropped and reads return 0xffffffff). It's
> possible amdgpu could notice this and shut down more gracefully, but I
> don't think it's the main problem here and it probably wouldn't force
> you to reboot.

actually in this state I cannot `modprobe -r amdgpu`:

modprobe: FATAL: Module amdgpu is in use.



>> [21880.455935] i915 0000:00:02.0: vgaarb: changed VGA decodes:
>> olddecodes=none,decodes=io+mem:owns=io+mem
>> [21880.456218] pci 0000:0c:00.0: Removing from iommu group 14
>> ...
>> ...
>> [21880.457311] pci_bus 0000:09: busn_res: [bus 09-3a] is released
>> [21880.457543] pci 0000:08:00.0: Removing from iommu group 14
> This looks like removing 0c:00.0 (the GPU) and two switches leading to
> it (probably part of the Thunderbolt topology), so to be expected.
>
>> [21880.457847] pci_bus 0000:06: Allocating resources
>> [21880.457888] pcieport 0000:06:02.0: bridge window [io 0x1000-0x0fff] to
>> [bus
>> 3b] add_size 1000
>> ...
>> ...
>> [21880.457947] pcieport 0000:06:02.0: BAR 13: failed to assign [io  size
>> 0x1000]
> I'm not sure why we're allocating resources as part of the removal.
> The hierarchies under 06:02.0 (to [bus 3b]) and 06:04.0 (to [bus
> 3c-6f]) seem to be siblings of the hierarchy you just removed (my
> guess is that was 06:01.0 to [bus 08-3a]).  But again, shouldn't
> require a reboot.
>
>> upon reconnection of the cable I get:
>>
>> [22192.753261] input: HDA ATI HDMI HDMI/DP,pcm=3 as
>>
>> /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input98
>> [22192.753738] input: HDA ATI HDMI HDMI/DP,pcm=7 as
>>
>> /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input99
>> [22192.753952] input: HDA ATI HDMI HDMI/DP,pcm=8 as
>>
>> /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input100
>> [22192.755234] input: HDA ATI HDMI HDMI/DP,pcm=9 as
>>
>> /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input101
>> [22192.763885] input: HDA ATI HDMI HDMI/DP,pcm=10 as
>>
>> /devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00.0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/input102
>> [22192.975773] thunderbolt 0-1: new device found, vendor=0x127 device=0x1
>> [22192.975786] thunderbolt 0-1: Razer Core X
>>
>> but the egpu no longer appears in `xrandr --listproviders`. Full reboot is
>> needed.
> Can you please build with CONFIG_DYNAMIC_DEBUG=y, boot with
> 'dyndbg="file pciehp* +p"', and attach the complete dmesg log to the
> bugzilla?  Also please attach the complete "sudo lspci -vv" output
> (before the unplug and after the replug)?

Ironically, I have rebooted to get the lspci output,  and now I can no longer get into the above state. What I get is that after attaching the egpu, it is enabled *without* the need of restarting the Xserver, while after detaching it the Xserver is restarted and the card gets released correctly, although the amdgpu drivers stays loaded. But I can `modprobe -r amdgpu` without problems. I could connect/disconnect many time without issues. Attached is the lspci output after fresh boot, upon epgu connection, and disconnection.

I will test more in the next days.

Thank you,
Roberto
Comment 5 Andrey Grodzovsky 2022-05-10 17:14:20 UTC
So is there any problem currently ? Also please provide full dmesg log for both disconnect and recconect of eGPU.
Comment 6 Andrey Grodzovsky 2022-05-10 17:14:33 UTC
So is there any problem currently ? Also please provide full dmesg log for both disconnect and recconect of eGPU.
Comment 7 Roberto 2022-05-10 17:41:43 UTC
Please give me some time, I'll be quite overloaded until this weekend, I will write back asap.
Comment 8 Roberto 2022-05-14 07:37:54 UTC
Created attachment 300955 [details]
dmesg during suspend and resume (ending in system freeze)
Comment 9 Roberto 2022-05-14 07:48:23 UTC
I notice two major issues related to the present bug:

1) suspend-resume seems to trigger the bug in subject (one can no longer connect/disconnect the card cleanly)

2) upon disconnection of the egpu, the X server restarts even if it's not using it: Note that I only use the egpu indirectly through DRI_PRIME offloading, and thus there should be no need of restarting X server, which is instead using the integrated "Intel UHD Graphics 620". The restart of X server is very annoying.

I have attached the `sudo dmesg -ew > dmesg.log` during suspend and resume operations, which ended in an unresponsive system (freezed with white screen).
I have set 'dyndbg="file pciehp* +p"' in grub at boot.
Comment 10 Andrey Grodzovsky 2022-05-16 14:56:47 UTC
Cn you confirm you have no problems with disconnecting eGPU as long as you don't do susepnd/resume ? 

It seems your susepnd/resume is broken regardless of eGPU disconnection procedure so it all looks more like S3 bug which has nothing to do with eGPU disconnect. 

Can you take you graphic card out of eGPU, put it on the motherboard PCI bus and check how suspend/resume works for you ?
Comment 11 Roberto 2022-05-16 15:52:54 UTC
(In reply to Andrey Grodzovsky from comment #10)
> Cn you confirm you have no problems with disconnecting eGPU as long as you
> don't do susepnd/resume?

as already stated, the only problem is that X server restarts, which I strongly preferred it didn't

> It seems your susepnd/resume is broken regardless of eGPU disconnection
> procedure so it all looks more like S3 bug which has nothing to do with eGPU
> disconnect. 

I see. Should we redirect this bug to the ACPI area?

> Can you take you graphic card out of eGPU, put it on the motherboard PCI bus
> and check how suspend/resume works for you ?

Unfortunately I cannot, I only have a laptop not a desktop.
Comment 12 Andrey Grodzovsky 2022-05-16 16:35:05 UTC
(In reply to Roberto from comment #11)
> (In reply to Andrey Grodzovsky from comment #10)
> > Cn you confirm you have no problems with disconnecting eGPU as long as you
> > don't do susepnd/resume?
> 
> as already stated, the only problem is that X server restarts, which I
> strongly preferred it didn't


Please open a seperate ticket for this and attach both dmesg and Xorg.log for this. 

> 
> > It seems your susepnd/resume is broken regardless of eGPU disconnection
> > procedure so it all looks more like S3 bug which has nothing to do with
> eGPU
> > disconnect. 
> 
> I see. Should we redirect this bug to the ACPI area?

No, it's our responsibility too. Can you go back and forth a bit and see if this is a regression ? Try maybe 5.15 and 5.16 kernels and see if this problem still there.


> 
> > Can you take you graphic card out of eGPU, put it on the motherboard PCI
> bus
> > and check how suspend/resume works for you ?
> 
> Unfortunately I cannot, I only have a laptop not a desktop.
Comment 13 Roberto 2022-05-19 19:50:30 UTC
Created attachment 300999 [details]
dmesg during suspend and resume using kernel 5.15.0
Comment 14 Roberto 2022-05-19 20:19:28 UTC
Using kernel 5.15 I could suspend, but during resume it seemed to froze for a few seconds, but then it managed to resume. I have tried to reconnect the egpu and it seemed to work, although with some lagging: if I type glxinfo or glxgears it takes a couple of seconds to respond.

Moreover, as second issue, for both the kernels I cannot use the `sensors` application to monitor the temperature after resuming from suspend, since it hangs. If I type 'sensors' in a terminal it prints ~40 lines with some information about the cpu and the battery, but then it freezes and it does not exit.

The suspend/resume procedure seems really problematic with the egpu attached, or after detaching it.
Comment 15 Andrey Grodzovsky 2022-05-19 21:03:54 UTC
(In reply to Roberto from comment #14)
> Using kernel 5.15 I could suspend, but during resume it seemed to froze for
> a few seconds, but then it managed to resume. I have tried to reconnect the
> egpu and it seemed to work, although with some lagging: if I type glxinfo or
> glxgears it takes a couple of seconds to respond.
> 
> Moreover, as second issue, for both the kernels I cannot use the `sensors`
> application to monitor the temperature after resuming from suspend, since it
> hangs. If I type 'sensors' in a terminal it prints ~40 lines with some
> information about the cpu and the battery, but then it freezes and it does
> not exit.
> 
> The suspend/resume procedure seems really problematic with the egpu
> attached, or after detaching it.

Probably, we don't have porpoer eGPU cage at hand to test it each time. Can you try to bisect from this 5.15 kernel forward to see at what point it started hard hanging like in your original kernel ?
Comment 16 Roberto 2022-05-20 09:06:16 UTC
For the Xserves issue I've created https://bugzilla.kernel.org/show_bug.cgi?id=216004

Note You need to log in before you can comment on or make changes to this bug.