Bug 216728

Summary: Thunderbolt USB Controller died after resume on Intel CometLake platform
Product: Drivers Reporter: Chris Chiu (chris.chiu)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: NEW ---    
Severity: high CC: dev, dev, kernel, kparal, liang_xiao1, mark.blakeney, mathias.nyman, mika.westerberg, neo.wong, rustamabd
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 6.1-rc Subsystem:
Regression: No Bisected commit-id:
Attachments: output of lspci -nnv
output of dmesg
output of sudo lspci -vv
dmesg of kernel 6.0
dmesg of kernel 6.1
dmesg with PCI_DEBUG
dmesg for add d3dcold delay
dmesg with aspm off
Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated
Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated (corrected)
attachment-24626-0.html
possible regression dmesg
attachment-24997-0.html
thunderbolt.dyndbg=+p dmesg output
attachment-6036-0.html

Description Chris Chiu 2022-11-22 10:25:26 UTC
We found few Intel CometLake laptops with Thunderbolt USB controller (8086:15ec) failed to function after resume, the plugged usb devices are failed to be detected. We found a easier way to reproduce as follows.
1. Put the system into suspend first
2. Plug a type-C usb device (or a type-C docking with USB device attached) to the port managed by thunderbolt usb controller.
3. Resume the system
 
The kernel log on 1 laptop shows as follows. The system resume automatically after plug in the type-C port
[ 583.424555] xhci_hcd 0000:38:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 583.424575] xhci_hcd 0000:38:00.0: Controller not ready at resume -19
[ 583.424576] xhci_hcd 0000:38:00.0: PCI post-resume error -19!
[ 583.424577] xhci_hcd 0000:38:00.0: HC died; cleaning up
[ 583.424580] xhci_hcd 0000:38:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -19
[ 583.424585] xhci_hcd 0000:38:00.0: PM: failed to resume async: error -19

The other laptop didn't automatically resume, but the following messages still shows after manually resume
[ 1393.737519] xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3
[ 1393.737638] xhci_hcd 0000:39:00.0: Controller not ready at resume -19
[ 1393.737638] xhci_hcd 0000:39:00.0: PCI post-resume error -19!
[ 1393.737639] xhci_hcd 0000:39:00.0: HC died; cleaning up
[ 1393.737643] dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -19
[ 1393.737645] PM: Device 0000:39:00.0 failed to resume async: error -19


The have 1 thing in common, the controller refused to change power state and host controller died after resume. Please help to find out the cause why the thunderbolt controller refuse to change power state and died after resume. Thanks
Comment 1 Chris Chiu 2022-11-22 10:27:23 UTC
Created attachment 303262 [details]
output of lspci -nnv
Comment 2 Chris Chiu 2022-11-22 10:29:17 UTC
Created attachment 303263 [details]
output of dmesg
Comment 3 Mika Westerberg 2022-11-24 09:19:42 UTC
Can you attach output of 'sudo lspci -vv' when this happens? This will show more details on the PCIe side.
Comment 4 Chris Chiu 2022-11-25 01:42:37 UTC
Created attachment 303289 [details]
output of sudo lspci -vv

Hi Mika, thanks for the prompt reply. Please refer to the attached.
Comment 5 Mika Westerberg 2022-11-25 06:34:09 UTC
Thanks. It looks like the PCIe link to the xHCI is down:

03:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 127
	Bus: primary=03, secondary=38, subordinate=38, sec-latency=0
...
		LnkSta:	Speed 2.5GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

I have not seen this one before. What's the firmware version? You can find this out by reading /sys/bus/thunderbolt/devices/0-0/nvm_version.
Comment 6 Chris Chiu 2022-11-25 07:58:19 UTC
The nvm_version shows 61.0
Comment 7 Mathias Nyman 2022-11-28 10:47:17 UTC
dmesg is from a 5.15.0-52-generic kernel.
There are some changes in pci code in this area since then.
Does a 6.0 or 6.1-rc kernel behave the same?

The "HC died; cleaning up" message is expected when xhci driver
can't read a mmio xHC registers (gets 0xffffffff).
This is normally the case when PCI device is removed, or as in this
case not yet powered (still in D3)

No idea why PCI link fails to go to D0
Comment 8 Chris Chiu 2022-11-28 12:41:29 UTC
Created attachment 303306 [details]
dmesg of kernel 6.0
Comment 9 Chris Chiu 2022-11-28 12:42:56 UTC
Created attachment 303307 [details]
dmesg of kernel 6.1

Yes. The error is the same on kernel 6.0 and 6.1. I'm also curious about why the power transition fail. There's no problem on non-CML platforms.
Comment 10 Mika Westerberg 2022-11-28 12:54:22 UTC
Can you enable CONFIG_PCI_DEBUG=y and attach full dmesg (v6.1 kernel preferably but v6.0 is fine too)?
Comment 11 Chris Chiu 2022-11-29 03:26:51 UTC
Created attachment 303316 [details]
dmesg with PCI_DEBUG

[   35.907673] pcieport 0000:03:02.0: waiting 100 ms for downstream link, after activation
[   35.907687] pcieport 0000:03:00.0: waiting 100 ms for downstream link, after activation
[   37.038857] xhci_hcd 0000:38:00.0: waiting additional 100 ms to become accessible
[   37.039165] thunderbolt 0000:04:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x9df40000)
[   37.039181] thunderbolt 0000:04:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x9df00000)
[   37.039196] thunderbolt 0000:04:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
[   37.146904] xhci_hcd 0000:38:00.0: Unable to change power state from D3cold to D0, device inaccessible
[   37.148017] pcieport 0000:03:02.0: re-enabling LTR
Comment 12 Mika Westerberg 2022-11-29 11:31:39 UTC
Can you add more delay to the drivers/pci/pci.c::pci_bridge_wait_for_secondary_bus(). Something like 500ms or more and see if that changes anything?
Comment 13 Mika Westerberg 2022-11-29 11:33:32 UTC
Also if possible try to disable ASPM.
Comment 14 Chris Chiu 2022-11-30 02:50:39 UTC
Created attachment 303328 [details]
dmesg for add d3dcold delay

Add delay seems no help. Please refer to the log if you're interested.
Comment 15 Chris Chiu 2022-11-30 02:52:40 UTC
Created attachment 303329 [details]
dmesg with aspm off

I forced the ASPM off with kernel parameter `pcie_aspm=off`. It does help and at least the usb4 can survive across many suspend/resume. What do you suggest for next step? Thanks
Comment 16 Mika Westerberg 2022-11-30 08:06:35 UTC
There is something wrong now:

[    0.372625] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments MSI EDR HPX-Type3]
[    0.372627] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]

This means pretty much no PCIe at all so this cannot work as expected. Is there no way of disabling ASPM from the BIOS side? Or then through sysfs between the PCIe downstream port and the xHCI?
Comment 17 Chris Chiu 2022-11-30 13:17:55 UTC
           +-1c.0-[01]----00.0  Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader
           +-1c.4-[02-6b]----00.0-[03-6b]--+-00.0-[04]----00.0  Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018]
           |                               +-01.0-[05-37]--
           |                               +-02.0-[38]----00.0  Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018]
           |                               \-04.0-[39-6b]--


I disabled the aspm by assign 0 individually to 
/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/0000:03:02.0/0000:38:00.0/link/l1_aspm

and 

/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/link/l1_2_aspm
/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/link/l1_1_aspm
/sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/link/l1_aspm

The same problem happens.

But I found something interesting which makes difference. If I disabled the "Wake on Dell USB-C dock" feature in BIOS setting, the problem will be gone. That matches the finding in the description, the power state change failure will always happens then the HC died after. I'll ask more detail about this feature. But also need suggestion about how we should handle this unexpected wakeup for the xhci_hcd.
Comment 18 Chris Chiu 2022-12-14 08:25:26 UTC
ODM claims they can still reproduce even with the "Wake on Dell USB-C dock" disabled in BIOS.
And they still see the power transition problem in dmesg of the fail case

[  330.481613] ACPI: EC: interrupt unblocked
[  332.132772] xhci_hcd 0000:38:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[  332.203642] xhci_hcd 0000:38:00.0: can't change power state from D3cold to D0 (config space inaccessible)
[  332.203652] xhci_hcd 0000:38:00.0: Controller not ready at resume -19
[  332.203654] xhci_hcd 0000:38:00.0: PCI post-resume error -19!
[  332.203655] xhci_hcd 0000:38:00.0: HC died; cleaning up
[  332.203658] PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -19
[  332.203664] xhci_hcd 0000:38:00.0: PM: failed to resume async: error -19
[  332.210099] nvme nvme0: 8/0/0 default/read/poll queues
[  332.439757] OOM killer enabled.
[  332.439759] Restarting tasks ...

Do I need to turn on the ACPI log for the power transition problem?
Comment 19 Mika Westerberg 2022-12-15 11:52:25 UTC
It is likely that this is not an ACPI issue because rest of the Titan Ridge (TBT etc.) come up just fine and the power management is done for the root port so you would expect it to affect the whole add-in-card. Unfortunately I don't have any ideas how to debug this further either. I can check if we have this machine in our TBT team and if yes to try to reproduce. What is the exact model?
Comment 20 Xiao Liang 2022-12-16 07:51:56 UTC
Dell can provide a machine to you for issue further debug, is it ok for you?
Comment 21 Mika Westerberg 2022-12-16 07:55:17 UTC
Hi, yes that's ok too.
Comment 22 Mika Westerberg 2023-01-13 14:33:03 UTC
Created attachment 303594 [details]
Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated

Can you try the attached patch? It seems that sometimes it takes longer than the additional 100ms (up to seconds) for the xHCI to become responsive after the PCIe link has been trained. This fixes the issue on the Dell Come Lake based system at least.
Comment 23 Mika Westerberg 2023-02-01 07:20:49 UTC
Created attachment 303670 [details]
Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated (corrected)

Sorry, I attached wrong patch. Can you please try this one instead.
Comment 24 Liang.Xiao1 2023-02-01 07:21:09 UTC
Created attachment 303671 [details]
attachment-24380-0.html

Hi Dear Sender,
I'm on leave today without email access, please expect my delay response.

Sorry for inconvinence.

Thanks
/Liang
Comment 25 Chris Chiu 2023-02-08 04:29:53 UTC
The problem is gone after applying the patch in #23. @Mika, could you send it upstream then we can start working the next step? Thanks
Comment 26 Mika Westerberg 2023-02-08 13:54:22 UTC
Thanks for testing. I will send it upstream after v6.3-rc1 is released since we are pretty late in the release cycle.
Comment 27 Mika Westerberg 2023-03-06 12:03:58 UTC
Hi, can you try if this reproduces on v6.3-rc1? There are fixes from Lukas that should handle this one too:

ac91e6980563 ("PCI: Unify delay handling for reset and resume")

and related. I will check tomorrow on that CML based system when I'm in the office.
Comment 28 Mario Kicherer 2023-05-02 10:59:28 UTC
Are there plans to backport this to the stable/longterm branches?
Comment 29 Liang.Xiao1 2023-05-02 11:00:44 UTC
Created attachment 304201 [details]
attachment-24626-0.html

Hi Dear Sender,
I'm OoO for Labor Day Holiday during 4/29 ~ 5/5 without email access, please expect my delay response.

Sorry for inconvinence.

Thanks
/Liang
Comment 30 Chris Chiu 2023-05-02 12:31:20 UTC
https://www.spinics.net/lists/linux-pci/msg138650.html is the upstream patch and may land in v6.4. WIll backport it when it's in mainline kernel.
Comment 31 Neo Wong 2023-05-15 01:57:45 UTC
@Chris 

Is patch mentioned in #30 verified in current kernel version ?
Comment 32 Matt L 2023-07-02 18:49:53 UTC
Created attachment 304529 [details]
possible regression dmesg

Hello,

The above patch *may* have caused a regression on my system.  I have a thunderbolt dock and if I disconnect it during suspend of my laptop (Dell XPS13 plus) the laptop fails to resume.  The laptop is alive but userspace processes do not resumme.  The only way I've managed to fix this is to put the laptop back to sleep, then the following resume starts userspace processes.

Attached is a dmesg with pci_dbg messages enabled.

Thanks,
Matt
Comment 33 Liang.Xiao1 2023-07-02 18:50:14 UTC
Created attachment 304530 [details]
attachment-24997-0.html

Hi Dear Sender,
I'm OoO until 7/7 with limited email access, please expect my delay response.

If you need immediate assistance please reach out to Shen, Yijun <Yijun_Shen@Dell.com>

Sorry for inconvenience

Thanks
/Liang
Comment 34 Denis Loginov 2023-07-05 23:52:57 UTC
fwiw, I'm still seeing this issue even with https://www.spinics.net/lists/linux-pci/msg138650.html applied to 6.4.1

Moreover, this happens even without suspend/resume by just issuing `lspci` upon cold boot. It hangs either immediately or after a few successive lspci's in a row, with `Unable to change power state from D3cold to D0, device inaccessible` appearing shortly after.

The only thing that seems to help is `pcie_aspm=off`. Strangely, it *was* working even without that previously - I haven't updated the kernel or BIOS/hardware configuration, and then it started appearing again, even though the patch seems to have worked previously.
Comment 35 Kamil Paral 2023-08-21 12:34:58 UTC
Hello, a fix for this bugzilla ticket seems to have caused a regression on my hardware. A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected can no longer resume from suspend. I reported the issue on the linux-pci and regressions lists:
https://lore.kernel.org/linux-pci/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u
https://lore.kernel.org/regressions/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u
Comment 36 Kamil Paral 2023-08-25 11:13:30 UTC
(In reply to Kamil Paral from comment #35)
> Hello, a fix for this bugzilla ticket seems to have caused a regression on
> my hardware. A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock
> connected can no longer resume from suspend.

This turned out to be most probably a laptop's firmware issue. See the linked lists for more info.
Comment 37 Mark Blakeney 2023-08-30 00:41:14 UTC
I'd like to confirm the same issue as Matt L reports just above (yay, I finally found somebody else reporting this issue!). I have a Dell laptop and Dell Thunderbolt dock and normally work on 2 dock connected screens + keyboard + mouse with the laptop screen closed. If my laptop suspends like this, and I then unplug the dock (e.g. to take the laptop away) then the screen stays blank when I try to resume. This started when Arch Linux updated the kernel from 6.3.9 (good) to 6.4.1 (bad) and I have tested it bad with every point release of 6.4 since up to 6.4.12 and is also bad on today's 6.5 which just rolled in to Arch. I have generally been using the LTS kernel (6.1.49) to avoid this bug.

Actually, if you wait 2.5 minutes then the screen does kick back in to life although this is inconsistent now on 6.5.

@Kamil, I was guided here from your similar bug at (https://bugzilla.redhat.com/show_bug.cgi?id=2230357). I wouldn't be so quick to say this is a Lenovo firmware issue. Seems to me there is something dodgy with resume and thunderbolt and kernel 6.4/6.5.
Comment 38 Mark Blakeney 2023-09-14 12:30:06 UTC
I'll add to my comment above that since 6.5 the delay I mention of 2.5 mins is consistently now about 60 to 70 secs instead. However, after that time the laptop display returns but I find that some of my dock connected devices (displays, keyboard, and mouse) may not all recover. I have to re-plug the dock to kick them back into life.

To reiterate, all this worked fine on kernel 6.3.9 and earlier. Broke from 6.4.1 and still broken on current 6.5.3. This is all definitely related to the changes and issues discussed over this series of emails: https://www.spinics.net/lists/linux-pci/msg142902.html but there they seem to conclude it being a Lenovo issue whereas Matt L and I are seeing this 6.4/6.5 issue on Dell laptop + Dell dock so it is not Lenovo specific.
Comment 39 Mika Westerberg 2023-09-14 15:46:07 UTC
Hi, yes the delay happens because the PCIe link does not come up and this is the actual reason, not the patch that increased the delay. Can you add "thunderbolt.dyndbg=+p" in the kernel command line, reproduce and attach full dmesg so we can hopefully see why the PCIe tunnel is not created properly.

Also, did you tweak any BIOS settings from the defaults?
Comment 40 Mark Blakeney 2023-09-14 22:32:05 UTC
Created attachment 305110 [details]
thunderbolt.dyndbg=+p dmesg output
Comment 41 Mark Blakeney 2023-09-14 22:39:48 UTC
1) I added that kernel option and rebooted. 2) With the lid closed, I suspended my laptop. 3) Pulled out the dock cable. 4) Opened the lid and waited the 60 secs before the lid display recovered. 5) Plugged the dock cable back in and this time both external displays, keyboard, and mouse recovered (although GNOME swapped the windows I had open on the 2 external displays?). Then captured dmesg which I have attached.

I have not changed anything significant in the BIOS (not that I can recall). Note that I use fwupd often to ensure my BIOS and dock etc are always running the latest firmware. Dell updates both quite frequently and I am always running the latest. It was mentioned in that email thread that changing the BIOS thunderbird setting to "No security" helps alleviate this issue but I don't have any security options like that in my BIOS.
Comment 42 Mika Westerberg 2023-09-15 10:31:15 UTC
Okay you unplug the dock on purpose? Okay I see.

So what should happen is that these PCIe root ports do support active link reporting so with recent kernel (v6.5 I think) the PCI core checks if the link is active before it issues the ~60s (arbitrary) wait. However, it does not seem to work in your case. Now, you have "mem_sleep_default=deep" which forces S3 and possibly enters into completely untested BIOS paths (these systems are pretty much all s2idle).

Can you add CONFIG_PCI_DEBUG=y to your .config, and remove the above option from the command line, and see if it reproduces? If it does can you attach the full dmesg again?
Comment 43 Mika Westerberg 2023-09-15 10:32:02 UTC
Also I suggest to create a new bug and link it from here because this is not the same issue originally reported in this bugzilla.
Comment 44 Mark Blakeney 2023-09-15 23:12:33 UTC
OK, have raised bug #217915. Will address your comments above over there.
Comment 45 Rustam Abdullaev 2024-02-12 20:29:40 UTC
So, on my system, Lenovo X1 Extreme Gen 5, the commit e8b908146d44310473e43b3382eca126e12d279c does not fix the disappearing Thunderbolt issue and instead adds a 65-second delay when resuming from sleep.
Could it be that such a wait is too long? After upgrading to 6.4.1+ the wait after each resume is excruciating.
And can anything be done to fix the Thunderbolt issue on my system?

[   49.135899] pcieport 0000:21:02.0: Unable to change power state from D3cold to D0, device inaccessible
[   51.415400] xhci_hcd 0000:56:00.0: not ready 1023ms after resume; waiting
[   51.415407] thunderbolt 0000:22:00.0: not ready 1023ms after resume; waiting
[   52.471401] xhci_hcd 0000:56:00.0: not ready 2047ms after resume; waiting
[   52.471406] thunderbolt 0000:22:00.0: not ready 2047ms after resume; waiting
[   54.743394] xhci_hcd 0000:56:00.0: not ready 4095ms after resume; waiting
[   54.743399] thunderbolt 0000:22:00.0: not ready 4095ms after resume; waiting
[   59.095398] xhci_hcd 0000:56:00.0: not ready 8191ms after resume; waiting
[   59.095403] thunderbolt 0000:22:00.0: not ready 8191ms after resume; waiting
[   67.543405] xhci_hcd 0000:56:00.0: not ready 16383ms after resume; waiting
[   67.543410] thunderbolt 0000:22:00.0: not ready 16383ms after resume; waiting
[   85.207405] xhci_hcd 0000:56:00.0: not ready 32767ms after resume; waiting
[   85.207409] thunderbolt 0000:22:00.0: not ready 32767ms after resume; waiting
[  120.023396] xhci_hcd 0000:56:00.0: not ready 65535ms after resume; giving up
[  120.023401] thunderbolt 0000:22:00.0: not ready 65535ms after resume; giving up
[  120.023570] xhci_hcd 0000:56:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  120.023628] thunderbolt 0000:22:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  120.092723] ACPI: EC: event unblocked
[  120.094771] pcieport 0000:00:1d.0: pciehp: Slot(8): Card not present
[  120.095141] xhci_hcd 0000:56:00.0: Unable to change power state from D3cold to D0, device inaccessible
[  120.095163] xhci_hcd 0000:56:00.0: Controller not ready at resume -19
[  120.095166] xhci_hcd 0000:56:00.0: PCI post-resume error -19!
[  120.095167] xhci_hcd 0000:56:00.0: HC died; cleaning up
[  120.095175] xhci_hcd 0000:56:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0x110 returns -19
[  120.095200] xhci_hcd 0000:56:00.0: PM: failed to resume async: error -19
[  120.110113] xhci_hcd 0000:56:00.0: remove, state 4
[  120.110120] usb usb3: USB disconnect, device number 1
Comment 46 Neo Wong 2024-02-12 20:30:25 UTC
Created attachment 305858 [details]
attachment-6036-0.html

Lunar New Year break, will back on 16/Feb.
-neo
Comment 47 Mika Westerberg 2024-02-13 07:08:35 UTC
Hi Rustam,

Can you open a new bugzilla about your issue where you describe the exact steps you do, the results you get vs. expectation and attach full dmesg (with thunderbolt.dyndbg=+p in the kernel command line)?