Bug 216728
Summary: | Thunderbolt USB Controller died after resume on Intel CometLake platform | ||
---|---|---|---|
Product: | Drivers | Reporter: | Chris Chiu (chris.chiu) |
Component: | USB | Assignee: | Default virtual assignee for Drivers/USB (drivers_usb) |
Status: | NEW --- | ||
Severity: | high | CC: | dev, dev, kernel, kparal, liang_xiao1, mark.blakeney, mathias.nyman, mika.westerberg, neo.wong, rustamabd |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 6.1-rc | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
output of lspci -nnv
output of dmesg output of sudo lspci -vv dmesg of kernel 6.0 dmesg of kernel 6.1 dmesg with PCI_DEBUG dmesg for add d3dcold delay dmesg with aspm off Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated (corrected) attachment-24626-0.html possible regression dmesg attachment-24997-0.html thunderbolt.dyndbg=+p dmesg output attachment-6036-0.html |
Description
Chris Chiu
2022-11-22 10:25:26 UTC
Created attachment 303262 [details]
output of lspci -nnv
Created attachment 303263 [details]
output of dmesg
Can you attach output of 'sudo lspci -vv' when this happens? This will show more details on the PCIe side. Created attachment 303289 [details]
output of sudo lspci -vv
Hi Mika, thanks for the prompt reply. Please refer to the attached.
Thanks. It looks like the PCIe link to the xHCI is down: 03:02.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge 4C 2018] (rev 06) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 127 Bus: primary=03, secondary=38, subordinate=38, sec-latency=0 ... LnkSta: Speed 2.5GT/s (ok), Width x4 (ok) TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- I have not seen this one before. What's the firmware version? You can find this out by reading /sys/bus/thunderbolt/devices/0-0/nvm_version. The nvm_version shows 61.0 dmesg is from a 5.15.0-52-generic kernel. There are some changes in pci code in this area since then. Does a 6.0 or 6.1-rc kernel behave the same? The "HC died; cleaning up" message is expected when xhci driver can't read a mmio xHC registers (gets 0xffffffff). This is normally the case when PCI device is removed, or as in this case not yet powered (still in D3) No idea why PCI link fails to go to D0 Created attachment 303306 [details]
dmesg of kernel 6.0
Created attachment 303307 [details]
dmesg of kernel 6.1
Yes. The error is the same on kernel 6.0 and 6.1. I'm also curious about why the power transition fail. There's no problem on non-CML platforms.
Can you enable CONFIG_PCI_DEBUG=y and attach full dmesg (v6.1 kernel preferably but v6.0 is fine too)? Created attachment 303316 [details]
dmesg with PCI_DEBUG
[ 35.907673] pcieport 0000:03:02.0: waiting 100 ms for downstream link, after activation
[ 35.907687] pcieport 0000:03:00.0: waiting 100 ms for downstream link, after activation
[ 37.038857] xhci_hcd 0000:38:00.0: waiting additional 100 ms to become accessible
[ 37.039165] thunderbolt 0000:04:00.0: restoring config space at offset 0x14 (was 0x0, writing 0x9df40000)
[ 37.039181] thunderbolt 0000:04:00.0: restoring config space at offset 0x10 (was 0x0, writing 0x9df00000)
[ 37.039196] thunderbolt 0000:04:00.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406)
[ 37.146904] xhci_hcd 0000:38:00.0: Unable to change power state from D3cold to D0, device inaccessible
[ 37.148017] pcieport 0000:03:02.0: re-enabling LTR
Can you add more delay to the drivers/pci/pci.c::pci_bridge_wait_for_secondary_bus(). Something like 500ms or more and see if that changes anything? Also if possible try to disable ASPM. Created attachment 303328 [details]
dmesg for add d3dcold delay
Add delay seems no help. Please refer to the log if you're interested.
Created attachment 303329 [details]
dmesg with aspm off
I forced the ASPM off with kernel parameter `pcie_aspm=off`. It does help and at least the usb4 can survive across many suspend/resume. What do you suggest for next step? Thanks
There is something wrong now: [ 0.372625] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments MSI EDR HPX-Type3] [ 0.372627] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI] This means pretty much no PCIe at all so this cannot work as expected. Is there no way of disabling ASPM from the BIOS side? Or then through sysfs between the PCIe downstream port and the xHCI? +-1c.0-[01]----00.0 Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader +-1c.4-[02-6b]----00.0-[03-6b]--+-00.0-[04]----00.0 Intel Corporation JHL7540 Thunderbolt 3 NHI [Titan Ridge 4C 2018] | +-01.0-[05-37]-- | +-02.0-[38]----00.0 Intel Corporation JHL7540 Thunderbolt 3 USB Controller [Titan Ridge 4C 2018] | \-04.0-[39-6b]-- I disabled the aspm by assign 0 individually to /sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/0000:03:02.0/0000:38:00.0/link/l1_aspm and /sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/link/l1_2_aspm /sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/link/l1_1_aspm /sys/devices/pci0000:00/0000:00:1c.4/0000:02:00.0/link/l1_aspm The same problem happens. But I found something interesting which makes difference. If I disabled the "Wake on Dell USB-C dock" feature in BIOS setting, the problem will be gone. That matches the finding in the description, the power state change failure will always happens then the HC died after. I'll ask more detail about this feature. But also need suggestion about how we should handle this unexpected wakeup for the xhci_hcd. ODM claims they can still reproduce even with the "Wake on Dell USB-C dock" disabled in BIOS. And they still see the power transition problem in dmesg of the fail case [ 330.481613] ACPI: EC: interrupt unblocked [ 332.132772] xhci_hcd 0000:38:00.0: can't change power state from D3cold to D0 (config space inaccessible) [ 332.203642] xhci_hcd 0000:38:00.0: can't change power state from D3cold to D0 (config space inaccessible) [ 332.203652] xhci_hcd 0000:38:00.0: Controller not ready at resume -19 [ 332.203654] xhci_hcd 0000:38:00.0: PCI post-resume error -19! [ 332.203655] xhci_hcd 0000:38:00.0: HC died; cleaning up [ 332.203658] PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -19 [ 332.203664] xhci_hcd 0000:38:00.0: PM: failed to resume async: error -19 [ 332.210099] nvme nvme0: 8/0/0 default/read/poll queues [ 332.439757] OOM killer enabled. [ 332.439759] Restarting tasks ... Do I need to turn on the ACPI log for the power transition problem? It is likely that this is not an ACPI issue because rest of the Titan Ridge (TBT etc.) come up just fine and the power management is done for the root port so you would expect it to affect the whole add-in-card. Unfortunately I don't have any ideas how to debug this further either. I can check if we have this machine in our TBT team and if yes to try to reproduce. What is the exact model? Dell can provide a machine to you for issue further debug, is it ok for you? Hi, yes that's ok too. Created attachment 303594 [details]
Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated
Can you try the attached patch? It seems that sometimes it takes longer than the additional 100ms (up to seconds) for the xHCI to become responsive after the PCIe link has been trained. This fixes the issue on the Dell Come Lake based system at least.
Created attachment 303670 [details]
Use pci_dev_wait() to check if the xHCI is reponsive after the link is activated (corrected)
Sorry, I attached wrong patch. Can you please try this one instead.
Created attachment 303671 [details]
attachment-24380-0.html
Hi Dear Sender,
I'm on leave today without email access, please expect my delay response.
Sorry for inconvinence.
Thanks
/Liang
The problem is gone after applying the patch in #23. @Mika, could you send it upstream then we can start working the next step? Thanks Thanks for testing. I will send it upstream after v6.3-rc1 is released since we are pretty late in the release cycle. Hi, can you try if this reproduces on v6.3-rc1? There are fixes from Lukas that should handle this one too: ac91e6980563 ("PCI: Unify delay handling for reset and resume") and related. I will check tomorrow on that CML based system when I'm in the office. Are there plans to backport this to the stable/longterm branches? Created attachment 304201 [details]
attachment-24626-0.html
Hi Dear Sender,
I'm OoO for Labor Day Holiday during 4/29 ~ 5/5 without email access, please expect my delay response.
Sorry for inconvinence.
Thanks
/Liang
https://www.spinics.net/lists/linux-pci/msg138650.html is the upstream patch and may land in v6.4. WIll backport it when it's in mainline kernel. @Chris Is patch mentioned in #30 verified in current kernel version ? Created attachment 304529 [details]
possible regression dmesg
Hello,
The above patch *may* have caused a regression on my system. I have a thunderbolt dock and if I disconnect it during suspend of my laptop (Dell XPS13 plus) the laptop fails to resume. The laptop is alive but userspace processes do not resumme. The only way I've managed to fix this is to put the laptop back to sleep, then the following resume starts userspace processes.
Attached is a dmesg with pci_dbg messages enabled.
Thanks,
Matt
Created attachment 304530 [details] attachment-24997-0.html Hi Dear Sender, I'm OoO until 7/7 with limited email access, please expect my delay response. If you need immediate assistance please reach out to Shen, Yijun <Yijun_Shen@Dell.com> Sorry for inconvenience Thanks /Liang fwiw, I'm still seeing this issue even with https://www.spinics.net/lists/linux-pci/msg138650.html applied to 6.4.1 Moreover, this happens even without suspend/resume by just issuing `lspci` upon cold boot. It hangs either immediately or after a few successive lspci's in a row, with `Unable to change power state from D3cold to D0, device inaccessible` appearing shortly after. The only thing that seems to help is `pcie_aspm=off`. Strangely, it *was* working even without that previously - I haven't updated the kernel or BIOS/hardware configuration, and then it started appearing again, even though the patch seems to have worked previously. Hello, a fix for this bugzilla ticket seems to have caused a regression on my hardware. A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock connected can no longer resume from suspend. I reported the issue on the linux-pci and regressions lists: https://lore.kernel.org/linux-pci/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u https://lore.kernel.org/regressions/CA+cBOTeWrsTyANjLZQ=bGoBQ_yOkkV1juyRvJq-C8GOrbW6t9Q@mail.gmail.com/T/#u (In reply to Kamil Paral from comment #35) > Hello, a fix for this bugzilla ticket seems to have caused a regression on > my hardware. A Thinkpad T480s laptop with a Thinkpad Thunderbolt 3 Dock > connected can no longer resume from suspend. This turned out to be most probably a laptop's firmware issue. See the linked lists for more info. I'd like to confirm the same issue as Matt L reports just above (yay, I finally found somebody else reporting this issue!). I have a Dell laptop and Dell Thunderbolt dock and normally work on 2 dock connected screens + keyboard + mouse with the laptop screen closed. If my laptop suspends like this, and I then unplug the dock (e.g. to take the laptop away) then the screen stays blank when I try to resume. This started when Arch Linux updated the kernel from 6.3.9 (good) to 6.4.1 (bad) and I have tested it bad with every point release of 6.4 since up to 6.4.12 and is also bad on today's 6.5 which just rolled in to Arch. I have generally been using the LTS kernel (6.1.49) to avoid this bug. Actually, if you wait 2.5 minutes then the screen does kick back in to life although this is inconsistent now on 6.5. @Kamil, I was guided here from your similar bug at (https://bugzilla.redhat.com/show_bug.cgi?id=2230357). I wouldn't be so quick to say this is a Lenovo firmware issue. Seems to me there is something dodgy with resume and thunderbolt and kernel 6.4/6.5. I'll add to my comment above that since 6.5 the delay I mention of 2.5 mins is consistently now about 60 to 70 secs instead. However, after that time the laptop display returns but I find that some of my dock connected devices (displays, keyboard, and mouse) may not all recover. I have to re-plug the dock to kick them back into life. To reiterate, all this worked fine on kernel 6.3.9 and earlier. Broke from 6.4.1 and still broken on current 6.5.3. This is all definitely related to the changes and issues discussed over this series of emails: https://www.spinics.net/lists/linux-pci/msg142902.html but there they seem to conclude it being a Lenovo issue whereas Matt L and I are seeing this 6.4/6.5 issue on Dell laptop + Dell dock so it is not Lenovo specific. Hi, yes the delay happens because the PCIe link does not come up and this is the actual reason, not the patch that increased the delay. Can you add "thunderbolt.dyndbg=+p" in the kernel command line, reproduce and attach full dmesg so we can hopefully see why the PCIe tunnel is not created properly. Also, did you tweak any BIOS settings from the defaults? Created attachment 305110 [details]
thunderbolt.dyndbg=+p dmesg output
1) I added that kernel option and rebooted. 2) With the lid closed, I suspended my laptop. 3) Pulled out the dock cable. 4) Opened the lid and waited the 60 secs before the lid display recovered. 5) Plugged the dock cable back in and this time both external displays, keyboard, and mouse recovered (although GNOME swapped the windows I had open on the 2 external displays?). Then captured dmesg which I have attached. I have not changed anything significant in the BIOS (not that I can recall). Note that I use fwupd often to ensure my BIOS and dock etc are always running the latest firmware. Dell updates both quite frequently and I am always running the latest. It was mentioned in that email thread that changing the BIOS thunderbird setting to "No security" helps alleviate this issue but I don't have any security options like that in my BIOS. Okay you unplug the dock on purpose? Okay I see. So what should happen is that these PCIe root ports do support active link reporting so with recent kernel (v6.5 I think) the PCI core checks if the link is active before it issues the ~60s (arbitrary) wait. However, it does not seem to work in your case. Now, you have "mem_sleep_default=deep" which forces S3 and possibly enters into completely untested BIOS paths (these systems are pretty much all s2idle). Can you add CONFIG_PCI_DEBUG=y to your .config, and remove the above option from the command line, and see if it reproduces? If it does can you attach the full dmesg again? Also I suggest to create a new bug and link it from here because this is not the same issue originally reported in this bugzilla. OK, have raised bug #217915. Will address your comments above over there. So, on my system, Lenovo X1 Extreme Gen 5, the commit e8b908146d44310473e43b3382eca126e12d279c does not fix the disappearing Thunderbolt issue and instead adds a 65-second delay when resuming from sleep. Could it be that such a wait is too long? After upgrading to 6.4.1+ the wait after each resume is excruciating. And can anything be done to fix the Thunderbolt issue on my system? [ 49.135899] pcieport 0000:21:02.0: Unable to change power state from D3cold to D0, device inaccessible [ 51.415400] xhci_hcd 0000:56:00.0: not ready 1023ms after resume; waiting [ 51.415407] thunderbolt 0000:22:00.0: not ready 1023ms after resume; waiting [ 52.471401] xhci_hcd 0000:56:00.0: not ready 2047ms after resume; waiting [ 52.471406] thunderbolt 0000:22:00.0: not ready 2047ms after resume; waiting [ 54.743394] xhci_hcd 0000:56:00.0: not ready 4095ms after resume; waiting [ 54.743399] thunderbolt 0000:22:00.0: not ready 4095ms after resume; waiting [ 59.095398] xhci_hcd 0000:56:00.0: not ready 8191ms after resume; waiting [ 59.095403] thunderbolt 0000:22:00.0: not ready 8191ms after resume; waiting [ 67.543405] xhci_hcd 0000:56:00.0: not ready 16383ms after resume; waiting [ 67.543410] thunderbolt 0000:22:00.0: not ready 16383ms after resume; waiting [ 85.207405] xhci_hcd 0000:56:00.0: not ready 32767ms after resume; waiting [ 85.207409] thunderbolt 0000:22:00.0: not ready 32767ms after resume; waiting [ 120.023396] xhci_hcd 0000:56:00.0: not ready 65535ms after resume; giving up [ 120.023401] thunderbolt 0000:22:00.0: not ready 65535ms after resume; giving up [ 120.023570] xhci_hcd 0000:56:00.0: Unable to change power state from D3cold to D0, device inaccessible [ 120.023628] thunderbolt 0000:22:00.0: Unable to change power state from D3cold to D0, device inaccessible [ 120.092723] ACPI: EC: event unblocked [ 120.094771] pcieport 0000:00:1d.0: pciehp: Slot(8): Card not present [ 120.095141] xhci_hcd 0000:56:00.0: Unable to change power state from D3cold to D0, device inaccessible [ 120.095163] xhci_hcd 0000:56:00.0: Controller not ready at resume -19 [ 120.095166] xhci_hcd 0000:56:00.0: PCI post-resume error -19! [ 120.095167] xhci_hcd 0000:56:00.0: HC died; cleaning up [ 120.095175] xhci_hcd 0000:56:00.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0x110 returns -19 [ 120.095200] xhci_hcd 0000:56:00.0: PM: failed to resume async: error -19 [ 120.110113] xhci_hcd 0000:56:00.0: remove, state 4 [ 120.110120] usb usb3: USB disconnect, device number 1 Created attachment 305858 [details]
attachment-6036-0.html
Lunar New Year break, will back on 16/Feb.
-neo
Hi Rustam, Can you open a new bugzilla about your issue where you describe the exact steps you do, the results you get vs. expectation and attach full dmesg (with thunderbolt.dyndbg=+p in the kernel command line)? |