Bug 219824 - [6.13 regression] USB controller just died
Summary: [6.13 regression] USB controller just died
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: AMD Linux
: P3 blocking
Assignee: Default virtual assignee for Drivers/USB
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2025-02-26 22:23 UTC by Artem S. Tashkinov
Modified: 2025-03-14 10:49 UTC (History)
7 users (show)

See Also:
Kernel Version: 6.13.4
Subsystem:
Regression: Yes
Bisected commit-id: 36b972d4b7cef5d098de63fee8d00720c051f335


Attachments
xhci_hcd and usb debug log (14.21 KB, application/x-xz)
2025-02-28 19:49 UTC, Artem S. Tashkinov
Details

Description Artem S. Tashkinov 2025-02-26 22:23:50 UTC
This is the first time it's happened to me.

My USB mouse just died.

The system was more or less idle, then I got these messages from the kernel:

[109773.985092] xhci_hcd 0000:c3:00.3: xHCI host not responding to stop endpoint command
[109773.998577] xhci_hcd 0000:c3:00.3: xHCI host controller not responding, assume dead
[109773.998622] xhci_hcd 0000:c3:00.3: HC died; cleaning up
[109773.998668] xhci_hcd 0000:c3:00.3: Timeout while waiting for stop endpoint command
[109773.998740] usb 1-2: USB disconnect, device number 2
[109774.032612] usb 1-3: USB disconnect, device number 3
[109774.033087] usb 1-4: USB disconnect, device number 4

This has never happened before with any of previous kernels, 6.9, 6.10, 6.11, 6.12.

Now on 6.13.4 this happened a few minutes after the system resumed.

That looks like a major regression.

The kernel didn't try anything.

Unbinding and binding the USB endpoint in /sys using this script has fixed the mouse but I never had to do that before:

https://unix.stackexchange.com/a/704342

My lspci:

c3:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller
c3:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller
c3:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Phoenix CCP/PSP 3.0 Device
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 7
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Phoenix IOMMU
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
c3:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 63)
01:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter
c4:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Function
c5:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Function
02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel
00:04.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A]
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A]
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A]
c4:00.1 Signal processing controller: Advanced Micro Devices, Inc. [AMD] AMD IPU Device
c3:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
c3:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b9
c3:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15ba
c5:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15c0
c5:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15c1
c5:00.5 USB controller: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #1
c5:00.6 USB controller: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #2
c3:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (rev d4)

My lsusb:

Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 0489:e0f2 Foxconn / Hon Hai Wireless_Device
Bus 001 Device 003: ID 06cb:00f0 Synaptics, Inc. 
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 0408:545f Quanta Computer, Inc. HP 5MP Camera
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub

I'm not bisecting this issue because so far it's happened just once and I've no idea how to trigger it. Yet it has never happened before with previous kernels.
Comment 1 Artem S. Tashkinov 2025-02-26 22:28:20 UTC
I'm utterly confused as to why the kernel decided to "xHCI host not responding to stop endpoint command".

I didn't do anything at the time. Wasn't even using the mouse.

Something funky is going on with 6.13.
Comment 2 Artem S. Tashkinov 2025-02-26 22:31:25 UTC
This was reported in the SUSE bug tracker earlier:

https://bugzilla.suse.com/show_bug.cgi?id=1236992

I don't see it being reported here, so the issue is not new.

Yet I see no patches queued for 6.13.5.
Comment 3 Artem S. Tashkinov 2025-02-26 22:47:52 UTC
The SUSE issue is seemingly unrelated, please dismiss.
Comment 4 Artem S. Tashkinov 2025-02-27 12:58:38 UTC
This just happened again:

[161470.836493] PM: resume devices took 0.547 seconds
[161470.836720] OOM killer enabled.
[161470.836721] Restarting tasks ... done.
[161470.839715] random: crng reseeded on system resumption
[161470.845090] PM: suspend exit
[161471.322491] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: Firmware: 400a4 vendor: 0x2 v0.43.1, 2 algorithms
[161471.324469] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: cirrus/cs35l41-dsp1-spk-prot-103c8b72.bin: v0.43.1
[161471.324480] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: spk-prot: D:\Amp Tuning\HP\840\0930\103C8B45_220930.bin
[161471.403951] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Calibration applied: R0=10446
[161471.407392] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Calibration applied: R0=10526
[161471.432157] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Firmware Loaded - Type: spk-prot, Gain: 17
[161471.433916] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Firmware Loaded - Type: spk-prot, Gain: 17
[161471.523827] hp_wmi: Unknown event_id - 131073 - 0x0
[162644.637587] xhci_hcd 0000:c3:00.3: xHCI host not responding to stop endpoint command
[162644.651068] xhci_hcd 0000:c3:00.3: xHCI host controller not responding, assume dead
[162644.651076] xhci_hcd 0000:c3:00.3: HC died; cleaning up
[162644.651099] xhci_hcd 0000:c3:00.3: Timeout while waiting for stop endpoint command
[162644.651102] usb 1-2: USB disconnect, device number 4
[162644.678374] usb 1-3: USB disconnect, device number 2
[162644.678748] usb 1-4: USB disconnect, device number 3

Shortly after resume all the USB ports are disabled.

I'm reverting back to Linux 6.11. I cannot use my device like this.
Comment 5 Mathias Nyman 2025-02-27 15:12:46 UTC
6.13 has a lot of changes related to endpoint stopping:

e21ebe51af68 xhci: Turn NEC specific quirk for handling Stop Endpoint errors generic
474538b8dd1c usb: xhci: Avoid queuing redundant Stop Endpoint commands
484c3bab2d5d usb: xhci: Fix TD invalidation under pending Set TR Dequeue
42b758137601 usb: xhci: Limit Stop Endpoint retries

Endpoints are stopped in order to cancel transfers, before suspend, and to soft reset an endpoint after clearing a halt. 

I understand that bisecting an issue like this that triggers rarely isn't an option, but can I ask you to try running 6.13 with xhci dynamic debug enabled.

mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
and send dmesg after issue is triggered.

It could reveal a bit more what's going on
Comment 6 Artem S. Tashkinov 2025-02-27 16:05:34 UTC
(In reply to Mathias Nyman from comment #5)
> I understand that bisecting an issue like this that triggers rarely isn't an
> option, but can I ask you to try running 6.13 with xhci dynamic debug
> enabled.

Will do as soon as possible. Thanks a lot!
Comment 7 Michał Pecio 2025-02-27 17:14:31 UTC
Which exact versions were you running successfully and for how long?

These patches listed by Mathias are instant first suspects, but they were all backported to v6.12.7 in December. Most of them also to v6.11.11 in early December and later in January to some LTS series.

Any chance that hibernation is indeed a (delayed) trigger and you weren't doing it as often in the past?

Did you come across similar reports from stable kernel branches in this year?
Comment 8 Artem S. Tashkinov 2025-02-27 21:07:45 UTC
> Which exact versions were you running successfully and for how long?

Kernel 6.12.14 that I was running earlier didn't have this issue.

Used software suspend/resume multiple times successfully.

> Any chance that hibernation is indeed a (delayed) trigger and you weren't
> doing it as often in the past?

Not using hibernation, just software suspend.

I've not changed anything software-wise except installing a new kernel on this laptop.

> Did you come across similar reports from stable kernel branches in this year?

I've Googled a couple of times already for this exact error message and nothing turned up.
Comment 9 Artem S. Tashkinov 2025-02-28 19:49:03 UTC
Created attachment 307725 [details]
xhci_hcd and usb debug log

(In reply to Mathias Nyman from comment #5)
> 6.13 has a lot of changes related to endpoint stopping:
> 
> e21ebe51af68 xhci: Turn NEC specific quirk for handling Stop Endpoint errors
> generic
> 474538b8dd1c usb: xhci: Avoid queuing redundant Stop Endpoint commands
> 484c3bab2d5d usb: xhci: Fix TD invalidation under pending Set TR Dequeue
> 42b758137601 usb: xhci: Limit Stop Endpoint retries
> 
> Endpoints are stopped in order to cancel transfers, before suspend, and to
> soft reset an endpoint after clearing a halt. 
> 
> I understand that bisecting an issue like this that triggers rarely isn't an
> option, but can I ask you to try running 6.13 with xhci dynamic debug
> enabled.
> 
> mount -t debugfs none /sys/kernel/debug
> echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
> echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
> and send dmesg after issue is triggered.
> 
> It could reveal a bit more what's going on

I'm confused.

If I resume the laptop and don't run these three commands immediately, all the USB ports eventually die (usually under 5 minutes).

If I resume the laptop and run these commands immediately, USB ports continue working like they always did before. So, weirdly and unexpectedly, when debugging is on ... it fixes the issue.

If I resume the laptop, don't run these commands, and then when the USB ports die I run them, there are no further messages from the xhci_hcd module.

I'm attaching the debug log (again, no bug here) regardless. Maybe it contains something that will let you understand what is going on.
Comment 10 Michał Pecio 2025-02-28 20:10:36 UTC
What if you enable only this one thing? Does anything show up under normal use or before the HC dies (if it still does)?

echo 'func xhci_handle_cmd_stop_ep +p' >/proc/dynamic_debug/control
Comment 11 Mathias Nyman 2025-03-03 13:54:41 UTC
(In reply to Artem S. Tashkinov from comment #9)
> Created attachment 307725 [details]
> xhci_hcd and usb debug log

Thanks, It does show some last minute urb canceling for endpoints
on the 1-3 device before suspend, and some more canceling after resume.
Also overflow/underflow messages after resume indicating that endpoint
might be started early.   

> 
> I'm confused.
> 
> If I resume the laptop and don't run these three commands immediately, all
> the USB ports eventually die (usually under 5 minutes).
> 
> If I resume the laptop and run these commands immediately, USB ports
> continue working like they always did before. So, weirdly and unexpectedly,
> when debugging is on ... it fixes the issue.

dynamic debug adds delays, and the code that starts and stops endpoints is
a bit timing sensitive. Could be that enabling debug hides the issue.

Can you still run one more try with xhci tracing instead of dynamic debug?
It does not affect timing as much:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
< Reproduce issue >
Send content of /sys/kernel/debug/tracing/trace

The trace file grows fast so copy it as soon as possible after issue is triggered.

Thanks
Comment 12 Artem S. Tashkinov 2025-03-03 15:42:52 UTC
(In reply to Michał Pecio from comment #10)
> What if you enable only this one thing? Does anything show up under normal
> use or before the HC dies (if it still does)?
> 
> echo 'func xhci_handle_cmd_stop_ep +p' >/proc/dynamic_debug/control

The same issue. Without debug the USB controller dies in a few minutes, with just this line it is working just fine.

> Can you still run one more try with xhci tracing instead of dynamic debug?
It does not affect timing as much:

I will try this ASAP.
Comment 13 Michał Pecio 2025-03-03 19:23:48 UTC
Hmm, can enabling a debug message which never gets printed really make a big difference? Maybe I'm crazy, but I wonder if it's possible that your interaction with USB devices while entering those commands somehow prevents (or at least delays) the failure?

I suppose that both dynamic debug and tracing settings persist across a suspend cycle, so if you still have no luck, maybe try setting everything up before suspending and then do nothing after resuming?
Comment 14 Michał Pecio 2025-03-03 22:38:51 UTC
I think I found it.

Does it help if you revert 36b972d4b7cef?
Comment 15 bugzilla 2025-03-06 11:15:15 UTC
I found this issue happening this after resuming from sleep. Downgrading to Kernel 6.12.9 (6.12.9-arch1-1) removed the issue completely. The error message was the same.

I am not much knowledgeable in reporting or helping with kernel issues but if you give me instructions of what I can do to maybe test a patch and report back I can help.
Comment 16 Artem S. Tashkinov 2025-03-06 11:54:31 UTC
(In reply to bugzilla from comment #15)
> I found this issue happening this after resuming from sleep. Downgrading to
> Kernel 6.12.9 (6.12.9-arch1-1) removed the issue completely. The error
> message was the same.
> 
> I am not much knowledgeable in reporting or helping with kernel issues but
> if you give me instructions of what I can do to maybe test a patch and
> report back I can help.

Please follow the instructions in https://bugzilla.kernel.org/show_bug.cgi?id=219824#c11

I'm currently on vacation, I left my mouse at home, so I cannot debug this.

BTW, here's an interesting tidbit: even with debugging enabled in comment 11 I wasn't able to get my USB ports die but (!) on a second resume they die _immediately_. So, maybe I just had to suspend/resume twice and then generate the debug data. Sadly, I just ran out of time.
Comment 17 Artem S. Tashkinov 2025-03-06 11:57:11 UTC
Also, if you're able to compile your kernel, please try to revert commit 36b972d4b7cef on top of e.g. 6.13.5 and check it you can reproduce the issue.

The patch can be downloaded here: https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=36b972d4b7cef5d098de63fee8d00720c051f335

You'll need to `patch -R < patch` it.
Comment 18 Artem S. Tashkinov 2025-03-06 12:03:52 UTC
Looks like there are multiple people affected by this regression, reports have started to trickle in:

https://bbs.archlinux.org/viewtopic.php?pid=2229550

BTW https://lore.kernel.org/lkml/20250304085139.4610e8ff@foxbook/ has been queued for stable, looks like 6.13.6 will have the issue fixed.

Please check.
Comment 19 Artem S. Tashkinov 2025-03-07 23:28:43 UTC
(In reply to Michał Pecio from comment #14)
> I think I found it.
> 
> Does it help if you revert 36b972d4b7cef?

People claim it fixes the issue.

Please actually submit the revert to stable as 6.13.6 has been released without it and neither 6.14-rc has seemingly seen it.

And seeing "Reported by:" would be nice. Thanks.
Comment 20 Michał Pecio 2025-03-08 06:56:59 UTC
Sorry about Reported-by, but if I added that then they would also want a Closes and I wasn't sure if it really is this bug. It's something I produced on my machine under my workload, using knowledge of how one commit is broken.

I guess I could have taken a gamble and tagged it anyway because it seemed likely, but I had no confirmation and no idea if I would get any by the end of the week. Indeed, to this day the best I've seen is reports that downgrading to 6.12 helps, I haven't heard from anyone previously affected reverting the suspect patch.

Like Mathias, I was expecting some nightmare scenario and not that. I hope that this is all there was to it, but we will really know next week.

My patch made it to Greg's usb-linus branch, which normally means it will land in the next -rc tomorrow and then propagate to stable. No other -rc has it because they were released before it existed.
Comment 21 christian.rohmann 2025-03-14 08:42:09 UTC
Will this be backported to 6.13? Seem 6.13.7 doesn't have it.
Comment 22 Artem S. Tashkinov 2025-03-14 09:06:39 UTC
6.13.7 absolutely includes it:

https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.13.7

> commit 80cb8e694110dee4ac6fbf0956ba7439aeb0603d
> Author: Michal Pecio <michal.pecio@gmail.com>
> Date:   Tue Mar 4 13:31:47 2025 +0200
> 
>     usb: xhci: Fix host controllers "dying" after suspend and resume
>     
>     commit c7c1f3b05c67173f462d73d301d572b3f9e57e3b upstream.
>     
>     A recent cleanup went a bit too far and dropped clearing the cycle bit
>     of link TRBs, so it stays different from the rest of the ring half of
>     the time. Then a race occurs: if the xHC reaches such link TRB before
>     more commands are queued, the link's cycle bit unintentionally matches
>     the xHC's cycle so it follows the link and waits for further commands.
>     If more commands are queued before the xHC gets there, inc_enq() flips
>     the bit so the xHC later sees a mismatch and stops executing commands.
>     
>     This function is called before suspend and 50% of times after resuming
>     the xHC is doomed to get stuck sooner or later. Then some Stop Endpoint
>     command fails to complete in 5 seconds and this shows up
>     
>     xhci_hcd 0000:00:10.0: xHCI host not responding to stop endpoint command
>     xhci_hcd 0000:00:10.0: xHCI host controller not responding, assume dead
>     xhci_hcd 0000:00:10.0: HC died; cleaning up
>     
>     followed by loss of all USB decives on the affected bus. That's if you
>     are lucky, because if Set Deq gets stuck instead, the failure is silent.
>     
>     Likely responsible for kernel bug 219824. I found this while searching
>     for possible causes of that regression and reproduced it locally before
>     hearing back from the reporter. To repro, simply wait for link cycle to
>     become set (debugfs), then suspend, resume and wait. To accelerate the
>     failure I used a script which repeatedly starts and stops a UVC camera.
>     
>     Some HCs get fully reinitialized on resume and they are not affected.
Comment 23 christian.rohmann 2025-03-14 10:49:47 UTC
(In reply to Artem S. Tashkinov from comment #22)
> 6.13.7 absolutely includes it:

> https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.13.7


Ah sorry, my bad. Thanks!

Note You need to log in before you can comment on or make changes to this bug.