Bug 219824

Summary: [6.13 regression] USB controller just died
Product: Drivers Reporter: Artem S. Tashkinov (aros)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: NEW ---    
Severity: blocking CC: mathias.nyman, michal.pecio
Priority: P3    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.13.4 Subsystem:
Regression: Yes Bisected commit-id:

Description Artem S. Tashkinov 2025-02-26 22:23:50 UTC
This is the first time it's happened to me.

My USB mouse just died.

The system was more or less idle, then I got these messages from the kernel:

[109773.985092] xhci_hcd 0000:c3:00.3: xHCI host not responding to stop endpoint command
[109773.998577] xhci_hcd 0000:c3:00.3: xHCI host controller not responding, assume dead
[109773.998622] xhci_hcd 0000:c3:00.3: HC died; cleaning up
[109773.998668] xhci_hcd 0000:c3:00.3: Timeout while waiting for stop endpoint command
[109773.998740] usb 1-2: USB disconnect, device number 2
[109774.032612] usb 1-3: USB disconnect, device number 3
[109774.033087] usb 1-4: USB disconnect, device number 4

This has never happened before with any of previous kernels, 6.9, 6.10, 6.11, 6.12.

Now on 6.13.4 this happened a few minutes after the system resumed.

That looks like a major regression.

The kernel didn't try anything.

Unbinding and binding the USB endpoint in /sys using this script has fixed the mouse but I never had to do that before:

https://unix.stackexchange.com/a/704342

My lspci:

c3:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller
c3:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h/1ah HD Audio Controller
c3:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Phoenix CCP/PSP 3.0 Device
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Data Fabric; Function 7
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Host Bridge
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Phoenix IOMMU
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
c3:00.5 Multimedia controller: Advanced Micro Devices, Inc. [AMD] ACP/ACP3X/ACP6x Audio Coprocessor (rev 63)
01:00.0 Network controller: MEDIATEK Corp. MT7922 802.11ax PCI Express Wireless Network Adapter
c4:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Function
c5:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Phoenix Dummy Function
02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400 NVMe SSD [Hendrix]
00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel
00:04.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 19h USB4/Thunderbolt PCIe tunnel
00:02.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge
00:02.4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix GPP Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A]
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A]
00:08.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Phoenix Internal GPP Bridge to Bus [C:A]
c4:00.1 Signal processing controller: Advanced Micro Devices, Inc. [AMD] AMD IPU Device
c3:00.7 Signal processing controller: Advanced Micro Devices, Inc. [AMD] Sensor Fusion Hub
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 71)
c3:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15b9
c3:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15ba
c5:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15c0
c5:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Device 15c1
c5:00.5 USB controller: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #1
c5:00.6 USB controller: Advanced Micro Devices, Inc. [AMD] Pink Sardine USB4/Thunderbolt NHI controller #2
c3:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Phoenix1 (rev d4)

My lsusb:

Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 0489:e0f2 Foxconn / Hon Hai Wireless_Device
Bus 001 Device 003: ID 06cb:00f0 Synaptics, Inc. 
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 0408:545f Quanta Computer, Inc. HP 5MP Camera
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub

I'm not bisecting this issue because so far it's happened just once and I've no idea how to trigger it. Yet it has never happened before with previous kernels.
Comment 1 Artem S. Tashkinov 2025-02-26 22:28:20 UTC
I'm utterly confused as to why the kernel decided to "xHCI host not responding to stop endpoint command".

I didn't do anything at the time. Wasn't even using the mouse.

Something funky is going on with 6.13.
Comment 2 Artem S. Tashkinov 2025-02-26 22:31:25 UTC
This was reported in the SUSE bug tracker earlier:

https://bugzilla.suse.com/show_bug.cgi?id=1236992

I don't see it being reported here, so the issue is not new.

Yet I see no patches queued for 6.13.5.
Comment 3 Artem S. Tashkinov 2025-02-26 22:47:52 UTC
The SUSE issue is seemingly unrelated, please dismiss.
Comment 4 Artem S. Tashkinov 2025-02-27 12:58:38 UTC
This just happened again:

[161470.836493] PM: resume devices took 0.547 seconds
[161470.836720] OOM killer enabled.
[161470.836721] Restarting tasks ... done.
[161470.839715] random: crng reseeded on system resumption
[161470.845090] PM: suspend exit
[161471.322491] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: Firmware: 400a4 vendor: 0x2 v0.43.1, 2 algorithms
[161471.324469] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: cirrus/cs35l41-dsp1-spk-prot-103c8b72.bin: v0.43.1
[161471.324480] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: DSP1: spk-prot: D:\Amp Tuning\HP\840\0930\103C8B45_220930.bin
[161471.403951] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Calibration applied: R0=10446
[161471.407392] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Calibration applied: R0=10526
[161471.432157] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.1: Firmware Loaded - Type: spk-prot, Gain: 17
[161471.433916] cs35l41-hda i2c-CSC3551:00-cs35l41-hda.0: Firmware Loaded - Type: spk-prot, Gain: 17
[161471.523827] hp_wmi: Unknown event_id - 131073 - 0x0
[162644.637587] xhci_hcd 0000:c3:00.3: xHCI host not responding to stop endpoint command
[162644.651068] xhci_hcd 0000:c3:00.3: xHCI host controller not responding, assume dead
[162644.651076] xhci_hcd 0000:c3:00.3: HC died; cleaning up
[162644.651099] xhci_hcd 0000:c3:00.3: Timeout while waiting for stop endpoint command
[162644.651102] usb 1-2: USB disconnect, device number 4
[162644.678374] usb 1-3: USB disconnect, device number 2
[162644.678748] usb 1-4: USB disconnect, device number 3

Shortly after resume all the USB ports are disabled.

I'm reverting back to Linux 6.11. I cannot use my device like this.
Comment 5 Mathias Nyman 2025-02-27 15:12:46 UTC
6.13 has a lot of changes related to endpoint stopping:

e21ebe51af68 xhci: Turn NEC specific quirk for handling Stop Endpoint errors generic
474538b8dd1c usb: xhci: Avoid queuing redundant Stop Endpoint commands
484c3bab2d5d usb: xhci: Fix TD invalidation under pending Set TR Dequeue
42b758137601 usb: xhci: Limit Stop Endpoint retries

Endpoints are stopped in order to cancel transfers, before suspend, and to soft reset an endpoint after clearing a halt. 

I understand that bisecting an issue like this that triggers rarely isn't an option, but can I ask you to try running 6.13 with xhci dynamic debug enabled.

mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
and send dmesg after issue is triggered.

It could reveal a bit more what's going on
Comment 6 Artem S. Tashkinov 2025-02-27 16:05:34 UTC
(In reply to Mathias Nyman from comment #5)
> I understand that bisecting an issue like this that triggers rarely isn't an
> option, but can I ask you to try running 6.13 with xhci dynamic debug
> enabled.

Will do as soon as possible. Thanks a lot!
Comment 7 MichaƂ Pecio 2025-02-27 17:14:31 UTC
Which exact versions were you running successfully and for how long?

These patches listed by Mathias are instant first suspects, but they were all backported to v6.12.7 in December. Most of them also to v6.11.11 in early December and later in January to some LTS series.

Any chance that hibernation is indeed a (delayed) trigger and you weren't doing it as often in the past?

Did you come across similar reports from stable kernel branches in this year?
Comment 8 Artem S. Tashkinov 2025-02-27 21:07:45 UTC
> Which exact versions were you running successfully and for how long?

Kernel 6.12.14 that I was running earlier didn't have this issue.

Used software suspend/resume multiple times successfully.

> Any chance that hibernation is indeed a (delayed) trigger and you weren't
> doing it as often in the past?

Not using hibernation, just software suspend.

I've not changed anything software-wise except installing a new kernel on this laptop.

> Did you come across similar reports from stable kernel branches in this year?

I've Googled a couple of times already for this exact error message and nothing turned up.