This seems, at least to me, a regression after the apparent fix on 6.13.7; I've tested on 6.13.9 and 6.14.3, and 6.13.7 seems to be working fine to me. At random times (whether idle or with actual active usage), there is a high chance that any of the xHCI controllers will fail. ``` Apr 28 6:54:28 PM kernel: usb 7-11: USB disconnect, device number 11 Apr 28 6:54:28 PM kernel: usb 7-7: USB disconnect, device number 6 Apr 28 6:54:28 PM kernel: usb 7-5.2: USB disconnect, device number 9 Apr 28 6:54:28 PM kernel: usb 7-5: USB disconnect, device number 4 Apr 28 6:54:28 PM kernel: usb 7-3.5: USB disconnect, device number 15 Apr 28 6:54:28 PM kernel: usb 7-3.4: USB disconnect, device number 14 Apr 28 6:54:28 PM kernel: usb 7-3.3: USB disconnect, device number 12 Apr 28 6:54:28 PM kernel: usb 8-5: USB disconnect, device number 4 Apr 28 6:54:28 PM kernel: usb 8-2: USB disconnect, device number 2 Apr 28 6:54:28 PM kernel: usb 8-3.4: USB disconnect, device number 5 Apr 28 6:54:28 PM kernel: usb 8-3: USB disconnect, device number 3 Apr 28 6:54:28 PM kernel: usb 7-3.1: USB disconnect, device number 7 Apr 28 6:54:28 PM kernel: usb 7-3: USB disconnect, device number 3 Apr 28 6:54:28 PM kernel: usb 7-2.5: USB disconnect, device number 8 Apr 28 6:54:28 PM kernel: usb 7-2.3.2: USB disconnect, device number 13 Apr 28 6:54:28 PM kernel: usb 7-2.3.1: USB disconnect, device number 10 Apr 28 6:54:28 PM kernel: usb 7-2.3: USB disconnect, device number 5 Apr 28 6:54:28 PM kernel: usb 7-2: USB disconnect, device number 2 Apr 28 6:51:25 PM kernel: usb 7-3.4.2: USB disconnect, device number 18 Apr 28 11:23:44 AM kernel: usb 7-3.4.2: USB disconnect, device number 17 Apr 27 1:06:10 AM kernel: usb 7-3.4.2: USB disconnect, device number 16 ``` Running (replace `??` for the actual guid) ``` echo -n "0000:??:00.0" | sudo tee /sys/bus/pci/drivers/xhci_hcd/unbind && echo -n "0000:??:00.0" | sudo tee /sys/bus/pci/drivers/xhci_hcd/bind ``` Reinitialises the controller. ---- uname: ``` 6.13.9-103.bazzite.fc42.x86_64 ``` rpm-ostree: ``` ● ostree-image-signed:docker://ghcr.io/ublue-os/bazzite:stable Digest: sha256:1a7ae28b95fde42b976cc9aa159219c0aaaa0611f7416f4b3b30284e292b0875 Version: 42.20250417 (2025-04-17T07:35:37Z) LayeredPackages: android-tools LocalPackages: 1password-8.10.70-1.x86_64 sublime-text-4192-1.x86_64 ```
That's odd, because there are no xhci changes between 6.13.7 and 6.13.9, and very few USB changes at all. Are you really sure that 6.13.7 was OK? Can you reproduce this on upstream kernels built from source? Or bring the issue up with Fedora, in case they apply other patches? What's the affected host controller (lspci -nn)? I presume you are seeing the "HC died" message too, could you post full dmesg? There is clearly "something" here, but hard to tell what it is and why.
> Are you really sure that 6.13.7 was OK? I could confirm, at least to my extent of testing, that on 6.13.7 this issue was not there. > Can you reproduce this on upstream kernels built from source? It might be a bit tricky (not to mention unstable) attempting to use upstream kernel on a ostree-locked OS; I could try to reach out to the maintainers of the distro (Silver Blue/Bazzite) and check if there are any guides on that. > I presume you are seeing the "HC died" message too, could you post full > dmesg? That's correct, the exact symptoms are present. Here's some sample messages from `journald` ``` xhci_hcd 0000:6a:00.0: HC died; cleaning up xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 ``` On this current Boot, `6a` is not present on `lspci -nn`, but I assume that's because of the virtual re-maping (order the kernel is binding the controllers at boot time), here's the output anyways: ``` 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex [1022:14d8] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU [1022:14d9] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db] 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] [1022:14dd] 00:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] [1022:14dd] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 71) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0 [1022:14e0] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1 [1022:14e1] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2 [1022:14e2] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3 [1022:14e3] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4 [1022:14e4] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5 [1022:14e5] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6 [1022:14e6] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7 [1022:14e7] 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev 10) 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479] (rev 10) 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] [1002:744c] (rev c8) 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio [1002:ab30] 03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 USB [1002:7446] 03:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:7444] 04:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal] [144d:a80c] 05:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port [1022:43f4] (rev 01) 06:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 06:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 06:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 06:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 08:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port [1022:43f4] (rev 01) 09:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 0a:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak] [8086:2725] (rev 1a) 0b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03) 0d:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:01.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:02.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:03.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0f:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137] 3b:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020] [8086:1138] 68:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 69:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller [1022:43f6] (rev 01) 6a:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6b:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller [1022:43f6] (rev 01) 6c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Granite Ridge [Radeon Graphics] [1002:13c0] (rev cb) 6c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller [1002:1640] 6c:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP [1022:1649] 6c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6] 6c:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7] 6d:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] ``` Regarding `dmesg` would you like a full output? The logs are... long.
*Apologies, I forgot to filter only the USB controllers on my last snippet from lspci
(In reply to Claudio Wunder from comment #2) > xhci_hcd 0000:6a:00.0: HC died; cleaning up > xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead > xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 If "abort failed" is what starts everything, this looks like a genuine case of hardware going bad for some reason, unlike the February "hc died" regression which was a trivial driver bug and the HW worked as designed. This may be harder to solve and HW specific, possibly including connected devices. Actually, a regression could conceivably be caused by a change in some device driver. And distro regressions can also be caused by changes in kernel .config, so you may want to ask them about that too. > Regarding `dmesg` would you like a full output? The logs are... long. Well, at the very least, it would be nice to see complete and unmodified kernel log from the event you are complaining about ;) The snippet quoted above, for example, appears to be in reverse order and I don't know why. Timestamps can useful too. If you can't or don't want to post full kernel log (from boot to "hc died"), please at least grep it for '0000:6a:00.0' since this will show PCI IDs of the culprit chip and maybe some anomalies previously logged by xhci_hcd. Going forward, does your system support dynamic debug and/or debugfs? Please try: echo 'module usbcore +p' | sudo tee /proc/dynamic_debug/control echo 'module xhci_hcd +p' | sudo tee /proc/dynamic_debug/control If this produces dmesg noise on a completely idle system, post a sample. If not, you may leave it enabled (changing +p to -p disables it) and collect dmesg with this debug info included next time something happens. If you can mount debugfs and access /sys/kernel/debug/usb/xhci/0000:??:00.0 (all as root), please save a copy of that directory after the next crash but before unbinding and rebinding the driver. This will contain information about what the chip was doing when it went belly up.
The symptoms still hint to be the regression caused by: 36b972d4b7ce usb: xhci: improve xhci_clear_command_ring() Which was fixed by Michal in: c7c1f3b05c67 usb: xhci: Fix host controllers "dying" after suspend and resume Regression was caused by command cycle confusion, meaning controller stops fetching commands from the ringbuffer as it incorrectly assumes that commands queued after cycle mismatch are old commands from last cycle on ring buffer. Driver keeps queuing new commands and hardware keeps ignoring them. Regression was first noticed when a dedicated timeout timer for "stop endpoint" command did not complete in time, and timer handler assumed host is dead. In this case it's the generic command timer that times out before a "stop endpoint" timer. xhci driver tries to recover by aborting the failing command, but abort likely fails as well as hardware isn't really processing any commands at the moment, and we end up assuming host is dead.
> If "abort failed" is what starts everything, this looks like a genuine case > of hardware going bad for some reason, unlike the February "hc died" > regression which was a trivial driver bug and the HW worked as designed. This > may be harder to solve and HW specific, possibly including connected devices. Here's some sample logs with the correct times, apologies for not providing the timestamps. ``` Apr 26 23:16:09 angel-thesis kernel: xhci_hcd 0000:6b:00.0: Event TRB for slot 18 ep 0 with no TDs queued Apr 26 23:16:24 angel-thesis kernel: xhci_hcd 0000:6b:00.0: ERROR unknown event type 4 Apr 26 23:16:24 angel-thesis kernel: xhci_hcd 0000:6b:00.0: Abort failed to stop command ring: -110 Apr 26 23:16:24 angel-thesis kernel: xhci_hcd 0000:6b:00.0: xHCI host controller not responding, assume dead Apr 26 23:16:24 angel-thesis kernel: xhci_hcd 0000:6b:00.0: HC died; cleaning up Apr 26 23:16:24 angel-thesis kernel: xhci_hcd 0000:6b:00.0: Timeout while waiting for setup device command ``` ``` Apr 26 23:16:24 angel-thesis kernel: usb 7-2: USB disconnect, device number 2 Apr 26 23:16:24 angel-thesis kernel: usb 7-2.3: USB disconnect, device number 5 Apr 26 23:16:24 angel-thesis kernel: usb 7-2.3.1: USB disconnect, device number 10 Apr 26 23:16:24 angel-thesis kernel: usb 7-2.4: USB disconnect, device number 8 Apr 26 23:16:24 angel-thesis kernel: usb 7-2.5: USB disconnect, device number 13 Apr 26 23:16:24 angel-thesis kernel: usb 7-3: USB disconnect, device number 3 Apr 26 23:16:25 angel-thesis kernel: usb 7-3.1: USB disconnect, device number 7 Apr 26 23:16:25 angel-thesis kernel: usb 8-3: USB disconnect, device number 3 Apr 26 23:16:25 angel-thesis kernel: usb 8-3.4: USB disconnect, device number 5 Apr 26 23:16:25 angel-thesis kernel: usb 8-2: USB disconnect, device number 2 Apr 26 23:16:25 angel-thesis kernel: usb 8-5: USB disconnect, device number 4 Apr 26 23:16:25 angel-thesis kernel: usb 7-3.3: USB disconnect, device number 12 Apr 26 23:16:25 angel-thesis kernel: usb 7-3.4: USB disconnect, device number 14 Apr 26 23:16:25 angel-thesis kernel: usb 7-3.4.2: USB disconnect, device number 19 Apr 26 23:16:25 angel-thesis kernel: usb 7-3.5: USB disconnect, device number 15 Apr 26 23:16:25 angel-thesis kernel: usb 7-5: USB disconnect, device number 4 Apr 26 23:16:25 angel-thesis kernel: usb 7-5.2: USB disconnect, device number 9 Apr 26 23:16:25 angel-thesis kernel: usb 7-7: USB disconnect, device number 6 Apr 26 23:16:25 angel-thesis kernel: usb 7-11: USB disconnect, device number 11 ``` ``` Apr 28 18:54:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Event TRB for slot 18 ep 0 with no TDs queued Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: ERROR unknown event type 4 Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: HC died; cleaning up Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Timeout while waiting for setup device command ``` ``` Apr 28 18:51:25 angel-thesis kernel: usb 7-3.4.2: USB disconnect, device number 18 Apr 28 18:54:28 angel-thesis kernel: usb 7-2: USB disconnect, device number 2 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.3: USB disconnect, device number 5 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.3.1: USB disconnect, device number 10 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.3.2: USB disconnect, device number 13 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.5: USB disconnect, device number 8 Apr 28 18:54:28 angel-thesis kernel: usb 7-3: USB disconnect, device number 3 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.1: USB disconnect, device number 7 Apr 28 18:54:28 angel-thesis kernel: usb 8-3: USB disconnect, device number 3 Apr 28 18:54:28 angel-thesis kernel: usb 8-3.4: USB disconnect, device number 5 Apr 28 18:54:28 angel-thesis kernel: usb 8-2: USB disconnect, device number 2 Apr 28 18:54:28 angel-thesis kernel: usb 8-5: USB disconnect, device number 4 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.3: USB disconnect, device number 12 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.4: USB disconnect, device number 14 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.5: USB disconnect, device number 15 Apr 28 18:54:28 angel-thesis kernel: usb 7-5: USB disconnect, device number 4 Apr 28 18:54:28 angel-thesis kernel: usb 7-5.2: USB disconnect, device number 9 Apr 28 18:54:28 angel-thesis kernel: usb 7-7: USB disconnect, device number 6 Apr 28 18:54:28 angel-thesis kernel: usb 7-11: USB disconnect, device number 11 Apr 29 10:05:16 angel-thesis kernel: usb 5-1: USB disconnect, device number 2 ``` > Actually, a regression could conceivably be caused by a change in some device > driver. And distro regressions can also be caused by changes in kernel > .config, so you may want to ask them about that too. I'll reach out to Fedora's Discourse/Universal Blue Discourse. > Well, at the very least, it would be nice to see complete and unmodified > kernel log from the event you are complaining about ;) The snippet quoted > above, for example, appears to be in reverse order and I don't know why. > Timestamps can useful too. Here's the full dmesg log via `journalctl -o short-precise -k -b -3`: https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7 (I shared on GitHub Gist due to the large size, if you rather have me hosting it on a FOSS alternative, let me know, I can upload it to gitlab.gnome.org) > Going forward, does your system support dynamic debug and/or debugfs? Please > try: Let me check and circle back.
Running the echo commands to proc resulted on zero noise on `dmesg`, but I've mounted the debug partition and can see files here: ``` ❯ sudo ls -la /sys/kernel/debug/usb/xhci/0000:6a:00.0 total 0 drwxr-xr-x. 6 root root 0 Apr 30 10:01 . drwxr-xr-x. 9 root root 0 Apr 30 10:01 .. drwxr-xr-x. 2 root root 0 Apr 30 10:01 command-ring drwxr-xr-x. 7 root root 0 Apr 30 10:05 devices drwxr-xr-x. 2 root root 0 Apr 30 10:01 event-ring drwxr-xr-x. 20 root root 0 Apr 30 10:01 ports -r--r--r--. 1 root root 0 Apr 30 10:01 reg-cap -r--r--r--. 1 root root 0 Apr 30 10:01 reg-ext-dbc:00 -r--r--r--. 1 root root 0 Apr 30 10:01 reg-ext-legsup:00 -r--r--r--. 1 root root 0 Apr 30 10:01 reg-ext-protocol:00 -r--r--r--. 1 root root 0 Apr 30 10:01 reg-ext-protocol:01 -r--r--r--. 1 root root 0 Apr 30 10:01 reg-ext-protocol:02 -r--r--r--. 1 root root 0 Apr 30 10:01 reg-op -r--r--r--. 1 root root 0 Apr 30 10:01 reg-runtime ``` I'll wait for the next crash to happen to zip them via `ssh` after next crash, but as requested, before unbinding/binding.
Ah, BTW, on the current boot, these are all the available USB controllers: ``` 03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 USB [1002:7446] 0f:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137] 3b:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020] [8086:1138] 68:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6a:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6] 6c:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7] 6d:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] ``` A random note, that I'm unaware if it has any affect at all, is that "Legacy xHCI Hand-Off is disabled" on my EFI; I assume since modern versions of the Kernel this isn't required.
(In reply to Claudio Wunder from comment #6) > Here's the full dmesg log via `journalctl -o short-precise -k -b -3`: > https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7 (I shared > on GitHub Gist due to the large size, if you rather have me hosting it on a > FOSS alternative, let me know, I can upload it to gitlab.gnome.org) Thanks. That's not a really huge file, you can upload such files here as attachments to this bug. > pci 0000:6b:00.0: [1022:43f7] type 00 class 0x0c0330 PCIe Legacy Endpoint This means it's the "600 series chipset", which is reportedly a Promontory family chipset, made for AMD by ASMedia. And IME ASMedia controllers are pretty buggy. > xhci_hcd 0000:6b:00.0: Event TRB for slot 18 ep 0 with no TDs queued > usb 8-3: Device not responding to setup address. > xhci_hcd 0000:6b:00.0: ERROR unknown event type 4 That's sort of stuff which may show up when they get completely FUBAR'd and about to stop working at all. So I don't really share the optimism that it's a simple SW bug which causes the abort to fail, it's probably a deeper screwup. The debugfs dump will tell... I think you said you have more of those logs, is the above always appearing a few seconds before "hc died"? It seems related to the 8-3 device, a VIA USB 3.0 hub. IME such problems tend to be happening under particular workloads, so still no idea how a minor kernel update could cause it to appear. Were there no hardware changes, like USB devices added or moved to other ports? How long are you using this machine with all of those devices, which kernel versions were working OK for so long that they certainly cannot possibly have this problem?
> This means it's the "600 series chipset", which is reportedly a Promontory > family chipset, made for AMD by ASMedia. And IME ASMedia controllers are > pretty buggy. That's interesting... I do recall on Windows at times, resuming from sleep would.. have issues (I had to change the AMD drivers to custom ones due to SMU/ULPS issues. I'm not sure if this piece of log `xhci_hcd 0000:03:00.2: xHC error in resume, USBSTS 0x401, Reinit` is relevant. > The debugfs dump will tell... Never been so excited for a crash to happen lol. > I think you said you have more of those logs, is the above always appearing a > few seconds before "hc died"? It seems related to the 8-3 device, a VIA USB > 3.0 hub. For the sample of two items I have so far, it appears that these are showing up. Note that on both the original regression and the current "apparent" one (if we can even call it a regression?), these errors above are happening. I will need to wait to see the next crash also happens to have said logs; > Were there no hardware changes, like USB devices added or moved to other > ports? Nope, no changes. I know older versions of Fedora (40) had no issues, but it's been a while since I've used it (I was recently on Windows 11 (no issues whatsoever, except for AMDgpu ULPS driver issues) and now I've been fresh to Bazzite (Universal Blue/Silverblue) and been facing said issues. It could as well be _my hardware_ -- but given the coincidence, and that with 6.13.7 the issue disappeared for a whole week, and then once going to the next update of Bazzite with 6.13.9 and the update with 6.14.3 it has been happening again. (Curiously on 6.13.9 it happens not often, but on 6.14.3 it was happening every few hours; Unfortunately since with rpm-ostree I rollbacked the system, the logs for those boots are lost; I could attempt to upgrade to 6.14.3 to see if I find the bug more often) That said, I genuinely don't know, nor if it is due to a Kernel change. > IME such problems tend to be happening under particular workloads, It happened before on both resuming from sleep, intensive workloads (i.e. gaming) or simply when typing/using the PC normally or when the PC is idle. I discarded power profile or power management at the time as a cause. Enabling or disabling OS-managed ASPM also has no effect, so I ruled out being an ACPI/APM issue.
> This means it's the "600 series chipset", OOC, X670E is my chipset, I assume it is indeed part of the buggy club. I could change my motherboard, something like X870E (Assuming the 800 series by ASMedia does not have the same issue); Just to rule out the "this is probably the chipset" culprit.
(In reply to Claudio Wunder from comment #10) > > I think you said you have more of those logs, is the above always appearing > a > > few seconds before "hc died"? It seems related to the 8-3 device, a VIA USB > > 3.0 hub. > > For the sample of two items I have so far, it appears that these are showing > up. Note that on both the original regression and the current "apparent" one > (if we can even call it a regression?), these errors above are happening. I > will need to wait to see the next crash also happens to have said logs; Wait, this is important. If you were seeing "Abort failed to stop command ring: -110" instead of "xHCI host not responding to stop endpoint command" before 6.13.7 then it is at least possible, if not likely, that you were already running into a different problem than the one fixed in 6.13.7. And it gets doubly suspicious if you also saw "ERROR unknown event type <some number>" a few seconds before "HC died". Do you still have those logs by any chance? As Mathias Nyman explained, the known 6.13 issue was a simple driver bug: commands were written incorrectly, chips correctly ignored them, the driver incorrectly pronounced them dead. Mathias further suggests that this or similar bug may still somehow exist in your kernel and that command abort fails because the chip believes there are no pending commands. That is possible, but unlikely because command abort is not supposed to fail like that. So if you ever seem command abort timeout, either the abort code is buggy (and it looks like no one touched that part in ages) or the chip is buggy in one way or another. It would be sad if this turns out to be a regression due to the commits initially suspected back in February: https://bugzilla.kernel.org/show_bug.cgi?id=219824#c5 These are present in all 6.12 and higher releases from this year, so the only supported kernels without them are old LTS series. Not sure if you have means of testing those for a few weeks on the same HW, userspace and workload? I could also suggest some stress tests which exercise this code (and the USB controller). I found webcams and USB serial dongles to be particularly suitable, do you have some of such stuff at hand?
(In reply to Claudio Wunder from comment #11) > OOC, X670E is my chipset, I assume it is indeed part of the buggy club. I > could change my motherboard, something like X870E (Assuming the 800 series > by ASMedia does not have the same issue); Just to rule out the "this is > probably the chipset" culprit. That's an assumption ;) A simpler thing is to try different USB ports (rear or front panel) and see if any are connected to different (probably in-CPU) controllers. Or get a PCIe card, those with Renesas uPD720201/uPD720202 chips are reliable, though 5Gbps only. Your problem seems to be HW specific, because others generally stopped complaining after 6.13.7. I have heard about one more case of "Abort failed to stop command ring: -110" and suggested filing a bug here, but the reporter never did.
> "xHCI host not responding to stop endpoint command" The closest log from all my previously stored logs within journald to "xHCI host not responding to stop endpoint command" is "Apr 21 11:23:10.258699 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead" Here are some more logs: ``` Apr 21 03:14:59 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Event TRB for slot 4 ep 0 with no TDs queued Apr 21 04:20:16 angel-thesis kernel: xhci_hcd 0000:6a:00.0: ERROR unknown event type 4 Apr 21 04:20:16 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 Apr 21 04:20:16 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead Apr 21 04:20:16 angel-thesis kernel: xhci_hcd 0000:6a:00.0: HC died; cleaning up Apr 21 04:20:16 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Timeout while waiting for setup device command ``` > if you also saw "ERROR unknown event type <some number>" a few seconds before > "HC died". Do you still have those logs by any chance? Yes! ``` Apr 26 23:57:40 angel-thesis kernel: xhci_hcd 0000:6b:00.0: Event TRB for slot 4 ep 0 with no TDs queued Apr 26 23:57:40 angel-thesis kernel: usb 8-3: Device not responding to setup address. Apr 26 23:57:55 angel-thesis kernel: xhci_hcd 0000:6b:00.0: ERROR unknown event type 4 Apr 26 23:57:55 angel-thesis kernel: xhci_hcd 0000:6b:00.0: Abort failed to stop command ring: -110 Apr 26 23:57:55 angel-thesis kernel: xhci_hcd 0000:6b:00.0: xHCI host controller not responding, assume dead Apr 26 23:57:55 angel-thesis kernel: xhci_hcd 0000:6b:00.0: HC died; cleaning up Apr 26 23:57:55 angel-thesis kernel: xhci_hcd 0000:6b:00.0: Timeout while waiting for setup device command Apr 26 23:57:55 angel-thesis kernel: usb 7-2: USB disconnect, device number 2 Apr 26 23:57:55 angel-thesis kernel: usb 7-2.3: USB disconnect, device number 5 Apr 26 23:57:55 angel-thesis kernel: usb 7-2.3.1: USB disconnect, device number 10 Apr 26 23:57:55 angel-thesis kernel: usb 7-2.4: USB disconnect, device number 8 Apr 26 23:57:55 angel-thesis kernel: usb 7-2.5: USB disconnect, device number 13 Apr 26 23:57:55 angel-thesis kernel: usb 7-3: USB disconnect, device number 3 Apr 26 23:57:55 angel-thesis kernel: usb 7-3.1: USB disconnect, device number 7 Apr 26 23:57:55 angel-thesis kernel: usb 7-3.3: USB disconnect, device number 12 Apr 26 23:57:55 angel-thesis kernel: usb 7-3.4: USB disconnect, device number 14 Apr 26 23:57:55 angel-thesis kernel: usb 7-3.4.2: USB disconnect, device number 16 Apr 26 23:57:55 angel-thesis kernel: usb 7-3.5: USB disconnect, device number 15 Apr 26 23:57:55 angel-thesis kernel: usb 7-5: USB disconnect, device number 4 Apr 26 23:57:55 angel-thesis kernel: usb 7-5.2: USB disconnect, device number 9 Apr 26 23:57:55 angel-thesis kernel: usb 7-7: USB disconnect, device number 6 Apr 26 23:57:55 angel-thesis kernel: usb 7-11: USB disconnect, device number 11 Apr 26 23:57:55 angel-thesis kernel: usb 8-3: device not accepting address 3, error -62 Apr 26 23:57:55 angel-thesis kernel: usb 8-3: USB disconnect, device number 3 Apr 26 23:57:55 angel-thesis kernel: usb 8-3.4: USB disconnect, device number 5 Apr 26 23:57:55 angel-thesis kernel: usb usb8-port3: couldn't allocate usb_device Apr 26 23:57:55 angel-thesis kernel: usb 8-2: USB disconnect, device number 2 Apr 26 23:57:55 angel-thesis kernel: usb 8-5: USB disconnect, device number 4 ``` ``` Apr 28 18:54:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Event TRB for slot 18 ep 0 with no TDs queued Apr 28 18:54:12 angel-thesis kernel: usb 8-3: Device not responding to setup address. Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: ERROR unknown event type 4 Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: HC died; cleaning up Apr 28 18:54:28 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Timeout while waiting for setup device command Apr 28 18:54:28 angel-thesis kernel: usb 7-2: USB disconnect, device number 2 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.3: USB disconnect, device number 5 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.3.1: USB disconnect, device number 10 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.3.2: USB disconnect, device number 13 Apr 28 18:54:28 angel-thesis kernel: usb 7-2.5: USB disconnect, device number 8 Apr 28 18:54:28 angel-thesis kernel: usb 7-3: USB disconnect, device number 3 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.1: USB disconnect, device number 7 Apr 28 18:54:28 angel-thesis kernel: usb 8-3: device not accepting address 3, error -62 Apr 28 18:54:28 angel-thesis kernel: usb 8-3: USB disconnect, device number 3 Apr 28 18:54:28 angel-thesis kernel: usb 8-3.4: USB disconnect, device number 5 Apr 28 18:54:28 angel-thesis kernel: usb usb8-port3: couldn't allocate usb_device Apr 28 18:54:28 angel-thesis kernel: usb 8-2: USB disconnect, device number 2 Apr 28 18:54:28 angel-thesis kernel: usb 8-5: USB disconnect, device number 4 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.3: USB disconnect, device number 12 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.4: USB disconnect, device number 14 Apr 28 18:54:28 angel-thesis kernel: usb 7-3.5: USB disconnect, device number 15 Apr 28 18:54:28 angel-thesis kernel: usb 7-5: USB disconnect, device number 4 Apr 28 18:54:28 angel-thesis kernel: usb 7-5.2: USB disconnect, device number 9 Apr 28 18:54:28 angel-thesis kernel: usb 7-7: USB disconnect, device number 6 Apr 28 18:54:28 angel-thesis kernel: usb 7-11: USB disconnect, device number 11 ``` And it happens indeed a few seconds afterwards. > So if you ever seem command abort timeout, either the abort code is buggy > (and it looks like no one touched that part in ages) or the chip is buggy in > one way or another. That's interesting. > These are present in all 6.12 and higher releases from this year, so the only > supported kernels without them are old LTS series. Not sure if you have means > of testing those for a few weeks on the same HW, userspace and workload? Ill wait for the issue to happen again, so I can at least upload the debugfs; Then Ill attempt to switch to an older Kernel version (6.12.X) if needed. > I could also suggest some stress tests which exercise this code (and the USB > controller). I found webcams and USB serial dongles to be particularly > suitable, do you have some of such stuff at hand? You mean simulated code? Like Prime95? On the dongles/connected USB devices (here's a screenshot of the USB Devices Tree: https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7#file-screenshot_20250501_113016-png) But pretty much: - There's a RodeCaster Duo connected to one of the USB rear ports - Note that this has two (2) USB-out ports to connect to two devices; - There's a KVM switch from Anker connected to another USB port (Model Number: A83K3) with a keyboard (Wooting 60HE+ and a Logitech Bolt dongle connected to it (mouse wireless dongle)) - 2nd RodeCaster Duo USB port also connected there. - I'm using a Monitor with an USB-B hub (Supposedly USB 3.2, per monitor settings, but Linux recognises it as a USB 2.0, possibly because bandwidth negotiations are at 2.0 speeds; either with webcam on idle or non-idle) where a webcam (Insta 360 Link 2C) indeed is connected. To be honest, neither of these devices are bandwidth hungry, even the webcam is capped at 2K but always on 1080p/i There are a bunch of integrated peripherals appearing there such as ASMedia's ASM Controller (ASM107x whatever that is, and seems to be shared on two xHCI controllers); The Bluetooth Controller, LED controller and the AIO Pump Controller. It is really hard to say if any of these devices are somehow crashing the xHCI controller, and I believe it might be crashing a specific one? For example, Audio on my RodeCaster Duo and Bluetooth keep working when that said crash happens (not sure if this is important info), but all other devices (like my mouse, keyboard) stop working (I already tried to plug on different front ports, but not rear ports) And from the Logs, it is exactly the controller that all my peripherals besides the back port of my RodeCaster Duo is. ``` Manufacturer: Linux 6.13.9-103.bazzite.fc42.x86_64 xhci-hcd Serial #: 0000:6a:00.0 ``` (This one contains all my devices, mouse, etc); Except for Bluetooth and the 1st RodeCaster Duo port. ``` Manufacturer: Linux 6.13.9-103.bazzite.fc42.x86_64 xhci-hcd Serial #: 0000:68:00.0 ``` And since all integrated mobo peripherals are on the former one, I'm assuming maybe it could be related to some integrated hardware, as you mentioned before? It's really hard to know now without the logs, so I'll stop my assumptions. > A simpler thing is to try different USB ports (rear or front panel) and see > if any are connected to different (probably in-CPU) controllers. Yeah, that's what I described above. > Your problem seems to be HW specific, because others generally stopped > complaining after 6.13.7. I have heard about one more case of "Abort failed > to stop command ring: -110" and suggested filing a bug here, but the reporter > never did. I worry I am wasting too much of your time, tbh. Genuinely speaking, no idea what's going on besides of "I definitely would like a solution for this" and contribute as much as I can with reporting a bug that may or may not be affecting other users.
Aand it happened again, during gaming workload. Unfortunately, I had restarted my PC and the debugfs partition was gone, but here are the logs: ``` May 01 22:53:53 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 May 01 22:53:53 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead May 01 22:53:53 angel-thesis kernel: xhci_hcd 0000:6a:00.0: HC died; cleaning up May 01 22:53:53 angel-thesis kernel: usb 7-2: USB disconnect, device number 2 May 01 22:53:53 angel-thesis kernel: usb 7-2.3: USB disconnect, device number 5 May 01 22:53:53 angel-thesis kernel: usb 7-2.3.1: USB disconnect, device number 10 May 01 22:53:53 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Timeout while waiting for setup device command May 01 22:53:53 angel-thesis kernel: usb 7-2.3.2: USB disconnect, device number 13 May 01 22:53:53 angel-thesis kernel: usb 7-2.5: USB disconnect, device number 8 May 01 22:53:53 angel-thesis kernel: usb 7-3: USB disconnect, device number 3 May 01 22:53:53 angel-thesis kernel: usb 7-3.1: USB disconnect, device number 7 May 01 22:53:53 angel-thesis kernel: usb 8-3: device not accepting address 3, error -62 May 01 22:53:53 angel-thesis kernel: usb 8-3: USB disconnect, device number 3 May 01 22:53:53 angel-thesis kernel: usb 8-3.4: USB disconnect, device number 5 May 01 22:53:53 angel-thesis kernel: usb usb8-port3: couldn't allocate usb_device May 01 22:53:53 angel-thesis kernel: usb 8-2: USB disconnect, device number 2 May 01 22:53:53 angel-thesis kernel: usb 8-5: USB disconnect, device number 4 May 01 22:53:53 angel-thesis kernel: usb 7-3.3: USB disconnect, device number 12 May 01 22:53:53 angel-thesis kernel: usb 7-3.4: USB disconnect, device number 14 May 01 22:53:53 angel-thesis kernel: usb 7-3.4.2: USB disconnect, device number 16 May 01 22:53:53 angel-thesis kernel: usb 7-3.5: USB disconnect, device number 15 May 01 22:53:53 angel-thesis kernel: usb 7-5: USB disconnect, device number 4 May 01 22:53:54 angel-thesis kernel: usb 7-5.2: USB disconnect, device number 9 May 01 22:53:54 angel-thesis kernel: usb 7-7: USB disconnect, device number 6 ``` If it happens again, I will be ready this time.
(note that I restarted my PC prior to this due to an unrelated matter)
Actually, I think it would still work if you mount debugfs after the fact. These files are generated by the driver from data which already exist in memory and will only be lost after unbind. But no problem, looks like the same thing is happening every few days. (In reply to Claudio Wunder from comment #14) > For example, Audio on my RodeCaster Duo and Bluetooth keep working when that > said crash happens (not sure if this is important info), but all other > devices (like my mouse, keyboard) stop working (I already tried to plug on > different front ports, but not rear ports) The devices which stop working are exactly those connected to the controller which fails. And the problem source is also somewhere there. There is something suspicious about the 8-3 hub which is always undergoing a reset when the HC dies, and it is those commands which appear to be timing out and then refusing to abort. Not sure if it's the cause or a symptom, maybe try to request a reset manually a few times? The 'usbutils' packages which provides 'lsusb' should also have 'usbreset', this will do the business: sudo usbreset 2109:0817 Or exercise this hub a little by asking it to reset its child device. There is another hub connected to it with different ID: sudo usbreset 2109:0211 (BTW, bus 7 is the USB 2.0 part of bus 8, and devices seen on both 7 and 8 are USB 3.x hubs).
> And the problem source is also somewhere there. There is something suspicious > about the 8-3 hub which is always undergoing a reset when the HC dies I assume you're referring to: ``` May 01 22:53:53 angel-thesis kernel: usb 8-3: device not accepting address 3, error -62 May 01 22:53:53 angel-thesis kernel: usb usb8-port3: couldn't allocate usb_device ``` > (BTW, bus 7 is the USB 2.0 part of bus 8, and devices seen on both 7 and 8 > are USB 3.x hubs). Oh that makes a lot of sense! Just to verify, with `sudo usbreset 2109:0817` and `sudo usbreset 2109:0211` you're asking me to reset the USB 3.0 Hub from (8) and it's child device, right? There's nothing connected to them, but just verifying. > But no problem, looks like the same thing is happening every few days. In the last 2 times these match times I've been playing games. Another symptom before the reset, is the system freezes for a few seconds, including audio, but then it unfreezes with the devices then disconnecting. > in memory and will only be lost after unbind. Alrighty, then no use, I had no recollection of this info; The mount was done after I ran unbind/bind from `ssh`... But here's the zip anyways, maybe it helps. https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7#file-debugfs-zip
BTW, I ran the `usbreset` commands and got: "Resetting USB3.0 Hub ... ok" and "Resetting USB3.0 Hub ... ok" respectively.
Try hammering each variant in a loop for a few hours: while :; do usbreset <numbers>; done You could also pile more work on the HC at the same time, for example starting and stopping video recording makes them quite busy. I suggest automating this job too, I use the yavta tool (linked below) like this: while test -c /dev/video0; do yavta -c3 /dev/video0; done https://git.ideasonboard.org/yavta.git
(In reply to Claudio Wunder from comment #18) > In the last 2 times these match times I've been playing games. Another > symptom before the reset, is the system freezes for a few seconds, including > audio, but then it unfreezes with the devices then disconnecting. If my tricks fail to accelerate the failure you can always go back to gaming ;) The freeze is a known (at least to me) bug. I believe it's caused by the driver waiting for the command abort to complete with xhci->lock held, which blocks the IRQ handler for a few seconds as it tries to acquire the same lock, which apparently causes other random IRQs to get delayed as well. I have been putting off looking into it properly since December, maybe it's time to alert Mathias... > The mount was done after I ran unbind/bind from `ssh`... > But here's the zip anyways, maybe it helps. > https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7#file- > debugfs-zip This zip is OK, but it contains current data from normal operation, not a failure. A few random transfers on some devices and it looks like the familiar 8-3 hub is periodically getting autosuspended, maybe that's what sometimes goes wrong.
> This zip is OK, but it contains current data from normal operation, not a > failure. A few random transfers on some devices and it looks like the > familiar 8-3 hub is periodically getting autosuspended, maybe that's what > sometimes goes wrong. Yeah, as I mentioned that was after I made the unbind/bind. > The freeze is a known (at least to me) bug. I believe it's caused by the > driver waiting for the command abort to complete with xhci->lock held, which > blocks the IRQ handler for a few seconds as it tries to acquire the same > lock, which apparently causes other random IRQs to get delayed as well. I > have been putting off looking into it properly since December, maybe it's > time to alert Mathias.. I appreciate the extra info. Let me try those simulated failures, let's see if that accelerates the issue.
I was finally able to capture a zip from debugfs before unbind/bind. Here's the file: https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7#file-debugfs-new-zip And the logs: ``` May 03 01:53:29 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Event TRB for slot 19 ep 0 with no TDs queued May 03 01:53:29 angel-thesis kernel: usb 8-3: Device not responding to setup address. May 03 01:53:44 angel-thesis kernel: xhci_hcd 0000:6a:00.0: ERROR unknown event type 4 May 03 01:53:44 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Abort failed to stop command ring: -110 May 03 01:53:44 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI host controller not responding, assume dead May 03 01:53:44 angel-thesis kernel: xhci_hcd 0000:6a:00.0: HC died; cleaning up May 03 01:53:44 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Timeout while waiting for setup device command May 03 01:53:44 angel-thesis kernel: usb 7-2: USB disconnect, device number 11 May 03 01:53:44 angel-thesis kernel: usb 7-2.3: USB disconnect, device number 12 May 03 01:53:44 angel-thesis kernel: usb 7-2.3.1: USB disconnect, device number 14 May 03 01:53:44 angel-thesis kernel: usb 7-2.3.2: USB disconnect, device number 15 May 03 01:53:44 angel-thesis kernel: usb 7-2.5: USB disconnect, device number 13 May 03 01:53:44 angel-thesis kernel: usb 7-3: USB disconnect, device number 16 May 03 01:53:44 angel-thesis kernel: usb 7-3.1: USB disconnect, device number 17 May 03 01:53:44 angel-thesis kernel: usb 7-3.3: USB disconnect, device number 18 May 03 01:53:44 angel-thesis kernel: usb 7-3.4: USB disconnect, device number 19 May 03 01:53:44 angel-thesis kernel: usb 7-3.4.2: USB disconnect, device number 21 May 03 01:53:44 angel-thesis kernel: usb 7-3.5: USB disconnect, device number 20 May 03 01:53:44 angel-thesis kernel: usb 7-5: USB disconnect, device number 2 May 03 01:53:44 angel-thesis kernel: usb 7-5.2: USB disconnect, device number 4 May 03 01:53:44 angel-thesis kernel: usb 8-3: device not accepting address 5, error -62 May 03 01:53:44 angel-thesis kernel: usb 8-3: USB disconnect, device number 5 May 03 01:53:44 angel-thesis kernel: usb 8-3.4: USB disconnect, device number 6 May 03 01:53:44 angel-thesis kernel: usb usb8-port3: couldn't allocate usb_device May 03 01:53:44 angel-thesis kernel: usb 8-2: USB disconnect, device number 4 May 03 01:53:44 angel-thesis kernel: usb 8-5: USB disconnect, device number 2 May 03 01:53:44 angel-thesis kernel: usb 7-7: USB disconnect, device number 3 May 03 01:53:44 angel-thesis kernel: usb 7-11: USB disconnect, device number 5 ``` Here's unbind: ``` May 03 02:16:07 angel-thesis kernel: xhci_hcd 0000:6a:00.0: remove, state 1 May 03 02:16:07 angel-thesis kernel: usb usb8: USB disconnect, device number 1 May 03 02:16:07 angel-thesis kernel: xhci_hcd 0000:6a:00.0: USB bus 8 deregistered May 03 02:16:07 angel-thesis kernel: xhci_hcd 0000:6a:00.0: remove, state 1 May 03 02:16:07 angel-thesis kernel: usb usb7: USB disconnect, device number 1 May 03 02:16:07 angel-thesis kernel: xhci_hcd 0000:6a:00.0: USB bus 7 deregistered ``` Here's bind: ``` May 03 02:16:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI Host Controller May 03 02:16:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: new USB bus registered, assigned bus number 7 May 03 02:16:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000200000010 May 03 02:16:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: xHCI Host Controller May 03 02:16:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: new USB bus registered, assigned bus number 8 May 03 02:16:12 angel-thesis kernel: xhci_hcd 0000:6a:00.0: Host supports USB 3.2 Enhanced SuperSpeed May 03 02:16:12 angel-thesis kernel: usb usb7: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.14 May 03 02:16:12 angel-thesis kernel: usb usb7: New USB device strings: Mfr=3, Product=2, SerialNumber=1 May 03 02:16:12 angel-thesis kernel: usb usb7: Product: xHCI Host Controller May 03 02:16:12 angel-thesis kernel: usb usb7: Manufacturer: Linux 6.14.4-103.bazzite.fc42.x86_64 xhci-hcd May 03 02:16:12 angel-thesis kernel: usb usb7: SerialNumber: 0000:6a:00.0 May 03 02:16:12 angel-thesis kernel: usb usb8: We don't know the algorithms for LPM for this host, disabling LPM. May 03 02:16:12 angel-thesis kernel: usb usb8: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.14 May 03 02:16:12 angel-thesis kernel: usb usb8: New USB device strings: Mfr=3, Product=2, SerialNumber=1 May 03 02:16:12 angel-thesis kernel: usb usb8: Product: xHCI Host Controller May 03 02:16:12 angel-thesis kernel: usb usb8: Manufacturer: Linux 6.14.4-103.bazzite.fc42.x86_64 xhci-hcd May 03 02:16:12 angel-thesis kernel: usb usb8: SerialNumber: 0000:6a:00.0 May 03 02:16:12 angel-thesis kernel: usb 7-2: new high-speed USB device number 2 using xhci_hcd May 03 02:16:13 angel-thesis kernel: usb 7-2: New USB device found, idVendor=2109, idProduct=4817, bcdDevice= 1.73 May 03 02:16:13 angel-thesis kernel: usb 7-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 May 03 02:16:13 angel-thesis kernel: usb 7-2: Product: USB2.0 Hub May 03 02:16:13 angel-thesis kernel: usb 7-2: Manufacturer: VIA Labs, Inc. May 03 02:16:13 angel-thesis kernel: usb 8-2: new SuperSpeed USB device number 2 using xhci_hcd May 03 02:16:13 angel-thesis kernel: usb 8-2: New USB device found, idVendor=2109, idProduct=3817, bcdDevice= 1.73 May 03 02:16:13 angel-thesis kernel: usb 8-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 May 03 02:16:13 angel-thesis kernel: usb 8-2: Product: USB3.0 Hub May 03 02:16:13 angel-thesis kernel: usb 8-2: Manufacturer: VIA Labs, Inc. May 03 02:16:13 angel-thesis kernel: usb 7-3: new high-speed USB device number 3 using xhci_hcd May 03 02:16:13 angel-thesis kernel: usb 7-3: New USB device found, idVendor=2109, idProduct=2817, bcdDevice= 7.74 May 03 02:16:14 angel-thesis kernel: usb 7-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:14 angel-thesis kernel: usb 7-3: Product: USB2.0 Hub May 03 02:16:14 angel-thesis kernel: usb 7-3: Manufacturer: VIA Labs, Inc. May 03 02:16:14 angel-thesis kernel: usb 7-3: SerialNumber: 000000000 May 03 02:16:14 angel-thesis kernel: usb 8-3: new SuperSpeed USB device number 3 using xhci_hcd May 03 02:16:14 angel-thesis kernel: usb 8-3: New USB device found, idVendor=2109, idProduct=0817, bcdDevice= 7.74 May 03 02:16:14 angel-thesis kernel: usb 8-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:14 angel-thesis kernel: usb 8-3: Product: USB3.0 Hub May 03 02:16:14 angel-thesis kernel: usb 8-3: Manufacturer: VIA Labs, Inc. May 03 02:16:14 angel-thesis kernel: usb 8-3: SerialNumber: 000000000 May 03 02:16:14 angel-thesis kernel: usb 7-5: new high-speed USB device number 4 using xhci_hcd May 03 02:16:14 angel-thesis kernel: usb 7-5: New USB device found, idVendor=174c, idProduct=2074, bcdDevice= 0.01 May 03 02:16:14 angel-thesis kernel: usb 7-5: New USB device strings: Mfr=2, Product=3, SerialNumber=0 May 03 02:16:14 angel-thesis kernel: usb 7-5: Product: ASM107x May 03 02:16:14 angel-thesis kernel: usb 7-5: Manufacturer: ASUS TEK. May 03 02:16:14 angel-thesis kernel: usb 7-2.3: new high-speed USB device number 5 using xhci_hcd May 03 02:16:14 angel-thesis kernel: usb 7-2.3: New USB device found, idVendor=1a40, idProduct=0801, bcdDevice= 1.00 May 03 02:16:14 angel-thesis kernel: usb 7-2.3: New USB device strings: Mfr=0, Product=1, SerialNumber=0 May 03 02:16:14 angel-thesis kernel: usb 7-2.3: Product: USB 2.0 Hub May 03 02:16:15 angel-thesis kernel: usb 8-5: new SuperSpeed USB device number 4 using xhci_hcd May 03 02:16:15 angel-thesis kernel: usb 8-5: New USB device found, idVendor=174c, idProduct=3074, bcdDevice= 0.01 May 03 02:16:15 angel-thesis kernel: usb 8-5: New USB device strings: Mfr=2, Product=3, SerialNumber=0 May 03 02:16:15 angel-thesis kernel: usb 8-5: Product: ASM107x May 03 02:16:15 angel-thesis kernel: usb 8-5: Manufacturer: ASUS TEK. May 03 02:16:15 angel-thesis kernel: usb 7-7: new high-speed USB device number 6 using xhci_hcd May 03 02:16:15 angel-thesis kernel: usb 7-7: New USB device found, idVendor=1e71, idProduct=300e, bcdDevice= 2.00 May 03 02:16:15 angel-thesis kernel: usb 7-7: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:15 angel-thesis kernel: usb 7-7: Product: NZXT Kraken Base May 03 02:16:15 angel-thesis kernel: usb 7-7: Manufacturer: NZXT Inc. May 03 02:16:15 angel-thesis kernel: usb 7-7: SerialNumber: E7474745ABA22D589754E50CB2A69D11 May 03 02:16:15 angel-thesis kernel: nzxt_kraken3 0003:1E71:300E.0016: hidraw1: USB HID v1.11 Device [NZXT Inc. NZXT Kraken Base] on usb-0000:6a:00.0-7/input1 May 03 02:16:15 angel-thesis kernel: usb 7-3.1: new full-speed USB device number 7 using xhci_hcd May 03 02:16:15 angel-thesis kernel: usb 8-3.4: new SuperSpeed USB device number 5 using xhci_hcd May 03 02:16:15 angel-thesis kernel: usb 8-3.4: New USB device found, idVendor=2109, idProduct=0211, bcdDevice= 5.84 May 03 02:16:15 angel-thesis kernel: usb 8-3.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0 May 03 02:16:15 angel-thesis kernel: usb 8-3.4: Product: USB3.0 Hub May 03 02:16:15 angel-thesis kernel: usb 8-3.4: Manufacturer: VIA Labs, Inc. May 03 02:16:15 angel-thesis kernel: usb 7-3.1: New USB device found, idVendor=046d, idProduct=c548, bcdDevice= 5.01 May 03 02:16:16 angel-thesis kernel: usb 7-3.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 May 03 02:16:16 angel-thesis kernel: usb 7-3.1: Product: USB Receiver May 03 02:16:16 angel-thesis kernel: usb 7-3.1: Manufacturer: Logitech May 03 02:16:16 angel-thesis kernel: input: Logitech USB Receiver as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.1/7-3.1:1.0/0003:046D:C548.0017/input/input26 May 03 02:16:16 angel-thesis kernel: usb 7-2.5: new high-speed USB device number 8 using xhci_hcd May 03 02:16:16 angel-thesis kernel: hid-generic 0003:046D:C548.0017: input,hidraw2: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:6a:00.0-3.1/input0 May 03 02:16:16 angel-thesis kernel: usb 7-2.5: New USB device found, idVendor=2109, idProduct=8817, bcdDevice= 0.01 May 03 02:16:16 angel-thesis kernel: usb 7-2.5: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:16 angel-thesis kernel: usb 7-2.5: Product: USB Billboard Device May 03 02:16:16 angel-thesis kernel: usb 7-2.5: Manufacturer: VIA Labs, Inc. May 03 02:16:16 angel-thesis kernel: usb 7-2.5: SerialNumber: 0000000000000001 May 03 02:16:16 angel-thesis kernel: input: Logitech USB Receiver Mouse as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.1/7-3.1:1.1/0003:046D:C548.0018/input/input27 May 03 02:16:16 angel-thesis kernel: input: Logitech USB Receiver Consumer Control as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.1/7-3.1:1.1/0003:046D:C548.0018/input/input28 May 03 02:16:16 angel-thesis kernel: input: Logitech USB Receiver System Control as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.1/7-3.1:1.1/0003:046D:C548.0018/input/input29 May 03 02:16:16 angel-thesis kernel: hid-generic 0003:046D:C548.0018: input,hidraw3: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:6a:00.0-3.1/input1 May 03 02:16:16 angel-thesis kernel: usb 7-5.2: new high-speed USB device number 9 using xhci_hcd May 03 02:16:16 angel-thesis kernel: hid-generic 0003:046D:C548.0019: hiddev97,hidraw4: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:6a:00.0-3.1/input2 May 03 02:16:16 angel-thesis kernel: usb 7-5.2: New USB device found, idVendor=046d, idProduct=c54d, bcdDevice=14.02 May 03 02:16:16 angel-thesis kernel: usb 7-5.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:16 angel-thesis kernel: usb 7-5.2: Product: USB Receiver May 03 02:16:16 angel-thesis kernel: usb 7-5.2: Manufacturer: Logitech May 03 02:16:16 angel-thesis kernel: usb 7-5.2: SerialNumber: 327534813432 May 03 02:16:16 angel-thesis kernel: input: Logitech USB Receiver as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-5/7-5.2/7-5.2:1.0/0003:046D:C54D.001A/input/input30 May 03 02:16:16 angel-thesis kernel: hid-generic 0003:046D:C54D.001A: input,hidraw5: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-0000:6a:00.0-5.2/input0 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: new full-speed USB device number 10 using xhci_hcd May 03 02:16:16 angel-thesis kernel: input: Logitech USB Receiver Keyboard as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-5/7-5.2/7-5.2:1.1/0003:046D:C54D.001B/input/input31 May 03 02:16:16 angel-thesis kernel: hid-generic 0003:046D:C54D.001B: input,hidraw7: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-0000:6a:00.0-5.2/input1 May 03 02:16:16 angel-thesis kernel: hid-generic 0003:046D:C54D.001C: hiddev98,hidraw8: USB HID v1.11 Device [Logitech USB Receiver] on usb-0000:6a:00.0-5.2/input2 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: config 1 has an invalid interface number: 5 but max is 3 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: config 1 has no interface number 3 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: New USB device found, idVendor=19f7, idProduct=004e, bcdDevice= 1.54 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: Product: RØDECaster Duo May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: Manufacturer: RØDE May 03 02:16:16 angel-thesis kernel: usb 7-11: new full-speed USB device number 11 using xhci_hcd May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: 1:1: cannot get freq at ep 0x82 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: 1:2: cannot get freq at ep 0x82 May 03 02:16:16 angel-thesis kernel: usb 7-2.3.1: 2:1: cannot get freq at ep 0x2 May 03 02:16:17 angel-thesis kernel: usb 7-11: config 1 has an invalid interface number: 2 but max is 1 May 03 02:16:17 angel-thesis kernel: usb 7-11: config 1 has no interface number 1 May 03 02:16:17 angel-thesis kernel: usb 7-11: New USB device found, idVendor=0b05, idProduct=18f3, bcdDevice= 1.00 May 03 02:16:17 angel-thesis kernel: usb 7-11: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:17 angel-thesis kernel: usb 7-11: Product: AURA LED Controller May 03 02:16:17 angel-thesis kernel: usb 7-11: Manufacturer: AsusTek Computer Inc. May 03 02:16:17 angel-thesis kernel: usb 7-11: SerialNumber: 9876543210 May 03 02:16:17 angel-thesis kernel: hid-generic 0003:0B05:18F3.001D: hiddev99,hidraw9: USB HID v1.11 Device [AsusTek Computer Inc. AURA LED Controller] on usb-0000:6a:00.0-11/input2 May 03 02:16:17 angel-thesis mtp-probe[94519]: checking bus 7, device 11: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-11" May 03 02:16:17 angel-thesis mtp-probe[94520]: checking bus 7, device 6: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-7" May 03 02:16:17 angel-thesis kernel: usb 7-3.3: new high-speed USB device number 12 using xhci_hcd May 03 02:16:17 angel-thesis mtp-probe[94537]: checking bus 7, device 9: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-5/7-5.2" May 03 02:16:17 angel-thesis mtp-probe[94545]: checking bus 7, device 6: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-7" May 03 02:16:17 angel-thesis kernel: usb 7-3.3: New USB device found, idVendor=2e1a, idProduct=4c03, bcdDevice= 2.00 May 03 02:16:17 angel-thesis kernel: usb 7-3.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0 May 03 02:16:17 angel-thesis kernel: usb 7-3.3: Product: Insta360 Link 2C May 03 02:16:17 angel-thesis kernel: usb 7-3.3: Manufacturer: Insta360 May 03 02:16:17 angel-thesis kernel: usb 7-3.3: Found UVC 1.10 device Insta360 Link 2C (2e1a:4c03) May 03 02:16:17 angel-thesis kernel: usb 7-2.3.2: new full-speed USB device number 13 using xhci_hcd May 03 02:16:17 angel-thesis kernel: usb 7-3.3: Warning! Unlikely big volume range (=32767), cval->res is probably wrong. May 03 02:16:17 angel-thesis kernel: usb 7-3.3: [9] FU [Mic Capture Volume] ch = 1, val = -32768/-1/1 May 03 02:16:17 angel-thesis kernel: usb 7-2.3.2: New USB device found, idVendor=31e3, idProduct=1322, bcdDevice= 2.30 May 03 02:16:18 angel-thesis kernel: usb 7-2.3.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:18 angel-thesis kernel: usb 7-2.3.2: Product: Wooting 60HE+ May 03 02:16:18 angel-thesis kernel: usb 7-2.3.2: Manufacturer: Wooting May 03 02:16:18 angel-thesis kernel: usb 7-2.3.2: SerialNumber: A02B2422W05T01100S02H23388 May 03 02:16:18 angel-thesis kernel: usb 7-3.4: new high-speed USB device number 14 using xhci_hcd May 03 02:16:18 angel-thesis kernel: hid-generic 0003:31E3:1322.001E: hiddev101,hidraw10: USB HID v1.11 Device [Wooting Wooting 60HE+] on usb-0000:6a:00.0-2.3.2/input0 May 03 02:16:18 angel-thesis kernel: input: Wooting Wooting 60HE+ as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.2/7-2.3.2:1.1/0003:31E3:1322.001F/input/input32 May 03 02:16:18 angel-thesis kernel: hid-generic 0003:31E3:1322.001F: input,hidraw11: USB HID v1.11 Keyboard [Wooting Wooting 60HE+] on usb-0000:6a:00.0-2.3.2/input1 May 03 02:16:18 angel-thesis kernel: usb 7-3.4: New USB device found, idVendor=2109, idProduct=2211, bcdDevice= 5.84 May 03 02:16:18 angel-thesis kernel: usb 7-3.4: New USB device strings: Mfr=1, Product=2, SerialNumber=0 May 03 02:16:18 angel-thesis kernel: usb 7-3.4: Product: USB2.0 Hub May 03 02:16:18 angel-thesis kernel: usb 7-3.4: Manufacturer: VIA Labs, Inc. May 03 02:16:18 angel-thesis kernel: hid-generic 0003:31E3:1322.0020: hiddev102,hidraw12: USB HID v1.11 Device [Wooting Wooting 60HE+] on usb-0000:6a:00.0-2.3.2/input2 May 03 02:16:18 angel-thesis kernel: input: Wooting Wooting 60HE+ System Control as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.2/7-2.3.2:1.3/0003:31E3:1322.0021/input/input33 May 03 02:16:18 angel-thesis kernel: input: Wooting Wooting 60HE+ Consumer Control as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.2/7-2.3.2:1.3/0003:31E3:1322.0021/input/input34 May 03 02:16:18 angel-thesis kernel: input: Wooting Wooting 60HE+ Mouse as /devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.2/7-2.3.2:1.3/0003:31E3:1322.0021/input/input35 May 03 02:16:18 angel-thesis kernel: hid-generic 0003:31E3:1322.0021: input,hidraw13: USB HID v1.11 Mouse [Wooting Wooting 60HE+] on usb-0000:6a:00.0-2.3.2/input3 May 03 02:16:18 angel-thesis kernel: hid-generic 0003:31E3:1322.0022: hiddev103,hidraw14: USB HID v1.11 Device [Wooting Wooting 60HE+] on usb-0000:6a:00.0-2.3.2/input4 May 03 02:16:18 angel-thesis kernel: usb 7-3.5: new high-speed USB device number 15 using xhci_hcd May 03 02:16:18 angel-thesis mtp-probe[94588]: checking bus 7, device 10: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.1" May 03 02:16:18 angel-thesis mtp-probe[94589]: checking bus 7, device 13: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.2" May 03 02:16:18 angel-thesis mtp-probe[94633]: checking bus 7, device 10: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-2/7-2.3/7-2.3.1" May 03 02:16:18 angel-thesis kernel: usb 7-3.5: New USB device found, idVendor=2109, idProduct=8884, bcdDevice= 0.01 May 03 02:16:18 angel-thesis kernel: usb 7-3.5: New USB device strings: Mfr=1, Product=2, SerialNumber=3 May 03 02:16:18 angel-thesis kernel: usb 7-3.5: Product: USB Billboard Device May 03 02:16:18 angel-thesis kernel: usb 7-3.5: Manufacturer: VIA Labs, Inc. May 03 02:16:18 angel-thesis kernel: usb 7-3.5: SerialNumber: 0000000000000001 May 03 02:16:18 angel-thesis mtp-probe[94639]: checking bus 7, device 7: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.1" May 03 02:16:18 angel-thesis mtp-probe[94640]: checking bus 7, device 12: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.3" May 03 02:16:18 angel-thesis mtp-probe[94679]: checking bus 7, device 12: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-3/7-3.3" May 03 02:16:18 angel-thesis mtp-probe[94686]: checking bus 7, device 9: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-5/7-5.2" May 03 02:16:18 angel-thesis mtp-probe[94687]: checking bus 7, device 11: "/sys/devices/pci0000:00/0000:00:02.1/0000:05:00.0/0000:06:0c.0/0000:6a:00.0/usb7/7-11" ```
Great. By the way, did you suspend, reboot or unbind xhci_hcd after taking the working system debugfs dumps and before it died? Unfortunately, the "dead" dump is missing information about connected devices, because they got dropped after "HC died". Below are the final commands executed by the HC. Their cycle bits are all set ("flags C") and there is no evidence of Stop Endpoint retries anywhere in the whole command ring dump, so this is not any of the known or obviously suspected problems, but something new and weird. Stop Ring Command: slot 19 sp 0 ep 3 flags C Set TR Dequeue Pointer Command: deq 00000000ffead3c0 stream 0 slot 19 ep 3 flags C Stop Ring Command: slot 19 sp 1 ep 1 flags C Stop Ring Command: slot 19 sp 0 ep 3 flags C Set TR Dequeue Pointer Command: deq 00000000ffead3d0 stream 0 slot 19 ep 3 flags C Stop Ring Command: slot 19 sp 1 ep 1 flags C Stop Ring Command: slot 6 sp 0 ep 5 flags C Set TR Dequeue Pointer Command: deq 00000000fffddbc1 stream 0 slot 6 ep 5 flags C Stop Ring Command: slot 6 sp 0 ep 5 flags C Set TR Dequeue Pointer Command: deq 00000000fffddbd1 stream 0 slot 6 ep 5 flags C Stop Ring Command: slot 6 sp 0 ep 5 flags C Set TR Dequeue Pointer Command: deq 00000000fffddbe1 stream 0 slot 6 ep 5 flags C Stop Ring Command: slot 6 sp 0 ep 5 flags C Set TR Dequeue Pointer Command: deq 00000000fffddbf1 stream 0 slot 6 ep 5 flags C Stop Ring Command: slot 6 sp 0 ep 5 flags C Set TR Dequeue Pointer Command: deq 00000000fffddc01 stream 0 slot 6 ep 5 flags C Stop Ring Command: slot 6 sp 0 ep 5 flags C Set TR Dequeue Pointer Command: deq 00000000fffddc11 stream 0 slot 6 ep 5 flags C Stop Ring Command: slot 19 sp 0 ep 1 flags C Reset Device Command: slot 16 flags C Reset Device Command: slot 19 flags C Reset Device Command: slot 19 flags C Address Device Command: ctx 00000000fff42000 slot 19 flags b:C Disable Slot Command: slot 19 flags C Enable Slot Command: flags C Address Device Command: ctx 00000000fff42000 slot 20 flags b:C Stop Ring Command: slot 6 sp 0 ep 5 flags C Initially, we see the familiar pattern of canceling a pending transfer on slot 19 ep 3 and stopping slot 19 ep 1 (the control endpoint) with "sp 1", which is a hint that the device will be suspended. This is probably the 8-3 hub again. Then there is some action on slot 6 ep 5, which I don't understand because information about devices is not available. In the earlier debugfs dump from a working system slot 6 was the ASM107x hub, but endpoint id 5 was *not* enabled on it, so that makes no sense. Things begin to get unusual now: stop endpoint on slot 19 ep 1 with "sp 0", then some devices are being reset. The last two commands fail to complete and the HC hangs when the driver tries to abort them. Looking at the event ring, the "unknown event type 4" actually points to the Address Device command for slot 20, so maybe the HC completed this command (but was already fubared enough to produce a corrupted event) and then got stuck for real on the final command for slot 6. But it was already fubared at this point, so something went wrong with those resets or it was the slot 6 ep 5 churn which broke it. That looks like repeatedly canceling a pending transfer before it completes and then resubmitting a similar transfer, and IME such pattern can break "ass media" HCs if they repeat fast enough... (no timestapms here, unfortunately). Not entirely sure what to think about it yet, I will take a closer look at the whole event ring later.
Some extra info: - This time the computer was on idle when it happened, like I wasn't even close to the PC but noticed something odd happened, hence checked and saw it was dead - Running usbreset of the controller and `yavta` on a loop for about one hour did not occur on any USB crash; So I doubt it is something related to heavy workloads on the controller. - Afaik the ASM107x device is disabled on EFI; Regarding the different slots, I'm not sure if the devices order/attachment can change on unbind/bind (I believe so) or just on a new boot, but here's the output of lspci and lsusb from the current boot (same as the crash) ``` ❯ lspci -nn | grep USB 03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 USB [1002:7446] 0f:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137] 3b:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020] [8086:1138] 68:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6a:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6] 6c:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7] 6d:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] ❯ lsusb Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 005 Device 002: ID 19f7:0050 RODE Microphones RODECaster Duo Bus 005 Device 003: ID 8087:0032 Intel Corp. AX210 Bluetooth Bus 005 Device 004: ID 0b05:1a53 ASUSTek Computer, Inc. USB Audio Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 007 Device 004: ID 174c:2074 ASMedia Technology Inc. ASM1074 High-Speed hub Bus 007 Device 006: ID 1e71:300e NZXT NZXT Kraken Base Bus 007 Device 009: ID 046d:c54d Logitech, Inc. USB Receiver Bus 007 Device 011: ID 0b05:18f3 ASUSTek Computer, Inc. AURA LED Controller Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 008 Device 004: ID 174c:3074 ASMedia Technology Inc. ASM1074 SuperSpeed hub Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 011 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 012 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 013 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 014 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub ``` Also one extra info, I have an `udev` rule to prevent USB devices to wake up my PC from sleep. I doubt it has anything to do with it, but worth sharing: ``` ❯ bat /etc/udev/rules.d/99-usb-wakeup.rules ───────┬─────────────────────────────────────────────────────────── │ File: /etc/udev/rules.d/99-usb-wakeup.rules ───────┼─────────────────────────────────────────────────────────── 1 │ ACTION=="add", SUBSYSTEM=="usb", RUN+="/bin/sh -c 'echo di │ sabled > /sys/bus/usb/devices/%k/power/wakeup'" ───────┴─────────────────────────────────────────────────────────── ``` (Note the issue was happening even before this rule existed afaik)
> But it was already fubared at this point, so something went wrong with those > resets or it was the slot 6 ep 5 churn which broke it. That looks like > repeatedly canceling a pending transfer before it completes and then > resubmitting a similar transfer, and IME such pattern can break "ass media" > HCs if they repeat fast enough... (no timestapms here, unfortunately). Is there anything extra I could provide?
Another crash happened; Same condition, idle. ``` ❯ lspci -nn | grep USB 03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 USB [1002:7446] 0f:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137] 3b:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020] [8086:1138] 68:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6a:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6] 6c:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7] 6d:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] ❯ lsusb Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 007 Device 002: ID 2109:4817 VIA Labs, Inc. USB2.0 Hub Bus 007 Device 003: ID 2109:2817 VIA Labs, Inc. USB2.0 Hub Bus 007 Device 004: ID 174c:2074 ASMedia Technology Inc. ASM1074 High-Speed hub Bus 007 Device 005: ID 1a40:0801 Terminus Technology Inc. USB 2.0 Hub Bus 007 Device 006: ID 1e71:300e NZXT NZXT Kraken Base Bus 007 Device 007: ID 046d:c548 Logitech, Inc. Logi Bolt Receiver Bus 007 Device 008: ID 2109:8817 VIA Labs, Inc. USB Billboard Device Bus 007 Device 009: ID 046d:c54d Logitech, Inc. USB Receiver Bus 007 Device 010: ID 19f7:004e RODE Microphones RØDECaster Duo Bus 007 Device 011: ID 0b05:18f3 ASUSTek Computer, Inc. AURA LED Controller Bus 007 Device 012: ID 2e1a:4c03 Insta360 Insta360 Link 2C Bus 007 Device 013: ID 31e3:1322 Wooting Wooting 60HE+ Bus 007 Device 014: ID 2109:2211 VIA Labs, Inc. USB2.0 Hub Bus 007 Device 015: ID 2109:8884 VIA Labs, Inc. USB Billboard Device Bus 007 Device 016: ID 0cf2:a201 ENE Technology, Inc. 6K7732 Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 008 Device 002: ID 2109:3817 VIA Labs, Inc. USB3.0 Hub Bus 008 Device 003: ID 2109:0817 VIA Labs, Inc. USB3.0 Hub Bus 008 Device 004: ID 174c:3074 ASMedia Technology Inc. ASM1074 SuperSpeed hub Bus 008 Device 005: ID 2109:0211 VIA Labs, Inc. USB3.0 Hub Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 011 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 012 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 013 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 014 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub ``` Interestingly enough, these last two crashes happened after a system update. ``` ❯ rpm-ostree status State: idle Deployments: ● ostree-image-signed:docker://ghcr.io/ublue-os/bazzite:stable Digest: sha256:91f041cd775526e8fe11089a1ca3433224cb9b5d1580c3ceb50973ec558f6dd9 Version: 42.20250430 (2025-05-01T19:32:10Z) LayeredPackages: android-tools LocalPackages: 1password-8.10.70-1.x86_64 sublime-text-4192-1.x86_64 ostree-image-signed:docker://ghcr.io/ublue-os/bazzite:stable Digest: sha256:1a7ae28b95fde42b976cc9aa159219c0aaaa0611f7416f4b3b30284e292b0875 Version: 42.20250417 (2025-04-17T07:35:37Z) LayeredPackages: android-tools LocalPackages: 1password-8.10.70-1.x86_64 sublime-text-4192-1.x86_64 ``` I'm on ``` ❯ uname -r 6.14.4-103.bazzite.fc42.x86_64 ``` Zip file: https://gist.github.com/ovflowd/0b0aa5c748683eca33909dc3ed7c66f7#file-debugfs-03-05-2025-zip
Created attachment 308078 [details] collect xhc crash data and reset the thing Could you run this script? It tries to catch debugfs after the controller hangs, but before the kernel "kills" it and forgets about all devices. It simply saves debugfs every few seconds, and packs the saved copy plus current debugfs and dmesg to a tgz archive when all devices disappear. Then it automatically resets the controller. Don't disconnect all devices by hand ;) The outdir and tmpdir paths need to be set to point at existing directories, and PCI ID is passed as an argument - use lspci to know if it's currently 0000:6a:00.0 or 0000:6b:00.0. I think it's somehow related to the 7-3/8-3 (USB 2.0/3.0 parts) hub, and its children - either the 7-3.4/8-3.4 hub or the 7-3.3 webcam. Would it be possible to move one of those child devices to another place so that it goes to a different bus? Perhaps bus 5, which is the first 600 chipset controller, so we could see if the problem moves to the first 600 controller or stays on the second one.
> Would it be possible to move one of those child devices to another place so > that it goes to a different bus? The webcam and another USB dongle are connected on the monitor's USB hub; That hub is connected directly to one of the rear USB ports; Do you want me to connect them to the other bus? (5) -- Note that these are also one of the rear USB ports. Also note that if the problem is 8-3 I don't think there's much I can do, as ASM107x is an internal device from the Motherboard; Also the "6K7732" device is (7-3.4 hub only device) a LED (it is just a simple LED backlight, it has no USB functionality whatsoever; There is also the AURA Led Controller and NZXT Kraken Base all devices under (7) but connected directly to the Mobo, and that I could remove. Anyhow, I can move the monitor Hub to another rear USB port and also run your script.
Sorry, I'm not talking about those device numbers from lsusb - they can change on reboot, suspend or xhci_hcd rebind, they are not reliable. The only somewhat constant addressing scheme is the one in dmesg: 7-3 is a device on the 3rd port of bus 7, and if 7-3 is a hub then 7-3.4 is a device on the 4th port of this hub, and if 7-3.4 is also a hub, then 7-3.4.2 is its second port. This doesn't change until you actually move USB plugs around. So I'm referring to those addresses from dmesg. If hubs can't be separated then fine, start with moving the webcam away.
Noted. I will do the following: - Update the System (as I rollbacked to 6.13.9; 6.14.3/6.14.4 seem to be more unstable (the bug occurs consistently way more often) - Mount debugfs - Run your script and keep it running - Wait for a crash; Add attachments here (only now noticed I can add attachments here) - Then try the change webcam and whatnot to different Hub - Wait for another crash
Created attachment 308082 [details] Autopsy Result A new crash happened, added the autopsy!
Created attachment 308090 [details] Autopsy Result (2) A 2nd crash happened, sending another autopsy. After this crash I moved the Monitor Hub (Webcam etc) to Bus 5. ``` Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 005 Device 002: ID 19f7:0050 RODE Microphones RODECaster Duo Bus 005 Device 003: ID 8087:0032 Intel Corp. AX210 Bluetooth Bus 005 Device 004: ID 0b05:1a53 ASUSTek Computer, Inc. USB Audio Bus 005 Device 005: ID 2109:2817 VIA Labs, Inc. USB2.0 Hub Bus 005 Device 006: ID 046d:c548 Logitech, Inc. Logi Bolt Receiver Bus 005 Device 007: ID 2e1a:4c03 Insta360 Insta360 Link 2C Bus 005 Device 008: ID 2109:2211 VIA Labs, Inc. USB2.0 Hub Bus 005 Device 009: ID 2109:8884 VIA Labs, Inc. USB Billboard Device Bus 005 Device 010: ID 0cf2:a201 ENE Technology, Inc. 6K7732 Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 006 Device 002: ID 2109:0817 VIA Labs, Inc. USB3.0 Hub Bus 006 Device 003: ID 2109:0211 VIA Labs, Inc. USB3.0 Hub Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 007 Device 004: ID 174c:2074 ASMedia Technology Inc. ASM1074 High-Speed hub Bus 007 Device 006: ID 1e71:300e NZXT NZXT Kraken Base Bus 007 Device 009: ID 046d:c54d Logitech, Inc. USB Receiver Bus 007 Device 011: ID 0b05:18f3 ASUSTek Computer, Inc. AURA LED Controller Bus 007 Device 027: ID 2109:4817 VIA Labs, Inc. USB2.0 Hub Bus 007 Device 028: ID 1a40:0801 Terminus Technology Inc. USB 2.0 Hub Bus 007 Device 029: ID 2109:8817 VIA Labs, Inc. USB Billboard Device Bus 007 Device 030: ID 19f7:004e RODE Microphones RØDECaster Duo Bus 007 Device 031: ID 31e3:1322 Wooting Wooting 60HE+ Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 008 Device 004: ID 174c:3074 ASMedia Technology Inc. ASM1074 SuperSpeed hub Bus 008 Device 008: ID 2109:3817 VIA Labs, Inc. USB3.0 Hub Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 011 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 012 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 013 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 014 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub ```
Looks like a 3rd crash happened, interestingly enough, the script did not detect a 3rd crash and did not send an unbind/bind command; But the Bus (5) got completely disconnected, Bus (7) which previously had the webcam, now is still working. I did not have the autopsy command running for Bus (5) so no logs; But Ill start it now and keep an eye on it. ``` 00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Root Complex [1022:14d8] 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge IOMMU [1022:14d9] 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db] 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db] 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:02.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge GPP Bridge [1022:14db] 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Dummy Host Bridge [1022:14da] 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] [1022:14dd] 00:08.3 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Internal GPP Bridge to Bus [C:A] [1022:14dd] 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 71) 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51) 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 0 [1022:14e0] 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 1 [1022:14e1] 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 2 [1022:14e2] 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 3 [1022:14e3] 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 4 [1022:14e4] 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 5 [1022:14e5] 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 6 [1022:14e6] 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge Data Fabric; Function 7 [1022:14e7] 01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev 10) 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479] (rev 10) 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT/7900 XTX/7900 GRE/7900M] [1002:744c] (rev c8) 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 HDMI/DP Audio [1002:ab30] 03:00.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 USB [1002:7446] 03:00.3 Serial bus controller [0c80]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:7444] 04:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller S4LV008[Pascal] [144d:a80c] 05:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port [1022:43f4] (rev 01) 06:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 06:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 06:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 06:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 08:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Upstream Port [1022:43f4] (rev 01) 09:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:0c.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 09:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset PCIe Switch Downstream Port [1022:43f5] (rev 01) 0a:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6E(802.11ax) AX210/AX1675* 2x2 [Typhoon Peak] [8086:2725] (rev 1a) 0b:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 03) 0d:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:01.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:02.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0e:03.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge [Maple Ridge 4C 2020] [8086:1136] (rev 02) 0f:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 NHI [Maple Ridge 4C 2020] [8086:1137] 3b:00.0 USB controller [0c03]: Intel Corporation Thunderbolt 4 USB Controller [Maple Ridge 4C 2020] [8086:1138] 68:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 69:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller [1022:43f6] (rev 01) 6a:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset USB 3.2 Controller [1022:43f7] (rev 01) 6b:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] 600 Series Chipset SATA Controller [1022:43f6] (rev 01) 6c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Granite Ridge [Radeon Graphics] [1002:13c0] (rev cb) 6c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Rembrandt Radeon High Definition Audio Controller [1002:1640] 6c:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 19h PSP/CCP [1022:1649] 6c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b6] 6c:00.4 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 3.1 xHCI [1022:15b7] 6d:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Raphael/Granite Ridge USB 2.0 xHCI [1022:15b8] ``` ``` Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 005 Device 002: ID 19f7:0050 RODE Microphones RODECaster Duo Bus 005 Device 003: ID 2109:2817 VIA Labs, Inc. USB2.0 Hub Bus 005 Device 004: ID 8087:0032 Intel Corp. AX210 Bluetooth Bus 005 Device 005: ID 046d:c548 Logitech, Inc. Logi Bolt Receiver Bus 005 Device 006: ID 0b05:1a53 ASUSTek Computer, Inc. USB Audio Bus 005 Device 009: ID 2109:2211 VIA Labs, Inc. USB2.0 Hub Bus 005 Device 010: ID 2109:8884 VIA Labs, Inc. USB Billboard Device Bus 005 Device 011: ID 0cf2:a201 ENE Technology, Inc. 6K7732 Bus 005 Device 012: ID 2e1a:4c03 Insta360 Insta360 Link 2C Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 006 Device 002: ID 2109:0817 VIA Labs, Inc. USB3.0 Hub Bus 006 Device 003: ID 2109:0211 VIA Labs, Inc. USB3.0 Hub Bus 007 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 007 Device 004: ID 174c:2074 ASMedia Technology Inc. ASM1074 High-Speed hub Bus 007 Device 006: ID 1e71:300e NZXT NZXT Kraken Base Bus 007 Device 009: ID 046d:c54d Logitech, Inc. USB Receiver Bus 007 Device 011: ID 0b05:18f3 ASUSTek Computer, Inc. AURA LED Controller Bus 007 Device 027: ID 2109:4817 VIA Labs, Inc. USB2.0 Hub Bus 007 Device 028: ID 1a40:0801 Terminus Technology Inc. USB 2.0 Hub Bus 007 Device 029: ID 2109:8817 VIA Labs, Inc. USB Billboard Device Bus 007 Device 030: ID 19f7:004e RODE Microphones RØDECaster Duo Bus 007 Device 031: ID 31e3:1322 Wooting Wooting 60HE+ Bus 008 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 008 Device 004: ID 174c:3074 ASMedia Technology Inc. ASM1074 SuperSpeed hub Bus 008 Device 008: ID 2109:3817 VIA Labs, Inc. USB3.0 Hub Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 011 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 012 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 013 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 014 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub ``` OOC -
I found out my Monitor has an interesting setting that "puts USB devices to sleep"; I disabled that (and set the USB Hub on the monitor settings to USB 2.0 instead of 3.2); Something tells me maybe this "usb sleep setting" could be related.
Restricting power management may reduce occurrences of this problem, maybe even to nothing or almost nothing, because things are breaking while resuming a hub from autosuspend. However, I don't think that power management is the root cause or the only workload which could run into trouble, it's just the one which happens often enough to be regularly running into trouble on your system. I see a few things being wrong or suspicious here. 1. The "average TRB length" properly of all Control Endpoint Contexts is set to zero, but it should be 8. This was recently reported as a violation of the xHCI spec with no known impact. It is probably irrelevant, but who knows - this should ideally be fixed for peace of mind. It can only be solved by a kernel patch. 2. The webcam driver regularly generates a burts of transfer cancellations, which is a known problem case. The command sequences and their completion events look OK, but IME ASMedia and Promontory can still behave weird under such conditions. 3. Sometimes when 8-3 hub resume happens some time after (or during) the webcam driver activity burst, it fails. And it fails weirdly: dmesg suggests that some control transfer to confirm the resumed hub's status times out, and when a Stop Endpoint command is issued to cancel the transfer, we suddenly get an event informing about a Transaction Error instead, and the Endpoint Context may or may not indicate that the control endpoint is halted (which it should be after an error), and yet the Stop Command succeeds (which it shouldn't), and we may get multiple "Stopped" events (we shouldn't) before the command's completion event. This is similar to misbehaviors I have been seeing due to problem #2 in the past. 4. Then an odd sequence of Reset Device commands is issued, probably courtesy of 2c31b05c63cf usb: hub: lack of clearing xHC resources which landed in 6.13.7 and 6.14+. I wouldn't be surprised if there are some spec violations in this commit, but I'm not sure if they would be harmful in this case. We probably shouldn't be issuing Reset Device for the child 8-3.4 hub here. We don't need to (and probably shouldn't) be issuing Reset Device 8-3 twice. 5. Lastly, attempts to reset 8-3 are not improving the situation and the hub driver tries to disable and enable the device slot, and address the device again. Hard to tell what's the outcome of this command, because it appears to generate a mangled completion event ("unknown event type <number>"). However, the event does contain a pointer to this command. It is possible that the HC progresses to the next command. 6. Very lastly, all unrelated transfers stop happening and a Stop Endpoint command issued to the webcam endpoint either never is reached, or hangs the HC for good.
I would suggest to reenable power management, and try something else. If you moved the problematic hubs to bus 5, try to move the problematic webcam to bus 7. If bus 5 crashes, the problem is only hubs and the webcam is harmless. If bus 7 crashes, the problem is the webcam. If neither bus crashes for a few days, it looks like it takes two to the tango. You could also try a quirk which seems to reduce some bizarre failures on my AMD B350 chipset. If your kernel has xhci_hcd and xhci_pci built as modules, run: rmmod xhci_pci xhci_hcd ; modprobe xhci_hcd quirks=0x4000000 ; modprobe xhci_pci If they are built into the kernel image, you need to figure out how to add xhci_hcd.quirks=0x4000000 to the kernel boot parameters in your distribution's bootloader config.
That's very useful (all the commentary)! By the way, > Looks like a 3rd crash happened, interestingly enough, the script did not > detect a 3rd crash and did not send an unbind/bind command; But the Bus (5) > got completely disconnected, Bus (7) which previously had the webcam, now is > still working. I did not have the autopsy command running for Bus (5) so no > logs; But Ill start it now and keep an eye on it. Ignore this, I facepalmed myself afterwards, of course it would only unbind/bind and capture 6a since that's what I provided to the script. I modified the script adding support to "multiple" pci ids and generating the folders and dumps prefixed with the pciid. > I would suggest to reenable power management, and try something else. If you > moved the problematic hubs to bus 5, try to move the problematic webcam to > bus 7. Noted, Ill first keep it off for one more day to see a crash happens; Then Ill re-enable it afterwards. > If bus 5 crashes, the problem is only hubs and the webcam is harmless. If bus 7 crashes, the problem is the webcam. If neither bus crashes for a few days, it looks like it takes two to the tango. Ill attempt these distribution steps. > If your kernel has xhci_hcd and xhci_pci built as modules, run: `lsmod` doesn't seem to list either xhci_pci and xhci_hcd as modules; > If they are built into the kernel image, you need to figure out how to add Seems the case. Can I add this flag via grub?
(In reply to Claudio Wunder from comment #38) > Seems the case. Can I add this flag via grub? Yes, if you use grub then add it to grub config. Like any other kernel parameter.
Just to double check (unfortunately I'm not much familiar with GRUB), `GRUB_CMDLINE_LINUX="quiet splash xhci_hcd.quirks=0x4000000"` Would that be enough?
I don't use grub2, but it looks like you found the right place. You can confirm by checking the /proc/cmdline file or running dmesg | grep xhci.*quirks If the quirk is applied, this line xhci_hcd 0000:6b:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000200000010 will become: xhci_hcd 0000:6b:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000204000010
The two Reset Device commands followed by an Address Device command that Michal pointed out earlier could very well be the reason for the issue. Reset Device Command: slot 16 flags C Reset Device Command: slot 19 flags C Reset Device Command: slot 19 flags C Address Device Command: ctx 00000000fff42000 slot 19 flags b:C A Reset Device command will move the slot (device) to the Default address 0 state where only one device should be at a time. In default state the device waits for a address device commands to be given a unique address. Here we have both slots 16 and 19 in default state at the same time, and both devices will try to respond to the address device command. Xhci section 4.5.3.4 "Default" has a note stating that: "Software shall ensure that only one Device Slot is in the Default state at time,otherwise undefined behavior may occur." The commit 2b66ef84d0d2 "usb: hub: lack of clearing xHC resource" that Michal pointed out earlier looks like possible cause to me as well.
Indeed, in that commit it looks like the call to hub_hc_release_resources() should be made _after_ the hcd->address0_mutex is acquired, not _before_. I didn't realize that the reset_device callback would put the slot or device into the default state. That's exactly what the mutex is intended to serialize.
Created attachment 308097 [details] Autopsy 68 Attaching another crash on 68; I haven't moved the webcam yet; Should I move the webcam to another Hub or simply disconnect it to verify the webcam is not related?
If you disconnect it from bus 5, you might as well connect it to bus 7 then. This way the two buses will be simultaneously testing two different configurations and time is saved ;)
FYI moving Webcam to a dedicated USB port (directly connected to a USB rear port) at least so far (a few hours) and haven't noticed anything.
Still nothing. When I put the Webcam back to the Hub, the issue persists. I will keep a couple more hours the Webcam on a dedicated port to see if anything at all changes.
Did you make the change I recommended in comment #43? That is, in drivers/usb/core/hub.c, in the usb_reset_and_verify_device() routine, move the if (udev->reset_resume) hub_hc_release_resources(udev); code _below_ the following mutex_lock(hcd->address0_mutex); line.
Hmm, I assume I'd need to build Kernel from source or? I'm not sure I can change (easily and without possibly breaking other things) the kernel from the distro I use.
(I'd appreciate some guidance here or if you could point me to docs; For reference, this is the flavor of the Kernel my distro uses is: https://github.com/bazzite-org/kernel-bazzite)
(let me check with the maintainers of Bazzite)
I wasn't able to get their attention with an issue, so I opened a PR directly on their "patchwork" repository (just to get their attention, I doubt they should merge such patch)
Bazzite's maintainer gave me a path forward, that will require me to first dualboot (with another distro) so I'm able to build the Kernel. Meanwhile, I'll try the Kernel args that Michal provided first, to verify if they also work.
avg_trb_len doesn't seem to be an issue for my ASMedia/Promontory HCs, they still have their usual bugs after changing it to 8 as per spec. By the way, the simplest way to break ASMedia chips is running while true ; do ifconfig ethN up ; ifconfig ethN down ; done on a SuperSpeed NIC (AX88179 and RTL8153 known to work). This is what the limit endpoint interval quirk appears to be helping with, though recently I found some case (NIC behind a SuperSpeed hub) where it didn't help. I am also not convinced that the "clearing xHC resources" patch is relevant, because it's hard to explain why nothing bad happens when you run reset manually. I found no problems resetting devices on my AMD B350, ASM3142 and ASM1142, despite the dubious command sequence always being there. Your logs show malfunction before we even get to this point. Indeed, the reason the hub is being reset is because attempting to resume it normally suddenly failed. So the outcome of testing is that the problem occurs if both the unlucky hub and the unlucky webcam are present on the same X670 bus, and disappears if they are on different X670 buses at the same time? And it's happening on 6.13.9 or newer, but not on 6.13.7? I have no explanation for that...
> And it's happening on 6.13.9 or newer, but not on 6.13.7? I have no > explanation for that... Well I can't say for 100% sure it wasn't on 6.13.7, but only that on the periods of time I've tested it didn't happen. > So the outcome of testing is that the problem occurs if both the unlucky hub > and the unlucky webcam are present on the same X670 bus, and disappears if > they are on different X670 buses at the same time? Also on the limited amount of time, that's right, it didn't happen at all during such circumstances (webcam on a separate bus); I didn't test webcam on same bus, but connected directly on the bus (without the monitor hub); my feeling is that the monitor hub is somehow the issue; I could focus on testing that aspect. I also tested with the bootargs, this is the result: ``` bash-5.2# dmesg | grep xhci.*quirks [ 0.537829] xhci_hcd 0000:03:00.2: hcc params 0x0110ffc5 hci version 0x120 quirks 0x0000000200000010 [ 0.539340] xhci_hcd 0000:3b:00.0: hcc params 0x20007fc1 hci version 0x110 quirks 0x0000000200009810 [ 0.595976] xhci_hcd 0000:68:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000200000010 [ 0.653978] xhci_hcd 0000:6a:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000200000010 [ 0.656224] xhci_hcd 0000:6c:00.3: hcc params 0x0120ffc5 hci version 0x120 quirks 0x0000000200000010 [ 0.657094] xhci_hcd 0000:6c:00.4: hcc params 0x0120ffc5 hci version 0x120 quirks 0x0000000200000010 [ 0.658896] xhci_hcd 0000:6d:00.0: hcc params 0x0110ffc5 hci version 0x120 quirks 0x0000000200000010 ```
I forgot to mention that with bootargs I was still able to reproduce the crash, btw.
That didn't work. If you edited grub config /etc, you would apparently need to do some extra work to transfer the settings to /boot. See below. (That's why I don't like and don't use grub2). https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/managing_monitoring_and_updating_the_kernel/configuring-kernel-command-line-parameters_managing-monitoring-and-updating-the-kernel I would recommend option 4.5 instead, if possible. Select the menu entry and press 'e' to edit it for one boot.
Afaik the changes are on /boot (contents of `/boot/grub2/grub.cfg`) ```` # This section was generated by a script. Do not modify the generated file - all changes # will be lost the next time file is regenerated. Instead edit the BootLoaderSpec files. # # The blscfg command parses the BootLoaderSpec files stored in /boot/loader/entries and # populates the boot menu. Please refer to the Boot Loader Specification documentation # for the files format: https://systemd.io/BOOT_LOADER_SPECIFICATION/. # The kernelopts variable should be defined in the grubenv file. But to ensure that menu # entries populated from BootLoaderSpec files that use this variable work correctly even # without a grubenv file, define a fallback kernelopts variable if this has not been set. # # The kernelopts variable in the grubenv file can be modified using the grubby tool or by # executing the grub2-mkconfig tool. For the latter, the values of the GRUB_CMDLINE_LINUX # and GRUB_CMDLINE_LINUX_DEFAULT options from /etc/default/grub file are used to set both # the kernelopts variable in the grubenv file and the fallback kernelopts variable. if [ -z "${kernelopts}" ]; then set kernelopts="root=UUID=4349ca0b-5bb9-4ad2-83a9-78d7440c02f9 ro quiet splash xhci_hcd.quirks=0x4000000 " fi ``` But let me try 4.5 approach.
Got it: ```` bash-5.2# dmesg | grep xhci.*quirks [ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)/ostree/default-5c836f8b9cf21db25d8ea8042cfcac18de42950e638fd2ab6f1bf522afe3bf57/vmlinuz-6.14.6-102.bazzite.fc42.x86_64 rhgb quiet root=UUID=4349ca0b-5bb9-4ad2-83a9-78d7440c02f9 rootflags=subvol=root rw ostree=/ostree/boot.0/default/5c836f8b9cf21db25d8ea8042cfcac18de42950e638fd2ab6f1bf522afe3bf57/0 bluetooth.disable_ertm=1 preempt=full xhci_hcd.quirks=0x4000000 [ 0.064435] Kernel command line: BOOT_IMAGE=(hd0,gpt2)/ostree/default-5c836f8b9cf21db25d8ea8042cfcac18de42950e638fd2ab6f1bf522afe3bf57/vmlinuz-6.14.6-102.bazzite.fc42.x86_64 rhgb quiet root=UUID=4349ca0b-5bb9-4ad2-83a9-78d7440c02f9 rootflags=subvol=root rw ostree=/ostree/boot.0/default/5c836f8b9cf21db25d8ea8042cfcac18de42950e638fd2ab6f1bf522afe3bf57/0 bluetooth.disable_ertm=1 preempt=full xhci_hcd.quirks=0x4000000 [ 0.613780] xhci_hcd 0000:03:00.2: hcc params 0x0110ffc5 hci version 0x120 quirks 0x0000000204000010 [ 0.615279] xhci_hcd 0000:3b:00.0: hcc params 0x20007fc1 hci version 0x110 quirks 0x0000000204009810 [ 0.671965] xhci_hcd 0000:68:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000204000010 [ 0.730414] xhci_hcd 0000:6a:00.0: hcc params 0x0200ef81 hci version 0x110 quirks 0x0000000204000010 [ 0.733142] xhci_hcd 0000:6c:00.3: hcc params 0x0120ffc5 hci version 0x120 quirks 0x0000000204000010 [ 0.734136] xhci_hcd 0000:6c:00.4: hcc params 0x0120ffc5 hci version 0x120 quirks 0x0000000204000010 [ 0.736031] xhci_hcd 0000:6d:00.0: hcc params 0x0110ffc5 hci version 0x120 quirks 0x0000000204000010 ```` Gonna wait for crashes to happen.
(I also found that I should use `rpm-ostree kargs --editor` for editing kernel args.
The issue is still reproducible, unfortunately, even with the quirks. Is another dump log required?
(I'm also still pending on making the kernel changes to verify they work)