Since kernel 5.15 (with kernel 5.13 I see no problem) I have a problem with my USB hub. The device stops working shortly after starting the system. In dmesg log I see DMAR fault on usb controller [kwi27 22:03] usb 5-1.2: new high-speed USB device number 3 using xhci_hcd [ +0,100440] usb 5-1.2: New USB device found, idVendor=1a40, idProduct=0101, bcdDevice= 1.11 [ +0,000004] usb 5-1.2: New USB device strings: Mfr=0, Product=1, SerialNumber=0 [ +0,000002] usb 5-1.2: Product: USB 2.0 Hub [ +0,001002] hub 5-1.2:1.0: USB hub found [ +0,000133] hub 5-1.2:1.0: 4 ports detected [ +0,702453] usb 5-1.2.2: new full-speed USB device number 4 using xhci_hcd [ +0,471198] usb 5-1.2.2: New USB device found, idVendor=047f, idProduct=c025, bcdDevice= 1.35 [ +0,000004] usb 5-1.2.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 [ +0,000002] usb 5-1.2.2: Product: Plantronics C320-M [ +0,000001] usb 5-1.2.2: Manufacturer: Plantronics [ +0,000001] usb 5-1.2.2: SerialNumber: B13D8BE491B04E73AEB4C95E162DBE2B [ +0,255862] mc: Linux media interface: v0.10 [ +0,001057] input: Plantronics Plantronics C320-M as /devices/pci0000:00/0000:00:1c.5/0000:04:00.0/usb5/5-1/5-1.2/5-1.2.2/5-1.2.2:1.3/0003:047F:C025.0004/input/input21 [ +0,060275] plantronics 0003:047F:C025.0004: input,hiddev1,hidraw3: USB HID v1.11 Device [Plantronics Plantronics C320-M] on usb-0000:04:00.0-1.2.2/input3 [ +0,859655] usb 5-1.2.2: Warning! Unlikely big volume range (=8192), cval->res is probably wrong. [ +0,000003] usb 5-1.2.2: [11] FU [Sidetone Playback Volume] ch = 1, val = 0/8192/1 [ +0,584234] usbcore: registered new interface driver snd-usb-audio [ +0,229229] xhci_hcd 0000:04:00.0: WARNING: Host System Error [ +0,000014] DMAR: DRHD: handling fault status reg 2 [ +0,000004] DMAR: [DMA Read NO_PASID] Request device [04:00.0] fault addr 0xfffca000 [fault reason 0x06] PTE Read access is not set [ +0,031993] xhci_hcd 0000:04:00.0: Host halt failed, -110 [kwi27 22:04] xhci_hcd 0000:04:00.0: xHCI host not responding to stop endpoint command. [ +0,000003] xhci_hcd 0000:04:00.0: USBSTS: HSE EINT [ +0,032011] xhci_hcd 0000:04:00.0: Host halt failed, -110 [ +0,000002] xhci_hcd 0000:04:00.0: xHCI host controller not responding, assume dead [ +0,000017] xhci_hcd 0000:04:00.0: HC died; cleaning up [ +0,000042] usb 5-1: USB disconnect, device number 2 [ +0,000003] usb 5-1.2: USB disconnect, device number 3 [ +0,000002] usb 5-1.2.2: USB disconnect, device number 4 [ +0,000114] usb 5-1.2.2: 1:0: usb_set_interface failed (-110) [ +0,000016] usb 5-1.2.2: 1:1: usb_set_interface failed (-19) [ +0,000011] usb 5-1.2.2: 1:0: usb_set_interface failed (-19) 04:00.0 USB controller: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller (rev 01) (prog-if 30 [XHCI]) Subsystem: Micro-Star International Co., Ltd. [MSI] VL805/806 xHCI USB 3.0 Controller Flags: bus master, fast devsel, latency 0, IRQ 31, IOMMU group 12 Memory at f7100000 (64-bit, non-prefetchable) [size=4K] Capabilities: <access denied> Kernel driver in use: xhci_hcd Kernel modules: xhci_pci
I just hit exactly the same issue when upgrading the kernel from v5.13.0-40 to v5.15.0-27. With no devices plugged in, the USB hub reports everything as ok. Plugging in a USB keyboard worked for a minute or two, and then I get exactly the same errors from [+0,229229] to [+0,000004] above. Same USB controller chipset as OP by the looks of things. I've managed to list the Capabilities in case that's any help: 03:00.0 USB controller: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller (rev 01) (prog-if 30 [XHCI]) Subsystem: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller Flags: bus master, fast devsel, latency 0, IRQ 28, IOMMU group 12 Memory at e0a00000 (64-bit, non-prefetchable) [size=4K] Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable+ Count=1/4 Maskable- 64bit+ Capabilities: [c4] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Kernel driver in use: xhci_hcd Kernel modules: xhci_pci Downgrading back to v5.13.0-40 fixes the problem.
I wanted to add this issue to the regression tracking and poke the maintainers, but noticed there is a patch that is being backported right now that might or might not be related (not my area of expertise): https://lore.kernel.org/all/20220504153117.726462014@linuxfoundation.org/ It's already in 5.18-rc5; could somebody please give it a quick try before I proceed with my initial plan?
I built myself this kernel 5.18-rc5 (with ubuntu default config), but the problem still exists
I've misled you a bit by saying that the bug didn't occur on the 5.13 kernel. I tried bisecting on the upstream kernel and it turns out that the problem also occurs on the 5.13 - I build it using ubuntu default config from kernel 5.15.0-27. So far, the only kernel build I haven't noticed a problem with (excluding kernels 5.4 from Ubuntu 20.04 LTS) is kernel 5.13.0-28-generic form Ubuntu. Interestingly, I found the sources of this kernel on git kernel.ubuntu.com and built this kernel using this config from kernel 5.15 and the problem also occurred. It was only when I built this kernel using the default config for this kernel that I stopped seeing the problem.
Sorry, this is starting to get confusing and hard to follow. If there is something that used to work with an Ubuntu kernel and stops working there, you might want to report it to the Ubuntu developers, but not here. This bug tracker care mainly about upstream kernel (see front page), so what happens with a kernel build from the Ubuntu sources (which are known to be modified a lot) is irrelevant and even just mentioning that makes things hard to follow. :-/ Regarding your problem: I'm not familiar with the code that might cause this, but to me it looks a lot like Ubuntu switched on a kernel configuration option that is causing this. If that's the case the problem doesn't qualify as regression, as explained here: https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html The developers nevertheless might be interested in fixing this, but might need more details from you (like the config option that is causing this)
[ +0,229229] xhci_hcd 0000:04:00.0: WARNING: Host System Error The xHC controller reports a catastrophic error, and sets HSE bit. For PCI xHC controllers the spec lists possible causes as: host controller PCI parity error, PCI Master Abort, PCI Target Abort. But DMA issues also possible cause, especially as log shows DMAR problems right after this. Any chance you could bisect this on upstream kernel?
@Thorsten Leemhuis sorry for misleading you but when adding this bug here, I didn't know it wasn't an upstream regression - at first look it looked that way, as I also observed the problem on the upstream. So far we only know that in one of the kernel configurations the problem does not occur - but this does not mean that the problem does not exist. > Any chance you could bisect this on upstream kernel? I'll try to do it this week
(In reply to Piotr Piórkowski from comment #7) > > > Any chance you could bisect this on upstream kernel? > I'll try to do it this week And news? Was the issue maybe fixed meanwhile?
The IOMMU error is caused by a buggy VL805 firmware. It is more visible with the Debian kernel as Debian patches the kernel to enable IOMMU by default. The updated firmware can be installed using the VIA Windows tool (this did not work for me), or you can just turn off IOMMU.
(In reply to Chris Bainbridge from comment #9) > The IOMMU error is caused by a buggy VL805 firmware. Makes me wonder: would it be possible to detect an old firmware and avoid the IOMMU path in this case? Or at least warn?
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10) > (In reply to Chris Bainbridge from comment #9) > > The IOMMU error is caused by a buggy VL805 firmware. > > Makes me wonder: would it be possible to detect an old firmware and avoid > the IOMMU path in this case? Or at least warn? It would be possible to detect, the firmware version can be read with: $ sudo lspci -d 1106:3483 -xxx | awk '/^50:/ { print "VL805 FW version: " $5 $4 $3 $2 }' VL805 FW version: 00013500 imho it would be a good idea for Linux to track the latest firmware versions for *all* hardware and warn if a firmware is out-of-date (even if the firmware updater is only available on Windows). Earlier this year I had an intermittent issue with a new laptop where the desktop would hang and processes would get IO errors. But this only happened once every 3 weeks or so. It took a few months to isolate the problem to NVME firmware (it was a HP laptop with Intel NVME, and I was unaware that these drives have locked HP-specific firmware). The firmware update was a Windows executable. I've also seen many forum posts where people have problems that were resolved by updates to GPU/motherboard/NVME/ethernet/wifi etc. firmware. Many of these problems could have been resolved a lot quicker if the kernel log contained "old firmware detected!".
(In reply to Chris Bainbridge from comment #11) > It would be possible to detect […] Thx for that. > imho it would be a good idea for Linux to track the latest firmware versions > for *all* hardware […] Pretty sure that is something that should be done in userspace, as there it's a lot easier to update the dataset with the latest firmware versions.
Ideally this should be fixed, but we are kinda stuck here: * Piotr, do you even care about this after all this time? * does the problem even still happen with the latest mainline kernel? might be a good idea to test this with 6.6 or 6.7-rc1 is out (e.g. in two weeks from now) before doing anything else. * The regression was never bisected, hence it's unclear which developers is resposible for handling this (USB? IOMMU? something else?). But well, with a bit of luck Mathias commented earlier and might see this and share his thoughts.
Honestly, I had already forgotten about this problem. In the meantime, I changed the HW. Lately I have little time, but I have somewhere this HW still and if there is a need I can verify something
(In reply to Piotr Piórkowski from comment #14) > Honestly, I had already forgotten about this problem. Happens, no worries :-D > Lately I have little time, but I have somewhere this HW > still and if there is a need I can verify something I suggest you wait for Mathias to speak up first, I guess he should know best what's the best way forward here.
> Pretty sure that is something that should be done in userspace, as there it's > a lot easier to update the dataset with the latest firmware versions. True, but then you have the problem of convincing distributions to ship it and enable it by default. > The regression was never bisected It's not a regression in the kernel as Piotr said in comment #7. It appeared on Debian kernels because Debian enabled IOMMU by default (but not for the GPU). This is a Debian-specific patch.
If I understand correctly this is caused by the VIA VL805 xHC controller with bad firmware accessing some DMA address outside the allowed range. With IOMMU enabled the IOMMU will prevent this access, and the controller fails. I'm speculating here, but it could be possible the controller accesses past one of the DMA ranges wile trying to read-ahead. If we can figure out past which area, then its possible to make a driver workaround for this controller that allocates a bit larger DMA chunk for that specific purpose. DMA memory allocated for xHC use before any USB device is connected: - dcbaa device context base address array. arrays of pointers to device contexts dma_alloc_coherent(dev) - command ring dma_pool_zalloc(segment_pool) - event ring dma_pool_zalloc(segment_pool) - event ring segment table info about event ring, segents, size and location dma_alloc_coherent(dev) - Scratchpad, only touched (RW) by xHC controller, not driver. dma_alloc_coheret() DMA memory allocated for each connected USB device. - device contexts dma_pool_zalloc(device_pool) - transfer rings, contains TRBs, metadata about transfers. dma_pool_zalloc(segment_pool) - stream contexts, dma_alloc_coherent() or dma_pool_alloc(*_streams_pool)
(In reply to Chris Bainbridge from comment #16) > It's not a regression in the kernel as Piotr said in comment #7. Ahh, sorry, missed/forgot that in all the regressions I deal with. Thx for pointing it out! BTW, thx Mathias for nevertheless looking into this.
(In reply to Mathias Nyman from comment #17) > If we can figure out past which area, then its possible to make a driver > workaround for this controller that allocates a bit larger DMA chunk for > that specific purpose. I am a debian user and opened this bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1050352 In the last mail I was pointed here. So I now offer my help because I have the hardware and can reproduce the bug. I am not a kernel developer. But I am able to modify kernel source files and compile a custom kernel and let it run on debian, either debian stable or debian testing. May be I could test new versions of the driver.