Bug 215906

Summary: DMAR fault when connected usb hub (xhci_hcd)
Product: Drivers Reporter: Piotr Piórkowski (qba100)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: NEW ---    
Severity: normal CC: chris.bainbridge, mathias.nyman, royston
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: Tested on 5.15.0-27, 5.17.0-051700-generic (from https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.17/) Subsystem:
Regression: Yes Bisected commit-id:

Description Piotr Piórkowski 2022-04-27 20:21:03 UTC
Since kernel 5.15 (with kernel 5.13 I see no problem) I have a problem with my USB hub. The device stops working shortly after starting the system.
In dmesg log I see DMAR fault on usb controller


[kwi27 22:03] usb 5-1.2: new high-speed USB device number 3 using xhci_hcd
[  +0,100440] usb 5-1.2: New USB device found, idVendor=1a40, idProduct=0101, bcdDevice= 1.11
[  +0,000004] usb 5-1.2: New USB device strings: Mfr=0, Product=1, SerialNumber=0
[  +0,000002] usb 5-1.2: Product: USB 2.0 Hub
[  +0,001002] hub 5-1.2:1.0: USB hub found
[  +0,000133] hub 5-1.2:1.0: 4 ports detected
[  +0,702453] usb 5-1.2.2: new full-speed USB device number 4 using xhci_hcd
[  +0,471198] usb 5-1.2.2: New USB device found, idVendor=047f, idProduct=c025, bcdDevice= 1.35
[  +0,000004] usb 5-1.2.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[  +0,000002] usb 5-1.2.2: Product: Plantronics C320-M
[  +0,000001] usb 5-1.2.2: Manufacturer: Plantronics
[  +0,000001] usb 5-1.2.2: SerialNumber: B13D8BE491B04E73AEB4C95E162DBE2B
[  +0,255862] mc: Linux media interface: v0.10
[  +0,001057] input: Plantronics Plantronics C320-M as /devices/pci0000:00/0000:00:1c.5/0000:04:00.0/usb5/5-1/5-1.2/5-1.2.2/5-1.2.2:1.3/0003:047F:C025.0004/input/input21
[  +0,060275] plantronics 0003:047F:C025.0004: input,hiddev1,hidraw3: USB HID v1.11 Device [Plantronics Plantronics C320-M] on usb-0000:04:00.0-1.2.2/input3
[  +0,859655] usb 5-1.2.2: Warning! Unlikely big volume range (=8192), cval->res is probably wrong.
[  +0,000003] usb 5-1.2.2: [11] FU [Sidetone Playback Volume] ch = 1, val = 0/8192/1
[  +0,584234] usbcore: registered new interface driver snd-usb-audio
[  +0,229229] xhci_hcd 0000:04:00.0: WARNING: Host System Error
[  +0,000014] DMAR: DRHD: handling fault status reg 2
[  +0,000004] DMAR: [DMA Read NO_PASID] Request device [04:00.0] fault addr 0xfffca000 [fault reason 0x06] PTE Read access is not set
[  +0,031993] xhci_hcd 0000:04:00.0: Host halt failed, -110
[kwi27 22:04] xhci_hcd 0000:04:00.0: xHCI host not responding to stop endpoint command.
[  +0,000003] xhci_hcd 0000:04:00.0: USBSTS: HSE EINT
[  +0,032011] xhci_hcd 0000:04:00.0: Host halt failed, -110
[  +0,000002] xhci_hcd 0000:04:00.0: xHCI host controller not responding, assume dead
[  +0,000017] xhci_hcd 0000:04:00.0: HC died; cleaning up
[  +0,000042] usb 5-1: USB disconnect, device number 2
[  +0,000003] usb 5-1.2: USB disconnect, device number 3
[  +0,000002] usb 5-1.2.2: USB disconnect, device number 4
[  +0,000114] usb 5-1.2.2: 1:0: usb_set_interface failed (-110)
[  +0,000016] usb 5-1.2.2: 1:1: usb_set_interface failed (-19)
[  +0,000011] usb 5-1.2.2: 1:0: usb_set_interface failed (-19)

04:00.0 USB controller: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller (rev 01) (prog-if 30 [XHCI])
	Subsystem: Micro-Star International Co., Ltd. [MSI] VL805/806 xHCI USB 3.0 Controller
	Flags: bus master, fast devsel, latency 0, IRQ 31, IOMMU group 12
	Memory at f7100000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: xhci_hcd
	Kernel modules: xhci_pci
Comment 1 Royston Shufflebotham 2022-05-01 12:32:00 UTC
I just hit exactly the same issue when upgrading the kernel from v5.13.0-40 to v5.15.0-27. With no devices plugged in, the USB hub reports everything as ok. Plugging in a USB keyboard worked for a minute or two, and then I get exactly the same errors from [+0,229229] to [+0,000004] above.

Same USB controller chipset as OP by the looks of things. I've managed to list the Capabilities in case that's any help:
03:00.0 USB controller: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller (rev 01) (prog-if 30 [XHCI])
        Subsystem: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller
        Flags: bus master, fast devsel, latency 0, IRQ 28, IOMMU group 12
        Memory at e0a00000 (64-bit, non-prefetchable) [size=4K]
        Capabilities: [80] Power Management version 3
        Capabilities: [90] MSI: Enable+ Count=1/4 Maskable- 64bit+
        Capabilities: [c4] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci

Downgrading back to v5.13.0-40 fixes the problem.
Comment 2 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-05 09:40:06 UTC
I wanted to add this issue to the regression tracking and poke the maintainers, but noticed there is a patch that is being backported right now that might or might not be related (not my area of expertise):
https://lore.kernel.org/all/20220504153117.726462014@linuxfoundation.org/

It's already in 5.18-rc5; could somebody please give it a quick try before I proceed with my initial plan?
Comment 3 Piotr Piórkowski 2022-05-05 12:30:34 UTC
I built myself this kernel 5.18-rc5 (with ubuntu default config), but the problem still exists
Comment 4 Piotr Piórkowski 2022-05-06 15:49:39 UTC
I've misled you a bit by saying that the bug didn't occur on the 5.13 kernel. I tried bisecting on the upstream kernel and it turns out that the problem also occurs on the 5.13 - I build it using ubuntu default config from kernel 5.15.0-27.

So far, the only kernel build I haven't noticed a problem with (excluding kernels 5.4 from Ubuntu 20.04 LTS) is kernel 5.13.0-28-generic form Ubuntu.

Interestingly, I found the sources of this kernel on git kernel.ubuntu.com and built this kernel using this config from kernel 5.15 and the problem also occurred.

It was only when I built this kernel using the default config for this kernel that I stopped seeing the problem.
Comment 5 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-05-09 05:42:03 UTC
Sorry, this is starting to get confusing and hard to follow.

If there is something that used to work with an Ubuntu kernel and stops working there, you might want to report it to the Ubuntu developers, but not here.

This bug tracker care mainly about upstream kernel (see front page), so what happens with a kernel build from the Ubuntu sources (which are known to be modified a lot) is irrelevant and even just mentioning that makes things hard to follow. :-/

Regarding your problem: I'm not familiar with the code that might cause this, but to me it looks a lot like Ubuntu switched on a kernel configuration option that is causing this. If that's the case the problem doesn't qualify as regression, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

The developers nevertheless might be interested in fixing this, but might need more details from you (like the config option that is causing this)
Comment 6 Mathias Nyman 2022-05-09 08:11:22 UTC
[  +0,229229] xhci_hcd 0000:04:00.0: WARNING: Host System Error

The xHC controller reports a catastrophic error, and sets HSE bit.

For PCI xHC controllers the spec lists possible causes as:
host controller PCI parity error, PCI Master Abort, PCI Target Abort.
But DMA issues also possible cause, especially as log shows  DMAR
problems right after this.

Any chance you could bisect this on upstream kernel?
Comment 7 Piotr Piórkowski 2022-05-09 09:00:24 UTC
@Thorsten Leemhuis sorry for misleading you but when adding this bug here, I didn't know it wasn't an upstream regression - at first look it looked that way, as I also observed the problem on the upstream.

So far we only know that in one of the kernel configurations the problem does not occur - but this does not mean that the problem does not exist.

> Any chance you could bisect this on upstream kernel?

I'll try to do it this week
Comment 8 The Linux kernel's regression tracker (Thorsten Leemhuis) 2022-06-20 08:40:02 UTC
(In reply to Piotr Piórkowski from comment #7)
>
> > Any chance you could bisect this on upstream kernel?
> I'll try to do it this week

And news? Was the issue maybe fixed meanwhile?
Comment 9 Chris Bainbridge 2023-10-22 19:19:42 UTC
The IOMMU error is caused by a buggy VL805 firmware. It is more visible with the Debian kernel as Debian patches the kernel to enable IOMMU by default. The updated firmware can be installed using the VIA Windows tool (this did not work for me), or you can just turn off IOMMU.
Comment 10 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-10-25 09:23:40 UTC
(In reply to Chris Bainbridge from comment #9)
> The IOMMU error is caused by a buggy VL805 firmware. 

Makes me wonder: would it be possible to detect an old firmware and avoid the IOMMU path in this case? Or at least warn?
Comment 11 Chris Bainbridge 2023-10-29 22:39:31 UTC
(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #10)
> (In reply to Chris Bainbridge from comment #9)
> > The IOMMU error is caused by a buggy VL805 firmware. 
> 
> Makes me wonder: would it be possible to detect an old firmware and avoid
> the IOMMU path in this case? Or at least warn?

It would be possible to detect, the firmware version can be read with: 

$ sudo lspci -d 1106:3483 -xxx | awk '/^50:/ { print "VL805 FW version: " $5 $4 $3 $2 }'
VL805 FW version: 00013500

imho it would be a good idea for Linux to track the latest firmware versions for *all* hardware and warn if a firmware is out-of-date (even if the firmware updater is only available on Windows). Earlier this year I had an intermittent issue with a new laptop where the desktop would hang and processes would get IO errors. But this only happened once every 3 weeks or so. It took a few months to isolate the problem to NVME firmware (it was a HP laptop with Intel NVME, and I was unaware that these drives have locked HP-specific firmware). The firmware update was a Windows executable. I've also seen many forum posts where people have problems that were resolved by updates to GPU/motherboard/NVME/ethernet/wifi etc. firmware. Many of these problems could have been resolved a lot quicker if the kernel log contained "old firmware detected!".
Comment 12 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-10-30 16:21:07 UTC
(In reply to Chris Bainbridge from comment #11)
> It would be possible to detect […]

Thx for that.

> imho it would be a good idea for Linux to track the latest firmware versions
> for *all* hardware […]

Pretty sure that is something that should be done in userspace, as there it's a lot easier to update the dataset with the latest firmware versions.
Comment 13 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-10-30 16:26:04 UTC
Ideally this should be fixed, but we are kinda stuck here:

* Piotr, do you even care about this after all this time?

* does the problem even still happen with the latest mainline kernel? might be a good idea to test this with 6.6 or 6.7-rc1 is out (e.g. in two weeks from now) before doing anything else.

* The regression was never bisected, hence it's unclear which developers is resposible for handling this (USB? IOMMU? something else?). But well, with a bit of luck Mathias commented earlier and might see this and share his thoughts.
Comment 14 Piotr Piórkowski 2023-10-30 17:18:10 UTC
Honestly, I had already forgotten about this problem. In the meantime, I changed the HW. Lately I have little time, but I have somewhere this HW still and if there is a need I can verify something
Comment 15 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-10-31 09:10:27 UTC
(In reply to Piotr Piórkowski from comment #14)
> Honestly, I had already forgotten about this problem.

Happens, no worries :-D

> Lately I have little time, but I have somewhere this HW
> still and if there is a need I can verify something

I suggest you wait for Mathias to speak up first, I guess he should know best what's the best way forward here.
Comment 16 Chris Bainbridge 2023-10-31 15:40:00 UTC
> Pretty sure that is something that should be done in userspace, as there it's
> a lot easier to update the dataset with the latest firmware versions.

True, but then you have the problem of convincing distributions to ship it and enable it by default.

> The regression was never bisected

It's not a regression in the kernel as Piotr said in comment #7. It appeared on Debian kernels because Debian enabled IOMMU by default (but not for the GPU). This is a Debian-specific patch.
Comment 17 Mathias Nyman 2023-11-02 10:59:57 UTC
If I understand correctly this is caused by the VIA VL805 xHC controller with bad firmware accessing some DMA address outside the allowed range.

With IOMMU enabled the IOMMU will prevent this access, and the controller fails.

I'm speculating here, but it could be possible the controller accesses past one of the DMA ranges wile trying to read-ahead.

If we can figure out past which area, then its possible to make a driver workaround for this controller that allocates a bit larger DMA chunk for that specific purpose. 

DMA memory allocated for xHC use before any USB device is connected:

- dcbaa device context base address array.
  arrays of pointers to device contexts
  dma_alloc_coherent(dev)

- command ring
  dma_pool_zalloc(segment_pool)

- event ring
  dma_pool_zalloc(segment_pool)

- event ring segment table
  info about event ring, segents, size and location
  dma_alloc_coherent(dev)

- Scratchpad,
  only touched (RW) by xHC controller, not driver.
  dma_alloc_coheret()

DMA memory allocated for each connected USB device.

- device contexts
  dma_pool_zalloc(device_pool)

- transfer rings,
  contains TRBs, metadata about transfers.
  dma_pool_zalloc(segment_pool)

- stream contexts,
  dma_alloc_coherent() or dma_pool_alloc(*_streams_pool)
Comment 18 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-11-04 08:06:24 UTC
(In reply to Chris Bainbridge from comment #16)
> It's not a regression in the kernel as Piotr said in comment #7.

Ahh, sorry, missed/forgot that in all the regressions I deal with. Thx for pointing it out!

BTW, thx Mathias for nevertheless looking into this.