Bug 217122

Summary: Regression in xhci driver since 6.1 "Transfer event TRB DMA ptr not part of current TD"
Product: Drivers Reporter: Jürgen Herrmann (t-5)
Component: USBAssignee: Default virtual assignee for Drivers/USB (drivers_usb)
Status: RESOLVED DUPLICATE    
Severity: normal CC: dmitri926, mario.limonciello, mathias.nyman, regressions, ZeroBeat
Priority: P1    
Hardware: AMD   
OS: Linux   
Kernel Version: 6.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: collection of txt files with various hardware/software states

Description Jürgen Herrmann 2023-03-03 19:53:55 UTC
Created attachment 303837 [details]
collection of txt files with various hardware/software states

Occasionally this error happens:

[94467.984745] xhci_hcd 0000:05:00.4: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 3 comp_code 1
[94467.984751] xhci_hcd 0000:05:00.4: Looking for event-dma 000000010f10c730 trb-start 000000010f10c740 trb-end 000000010f10c740 seg-start 000000010f10
c000 seg-end 000000010f10cff0

the playback software (camilladsp then reports an error:

2023-03-03 05:41:37.010408 WARN [src/alsadevice.rs:260] Capture device failed while waiting for available frames, error: ALSA function 'snd_pcm_wait' f
ailed with error 'EPIPE: Broken pipe'
2023-03-03 05:41:37.013214 ERROR [src/bin.rs:364] Capture error: ALSA function 'snd_pcm_wait' failed with error 'EPIPE: Broken pipe'

This did not occur in any 6.0 version and was introduced in 6.1, still present in 6.2
Comment 1 Mario Limonciello (AMD) 2023-03-06 02:04:21 UTC
When you say "any 6.0 version", have you also checked 6.0.y releases?  
As a guess - could this be related to runtime power management?  There is a policy change that happened in 6.1, but it also backported to 6.0.y.  So if that is the root cause I would have expected a failure in 6.0.y as well.

If it's not runtime PM, could you possibly do a bisect?
As it's an occasional issue it might be a lengthy process depending upon how frequently it happens.
Comment 2 Jürgen Herrmann 2023-03-06 09:17:05 UTC
Yes, i also checked 6.0.y releases, currently I have 6.0.19 running and the error does not happen there.

So I guess it is not runtime power management related? You say there was a policy change, how can I force the old policy on 6.2 to rule that out?

And if a bisect is necessary can you point me to some information how to do it?
Comment 3 Mathias Nyman 2023-03-06 11:35:44 UTC
Controller sent a transfer completion event for a transfer block (TRB) queued at *c740 in the ring buffer, but driver is still waiting for a completion of the previous transfer at *c730.

If git bisect doesn't point to a clear cause then xhci tracing and dynamic
debug could show what happened to that lost transfer.

steps for xhci trace and dynamic debug:

< disconnect all extra unrelated usb devices >
mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
< Reproduce issue >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace

Trace file grows quickly so try to copy it as soon as possible after triggering the issue.
Comment 4 Mario Limonciello (AMD) 2023-03-06 13:15:24 UTC
> So I guess it is not runtime power management related? You say there was a
> policy change, how can I force the old policy on 6.2 to rule that out?

You can change /sys/bus/pci/drivers/${xhci_pci_device}/power/control from "auto" to "on".

> And if a bisect is necessary can you point me to some information how to do
> it?

https://www.kernel.org/doc/html/next/admin-guide/bug-bisect.html
Comment 5 Jürgen Herrmann 2023-03-06 21:06:52 UTC
# jh @ laptop in ~ [22:03:49] C:1
$ cat /sys/bus/pci/drivers/xhci_hcd/0000:05:00.4/power/control 
on

# jh @ laptop in ~ [22:04:00] 
$ cat /sys/bus/pci/drivers/xhci_hcd/0000:05:00.3/power/control
on

So power management does not seem to be the root cause.

reagrding bisect, should i mark 6.0 as good or 6.0.19?

already have the git tree checked out and compiling (6.2) and I am willing to go throught the maybe lengthy process.
Comment 6 Mario Limonciello (AMD) 2023-03-06 21:09:56 UTC
> So power management does not seem to be the root cause.

OK, good.

>reagrding bisect, should i mark 6.0 as good or 6.0.19?

6.0.19 might not strictly have everything that is in 6.1/6.2.  There are things that could be there from 6.3-rc1 as well.

So it's best IMO to set 6.0 as good and 6.1 as bad.

>and I am willing to go throught the maybe lengthy process.

Thanks!
Comment 7 Jürgen Herrmann 2023-03-06 21:18:39 UTC
OK, thanks. checked out v6.1 instead of v6.2 and it's compiling. Will take my laptop through some tough compiling times in the next days I guess ;)

I will report back after I have some result or problems with bisect arise along the way.
Comment 8 Jürgen Herrmann 2023-03-07 20:25:54 UTC
Update: I started with 6.1.0 to verify the bug is there still. As of now it seems it is NOT in this version. Will have to run it longer, to be really sure.

To be honest I probably never tried 6.1.0 but started at some point release like 6.1.5 or so, I compile kernels in an irregular interval, something like 3-4 weeks. And then I always pick the latest stable point release.

So maybe the bug was backported from 6.2 to 6.1.y? Maybe some of you guys will have some suspicion? If not, and if it turns out that 6.1.0 does NOT have the bug I will start a new bisect with 6.1 good, 6.2 bad.
Comment 9 The Linux kernel's regression tracker (Thorsten Leemhuis) 2023-03-23 14:00:19 UTC
(In reply to Jürgen Herrmann from comment #8)

> So maybe the bug was backported from 6.2 to 6.1.y?

Quite possible. 

Anyway, did your bisection make any progress? Or did the problem vanish on its own?
Comment 10 Jürgen Herrmann 2023-03-23 14:59:47 UTC
I already did one complete bisect which took 13 steps and came out with a change that 100$ was totally unrelated to the problem. So I'm  doing it again, suspect I made a mistake along the way. I takes a lot of time because I do one bisect a day as the problem can be so sporadic, to be totally sure I don't miss it.
Comment 11 Jürgen Herrmann 2023-04-19 16:36:13 UTC
OMG, just found out why bisect failed again. The bug only manifests itself, if nvidia drivers are NOT loaded. Currently running 6.2.11 to confirm and the bug has not shown up since 24h with nvidia drivers loaded.

Due do improper systemd configuration I had to restart once to get nvidia drivers compiled and loaded before Xorg started. Since I fixed that problem and every time the nvidia drivers are loaded now every time I log on.

Will disable DMKS for nvidia drivers for now and start the bisect at the last known bad state.
Comment 12 Jürgen Herrmann 2023-05-06 17:31:01 UTC
Update: now the bisect steps are different, currently I'm 4 steps away from completion.
Comment 13 Jürgen Herrmann 2023-05-10 20:06:53 UTC
Another round of bisecting is done. I'm quite sure this result is useless, I will post it here nonetheless. Please comment!

$ git bisect good
b006c439d58db625318bf2207feabf847510a8a6 is the first bad commit
commit b006c439d58db625318bf2207feabf847510a8a6
Author: Dominik Brodowski <linux@dominikbrodowski.net>
Date:   Thu Sep 22 15:59:31 2022 +0200

    hwrng: core - start hwrng kthread also for untrusted sources
    
    Start the hwrng kthread even if the hwrng source has a quality setting
    of zero. Then, every crng reseed interval, one batch of data from this
    zero-quality hwrng source will be mixed into the CRNG pool.
    
    This patch is based on the assumption that data from a hwrng source
    will not actively harm the CRNG state. Instead, many hwrng sources
    (such as TPM devices), even though they are assigend a quality level of
    zero, actually provide some entropy, which is good enough to mix into
    the CRNG pool every once in a while.
    
    Cc: Herbert Xu <herbert@gondor.apana.org.au>
    Cc: Jason A. Donenfeld <Jason@zx2c4.com>
    Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
    Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

 drivers/char/hw_random/core.c | 36 ++++++++++--------------------------
 1 file changed, 10 insertions(+), 26 deletions(-)
Comment 14 Mario Limonciello (AMD) 2023-05-10 20:15:34 UTC
That's ... interesting.  

Do you have fTPM enabled in your BIOS?  If so, can you disable it?
Comment 15 Mario Limonciello (AMD) 2023-05-10 21:25:49 UTC
And, if you don't have that knob in your BIOS can you please try to apply 
f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs")
to a kernel that otherwise fails?

Does that help?
Comment 16 Jürgen Herrmann 2023-05-10 21:51:04 UTC
I cannot find anything about TPM in the BIOS. It's a lenovo laptop, If you need the actual model, tell me - I will have to flip it on it's back to read it's model number.

I found an option in the BIOS named "AMD security processor" - disabled it and I will test now... Maybe it's the correct option. If that does still fail I will apply "f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs")"
Comment 17 Mario Limonciello (AMD) 2023-05-10 21:52:51 UTC
> I found an option in the BIOS named "AMD security processor" - disabled it
> and I will test now...

If you disabled it and no longer have a /dev/tpm0 then yes that's the correct option.

>  If that does still fail I will apply "f1324bbc4011 ("tpm: disable hwrng for
>  fTPM on some AMD designs")"

OK.
Comment 18 Jürgen Herrmann 2023-05-10 21:54:35 UTC
can confirm, there's no longer a /dev/tpm0. Will compile previously failing 6.1.0 now and see what happens...
Comment 19 Jürgen Herrmann 2023-05-10 22:23:34 UTC
6.1.0 (previously faulty) is running now with TPM disabled in the BIOS. If that "fixes" the bug, do you want me to still apply f1324bbc4011 ("tpm: disable hwrng for
 fTPM on some AMD designs")" and retest it with TPM re-enabled?

Or is this fix in newer kernels already?
Comment 20 Mario Limonciello (AMD) 2023-05-10 22:25:51 UTC
That commit has already gone to stable kernels.

But for completeness's sake I think it would be good to re-enable TPM on a failing kernel and then try to apply that patch and confirm it's really fixed by that.
Comment 21 Jürgen Herrmann 2023-05-10 22:26:42 UTC
PS: I think I got that result as well in a previous bisect round, but I never ever would have thought that this commit would be the issue. Seemed TOTALLY unrelated. Oh my, better let the judges be the real experts :)

Thanks for all your help and patience so far!
Comment 22 Jürgen Herrmann 2023-05-10 22:28:51 UTC
OK, I will do that an dreport back if that particular commit also fixes the issue with TPM re-enabled.
Comment 23 Jürgen Herrmann 2023-05-10 22:29:07 UTC
OK, I will do that an dreport back if that particular commit also fixes the issue with TPM re-enabled.
Comment 24 Jürgen Herrmann 2023-05-12 16:38:10 UTC
I re-enabled TPM (confirmed by having /dev/tpm0) and applied the patch. Bug is gone.
Comment 25 Mario Limonciello (AMD) 2023-05-12 16:45:31 UTC
That's great news.  It's actually a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=216989 then.

That patch rolled out to stable kernels, so this issue is fixed.
Comment 26 Jürgen Herrmann 2023-05-12 16:51:39 UTC
Cool, thanks for your help! This was definitely the longest debugging session I ever did :)
Comment 27 Mario Limonciello (AMD) 2023-05-12 17:00:04 UTC

*** This bug has been marked as a duplicate of bug 216989 ***
Comment 28 Michael 2023-08-02 15:48:36 UTC
Is this regression fixed?

I'm asking because I got this issue report:
https://github.com/ZerBea/hcxdumptool/issues/334

dmesg showing something similar:
[  143.833329] xhci_hcd 0000:00:0c.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 8 comp_code 13
[  143.833865] xhci_hcd 0000:00:0c.0: Looking for event-dma 00000000010a0070 trb-start 00000000010a0080 trb-end 00000000010a00b0 seg-start 00000000010a0000 seg-end 00000000010a0ff0
Comment 30 Michael 2023-08-02 16:46:55 UTC
Thanks for the fast response and the patch.
Comment 31 Michael 2023-08-07 05:12:10 UTC
Looks like the problem still exist after applying the patch:
https://github.com/ZerBea/hcxdumptool/issues/334#issuecomment-1667093519
Comment 32 Michael 2023-08-07 07:01:01 UTC
Some additional information:

I'm on kernel 6.4.8 and noticed that a WiFi device connected to an USB 3 port doesn't work:
Bus 002 Device 003: ID 0e8d:7612 MediaTek Inc. MT7612U 802.11a/b/g/n/ac Wireless Adapter

[ 7280.709821] usb 2-3: New USB device found, idVendor=0e8d, idProduct=7612, bcdDevice= 1.00
[ 7280.709829] usb 2-3: New USB device strings: Mfr=2, Product=3, SerialNumber=4
[ 7280.709830] usb 2-3: Product: Wireless 
[ 7280.709832] usb 2-3: Manufacturer: MediaTek Inc.
[ 7280.709833] usb 2-3: SerialNumber: 000000000
[ 7280.970177] usb 2-3: reset SuperSpeed USB device number 2 using xhci_hcd
[ 7280.994419] mt76x2u 2-3:1.0: ASIC revision: 76120044
[ 7281.025164] mt76x2u 2-3:1.0: ROM patch build: 20141115060606a
[ 7281.202395] mt76x2u 2-3:1.0: Firmware Version: 0.0.00
[ 7281.202404] mt76x2u 2-3:1.0: Build: 1
[ 7281.202405] mt76x2u 2-3:1.0: Build Time: 201507311614____
[ 7281.928318] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
[ 7281.928840] usbcore: registered new interface driver mt76x2u
[ 7281.942982] mt76x2u 2-3:1.0 wlp22s0f0u3: renamed from wlan0

$ sudo iw dev wlp22s0f0u3 set type monitor
[ 7475.445473] mt76x2u 2-3:1.0 wlp22s0f0u3: entered promiscuous mode

$ sudo ip link set wlp22s0f0u3 up
$ sudo iw dev wlp22s0f0u3 set channel 6
$ iw dev
phy#1
	Interface wlp22s0f0u3
		ifindex 4
		wdev 0x100000001
		addr 00:c0:ca:ad:0e:49
		type monitor
		channel 6 (2437 MHz), width: 20 MHz (no HT), center1: 2437 MHz
		txpower 20.00 dBm
		multicast TXQ:
			qsz-byt	qsz-pkt	flows	drops	marks	overlmt	hashcol	tx-bytes	tx-packets
			0	0	0	0	0	0	0	0		0

$ tshark -i wlp22s0f0u3
Capturing on 'wlp22s0f0u3'
 ** (tshark:12534) 08:42:20.045697 [Main MESSAGE] -- Capture started.
 ** (tshark:12534) 08:42:20.045804 [Main MESSAGE] -- File: "/tmp/wireshark_wlp22s0f0u3LO1981.pcapng"
^Ctshark: 
0 packets captured


$ sudo ip link set wlp22s0f0u3 down
$ sudo iw dev wlp22s0f0u3 set type managed
[ 7479.304196] mt76x2u 2-3:1.0 wlp22s0f0u3: left promiscuous mode

Controller:
                description: USB controller
                product: Renoir/Cezanne USB 3.1
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0.3
                bus info: pci@0000:30:00.3
                version: 00
                width: 64 bits
                clock: 33MHz
                capabilities: xhci bus_master cap_list
                configuration: driver=xhci_hcd latency=0
                resources: irq:33 memory:fc300000-fc3fffff


Mainboard: MSI B550 A-Pro


While all USB 2 devices are working as expected on this port, none of the USB 3 devices is working. I don't think that it is a driver problem, because I tested several devices manufactured by different vendors (mt76m rt2800usb).

Unfortunately I got no dmesg message like this user
https://github.com/ZerBea/hcxdumptool/issues/334#issuecomment-1667093519

I guess all this problems are related to each other.

BTW:
A long time ago, we got something similar related to XHCI):
https://bugzilla.kernel.org/show_bug.cgi?id=202541
but that seems to be fixed, now.
Comment 33 Mario Limonciello (AMD) 2023-08-07 15:47:30 UTC
This seems like it probably is a different problem than the original one. I suggest opening a new issue; and if you can please remove the VM from the equation to make it easier to follow and reproduce.
Comment 34 Michael 2023-08-07 16:15:41 UTC
Thanks.
I informed the user and advised him to join us here:
https://github.com/ZerBea/hcxdumptool/issues/334#issuecomment-1668171443
Comment 35 Dmitri 2023-08-08 00:13:41 UTC
Hi Mario & Michael,

I was able to test on a physical box. I'm not seeing the "Transfer event TRB DMA ptr not part of current TD" error in dmesg but not able to capture any packets either.

$ uname -a
Linux sonylaptop 6.5.0-0.a.test-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.5~rc4-1~exp1a~test (2023-08-07) x86_64 GNU/Linux

[ 1189.175651] usb 3-2: new high-speed USB device number 5 using xhci_hcd
[ 1189.326329] usb 3-2: New USB device found, idVendor=0e8d, idProduct=7961, bcdDevice= 1.00
[ 1189.326351] usb 3-2: New USB device strings: Mfr=6, Product=7, SerialNumber=8
[ 1189.326358] usb 3-2: Product: Wireless_Device
[ 1189.326363] usb 3-2: Manufacturer: MediaTek Inc.
[ 1189.326368] usb 3-2: SerialNumber: 000000000
[ 1189.332931] bluetooth hci1: firmware: direct-loading firmware mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin
[ 1189.337734] mt7921u 3-2:1.3: firmware: direct-loading firmware mediatek/WIFI_RAM_CODE_MT7961_1.bin
[ 1195.262160] mt7921u 3-2:1.3: firmware: direct-loading firmware mediatek/WIFI_RAM_CODE_MT7961_1.bin
[ 1195.276095] mt7921u 3-2:1.3: WM Firmware Version: ____010000, Build Time: 20230526130958
[ 1196.916463] mt7921u 3-2:1.3 wlx00c0cab3c769: renamed from wlan0

    description: Motherboard
       product: VAIO
       vendor: Sony Corporation
       physical id: 0
       version: N/A
       serial: N/A
     *-firmware
          description: BIOS
          vendor: Insyde Corp.
          physical id: 0
          version: R0220DA
          date: 11/18/2013
          size: 128KiB
          capacity: 3MiB
          capabilities: pci pnp upgrade shadowing cdboot bootselect edd int9keyboard int10video acpi usb smartbattery biosbootspecification netboot uefi
     *-cpu
          description: CPU
          product: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
          vendor: Intel Corp.
          physical id: 4
          bus info: cpu@0
          version: 6.58.9
          serial: N/A
          slot: N/A
          size: 2591MHz
     *-usbhost:1
                product: xHCI Host Controller
                vendor: Linux 6.5.0-0.a.test-amd64 xhci-hcd
                physical id: 1
                bus info: usb@4
                logical name: usb4
                version: 6.05
                capabilities: usb-3.00
                configuration: driver=hub slots=4 speed=5000Mbit/s

$ lsusb
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 004: ID 0bda:5729 Realtek Semiconductor Corp. Front Camera
Bus 001 Device 003: ID 0489:e062 Foxconn / Hon Hai BCM43142A0
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 005: ID 0e8d:7961 MediaTek Inc. Wireless_Device
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

$ hcxdumptool -L

Requesting physical interface capabilities. This may take some time.
Please be patient...


available wlan devices:

phy idx hw-mac       virtual-mac  m ifname           driver (protocol)
---------------------------------------------------------------------------------------------
  3   6 00c0cab3c769 00c0cab3c769 * wlx00c0cab3c769  mt7921u (NETLINK)

* active monitor mode available
+ monitor mode available
- no monitor mode available


:~/work$ sudo iw dev $IFACE set type monitor
:~/work$ sudo ip link set $IFACE up
:~/work$ sudo iw dev
phy#3
        Interface wlx00c0cab3c769
                ifindex 6
                wdev 0x300000001
                addr d8:5d:fb:08:df:5c
                type monitor
                channel 6 (2437 MHz), width: 20 MHz (no HT), center1: 2437 MHz
                txpower 3.00 dBm
                multicast TXQ:
                        qsz-byt qsz-pkt flows   drops   marks   overlmt hashcol tx-bytes        tx-packets
                        0       0       0       0       0       0       0       0               0
						
							
						
$ sudo tshark -i $IFACE
Running as user "root" and group "root". This could be dangerous.
Capturing on 'wlx00c0cab3c769'
 ** (tshark:8102) 20:10:07.839462 [Main MESSAGE] -- Capture started.
 ** (tshark:8102) 20:10:07.839571 [Main MESSAGE] -- File: "/tmp/wireshark_wlx00c0cab3c769SOIS91.pcapng"
^Ctshark:
0 packets captured