Bug 217122
Summary: | Regression in xhci driver since 6.1 "Transfer event TRB DMA ptr not part of current TD" | ||
---|---|---|---|
Product: | Drivers | Reporter: | Jürgen Herrmann (t-5) |
Component: | USB | Assignee: | Default virtual assignee for Drivers/USB (drivers_usb) |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | dmitri926, mario.limonciello, mathias.nyman, regressions, ZeroBeat |
Priority: | P1 | ||
Hardware: | AMD | ||
OS: | Linux | ||
Kernel Version: | 6.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: | collection of txt files with various hardware/software states |
When you say "any 6.0 version", have you also checked 6.0.y releases? As a guess - could this be related to runtime power management? There is a policy change that happened in 6.1, but it also backported to 6.0.y. So if that is the root cause I would have expected a failure in 6.0.y as well. If it's not runtime PM, could you possibly do a bisect? As it's an occasional issue it might be a lengthy process depending upon how frequently it happens. Yes, i also checked 6.0.y releases, currently I have 6.0.19 running and the error does not happen there. So I guess it is not runtime power management related? You say there was a policy change, how can I force the old policy on 6.2 to rule that out? And if a bisect is necessary can you point me to some information how to do it? Controller sent a transfer completion event for a transfer block (TRB) queued at *c740 in the ring buffer, but driver is still waiting for a completion of the previous transfer at *c730. If git bisect doesn't point to a clear cause then xhci tracing and dynamic debug could show what happened to that lost transfer. steps for xhci trace and dynamic debug: < disconnect all extra unrelated usb devices > mount -t debugfs none /sys/kernel/debug echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable echo 1 > /sys/kernel/debug/tracing/tracing_on < Reproduce issue > Send output of dmesg Send content of /sys/kernel/debug/tracing/trace Trace file grows quickly so try to copy it as soon as possible after triggering the issue. > So I guess it is not runtime power management related? You say there was a > policy change, how can I force the old policy on 6.2 to rule that out? You can change /sys/bus/pci/drivers/${xhci_pci_device}/power/control from "auto" to "on". > And if a bisect is necessary can you point me to some information how to do > it? https://www.kernel.org/doc/html/next/admin-guide/bug-bisect.html # jh @ laptop in ~ [22:03:49] C:1 $ cat /sys/bus/pci/drivers/xhci_hcd/0000:05:00.4/power/control on # jh @ laptop in ~ [22:04:00] $ cat /sys/bus/pci/drivers/xhci_hcd/0000:05:00.3/power/control on So power management does not seem to be the root cause. reagrding bisect, should i mark 6.0 as good or 6.0.19? already have the git tree checked out and compiling (6.2) and I am willing to go throught the maybe lengthy process. > So power management does not seem to be the root cause. OK, good. >reagrding bisect, should i mark 6.0 as good or 6.0.19? 6.0.19 might not strictly have everything that is in 6.1/6.2. There are things that could be there from 6.3-rc1 as well. So it's best IMO to set 6.0 as good and 6.1 as bad. >and I am willing to go throught the maybe lengthy process. Thanks! OK, thanks. checked out v6.1 instead of v6.2 and it's compiling. Will take my laptop through some tough compiling times in the next days I guess ;) I will report back after I have some result or problems with bisect arise along the way. Update: I started with 6.1.0 to verify the bug is there still. As of now it seems it is NOT in this version. Will have to run it longer, to be really sure. To be honest I probably never tried 6.1.0 but started at some point release like 6.1.5 or so, I compile kernels in an irregular interval, something like 3-4 weeks. And then I always pick the latest stable point release. So maybe the bug was backported from 6.2 to 6.1.y? Maybe some of you guys will have some suspicion? If not, and if it turns out that 6.1.0 does NOT have the bug I will start a new bisect with 6.1 good, 6.2 bad. (In reply to Jürgen Herrmann from comment #8) > So maybe the bug was backported from 6.2 to 6.1.y? Quite possible. Anyway, did your bisection make any progress? Or did the problem vanish on its own? I already did one complete bisect which took 13 steps and came out with a change that 100$ was totally unrelated to the problem. So I'm doing it again, suspect I made a mistake along the way. I takes a lot of time because I do one bisect a day as the problem can be so sporadic, to be totally sure I don't miss it. OMG, just found out why bisect failed again. The bug only manifests itself, if nvidia drivers are NOT loaded. Currently running 6.2.11 to confirm and the bug has not shown up since 24h with nvidia drivers loaded. Due do improper systemd configuration I had to restart once to get nvidia drivers compiled and loaded before Xorg started. Since I fixed that problem and every time the nvidia drivers are loaded now every time I log on. Will disable DMKS for nvidia drivers for now and start the bisect at the last known bad state. Update: now the bisect steps are different, currently I'm 4 steps away from completion. Another round of bisecting is done. I'm quite sure this result is useless, I will post it here nonetheless. Please comment! $ git bisect good b006c439d58db625318bf2207feabf847510a8a6 is the first bad commit commit b006c439d58db625318bf2207feabf847510a8a6 Author: Dominik Brodowski <linux@dominikbrodowski.net> Date: Thu Sep 22 15:59:31 2022 +0200 hwrng: core - start hwrng kthread also for untrusted sources Start the hwrng kthread even if the hwrng source has a quality setting of zero. Then, every crng reseed interval, one batch of data from this zero-quality hwrng source will be mixed into the CRNG pool. This patch is based on the assumption that data from a hwrng source will not actively harm the CRNG state. Instead, many hwrng sources (such as TPM devices), even though they are assigend a quality level of zero, actually provide some entropy, which is good enough to mix into the CRNG pool every once in a while. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> drivers/char/hw_random/core.c | 36 ++++++++++-------------------------- 1 file changed, 10 insertions(+), 26 deletions(-) That's ... interesting. Do you have fTPM enabled in your BIOS? If so, can you disable it? And, if you don't have that knob in your BIOS can you please try to apply f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs") to a kernel that otherwise fails? Does that help? I cannot find anything about TPM in the BIOS. It's a lenovo laptop, If you need the actual model, tell me - I will have to flip it on it's back to read it's model number. I found an option in the BIOS named "AMD security processor" - disabled it and I will test now... Maybe it's the correct option. If that does still fail I will apply "f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs")" > I found an option in the BIOS named "AMD security processor" - disabled it > and I will test now... If you disabled it and no longer have a /dev/tpm0 then yes that's the correct option. > If that does still fail I will apply "f1324bbc4011 ("tpm: disable hwrng for > fTPM on some AMD designs")" OK. can confirm, there's no longer a /dev/tpm0. Will compile previously failing 6.1.0 now and see what happens... 6.1.0 (previously faulty) is running now with TPM disabled in the BIOS. If that "fixes" the bug, do you want me to still apply f1324bbc4011 ("tpm: disable hwrng for fTPM on some AMD designs")" and retest it with TPM re-enabled? Or is this fix in newer kernels already? That commit has already gone to stable kernels. But for completeness's sake I think it would be good to re-enable TPM on a failing kernel and then try to apply that patch and confirm it's really fixed by that. PS: I think I got that result as well in a previous bisect round, but I never ever would have thought that this commit would be the issue. Seemed TOTALLY unrelated. Oh my, better let the judges be the real experts :) Thanks for all your help and patience so far! OK, I will do that an dreport back if that particular commit also fixes the issue with TPM re-enabled. OK, I will do that an dreport back if that particular commit also fixes the issue with TPM re-enabled. I re-enabled TPM (confirmed by having /dev/tpm0) and applied the patch. Bug is gone. That's great news. It's actually a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=216989 then. That patch rolled out to stable kernels, so this issue is fixed. Cool, thanks for your help! This was definitely the longest debugging session I ever did :) *** This bug has been marked as a duplicate of bug 216989 *** Is this regression fixed? I'm asking because I got this issue report: https://github.com/ZerBea/hcxdumptool/issues/334 dmesg showing something similar: [ 143.833329] xhci_hcd 0000:00:0c.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 8 comp_code 13 [ 143.833865] xhci_hcd 0000:00:0c.0: Looking for event-dma 00000000010a0070 trb-start 00000000010a0080 trb-end 00000000010a00b0 seg-start 00000000010a0000 seg-end 00000000010a0ff0 Try this patch. https://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git/commit/?h=next&id=2353032d292e97005bebf346939841c6c008df7a Thanks for the fast response and the patch. Looks like the problem still exist after applying the patch: https://github.com/ZerBea/hcxdumptool/issues/334#issuecomment-1667093519 Some additional information: I'm on kernel 6.4.8 and noticed that a WiFi device connected to an USB 3 port doesn't work: Bus 002 Device 003: ID 0e8d:7612 MediaTek Inc. MT7612U 802.11a/b/g/n/ac Wireless Adapter [ 7280.709821] usb 2-3: New USB device found, idVendor=0e8d, idProduct=7612, bcdDevice= 1.00 [ 7280.709829] usb 2-3: New USB device strings: Mfr=2, Product=3, SerialNumber=4 [ 7280.709830] usb 2-3: Product: Wireless [ 7280.709832] usb 2-3: Manufacturer: MediaTek Inc. [ 7280.709833] usb 2-3: SerialNumber: 000000000 [ 7280.970177] usb 2-3: reset SuperSpeed USB device number 2 using xhci_hcd [ 7280.994419] mt76x2u 2-3:1.0: ASIC revision: 76120044 [ 7281.025164] mt76x2u 2-3:1.0: ROM patch build: 20141115060606a [ 7281.202395] mt76x2u 2-3:1.0: Firmware Version: 0.0.00 [ 7281.202404] mt76x2u 2-3:1.0: Build: 1 [ 7281.202405] mt76x2u 2-3:1.0: Build Time: 201507311614____ [ 7281.928318] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht' [ 7281.928840] usbcore: registered new interface driver mt76x2u [ 7281.942982] mt76x2u 2-3:1.0 wlp22s0f0u3: renamed from wlan0 $ sudo iw dev wlp22s0f0u3 set type monitor [ 7475.445473] mt76x2u 2-3:1.0 wlp22s0f0u3: entered promiscuous mode $ sudo ip link set wlp22s0f0u3 up $ sudo iw dev wlp22s0f0u3 set channel 6 $ iw dev phy#1 Interface wlp22s0f0u3 ifindex 4 wdev 0x100000001 addr 00:c0:ca:ad:0e:49 type monitor channel 6 (2437 MHz), width: 20 MHz (no HT), center1: 2437 MHz txpower 20.00 dBm multicast TXQ: qsz-byt qsz-pkt flows drops marks overlmt hashcol tx-bytes tx-packets 0 0 0 0 0 0 0 0 0 $ tshark -i wlp22s0f0u3 Capturing on 'wlp22s0f0u3' ** (tshark:12534) 08:42:20.045697 [Main MESSAGE] -- Capture started. ** (tshark:12534) 08:42:20.045804 [Main MESSAGE] -- File: "/tmp/wireshark_wlp22s0f0u3LO1981.pcapng" ^Ctshark: 0 packets captured $ sudo ip link set wlp22s0f0u3 down $ sudo iw dev wlp22s0f0u3 set type managed [ 7479.304196] mt76x2u 2-3:1.0 wlp22s0f0u3: left promiscuous mode Controller: description: USB controller product: Renoir/Cezanne USB 3.1 vendor: Advanced Micro Devices, Inc. [AMD] physical id: 0.3 bus info: pci@0000:30:00.3 version: 00 width: 64 bits clock: 33MHz capabilities: xhci bus_master cap_list configuration: driver=xhci_hcd latency=0 resources: irq:33 memory:fc300000-fc3fffff Mainboard: MSI B550 A-Pro While all USB 2 devices are working as expected on this port, none of the USB 3 devices is working. I don't think that it is a driver problem, because I tested several devices manufactured by different vendors (mt76m rt2800usb). Unfortunately I got no dmesg message like this user https://github.com/ZerBea/hcxdumptool/issues/334#issuecomment-1667093519 I guess all this problems are related to each other. BTW: A long time ago, we got something similar related to XHCI): https://bugzilla.kernel.org/show_bug.cgi?id=202541 but that seems to be fixed, now. This seems like it probably is a different problem than the original one. I suggest opening a new issue; and if you can please remove the VM from the equation to make it easier to follow and reproduce. Thanks. I informed the user and advised him to join us here: https://github.com/ZerBea/hcxdumptool/issues/334#issuecomment-1668171443 Hi Mario & Michael, I was able to test on a physical box. I'm not seeing the "Transfer event TRB DMA ptr not part of current TD" error in dmesg but not able to capture any packets either. $ uname -a Linux sonylaptop 6.5.0-0.a.test-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.5~rc4-1~exp1a~test (2023-08-07) x86_64 GNU/Linux [ 1189.175651] usb 3-2: new high-speed USB device number 5 using xhci_hcd [ 1189.326329] usb 3-2: New USB device found, idVendor=0e8d, idProduct=7961, bcdDevice= 1.00 [ 1189.326351] usb 3-2: New USB device strings: Mfr=6, Product=7, SerialNumber=8 [ 1189.326358] usb 3-2: Product: Wireless_Device [ 1189.326363] usb 3-2: Manufacturer: MediaTek Inc. [ 1189.326368] usb 3-2: SerialNumber: 000000000 [ 1189.332931] bluetooth hci1: firmware: direct-loading firmware mediatek/BT_RAM_CODE_MT7961_1_2_hdr.bin [ 1189.337734] mt7921u 3-2:1.3: firmware: direct-loading firmware mediatek/WIFI_RAM_CODE_MT7961_1.bin [ 1195.262160] mt7921u 3-2:1.3: firmware: direct-loading firmware mediatek/WIFI_RAM_CODE_MT7961_1.bin [ 1195.276095] mt7921u 3-2:1.3: WM Firmware Version: ____010000, Build Time: 20230526130958 [ 1196.916463] mt7921u 3-2:1.3 wlx00c0cab3c769: renamed from wlan0 description: Motherboard product: VAIO vendor: Sony Corporation physical id: 0 version: N/A serial: N/A *-firmware description: BIOS vendor: Insyde Corp. physical id: 0 version: R0220DA date: 11/18/2013 size: 128KiB capacity: 3MiB capabilities: pci pnp upgrade shadowing cdboot bootselect edd int9keyboard int10video acpi usb smartbattery biosbootspecification netboot uefi *-cpu description: CPU product: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz vendor: Intel Corp. physical id: 4 bus info: cpu@0 version: 6.58.9 serial: N/A slot: N/A size: 2591MHz *-usbhost:1 product: xHCI Host Controller vendor: Linux 6.5.0-0.a.test-amd64 xhci-hcd physical id: 1 bus info: usb@4 logical name: usb4 version: 6.05 capabilities: usb-3.00 configuration: driver=hub slots=4 speed=5000Mbit/s $ lsusb Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 001 Device 004: ID 0bda:5729 Realtek Semiconductor Corp. Front Camera Bus 001 Device 003: ID 0489:e062 Foxconn / Hon Hai BCM43142A0 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub Bus 003 Device 005: ID 0e8d:7961 MediaTek Inc. Wireless_Device Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub $ hcxdumptool -L Requesting physical interface capabilities. This may take some time. Please be patient... available wlan devices: phy idx hw-mac virtual-mac m ifname driver (protocol) --------------------------------------------------------------------------------------------- 3 6 00c0cab3c769 00c0cab3c769 * wlx00c0cab3c769 mt7921u (NETLINK) * active monitor mode available + monitor mode available - no monitor mode available :~/work$ sudo iw dev $IFACE set type monitor :~/work$ sudo ip link set $IFACE up :~/work$ sudo iw dev phy#3 Interface wlx00c0cab3c769 ifindex 6 wdev 0x300000001 addr d8:5d:fb:08:df:5c type monitor channel 6 (2437 MHz), width: 20 MHz (no HT), center1: 2437 MHz txpower 3.00 dBm multicast TXQ: qsz-byt qsz-pkt flows drops marks overlmt hashcol tx-bytes tx-packets 0 0 0 0 0 0 0 0 0 $ sudo tshark -i $IFACE Running as user "root" and group "root". This could be dangerous. Capturing on 'wlx00c0cab3c769' ** (tshark:8102) 20:10:07.839462 [Main MESSAGE] -- Capture started. ** (tshark:8102) 20:10:07.839571 [Main MESSAGE] -- File: "/tmp/wireshark_wlx00c0cab3c769SOIS91.pcapng" ^Ctshark: 0 packets captured |
Created attachment 303837 [details] collection of txt files with various hardware/software states Occasionally this error happens: [94467.984745] xhci_hcd 0000:05:00.4: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 3 comp_code 1 [94467.984751] xhci_hcd 0000:05:00.4: Looking for event-dma 000000010f10c730 trb-start 000000010f10c740 trb-end 000000010f10c740 seg-start 000000010f10 c000 seg-end 000000010f10cff0 the playback software (camilladsp then reports an error: 2023-03-03 05:41:37.010408 WARN [src/alsadevice.rs:260] Capture device failed while waiting for available frames, error: ALSA function 'snd_pcm_wait' f ailed with error 'EPIPE: Broken pipe' 2023-03-03 05:41:37.013214 ERROR [src/bin.rs:364] Capture error: ALSA function 'snd_pcm_wait' failed with error 'EPIPE: Broken pipe' This did not occur in any 6.0 version and was introduced in 6.1, still present in 6.2