Bug 65021
Summary: | xhci: complete USB freeze | ||
---|---|---|---|
Product: | Drivers | Reporter: | dezifit |
Component: | USB | Assignee: | XHCI bugs virtual user (xhci) |
Status: | NEW --- | ||
Severity: | normal | CC: | alan, baolu.lu, bugzilla, dan.j.williams, keerthi, mathias.nyman, olli.salonen, robin, szg00000, xhci |
Priority: | P1 | ||
Hardware: | i386 | ||
OS: | Linux | ||
URL: | https://bugzilla.kernel.org/show_bug.cgi?id=62911 | ||
Kernel Version: | up to 3.12.8 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
lspci -vvvv
lsusb -vvvv extract of dying xhci host dmesg with enabled xHCI debugging debug journalctl output of second vlc starting up simpler journal output for 2x(vlc+pctv290e) journalctl output with patched 3.18 kernel debug patch for command ring status journalctl output with patch from 20141127 Output of branch "for-usb-next-test" |
Description
dezifit
2013-11-15 00:38:11 UTC
Created attachment 114781 [details]
lspci -vvvv
Created attachment 114791 [details]
lsusb -vvvv
Created attachment 114801 [details]
extract of dying xhci host
Please send this to the linux-usb@vger.kernel.org mailing list. Update: kernel 3.2 didn't cause this issue as XHCI was experimental until 3.7 and therefore simply not in the tested configuration (ubuntu 12.04.1 stock). The bug is 100% reproducible and occurs only on audio access (video is working even with xhci). Bios update to recent RLH8710H.86A.0323.2013.1204.1726 didn't help. Please test with 3.13, with CONFIG_USB_DEBUG=y. Run this command as root to enable xHCI debugging: echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control Then start capturing dmesg, trigger the bug, and send me the resulting dmesg. I need to see a lot more of what happened to the xHCI driver before the host died to figure out what went wrong. Created attachment 124701 [details]
dmesg with enabled xHCI debugging
Attached USB debug log was created with 3.13.1, hope that's OK and contains the informations you need.
Yes, that's what I was looking for. Nothing immediately wrong jumps out at me, unfortunately. I might need you to apply some debugging patches later, once I form some hypothesis as to what's going wrong. I'd like to ask if there's been any progress on this bug? I have essentially the same kind of issue see https://bbs.archlinux.org/viewtopic.php?id=190000 with a shutdown of usb caused by something in xhci. I'm willing to assist in getting more information if that's required. My kernel is 3.17.3-1-ARCH #1 SMP PREEMPT Fri Nov 14 23:13:48 CET 2014 x86_64 GNU/Linux. Is it worth me trying the same kind of build as in comment 6? While dealing with halted endpoints in xhci I noticed that there could be issues with how we stop endpoints and cancel URBs as well. If I make an additional debug patch could you apply it and run it with the xhci debugging enabled as in comment 6? I am willing to try this, but when I looked at the drivers/usb/host files I do not see the debug flag mentioned in comment 6 (CONFIG_USB_DEBUG) nor is it in the .config prepared by the Arch pkgbuild. There are other flags with debug and usb in their names eg CONFIG_DVB_USB_DEBUG, but I would need advice on which debugs to turn on. Ah, yes, the CONFIG_USB_DEBUG was for older kernels. With later kernels you only need to check that dynamic debug is enabled: CONFIG_DYNAMIC_DEBUG=y check if debugfs is mounted, if not then run: mount -t debugfs none /sys/kernel/debug And to enable xhci debugging do: echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control many distos have the dynamic debug enabled and debugfs mounted automatically so you only need to echo to the dymanic_debug/control file. OK I have checked and it seems the stock arch linux kernel has CONFIG_DYNAMIC_DEBUG set so I don't need a recompile. I will carry out the comment 6 type process when I get home. I'll post the dmesg output here tonight. Created attachment 158441 [details]
debug journalctl output of second vlc starting up
this the journalctl output of the boot when I start a second pctv 290e device with vlc after doing
sudo su
echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
my kernel
Linux minikat 3.17.3-1-ARCH #1 SMP PREEMPT Fri Nov 14 23:13:48 CET 2014 x86_64 GNU/Linux
Created attachment 158471 [details]
simpler journal output for 2x(vlc+pctv290e)
I ran this script immediately after a fresh boot
$ cat bin/debugger.sh
logger "=========================== turning on xhci_hcd debug"
sudo sh -c "echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control"
logger "=========================== starting VLC on adapter 0"
vlc --dvb-adapter=0 ~/.config/channels.xspf &
sleep 30
logger "=========================== starting VLC on adapter 1"
vlc --dvb-adapter=1 ~/.config/channels.xspf &
sleep 60
sudo systemctl poweroff
the system was working after the usb turnoff and my sudo doesn't need a password so the system did power off OK. attached is the journalctl -b -1 --all output
Thanks, logs show that the host is asked to cancel a transfer, so it tries to stop the ring and remove the transfer block from the ring. Stopping the ring never responses, and we timout assuming the host is dead. This might be related to bug# 75521 I got a testbranch that both fixes some related issues, and adds more debugging for this case. If you can try it out its available at: git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-usb-next-test Remember to switch to the for-usb-next-test branch. It includes 8 patches that also apply cleanly on top of 3.18-rc6, I can send just patches instead if you prefer it that way. I'm hoping these patches would show one of the following messages when it fails: "Cancelled TD not on stopped ring" "Cancel URB NOT on current ring" "Error unhandled cancelled TD's after dev reset" "Error unhandled TD's after dev reset" Created attachment 158871 [details] journalctl output with patched 3.18 kernel Same process as for comment 15, but with your patched kernel. I built using the Arch Linux 3.17.4 PKGBUILD after modding to use the source from the patched kernel and changing the version numbers. During the boot I see the systemd load modules failing (probably because it has a direct reference to 3.17), but I don't normally need to load any modules specially so I guess no harm done. I don't think I see any of your expected messages though. Created attachment 158891 [details] debug patch for command ring status Thanks. looks like its not related to bug# 75521 after all. we never even get to run the command that should stop the endpoint ring. The whole command ring is not running. Could you add this attached debug patch (its also added to the for-usb-next-test branch) and show me the output of it failing Created attachment 159001 [details]
journalctl output with patch from 20141127
I think this one has some new debugs coming out.
Another sample for comparison. Created with for-usb-next-test branch mentioned in comment 18 Created attachment 159031 [details]
Output of branch "for-usb-next-test"
(In reply to dezifit from comment #21) > Created attachment 159031 [details] > Output of branch "for-usb-next-test" I think something went wrong when you cloned that branch, It shows output messages like "Endpoint 0x81 not halted, refusing to reset." which shouldn't be possible in the "for-usb-next-test" branch, one of the patches removes that line. (In reply to Robin Becker from comment #19) > Created attachment 159001 [details] > journalctl output with patch from 20141127 > > I think this one has some new debugs coming out. Thanks, it shows the command ring is running, and that we queued two "stop endpoint" commands on the command ring, but xhci doesn't comlete the commands even if the ring is running. Its probably not a very common situation to have two reset endpoint commands for different usb slots queued at the same time, maybe that triggers something. I can't think of anything else to try than adding code that prevents queuing two stop endpoint command on the actual hw ring at the same time, putting the other command on a sw ring until first command completes. but it might take a while before I get to write a hackpatch like that I'll try this out when you have it ready. I looked at the code and there is a comment in xhci.c suggesting that only one stop command should be in flight, but I'm not clever enough to know if that applies here ie it might be 1 per ep or 1 per cancellation etc etc. Anyway thanks for your efforts. Not sure whether this helps in resolution of this bug, but I asked for input on linux media and apparently others have also suffered from usb3/xhci problems with usb dvb. The following was suggested by Olli Salonen <olli.salonen a-t iki.fi> https://github.com/OpenELEC/OpenELEC.tv/commit/b636927dec20652ff020e54ed7838a2e9be51e03 which advises reverting commit 47f467ac740ebf0475a5176ddb1741acba6aad4 When I apply the above to Arch linux-18.2, my problem disappears and I can run 2 x pctv-290e + vlc / tzap & usb 3. There are indications that this patch remedies the issue: http://www.spinics.net/lists/linux-usb/msg122678.html I'm at work right now, but will give this patch a spin this evening. I will try it with Arch 3.19.3-3 x86_64 without the patch I currently use see 25 above. OK I can confirm that the spinics patch in http://www.spinics.net/lists/linux-usb/msg122678.html also fixes the problem for me. Did test the patch (see comment 26) with 3.12.40 and can confirm that XHCI_AVOID_BEI prevents xhci from freezing, so the initial issue is solved for me. I'm now seeing a bunch of new messages, don't know if they are related. Examples: xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD xhci_hcd 0000:00:14.0: Signal while waiting for configure endpoint command The spinics patch appears to be in Arch linux-4.0.1-1 and the bug has gone away for me. I am using kernel 4.4.0-127-lowlatency (Ubuntu 14.04) and I have this bug. When we run an UVC camera, the host controller dies randomly after some point. The symptoms are exactly as described by other users above. I would like to know which kernels this bug affects and where a patch has been applied. May I know which kernel has the spinics patch ? Hardware: Asus X99-E-10G WS Motherboard Intel Xeon E5-2687W v4 3.0 GHz |