Bug 65021 - xhci: complete USB freeze
Summary: xhci: complete USB freeze
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: i386 Linux
: P1 normal
Assignee: XHCI bugs virtual user
URL: https://bugzilla.kernel.org/show_bug....
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-15 00:38 UTC by dezifit
Modified: 2018-06-05 20:22 UTC (History)
10 users (show)

See Also:
Kernel Version: up to 3.12.8
Subsystem:
Regression: No
Bisected commit-id:


Attachments
lspci -vvvv (11.14 KB, application/octet-stream)
2013-11-15 00:39 UTC, dezifit
Details
lsusb -vvvv (53.67 KB, application/octet-stream)
2013-11-15 00:40 UTC, dezifit
Details
extract of dying xhci host (6.94 KB, application/octet-stream)
2013-11-15 00:41 UTC, dezifit
Details
dmesg with enabled xHCI debugging (87.25 KB, text/plain)
2014-02-05 15:09 UTC, dezifit
Details
debug journalctl output of second vlc starting up (333.47 KB, text/plain)
2014-11-21 22:30 UTC, Robin Becker
Details
simpler journal output for 2x(vlc+pctv290e) (214.80 KB, text/plain)
2014-11-22 09:36 UTC, Robin Becker
Details
journalctl output with patched 3.18 kernel (175.73 KB, text/plain)
2014-11-27 00:11 UTC, Robin Becker
Details
debug patch for command ring status (2.86 KB, patch)
2014-11-27 14:08 UTC, Mathias Nyman
Details | Diff
journalctl output with patch from 20141127 (177.26 KB, text/plain)
2014-11-28 00:20 UTC, Robin Becker
Details
Output of branch "for-usb-next-test" (147.35 KB, text/plain)
2014-11-28 18:25 UTC, dezifit
Details

Description dezifit 2013-11-15 00:38:11 UTC
Accessing the audio capture device (Em28xxAudio) of a USB TV adapter (Hauppauge HVR-900) causes a complete USB freeze (see attached log "xhci.log").

This device is working at least until kernel 3.2 with the same hardware (Intel DH87RL/Haswell, BIOS RLH8710H.86A.0320.2013.0606.1802).

Switching from a USB3 to a USB2 port doesn't change anything; if xhci is available, it is used and the bus freezes. If ehci is forced by disabling xhci completely, the capture device is working again.

See also https://bugzilla.kernel.org/show_bug.cgi?id=62911 for additional informations related to V4L
Comment 1 dezifit 2013-11-15 00:39:34 UTC
Created attachment 114781 [details]
lspci -vvvv
Comment 2 dezifit 2013-11-15 00:40:03 UTC
Created attachment 114791 [details]
lsusb -vvvv
Comment 3 dezifit 2013-11-15 00:41:56 UTC
Created attachment 114801 [details]
extract of dying xhci host
Comment 4 Greg Kroah-Hartman 2013-11-15 06:27:55 UTC
Please send this to the linux-usb@vger.kernel.org mailing list.
Comment 5 dezifit 2014-01-20 19:03:40 UTC
Update: kernel 3.2 didn't cause this issue as XHCI was experimental until 3.7 and therefore simply not in the tested configuration (ubuntu 12.04.1 stock).

The bug is 100% reproducible and occurs only on audio access (video is working even with xhci). Bios update to recent RLH8710H.86A.0323.2013.1204.1726 didn't help.
Comment 6 Sarah Sharp 2014-01-30 01:49:16 UTC
Please test with 3.13, with CONFIG_USB_DEBUG=y.  Run this command as root to enable xHCI debugging:

echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control

Then start capturing dmesg, trigger the bug, and send me the resulting dmesg.  I need to see a lot more of what happened to the xHCI driver before the host died to figure out what went wrong.
Comment 7 dezifit 2014-02-05 15:09:51 UTC
Created attachment 124701 [details]
dmesg with enabled xHCI debugging

Attached USB debug log was created with 3.13.1, hope that's OK and contains the informations you need.
Comment 8 Sarah Sharp 2014-02-05 22:08:37 UTC
Yes, that's what I was looking for.  Nothing immediately wrong jumps out at me, unfortunately.  I might need you to apply some debugging patches later, once I form some hypothesis as to what's going wrong.
Comment 9 Robin Becker 2014-11-19 20:13:41 UTC
I'd like to ask if there's been any progress on this bug? I have essentially the same kind of issue see https://bbs.archlinux.org/viewtopic.php?id=190000 with a shutdown of usb caused by something in xhci. I'm willing to assist in getting more information if that's required. My kernel is 3.17.3-1-ARCH #1 SMP PREEMPT Fri Nov 14 23:13:48 CET 2014 x86_64 GNU/Linux. Is it worth me trying the same kind of build as in comment 6?
Comment 10 Mathias Nyman 2014-11-20 11:38:48 UTC
While dealing with halted endpoints in xhci I noticed that there could be issues with how we stop endpoints and cancel URBs as well. 

If I make an additional debug patch could you apply it and run it with the xhci debugging enabled as in comment 6?
Comment 11 Robin Becker 2014-11-20 12:18:16 UTC
I am willing to try this, but when I looked at the drivers/usb/host files I do not see the debug flag mentioned in comment 6 (CONFIG_USB_DEBUG) nor is it in the .config prepared by the  Arch pkgbuild. There are other flags with debug and usb in their names eg CONFIG_DVB_USB_DEBUG, but I would need advice on which debugs to turn on.
Comment 12 Mathias Nyman 2014-11-21 11:23:03 UTC
Ah, yes, the CONFIG_USB_DEBUG was for older kernels.
With later kernels you only need to check that dynamic debug is enabled:
CONFIG_DYNAMIC_DEBUG=y

check if debugfs is mounted, if not then run:
mount -t debugfs none /sys/kernel/debug

And to enable xhci debugging do: 
echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control

many distos have the dynamic debug enabled and debugfs mounted automatically so you only need to echo to the dymanic_debug/control file.
Comment 13 Robin Becker 2014-11-21 12:16:11 UTC
OK I have checked and it seems the stock arch linux kernel has CONFIG_DYNAMIC_DEBUG set so I don't need a recompile. I will carry out the comment 6 type process when I get home. I'll post the dmesg output here tonight.
Comment 14 Robin Becker 2014-11-21 22:30:31 UTC
Created attachment 158441 [details]
debug journalctl output of second vlc starting up

this the journalctl output of the boot when I start a second pctv 290e device with vlc after doing

sudo su
echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control

my kernel

Linux minikat 3.17.3-1-ARCH #1 SMP PREEMPT Fri Nov 14 23:13:48 CET 2014 x86_64 GNU/Linux
Comment 15 Robin Becker 2014-11-22 09:36:50 UTC
Created attachment 158471 [details]
simpler journal output for 2x(vlc+pctv290e)

I ran this script immediately after a fresh boot

$ cat bin/debugger.sh
logger "=========================== turning on xhci_hcd debug"
sudo sh -c "echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control"
logger "=========================== starting VLC on adapter 0"
vlc --dvb-adapter=0 ~/.config/channels.xspf &
sleep 30
logger "=========================== starting VLC on adapter 1"
vlc --dvb-adapter=1 ~/.config/channels.xspf &
sleep 60
sudo systemctl poweroff

the system was working after the usb turnoff and my sudo doesn't need a password so the system did power off OK. attached is the journalctl -b -1 --all output
Comment 16 Mathias Nyman 2014-11-26 12:52:34 UTC
Thanks, logs show that the host is asked to cancel a transfer, so
it tries to stop the ring and remove the transfer block from the ring.
Stopping the ring never responses, and we timout assuming the host is dead.

This might be related to bug# 75521

I got a testbranch that both fixes some related issues, and adds more debugging for this case. If you can try it out its available at:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git for-usb-next-test

Remember to switch to the for-usb-next-test branch.

It includes 8 patches that also apply cleanly on top of 3.18-rc6, I can send
just patches instead if you prefer it that way.

I'm hoping these patches would show one of the following messages when it fails:
"Cancelled TD not on stopped ring"
"Cancel URB NOT on current ring"
"Error unhandled cancelled TD's after dev reset"
"Error unhandled TD's after dev reset"
Comment 17 Robin Becker 2014-11-27 00:11:53 UTC
Created attachment 158871 [details]
journalctl output with patched 3.18 kernel

Same process as for comment 15, but with your patched kernel. I built using the Arch Linux 3.17.4 PKGBUILD after modding to use the source from the patched kernel and changing the version numbers. During the boot I see the systemd load modules failing (probably because it has a direct reference to 3.17), but I don't normally need to load any modules specially so I guess no harm done. I don't think I see any of your expected messages though.
Comment 18 Mathias Nyman 2014-11-27 14:08:04 UTC
Created attachment 158891 [details]
debug patch for command ring status

Thanks. looks like its not related to bug# 75521 after all.
we never even get to run the command that should stop the endpoint ring. The
whole command ring is not running. Could you add this attached debug patch (its also added to the for-usb-next-test branch) and show me the output of it failing
Comment 19 Robin Becker 2014-11-28 00:20:12 UTC
Created attachment 159001 [details]
journalctl output with patch from 20141127

I think this one has some new debugs coming out.
Comment 20 dezifit 2014-11-28 18:17:53 UTC
Another sample for comparison. Created with for-usb-next-test branch mentioned in comment 18
Comment 21 dezifit 2014-11-28 18:25:45 UTC
Created attachment 159031 [details]
Output of branch "for-usb-next-test"
Comment 22 Mathias Nyman 2014-12-12 18:11:56 UTC
(In reply to dezifit from comment #21)
> Created attachment 159031 [details]
> Output of branch "for-usb-next-test"

I think something went wrong when you cloned that branch,

It shows output messages like "Endpoint 0x81 not halted, refusing to reset."
which shouldn't be possible in the "for-usb-next-test" branch, one of the patches removes that line.
Comment 23 Mathias Nyman 2014-12-12 18:20:44 UTC
(In reply to Robin Becker from comment #19)
> Created attachment 159001 [details]
> journalctl output with patch from 20141127
> 
> I think this one has some new debugs coming out.

Thanks, it shows the command ring is running, and that we queued two "stop endpoint" commands on the command ring, but xhci doesn't comlete the commands
even if the ring is running.

Its probably not a very common situation to have two reset endpoint commands for different usb slots queued at the same time, maybe that triggers something.

I can't think of anything else to try than adding code that prevents queuing two stop endpoint command on the actual hw ring at the same time, putting the other command on a sw ring until first command completes. but it might take a while before I get to write a hackpatch like that
Comment 24 Robin Becker 2014-12-13 13:52:17 UTC
I'll try this out when you have it ready. I looked at the code and there is a comment in xhci.c suggesting that only one stop command should be in flight, but I'm not clever enough to know if that applies here ie it might be 1 per ep or 1 per cancellation etc etc. Anyway thanks for your efforts.
Comment 25 Robin Becker 2015-01-25 16:18:08 UTC
Not sure whether this helps in resolution of this bug, but I asked for input on linux media and apparently others have also suffered from usb3/xhci problems with usb dvb. 

The following was suggested by Olli Salonen <olli.salonen a-t iki.fi>


https://github.com/OpenELEC/OpenELEC.tv/commit/b636927dec20652ff020e54ed7838a2e9be51e03

which advises reverting commit 47f467ac740ebf0475a5176ddb1741acba6aad4

When I apply the above to Arch linux-18.2, my problem disappears and I can run 2 x pctv-290e + vlc / tzap & usb 3.
Comment 26 Olli Salonen 2015-04-09 06:05:20 UTC
There are indications that this patch remedies the issue:
http://www.spinics.net/lists/linux-usb/msg122678.html
Comment 27 Robin Becker 2015-04-09 09:11:32 UTC
I'm at work right now, but will give this patch a spin this evening. I will try it with Arch  3.19.3-3 x86_64 without the patch I currently use see 25 above.
Comment 28 Robin Becker 2015-04-09 20:05:19 UTC
OK I can confirm that the spinics patch in http://www.spinics.net/lists/linux-usb/msg122678.html also fixes the problem for me.
Comment 29 dezifit 2015-04-15 09:46:05 UTC
Did test the patch (see comment 26) with 3.12.40 and can confirm that XHCI_AVOID_BEI prevents xhci from freezing, so the initial issue is solved for me.

I'm now seeing a bunch of new messages, don't know if they are related. Examples:
xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD
xhci_hcd 0000:00:14.0: Signal while waiting for configure endpoint command
Comment 30 Robin Becker 2015-04-30 21:22:46 UTC
The spinics patch appears to be in Arch linux-4.0.1-1 and the bug has gone away for me.
Comment 31 keerthi 2018-06-05 20:22:54 UTC
I am using kernel 4.4.0-127-lowlatency (Ubuntu 14.04) and I have this bug. When we run an UVC camera, the host controller dies randomly after some point. The symptoms are exactly as described by other users above.

I would like to know which kernels this bug affects and where a patch has been applied.

May I know which kernel has the spinics patch ? 

Hardware:
Asus X99-E-10G WS Motherboard
Intel Xeon E5-2687W v4 3.0 GHz

Note You need to log in before you can comment on or make changes to this bug.