Bug 70781 - [xhci_hcd] reset SuperSpeed, xhci_drop_endpoint called with disabled ep, Error in queuecommand_lck: task blocked
Summary: [xhci_hcd] reset SuperSpeed, xhci_drop_endpoint called with disabled ep, Erro...
Status: NEW
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: XHCI bugs virtual user
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-18 16:10 UTC by Andreas Reis
Modified: 2014-11-15 17:13 UTC (History)
11 users (show)

See Also:
Kernel Version: 3.14-rc3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg showing bug, without module xhci_hcd =p (114.02 KB, application/octet-stream)
2014-02-18 16:10 UTC, Andreas Reis
Details
dmesg after restart, before plugging stick in (77.83 KB, text/plain)
2014-02-18 16:15 UTC, Andreas Reis
Details
dmesg after restart, after plugging stick in (61.92 KB, text/plain)
2014-02-18 16:16 UTC, Andreas Reis
Details
dmesg of bug occurring while debug flag set (146.00 KB, text/plain)
2014-02-19 16:18 UTC, Andreas Reis
Details
dmesg after reboot with debug flag set (125.83 KB, text/plain)
2014-02-19 16:20 UTC, Andreas Reis
Details
dmesg with xhci_drop_endpoint (73.54 KB, text/plain)
2014-03-12 16:24 UTC, Vasily Tarasov
Details
dmesg with DID_NO_CONNECT and hung sync (74.70 KB, text/plain)
2014-03-12 16:24 UTC, Vasily Tarasov
Details
lspci (1.95 KB, text/plain)
2014-03-12 16:25 UTC, Vasily Tarasov
Details
lspci -n (651 bytes, text/plain)
2014-03-12 16:25 UTC, Vasily Tarasov
Details
lsusb -v (25.09 KB, text/plain)
2014-03-12 16:26 UTC, Vasily Tarasov
Details
dmesg (104.43 KB, application/octet-stream)
2014-03-16 14:11 UTC, Andreas Reis
Details
usbmon trace (905.40 KB, application/x-xz)
2014-03-16 14:20 UTC, Andreas Reis
Details
dmesg with debug patch v2 (155.89 KB, application/octet-stream)
2014-03-19 11:30 UTC, Andreas Reis
Details
testpatch to disable usb3 on intel hosts (962 bytes, patch)
2014-03-25 16:19 UTC, Mathias Nyman
Details | Diff
patch to disable all lpm related control transfers (1.05 KB, patch)
2014-03-25 16:22 UTC, Mathias Nyman
Details | Diff
dmesg, yet another crash (110.35 KB, text/plain)
2014-04-06 19:17 UTC, Andreas Reis
Details

Description Andreas Reis 2014-02-18 16:10:32 UTC
Created attachment 126621 [details]
dmesg showing bug, without module xhci_hcd =p

Corsair Voyager GT 3.0 32GB on a Intel Z87. I'm using it mostly to store source code and as ccache directory, ie. lots of small reads & writes.

Currently on 3.14-rc3 (self-compiled on Arch), ie. with the other xhci revert-fixes already applied.

Happens since 3.14 development started. Can't remember ever having anything like it occur on 3.13.

The error always follows the same sequence — with apparently varying addresses — though I cannot (or don't know how to) trigger it manually:
1. usb 4-2: reset SuperSpeed USB device number 3 using xhci_hcd
2. xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff880428207900
3. usb-storage: Error in queuecommand_lck: us->srb = ffff8804283d0d80

Then the task that called the IO hangs, and afterwards the entire system, at the latest when I try to shutdown/restart.

It also often happens that if I restart leaving the stick plugged in, it either won't show after boot or immediately triggers some other file system related bug (as seen in the attachment). I've tried ext4, f2fs and btrfs — the first always seems able to recover at some point, the second sometimes (at least if one manages to avoid any corrupted files henceforth), the third never.

The attachment "dmesg", which actually shows the bug, was unfortunately taken without "module xhci_hcd =p", which I did not know about then. The other two were taken with it after two or three restarts, but apparently fsck.f2fs had managed to repair the partition by then, and no follow-up bug triggered.

I'll try to catch the bug with the debug flag set. But as written, since it's seemingly random I don't know when I'll manage to do that.
Comment 1 Andreas Reis 2014-02-18 16:15:49 UTC
Created attachment 126631 [details]
dmesg after restart, before plugging stick in
Comment 2 Andreas Reis 2014-02-18 16:16:40 UTC
Created attachment 126641 [details]
dmesg after restart, after plugging stick in
Comment 3 Greg Kroah-Hartman 2014-02-18 17:11:02 UTC
On Tue, Feb 18, 2014 at 04:10:32PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=70781
> 
>             Bug ID: 70781
>            Summary: [xhci_hcd] reset SuperSpeed, xhci_drop_endpoint called
>                     with disabled ep, Error in queuecommand_lck: task
>                     blocked

Please send this to the linux-usb@vger.kernel.org mailing list.
Comment 4 Andreas Reis 2014-02-19 16:18:46 UTC
Created attachment 126751 [details]
dmesg of bug occurring while debug flag set
Comment 5 Andreas Reis 2014-02-19 16:20:47 UTC
Created attachment 126761 [details]
dmesg after reboot with debug flag set
Comment 6 Sarah Sharp 2014-03-07 19:22:53 UTC
Andreas, this may be related to a known regression.  Please try these two patches and see if it fixes your issue:
http://marc.info/?l=linux-kernel&m=139420429908418&w=2
http://marc.info/?l=linux-kernel&m=139420432208425&w=2
Comment 7 Andreas Reis 2014-03-08 12:27:22 UTC
Sorry for not having replied to the mailing list, I'm busy atm and the bug being sporadic makes it hard to debug anyhow.

I'll apply the patches and see if it still occurs.
Comment 8 NotTheEvilOne 2014-03-08 20:25:18 UTC
Hi list / bugzilla,

just want you to know that a similar or same bug also occurs for Linux version 3.13.5-1-ARCH / 3.13.6-1-ARCH. I know that it's not a vanilla kernel but I don't know of any patches related to USB 3.0 to be added by ArchLinux.

kernel log:
[ 4578.359232] usb 4-2: Disable of device-initiated U1 failed.
[ 4580.077667] usb 4-2: reset SuperSpeed USB device number 2 using xhci_hcd
[ 4580.092832] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep --address--
[ 4580.092837] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep --address--

Affected device:
Bus 004 Device 003: ID 05e3:0732 Genesys Logic, Inc. labeled as something like "Transcend SD card reader USB 3.0" with 32GB SD card

Let me know if I can anything do to verify the bug (or if it is a independent one).
Comment 9 Andreas Reis 2014-03-12 10:59:55 UTC
Unfortunately it is still happening. Using linux-git (ie. both patches applied):

[ 3022.774515] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[ 3022.787928] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff880422309e80
[ 3022.787930] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff880422309ec0
100x [ 3053.680116] usb-storage: Error in queuecommand_lck: us->srb = ffff880414422d80
Comment 10 Vasily Tarasov 2014-03-12 16:23:25 UTC
I encounter same (or similar) problem with Intel DQ77MK motherboard (based on Q77 chipset) and Oyen Digital MiniPro 2TB USB 3.0 external disk drive (uses ASMedia 2105 USB-SATA bridge).

The symptomps are the same on all kernels I tried - 3.11, 3.13.6, and 3.14.0-rc6. Either, when I/O activity starts at the disk, I get these messages:

[  414.912907] usb 4-1: reset SuperSpeed USB device number 2 using xhci_hcd
[  414.929082] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff8800360f2200
1068 [  414.929088] xhci_hcd 0000:00:14.0: xHCI xhci_drop_endpoint called with disabled ep ffff8800360f2240

or right after connecting the drive I see this:

[  136.463022] usb 4-1: USB disconnect, device number 2
[  136.463740] sd 6:0:0:0: [sdc] Synchronizing SCSI cache
[  136.463798] sd 6:0:0:0: [sdc]  
[  136.463800] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK

In the latter case, if I issue sync, the process hangs:

[  601.302650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.303368] sync            D ffff88041dc144c0     0  2065   2014 0x00000000
1086 [  601.303371]  ffff88040600bd80 0000000000000082 ffff88040607e380 ffff88040600bfd8
[  601.303373]  00000000000144c0 00000000000144c0 ffff88040607e380 ffff88040600be90
[  601.303375]  ffff88040600be98 7fffffffffffffff ffff88040607e380 ffffffff811ef9d0
[  601.303377] Call Trace:
[  601.303382]  [<ffffffff811ef9d0>] ? vfs_fsync+0x40/0x40
[  601.303385]  [<ffffffff81731949>] schedule+0x29/0x70
[  601.303387]  [<ffffffff81730bf9>] schedule_timeout+0x219/0x2b0
[  601.303390]  [<ffffffff811518cb>] ? filemap_fdatawait_range+0x13b/0x190
[  601.303393]  [<ffffffff81081c8a>] ? __queue_delayed_work+0xaa/0x1a0
[  601.303395]  [<ffffffff810822dd>] ? try_to_grab_pending+0x11d/0x160
[  601.303396]  [<ffffffff811ef9d0>] ? vfs_fsync+0x40/0x40
[  601.303398]  [<ffffffff81732476>] wait_for_completion+0xa6/0x160
[  601.303400]  [<ffffffff8109c2d0>] ? wake_up_state+0x20/0x20
[  601.303403]  [<ffffffff811e85b5>] sync_inodes_sb+0xa5/0x1c0
[  601.303404]  [<ffffffff811ef9d0>] ? vfs_fsync+0x40/0x40
[  601.303406]  [<ffffffff811ef9e9>] sync_inodes_one_sb+0x19/0x20
[  601.303408]  [<ffffffff811c3d42>] iterate_supers+0xb2/0x110
[  601.303410]  [<ffffffff811efc55>] sys_sync+0x35/0x90
[  601.303413]  [<ffffffff8173e6ed>] system_call_fastpath+0x1a/0x1f


It seems that the problem occurs only when I connect the drive to the USB 3.0 port. If I connect it to USB 2.0 port the problem does not happen.

The drive seems to run smoothly on Windows (which is installed on the same machine).

I attach dmesg, lspci, lspci -n, lsusb -v, and uname outputs to this comment. Let me know if any further information or actions on my side are needed.
Comment 11 Vasily Tarasov 2014-03-12 16:24:14 UTC
Created attachment 129191 [details]
dmesg with xhci_drop_endpoint
Comment 12 Vasily Tarasov 2014-03-12 16:24:50 UTC
Created attachment 129201 [details]
dmesg with DID_NO_CONNECT and hung sync
Comment 13 Vasily Tarasov 2014-03-12 16:25:15 UTC
Created attachment 129211 [details]
lspci
Comment 14 Vasily Tarasov 2014-03-12 16:25:51 UTC
Created attachment 129221 [details]
lspci -n
Comment 15 Vasily Tarasov 2014-03-12 16:26:21 UTC
Created attachment 129231 [details]
lsusb -v
Comment 16 Andreas Reis 2014-03-16 14:11:11 UTC
I finally managed to get a trace via usbmon of the bug. Happened during a longer compilation; the trace was started about a minute into it.
Comment 17 Andreas Reis 2014-03-16 14:11:50 UTC
Created attachment 129591 [details]
dmesg
Comment 18 Andreas Reis 2014-03-16 14:20:13 UTC
Created attachment 129601 [details]
usbmon trace
Comment 19 Andreas Reis 2014-03-19 11:30:19 UTC
Created attachment 130041 [details]
dmesg with debug patch v2

for mailing list
Comment 20 Mathias Nyman 2014-03-25 16:19:41 UTC
Created attachment 130661 [details]
testpatch to disable usb3 on intel hosts

Could you try this patch first to check if this is Link PM related.
This patch sets LPM disabled for intel hosts, but still sends LPM related control transfers.
Comment 21 Mathias Nyman 2014-03-25 16:22:40 UTC
Created attachment 130671 [details]
patch to disable all lpm related control transfers

If previous patch doesn't work, please try this one.
It prevents all lpm related control transfers on all hosts.

If the issue still exists after this one then it's probably not
USB3 LPM related
Comment 22 NotTheEvilOne 2014-03-25 21:26:27 UTC
(In reply to Mathias Nyman from comment #20)
> Created attachment 130661 [details]
> testpatch to disable usb3 on intel hosts
> 
> Could you try this patch first to check if this is Link PM related.
> This patch sets LPM disabled for intel hosts, but still sends LPM related
> control transfers.

Applying this patch to 3.13.7 seems to suppress the bug being triggered for me. I get the following output in dmesg:

[    1.410752] usb 4-2: new SuperSpeed USB device number 2 using xhci_hcd
[    1.424830] usb 4-2: Parent hub missing LPM exit latency info.  Power management will be impacted.
Comment 23 NotTheEvilOne 2014-03-26 10:03:42 UTC
Sorry for the noise. Looks like my original problem was indeed a different one fixed with "[PATCH 1/2] Revert xhci 1.0: Limit arbitrarily-aligned scatter gather." and "[PATCH 2/2] Revert USBNET: ax88179_178a: enable tso if usb host supports sg dma" for 3.13.7.

Could be that this triggered the same error as the one of the original author.

I rebuilt the original (broken for me) 3.13.6 with your patch. I didn't get any xhci related log entries so far. Looks like I can't reproduce the problem anymore.
Comment 24 Andreas Reis 2014-04-06 19:17:19 UTC
Created attachment 131571 [details]
dmesg, yet another crash
Comment 25 a92zu 2014-04-16 21:42:41 UTC
Hello,

I also tried the patch "patch to disable all lpm related control transfers" from Mathias Nyman, but without success, still "xHCI xhci_drop_endpoint called with disabled ep"-messages, ... (kernel 3.14)
USB3 controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03)

I have problems with my USB 3.0 devices; external HDDs (HGST Touro Mobile Pro (Simpletech-Chip), Fantec case with WDC, Asmedia Chip) and a Sandisk card reader.

** My configuration:
Home PC: ASUS P7P55D-E PRO with Intel Chipset P55 with PLX PEX PCI Bridges attached to NEC Controller => not working

For my office PC I have bought a PCIe-Card with the same NEC-Controller (it is exactly the same revision reported by lspci) and all the usb3-devices are working fine there => no errors !? 

Furthermore my usb3.0 stick Lexar 64GByte is working fine on *both* controllers!

It is very strange, that on the controller with the PCIe-card all devices are working as expected, although both controllers have the same revision (firmware?). 
I can only explain that with different firmware revisions or because of the different connection/interface? 
(A firmware update on the mainboard's onboard chip is not possible or too risky.)
PLX Chip with NEC: not working, PCIe-card with NEC: working


** Chipset P55 with PLX PEX PCI Bridges vs Sandybridge Mainboard with PCIe-Card(PCIe 2.0 x1, 5Gbps):

Here are the differences between the lspci outputs;

PCIe Type	| CacheLineSize | Flags; AuxCurrent | PME    |DevCtl   | MaxReadReq |AuxPwr  | CommClk
-------------------------------------------------------------------------------------------------
PLX Chip, NEC	| 	32	| AuxCurrent=375mA  |D3cold+ |NoSnoop+ | 512 bytes  |AuxPwr+ | CommClk-
PCIe-Card NEC:works| 	64	| AuxCurrent=0mA    |D3cold- |NoSnoop- | 128 bytes  |AuxPwr- | CommClk+

lsusb diffs;
PLX Chip, NEC; bmAttributes 0x02 (no LTM support)
PCIe-Card: bmAttributes 0x00  Latency Tolerance Messages (LTM) Supported

Are there any new patches available for testing? 

Thank you for your support, Bernhard

errors:
[18853.074081] xhci_hcd 0000:08:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff8803e2119c80
[18853.074088] xhci_hcd 0000:08:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff8803e2119cc0
[18890.189088] xhci_hcd 0000:08:00.0: xHCI host not responding to stop endpoint command.
[18890.189096] xhci_hcd 0000:08:00.0: Assuming host is dying, halting host.
[18890.232182] xhci_hcd 0000:08:00.0: Host not halted after 16000 microseconds.
[18890.232193] xhci_hcd 0000:08:00.0: Non-responsive xHCI host is not halting.
[18890.232197] xhci_hcd 0000:08:00.0: Completing active URBs anyway.
[18890.232209] xhci_hcd 0000:08:00.0: HC died; cleaning up

or

[ 8427.730368] usb 4-1: reset SuperSpeed USB device number 5 using xhci_hcd
[ 8427.748097] usb 4-1: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 8427.749549] xhci_hcd 0000:08:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880407ca7480
[ 8427.749555] xhci_hcd 0000:08:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880407ca74c0
[ 8427.870515] usb 4-1: reset SuperSpeed USB device number 5 using xhci_hcd
[ 8427.888132] usb 4-1: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 8427.889549] xhci_hcd 0000:08:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880407ca7480
[ 8427.889555] xhci_hcd 0000:08:00.0: xHCI xhci_drop_endpoint called with disabled ep ffff880407ca74c0
Comment 26 a92zu 2014-05-09 21:35:54 UTC
Are there any new patches to test or news regarding "xhci_drop_endpoint"-issue?
Comment 27 Karl Richter 2014-11-15 17:13:21 UTC
@Bernard at least a workaround (using 3.12.32) which is the latest working longterm kernel based on a very simple bisecting I've done, see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1371233 for details (and yes, you seriously have to download my post and open it in a text editor to read it...)

Note You need to log in before you can comment on or make changes to this bug.