Bug 40702

Summary: cpu0 lockup when a sound is played thru USB soundcard connected to xHCI USB 3.0 controller
Product: Drivers Reporter: Matej Laitl (matej)
Component: USBAssignee: Greg Kroah-Hartman (greg)
Status: CLOSED CODE_FIX    
Severity: normal CC: florian, sarah, stern, zonque
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: v3.0-7530-gde96355 Subsystem:
Regression: No Bisected commit-id:
Attachments: lockup-debug-on.jpeg
lockup-debug-on.jpeg
lockup-debug-off.jpeg
debugging.patch
.config for the debugging kernel
dmesg from the debugging kernel prior to the hang
lspci -vvnn
lsusb -v
netconsole-1.txt lockup dump type 1
netconsole-3.txt lockup dump type 2
netconsole-4.txt NULL ptr deref with 3.1-rc1 + for-usb-linus-transcend-fix
netconsole-10.txt td->last_trb is NULL, the system hand in the end
dmesg-20.txt dmesg from audio play. no hang, but audio stops after a few secs
dmesg-26-different.txt from audio play, audio stop after few secs and system hangs _after_ I hit Ctrl+C
Debug patch to not queue URBs that have not yet been given back
messages-40.txt non-truncated version of dmesg-20.txt
dmesg-43.txt full dmesg with Debug patch for caiaq
Patch for proper tracking of output URBs in snd_usb_caiaq
Patch for proper tracking of output URBs in snd_usb_caiaq (2)
messages-50-truncated.txt dmesg with CONFIG_USB_XHCI_HCD_DEBUGGING on

Description Matej Laitl 2011-08-07 20:18:30 UTC
Everytime I try to play a sound file though a NI Audio4DJ USB soundcard (snd-usb-caiaq module) that is connected to a NEC Corporation uPD720200 USB 3.0 Host Controller (xhci-hcd module), the system completely hangs. dmesg reports stalls or hard lockups on cpu 0; magic-sysrq-key doesn't work, but tasks on other cpus still apparently run. The card runs in the USB 2.0 (480M) mode.

The bug does NOT occur when the same soundcard is connected to a USB 2.0 (ehci) ports. Non-sound communication with the soundcard works even on USB 3.0: input mode can be set correctly using amixer.

Steps to reproduce:
1. connect a USB soundcard (Audio4DJ used) to a USB 3.0 port (NEC uPD720200 xHCI used)
2. try to play a sound: aplay -D plughw:Audio4DJ sound.wav
3. system (cpu 0) locks up

The bug happens everytime on v3.0-7530-gde96355 and also on v3.0.0. I cannot easily test older kernels due to EFI bugs.

Please find attached 2 dmesg screenshots from the hang, lockup-debug-off.png is from a kernel build without debugging-enabled kernel, lockup-debug-on.png is with a debugging kernel that has small debugging.patch applied that just adds a few printks.

.config, lspci -vvnn, lsusb -v outputs attached refer to the debugging kernel prior to the hang.
Comment 1 Matej Laitl 2011-08-07 20:19:35 UTC
Created attachment 67942 [details]
lockup-debug-on.jpeg
Comment 2 Matej Laitl 2011-08-07 20:20:31 UTC
Created attachment 67952 [details]
lockup-debug-on.jpeg
Comment 3 Matej Laitl 2011-08-07 20:23:19 UTC
Created attachment 67962 [details]
lockup-debug-off.jpeg
Comment 4 Matej Laitl 2011-08-07 20:23:49 UTC
Created attachment 67972 [details]
debugging.patch
Comment 5 Matej Laitl 2011-08-07 20:24:28 UTC
Created attachment 67982 [details]
.config for the debugging kernel
Comment 6 Matej Laitl 2011-08-07 20:25:42 UTC
Created attachment 67992 [details]
dmesg from the debugging kernel prior to the hang
Comment 7 Matej Laitl 2011-08-07 20:26:01 UTC
Created attachment 68002 [details]
lspci -vvnn
Comment 8 Matej Laitl 2011-08-07 20:26:24 UTC
Created attachment 68012 [details]
lsusb -v
Comment 9 Sarah Sharp 2011-08-08 17:18:55 UTC
Matej, can you use netconsole to capture the kernel trace with CONFIG_USB_XHCI_HCD_DEBUGGING and CONFIG_USB_DEBUG turned on?

There are three bug fixes for isochronous transfers (which your audio device is probably using).  Can you see if these three patches fix your issue?

http://marc.info/?l=linux-usb&m=131248351823895&w=2
http://marc.info/?l=linux-usb&m=131248352023905&w=2
http://marc.info/?l=linux-usb&m=131248351923898&w=2

They're also available on my for-usb-linus branch, if it's easier for you to pull from a git tree:

http://git.kernel.org/?p=linux/kernel/git/sarah/xhci.git;a=shortlog;h=refs/heads/for-usb-linus

Sarah Sharp
Comment 10 Matej Laitl 2011-08-09 11:07:04 UTC
Created attachment 68212 [details]
netconsole-1.txt lockup dump type 1

Hi Sarah,
thanks for working with me to debug the issue.

The lockup still occurs with git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci.git for-usb-linus merged on top of v3.1-rc1; see attached netconsole log that starts just before the xHCI controller and usb soundard is attached.

I keep getting 2 different types of of hang dumps (sometimes none), this is the first one, I'll attach the second one in a while.

Other information I forgot to mention originally:
 * the xHCI is actually an ExpressCard marketed as Transcend USB 3.0 ExpressCard adapter.
 * other (mass storage, probably non-isochronous) USB 2.0 devices work fine when attached to the xHCI controller.
 * the soundcard is only detected when it is already attached to the ExpressCard _before_ the card is inserted. But this seems to be unrelated and less severe issue.
Comment 11 Sarah Sharp 2011-08-09 17:34:42 UTC
The Transcend expresscard should be fine, since it's an NEC chipset and I've tested with those.  Even if it's buggy, it shouldn't hang your kernel. :)

If you still have the xhci-hcd.ko from that kernel around, can you run gdb to find out where exactly in finish_td() it's hanging?  I think this will do it (although I just learned about this trick myself):

$ gdb xhci-hcd.ko
(gdb) l *(finish_td + 0xc8)

Let me know what that says.  I'll probably need you to add a little bit more debugging in order to figure out what's going on once I know where it's hanging.
Comment 12 Matej Laitl 2011-08-09 18:23:57 UTC
(gdb) l *(finish_td + 0xc8)
0xb9e8 is in finish_td (drivers/usb/host/xhci-ring.c:1555).
1550                                            slot_id, ep_index, ep_ring->stream_id,
1551                                            td, event_trb);
1552                    } else {
1553                            /* Update ring dequeue pointer */
1554                            while (ep_ring->dequeue != td->last_trb)
1555                                    inc_deq(xhci, ep_ring, false);
1556                            inc_deq(xhci, ep_ring, false);
1557                    }
1558
1559    td_cleanup:

No problem with patching kernel. :-)

Should I try to reproduce the second lockup dump, or netconsole-1.txt suffices?
Comment 13 Sarah Sharp 2011-08-09 18:32:47 UTC
I'd also like to see the second lockup dump, but this at least shows where part of the problem is.

It's just a very odd place to hang.  It looks like the xHCI driver is stuck looping trying to update the dequeue pointer.  Which probably means td->last_trb isn't set to a TRB in the ring that the event came from.  (Or it could be set to a link TRB, although I don't know how that could be possible.  Possibly due to this NEC host needing the link_trb quirk?)  I'll think about it and come up with a debugging patch in a couple hours.
Comment 14 Matej Laitl 2011-08-09 20:51:30 UTC
Created attachment 68242 [details]
netconsole-3.txt lockup dump type 2

Second type of lockup dump. Looking at the dump more closely, it looks very similar to the first one, but I'll let you judge.
Comment 15 Sarah Sharp 2011-08-09 22:18:17 UTC
Can you run the gdb command again to find out where finish_td+0xd6 is for the second crash?
Comment 16 Matej Laitl 2011-08-10 11:26:48 UTC
(gdb) l *(finish_td+0xd6)
0xb9f6 is in finish_td (drivers/usb/host/xhci-ring.c:1554).
1549                            xhci_cleanup_halted_endpoint(xhci,
1550                                            slot_id, ep_index, ep_ring->stream_id,
1551                                            td, event_trb);
1552                    } else {
1553                            /* Update ring dequeue pointer */
1554                            while (ep_ring->dequeue != td->last_trb)
1555                                    inc_deq(xhci, ep_ring, false);
1556                            inc_deq(xhci, ep_ring, false);
1557                    }
1558
Comment 17 Sarah Sharp 2011-08-10 18:16:22 UTC
Ok, that looks like the same place then.  Can you pull down my git tree, checkout the for-usb-linus-transcend-fix branch, and attach the dmesg with CONFIG_USB_DEBUGGING and CONFIG_USB_XHCI_HCD_DEBUGGING turned on?  I think it will at least prevent the kernel hang, but it's just a band-aid fix with some debugging to find out where the problem is.
Comment 18 Sarah Sharp 2011-08-10 18:53:34 UTC
Sorry, I forgot to commit the changes.  The debug patch should actually be on the for-usb-linus-transcend-fix branch now.
Comment 19 Alan Stern 2011-08-10 21:15:51 UTC
You should be aware that Daniel Mack has just posted a patch fixing a bug in the snd-usb-caiaq driver:

   http://marc.info/?l=linux-usb&m=131299041310285&w=2

I don't know if that would be related to this problem or not.  The effect of the bug is that sometimes and isochronous URB would be submitted in which some of the individual packet lengths are set to 0.
Comment 20 Matej Laitl 2011-08-10 22:25:55 UTC
Created attachment 68442 [details]
netconsole-4.txt NULL ptr deref with 3.1-rc1 + for-usb-linus-transcend-fix

Hi Sarah,
applying your patch resulted in a kernel NULL ptr dereference → panic instead of the lockup. Sometimes I get no lockup/xhci/null ptr dump at all, but caps lock keeps flashing (and system hungs) so that probably means the same symptom.

I'll try to enhance the debugging to show what ptr is null instead of dereferencing it.

(gdb) l *(finish_td+0x136)
0xba56 is in finish_td (drivers/usb/host/xhci-ring.c:1567).
1562                                            xhci_warn(xhci, "Event TRB: %08x %08x %08x %08x\n",
1563                                                            lower_32_bits(le64_to_cpu(event_trb->link.segment_ptr)),
1564                                                            upper_32_bits(le64_to_cpu(event_trb->link.segment_ptr)),
1565                                                            le32_to_cpu(event_trb->link.intr_target),
1566                                                            le32_to_cpu(event_trb->link.control));
1567                                            xhci_warn(xhci, "TD last TRB: %08x %08x %08x %08x\n",
1568                                                            lower_32_bits(le64_to_cpu(td->last_trb->link.segment_ptr)),
1569                                                            upper_32_bits(le64_to_cpu(td->last_trb->link.segment_ptr)),
1570                                                            le32_to_cpu(td->last_trb->link.intr_target),
1571                                                            le32_to_cpu(td->last_trb->link.control));
Comment 21 Sarah Sharp 2011-08-10 23:08:59 UTC
The xHCI driver doesn't know how to deal with zero length transfers at all (other than control transfers), so that's probably where the issue is.  In fact, looking at the code, xhci_td->last_td gets set in a for loop based on the number of TRBs to queue, which would be zero in the case where the driver submits isochronous URBs with some of the buffer lengths set to NULL.  That would explain why the latest patch oopsed when it tried to dereference td->last_trb.
Comment 22 Matej Laitl 2011-08-10 23:12:24 UTC
Created attachment 68482 [details]
netconsole-10.txt td->last_trb is NULL, the system hand in the end

Hi Sarah and Alan,
I've applied http://marc.info/?l=linux-usb&m=131299041310285&w=2 (which seems not to make a difference) and following little patch on top of 3.1-rc1 + for-usb-linus-transcend-fix and get this interesting result.

debugging patch applied:

index 204b8cd..4e77c87 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1564,11 +1564,14 @@ static int finish_td(struct xhci_hcd *xhci, struct xhci_td *td,
                                                        upper_32_bits(le64_to_cpu(event_trb->link.segment_ptr)),
                                                        le32_to_cpu(event_trb->link.intr_target),
                                                        le32_to_cpu(event_trb->link.control));
-                                       xhci_warn(xhci, "TD last TRB: %08x %08x %08x %08x\n",
+                                       if(td->last_trb)
+                                               xhci_warn(xhci, "TD last TRB: %08x %08x %08x %08x\n",
                                                        lower_32_bits(le64_to_cpu(td->last_trb->link.segment_ptr)),
                                                        upper_32_bits(le64_to_cpu(td->last_trb->link.segment_ptr)),
                                                        le32_to_cpu(td->last_trb->link.intr_target),
                                                        le32_to_cpu(td->last_trb->link.control));
+                                       else
+                                               xhci_warn(xhci, "TD last TRB (td->last_trb) is a NULL pointer\n");
                                        xhci_warn(xhci, "Endpoint ring:\n");
                                        xhci_debug_ring(xhci, ep_ring);
                                        xhci_warn(xhci, "Event ring:\n");
Comment 23 Sarah Sharp 2011-08-10 23:33:30 UTC
Ok, to rule out zero-length buffers, I've created another patch on the for-usb-linus-transcend-fix and incorported your debug fix.  Can you retest?
Comment 24 Matej Laitl 2011-08-10 23:48:58 UTC
(In reply to comment #23)
> Ok, to rule out zero-length buffers, I've created another patch on the
> for-usb-linus-transcend-fix and incorported your debug fix.  Can you retest?

Thanks, will do, but it'll have to wail till tommorow. (it seems that you're all in a different timezone, I'm in GMT+2, which makes collaboration a bit impractical) :-)
Comment 25 Matej Laitl 2011-08-11 12:07:23 UTC
Created attachment 68542 [details]
dmesg-20.txt dmesg from audio play. no hang, but audio stops after a few secs

Partial success, I often don't get any lockups with 3.1-rc1 + current for-usb-linus-transcend-fix + http://git.kernel.org/?p=linux/kernel/git/tiwai/sound-2.6.git;a=commitdiff;h=15439b , yay!

But the audio plays just a few secs, than some white noise can be heard and it gradually slows down to silence. Attached dmesg is from this event. Sometimes in 1 of 10 tries a got different, more interesting behaviour, see the next attachement.
Comment 26 Matej Laitl 2011-08-11 12:16:19 UTC
Created attachment 68552 [details]
dmesg-26-different.txt from audio play, audio stop after few secs and system hangs _after_ I hit Ctrl+C

More rarely I get this behaviour: audio also stops after a few secs, but the aplay command does not return. The system is otherwise usable _until_ I press CTRL+C - aplay says it has been interrupted and the system hangs at that point.

(the more frequent behaviour is that that the audio stops, but aplay returns and the system is okay)

Attached dmesg is also a bit different one, see the
usb 3-2: Zero-length buffer submitted.
lines.
Comment 27 Sarah Sharp 2011-08-11 17:10:08 UTC
Hmm, so the first dmesg (https://bugzilla.kernel.org/attachment.cgi?id=68542) I think shows the endpoint rings are pretty messed up.  It looks like the driver keeps submitting the transfers, and the host controller never services them, so they eventually get canceled.  It's hard to tell exactly how the driver got into that state, and I think you're missing some debug output (perhaps you need to run `sudo dmesg -n 8` before capturing?).

It turns out the driver does need to handle zero-length isochronous transfers, but I think your log might also be an indication that the three patches to clean up after a failed isochronous transfer might not be working properly.  If you don't mind, I'd like to continue trying to get that bug fixed before we actually make your audio device work by adding support for zero-length transfers. :)

Can you try to run the test again after running `sudo dmesg -n 8` and running netconsole to capture the crash?  This log (https://bugzilla.kernel.org/attachment.cgi?id=68552) doesn't give any oops output, except for the message that the host controller died.
Comment 28 Daniel Mack 2011-08-11 17:47:03 UTC
I think that is digging at the wrong end. The driver doesn't submit any zero-length frames anymore, after my latest patch.

The only thing that might be worth mentioning is that the driver assumes that outbound iso packets *are* serviced by the host controller. The theory of operation is as follows: we queue a number (32) of iso urbs for incoming data and pre-allocate the same amount of urbs for outgoing data as well. Once a packet is received, the driver determines the internal index of the received frame and assumes that the outbound urb with the same index is good to use. There are currently no checks whether the output urb was actually given back by the USB subsystem. That is admittedly somewhat sub-optimal, and we should maybe track the output urbs in a better way. As the code stands, it is in theory possible that we submit an urb again before it actually returned, but OTOH, there is not much we can really do in case our queued urbs are not being processed. And such a check was never necessary in the past.
Comment 29 Sarah Sharp 2011-08-11 18:27:00 UTC
Not tracking outbound URBs is less than ideal, and mucking with URBs that haven't been completed just sounds like a recipe for disaster.  Doesn't the documentation say something about "no guarantee of completion ordering across different endpoints"?

Probably your code only worked before because the EHCI driver did all the scheduling in software, and you could count on it scheduling the outbound transfers before the in-bound transfers if you called usb_submit_urb() in the right order.  Does that sound likely?

The bigger challenge here is that the xHCI host controller driver doesn't have control over the bus schedule.  The hardware does all the scheduling internally.  There's no guarantee what order across endpoints the URBs will complete in.  There's even some leeway in how the host controller is allowed to schedule URBs within a service interval, such that you could get an IN URB completing before an OUT URB in one service interval, and then an OUT before an IN in another service interval.

Another issue is that the xHCI host controller can report a "missed service event" and tell the driver that it had to skip some isochronous frames.  We can't control when that could happen, so you could see, say, 3 inbound frames skipped when the host controller hasn't finished processing the 3rd outbound frame.

So that's my theory about why the check wasn't necessary before.  We still haven't proven that's actually Matej's issue, but it probably does need to get fixed in your driver.
Comment 30 Daniel Mack 2011-08-11 18:59:36 UTC
Well, I didn't count on the host controller to schedule the outbound before the inbound, I just chose a ring buffer size that is large enough to cope with worst case latency scenarios. The background of this is that I wanted to avoid dynamic allocation of outbound urb from inside the inbound urb's completion handler, so I need some sort of fixed-size implementation.

So the only thing that could happen here is that due to a bug in XHCI, urbs aren't dequeued at all, and my driver doesn't handle this fact properly and makes the situation much worse by queuing urbs again.

Speaking about ordering, you say I cannot count on anything? URBs queued for incoming data can be returned to the caller in arbitrary order, and I have to find out myself whether I actually received all "earlier" urbs already before I process the data and push the data to ALSA? Really? Let me put an example: when I queue 32 urbs for incoming data, each pointing to a specific buffer location within a continuous data stream buffer, I must cope with the fact that #6 returns before #5? If that is the case, I think there a much more drivers that will break due to wrong assumptions.

I'll hack up a quick debug patch that should show whether we're debugging the right thing ...
Comment 31 Daniel Mack 2011-08-11 19:03:53 UTC
Created attachment 68612 [details]
Debug patch to not queue URBs that have not yet been given back

That's a hackish approach which also doesn't recover from situations where this hits us, but it should at least yell at you in the logs and tell us that we're looking at the right thing.

Compile-tested only, I'm not at home, so I don't have any hardware to test with.
Comment 32 Matej Laitl 2011-08-11 19:05:31 UTC
Created attachment 68622 [details]
messages-40.txt non-truncated version of dmesg-20.txt

Sorry, dmesg-20.txt might be truncated, this is a better version. I started the logging before plugging the card this time.

I await your patches! ;)
Comment 33 Sarah Sharp 2011-08-11 19:43:51 UTC
> Speaking about ordering, you say I cannot count on anything? URBs queued for
> incoming data can be returned to the caller in arbitrary order, and I have to
> find out myself whether I actually received all "earlier" urbs already before
> I
> process the data and push the data to ALSA?

Within a particular endpoint, you will get the URBs completed in the order you submitted them.  Only in the case where the host controller hardware is dying will you ever get them out of order.  But in that case, the host controller won't allow you submit URBs anyway, because it knows the host controller is hosed.  It's a pretty rare case anyway.

However, it seemed like you were saying you were trying to rely on completion ordering *across* endpoints:

> Once a packet is received, the driver determines the internal index of the
> received frame and assumes that the outbound urb with the same index is good
> to use.

It seemed like if index 3 IN URB completed, you assumed index 3 of the OUT URB was OK to queue.  But what if the host controller had scheduled the OUT URB after the IN URB, and it was still being used by the host controller?
Comment 34 Daniel Mack 2011-08-11 19:53:01 UTC
(In reply to comment #33)
> > Speaking about ordering, you say I cannot count on anything? URBs queued
> for
> > incoming data can be returned to the caller in arbitrary order, and I have
> to
> > find out myself whether I actually received all "earlier" urbs already
> before I
> > process the data and push the data to ALSA?
> 
> Within a particular endpoint, you will get the URBs completed in the order
> you
> submitted them.  Only in the case where the host controller hardware is dying
> will you ever get them out of order.  But in that case, the host controller
> won't allow you submit URBs anyway, because it knows the host controller is
> hosed.  It's a pretty rare case anyway.

Ok, it should be easy to fix the driver then. However, I'm still not sure whether it's actually necessary.

> However, it seemed like you were saying you were trying to rely on completion
> ordering *across* endpoints:
> 
> > Once a packet is received, the driver determines the internal index of the
> > received frame and assumes that the outbound urb with the same index is
> good
> > to use.
> 
> It seemed like if index 3 IN URB completed, you assumed index 3 of the OUT
> URB
> was OK to queue.  But what if the host controller had scheduled the OUT URB
> after the IN URB, and it was still being used by the host controller?

Yes, that is in theory possible, but I do that based on the index. So once #1 comes back from input, I queue #1 for output. The only case I can think of where this breaks is when the host controller didn't queue 32 urbs in a row, and in this case, something is seriously wrong anyway, right? It should still be fixed, no question. But I doubt that this is actually the root cause of the problem here.

Let's wait what Matej reports and whether he sees any of those new log messages I added with the debug patch.
Comment 35 Matej Laitl 2011-08-11 23:01:17 UTC
Created attachment 68632 [details]
dmesg-43.txt full dmesg with Debug patch for caiaq

Hi Sarah and Daniel,
applying "Debug patch to not queue URBs that have not yet been given back" did not introduce audible changes - the audio still plays only for a second and then gets mixed with white noise and stops.

See attached dmesg where both
 * usb 1-2: Zero-length buffer submitted.   (exactly once and before any ...Weird! messages)
 * outbound URB 23 is still processed? Weird!
messages can be seen. (note the urb number is always the same in this case)

Sometimes I still also get hangs (with no hang dumps printed, aplay never returns and sigterm → aplay triggers hang) in which case the dumps end with these lines:

[ 1190.514236] xhci_hcd 0000:05:00.0: underrun event on endpoint
[ 1190.514237] xhci_hcd 0000:05:00.0: Underrun Event for slot 1 ep 11 still with TDs queued?
[ 1190.514272] outbound URB 20 is still processed? Weird!
[ 1190.519575] xhci_hcd 0000:05:00.0: Miss service interval error, set skip flag
[ 1190.519792] xhci_hcd 0000:05:00.0: Found td. Clear skip flag.
[ 1190.519822] outbound URB 22 is still processed? Weird!
[ 1190.519846] outbound URB 23 is still processed? Weird!
[ 1190.519866] outbound URB 24 is still processed? Weird!
[ 1225.121286] xhci_hcd 0000:05:00.0: Poll event ring: 4295895040
[ 1225.166060] xhci_hcd 0000:05:00.0: op reg status = 0xffffffff
[ 1225.166085] xhci_hcd 0000:05:00.0: HW died, polling stopped.

(Giveback URB messages stripped out)
Comment 36 Daniel Mack 2011-08-11 23:26:02 UTC
(In reply to comment #35)
> Created an attachment (id=68632) [details]
> dmesg-43.txt full dmesg with Debug patch for caiaq
> 
> Hi Sarah and Daniel,
> applying "Debug patch to not queue URBs that have not yet been given back"
> did
> not introduce audible changes - the audio still plays only for a second and
> then gets mixed with white noise and stops.
> 
> See attached dmesg where both
>  * usb 1-2: Zero-length buffer submitted.   (exactly once and before any
> ...Weird! messages)
>  * outbound URB 23 is still processed? Weird!
> messages can be seen. (note the urb number is always the same in this case)
> 
> Sometimes I still also get hangs (with no hang dumps printed, aplay never
> returns and sigterm → aplay triggers hang) in which case the dumps end with
> these lines:
> 
> [ 1190.514236] xhci_hcd 0000:05:00.0: underrun event on endpoint
> [ 1190.514237] xhci_hcd 0000:05:00.0: Underrun Event for slot 1 ep 11 still
> with TDs queued?
> [ 1190.514272] outbound URB 20 is still processed? Weird!
> [ 1190.519575] xhci_hcd 0000:05:00.0: Miss service interval error, set skip
> flag
> [ 1190.519792] xhci_hcd 0000:05:00.0: Found td. Clear skip flag.
> [ 1190.519822] outbound URB 22 is still processed? Weird!
> [ 1190.519846] outbound URB 23 is still processed? Weird!
> [ 1190.519866] outbound URB 24 is still processed? Weird!
> [ 1225.121286] xhci_hcd 0000:05:00.0: Poll event ring: 4295895040
> [ 1225.166060] xhci_hcd 0000:05:00.0: op reg status = 0xffffffff
> [ 1225.166085] xhci_hcd 0000:05:00.0: HW died, polling stopped.

Note that with my debug patch applied, output urbs are not actually submitted when that messages is printed. So the worst thing that can happen now is that the hardware encounters underruns in its data stream because of missing packets. That might result in distortion and dropouts, but it shouldn't actually be able to crash the host controller.

Given that, I dare to claim that the reason for these urbs that are still pending in the output queue is also some sort of a flaw in somer other layer than my driver.
Comment 37 Matej Laitl 2011-08-11 23:37:05 UTC
(In reply to comment #36)
> Note that with my debug patch applied, output urbs are not actually submitted
> when that messages is printed. So the worst thing that can happen now is that
> the hardware encounters underruns in its data stream because of missing
> packets. That might result in distortion and dropouts, but it shouldn't
> actually be able to crash the host controller.

Update: now I no longer get lockups, "zero-length buffers submitted" or "outbound URB 23 is still processed? Weird!" messages
with http://marc.info/?l=linux-usb&m=131309876820335&w=2 applied.

So now the problem reduces to those audio issues, but you seem to know to to handle these, Daniel?  (as I noted in the e-mail, the audio "quality" is actually much worse with mentioned patch applied)
Comment 38 Sarah Sharp 2011-08-12 04:39:36 UTC
Matej: Try turning off xHCI debugging and see if you stop getting underruns and missed service events.  The debugging can cause isochronous transfer completions to get delayed.

Daniel: I'm not sure what you mean by "valid status" on URBs.  If the xHCI encounters a missed service event, we'll leave the urb->status cleared to zero, and set the individual frame buffer status to -EXDEV and the frame length to zero.  Not all of the frames in an URB may be skipped.
Comment 39 Daniel Mack 2011-08-12 10:26:23 UTC
Created attachment 68672 [details]
Patch for proper tracking of output URBs in snd_usb_caiaq

Track the output urbs that were submitted and make sure not to queue one that was not yet given back.

Again, this is only compile-tested.
Comment 40 Matej Laitl 2011-08-12 11:51:44 UTC
Sarah & Daniel, I've got some partially good news. I've disabled CONFIG_USB_XHCI_HCD_DEBUGGING, but let CONFIG_USB_DEBUG turned on and retested both EHCI and xHCI with various patch combinations:

 * for-usb-linus-transcend-fix
   + ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
   + Debug patch to not queue URBs that have not yet been given back:

   EHCI: perfect, this patch combination even fixed a bug I thought was unrelated - crackles would occur every 10secs in playback on EHCI. (in fact, i bought xHCI because I thought it was a harware error) Do you want me to isolate what patch caused it?

   xHCI: the best result with USB 3.0 so far, the song plays arbitrarily long but sounds like a badly tuned radio. I also get several "outbound URB 13 is still processed? Weird!" messages per second during playback, which seems to be connected to the distortion. (I always get numbers 13, 27, or both)


 * for-usb-linus-transcend-fix
   + ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
   + Debug patch to not queue URBs that have not yet been given back:
   + "if (len == 0)	continue;" patch ( http://marc.info/?l=linux-usb&m=131309876820335&w=2 )

   Both EHCI & xHCI: playback is approx 10 times slower, distorted, but no messages in dmesg. (and both USB controllers behave the same ;) )


 * for-usb-linus-transcend-fix
   + ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
   + proper tracking of output URBs in snd_usb_caiaq

   Both EHCI & xHCI: no sound playback at all, and exacly 32 following messages in dmesg:
   snd-usb-caiaq log: Unable to find an output urb to use



I'm not longer able to reproduce any lockups with CONFIG_USB_XHCI_HCD_DEBUGGING disabled, probably a heisenbug? Sarah, I'm still willing to test patches that improve xHCI behaviour even when my audio is fixed.
Comment 41 Daniel Mack 2011-08-12 12:00:55 UTC
Created attachment 68722 [details]
Patch for proper tracking of output URBs in snd_usb_caiaq (2)

Same approach with one less braino in urb allocation.
Comment 42 Daniel Mack 2011-08-12 12:07:00 UTC
(In reply to comment #40)
> Sarah & Daniel, I've got some partially good news. I've disabled
> CONFIG_USB_XHCI_HCD_DEBUGGING, but let CONFIG_USB_DEBUG turned on and
> retested
> both EHCI and xHCI with various patch combinations:
> 
>  * for-usb-linus-transcend-fix
>    + ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
>    + Debug patch to not queue URBs that have not yet been given back:
> 
>    EHCI: perfect, this patch combination even fixed a bug I thought was
> unrelated - crackles would occur every 10secs in playback on EHCI. (in fact,
> i
> bought xHCI because I thought it was a harware error) Do you want me to
> isolate
> what patch caused it?

I'm pretty sure it's the "correct offset" bug - I had that too with my new laptop that I got some days ago :)

Follow this thread if you're curious: http://mailman.alsa-project.org/pipermail/alsa-devel/2011-July/041684.html

>    xHCI: the best result with USB 3.0 so far, the song plays arbitrarily long
> but sounds like a badly tuned radio. I also get several "outbound URB 13 is
> still processed? Weird!" messages per second during playback, which seems to
> be
> connected to the distortion. (I always get numbers 13, 27, or both)
> 
> 
>  * for-usb-linus-transcend-fix
>    + ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
>    + Debug patch to not queue URBs that have not yet been given back:
>    + "if (len == 0)    continue;" patch (
> http://marc.info/?l=linux-usb&m=131309876820335&w=2 )
> 
>    Both EHCI & xHCI: playback is approx 10 times slower, distorted, but no
> messages in dmesg. (and both USB controllers behave the same ;) )

Yes, you can drop this one. The XHCI controller must be made cope with these zero-frames, and Sarah said she has ideas for that already.

>  * for-usb-linus-transcend-fix
>    + ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc
>    + proper tracking of output URBs in snd_usb_caiaq
> 
>    Both EHCI & xHCI: no sound playback at all, and exacly 32 following
>    messages
> in dmesg:
>    snd-usb-caiaq log: Unable to find an output urb to use

Eh, my bad. Please try the new version of this patch I just submitted.

Thanks for all your testing!
Comment 43 Sarah Sharp 2011-08-12 17:59:48 UTC
Matej, I just pushed a fix to allow zero-length transfers to the for-usb-linus-transcend-fix branch.  Can you pull and retest?  You might still need the patch from Daniel for proper tracking of outbound URBs, and the patch "ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc".

If you can test with CONFIG_USB_XHCI_HCD_DEBUGGING both on and off, that would be great.
Comment 44 Daniel Mack 2011-08-12 21:30:52 UTC
(In reply to comment #43)
> Matej, I just pushed a fix to allow zero-length transfers to the
> for-usb-linus-transcend-fix branch.  Can you pull and retest?

Have you really? I guess it should be here:

  http://git.kernel.org/?p=linux/kernel/git/sarah/xhci.git;a=shortlog;h=refs/heads/for-usb-linus-transcend-fix

but I can't find it. In fact, I see the "xhci: Don't allow zero-length isoc buffers." commit, which shouldn't be there, right?

>  You might still
> need the patch from Daniel for proper tracking of outbound URBs, and the
> patch
> "ALSA: snd-usb-caiaq: Correct offset fields of outbound iso_frame_desc".

Yes, both are mandatory.
Comment 45 Sarah Sharp 2011-08-12 21:53:05 UTC
Ok, I don't know what's wrong with the git servers, but gitk shows I have pushed this to master.kernel.org.  In any case, here are the two patches to apply:
http://userweb.kernel.org/~sarah/0001-Revert-xhci-Don-t-allow-zero-length-isoc-buffers.patch
http://userweb.kernel.org/~sarah/0002-xhci-Handle-zero-length-isochronous-packets.patch
Comment 46 Matej Laitl 2011-08-13 12:55:45 UTC
Created attachment 68752 [details]
messages-50-truncated.txt dmesg with CONFIG_USB_XHCI_HCD_DEBUGGING on

Success!

With current for-usb-linus-transcend-fix + the two caiaq patches, the soundcard works perfectly with both EHCI and xHCI and both states of CONFIG_USB_XHCI_HCD_DEBUGGING. Thanks a lot Sarah & Daniel, you rock!

I still get lots of kernel messages during playback and CONFIG_USB_XHCI_HCD_DEBUGGING enabled, but they do not impact the playback unless logged to console (dmesg -n8); reading them through /proc/kmsg is okay. I'm attaching the start of these recorded from the point audio is started.

Sarah, let me know if you want me to test your xHCI patches in different conditions.

From my point, this bug is resolved for me, should I mark is as such or you'll do it?
Comment 47 Daniel Mack 2011-08-14 09:48:19 UTC
(In reply to comment #46)
> Created an attachment (id=68752) [details]
> messages-50-truncated.txt dmesg with CONFIG_USB_XHCI_HCD_DEBUGGING on
> 
> Success!
> 
> With current for-usb-linus-transcend-fix + the two caiaq patches, the
> soundcard
> works perfectly with both EHCI and xHCI and both states of
> CONFIG_USB_XHCI_HCD_DEBUGGING. 

Glad to hear that. Thanks to everyone who helped in sorting this out :)
Comment 48 Sarah Sharp 2011-08-15 17:11:38 UTC
(In reply to comment #46)
> Created an attachment (id=68752) [details]
> messages-50-truncated.txt dmesg with CONFIG_USB_XHCI_HCD_DEBUGGING on
> 
> Success!
> 
> With current for-usb-linus-transcend-fix + the two caiaq patches, the
> soundcard
> works perfectly with both EHCI and xHCI and both states of
> CONFIG_USB_XHCI_HCD_DEBUGGING. Thanks a lot Sarah & Daniel, you rock!

Great!  Can you double check that removing some of the debugging we added still allows the driver to work?  I've added just the zero-length transfer fix to my for-usb-linus branch, so can you pull and re-test that, plus the two caiaq patches?

(If the kernel.org mirrors are being too slow, you can just git revert commits 3c1a996 and 3e81d55.)
Comment 49 Alan Stern 2011-08-15 18:01:37 UTC
Sarah, I just looked at your patch.  It would be more efficient if addr and td_len were declared unsigned rather than u64.  The high-order bits don't matter because you're going to divide by TRB_MAX_BUF_SIZE anyway.
Comment 50 Sarah Sharp 2011-08-16 15:17:11 UTC
(In reply to comment #49)
> Sarah, I just looked at your patch.  It would be more efficient if addr and
> td_len were declared unsigned rather than u64.  The high-order bits don't
> matter because you're going to divide by TRB_MAX_BUF_SIZE anyway.

I can certainly modify the variables to be unsigned in a different patch, but I'd like to keep this patch as a bug fix only so we can backport it to stable.
Comment 51 Matej Laitl 2011-08-16 15:33:31 UTC
(In reply to comment #48)
> (In reply to comment #46)
> > Success!
> 
> Great!  Can you double check that removing some of the debugging we added
> still
> allows the driver to work?  I've added just the zero-length transfer fix to
> my
> for-usb-linus branch, so can you pull and re-test that, plus the two caiaq
> patches?

Rebasing the two ALSA patches on top of your current for-usb-linux works just the same -- i.e. perfectly. (with XHCI debugging on, USB debugging off and dmesg -n8)
Comment 52 Florian Mickler 2011-08-29 19:01:00 UTC
A patch referencing this bug report has been merged in Linux v3.1-rc4:

commit 48df4a6fd8c40c0bbcbca2044f5f2bc75dcf6db1
Author: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Date:   Fri Aug 12 10:23:01 2011 -0700

    xhci: Handle zero-length isochronous packets.
Comment 53 Greg Kroah-Hartman 2012-02-22 21:16:32 UTC
All USB bugs should be sent to the linux-usb@vger.kernel.org mailing 
list, and not entered into bugzilla.  Please bring this issue up there,
if it is still a problem in the latest kernel release.
Comment 54 Florian Mickler 2012-02-23 22:15:51 UTC
Let's assume it is fixed by the patch mentioned in the note above. (Bugzilla notifications were stuck again, so I didn't notice my script posting this notice to the bug, else I would probably already have closed it already... :-) )