Bug 72891

Summary:	Too many ISO frames scheduled when starting stream
Product:	v4l-dvb	Reporter:	beralt (linux)
Component:	em28xx	Assignee:	em28xx (v4l-dvb_em28xx)
Status:	REOPENED ---
Severity:	normal	CC:	bugzilla-kernel.org, crope, lee295012, martin, mchehab, mybigspam, nigel.jewell, schaefer.frank, stern, szg00000, tenminjoe
Priority:	P1
Hardware:	x86-64
OS:	Linux
Kernel Version:	3.13.6	Subsystem:
Regression:	No	Bisected commit-id:
Attachments:	Output of lsusb for PCTV 510e usbmon trace Initialize ehci->last_iso_frame before it gets used

Description beralt 2014-03-24 10:01:16 UTC

After a recent kernel update to 3.13.6 my USB DVB-C device, a PCTV 510e, stopped working. The device is recognized as usual, and at first it is possible to get a stream from the device. After a while the stream stops, and further data transfer is not possible. This is probably related to a lot of fixes that went in recently to get USB working for bulk transfers.

As suggested I tried the media_build approach yesterday to test backports of recent patches that did not went into 3.13. This did not fix the problem however. After a while the error "submit of urb 0 failed (error=-27)" appears in the syslog. Subsequent streams fail silently.

Hardware:
- Pinnacle Systems PCTV 510e @ 480 Mbps (2304:0242, interface 0, class 0)
- em28xx: chip ID is em2884
- em28xx: DVB interface 0 found: isoc
- Intel® NUC Board DCP847SKE [1] with QS77 chipset [2]
- The DVB-C USB stick is connected to a USB 2.0 port (the NUC kit only has USB 2.0 ports)

USB tree:
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=ehci-pci/2p, 480M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/8p, 480M
        |__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=em28xx, 480M
        |__ Port 2: Dev 4, If 0, Class=Hub, Driver=hub/4p, 12M
            |__ Port 4: Dev 6, If 0, Class=Vendor Specific Class, Driver=mceusb, 12M
        |__ Port 7: Dev 5, If 0, Class=Vendor Specific Class, Driver=ftdi_sio, 12M

[1] http://www.intel.com/content/www/us/en/motherboards/desktop-motherboards/desktop-board-dcp847ske.html
[2] http://www.intel.com/content/www/us/en/chipsets/performance-chipsets/mobile-chipset-qs77.html

Comment 1 beralt 2014-03-25 17:05:40 UTC

Further investigation shows that the error is triggered from the ehci scheduler[1]. I'm currently running with some dynamic debugging enabled on those lines, as well as most em28xx debug options turned on.

[1] http://lxr.free-electrons.com/source/drivers/usb/host/ehci-sched.c#L1663

Comment 2 beralt 2014-03-25 17:10:16 UTC

And here is the trace obtained upon failure, showing that the scheduler overflows. Maybe the URB's are not removed from the queue, since this error happens after opening and closing the streams for a while.

kernel: em2884 #0/2-dvb: Using 5 buffers each with 64 x 940 bytes
kernel: em2884 #0 em28xx_init_usb_xfer :em28xx: called em28xx_init_usb_xfer in mode 2
kernel: ehci-pci 0000:00:1d.0: request ffff8800ccc59000 would overflow (8136+63 >= 8192)
kernel: submit of urb 0 failed (error=-27)
kernel: em2884 #0 em28xx_uninit_usb_xfer :em28xx: called em28xx_uninit_usb_xfer in mode 2

Comment 3 beralt 2014-03-26 13:01:49 UTC

This was already fixed in media_build, but somehow the backported modules did not load into my kernel. Seems the media_build script does not work completely as intended on my distribution (Arch Linux), I needed to run depmod manually.

Comment 4 beralt 2014-03-27 13:53:28 UTC

Apparently the problem still occurs in the backported module, since I got the error again a few minutes ago. So I'm reopening the report again, sorry for the confusion.

em2884 #0/2-dvb: Using 5 buffers each with 64 x 940 bytes, alternate 1
em2884 #0 em28xx_init_usb_xfer :em28xx: called em28xx_init_usb_xfer in mode 2
submit of urb 0 failed (error=-27)
em2884 #0 em28xx_uninit_usb_xfer :em28xx: called em28xx_uninit_usb_xfer in mode 2

This time I'm sure about the version used.

Comment 5 Nigel Jewell 2014-03-30 10:38:22 UTC

I am also seeing this with a 290e with any of the FC20 3.13 kernels.  It works fine with 3.11.10 from the Fedora repo.

Please let me know if I can provide anything that would help resolve this.

Comment 6 beralt 2014-03-30 11:48:56 UTC

There were some changes to the module made by Frank Schaefer which added analog video and improved USB endpoint and alternate setting handling. Those commits introduced this bug, and there are some weird USB related timing issues that need to be handled.

So I'm adding the USB descriptor for my device to this bug report. I have some serious doubts about the correctness of the bNumInterfaces value being reported as "1" by my device, since there are 8 Interface Descriptors present.

Between those the various configurations of the endpoints differ, namely the wMaxPacketSize and bInterval. I also think that setting the current value of interval in the urb struct to hardcoded "1" would cause problems on both xHCI hardware (Mauro patched this in is em28xx-audio code [1]).

I'm guessing that the current code selects the wrong interface configuration, and that the USB HCI scheduling fails to fill the buffers in time.

[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/media/usb/em28xx?id=d2849fa59d79dea7a6deeef13cc3efaafab3bf63

Comment 7 beralt 2014-03-30 11:49:51 UTC

Created attachment 131041 [details]
Output of lsusb for PCTV 510e

Comment 8 schaefer.frank 2014-04-01 17:37:18 UTC

(In reply to beralt from comment #6)
> There were some changes to the module made by Frank Schaefer which added
> analog video and improved USB endpoint and alternate setting handling. Those
> commits introduced this bug, and there are some weird USB related timing
> issues that need to be handled.

Which commit are you talking about ?
 
> So I'm adding the USB descriptor for my device to this bug report. I have
> some serious doubts about the correctness of the bNumInterfaces value being
> reported as "1" by my device, since there are 8 Interface Descriptors
> present.

There's nothing unusual with your device descriptor.
A single interface (#0) with 8 alternate settings, EP 0x82 = analog video, EP 0x83 = audio, EP 0x84 = DVB.

Comment 9 beralt 2014-04-01 18:26:08 UTC

(In reply to schaefer.frank from comment #8)
> (In reply to beralt from comment #6)
> > There were some changes to the module made by Frank Schaefer which added
> > analog video and improved USB endpoint and alternate setting handling.
> Those
> > commits introduced this bug, and there are some weird USB related timing
> > issues that need to be handled.
> 
> Which commit are you talking about ?

The only relevant commit I can think of is this one:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/media/usb/em28xx/em28xx-dvb.c?id=961717b41bc1103dcd30d293fd689a614fbfa90c

And the bug does not appear in 3.12, only in 3.13 and above (including the media_build tree). Using the media_build tree also introduces the debug message about which alternate was used, and that was 1 in my case. That seems right, so maybe this is related to EHCI scheduling instead?

>  
> > So I'm adding the USB descriptor for my device to this bug report. I have
> > some serious doubts about the correctness of the bNumInterfaces value being
> > reported as "1" by my device, since there are 8 Interface Descriptors
> > present.
> 
> There's nothing unusual with your device descriptor.
> A single interface (#0) with 8 alternate settings, EP 0x82 = analog video,
> EP 0x83 = audio, EP 0x84 = DVB.

Good to know. If there is anything I can do to help let me know.

Comment 10 schaefer.frank 2014-04-02 18:20:14 UTC

(In reply to beralt from comment #9)
> (In reply to schaefer.frank from comment #8)
> > (In reply to beralt from comment #6)
> > > There were some changes to the module made by Frank Schaefer which added
> > > analog video and improved USB endpoint and alternate setting handling.
> Those
> > > commits introduced this bug, and there are some weird USB related timing
> > > issues that need to be handled.
> > 
> > Which commit are you talking about ?
> 
> The only relevant commit I can think of is this one:
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> drivers/media/usb/em28xx/em28xx-dvb.
> c?id=961717b41bc1103dcd30d293fd689a614fbfa90c

I doubt that, this change has been introduced with kernel 3.14 and it has not been backported to older kernels. ;-)

I can't think of any USB related changes in the em28xx driver between 3.12 and 3.13.

How long can you capture before you get the error ?
Does it happen with analog and DVB capturing ?

Comment 11 beralt 2014-04-02 18:58:55 UTC

(In reply to schaefer.frank from comment #10)
> (In reply to beralt from comment #9)
> > (In reply to schaefer.frank from comment #8)
> > > (In reply to beralt from comment #6)
> > > > There were some changes to the module made by Frank Schaefer which
> added
> > > > analog video and improved USB endpoint and alternate setting handling.
> Those
> > > > commits introduced this bug, and there are some weird USB related
> timing
> > > > issues that need to be handled.
> > > 
> > > Which commit are you talking about ?
> > 
> > The only relevant commit I can think of is this one:
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> > drivers/media/usb/em28xx/em28xx-dvb.
> > c?id=961717b41bc1103dcd30d293fd689a614fbfa90c
> 
> I doubt that, this change has been introduced with kernel 3.14 and it has
> not been backported to older kernels. ;-)

I tested with the upstream media_tree to check if the bug might have been fixed. But I used a 3.13 kernel in combination with the upstream media tree. I will test against 3.12 too, since I'm suspecting this might be related to changes in the ehci scheduler [1]

> 
> I can't think of any USB related changes in the em28xx driver between 3.12
> and 3.13.

Then we can rule out the em28xx module? Let me test this by using the upstream code on a 3.12 kernel.

> How long can you capture before you get the error ?
> Does it happen with analog and DVB capturing ?

It happens with DVB (I do not use analog). I use TVHeadend, and during its EPG data scanning it opens and closes streams, causing the init_usb_xfer calls, followed by stop_urbs when it's done. This is repeated a few times (somewhere around 40 times) and then the ehchi scheduler queue overflows. So it seems that some URB's stay in the queue, despite kill()ing them. The device stops working once the queue is full, and resetting the ehci controller by rebooting is the only solution.

[1] http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/usb/host/ehci-q.c?id=d0ce5c6b9208c79fc725c578eebdeb5724faf17d

Comment 12 schaefer.frank 2014-04-05 12:47:43 UTC

(In reply to beralt from comment #11)
> 
> Then we can rule out the em28xx module? Let me test this by using the
> upstream code on a 3.12 kernel.

I'm not saying that the actual problem isn't in the em28xx driver, but at least nothing has changed here between 3.12 and 3.13.
There have indeed been several EHCI scheduling related changes recently.
At the moment, I would focus on them.

Comment 13 deadite66 2014-04-20 10:34:48 UTC

This is something that has been plaguing me for months,i thought it was the AMD E350 motherboard i was using but finding this bug post makes me wonder.

I run mythtv-backend 0.27 with 2x PCTV 290e's and get the "submit of urb 0 failed (error=-27)" errors, i get them at least once a week.

xubuntu saucy x64 3.13.1-031301-generic.

Comment 14 Martin Sofaru 2014-06-09 11:08:56 UTC

I have the same effect [1] after a while with my PCTV 461e device. I have tried kernels 3.14 and 3.15-rc7.

[1] "submit of urb 0 failed (error=-27)"

Comment 15 Antti Palosaari 2014-06-11 22:00:47 UTC

I have seen that also quite often with multiple em28xx based DVB devices.

Comment 16 Martin Sofaru 2014-06-15 19:32:45 UTC

I wanted to git bisect this. I do not get my 461e device to work with media_build and 3.12 though.

Comment 17 Martin Sofaru 2014-07-05 08:13:15 UTC

This is still happening almost every day and requires to unload/load the em28xx modules.

Jul  5 08:02:54 foo tvheadend[4115]: linuxdvb: Montage M88DS3103 : DVB-S #0 - failed to config dmx for pid 0 [e=File too large]
Jul  5 08:02:54 foo kernel: [463524.477658] submit of urb 0 failed (error=-27)

Comment 18 Martin Sofaru 2014-08-13 14:00:05 UTC

Throwing "options em28xx usb_xfer_mode=0" into /etc/modprobe.d/em28xx.conf seems to avoid the issue for me.

HTH.

Comment 19 Nigel Jewell 2014-08-13 20:50:03 UTC

Martin - what kernel version have you tested this option with?  Thanks.

Comment 20 Martin Sofaru 2014-08-13 22:43:43 UTC

I am currently running Debian's 3.14 Kernel:
Linux foo 3.14-2-amd64 #1 SMP Debian 3.14.15-2 (2014-08-09) x86_64 GNU/Linux

Comment 21 Nigel Jewell 2014-08-17 18:25:38 UTC

Hmm.  I've tried this under 3.15.10 on Fedora 20 and it lasted for about 24 hours, but today it failed with:

Aug 17 16:17:34 pvr kernel: [28704.791573] submit of urb 0 failed (error=-27)
Aug 17 16:17:34 pvr kernel: submit of urb 0 failed (error=-27)

Reverting back to 3.11.10.

Comment 22 beralt 2014-08-29 13:58:30 UTC

(In reply to Martin Sofaru from comment #18)
> Throwing "options em28xx usb_xfer_mode=0" into /etc/modprobe.d/em28xx.conf
> seems to avoid the issue for me.
> 
> HTH.

I tried this on 3.16.1 (x86_64) and setting this option avoided the issue for about three days now. So I guess this solves the problem for me too, although I'm not sure why. Setting it to "0" means "prefer isoc". The device used isoc transfers for DVB when the issue did occur, so setting this option should not change anything.

Comment 23 deadite66 2014-08-29 14:18:45 UTC

Haven't had the problem in 3.16.1 yet and didn't set usb_xfer_mode=0.
maybe something has been fixed/changed.

Comment 24 schaefer.frank 2014-08-31 15:24:22 UTC

Using the module parameter usb_xfer_mode=0 has no effect.
Without it, the driver prefers isoc transfers, too (exception: camera devices).

I know it's annoying, but someone will have to git bisect it. :/

Comment 25 Nigel Jewell 2014-09-21 06:00:20 UTC

Unfortunately I am seeing exactly the same behaviour with kernel-3.16.2 on Fedora 20.  I don't see any evidence of a resolution.

Comment 26 Martin Long 2014-09-28 17:01:58 UTC

Any movement on this? Somebody's commit has broken this, and I've been unable to use this piece of hardware for over 6 months now. (This broken kernel is part of an LTE release of Ubuntu, which I had expected to work as previous versions).

Comment 27 Antti Palosaari 2014-09-28 17:17:09 UTC

em28xx driver is maintained by Mauro, but I think he could not reproduce issue. Also, I suspect that issue is likely seen only certain USB hist controllers as it seems to work for many people.

For me, I can see that issues with both USB controllers I have:
ASM1042 SuperSpeed USB Host Controller
SB7x0/SB8x0/SB9x0 USB EHCI Controller

Comment 28 Martin Long 2014-09-28 17:27:28 UTC

If there is no progress on getting it fixed, can we not roll back all of the patches on this module since the last known working version? Just breaking stuff and then leaving it broken is not good. 

Not sure what my USB controller is. dmesg dump (including host and TV contoller): 

[    0.804139] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002
[    0.804141] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    0.804144] usb usb1: Product: EHCI Host Controller
[    0.804146] usb usb1: Manufacturer: Linux 3.13.0-36-generic ehci_hcd
[    0.804148] usb usb1: SerialNumber: 0000:00:04.1
[    0.862044] usb usb2: New USB device found, idVendor=1d6b, idProduct=0001
[    0.862047] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
[    0.862049] usb usb2: Product: OHCI PCI host controller
[    0.862051] usb usb2: Manufacturer: Linux 3.13.0-36-generic ohci_hcd
[    0.862053] usb usb2: SerialNumber: 0000:00:04.0
[    1.120056] usb 1-4: new high-speed USB device number 2 using ehci-pci
[    1.255914] usb 1-4: New USB device found, idVendor=1b80, idProduct=e409
[    1.255920] usb 1-4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    1.255925] usb 1-4: Product: DVB-T TV Stick
[    1.255929] usb 1-4: Manufacturer: ITE Technologies, Inc.
[    1.368077] usb 1-5: new high-speed USB device number 3 using ehci-pci
[    1.503918] usb 1-5: New USB device found, idVendor=1b80, idProduct=e409
[    1.503926] usb 1-5: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[    1.503930] usb 1-5: Product: DVB-T TV Stick
[    1.503935] usb 1-5: Manufacturer: ITE Technologies, Inc.
[    1.616084] usb 1-6: new high-speed USB device number 4 using ehci-pci
[    1.749168] usb 1-6: New USB device found, idVendor=2013, idProduct=024f
[    1.749175] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[    1.749179] usb 1-6: Product: PCTV 290e
[    1.749184] usb 1-6: Manufacturer: PCTV Systems
[    1.749187] usb 1-6: SerialNumber: 00000006XYS0

Comment 29 Antti Palosaari 2014-09-28 18:53:06 UTC

Only em28xx changes went to (In reply to Martin Sofaru from comment #16)
> I wanted to git bisect this. I do not get my 461e device to work with
> media_build and 3.12 though.

We certain(In reply to schaefer.frank from comment #12)
> (In reply to beralt from comment #11)
> > 
> > Then we can rule out the em28xx module? Let me test this by using the
> > upstream code on a 3.12 kernel.
> 
> I'm not saying that the actual problem isn't in the em28xx driver, but at
> least nothing has changed here between 3.12 and 3.13.
> There have indeed been several EHCI scheduling related changes recently.
> At the moment, I would focus on them.

I looked also all em28xx changes done between 3.12 and 3.13 and I can confirm Frank. There is was no change that touches DTV streaming code. These are all patches:
bdee6bd [media] em28xx-video: Swap release order to avoid lock nesting
6dbea9f [media] Add support for KWorld UB435-Q V2
be353fa [media] V4L2: em28xx: tell the ov2640 driver to balance clock enabling internally
fc5d0f8 [media] V4L2: em28xx: register a V4L2 clock source
032f1dd [media] em28xx: fix error path in em28xx_start_analog_streaming()
b68cafc [media] em28xx: fix and unify the coding style of the GPIO register write sequences
de0fc46 [media] em28xx: MaxMedia UB425-TC change demod settings
b6c7abb [media] em28xx: MaxMedia UB425-TC switch RF tuner driver to another
8d100b2 [media] em28xx: MaxMedia UB425-TC offer firmware for demodulator


Is there anyone who could test kernel 3.12 or older, with latest media drivers (media_build.git). If it works, we can rule out media driver issue.

Comment 30 Mauro Carvalho Chehab 2014-09-28 20:02:09 UTC

(In reply to Martin Long from comment #28)
> If there is no progress on getting it fixed, can we not roll back all of the
> patches on this module since the last known working version?

If you can identify what patch broke it, then we can roll the patch back, but, as Antti and Frank pointed, it is very unlikely that this change is related to em28xx. Probably, it is on some other place.

Could you please do a git bisect, in order to identify what change broke it?

Comment 31 Mauro Carvalho Chehab 2014-09-28 20:06:57 UTC

As I commented at the linux-media@vger.kernel.org, this error is likely because the USB stack is returning EFBIG when trying to submit URBs to receive the data.

The EFBIG return code happens if there's not enough space in a given
USB bus to reserve traffic for ISOC transfers when submitting the
URBs. Changing the URB size helps to reduce the amount of ISOC
transfers, making more unlikely for this bug to happen. However, the
em28xx driver already tries to use the max supported size, using the
USB descriptors.

There's one easy way to reproduce it: plug two em28xx devices with
analog TV and try to start both. The first analog TV stream will
allocate about 60% of the bus traffic, and the Kernel will return
EFBIG when trying to stream at the second one.

That's said, on a USB bus where there's just one device connected,
this error shouldn't happen, especially for DVB, where the bandwidth
requirements are typically lower.

It used to be possible to check how much was allocated for ISOC
traffic via:

# cat /sys/kernel/debug/usb/devices 

T:  Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=5000 MxCh= 1
B:  Alloc=  0/800 us ( 0%), #Int=  0, #Iso=  0
D:  Ver= 3.00 Cls=09(hub  ) Sub=00 Prot=03 MxPS= 9 #Cfgs=  1
P:  Vendor=1d6b ProdID=0003 Rev= 3.17
S:  Manufacturer=Linux 3.17.0-rc6+ xhci_hcd
S:  Product=xHCI Host Controller
S:  SerialNumber=0000:00:14.0
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=256ms

The "Alloc" above would indicate how many slots were allocated.

However, at least here on an Atom NUC I'm testing (model DE3815TYKE) and latest
upstream Kernel + media patches, no matter if I start some traffic or not there, it still shows 0/800, at the USB 2.0 or USB 3.0 ports. It seems that there's a bug somewhere on the USB stack that it is preventing it to show.

In any case, the return of EFBIG doesn't imply that there's a bug at
em28xx driver or at the usb stack. It is just an indication that the
bus has reached its maximum hardware capacity.

That's said, I was never ever be able to get EFBIG when there's
just one isoc device connected into an USB bus.

Are you sure that you don't have any other devices connected to the USB bus that also does ISOC transfers?

Aren't you trying to start it on a VM or connecting the device via an USB hub?

Comment 32 beralt 2014-09-28 21:27:42 UTC

Let me start with the fact that this bug still haunts me, in contrast to what I posted here earlier.

I originally found this bug with the tree I gave in my first comment. And I'm pretty sure this is not related to em28xx, since I've looked through the updates and none of them have any indication that it is something to do with this bug.

On the other hand I've seen some updates to the EHCI scheduling which has been overhauled after 3.12. I checked the devices on that bus, and the em28xx based device is the only one using ISOC transfers.

Note that that this bug only pops up after quite some transfers have taken place. Is there any reason why the EFBIG would occur only after a large number of transfers?

And if you remove the em28xx module, and reload it again, it seems you get another round of successful transfers. That's my current work-around by the way.

Comment 33 Mauro Carvalho Chehab 2014-09-28 22:47:22 UTC

(In reply to beralt from comment #32)
> Let me start with the fact that this bug still haunts me, in contrast to
> what I posted here earlier.
> 
> I originally found this bug with the tree I gave in my first comment. And
> I'm pretty sure this is not related to em28xx, since I've looked through the
> updates and none of them have any indication that it is something to do with
> this bug.

Ok.

> On the other hand I've seen some updates to the EHCI scheduling which has
> been overhauled after 3.12. I checked the devices on that bus, and the
> em28xx based device is the only one using ISOC transfers.

Well, I think that non-ISOC devices could also affect, as the EHCI URB scheduler dynamically estimates the amount of traffic needed on an USB bus and recalculates the available bandwidth.

> Note that that this bug only pops up after quite some transfers have taken
> place. Is there any reason why the EFBIG would occur only after a large
> number of transfers?

I'm not an expert at the EHCI driver. I would guess that the logic that dynamically estimates the bandwidth either has a bug or maybe some traffic
burst at the bus makes it to think that the bandwidth needed for other traffic increased, making em28xx to fail.

> 
> And if you remove the em28xx module, and reload it again, it seems you get
> another round of successful transfers. That's my current work-around by the
> way.

Weird. maybe setting again the alternate (or some other call to the USB functions) makes the ehci driver to reset the traffic counters, and re-start the bandwidth estimation algorithm.
counters.

Comment 34 Antti Palosaari 2014-09-29 03:56:09 UTC

I ran PCTV 520e DVB-C channel scan in a loop around 7 hours without problems using kernel 3.11, but latest media drivers top of that. Then I switched to 3.17.0-rc5+, which has same media drivers, started running same scan in a loop and problem occurs in a few minute.

It is strong indicator that problem is somewhere on USB stack.

Problem could be reproduced rather easily re-tuning device in a loop. Tuning causes USB stream start + stop, which will fail sooner or later.

Here is simple command to scan in a loop
$ while true; do scandvb intial-channels-file; done

Comment 35 schaefer.frank 2014-09-29 18:22:02 UTC

(In reply to Martin Long from comment #28)
> If there is no progress on getting it fixed, can we not roll back all of the
> patches on this module since the last known working version? Just breaking
> stuff and then leaving it broken is not good. 

Don't cry for a roll back unless you've located the problem.

Guys, this is going in circles.
If you can reproduce the issue, do a "git bisect v3.13..v3.12" on linus tree.
Maybe you can share the work. 
Please make sure to test "good" revisions for a while.

Comment 36 Antti Palosaari 2014-09-29 19:08:03 UTC

Today I was able to reproduce bug 2 times with PCTV 520e, just right after I started scan. But after that, I have been running scan with PCTV 520e and PCTV 292e many hours without a problem. Kernel is 3.17.0-rc5+ (media master). So that bug is very hard, if not impossible, to bisect unless you will find way to reproduce it reliably...

Comment 37 Martin Long 2014-09-29 19:29:52 UTC

(In reply to schaefer.frank from comment #35)

> Don't cry for a roll back unless you've located the problem.


Well, it seems that it is likely not a problem with the em28xx module, so fair enough. 

Please, just be aware that not everyone has the skills needed to do a git bisect and compile and install kernels. Just because I don't have the skills to do this doesn't change the fact that I bought a device which was working on my system, and commits have been made on part of the kernel which have since broken this (whether that is the dvb driver or USB, or whatever). And 6 months later the people who do have the skills to fix it seem to be no nearer to getting it working, or even knowing who broke it and how. 

I completely understand that nobody is getting paid for this, but stuff like this prevents Linux from being accessible to non-technical users. If something can stay months in a broken state, then it will be stop users coming to the platform - especially those who don't have a knowledge of how to debug kernel code!

I'm very sorry that I am not able to bisect this. If I was able to then I would be spending my evenings working on this. But, surely you can understand the frustration at not being able to use my hardware for the last 6 months, especially when, until the last couple of days, nobody seems to be interested in fixing it. 

I do, on the other hand, appreciate the work that some people here are now doing on this.

Comment 38 Joe Bryant 2014-11-10 15:07:50 UTC

To whom it may concern - I'm in the middle of doing a bisect on this, but it's going to take a week or two. "git bisect" reports there are approximately 13 iterations to test. I'm letting each iteration run for 48 hours before I pronounce it "good"... I've seen it run for 9 hours and then fail, so I want to be sure the good ones really are good.

If anyone else is doing/has done this already, let me know and save me some effort! Otherwise, I will report back in a while with results.

Comment 39 Joe Bryant 2014-11-19 08:57:06 UTC

Finally finished the bisect - it points to this as the first bad commit:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/media/usb/em28xx?id=46c73d1d3ebc38feed1d97c6980252a0a01f6a5b

"USB: EHCI: handle isochronous underruns with tasklets"

I don't really know what I'm looking at regarding kernel USB handling, so I don't know whether that commit is actually *wrong* or whether it's just exposing an issue elsewhere. It does, at least, look like a plausible candidate.

I'm using a Hauppauge 290e and my test was to run a channel scan in a loop with a 60 second pause between iterations. It would always successfully pick up channels at first. "Bad" tests were ones where the channel scan would eventually start to fail and I'd see ""submit of urb 0 failed" in the syslog. If it continued to run without problems for 48 hours then I called it a "good" commit.

On the bad commits, the longest run before failure was 26 hours. Sometimes it would fail in about 1 hour. I didn't investigate what caused the different lengths of time between failures.

I hope that this is enough information for someone with more understanding of the source to investigate. I'd be pleased to give any more information or run any more tests that would help.

Comment 40 schaefer.frank 2014-11-27 17:26:13 UTC

(In reply to Joe Bryant from comment #39)
> Finally finished the bisect - it points to this as the first bad commit:
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> drivers/media/usb/em28xx?id=46c73d1d3ebc38feed1d97c6980252a0a01f6a5b
> 
> "USB: EHCI: handle isochronous underruns with tasklets"
> 
> I don't really know what I'm looking at regarding kernel USB handling, so I
> don't know whether that commit is actually *wrong* or whether it's just
> exposing an issue elsewhere. It does, at least, look like a plausible
> candidate.

Indeed, many thanks for bisecting.
You should report this to the author of the patch (Alan Stern) and CC the linux-usb mailinglist.

Comment 41 Alan Stern 2014-11-27 17:43:59 UTC

Considering how long people have suspected this bug might have something to do with the ehci-hcd driver, it's surprising that until now, nobody has asked the maintainer of that driver to take a look.

Mauro's comments aren't quite right.  EFBIG does not mean the bus's bandwidth is fully committed -- that's what ENOSPC means.  Rather, EFBIG means ehci-hcd thinks the driver has tried to submit URBs too far into the future.  Consider this example from comment #2:

kernel: ehci-pci 0000:00:1d.0: request ffff8800ccc59000 would overflow (8136+63 >= 8192)
kernel: submit of urb 0 failed (error=-27)

The debugging message means that there currently are URBs scheduled for the next 8136 microframes (one microframe = 125 us) and the driver has tried to submit an URB that would last an additional 63 microframes.  Since ehci-hcd can't schedule more than 8192 microframes into the future, the submission fails.

Tracking this down may require adding some additional debugging lines to ehci-hcd.  But for now, the best start would be to record a usbmon trace showing the problem (see Documentation/usb/usbmon.txt).  That will show when all the URBs get submitted, where they are placed in the schedule, and where the error occurs.

Comment 42 Mauro Carvalho Chehab 2014-11-27 19:13:59 UTC

(In reply to Alan Stern from comment #41)
> Considering how long people have suspected this bug might have something to
> do with the ehci-hcd driver, it's surprising that until now, nobody has
> asked the maintainer of that driver to take a look.
> 
> Mauro's comments aren't quite right.  EFBIG does not mean the bus's
> bandwidth is fully committed -- that's what ENOSPC means.  Rather, EFBIG
> means ehci-hcd thinks the driver has tried to submit URBs too far into the
> future.

Ah!

> Consider this example from comment #2:
> 
> kernel: ehci-pci 0000:00:1d.0: request ffff8800ccc59000 would overflow
> (8136+63 >= 8192)
> kernel: submit of urb 0 failed (error=-27)
> 
> The debugging message means that there currently are URBs scheduled for the
> next 8136 microframes (one microframe = 125 us) and the driver has tried to
> submit an URB that would last an additional 63 microframes.  Since ehci-hcd
> can't schedule more than 8192 microframes into the future, the submission
> fails.

For a media application, an additional delay of 63 microframes won't hurt
at all. Even a bigger delay won't hurt. The way the media system works is
that it reserves multiple video buffers, each buffer to store a full frame.
At 30 frames/s rate (NTSC usual frame frequency), it means that, when a
buffer is filled, the next one will only be ready after 33 ms. There are even some devices that only sends URBs in burst, on a minimum of 1 ms interval.

What I'm trying to say is that, on those devices, the time that the buffers
are filled are not too precise, as they don't need to be, and sometimes the
device firmware needs to wait for a while, as the device's microcontroller could be busy handling other things (like IR or audio).

Comment 43 Joe Bryant 2014-11-27 19:41:54 UTC

Created attachment 158941 [details]
usbmon trace

Attached is a usbmon trace - I was very fortunate in that this time it failed very quickly, and so this is the full trace, over about 15 minutes.

Comment 44 Alan Stern 2014-11-28 16:32:31 UTC

This clearly shows a bug in ehci-hcd.  To figure out what the bug is, I need more debugging info.

Please edit the iso_stream_schedule() function in drivers/usb/host/ehci-sched.c.  Shortly after the use_start: label, just before the "status = -EFBIG" line, add a line to print out the values of empty, new_stream, now, base, next, and start.  Let's see what they show.

Comment 45 Joe Bryant 2014-11-29 22:17:56 UTC

Okay, I added the line thus:

printk(KERN_ALERT "EFBIG BUG - empty: %d new_stream: %d now: %d base: %d next: %d start: %d\n", empty, new_stream, now, base, next, start);

and I got:

EFBIG BUG - empty: 1 new_stream: 1 now: 2680 base: 2752 next: 4033 start: 4033

Comment 46 Alan Stern 2014-11-30 16:17:15 UTC

Created attachment 159211 [details]
Initialize ehci->last_iso_frame before it gets used

That's definitely bad...  base should always be between now - 1000 and now (modulo the schedule size), not larger than now.

Reading through the code, the reason is obvious (and it makes me feel really stupid).  It's a simple use-before-initialize bug for ehci->last_iso_frame.  The attached patch should fix it.  Let me know what happens.

Comment 47 Joe Bryant 2014-12-01 08:28:43 UTC

Many thanks for the fast patch, Alan. Assuming it's correct, it will of course be a while before I can report back, because it'll just sit there working perfectly the whole time! I will set a test run going as soon as I'm able and report back when it's been going for 48 hours.

Comment 48 Martin Long 2014-12-01 10:05:18 UTC

Looks very promising. I've managed to get my kernel building so I'll also report back in a couple of days.

Comment 49 Joe Bryant 2014-12-03 23:06:03 UTC

Looks good here - it ran my test for 48 hours with no sign of trouble, before I stopped it myself.

Comment 50 Martin Long 2014-12-03 23:09:04 UTC

Me too. I haven't been running such extensive tests, but it's gone a couple of days without dropping out on my MythTV setup, which is the best it's done so far.

Comment 51 Alan Stern 2014-12-04 15:13:29 UTC

Okay, I will submit the patch and credit the two of you with testing it.

Comment 52 Nigel Jewell 2014-12-15 22:21:16 UTC

I've been using this patch on my Fedora 20 box for over a week now (with the latest kernel being used on F20 - 3.17.6) and I've not seen any evidence of an issue.  Thanks to all involved in the diagnosis and resolution.

Comment 53 Martin Long 2014-12-15 22:39:03 UTC

What is the lifecycle for the patch? How does it get picked up off the mailing list, and which git tree is it likely to hit first. I've just had a quick scan and haven't seen anything yet. 

Is there a process which ensures this patch doesn't just sit on the mailing list unnoticed?

Comment 54 Alan Stern 2014-12-16 21:30:51 UTC

The patch has been submitted, and Greg K-H (the USB maintainer) will apply it after the current merge window closes.  Some time after the patch makes it into the mainline kernel, it will be queued for the various -stable kernels.

The first place it will show up is in Greg's USB tree.  The only process for making sure the patch gets noticed is to poke Greg if he doesn't pick it up in a timely manner.

Comment 55 Alan Stern 2015-02-02 20:09:24 UTC

FYI, the patch is now in the main Linux 3.19-rc7 kernel (commit 	6d89252a998a695ecb0348fc2d717dc33d90cae9) and the 3.18.5 stable kernel (commit 	99688ecfbc63e2aa90fab5e704c47c4fd6439197).  This bug report can be closed.