Bug 50381 - [BISECTED]USB issue with kernel from 3.6.x
Summary: [BISECTED]USB issue with kernel from 3.6.x
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-11 16:19 UTC by piergiorgio.sartor
Modified: 2013-11-19 18:46 UTC (History)
2 users (show)

See Also:
Kernel Version: 3.6.x
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Log from usbmon (195.51 KB, application/x-bzip2)
2012-11-23 18:08 UTC, piergiorgio.sartor
Details
Log from usbmon with patch - First 50000 lines (614.88 KB, application/x-bzip2)
2012-12-01 17:50 UTC, piergiorgio.sartor
Details
Log from usbmon with patch - Last 50000 something lines (646.17 KB, application/x-bzip2)
2012-12-01 17:51 UTC, piergiorgio.sartor
Details
Registers dump (20.00 KB, application/x-tar)
2012-12-01 17:54 UTC, piergiorgio.sartor
Details
dmesg out until error (24.62 KB, application/x-bzip2)
2012-12-03 19:41 UTC, piergiorgio.sartor
Details
EHCI extended log (2.41 KB, text/x-log)
2012-12-08 17:53 UTC, piergiorgio.sartor
Details

Description piergiorgio.sartor 2012-11-11 16:19:05 UTC
Hi all,

this bug report was filed against Fedora bugzilla, please see:

https://bugzilla.redhat.com/show_bug.cgi?id=866166

Nevertheless, there they seem to quietly ignore it, so I try here...

This was also report to the USB mailing list. The suggestion there was to bisect the patches in order to find out the guilty one, but this seems to me a be beyond my actual capabilities, so I'll like to ask again if there is a "better" debug methodology.

I've a setup, where some USB Hubs are cascaded in order to get 10 ports. To these 10 ports there are 10 HDDs, composing a RAID-6.

Until kernel 3.5.6 (Fedora) everything was working fine, since kernel 3.6.x (still Fedora), after accessing the HDDs (in parallel, either via the md device, or all the single disks in parallel) for few seconds, I get error -110 and the HDDs are not any more accessible.
Furthermore the ehci stack seems to be dead, in fact, any other USB storage device is not recognized any more, unless the ehci is removed from kernel (sysfs interface). After that USB storage works again, but in USB 1.1 mode (of course).

Each single HDD can achieve around about 28MB/sec, but in RAID configuration the data rate increases to around about 43MB/s.

It seems like heavy traffic on the USB stack is, for some reason, not handled properly, this could also be an interrupt issue, I guess.

I would like to debug this, is there any way, using the Fedora kernel, to enable some verbose/debug output and see if more information is delivered?

In case the only resort is the bisect, how to do that? Where to find info? How is with Fedora kernels?

Thanks a lot in advance,

pg
Comment 1 Greg Kroah-Hartman 2012-11-11 17:27:29 UTC
On Sun, Nov 11, 2012 at 04:19:06PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> This was also report to the USB mailing list. The suggestion there was to
> bisect the patches in order to find out the guilty one, but this seems to me
> a
> be beyond my actual capabilities, so I'll like to ask again if there is a
> "better" debug methodology.

The mailing list is best, not here, sorry.
Comment 2 piergiorgio.sartor 2012-11-11 17:42:08 UTC
Hi Greg,

thanks for the answer.

I'm under the impression the mailing list has too much traffic, so my post went quickly covered by [PATCHES] and other issues, so there was no further answer to my request about debugging.

That's why I'm here.

In any case, despite my setup being quite unconventional, I see this issue as a serious regression, I'm baffled nobody cares.

A part from convoluted kernel re-compilations, how can one debug items like this?

Or you suggest I should send an email per day to the mailing list until I get an answer (or classified as spam)? :-)

Thanks again,

bye,

pg
Comment 3 Greg Kroah-Hartman 2012-11-11 19:13:22 UTC
On Sun, Nov 11, 2012 at 05:42:08PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> I'm under the impression the mailing list has too much traffic, so my post
> went
> quickly covered by [PATCHES] and other issues, so there was no further answer
> to my request about debugging.

Please send it again, and say you can't run 'git bisect'.  But note, if
you can't do this, and no one else can duplicate this, the odds of it
getting resolved are, of course, quite limited.
Comment 4 piergiorgio.sartor 2012-11-12 19:48:20 UTC
Hi Greg,

thanks for the answer, I'll try again with the mailing list.

One last question, is it possible to use the mailing list without subscribing to it? I would like to avoid all the other traffic.

Thanks,

bye,

pg
Comment 5 Greg Kroah-Hartman 2012-11-12 23:12:16 UTC
> One last question, is it possible to use the mailing list without subscribing
> to it? I would like to avoid all the other traffic.

Of course, you can always send messages to it, even if you are not a
subscriber, that's the way all vger.kernel.org mailing lists work.
Comment 6 Alan 2012-11-14 12:28:17 UTC
Git bisect between 3.5 and 3.6 shouldn't be too bad. It is a pile of builds but on the other hand you have a nice solid looking test case.

http://www.landley.net/writing/git-bisect-howto.html

has a reasonably good tutorial
Comment 7 piergiorgio.sartor 2012-11-14 20:34:26 UTC
Hi Alan,

thanks for the hints.

After some investigation, I was able to run "git bisect" (the correct version system it's pretty unclear).
It claims there are 13 runs to go.

I'll do in the week-end and see what I can get.

Thanks again,

bye,

pg
Comment 8 piergiorgio.sartor 2012-11-17 17:39:04 UTC
Hi all,

I followed this instructions: http://wiki.gentoo.org/wiki/Kernel_git-bisect
The result, unless I did something incorrect, is:

Bisecting: a merge base must be tested
[28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5
Bisecting: 5128 revisions left to test after this (roughly 13 steps)
[b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Bisecting: 2384 revisions left to test after this (roughly 12 steps)
[3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Bisecting: 1118 revisions left to test after this (roughly 10 steps)
[9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Bisecting: 559 revisions left to test after this (roughly 9 steps)
[07c7f79ee1c7e6288c614ba88005a8de6dbaadff] staging:iio:imu:adis16400 drop sysfs interface for manual device reset.
Bisecting: 279 revisions left to test after this (roughly 8 steps)
[a2b520b7867c964bb629633c1f909f6ac901f91b] staging: comedi: rtd520: remove RtdInterruptClear macro
Bisecting: 139 revisions left to test after this (roughly 7 steps)
[b94cea48ab8428ae803f4e27131367574295853d] staging: comedi: adl_pci7296: remove the private data
Bisecting: 69 revisions left to test after this (roughly 6 steps)
[f2a64902e3c2ffa34b52e6ad13e8e3c40f6919c7] staging: comedi: dyna_pci10xx: fix detach
Bisecting: 34 revisions left to test after this (roughly 5 steps)
[f58bfd88be490c0f9e25d980abe1976ceef00647] staging: csr: remove CsrUtf16String typedef
Bisecting: 17 revisions left to test after this (roughly 4 steps)
[6a4f6d38a34a992f8fd1572fc8d5706c11089d5e] staging: csr: remove CsrStrChr()
Bisecting: 8 revisions left to test after this (roughly 3 steps)
[4fe9db37104f833972486355fe86d7dcd29279b5] staging: csr: remove CsrMemFree() and CsrMemFreeDma()
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[55a27055b9ea5aabf8206ed6b7777c79f4e840c3] staging: csr: remove CsrPmemFree()
Bisecting: 2 revisions left to test after this (roughly 1 step)
[eaae2e9223e859cd968b58bd3173a3f17b50538d] staging: csr: remove csr_pmem.h
Bisecting: 0 revisions left to test after this (roughly 1 step)
[419e9266884fa853179ab726c27a63a9d3ae46e3] staging: csr: delete a bunch of unused library functions
b13bc8dda81c54a66a1c84e66f60b8feba659f28 is the first bad commit

Now is your turn... :-)

bye,

pg
Comment 9 piergiorgio.sartor 2012-11-18 10:34:42 UTC
Hi,

after the feedback from the mailing list, I did it again.
So, it seems I've no beginners luck, in fact it looks like I took a wrong turn, in the first try.

New bisect log is:

Bisecting: 5071 revisions left to test after this (roughly 13 steps)
[b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Bisecting: 2384 revisions left to test after this (roughly 12 steps)
[3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Bisecting: 1118 revisions left to test after this (roughly 10 steps)
[9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Bisecting: 670 revisions left to test after this (roughly 9 steps)
[f14121ab35912e3d2e57ac9a4ce1f9d4b7baeffb] Merge tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux
Bisecting: 340 revisions left to test after this (roughly 8 steps)
[801b03653fc04de2cc5bc83c06de504d41345b63] Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Bisecting: 170 revisions left to test after this (roughly 7 steps)
[b015cb79ce84944076a8a849055f898f2a4d6be8] USB: EHCI: return void instead of 0
Bisecting: 85 revisions left to test after this (roughly 7 steps)
[94c122ab01f84332c371dc4ae0f5b70e879173d6] [SCSI] ufs: fix incorrect return value about SUCCESS and FAILED
Bisecting: 42 revisions left to test after this (roughly 6 steps)
[d81a5d1956731c453b85c141458d4ff5d6cc5366] USB: add USB_VENDOR_AND_INTERFACE_INFO() macro
Bisecting: 19 revisions left to test after this (roughly 5 steps)
[6470cbc486652942dcf6c4b6420e3a521e766bbf] Merge tag 'for-usb-next-2012-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci into usb-next
Bisecting: 11 revisions left to test after this (roughly 4 steps)
[569b394f53f0abd177cc665c9b4ace89e3f4c7fb] USB: EHCI: always scan each interrupt QH
Bisecting: 5 revisions left to test after this (roughly 3 steps)
[55934eb3b9fa52eb53b9d7342267fc73c38206aa] USB: EHCI: use hrtimer for (s)iTD deallocation
Bisecting: 2 revisions left to test after this (roughly 2 steps)
[3c273a056bf46167f0a1309c2ba72282a17d2541] USB: EHCI: unlink multiple async QHs together
Bisecting: 0 revisions left to test after this (roughly 1 step)
[361aabf395e4a23cf554cf4ec0c0c6963b8beb01] USB: EHCI: don't lose events during a scan
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[32830f207691176234b4c4dd17f0d7ab6d87d94b] USB: EHCI: use hrtimer for unlinking empty async QHs
32830f207691176234b4c4dd17f0d7ab6d87d94b is the first bad commit
commit 32830f207691176234b4c4dd17f0d7ab6d87d94b
Author: Alan Stern <stern@rowland.harvard.edu>
Date:   Wed Jul 11 11:22:53 2012 -0400

    USB: EHCI: use hrtimer for unlinking empty async QHs
    
    This patch (as1583) changes ehci-hcd to use an hrtimer event for
    unlinking empty (unused) async QHs instead of using a kernel timer.
    
    The check for empty QHs is moved to a new routine, where it doesn't
    require going through an entire scan of both the async and periodic
    schedules.  And it can unlink multiple QHs at once, unlike the current
    code.
    
    Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

:040000 040000 9767ca0f3b8c9fa0530e90730dc3106420412822 6b567d1137df3cdec9f19f835ea7c546491d9ca8 M	drivers

Which looks like to me more consistent with the problem.

Your turn again,

bye,

pg
Comment 10 piergiorgio.sartor 2012-11-23 18:08:53 UTC
Created attachment 87141 [details]
Log from usbmon

Hi Alan,

as discussed in the mailing list, please find attached the log, taken with usbmon, with proper operation until the system stopped working.

I hope it is fine, in case of problems or else, please let me know,

bye,

pg
Comment 11 piergiorgio.sartor 2012-12-01 17:50:23 UTC
Created attachment 88081 [details]
Log from usbmon with patch - First 50000 lines

This is the logfile obtained after patching the kernel. I hope it is as expected.
It captures a running system and a consequent crash of the USB connection.
Comment 12 piergiorgio.sartor 2012-12-01 17:51:26 UTC
Created attachment 88091 [details]
Log from usbmon with patch - Last 50000 something lines

Rest of the log. I had to split, because the bzip2 file was more than 1MB.
Comment 13 piergiorgio.sartor 2012-12-01 17:54:36 UTC
Created attachment 88101 [details]
Registers dump

This tar archive contains the USB registers dump.
There are the following extensions:

.start --> before anything was connected to the PC
.plugin --> just after plugging in the HUBs system with 10 HDDs
.stop --> just after the crash of the USB subsystem
.unplug --> after unplugging the HUBs

One file as a further "_2", this was taken after a reboot (I accidentally deleted the first one), so it might have values non consistent with the others.
I hope it is not a problem.
Comment 14 piergiorgio.sartor 2012-12-02 18:22:40 UTC
Comment on attachment 88081 [details]
Log from usbmon with patch - First 50000 lines

Useless dump
Comment 15 piergiorgio.sartor 2012-12-02 18:22:54 UTC
Comment on attachment 88091 [details]
Log from usbmon with patch - Last 50000 something lines

Useless dump
Comment 16 piergiorgio.sartor 2012-12-03 19:41:42 UTC
Created attachment 88301 [details]
dmesg out until error

This is the complete dump of dmesg until the error occurred.
Comment 17 piergiorgio.sartor 2012-12-08 17:53:15 UTC
Created attachment 88661 [details]
EHCI extended log
Comment 18 piergiorgio.sartor 2012-12-08 17:53:58 UTC
Output of "lspci -v -s b.1":

00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) (prog-if 20 [EHCI])
	Subsystem: ASUSTeK Computer Inc. Device 81c0
	Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22
	Memory at fe02e000 (32-bit, non-prefetchable) [size=256]
	Capabilities: [44] Debug port: BAR=1 offset=0098
	Capabilities: [80] Power Management version 2
	Kernel driver in use: ehci_hcd

bye,

pg
Comment 19 Alan Stern 2013-01-26 18:51:40 UTC
Fixed by commit 6e0c3339a6f19d748f16091d0a05adeb1e1f822b (USB: EHCI: unlink one async QH at a time).
Comment 20 piergiorgio.sartor 2013-02-23 15:25:36 UTC
Hi Alan, Greg,

I'm testing the kernel from Fedora using this patch, 3.7.9-101.fc17.x86_64, and it seems it is working as expected.

I guess you can closed this bug.

Thanks,

bye,

pg
Comment 21 Alan 2013-11-19 18:46:13 UTC
Thanks for all the debug info

Note You need to log in before you can comment on or make changes to this bug.