Hi all, this bug report was filed against Fedora bugzilla, please see: https://bugzilla.redhat.com/show_bug.cgi?id=866166 Nevertheless, there they seem to quietly ignore it, so I try here... This was also report to the USB mailing list. The suggestion there was to bisect the patches in order to find out the guilty one, but this seems to me a be beyond my actual capabilities, so I'll like to ask again if there is a "better" debug methodology. I've a setup, where some USB Hubs are cascaded in order to get 10 ports. To these 10 ports there are 10 HDDs, composing a RAID-6. Until kernel 3.5.6 (Fedora) everything was working fine, since kernel 3.6.x (still Fedora), after accessing the HDDs (in parallel, either via the md device, or all the single disks in parallel) for few seconds, I get error -110 and the HDDs are not any more accessible. Furthermore the ehci stack seems to be dead, in fact, any other USB storage device is not recognized any more, unless the ehci is removed from kernel (sysfs interface). After that USB storage works again, but in USB 1.1 mode (of course). Each single HDD can achieve around about 28MB/sec, but in RAID configuration the data rate increases to around about 43MB/s. It seems like heavy traffic on the USB stack is, for some reason, not handled properly, this could also be an interrupt issue, I guess. I would like to debug this, is there any way, using the Fedora kernel, to enable some verbose/debug output and see if more information is delivered? In case the only resort is the bisect, how to do that? Where to find info? How is with Fedora kernels? Thanks a lot in advance, pg
On Sun, Nov 11, 2012 at 04:19:06PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > This was also report to the USB mailing list. The suggestion there was to > bisect the patches in order to find out the guilty one, but this seems to me > a > be beyond my actual capabilities, so I'll like to ask again if there is a > "better" debug methodology. The mailing list is best, not here, sorry.
Hi Greg, thanks for the answer. I'm under the impression the mailing list has too much traffic, so my post went quickly covered by [PATCHES] and other issues, so there was no further answer to my request about debugging. That's why I'm here. In any case, despite my setup being quite unconventional, I see this issue as a serious regression, I'm baffled nobody cares. A part from convoluted kernel re-compilations, how can one debug items like this? Or you suggest I should send an email per day to the mailing list until I get an answer (or classified as spam)? :-) Thanks again, bye, pg
On Sun, Nov 11, 2012 at 05:42:08PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > I'm under the impression the mailing list has too much traffic, so my post > went > quickly covered by [PATCHES] and other issues, so there was no further answer > to my request about debugging. Please send it again, and say you can't run 'git bisect'. But note, if you can't do this, and no one else can duplicate this, the odds of it getting resolved are, of course, quite limited.
Hi Greg, thanks for the answer, I'll try again with the mailing list. One last question, is it possible to use the mailing list without subscribing to it? I would like to avoid all the other traffic. Thanks, bye, pg
> One last question, is it possible to use the mailing list without subscribing > to it? I would like to avoid all the other traffic. Of course, you can always send messages to it, even if you are not a subscriber, that's the way all vger.kernel.org mailing lists work.
Git bisect between 3.5 and 3.6 shouldn't be too bad. It is a pile of builds but on the other hand you have a nice solid looking test case. http://www.landley.net/writing/git-bisect-howto.html has a reasonably good tutorial
Hi Alan, thanks for the hints. After some investigation, I was able to run "git bisect" (the correct version system it's pretty unclear). It claims there are 13 runs to go. I'll do in the week-end and see what I can get. Thanks again, bye, pg
Hi all, I followed this instructions: http://wiki.gentoo.org/wiki/Kernel_git-bisect The result, unless I did something incorrect, is: Bisecting: a merge base must be tested [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5 Bisecting: 5128 revisions left to test after this (roughly 13 steps) [b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Bisecting: 2384 revisions left to test after this (roughly 12 steps) [3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Bisecting: 1118 revisions left to test after this (roughly 10 steps) [9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Bisecting: 559 revisions left to test after this (roughly 9 steps) [07c7f79ee1c7e6288c614ba88005a8de6dbaadff] staging:iio:imu:adis16400 drop sysfs interface for manual device reset. Bisecting: 279 revisions left to test after this (roughly 8 steps) [a2b520b7867c964bb629633c1f909f6ac901f91b] staging: comedi: rtd520: remove RtdInterruptClear macro Bisecting: 139 revisions left to test after this (roughly 7 steps) [b94cea48ab8428ae803f4e27131367574295853d] staging: comedi: adl_pci7296: remove the private data Bisecting: 69 revisions left to test after this (roughly 6 steps) [f2a64902e3c2ffa34b52e6ad13e8e3c40f6919c7] staging: comedi: dyna_pci10xx: fix detach Bisecting: 34 revisions left to test after this (roughly 5 steps) [f58bfd88be490c0f9e25d980abe1976ceef00647] staging: csr: remove CsrUtf16String typedef Bisecting: 17 revisions left to test after this (roughly 4 steps) [6a4f6d38a34a992f8fd1572fc8d5706c11089d5e] staging: csr: remove CsrStrChr() Bisecting: 8 revisions left to test after this (roughly 3 steps) [4fe9db37104f833972486355fe86d7dcd29279b5] staging: csr: remove CsrMemFree() and CsrMemFreeDma() Bisecting: 4 revisions left to test after this (roughly 2 steps) [55a27055b9ea5aabf8206ed6b7777c79f4e840c3] staging: csr: remove CsrPmemFree() Bisecting: 2 revisions left to test after this (roughly 1 step) [eaae2e9223e859cd968b58bd3173a3f17b50538d] staging: csr: remove csr_pmem.h Bisecting: 0 revisions left to test after this (roughly 1 step) [419e9266884fa853179ab726c27a63a9d3ae46e3] staging: csr: delete a bunch of unused library functions b13bc8dda81c54a66a1c84e66f60b8feba659f28 is the first bad commit Now is your turn... :-) bye, pg
Hi, after the feedback from the mailing list, I did it again. So, it seems I've no beginners luck, in fact it looks like I took a wrong turn, in the first try. New bisect log is: Bisecting: 5071 revisions left to test after this (roughly 13 steps) [b13bc8dda81c54a66a1c84e66f60b8feba659f28] Merge tag 'staging-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Bisecting: 2384 revisions left to test after this (roughly 12 steps) [3c4cfadef6a1665d9cd02a543782d03d3e6740c6] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Bisecting: 1118 revisions left to test after this (roughly 10 steps) [9fc377799bc9bfd8d5cb35d0d1ea2e2458cbdbb3] Merge tag 'usb-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Bisecting: 670 revisions left to test after this (roughly 9 steps) [f14121ab35912e3d2e57ac9a4ce1f9d4b7baeffb] Merge tag 'dt-for-3.6' of git://sources.calxeda.com/kernel/linux Bisecting: 340 revisions left to test after this (roughly 8 steps) [801b03653fc04de2cc5bc83c06de504d41345b63] Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw Bisecting: 170 revisions left to test after this (roughly 7 steps) [b015cb79ce84944076a8a849055f898f2a4d6be8] USB: EHCI: return void instead of 0 Bisecting: 85 revisions left to test after this (roughly 7 steps) [94c122ab01f84332c371dc4ae0f5b70e879173d6] [SCSI] ufs: fix incorrect return value about SUCCESS and FAILED Bisecting: 42 revisions left to test after this (roughly 6 steps) [d81a5d1956731c453b85c141458d4ff5d6cc5366] USB: add USB_VENDOR_AND_INTERFACE_INFO() macro Bisecting: 19 revisions left to test after this (roughly 5 steps) [6470cbc486652942dcf6c4b6420e3a521e766bbf] Merge tag 'for-usb-next-2012-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/sarah/xhci into usb-next Bisecting: 11 revisions left to test after this (roughly 4 steps) [569b394f53f0abd177cc665c9b4ace89e3f4c7fb] USB: EHCI: always scan each interrupt QH Bisecting: 5 revisions left to test after this (roughly 3 steps) [55934eb3b9fa52eb53b9d7342267fc73c38206aa] USB: EHCI: use hrtimer for (s)iTD deallocation Bisecting: 2 revisions left to test after this (roughly 2 steps) [3c273a056bf46167f0a1309c2ba72282a17d2541] USB: EHCI: unlink multiple async QHs together Bisecting: 0 revisions left to test after this (roughly 1 step) [361aabf395e4a23cf554cf4ec0c0c6963b8beb01] USB: EHCI: don't lose events during a scan Bisecting: 0 revisions left to test after this (roughly 0 steps) [32830f207691176234b4c4dd17f0d7ab6d87d94b] USB: EHCI: use hrtimer for unlinking empty async QHs 32830f207691176234b4c4dd17f0d7ab6d87d94b is the first bad commit commit 32830f207691176234b4c4dd17f0d7ab6d87d94b Author: Alan Stern <stern@rowland.harvard.edu> Date: Wed Jul 11 11:22:53 2012 -0400 USB: EHCI: use hrtimer for unlinking empty async QHs This patch (as1583) changes ehci-hcd to use an hrtimer event for unlinking empty (unused) async QHs instead of using a kernel timer. The check for empty QHs is moved to a new routine, where it doesn't require going through an entire scan of both the async and periodic schedules. And it can unlink multiple QHs at once, unlike the current code. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> :040000 040000 9767ca0f3b8c9fa0530e90730dc3106420412822 6b567d1137df3cdec9f19f835ea7c546491d9ca8 M drivers Which looks like to me more consistent with the problem. Your turn again, bye, pg
Created attachment 87141 [details] Log from usbmon Hi Alan, as discussed in the mailing list, please find attached the log, taken with usbmon, with proper operation until the system stopped working. I hope it is fine, in case of problems or else, please let me know, bye, pg
Created attachment 88081 [details] Log from usbmon with patch - First 50000 lines This is the logfile obtained after patching the kernel. I hope it is as expected. It captures a running system and a consequent crash of the USB connection.
Created attachment 88091 [details] Log from usbmon with patch - Last 50000 something lines Rest of the log. I had to split, because the bzip2 file was more than 1MB.
Created attachment 88101 [details] Registers dump This tar archive contains the USB registers dump. There are the following extensions: .start --> before anything was connected to the PC .plugin --> just after plugging in the HUBs system with 10 HDDs .stop --> just after the crash of the USB subsystem .unplug --> after unplugging the HUBs One file as a further "_2", this was taken after a reboot (I accidentally deleted the first one), so it might have values non consistent with the others. I hope it is not a problem.
Comment on attachment 88081 [details] Log from usbmon with patch - First 50000 lines Useless dump
Comment on attachment 88091 [details] Log from usbmon with patch - Last 50000 something lines Useless dump
Created attachment 88301 [details] dmesg out until error This is the complete dump of dmesg until the error occurred.
Created attachment 88661 [details] EHCI extended log
Output of "lspci -v -s b.1": 00:0b.1 USB Controller: nVidia Corporation MCP51 USB Controller (rev a3) (prog-if 20 [EHCI]) Subsystem: ASUSTeK Computer Inc. Device 81c0 Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 22 Memory at fe02e000 (32-bit, non-prefetchable) [size=256] Capabilities: [44] Debug port: BAR=1 offset=0098 Capabilities: [80] Power Management version 2 Kernel driver in use: ehci_hcd bye, pg
Fixed by commit 6e0c3339a6f19d748f16091d0a05adeb1e1f822b (USB: EHCI: unlink one async QH at a time).
Hi Alan, Greg, I'm testing the kernel from Fedora using this patch, 3.7.9-101.fc17.x86_64, and it seems it is working as expected. I guess you can closed this bug. Thanks, bye, pg
Thanks for all the debug info