Bug 8968 - One broken USB storage device can hang the entire USB subsystem
Summary: One broken USB storage device can hang the entire USB subsystem
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Greg Kroah-Hartman
URL:
Keywords:
Depends on:
Blocks: USB
  Show dependency tree
 
Reported: 2007-09-02 14:07 UTC by Alain Knaff
Modified: 2008-05-09 02:41 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.22.6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo (121.76 KB, text/plain)
2007-09-02 14:59 UTC, Alain Knaff
Details
echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo (349.38 KB, text/plain)
2007-09-03 10:55 UTC, Alain Knaff
Details
Dmesg output on a kernel with CONFIG_USB_DEBUG=y (250.85 KB, text/plain)
2007-09-03 16:54 UTC, Alain Knaff
Details
echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo without dodgy USB hub (334.29 KB, text/plain)
2007-09-04 10:47 UTC, Alain Knaff
Details
/proc/bus/usb/devices before crash (4.92 KB, text/plain)
2007-09-04 15:11 UTC, Alain Knaff
Details
/sys/class/usb_host/usb_host6/registers (789 bytes, text/plain)
2007-09-04 15:12 UTC, Alain Knaff
Details
/sys/class/usb_host/usb_host6/async (184 bytes, text/plain)
2007-09-04 15:12 UTC, Alain Knaff
Details
/proc/bus/usb/devices after crash (4.92 KB, text/plain)
2007-09-04 15:13 UTC, Alain Knaff
Details

Description Alain Knaff 2007-09-02 14:07:05 UTC
I have several Sharkoon external disk enclosures.

One contains an IDE (parallel) disk with some bad sectors. The other contains a good (as far as I know...) SATA disk.

With the IDE disk, If I attempt to read files containing the bad sectors, the process hangs forever, in an unkillable state.

With the SATA disk (new, and never fell => no reason to believe it is bad), it fails randomly after some heavy copying (rsyncing a huge tree from the external disk towards my internal disk), at different places on each try. But like with the IDE disk, the process will hang forever, in an unkillable state, and the drive's light turns green (red with Ubuntu kernel 2.6.22-10-generic).

In both cases (IDE and SATA), a sync started in another window will also hang forever (even if the first process did only read from the disk). Attempting to remove the ohci_hcd or ehci_hcd will just cause that rmmod process to hang as well. Lsusb will hang likewise. But strangely enough, my usb mouse still works, as do my usb speakers.

With the IDE disk, a reboot solves the issue (until I attempt read those bad sectors again), even without powercycling the disk.

IMHO, these Sharkoon enclosures do have issues, but should the kernel really handle the situation so badly? Shouldn't there be a timeout where the processes will just get an I/O error, but the rest of the system can continue using USB?

Interestingly enough, the problem doesn't occur (or occurs much less often) if I throttle rsync's bandwith using the --bwlimit option (if I throttle bw to 5 Megabytes/s, the problem almost never happens. The enclosure+disk is capable of 30 Megabyte/s though)


Most recent kernel where this bug did not occur: don't known for sure. However, on my old SuSE box (having 2.6.13-15.16) it does work all right.

Distribution: Kubuntu Gutsy

Hardware Environment: 
Sharkoon Swift-Case SATA/USB2.0 3.5"
http://www.computeruniverse.net/info.asp?id=90205832

Western Digital WD2500JS 250GB
http://www.computeruniverse.net/info.asp?id=90143746

or, for the IDE variant:
Sharkoon Swift-Case USB2.0 3.5"
http://www.computeruniverse.net/info.asp?id=90150304

MAXTOR 6L080J4

In both cases, with an ASUS M2A-VM, Sockel AM2, mATX motherboard
http://www.computeruniverse.net/info.asp?id=90216021
(in case it is relevant due to the USB controller?)


Software Environment: n.a.

Problem Description:
A copy from the affected device will hang in D (diskio) state, and is unkillable

Steps to reproduce:
rsync -uav /media/usbdisk /tmp

(For the IDE variant: knock over the drive while in operation or otherwise create damaged sectors...)
Comment 1 Andrew Morton 2007-09-02 14:46:06 UTC
Please make it hang, then do

echo t > /proc/sysrq-trigger
dmesg -s 1000000 > foo

then attach foo to this report, thanks.
Comment 2 Alain Knaff 2007-09-02 14:59:42 UTC
Created attachment 12674 [details]
echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo

Output of echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo
Comment 3 Andrew Morton 2007-09-02 23:34:51 UTC
Alas, not all the output is there.  You'll need a larger kernel log buffer.

Can you please add "log_buf_len=1M" to the kernel boot command line, reboot
and then rerun the sysrq-T thing?

Thank.
Comment 4 Alain Knaff 2007-09-03 10:55:26 UTC
Created attachment 12682 [details]
echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo

Output of echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo

Btw, this time the drive led stayed red... (formerly it was green)
Comment 5 Anonymous Emailer 2007-09-03 11:45:14 UTC
Reply-To: akpm@linux-foundation.org

> On Mon,  3 Sep 2007 10:55:26 -0700 (PDT) bugme-daemon@bugzilla.kernel.org
> wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=8968

Either a scsi bug or a usb-storage bug.  The interesting bits are below:


> [  411.364000] scsi_eh_6     D F7E21F28     0  2782      2 (L-TLB)
> [  411.364000]        f7e21f3c 00000046 00000002 f7e21f28 f7e21f24 00000000
> 00000082 f79ec600 
> [  411.364000]        dd5c5a00 0000000a b1510900 00000058 00000000 dd56a10c
> c3015b00 00000000 
> [  411.364000]        00000002 00000001 deb0e1c0 00004ec3 00000000 00000082
> 000000ff 00000000 
> [  411.364000] Call Trace:
> [  411.364000]  [<c02f11e4>] wait_for_completion+0x94/0xe0
> [  411.364000]  [<c0122bc0>] default_wake_function+0x0/0x10
> [  411.364000]  [<f88fe6d5>] command_abort+0x85/0xb0 [usb_storage]
> [  411.364000]  [<f8a51f89>] __scsi_try_to_abort_cmd+0x19/0x20 [scsi_mod]
> [  411.364000]  [<f8a53b86>] scsi_error_handler+0x356/0x510 [scsi_mod]
> [  411.364000]  [<c01205c0>] complete+0x40/0x60
> [  411.364000]  [<f8a53830>] scsi_error_handler+0x0/0x510 [scsi_mod]
> [  411.364000]  [<c013ba12>] kthread+0x42/0x70
> [  411.364000]  [<c013b9d0>] kthread+0x0/0x70
> [  411.364000]  [<c0105487>] kernel_thread_helper+0x7/0x10
> [  411.364000]  =======================
> [  411.364000] usb-storage   D F7E23EA0     0  2783      2 (L-TLB)
> [  411.364000]        f7e23eb4 00000046 00000000 f7e23ea0 f79ec600 de7e6a80
> 000031ba 00000000 
> [  411.364000]        00000a00 0000000a b56a6600 00000051 00000000 dd56a62c
> c3015b00 00000000 
> [  411.364000]        c02f18d8 0001b000 de7e6a80 000031ba dfb89d00 00002000
> 00000010 c0008380 
> [  411.364000] Call Trace:
> [  411.364000]  [<c02f18d8>] schedule_timeout+0x78/0xd0
> [  411.364000]  [<c02f11e4>] wait_for_completion+0x94/0xe0
> [  411.364000]  [<c0122bc0>] default_wake_function+0x0/0x10
> [  411.364000]  [<f8887716>] usb_sg_wait+0x126/0x170 [usbcore]
> [  411.364000]  [<f88ff2e0>] usb_stor_bulk_transfer_sg+0xb0/0x100
> [usb_storage]
> [  411.364000]  [<f88ff11e>] usb_stor_bulk_transfer_buf+0x4e/0x90
> [usb_storage]
> [  411.364000]  [<f88ff776>] usb_stor_Bulk_transport+0x126/0x290
> [usb_storage]
> [  411.364000]  [<f89007b0>] usb_stor_control_thread+0x0/0x1d0 [usb_storage]
> [  411.364000]  [<f88ff904>] usb_stor_invoke_transport+0x24/0x2f0
> [usb_storage]
> [  411.364000]  [<c012003f>] __wake_up_locked+0x1f/0x30
> [  411.364000]  [<c02f2820>] __down_interruptible+0xf0/0x120
> [  411.364000]  [<c0122bc0>] default_wake_function+0x0/0x10
> [  411.364000]  [<f89007b0>] usb_stor_control_thread+0x0/0x1d0 [usb_storage]
> [  411.364000]  [<f89007b0>] usb_stor_control_thread+0x0/0x1d0 [usb_storage]
> [  411.364000]  [<f8900918>] usb_stor_control_thread+0x168/0x1d0
> [usb_storage]
> [  411.364000]  [<c01205c0>] complete+0x40/0x60
> [  411.364000]  [<f89007b0>] usb_stor_control_thread+0x0/0x1d0 [usb_storage]
> [  411.364000]  [<c013ba12>] kthread+0x42/0x70
> [  411.364000]  [<c013b9d0>] kthread+0x0/0x70
> [  411.364000]  [<c0105487>] kernel_thread_helper+0x7/0x10
> [  411.364000]  =======================
>
> ...
>
> [  411.364000] rsync         D 000031F9     0  8809   8779 (NOTLB)
> [  411.364000]        de1a9db0 00000082 00000051 000031f9 c3016a38 00000000
> df93a6d0 c3010000 
> [  411.364000]        c010320a 00000009 b56a6600 00000051 00000000 df93a62c
> c3015b00 00000000 
> [  411.364000]        de1a9df0 00000082 de7e6a80 00000086 de1a9ea8 c012da61
> c0118735 b56a6600 
> [  411.364000] Call Trace:
> [  411.364000]  [<c010320a>] __switch_to+0xaa/0x1d0
> [  411.364000]  [<c012da61>] irq_exit+0x51/0x80
> [  411.364000]  [<c0118735>] smp_apic_timer_interrupt+0x55/0x80
> [  411.364000]  [<c02f171d>] io_schedule+0x1d/0x30
> [  411.364000]  [<c015e7c9>] sync_page+0x39/0x50
> [  411.364000]  [<c02f19ad>] __wait_on_bit_lock+0x3d/0x70
> [  411.364000]  [<c015e790>] sync_page+0x0/0x50
> [  411.364000]  [<c015e773>] __lock_page+0x73/0x80
> [  411.364000]  [<c013bd20>] wake_bit_function+0x0/0x60
> [  411.364000]  [<c015f193>] do_generic_mapping_read+0x213/0x5d0
> [  411.364000]  [<c01612ec>] generic_file_aio_read+0xcc/0x1d0
> [  411.364000]  [<c015e520>] file_read_actor+0x0/0xf0
> [  411.364000]  [<c01802f5>] do_sync_read+0xd5/0x120
> [  411.364000]  [<c013bcd0>] autoremove_wake_function+0x0/0x50
> [  411.364000]  [<c0180c2c>] vfs_read+0xbc/0x160
> [  411.364000]  [<c0180220>] do_sync_read+0x0/0x120
> [  411.364000]  [<c0181161>] sys_read+0x41/0x70
> [  411.364000]  [<c01041d2>] sysenter_past_esp+0x6b/0xa9
> 
> 
Comment 6 Alan Stern 2007-09-03 14:44:25 UTC
It's entirely possible that the hang (not the original error, just the hang) is caused by a bug in your USB host controller hardware.  See for example Bugzilla entry #8692.

You need to provide more information about your setup.  Start with the output from lspci and the contents of /proc/bus/usb/devices -- indicate which device is causing the problem.  Also, you should build a kernel with CONFIG_USB_DEBUG enabled and attach the dmesg log after the error occurs.
Comment 7 Alain Knaff 2007-09-03 16:54:17 UTC
Created attachment 12686 [details]
Dmesg output on a kernel with CONFIG_USB_DEBUG=y

Dmesg output on a kernel with CONFIG_USB_DEBUG=y
Comment 8 Alan Stern 2007-09-04 07:26:29 UTC
I don't see anything in your dmesg log indicating a hang.

But one thing does stand out: The Chesen USB-PS2 converter (attached to a keyboard) constantly gets reset every few seconds.  You might want to try plugging that device into the other hub, the one with the audio device, instead.
Comment 9 Alain Knaff 2007-09-04 10:46:27 UTC
The Chesen USB-PS2 converter is actually an USB2 hub, which also happens to have a PS/2 mouse and PS/2 keyboard connector on the side. I bought it for use as a hub, so my keyboard is actually connected directly to the PC, and not to this hub.

At the time of the crash nothing else was (no USB device) was connected to that hub either, the "faulty" Sharkoon enclosure was connected directly to another USB connector.

However, although my (PS/2) keyboard was connected directly to the PC, it did act up occasionally during my tests: a single keypress would generate multiple repeated characters.

The USB loudspeakers are connected to another USB, which is only 1.1. So, it would be somewhat pointless to connect the fast USB 2.0 hub to the slow one.

However, the fast hub is indeed a tad bid dodgy: a number of devices (the loudspeakers, and also the serial cable to connect my mobile phone) do not work correctly on it, but only on the slow hub.

Ok, in any case, I disconnected that dodgy hub altogether, and repeated the test. With the hub removed, the crash did indeed take much longer to occur (after 5 Gigs had been successfully copied), but it did occur nevertheless.

Attached is the echo t >/proc/sysrq-trigger output after this happened. This time, I used cp rather than rsync.
Comment 10 Alain Knaff 2007-09-04 10:47:56 UTC
Created attachment 12699 [details]
echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo without dodgy USB hub

echo t > /proc/sysrq-trigger ; dmesg -s 1000000 > foo without dodgy USB hub
Comment 11 Alan Stern 2007-09-04 12:23:18 UTC
Okay.  The stack trace (echo t >/proc/sysrq-trigger) doesn't provide any more information than it did before, so you may as well leave it out in the future.

The next thing to do is check the contents of /sys/class/usb_host/usb_host6 after the next crash.  The "registers" and "async" files are of particular interest; please attach them to this bug report.  (And be sure to keep CONFIG_USB_DEBUG enabled!)

Also, it wouldn't hurt to attach a copy of /proc/bus/usb/devices (before the crash, but with the drive plugged in).
Comment 12 Alain Knaff 2007-09-04 15:11:19 UTC
Created attachment 12700 [details]
/proc/bus/usb/devices before crash

/proc/bus/usb/devices before crash
Comment 13 Alain Knaff 2007-09-04 15:12:13 UTC
Created attachment 12701 [details]
/sys/class/usb_host/usb_host6/registers
Comment 14 Alain Knaff 2007-09-04 15:12:38 UTC
Created attachment 12702 [details]
/sys/class/usb_host/usb_host6/async
Comment 15 Alain Knaff 2007-09-04 15:13:25 UTC
Created attachment 12703 [details]
/proc/bus/usb/devices after crash
Comment 16 Natalie Protasevich 2008-02-04 14:38:34 UTC
Any new status on this problem please?
Thanks.
Comment 17 Alain Knaff 2008-05-09 02:41:32 UTC
The problem hasn't re-occurred since a while (since the machine was upgraded to Kubuntu 7.10 , kernel 2.6.22-14-generic if I remember correctly)

Note You need to log in before you can comment on or make changes to this bug.