Bug 6789 - ehci_hcd driver freezes external usb hard-disk drive
Summary: ehci_hcd driver freezes external usb hard-disk drive
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: Drivers
Classification: Unclassified
Component: USB (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Matthew Dharm
URL:
Keywords:
Depends on:
Blocks: USB
  Show dependency tree
 
Reported: 2006-07-04 03:46 UTC by Michael
Modified: 2007-02-27 10:11 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.16/2.6.17
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
list of /proc/bus/usb/devices (2.33 KB, text/plain)
2006-09-08 00:44 UTC, Michael
Details
usbmon trace of the failure (997.66 KB, application/gzip)
2006-09-11 04:27 UTC, Michael
Details
Display result code for failed commands (566 bytes, patch)
2006-09-11 08:25 UTC, Alan Stern
Details | Diff
Trace from a second recorded failure (36.80 KB, text/plain)
2006-09-11 09:52 UTC, Michael
Details
Improve handling of SCSI residue and 10-byte commands (2.54 KB, patch)
2006-09-11 12:07 UTC, Alan Stern
Details | Diff
rejected part of patch (1.82 KB, text/plain)
2006-09-11 13:21 UTC, Michael
Details
Improve handling of SCSI residue and 10-byte commands (2.43 KB, patch)
2006-09-12 07:46 UTC, Alan Stern
Details | Diff
Improve handling of SCSI residue and 10-byte commands (2.44 KB, patch)
2006-09-12 10:33 UTC, Alan Stern
Details | Diff
last 500 lines of usbmon trace of failure (37.53 KB, text/plain)
2006-09-13 01:19 UTC, Michael
Details
Improve handling of SCSI residue and 10-byte commands (3.55 KB, patch)
2006-09-18 08:19 UTC, Alan Stern
Details | Diff
Improve handling of bogus Illegal Command errors (2.98 KB, patch)
2006-09-25 08:48 UTC, Alan Stern
Details | Diff
Trace of usbmon during usb bus reset (911.10 KB, application/gzip)
2006-09-27 23:56 UTC, Michael
Details

Description Michael 2006-07-04 03:46:35 UTC
Most recent kernel where this bug did not occur:
Distribution: Kanotix, Debian SID
Hardware Environment: Several
Software Environment: Debian SID
Problem Description:
When transferring data to or from an external USB disk (Revoltec Alu Book
Edition 2 USB Enclosure, see http://www.revoltec.de/english/produkte/rs025.htm),
the connection stops after a short period of time with several i/o errors. The
external disk is then no longer accessible. 

After attaching the external device appears as sda:

Jul  4 12:36:27 LaptopMB kernel: usb 1-3: new high speed USB device using
ehci_hcd and address 14
Jul  4 12:36:27 LaptopMB kernel: usb 1-3: configuration #1 chosen from 1 choice
Jul  4 12:36:27 LaptopMB kernel: scsi3 : SCSI emulation for USB Mass Storage devices
Jul  4 12:36:27 LaptopMB kernel: usb-storage: device found at 14
Jul  4 12:36:27 LaptopMB kernel: usb-storage: waiting for device to settle
before scanning
Jul  4 12:36:32 LaptopMB kernel:   Vendor: IC25N040  Model: ATMR04-0         
Rev: MO2O
Jul  4 12:36:32 LaptopMB kernel:   Type:   Direct-Access                     
ANSI SCSI revision: 00
Jul  4 12:36:32 LaptopMB kernel: SCSI device sda: 78140160 512-byte hdwr sectors
(40008 MB)
Jul  4 12:36:32 LaptopMB kernel: sda: Write Protect is off
Jul  4 12:36:32 LaptopMB kernel: sda: Mode Sense: 00 14 00 00
Jul  4 12:36:32 LaptopMB kernel: sda: assuming drive cache: write through
Jul  4 12:36:32 LaptopMB kernel: SCSI device sda: 78140160 512-byte hdwr sectors
(40008 MB)
Jul  4 12:36:32 LaptopMB kernel: sda: Write Protect is off
Jul  4 12:36:32 LaptopMB kernel: sda: Mode Sense: 00 14 00 00
Jul  4 12:36:32 LaptopMB kernel: sda: assuming drive cache: write through
Jul  4 12:36:32 LaptopMB kernel:  sda: sda1 sda2 sda3 sda4
Jul  4 12:36:32 LaptopMB kernel: sd 3:0:0:0: Attached scsi disk sda
Jul  4 12:36:32 LaptopMB kernel: sd 3:0:0:0: Attached scsi generic sg0 type 0
Jul  4 12:36:32 LaptopMB kernel: usb-storage: device scan complete

fdisk -l shows the new device (hda is the internal, sda the external):

Platte /dev/hda: 80.0 GByte, 80026361856 Byte
255 K
Comment 1 Michael 2006-07-07 05:02:11 UTC
Just tested 2.6.17.3: Has the same bug.
Comment 2 Matthew Dharm 2006-08-30 10:13:46 UTC
Please turn on CONFIG_USB_STORAGE_DEBUG and capture (and post) the section of
the logfile which shows the very beginnings of the failure.
Comment 3 Alan Stern 2006-08-30 13:17:30 UTC
Also, can you attach the contents of /proc/bus/usb/devices when the disk is
plugged in?  You may need to mount the /proc/bus/usb directory first:

    mount -t usbfs none /proc/bus/usb
Comment 4 Michael 2006-09-08 00:44:18 UTC
Created attachment 8969 [details]
list of /proc/bus/usb/devices

@alan: this is the output of /proc/bus/usb/devices with the device attached
Comment 5 Michael 2006-09-08 00:59:34 UTC
@alan: 

interesting: when the error occurs and the device no longer appears in fdisk -l,
it is still visible in /proc/bus/usb/devices

Don't know if that matters, though.

@Matthew:
I tried, however did not succeed. I haven't compiled an own kernel yet. Most I
did was compiling a new driver or something like this. I guess I will need some
more time to do so. Maybe I get something done next week or so...
Comment 6 Alan Stern 2006-09-08 09:28:26 UTC
The fact that the device is still visible in /proc/bus/usb/devices means that it
hasn't disconnected itself electrically from the USB bus.  But fdisk doesn't
work because the device has stopped responding to commands.

As an alternative to using CONFIG_USB_STORAGE_DEBUG and rebuilding the device
driver, you could try using usbmon -- assuming it is configured into your
kernel.  Instructions are in the kernel source file Documentation/usb/usbmon.txt.
Comment 7 Michael 2006-09-11 04:27:46 UTC
Created attachment 8984 [details]
usbmon trace of the failure

sorry for the long file. I have no clue, where failure messages start in this
file, so I decided to take the last 10% of the whole capture.

I compiled kernel 2.6.17.13 and enabled DEBUG_FS. I mounted the filesystem,
loaded the module and attached the disk. /proc/bus/usb/devices showed bus 03,
so I started a trace of 3t. I then issued "dd if=/dev/sda2 of=/dev/null" and
after 4.9GB the dd stopped with the input/output error.

Hope this helps!

Michael
Comment 8 Alan Stern 2006-09-11 08:25:25 UTC
Created attachment 8985 [details]
Display result code for failed commands

It's kind of hard to say just what happened.  The usbmon log looks perfectly
normal except for one spot just before the end.  The computer tried to read 240
sectors starting at sector number 10559056.  The drive sent back 239 sectors of
data and then an error code meaning "Illegal command", which makes no sense.

The computer went on to read the following 280 sectors uneventfully (probably
trying to finish up the I/O request), but the damage had already been done.  It
seems like the drive encountered some sort of read error and then reported it
incorrectly.  I don't know why Linux didn't retry the command.

By the way, your instinct was correct.	The interesting parts were all in the
last 100 lines of the log!

Try again after applying this patch.  It should print a message in the system
log whenever a command sent to the drive fails.  Again, send the last few
hundred lines of the usbmon file and also whatever you get from dmesg.
Comment 9 Michael 2006-09-11 09:52:57 UTC
Created attachment 8990 [details]
Trace from a second recorded failure

Syslog shows:

Sep 11 18:52:56 LaptopMB kernel: Command result 0x2
Sep 11 18:52:56 LaptopMB kernel: end_request: I/O error, dev sda, sector
8752464
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block

971964
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971965
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971966
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971967
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971968
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971969
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971970
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971971
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971972
Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block
971973

The last 500 (to be sure) lines of the usbmon trace are attached.

Btw, is this the right workflow to include the patch or is there a simpler way?

- rm /usr/src/linux/drivers/usb/storage/*.o
- make modules
- cp /usr/src/linux/drivers/usb/storage/usb-storage.ko
/lib/modules/2.6.17.13-mb-1/drivers/usb/storage

Michael
Comment 10 Michael 2006-09-11 09:56:19 UTC
sorry, forgot to mention the patching step in the procedure:

- rm /usr/src/linux/drivers/usb/storage/*.o
- patch -p0 < usb_patch.diff (and enter path to
/usr/src/linux/drivers/usb/storage/transport.c)
- make modules
- cp /usr/src/linux/drivers/usb/storage/usb-storage.ko
/lib/modules/2.6.17.13-mb-1/drivers/usb/storage
Comment 11 Alan Stern 2006-09-11 12:07:52 UTC
Created attachment 8991 [details]
Improve handling of SCSI residue and 10-byte commands

I think the problem lies in the SCSI core, not in the USB drivers.  Try this
patch instead of the other one and see if it works.  No need for usbmon.

Your workflow isn't quite right.  You don't need to remove the old *.o files,
since the kbuild system is smart enough to know when the source has been
changed and needs to be recompiled.  Also, when you apply the patch you should
first cd into /usr/src/linux and then invoke patch with -p1, not -p0.
Comment 12 Michael 2006-09-11 13:21:10 UTC
Created attachment 8997 [details]
rejected part of patch

Hi Alan,

seems I still have an error in my workflow. I saved the patch as
scsi_patch.diff into /usr/src/linux and did the following:

cd /usr/src/linux
patch -p1 < scsi_patch.diff
patching file drivers/scsi/sd.c
Hunk #1 FAILED at 926.
Hunk #2 FAILED at 1011.
2 out of 2 hunks FAILED -- saving rejects to file drivers/scsi/sd.c.rej
patching file include/scsi/scsi_device.h
patching file drivers/scsi/scsi_lib.c
Hunk #1 succeeded at 1050 with fuzz 2 (offset 158 lines).

What am I doing wrong (if anything)?
Comment 13 Alan Stern 2006-09-12 07:46:45 UTC
Created attachment 9002 [details]
Improve handling of SCSI residue and 10-byte commands

You didn't do anything wrong; I did.  The patch was based on a development
kernel, not 2.6.17, so it didn't apply properly.  This attachment has the
equivalent patch rebased against 2.6.17.10, so now there shouldn't be any
difficulty applying it.
Comment 14 Michael 2006-09-12 09:08:59 UTC
Alan,

it seems there is still something wrong:

  CC      drivers/scsi/scsi.o
  CC      drivers/scsi/hosts.o
  CC      drivers/scsi/scsi_ioctl.o
  CC      drivers/scsi/constants.o
  CC      drivers/scsi/scsicam.o
  CC      drivers/scsi/scsi_error.o
  CC      drivers/scsi/scsi_lib.o
  CC      drivers/scsi/scsi_scan.o
  CC      drivers/scsi/scsi_sysfs.o
  CC      drivers/scsi/scsi_devinfo.o
  CC      drivers/scsi/scsi_sysctl.o
  CC      drivers/scsi/scsi_proc.o
  CC      drivers/scsi/sd.o
drivers/scsi/sd.c: In function 'sd_rw_intr':
drivers/scsi/sd.c:911: error: 'xfer_size' undeclared (first use in this function)
drivers/scsi/sd.c:911: error: (Each undeclared identifier is reported only once
drivers/scsi/sd.c:911: error: for each function it appears in.)
make[2]: *** [drivers/scsi/sd.o] Fehler 1
make[1]: *** [drivers/scsi] Fehler 2
make: *** [drivers] Fehler 2

I even redownloaded the sources, but still get the same error with the new patch.

Any idea?
Comment 15 Alan Stern 2006-09-12 10:33:57 UTC
Created attachment 9003 [details]
Improve handling of SCSI residue and 10-byte commands

Third time's the charm.  I didn't notice that xfer_size should have been
renamed to this_count.
Comment 16 Michael 2006-09-13 01:19:39 UTC
Created attachment 9010 [details]
last 500 lines of usbmon trace of failure

O.K. we are one step ahead it seems.

With the latest patch (which compiles and installs perfectly), I still get
input/output errors, but the device no longer disconnects (remains accessible).


The input/output errors are random, i.e. when I repeat the dd command, the next
error occurs at a different block.
Comment 17 Alan Stern 2006-09-13 11:41:17 UTC
This is rapidly mutating into a question about the SCSI stack, not about USB.  I
can't answer these questions quickly.  In fact, I'll have to build a 2.6.17
kernel and do some experiments.  It'll take a while.
Comment 18 Michael 2006-09-13 12:51:48 UTC
No need to hurry. I have this disk now for more than two months and I can easily
wait some time. We're closing in to a solution, I feel, and this is good :-)

Anyway, I also have to do some changes on this new kernel. It has the suspend2
patches integrated. But they seem to be still very experimental. In eight
suspends the system froze 3 times and oopsed one time -> not really useful. I
never had problems with software suspend1. So, I will try to get suspend1
working with this kernel.

Thanks for your help, btw. This usb case is really a very cute enclosure.
Unfortunately, it seems to have a really buggy interface. (Although I sometimes
wonder what makes Windows work with all these buggy stuff. Is it so much more
fault-tolerant?)

Cheers,

Michael
Comment 19 Alan Stern 2006-09-18 08:19:57 UTC
Created attachment 9046 [details]
Improve handling of SCSI residue and 10-byte commands

This patch should fix things up for 2.6.17.  A similar but rather different
change will be needed for 2.6.18 and beyond.
Comment 20 Michael 2006-09-19 06:52:23 UTC
Looks as if we are another big step forward. There are no more
input/output-errors reported. Unfortunately, the data seem to get corrupted
during transfer. At least, md5sums don't match:

dd if=/dev/sda2 of=sda2 bs=1M
9765+1 Datens
Comment 21 Alan Stern 2006-09-19 07:34:53 UTC
Looks like we'll have to try a slightly different approach.  I've been assuming
that when those partial writes occur, all but the last sector was correctly
transferred.  That assumption may have been wrong.  The best approach might be
to retry the entire transfer.  I'll attach a patch to do that instead.
Comment 22 Michael 2006-09-25 07:15:31 UTC
Anything new on this?

If it is easier for you, I can also switch to a 2.6.18er kernel to test a new
patch. (Might be a good idea anyway...)

Cheers,

Michael
Comment 23 Alan Stern 2006-09-25 08:48:54 UTC
Created attachment 9098 [details]
Improve handling of bogus Illegal Command errors

It will be better to switch to 2.6.18, but first try this patch for 2.6.17. 
Unlike the earlier patch, when one of those bogus errors occurs this will cause
the entire command to be retried.
Comment 24 Michael 2006-09-25 14:32:46 UTC
Alan:

Thanks for the new patch.

It seems to work correctly for reads, as now I get consistent md5sums:

dd if=/dev/sda2 | md5sum
19999728+0 Datens
Comment 25 Alan Stern 2006-09-26 07:21:45 UTC
I can't tell without more information, like a usbmon log for one of those bad
writes.  To help keep the size of the log file down, try transferring only a few
hundred of your 1-MB blocks.

Another thing to try is to compare the contents of the drive against the disk
file to see the byte positions where they differ.  The location of the
differences ought to match up with errors in the usbmon log.
Comment 26 Michael 2006-09-27 23:56:01 UTC
Created attachment 9115 [details]
Trace of usbmon during usb bus reset

Hi Alan,

I started a new transfer (dd if=/home/michael/sda2 of=/dev/sda2 bs=1M) and
logged the output of /sys/kernel/debug/usbmon/1t, until I saw a usb reset
message in the syslog. I then stopped the logging, zipped the file and attached
it.

The syslog message was:
Sep 28 08:55:53 LaptopMB kernel: usb 1-4: reset high speed USB device using
ehci_hcd and address 2

Hope, it contains some useful information.

Thanks,

Michael
Comment 27 Alan Stern 2006-09-28 11:35:37 UTC
Here's the part of the log just before the reset occurred:

f62ce3c0 3407227500 S Bo:002:03 -115 31 = 55534243 5cb30200 00e00100 00000c2a
00000f94 e00000f0 00000000 000000
f62ce3c0 3407228070 C Bo:002:03 0 31 >
f5fe9cc0 3407228090 S Bo:002:03 -115 24576 = ed810000 b1670000 16044e44 19d84d44
711b3c44 00000000 00000100 38000000
f5fe92c0 3407228093 S Bo:002:03 -115 32768 = c03b3998 00000002 00265bf7 00000000
00000000 00000000 00000000 00000000
f5fe9240 3407228096 S Bo:002:03 -115 32768 = ffa10000 13000000 10b76444 6ad54d44
6ad54d44 00000000 00000100 00000000
f5fe9940 3407228099 S Bo:002:03 -115 32768 = ed810000 b1670000 16044e44 19d84d44
711b3c44 00000000 00000100 38000000
f5fe9cc0 3407231465 C Bo:002:03 0 24576 >
f5fe92c0 3407231474 C Bo:002:03 0 32768 >
f5fe9240 3407233423 C Bo:002:03 0 32768 >
f5fe9940 3407233432 C Bo:002:03 0 32768 >
f62ce3c0 3407233472 S Bi:002:04 -115 13 <
f62ce3c0 3407233540 C Bi:002:04 0 13 = 00000000 00000000 00000000 00

Basically what this contains is a WRITE command for 240 sectors (122880 bytes).
 They were transferred successfully to the drive.  But then afterward, in the
last line where the drive reports the status of the command, it just sent back a
bunch of 0s.  The data on that line should have been "55534253 5cb30200 00000000
00".  So obviously there's still something wrong with the drive itself, or with
its USB interface.

Anyway, the reset occurred and the WRITE command was retried.  It succeeded the
second time.  So as far as the log indicates, the entire transfer worked.

Did you check to see if the data was copied correctly?  Remember, the log for a
successful transfer isn't very useful.  We need to see the log for a transfer
where the data was bad.  And we also need to check to see if the places where
the data is bad correspond to the places where resets occurred.
Comment 28 Michael 2006-09-28 12:52:11 UTC
Alan,

no, I haven't checked. I just assumed it, as it seemed that the data on the disk
only changed when there where those reset messages in the log.

Furthermore, every transfer takes quite some time, as the original error only
occured after a lot of data was transfered. I usually got the input/output error
only on the 10 GB partition or on a 30 GB Partition. I nearly never got one on a
500MB or a 3 GB transfer.

But of course, I can't assume that this error is directly related to the
original input/output error. Maybe I should write some kind of small script,
which writes 500 MB of data, logs the transfer with usbmon, checks the syslog,
rereads the data and verifies it. If the data is o.k., it empties the usbmon log
and retries.

The problem could be, that if the error again only occurs on very large
transfers, then the usbmon log will be extremely big, too.

Anyway, I will need some time to do the tests and try to get the info you need.

Btw. the drive itself is fine. It has served me for more than three years as the
main drive in my laptop without any problems. I also tried another drive in the
same enclosure and had similar problems. So, it seems to be the usb interface.

One more info: Before your scsi patch, I also once tried a whole 10 GB read with
the ohci driver and yes, there also was this input/output error (although it
took three hours to get to 6GB where the error finally occured).

Hope to report something next week.

Cheers,

Michael

P.S.: Unbelievable, how you can read the usbmon data. I'm afraid the data tells
me nothing. Is there somewhere some documentation, so that I could be able to
find the data before the reset (and post only this part then)?
Comment 29 Alan Stern 2006-09-28 13:35:50 UTC
Automating the test sounds like a good idea.  You might even want to use the
"seek=" parameter in dd so that the script writes to various locations on the
disk, just like a real 30 GB copy would.

usbmon is explained in the kernel source file Documentation/usb/usbmon.txt.  In
this case it was pretty easy to find what I was looking for.  Errors show up as
negative numbers other than -115 (that's the standard IN-PROGRESS code), and
resets show up with the string "000:00".  Check your old log, you should see
what I mean.  There were no errors and one section containing a reset, about 5%
of the way through the log.

Note You need to log in before you can comment on or make changes to this bug.