Most recent kernel where this bug did not occur: Distribution: Kanotix, Debian SID Hardware Environment: Several Software Environment: Debian SID Problem Description: When transferring data to or from an external USB disk (Revoltec Alu Book Edition 2 USB Enclosure, see http://www.revoltec.de/english/produkte/rs025.htm), the connection stops after a short period of time with several i/o errors. The external disk is then no longer accessible. After attaching the external device appears as sda: Jul 4 12:36:27 LaptopMB kernel: usb 1-3: new high speed USB device using ehci_hcd and address 14 Jul 4 12:36:27 LaptopMB kernel: usb 1-3: configuration #1 chosen from 1 choice Jul 4 12:36:27 LaptopMB kernel: scsi3 : SCSI emulation for USB Mass Storage devices Jul 4 12:36:27 LaptopMB kernel: usb-storage: device found at 14 Jul 4 12:36:27 LaptopMB kernel: usb-storage: waiting for device to settle before scanning Jul 4 12:36:32 LaptopMB kernel: Vendor: IC25N040 Model: ATMR04-0 Rev: MO2O Jul 4 12:36:32 LaptopMB kernel: Type: Direct-Access ANSI SCSI revision: 00 Jul 4 12:36:32 LaptopMB kernel: SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB) Jul 4 12:36:32 LaptopMB kernel: sda: Write Protect is off Jul 4 12:36:32 LaptopMB kernel: sda: Mode Sense: 00 14 00 00 Jul 4 12:36:32 LaptopMB kernel: sda: assuming drive cache: write through Jul 4 12:36:32 LaptopMB kernel: SCSI device sda: 78140160 512-byte hdwr sectors (40008 MB) Jul 4 12:36:32 LaptopMB kernel: sda: Write Protect is off Jul 4 12:36:32 LaptopMB kernel: sda: Mode Sense: 00 14 00 00 Jul 4 12:36:32 LaptopMB kernel: sda: assuming drive cache: write through Jul 4 12:36:32 LaptopMB kernel: sda: sda1 sda2 sda3 sda4 Jul 4 12:36:32 LaptopMB kernel: sd 3:0:0:0: Attached scsi disk sda Jul 4 12:36:32 LaptopMB kernel: sd 3:0:0:0: Attached scsi generic sg0 type 0 Jul 4 12:36:32 LaptopMB kernel: usb-storage: device scan complete fdisk -l shows the new device (hda is the internal, sda the external): Platte /dev/hda: 80.0 GByte, 80026361856 Byte 255 K
Just tested 2.6.17.3: Has the same bug.
Please turn on CONFIG_USB_STORAGE_DEBUG and capture (and post) the section of the logfile which shows the very beginnings of the failure.
Also, can you attach the contents of /proc/bus/usb/devices when the disk is plugged in? You may need to mount the /proc/bus/usb directory first: mount -t usbfs none /proc/bus/usb
Created attachment 8969 [details] list of /proc/bus/usb/devices @alan: this is the output of /proc/bus/usb/devices with the device attached
@alan: interesting: when the error occurs and the device no longer appears in fdisk -l, it is still visible in /proc/bus/usb/devices Don't know if that matters, though. @Matthew: I tried, however did not succeed. I haven't compiled an own kernel yet. Most I did was compiling a new driver or something like this. I guess I will need some more time to do so. Maybe I get something done next week or so...
The fact that the device is still visible in /proc/bus/usb/devices means that it hasn't disconnected itself electrically from the USB bus. But fdisk doesn't work because the device has stopped responding to commands. As an alternative to using CONFIG_USB_STORAGE_DEBUG and rebuilding the device driver, you could try using usbmon -- assuming it is configured into your kernel. Instructions are in the kernel source file Documentation/usb/usbmon.txt.
Created attachment 8984 [details] usbmon trace of the failure sorry for the long file. I have no clue, where failure messages start in this file, so I decided to take the last 10% of the whole capture. I compiled kernel 2.6.17.13 and enabled DEBUG_FS. I mounted the filesystem, loaded the module and attached the disk. /proc/bus/usb/devices showed bus 03, so I started a trace of 3t. I then issued "dd if=/dev/sda2 of=/dev/null" and after 4.9GB the dd stopped with the input/output error. Hope this helps! Michael
Created attachment 8985 [details] Display result code for failed commands It's kind of hard to say just what happened. The usbmon log looks perfectly normal except for one spot just before the end. The computer tried to read 240 sectors starting at sector number 10559056. The drive sent back 239 sectors of data and then an error code meaning "Illegal command", which makes no sense. The computer went on to read the following 280 sectors uneventfully (probably trying to finish up the I/O request), but the damage had already been done. It seems like the drive encountered some sort of read error and then reported it incorrectly. I don't know why Linux didn't retry the command. By the way, your instinct was correct. The interesting parts were all in the last 100 lines of the log! Try again after applying this patch. It should print a message in the system log whenever a command sent to the drive fails. Again, send the last few hundred lines of the usbmon file and also whatever you get from dmesg.
Created attachment 8990 [details] Trace from a second recorded failure Syslog shows: Sep 11 18:52:56 LaptopMB kernel: Command result 0x2 Sep 11 18:52:56 LaptopMB kernel: end_request: I/O error, dev sda, sector 8752464 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971964 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971965 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971966 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971967 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971968 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971969 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971970 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971971 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971972 Sep 11 18:52:56 LaptopMB kernel: Buffer I/O error on device sda2, logical block 971973 The last 500 (to be sure) lines of the usbmon trace are attached. Btw, is this the right workflow to include the patch or is there a simpler way? - rm /usr/src/linux/drivers/usb/storage/*.o - make modules - cp /usr/src/linux/drivers/usb/storage/usb-storage.ko /lib/modules/2.6.17.13-mb-1/drivers/usb/storage Michael
sorry, forgot to mention the patching step in the procedure: - rm /usr/src/linux/drivers/usb/storage/*.o - patch -p0 < usb_patch.diff (and enter path to /usr/src/linux/drivers/usb/storage/transport.c) - make modules - cp /usr/src/linux/drivers/usb/storage/usb-storage.ko /lib/modules/2.6.17.13-mb-1/drivers/usb/storage
Created attachment 8991 [details] Improve handling of SCSI residue and 10-byte commands I think the problem lies in the SCSI core, not in the USB drivers. Try this patch instead of the other one and see if it works. No need for usbmon. Your workflow isn't quite right. You don't need to remove the old *.o files, since the kbuild system is smart enough to know when the source has been changed and needs to be recompiled. Also, when you apply the patch you should first cd into /usr/src/linux and then invoke patch with -p1, not -p0.
Created attachment 8997 [details] rejected part of patch Hi Alan, seems I still have an error in my workflow. I saved the patch as scsi_patch.diff into /usr/src/linux and did the following: cd /usr/src/linux patch -p1 < scsi_patch.diff patching file drivers/scsi/sd.c Hunk #1 FAILED at 926. Hunk #2 FAILED at 1011. 2 out of 2 hunks FAILED -- saving rejects to file drivers/scsi/sd.c.rej patching file include/scsi/scsi_device.h patching file drivers/scsi/scsi_lib.c Hunk #1 succeeded at 1050 with fuzz 2 (offset 158 lines). What am I doing wrong (if anything)?
Created attachment 9002 [details] Improve handling of SCSI residue and 10-byte commands You didn't do anything wrong; I did. The patch was based on a development kernel, not 2.6.17, so it didn't apply properly. This attachment has the equivalent patch rebased against 2.6.17.10, so now there shouldn't be any difficulty applying it.
Alan, it seems there is still something wrong: CC drivers/scsi/scsi.o CC drivers/scsi/hosts.o CC drivers/scsi/scsi_ioctl.o CC drivers/scsi/constants.o CC drivers/scsi/scsicam.o CC drivers/scsi/scsi_error.o CC drivers/scsi/scsi_lib.o CC drivers/scsi/scsi_scan.o CC drivers/scsi/scsi_sysfs.o CC drivers/scsi/scsi_devinfo.o CC drivers/scsi/scsi_sysctl.o CC drivers/scsi/scsi_proc.o CC drivers/scsi/sd.o drivers/scsi/sd.c: In function 'sd_rw_intr': drivers/scsi/sd.c:911: error: 'xfer_size' undeclared (first use in this function) drivers/scsi/sd.c:911: error: (Each undeclared identifier is reported only once drivers/scsi/sd.c:911: error: for each function it appears in.) make[2]: *** [drivers/scsi/sd.o] Fehler 1 make[1]: *** [drivers/scsi] Fehler 2 make: *** [drivers] Fehler 2 I even redownloaded the sources, but still get the same error with the new patch. Any idea?
Created attachment 9003 [details] Improve handling of SCSI residue and 10-byte commands Third time's the charm. I didn't notice that xfer_size should have been renamed to this_count.
Created attachment 9010 [details] last 500 lines of usbmon trace of failure O.K. we are one step ahead it seems. With the latest patch (which compiles and installs perfectly), I still get input/output errors, but the device no longer disconnects (remains accessible). The input/output errors are random, i.e. when I repeat the dd command, the next error occurs at a different block.
This is rapidly mutating into a question about the SCSI stack, not about USB. I can't answer these questions quickly. In fact, I'll have to build a 2.6.17 kernel and do some experiments. It'll take a while.
No need to hurry. I have this disk now for more than two months and I can easily wait some time. We're closing in to a solution, I feel, and this is good :-) Anyway, I also have to do some changes on this new kernel. It has the suspend2 patches integrated. But they seem to be still very experimental. In eight suspends the system froze 3 times and oopsed one time -> not really useful. I never had problems with software suspend1. So, I will try to get suspend1 working with this kernel. Thanks for your help, btw. This usb case is really a very cute enclosure. Unfortunately, it seems to have a really buggy interface. (Although I sometimes wonder what makes Windows work with all these buggy stuff. Is it so much more fault-tolerant?) Cheers, Michael
Created attachment 9046 [details] Improve handling of SCSI residue and 10-byte commands This patch should fix things up for 2.6.17. A similar but rather different change will be needed for 2.6.18 and beyond.
Looks as if we are another big step forward. There are no more input/output-errors reported. Unfortunately, the data seem to get corrupted during transfer. At least, md5sums don't match: dd if=/dev/sda2 of=sda2 bs=1M 9765+1 Datens
Looks like we'll have to try a slightly different approach. I've been assuming that when those partial writes occur, all but the last sector was correctly transferred. That assumption may have been wrong. The best approach might be to retry the entire transfer. I'll attach a patch to do that instead.
Anything new on this? If it is easier for you, I can also switch to a 2.6.18er kernel to test a new patch. (Might be a good idea anyway...) Cheers, Michael
Created attachment 9098 [details] Improve handling of bogus Illegal Command errors It will be better to switch to 2.6.18, but first try this patch for 2.6.17. Unlike the earlier patch, when one of those bogus errors occurs this will cause the entire command to be retried.
Alan: Thanks for the new patch. It seems to work correctly for reads, as now I get consistent md5sums: dd if=/dev/sda2 | md5sum 19999728+0 Datens
I can't tell without more information, like a usbmon log for one of those bad writes. To help keep the size of the log file down, try transferring only a few hundred of your 1-MB blocks. Another thing to try is to compare the contents of the drive against the disk file to see the byte positions where they differ. The location of the differences ought to match up with errors in the usbmon log.
Created attachment 9115 [details] Trace of usbmon during usb bus reset Hi Alan, I started a new transfer (dd if=/home/michael/sda2 of=/dev/sda2 bs=1M) and logged the output of /sys/kernel/debug/usbmon/1t, until I saw a usb reset message in the syslog. I then stopped the logging, zipped the file and attached it. The syslog message was: Sep 28 08:55:53 LaptopMB kernel: usb 1-4: reset high speed USB device using ehci_hcd and address 2 Hope, it contains some useful information. Thanks, Michael
Here's the part of the log just before the reset occurred: f62ce3c0 3407227500 S Bo:002:03 -115 31 = 55534243 5cb30200 00e00100 00000c2a 00000f94 e00000f0 00000000 000000 f62ce3c0 3407228070 C Bo:002:03 0 31 > f5fe9cc0 3407228090 S Bo:002:03 -115 24576 = ed810000 b1670000 16044e44 19d84d44 711b3c44 00000000 00000100 38000000 f5fe92c0 3407228093 S Bo:002:03 -115 32768 = c03b3998 00000002 00265bf7 00000000 00000000 00000000 00000000 00000000 f5fe9240 3407228096 S Bo:002:03 -115 32768 = ffa10000 13000000 10b76444 6ad54d44 6ad54d44 00000000 00000100 00000000 f5fe9940 3407228099 S Bo:002:03 -115 32768 = ed810000 b1670000 16044e44 19d84d44 711b3c44 00000000 00000100 38000000 f5fe9cc0 3407231465 C Bo:002:03 0 24576 > f5fe92c0 3407231474 C Bo:002:03 0 32768 > f5fe9240 3407233423 C Bo:002:03 0 32768 > f5fe9940 3407233432 C Bo:002:03 0 32768 > f62ce3c0 3407233472 S Bi:002:04 -115 13 < f62ce3c0 3407233540 C Bi:002:04 0 13 = 00000000 00000000 00000000 00 Basically what this contains is a WRITE command for 240 sectors (122880 bytes). They were transferred successfully to the drive. But then afterward, in the last line where the drive reports the status of the command, it just sent back a bunch of 0s. The data on that line should have been "55534253 5cb30200 00000000 00". So obviously there's still something wrong with the drive itself, or with its USB interface. Anyway, the reset occurred and the WRITE command was retried. It succeeded the second time. So as far as the log indicates, the entire transfer worked. Did you check to see if the data was copied correctly? Remember, the log for a successful transfer isn't very useful. We need to see the log for a transfer where the data was bad. And we also need to check to see if the places where the data is bad correspond to the places where resets occurred.
Alan, no, I haven't checked. I just assumed it, as it seemed that the data on the disk only changed when there where those reset messages in the log. Furthermore, every transfer takes quite some time, as the original error only occured after a lot of data was transfered. I usually got the input/output error only on the 10 GB partition or on a 30 GB Partition. I nearly never got one on a 500MB or a 3 GB transfer. But of course, I can't assume that this error is directly related to the original input/output error. Maybe I should write some kind of small script, which writes 500 MB of data, logs the transfer with usbmon, checks the syslog, rereads the data and verifies it. If the data is o.k., it empties the usbmon log and retries. The problem could be, that if the error again only occurs on very large transfers, then the usbmon log will be extremely big, too. Anyway, I will need some time to do the tests and try to get the info you need. Btw. the drive itself is fine. It has served me for more than three years as the main drive in my laptop without any problems. I also tried another drive in the same enclosure and had similar problems. So, it seems to be the usb interface. One more info: Before your scsi patch, I also once tried a whole 10 GB read with the ohci driver and yes, there also was this input/output error (although it took three hours to get to 6GB where the error finally occured). Hope to report something next week. Cheers, Michael P.S.: Unbelievable, how you can read the usbmon data. I'm afraid the data tells me nothing. Is there somewhere some documentation, so that I could be able to find the data before the reset (and post only this part then)?
Automating the test sounds like a good idea. You might even want to use the "seek=" parameter in dd so that the script writes to various locations on the disk, just like a real 30 GB copy would. usbmon is explained in the kernel source file Documentation/usb/usbmon.txt. In this case it was pretty easy to find what I was looking for. Errors show up as negative numbers other than -115 (that's the standard IN-PROGRESS code), and resets show up with the string "000:00". Check your old log, you should see what I mean. There were no errors and one section containing a reset, about 5% of the way through the log.