Bug 9638 - USB bad sense data [sense 0 00 00] causes corruption
USB bad sense data [sense 0 00 00] causes corruption
Status: REJECTED INVALID
Product: Drivers
Classification: Unclassified
Component: USB
All Linux
: P1 normal
Assigned To: Greg Kroah-Hartman
:
Depends on:
Blocks: USB
  Show dependency treegraph
 
Reported: 2007-12-26 11:29 UTC by Andrew J. Kroll
Modified: 2009-11-24 15:36 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.xx
Tree: Mainline
Regression: No


Attachments
usbmon output plus dmesg (638.80 KB, text/plain)
2008-07-13 07:06 UTC, Andrew J. Kroll
Details
windows debugging data (156.01 KB, text/plain)
2008-07-16 07:10 UTC, Andrew J. Kroll
Details
usb-storage fix (4.60 KB, patch)
2008-07-21 13:08 UTC, Andrew J. Kroll
Details | Diff
proposed patch for error detection fixes (1.93 KB, patch)
2008-07-31 01:39 UTC, Robert Hancock
Details | Diff
scsi, usb-storage and usbmon logs for I/O failure during scsi-usn loop (77.12 KB, application/octet-stream)
2008-10-07 19:12 UTC, Alexandre Oliva
Details
Patch that fixes errors after SCSI IOCTLs on JMicron USB enclosure (802 bytes, patch)
2008-10-07 23:10 UTC, Alexandre Oliva
Details | Diff

Description Andrew J. Kroll 2007-12-26 11:29:36 UTC
Most recent kernel where this bug did not occur:
Distribution: Slackware 12.0.0
Hardware Environment: EHCI/OCHI/UHCI
Software Environment:
Problem Description:
One device (USB to PATA/SATA bridge) fails only on EHCI and OHCI, but is fine on the system with UHCI, the other device fails on ALL THREE

Steps to reproduce:
Read or write from the device, check dmesg

>>>>>>>>>>>>>>>>>>WD MYBOOK CASE

usb 4-2.1: new high speed USB device using ehci_hcd and address 25
usb 4-2.1: configuration #1 chosen from 1 choice
 ubc: ubc1
ub(4.15): dir w len/act 8192/1536 [sense 0 00 00] retry 1
usb 4-2.3: reset high speed USB device using ehci_hcd and address 15
<unplug device, plug in device, mount, ls>
usb 4-2.1: USB disconnect, address 25
usb 4-2.1: new high speed USB device using ehci_hcd and address 26
usb 4-2.1: configuration #1 chosen from 1 choice
 ubc: ubc1
ub(4.26): dir r len/act 4608/512 [sense 0 00 00] retry 1
usb 4-2.1: reset high speed USB device using ehci_hcd and address 26


/: Bus 04.Port 1: Dev 1, Class=root_hub, Drv=ehci_hcd/6p, 480M
    |_ Port 2: Dev 14, If 0, Prod=, Class=hub, Drv=hub/4p, 480M
        |_ Port 1: Dev 26, If 0, Prod=My Book, Class=stor., Drv=ub, 480M
        |_ Port 3: Dev 15, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
        |_ Port 4: Dev 24, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M

Bus 4 Device 26: ID 1058:1100 Western Digital Technologies, Inc.
Bus 4 Device 24: ID 1058:0900 Western Digital Technologies, Inc.
Bus 4 Device 15: ID 1058:0901 Western Digital Technologies, Inc.
Bus 4 Device 14: ID 058f:6254 Alcor Micro Corp.
Bus 4 Device 1: ID 0000:0000
Bus 1 Device 1: ID 0000:0000
Bus 3 Device 1: ID 0000:0000
Bus 2 Device 1: ID 0000:0000

What is interesting is the other two WD MyBooks are and have been working fine for over 1 year, and continue to do so. This particular one is the 500GiB essentials model WD5000M1U-00, that I had just purchased, and tried on 2 different machines, and even across usb 1.x hubs to verify that it is not a speed issue. to verify that it is not a cable issue, 3 different sets of cables were used, all results presented the same bogus sense data on 2 different machines.

>>>>>>>>>>>>>>>> USB to PATA/SATA bridge case

This particular device works fine on UHCI USB 1.10, an older Athalon 900 motherboard. It errupts the same errors that the WD mybook does on EHCI/OHCI with same results as the above mentioned hardware, using the same testing.

usb 4-2.1: new high speed USB device using ehci_hcd and address 27
usb 4-2.1: configuration #1 chosen from 1 choice
 ubc: ubc1
Bus 4 Device 27: ID 152d:2338
Bus 4 Device 24: ID 1058:0900 Western Digital Technologies, Inc.
Bus 4 Device 15: ID 1058:0901 Western Digital Technologies, Inc.
Bus 4 Device 14: ID 058f:6254 Alcor Micro Corp.
Bus 4 Device 1: ID 0000:0000
Bus 1 Device 1: ID 0000:0000
Bus 3 Device 1: ID 0000:0000
Bus 2 Device 1: ID 0000:0000
/: Bus 04.Port 1: Dev 1, Class=root_hub, Drv=ehci_hcd/6p, 480M
    |_ Port 2: Dev 14, If 0, Prod=, Class=hub, Drv=hub/4p, 480M
        |_ Port 1: Dev 27, If 0, Prod=USB to ATA/ATAPI Bridge, Class=stor., Drv=ub, 480M
        |_ Port 3: Dev 15, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
        |_ Port 4: Dev 24, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M

However this bug appears to only occur when slower devices or devices powered by the unit, so it may be that the sense data is reporting some sort of drive feature that is missing? perhaps to much draw from the hub? I still need to test this on a powered hub, to see if it mitigates the issue, as well as try different combinations of disk.

Regardless, it's very apparent that this strange sense data means something, what, I'm not sure. However I do know that massive corruption occurs when this message pops up. Perhaps the right thing to do is to discard that entire transaction and to retry it completely to avoid corruption. the rest I don't think is needed, and possibly slows the entire process down.
Comment 1 Anonymous Emailer 2007-12-26 14:53:12 UTC
Reply-To: akpm@linux-foundation.org

On Wed, 26 Dec 2007 11:29:38 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9638
> 
>            Summary: USB bad sense data [sense 0 00 00] causes corruption
>            Product: Drivers
>            Version: 2.5
>      KernelVersion: 2.6.xx
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: USB
>         AssignedTo: greg@kroah.com
>         ReportedBy: a@oo.ms
> 
> 
> Most recent kernel where this bug did not occur:
> Distribution: Slackware 12.0.0
> Hardware Environment: EHCI/OCHI/UHCI
> Software Environment:
> Problem Description:
> One device (USB to PATA/SATA bridge) fails only on EHCI and OHCI, but is fine
> on the system with UHCI, the other device fails on ALL THREE
> 
> Steps to reproduce:
> Read or write from the device, check dmesg
> 
> >>>>>>>>>>>>>>>>>>WD MYBOOK CASE
> 
> usb 4-2.1: new high speed USB device using ehci_hcd and address 25
> usb 4-2.1: configuration #1 chosen from 1 choice
>  ubc: ubc1
> ub(4.15): dir w len/act 8192/1536 [sense 0 00 00] retry 1
> usb 4-2.3: reset high speed USB device using ehci_hcd and address 15
> <unplug device, plug in device, mount, ls>
> usb 4-2.1: USB disconnect, address 25
> usb 4-2.1: new high speed USB device using ehci_hcd and address 26
> usb 4-2.1: configuration #1 chosen from 1 choice
>  ubc: ubc1
> ub(4.26): dir r len/act 4608/512 [sense 0 00 00] retry 1
> usb 4-2.1: reset high speed USB device using ehci_hcd and address 26
> 
> 
> /: Bus 04.Port 1: Dev 1, Class=root_hub, Drv=ehci_hcd/6p, 480M
>     |_ Port 2: Dev 14, If 0, Prod=, Class=hub, Drv=hub/4p, 480M
>         |_ Port 1: Dev 26, If 0, Prod=My Book, Class=stor., Drv=ub, 480M
>         |_ Port 3: Dev 15, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
>         |_ Port 4: Dev 24, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
> /: Bus 03.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
> /: Bus 02.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
> /: Bus 01.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
> 
> Bus 4 Device 26: ID 1058:1100 Western Digital Technologies, Inc.
> Bus 4 Device 24: ID 1058:0900 Western Digital Technologies, Inc.
> Bus 4 Device 15: ID 1058:0901 Western Digital Technologies, Inc.
> Bus 4 Device 14: ID 058f:6254 Alcor Micro Corp.
> Bus 4 Device 1: ID 0000:0000
> Bus 1 Device 1: ID 0000:0000
> Bus 3 Device 1: ID 0000:0000
> Bus 2 Device 1: ID 0000:0000
> 
> What is interesting is the other two WD MyBooks are and have been working fine
> for over 1 year, and continue to do so. This particular one is the 500GiB
> essentials model WD5000M1U-00, that I had just purchased, and tried on 2
> different machines, and even across usb 1.x hubs to verify that it is not a
> speed issue. to verify that it is not a cable issue, 3 different sets of cables
> were used, all results presented the same bogus sense data on 2 different
> machines.
> 
> >>>>>>>>>>>>>>>> USB to PATA/SATA bridge case
> 
> This particular device works fine on UHCI USB 1.10, an older Athalon 900
> motherboard. It errupts the same errors that the WD mybook does on EHCI/OHCI
> with same results as the above mentioned hardware, using the same testing.
> 
> usb 4-2.1: new high speed USB device using ehci_hcd and address 27
> usb 4-2.1: configuration #1 chosen from 1 choice
>  ubc: ubc1
> Bus 4 Device 27: ID 152d:2338
> Bus 4 Device 24: ID 1058:0900 Western Digital Technologies, Inc.
> Bus 4 Device 15: ID 1058:0901 Western Digital Technologies, Inc.
> Bus 4 Device 14: ID 058f:6254 Alcor Micro Corp.
> Bus 4 Device 1: ID 0000:0000
> Bus 1 Device 1: ID 0000:0000
> Bus 3 Device 1: ID 0000:0000
> Bus 2 Device 1: ID 0000:0000
> /: Bus 04.Port 1: Dev 1, Class=root_hub, Drv=ehci_hcd/6p, 480M
>     |_ Port 2: Dev 14, If 0, Prod=, Class=hub, Drv=hub/4p, 480M
>         |_ Port 1: Dev 27, If 0, Prod=USB to ATA/ATAPI Bridge, Class=stor.,
> Drv=ub, 480M
>         |_ Port 3: Dev 15, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
>         |_ Port 4: Dev 24, If 0, Prod=External HDD, Class=stor., Drv=ub, 480M
> /: Bus 03.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
> /: Bus 02.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
> /: Bus 01.Port 1: Dev 1, Class=root_hub, Drv=ohci_hcd/2p, 12M
> 
> However this bug appears to only occur when slower devices or devices powered
> by the unit, so it may be that the sense data is reporting some sort of drive
> feature that is missing? perhaps to much draw from the hub? I still need to
> test this on a powered hub, to see if it mitigates the issue, as well as try
> different combinations of disk.
> 
> Regardless, it's very apparent that this strange sense data means something,
> what, I'm not sure. However I do know that massive corruption occurs when this
> message pops up. Perhaps the right thing to do is to discard that entire
> transaction and to retry it completely to avoid corruption. the rest I don't
> think is needed, and possibly slows the entire process down.
> 
> 
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug, or are watching someone who is.

Comment 2 Alan Stern 2007-12-27 08:25:45 UTC
Have you tried using usb-storage instead of ub?
Comment 3 Andrew J. Kroll 2007-12-28 04:27:52 UTC
No, however turing off fsreadahead on the IDE drives seems to have "solved" the problem.... and I got a speed gain as well... so I turned fsreadahead off of all my storage because for me it's a loss...

I'll give it a shot this coming week, right now it's impractical for me to do so, because of my job, I have no time to recompile at the moment. But I will try and report back what the results are on Monday.

Thanks for the suggestion.

Comment 4 Andrew J. Kroll 2007-12-29 05:36:46 UTC
Seems to work with the scsi varient okay... initial reads were slow, but after zeroing the drive, it came up from a shameful 350KB/sec up to a much better 27MB/sec.... however the oddness went away so far in my tests. I will beat the thing to death for the next day or two, and let you know my confidence level. Whatever usb-storage is doing, the ub driver aught to follow along perhaps.

Will advise by Monday.
Comment 5 Andrew J. Kroll 2007-12-31 13:47:56 UTC
scsi variant seems to work happily... I've really tortured the drives by doing multiple copies in parallel on them, then checking md5sums in parallel and all checked fine. Is the ub driver to be depreciated or fixed?
Comment 6 Alan Stern 2007-12-31 14:12:09 UTC
It's hard to say without knowing more about the original problem.

You can help by attaching to this bug report some usbmon traces (see the instructions in the kernel source file Documentation/usb/usbmon.txt).  It will be enough if you show what happens when you plug in the device and mount the drive; you don't have to trace the torture tests.  Collect one trace using ub and another one using usb-storage so we can compare them.
Comment 7 Andrew J. Kroll 2008-07-06 20:32:59 UTC
I did another test with the same results on a Debian box as well, although it fails as expected, an uses an older (2.6.18-6-1686) Debian patched kernel...
yes, it's tained by the debian devs, but the bug is there too, and the results are the same.

On the test box at home, I am now getting hard-lockups, the system totally wedges, and I can't get to a console for the information, and since it is storage, it's not able to be logged either. I firmly believe this is some sort of botch-up in the actual controller code, and when I get the time I will attempt to narrow down the problem.

Note the failures are occurring in BOTH "usb storage" mode AND scsi emulation mode, and only when the connection is USB 2.0. I have not seen this happen in 1.x USB connections except the SiS box mentioned, and they still happen even with the latest kernel.
Comment 8 Alan Stern 2008-07-07 07:15:56 UTC
It would help a lot if you could give a better description of the problem.  "Bogus sense data" doesn't tell us much.  In what way is it bogus?  How do you know it is bogus?  How do you even know what it is?  In what way does the bogus data affect the operation of the rest of the system?  Where does the system lock up?  What does the kernel log reveal when the debugging options are enabled?  What does usbmon show?
Comment 9 Andrew J. Kroll 2008-07-13 06:47:38 UTC
"Where" it is locking up from what I can guess, since I lack any ability to access any disk during the lockup, is a good question :-) I'm guessing there may be some sort of infinite loop issue, or a wedged spinlock. As far as the bogus sense data, what you see is what I could see in the log files that I could manage to get, it's just after these bogus sense datas that the system goes down hill fairly fast. It's as if the USB reset isn't fully recovering. I have managed to make that exact thing happen on the "tainted" box, and eventually it said the device was dead... That particular system did not lock up, but then, it IS a different chipset. I'm not the only person on the planet experiencing these problems either, a quick search on google finds MANY people having the same problem.

As far as turning on any debugging code, I have found that placing in smaller debug points tends to weed out problems like this faster, since debug code changes the timing, thus in some cases, especially race conditions, the bugs disappear when you use any heavy debugging.

As far as usbmon, I'll have to find and compile that, and give you the results.
Comment 10 Andrew J. Kroll 2008-07-13 07:06:38 UTC
Created attachment 16801 [details]
usbmon output plus dmesg

Here is the dump from usbmon, also within the file is the dmesg output.
I avoided a lockup by removing all USB devices first, but if several were plugged in, it would have been a real mess, and deadlocked someplace...
Comment 11 Alan Stern 2008-07-13 11:09:56 UTC
Okay, the log clears up a lot of questions.  The bogus sense data is not the cause of your lockups; failure of the drive (or its USB interface) is.

What's happening is that from time to time, the device doesn't transfer the data it's supposed to.  When the kernel sends a REQUEST SENSE command to find out what's wrong, the device replies with empty sense data -- indicating nothing is wrong!

Under these circumstances the kernel should have retried the command, but it didn't.  That is puzzling.  Anyway, things proceeded like this until finally the device failed to accept data for a WRITE transfer.  After the command timed out the kernel tried to reset the drive, and the reset failed.  It was taken offline, and presumably that's what led to your lockups.

Do you want to try to find out why the commands weren't retried?  The place to look is in the drivers/usb/storage/transport.c source file.  Near the end of usb_stor_invoke_transport() is a comment line that says:

        /* Did we transfer less than the minimum amount required? */

The test following that comment should trigger and srb->result should be set to an error code.  You can check whether this is really happening.
Comment 12 Andrew J. Kroll 2008-07-13 12:12:11 UTC
Here's what I did...


        /* Did we transfer less than the minimum amount required? */
        if (srb->result == SAM_STAT_GOOD &&
                        scsi_bufflen(srb) - scsi_get_resid(srb) < srb->underflow) {
                printk("AJK: Less than the minimum. (%x)\n", srb->result);
                srb->result = (DID_ERROR << 16) | (SUGGEST_RETRY << 24);
                printk("AJK: Suggesting a retry. (%x)\n", srb->result);
        }

The printk's are never hit, therefore something stranger is going on :-)
I.E. we *did* do the minimum xfer!
So that's not it.... next thing to check?
Comment 13 Alan Stern 2008-07-13 12:24:35 UTC
I see.  I had expected the preceding code to set srb->result to SAM_STAT_GOOD, but it doesn't because result != USB_STOR_TRANSPORT_GOOD.  That explains why there's no retry.

So maybe the next thing is to try changing the first line of the "if" statement.  Make it:

   if ((src->result == SAM_STAT_GOOD || !(srb->sense_buffer[2] & 0x0f)) &&

I think that will force the system to retry the failed reads.  But it won't solve your problem entirely, because it won't help the failed resets.  Once the firmware in your device's USB interface has crashed, there's nothing the kernel can do about it.

In the end, we'll probably conclude this is the result of a hopeless hardware failure in your drive.  Have you tried using it with different operating systems?
Comment 14 Andrew J. Kroll 2008-07-13 12:37:50 UTC
Yes, the usual "works fine in M$ products" claim applies, and yes, on the same hardware. Please note too that the hardware *does* work fine on one other system, my old AMD Athlon, which uses an VIA VT82xxxxx UHCI USB 1.1 controller, revision 10.... So I'm really thinking it's a bit deeper than this! Perhaps there is some sort of SiS quirk, however the same happens on Intel 82801DB/DBL/DBM (ICH4/ICH4-L/ICH4-M) USB UHCI Controller #1 (rev 02) as well... And to me, that's seriously curious.
Comment 15 Andrew J. Kroll 2008-07-13 12:43:06 UTC
Fails also on USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01)
And that's older 1.x USB... This is quite curious!
Comment 16 Andrew J. Kroll 2008-07-13 12:51:24 UTC
And the final test system, passes with the same chipset, but different speed cpu... so this is pretty strange. I've checked voltages, etc, and all are within tolerances that are acceptable, unfortunatly I don't have a 'scope to check for any ripple, but since MOST other devices seem fine (I have a few that fail like this!) it's a tiny bit.... strange.
Just thought I'd pass on the last bits... as far as what I am doing for testing.
Why one box passes, and another fails is a bit interesting, I'll try your patch suggestion, and see what explodes :-)
Comment 17 Andrew J. Kroll 2008-07-13 14:56:26 UTC
Still never hit the codeblock, and acted the same....
(and you had a typo too)
Comment 18 Alan Stern 2008-07-14 07:46:29 UTC
The new "if" statement never succeeded?  Did you find out why not?  What were the values of srb->result, srb->sense_buffer[2], scsi_bufflen(srb), scsi_get_resid(srb), and srb->underflow?

The differences in behavior from one system to another could be caused by variations in the low-level timing of the USB controller hardware in the machines.
Comment 19 Andrew J. Kroll 2008-07-14 14:45:48 UTC
Well, out of curiosity, I did an rmmod ehci_hcd...
Now reads and writes succeed OK. So it has to be the EHCI timings (as a guess).
I know this is going to sound a bit counter-productive, but, I'm going to experiment with placing small delays in write sections, and see what happens.
I've seen silly things like this happen before.

As far as the values, they never printed because the if() was never true.
I think it's EHCI corruption. Moreover, I did the same rmmod on another machine, with the same results... the hardware became sane :-) At least I've got it narrowed down now, and the code *is* labeled as "experimental" and according to the changelog in the comments, not touched since 2005...

Comment 20 Alan Stern 2008-07-15 07:37:09 UTC
> So it has to be the EHCI timings (as a guess).

Earlier you said that the drive failed on a system with an Intel 82371AB/EB/MB PIIX4 motherboard, which doesn't use EHCI.  So the EHCI timings cannot be the entire story.

If you want to experiment with changing timings, a good place to start is in drivers/usb/storage/transport.c:usb_stor_Bulk_transport().  An "if" statement in that routine tests US_FL_GO_SLOW and does a 125-us delay if the test succeeds.  You could try doing the delay always, whether the test succeeds or not.

> As far as the values, they never printed because the if() was never true.

But _why_ was the if() never true?  You still haven't answered my question.  This is the third time I've asked.

> the code *is* labeled as "experimental"

Which code is labeled "experimental"?  Are you referring to the code in ehci-hcd?  That driver is not experimental any more.  Furthermore, changelogs are no longer maintained in source code comments; they are maintained separately in the Git source-code management system.
Comment 21 Andrew J. Kroll 2008-07-15 09:16:07 UTC
> Earlier you said that the drive failed on a system with an Intel 82371AB/EB/MB
> PIIX4 motherboard, which doesn't use EHCI.  So the EHCI timings cannot be the
> entire story.

Might be device timings, as you suggested, since it is the only non-EHCI system that fails... i.e. a corner case.

> But _why_ was the if() never true?  You still haven't answered my question. 
> This is the third time I've asked.

As far as why it's false, I do not know. All I do know is that the printk()'s were never hit :-( Placing them elsewhere would spam the heck out of my logs, etc... If you have a good idea on how to get the info without spamming my logfiles, I'm all for it. I've always been a fan of doing seperate /proc/ entries for this sort of thing, then one can cat them and follow what is happening...

> Which code is labeled "experimental"? ....
> ...Furthermore, changelogs are no longer maintained in source code comments

That explains alot :-)

Is there any other place besides drivers/usb/storage/transport.c that can handle  a delay for everything? Reason I ask is because it may make sense if other devices corrupt on writes as well... I'll try playing with the delay, and see what happens first. I will also see about getting you the why's as to the codeblock not being true... Data corruption could be the answer... since there is no answer from the device on the USB chain, there really is no error to be triggered... That's just my theory though.



Comment 22 Alan Stern 2008-07-15 11:48:51 UTC
You can get the info by putting your printk() statements at the end of the "if (need_auto_sense) {" block.  That block executes only a couple of times under normal circumstances, but it executes every time you get one of those empty sense data messages.

As for EHCI data corruption... It's possible, but I rather doubt it.  Your log contains examples in which the data prove that corruption did _not_ occur.

For example, in the log at timestamp 78.904612 the computer sent a command asking the device for 4096 bytes (i.e., a read of eight 512-byte sectors).  At timestamp 80.376252 the device replied with only 154 bytes -- which is not a magic number of any sort and could not have happened by accident.  Then at timestamp 80.376498 the device sent a status message indicating that a problem had occurred (the final byte was 01) and that 0xf66 (= 3942 decimal or 660f0000 in little-endian hex) bytes of data had been omitted.  Clearly the device _knew_ that it had sent only 154 out of 4096 requested bytes.  But then at timestamp 80.376874 the device sent sense information in which all the important fields were set to 0 -- meaning that nothing had gone wrong!
Comment 23 Andrew J. Kroll 2008-07-15 15:16:37 UTC
I've been doing more testing, adding huge delays seems to reduce, but not eliminate the problem.

I now can do a full read of the device without any short reads. Is there some sort of possibility there is a problem in the device where there needs to be a larger delay for a read/write turn-around? 

If I dd TO the device, that is when it starts to complain... It would also explain why turning off readaheads as I did before, helped a little.
Comment 24 Alan Stern 2008-07-16 06:33:43 UTC
Any type of bug you can think of could possibly exist in the device.  :-)

If reading works okay without errors, I don't see why eliminating readaheads should improve the behavior.
Comment 25 Andrew J. Kroll 2008-07-16 07:10:20 UTC
Created attachment 16838 [details]
windows debugging data

This is the debugging info from a windows xp USB tool, exported to a text file.

Note no errors. I do have the full data file available if you really want it, but I'm more hopeful that something can be gained from looking at perhaps the times of the transactions, or how they are ordered instead.

One thing I really noticed is how differently the LED on the device flickers, it seems to be basically beat to death on Linux, but no so on XP, and the transfers appear to be faster on XP as well... This indicates some sort of delay or scheduling differences someplace. Where is anyone's guess. I also looked on the CD that came with the device for any clues. I never needed it to use the device since it is transparent SCSI over USB, however there does exist a USB 2.0 patch set that apparently deals with EHCI short transfer problems on the CD. This is the exact problem I am experiencing under Linux, but not on XP... That's atleast an interesting note.
Comment 26 Alan Stern 2008-07-16 07:51:33 UTC
In general I think Linux tends to drive mass-storage closer to the theoretical limit than Windows does, putting higher demands on the hardware.  But that's hard to verify without expensive bus-analyzer equipment.  The timestamps in your debugging log have a resolution of about 15 milliseconds, not nearly precise enough for timing comparisons -- we'd need to see values good to around a microsecond.

The most interesting part of that log is what happens at 11:34:00.875.  Only 3584 bytes out of 12288 are transferred to the device.  But the log doesn't indicate how or why the transfer stops short.

I don't know how to interpret your observations of the LED behavior.  Does the fact that it isn't beat to death on XP mean that data is being transferred faster or slower?

How do you know that there is a patch set on the CD dealing with EHCI short transfer problems?  Do you know what it does, in any detail?
Comment 27 Andrew J. Kroll 2008-07-16 09:05:30 UTC
> The most interesting part of that log is what happens at 11:34:00.875.  Only
> 3584 bytes out of 12288 are transferred to the device.  But the log doesn't
> indicate how or why the transfer stops short.

Right, I can see that now too. As you can see though in the log, the clear stalled pipe happens fast, there is no huge pauses or delays...

I'll take a better look at the original data... and try to get better information on how it did the recovery.

> How do you know that there is a patch set on the CD dealing with EHCI short
> transfer problems?  Do you know what it does, in any detail?

The patch info is in a readme.txt on the cdrom.
I can tarball and email it to you if you want to dig inside it.
Comment 28 Andrew J. Kroll 2008-07-16 09:20:06 UTC
The status returned was "USBD_STATUS_STALL_PID", This seems to be something internal to windows??
Comment 29 Andrew J. Kroll 2008-07-16 09:39:53 UTC
Forgot to say, the status number returned was 0xC0000004...

0936 says:
The Sync reset function issued to clear the problem was 0x001E
status 0x00004540

The actual log file I got here is much larger, I'll look for more instances of the same...
Comment 30 Alan Stern 2008-07-16 09:54:01 UTC
Don't send me the patch itself; I wouldn't know what to do with it.  Maybe the info alone would say something useful, maybe not.

The USBD_STATUS_STALL_PID means that the devices sent a STALL handshake, which agrees with the following log contents.  However the usbmon log you got earlier doesn't show the same behavior.  At timestamp 121.136113 we see that an 8192-byte transfer stopped after only 3072 bytes had been sent, and the completion code -104 indicates that the transfer was aborted by the computer.  If the device had sent a STALL, there wouldn't have been a 30-second delay and the completion code would have been -32 instead of -104.

Unless the ehci-hcd driver is somehow missing the STALL (which seems unlikely), the device behaved differently during the two log sessions.
Comment 31 Andrew J. Kroll 2008-07-16 10:01:42 UTC
Found something a bit "interesting" in the debug, with those double 31 byte out bound URBs... And I think it is how it detects stalls fast.

First out URB reads
Hdr:
Length 0x0048
Function 0x0009
status 0x00000000
UsbDeviceHandle 0x00000000
UsbdFlags 0x00000000
Pipehandle....
TransferFlages 0x00000000
TransferBufferLength 0x0000001f
Transfer buffer...
TransferBufferMDL 0x00000000
UrbLink 0x00000000
HCA is all zeros

The second URB sent has an easter egg in it :-)

Hdr:
Length 0x0048
Function 0x0009
status 0x00000000
UsbDeviceHandle 0x88A3A158 <--- I don't see the handle in the logs... Curious.
UsbdFlags 0x00000000
Pipehandle....
TransferFlages 0x00000000
TransferBufferLength 0x0000001f
Transfer buffer...
TransferBufferMDL 0x00000000
UrbLink 0x00000000
HCA is all zeros EXCEPT
HcdEndpoint 0xffffffff
HcdIrp 0xdeadfood
         ^^^^^^^^--- It sure is!
LOL!


These seem to occur regularly, about every 3 transactions.
I think that may be the key to it working.
Comment 32 Alan Stern 2008-07-16 11:50:24 UTC
No, that has nothing to do with anything.  It's just part of the way Windows keeps track internally of USB communications.
Comment 33 Andrew J. Kroll 2008-07-16 14:08:49 UTC
Your autosense data via prink() is...
-- Result from auto-sense is 0
-- code: 0x70, key: 0x0, ASC: 0x0, ASCQ: 0x0
not surprising to me :-)
Comment 34 Andrew J. Kroll 2008-07-16 14:55:43 UTC
On or about line 936, I added something ugly that seems to work :-)

        /* copy the command payload */
        memset(bcb->CDB, 0, sizeof(bcb->CDB));
        memcpy(bcb->CDB, srb->cmnd, bcb->Length);

        mdelay(6); /* 666 */
        /* send it to out endpoint */

yep, an ugly mdelay(), BUT, it no longer spews all over it'sself. My guess here is that we need an "unlikely" bit for this, eventually I can figgure out exactly where to place delays so it doesn't affect every transfer, and every darn device... 

USB Mass Storage support registered.
usb-storage: device found at 15
usb-storage: waiting for device to settle before scanning
scsi 11:0:0:0: Direct-Access     IBM-DMCA -21440           95   PQ: 0 ANSI: 2 CC
S
sd 11:0:0:0: [sda] 2818368 512-byte hardware sectors (1443 MB)
sd 11:0:0:0: [sda] Write Protect is off
sd 11:0:0:0: [sda] Mode Sense: 00 38 00 00
sd 11:0:0:0: [sda] Assuming drive cache: write through
sd 11:0:0:0: [sda] 2818368 512-byte hardware sectors (1443 MB)
sd 11:0:0:0: [sda] Write Protect is off
sd 11:0:0:0: [sda] Mode Sense: 00 38 00 00
sd 11:0:0:0: [sda] Assuming drive cache: write through
 sda: sda1 sda2
sd 11:0:0:0: [sda] Attached SCSI disk
sd 11:0:0:0: Attached scsi generic sg0 type 0
usb-storage: device scan complete

#dd if=/dev/sda of=/dev/null
*twiddle, twiddle, nap, etc*
2818368+0 records in
2818368+0 records out
1443004416 bytes (1.4 GB) copied, 513.807 s, 2.8 MB/s

...and no complaints in the dmesg :-) :-) :-)

I tried a value of 5, but that gave occasional "no sense" problems.
I think I'm on to something here...

Comment 35 Andrew J. Kroll 2008-07-17 06:09:39 UTC
Writing is successful as well:
# dd if=/dev/zero of=/dev/sda2
dd: writing to `/dev/sda2': No space left on device
971713+0 records in
971712+0 records out
497516544 bytes (498 MB) copied, 1077.61 s, 462 kB/s

As shown, dd was extremely slow but I'm not at all surprised since the drive on it sucks. I'll take the speed penalty in trade for stability. Still, I should find out where exactly the delay needs to be, and when... perhaps snooping the kind of URB.

On the drives that have no problems, the speed is reduced by 1/4th on reads... for example:

root@shit:~# dd if=/dev/sdd of=/dev/null bs=1024 count=10240
10240+0 records in
10240+0 records out
10485760 bytes (10 MB) copied, 1.39498 s, 7.5 MB/s

Normally this was ~27-30MB/sec... so 'till I get a quirk entry, I can live with stable data and the slow speeds. In-fact If you could place in a few of them, that would be awesome. In the event there is a different quirky device that can handle a bit more speed, it can be dealt with. If I may suggest the following QUIRK names:

SLOW_MDELAY_1
SLOW_MDELAY_2
SLOW_MDELAY_4

They can be used in combinations to produce 1/2/3/4/5/6/7 msec delays when used collectively. Perhaps if the driver "runs into trouble" during a test during init, it can set the bits as needed, thus mitigating the need for any quirk entries in the unusual devices table.

I realize this isn't the best fix, but it is a temporary fix, till I manage perhaps something like a snoop and see when it's really needed. I'm also thinking what would be nice is perhaps an extra field in the device struct (if it doesn't exist already, haven't checked) to allow for a void pointer to private data. This could be used for example, to store something like counters and timestamps, so that if it needs to cause a delay only after a certain amount of transactions within a given time it can be done.

Oh, and just FYI, this fix was tested on 2.6.26, so no crusty old kernel here :-)




Comment 36 Alan Stern 2008-07-17 10:03:20 UTC
I can't submit changes with delays all over the place; the maintainers wouldn't stand for it.  The best thing will be for you to determine exactly what delays you really need.

Any answer to my question in comments 20 and 22 about the variable values in the "need_auto_sense" block?
Comment 37 Andrew J. Kroll 2008-07-17 11:12:21 UTC
> I can't submit changes with delays all over the place; the maintainers
> wouldn't stand for it.  The best thing will be for you to determine exactly
> what delays you really need.

There is only one place it is needed at this point, re-read comments 34, and 35 :-) The current fix is a one-liner. If I have the 3 new quirks bits, I can write an in-kernel test to determine what quirk bits (if any) need to be set on a per-device basis.

> Any answer to my question in comments 20 and 22 about the variable values in
> the "need_auto_sense" block?

Comment 33 contains a prink within, I'll "unfix" the driver to get to you the data in about 8 or 9 hours.
Comment 38 Alan Stern 2008-07-17 11:43:54 UTC
Is it really worthwhile to add a "fix" which reduces the effective throughput by a factor of 4 or more?  That's a Pyrrhic victory at best.  Besides, don't you think Linux should be able to access the drive as fast as Windows?  The way to do that isn't by adding delays; it's by detecting when the drive makes an error and retrying.

How are you going to write an in-kernel test to find the necessary delay time?
Comment 39 Andrew J. Kroll 2008-07-17 17:27:10 UTC
> Is it really worthwhile to add a "fix" which reduces the effective throughput
> by a factor of 4 or more?  That's a Pyrrhic victory at best.  Besides, don't
> you think Linux should be able to access the drive as fast as Windows?

Well, as I said, temporary fix till the root cause is actually located. I don't mind the reduced speed if it is only on the quirky interface, and leaves the other ones alone. More on that below :-)

As far as windows, I only have it around to make sure devices that misbehave under Linux are indeed working, and also note that it did 64K requests too... larger requests means fewer of them, and the delay time would not matter in that particular case. I do realize a 64K transfer ability is NOT something we can test for yet, and that it hoses some devices. Obviously there has to be some sort of test that can be done that can tell us that, and who knows, the other OS might do just that, or it may not. I'll have to trace the plug-in event to tell.

>  The way to do that isn't by adding delays; it's by detecting when the 
> drive makes an error and retrying.

I agree, but till such a time, any fix is better than no fix, and getting at the error message is problematic... That's why I am trying to help fix it in the first place.

> How are you going to write an in-kernel test to find the necessary delay time? 

That's pretty trivial to do. Just use the transport code to request a pile of reads on sector 0 or so, and change the quirk value as needed. Once the device passes the test, it can be considered OK, and if the particular device does happen to fail later on, a simple if() can bump up the value another notch only on the particular device when no sense is detected and if any of the quirk bits are set. It would see to me that it is easy to test for this case initially, since we see no sense data with this particular failure. That would be the reason for the new quirk bits.

I'll get you those values in a few moments. I'm sure they will be interesting.
Comment 40 Andrew J. Kroll 2008-07-17 17:55:27 UTC
The log is quite curious... it doesn't reveal the root cause as far as I can see, and more-or-less exactly what I expected the output to look like. Perhaps it means more to you.

srb->result = 2
scsi_bufflen(srb)[8192] - scsi_get_resid(srb)[8192] = 0
srb->underflow = 0
sd 21:0:0:0: [sda] Sense Key : No Sense [current]
sd 21:0:0:0: [sda] Add. Sense: No additional sense information
usb 3-6: reset high speed USB device using ehci_hcd and address 30
usb 3-6: reset high speed USB device using ehci_hcd and address 30
srb->result = 2
scsi_bufflen(srb)[122880] - scsi_get_resid(srb)[122880] = 0
srb->underflow = 0
sd 21:0:0:0: [sda] Sense Key : No Sense [current]
sd 21:0:0:0: [sda] Add. Sense: No additional sense information
srb->result = 2
scsi_bufflen(srb)[122880] - scsi_get_resid(srb)[122710] = 170
srb->underflow = 0
sd 21:0:0:0: [sda] Sense Key : No Sense [current]
sd 21:0:0:0: [sda] Add. Sense: No additional sense information
srb->result = 2
scsi_bufflen(srb)[8192] - scsi_get_resid(srb)[8192] = 0
srb->underflow = 0
sd 21:0:0:0: [sda] Sense Key : No Sense [current]
sd 21:0:0:0: [sda] Add. Sense: No additional sense information
usb 3-6: reset high speed USB device using ehci_hcd and address 30
usb 3-6: reset high speed USB device using ehci_hcd and address 30

Comment 41 Andrew J. Kroll 2008-07-18 14:49:39 UTC
I'll post the patch later (hoe that you will accept it), I added 3 new quirk definitions, and an entry in to unusual_devs.h

Here's the nice sane result...
# cat /proc/scsi/usb-storage/22
   Host scsi22: usb-storage
       Vendor: JMicron
      Product: USB to ATA/ATAPI Bridge
Serial Number: 2C50C5193638
     Protocol: Transparent SCSI
    Transport: Bulk
       Quirks: SLOW_MDELAY_2 SLOW_MDELAY_4

Now it only affects the quirky device. I still want to write some sort of detection, but right now I do not have the time to do so.
Comment 42 Andrew J. Kroll 2008-07-18 14:53:54 UTC
also, just to be clear here... 

/dev/sdb:
 Timing buffered disk reads:   84 MB in  3.00 seconds =  27.96 MB/sec
root@shit:/usr/src/current# hdparm -t /dev/sda

/dev/sda:
 Timing buffered disk reads:   10 MB in  3.23 seconds =   3.10 MB/sec
root@shit:/usr/src/current#

It doesn't affect other devices like you thought it would....
Comment 43 Andrew J. Kroll 2008-07-18 15:17:45 UTC
Hitachi Micro drive...

/dev/sde:
 Timing buffered disk reads:   10 MB in  3.06 seconds =   3.27 MB/sec

Host: scsi31 Channel: 00 Id: 00 Lun: 00
  Vendor: HMS36040 Model: 4D5CF00          Rev: CA2A
  Type:   Direct-Access                    ANSI  SCSI revision: 02

256MB Dane-Electronics CF card Interestingly, ID's that hitachi made it):

/dev/sde:
 Timing buffered disk reads:    2 MB in  3.77 seconds = 543.75 kB/sec
Host: scsi32 Channel: 00 Id: 00 Lun: 00
  Vendor: Hitachi  Model: XXM2.3.0         Rev: 3.00
  Type:   Direct-Access                    ANSI  SCSI revision: 02

2GB Lexar "Premium 80x":
/dev/sde:
 Timing buffered disk reads:   18 MB in  3.10 seconds =   5.81 MB/sec
Host: scsi33 Channel: 00 Id: 00 Lun: 00
  Vendor: LEXAR AT Model: A FLASH CARD     Rev: 0911
  Type:   Direct-Access                    ANSI  SCSI revision: 02

A 3.5" Maxtor I had laying arround:
/dev/sde:
 Timing buffered disk reads:   18 MB in  3.03 seconds =   5.94 MB/sec
Host: scsi34 Channel: 00 Id: 00 Lun: 00
  Vendor: Maxtor 5 Model: 4098U8           Rev: 0CQ0
  Type:   Direct-Access                    ANSI  SCSI revision: 02

I don't have any SATA drives hany to test, but, considering how slow CF is any how, it doesn't seem to matter much... Personally, that's my primary use for it any way (CF on a laptop IDE converter board)

If the debug info way above provided you with a clue, I'm itching to know exactly what it is :-)
Comment 44 Alan Stern 2008-07-18 16:18:55 UTC
> srb->underflow = 0

That's the real problem.  srb->underflow is supposed to be equal to the total number of bytes to be transferred, for normal filesystem I/O.  In the case you posted that would be 8192, not 0.

In fact, I see where srb->underflow gets set to 0.  It happens in drivers/scsi/scsi_error.c:scsi_eh_prep_cmnd().  The corresponding routine scsi_eh_restore_cmnd() should restore the old value of underflow, but it doesn't.  Try adding that in there (you'll have to add an underflow field to struct scsi_eh_save), and then see how things work with the change recommended in comment #13 -- and no extra delays.
Comment 45 Andrew J. Kroll 2008-07-20 12:02:23 UTC
That would make sense, no pun intended.

Now that I know what those variables are for, I guess that it means that there is a bug anyway. I'll give it a fix in a few hours, and let you know what happens. I also noticed the iPod here doing the same thing more-or-less (has trouble on EHCI).
Comment 46 Andrew J. Kroll 2008-07-20 12:23:47 UTC
It would seem logical that I need to store the underflow too, would that be done in scsi_eh_prep_cmnd() or someplace else?
Comment 47 Alan Stern 2008-07-20 14:45:14 UTC
Yes, store the underflow in prep_cmnd and restore it in restore_cmnd.

If you want, I could write a combined patch to do it all.  But it's probably quicker for you to write it, if you understand what to do.
Comment 48 Andrew J. Kroll 2008-07-20 20:53:29 UTC
Modifications done.
Here's the results...

BEGIN TEST:
srb->result = 2 (SAM_STAT_GOOD = 0)
scsi_bufflen(srb)[122880] - scsi_get_resid(srb)[122880] = 0
srb->underflow = 122880
srb->result = 2 (SAM_STAT_GOOD = 0)
scsi_bufflen(srb)[8192] - scsi_get_resid(srb)[8036] = 156
srb->underflow = 8192
srb->result = 2 (SAM_STAT_GOOD = 0)
scsi_bufflen(srb)[122880] - scsi_get_resid(srb)[122880] = 0
srb->underflow = 122880
***HUGE PAUSE***
usb 3-1: reset high speed USB device using ehci_hcd and address 4
usb 3-1: reset high speed USB device using ehci_hcd and address 4
srb->result = 2 (SAM_STAT_GOOD = 0)
scsi_bufflen(srb)[8192] - scsi_get_resid(srb)[8192] = 0
srb->underflow = 8192
***HUGE PAUSE***
usb 3-1: reset high speed USB device using ehci_hcd and address 4
usb 3-1: reset high speed USB device using ehci_hcd and address 4

...repeat ad-nauseum :-)

Comment 49 Andrew J. Kroll 2008-07-20 20:57:52 UTC
This wouldn't be so bad if it wasn't for the huge pauses...
I do believe the modifications will atleast help prevent the corruption, and that's a plus. 

Now we need to nail down why the USB to IDE bridge hits those huge timeouts, needing a reset... The delay method does work...

Comment 50 Alan Stern 2008-07-21 07:38:22 UTC
Presumably those first few messages indicate read errors that were retried successfully, although it would be nice to make certain of that.

The huge pauses arise because the computer is waiting for the drive to send something, but the drive isn't responding.  After 30 seconds the computer gives up and resets the drive.  Your log doesn't include timestamps so I can't tell what relation there is between the pauses and the other messages.

The Windows log seemed to indicate that the pauses were avoided because the drive sent a STALL instead of not responding.  If it had done the same thing under Linux then the pauses would have been avoided in the same way, but apparently it didn't.
Comment 51 Andrew J. Kroll 2008-07-21 08:25:27 UTC
Right, but now what to do?
Comment 52 Alan Stern 2008-07-21 08:51:10 UTC
It's time to bring this problem to the attention of the community.  Post your test results and findings to the usb-storage mailing list or the linux-usb list.

At this stage I can't think of anything to do other than include those long delays (you should call msleep BTW, not mdelay).  Other people might have more ideas.
Comment 53 Andrew J. Kroll 2008-07-21 09:24:21 UTC
I'll draw up a patch and post it here for your consumption shortly, I assume you have GIT access to apply it to the repository, I don't. I'll also do some additional testing to see what the minimum delay needed is that will accept a retry, but not cause a reset. Furthermore, I'll take a peek into the ehci code, and see about slowing it down there. Yea, it may upset some, you can't please everyone all of the time...
 

Comment 54 Alan Stern 2008-07-21 09:36:46 UTC
I don't know which repository you mean, and I don't use GIT.
Comment 55 Andrew J. Kroll 2008-07-21 11:00:59 UTC
using msleep() seems to have a different effect... 
An msleep(1) seems to suffice now, v.s. the mdelay(6);
This is a good thing :-) speed++;
I Think I can leave it at that now.


   Host scsi8: usb-storage
       Vendor: JMicron
      Product: USB to ATA/ATAPI Bridge
Serial Number: 2C50C5193638
     Protocol: Transparent SCSI
    Transport: Bulk
       Quirks: SLOW_DELAY_1

the additional delays I'll leave in so they can be available for use in really demanding situations.

Comment 56 Andrew J. Kroll 2008-07-21 11:32:58 UTC
here's what the lexar cf does now, with the msleep(1)
dd if=/dev/sda of=/dev/null
4008816+0 records in
4008816+0 records out
2052513792 bytes (2.1 GB) copied, 333.474 s, 6.2 MB/s

That's pretty darn good IMHO.
Thanks for the suggestion to switch to msleep()!

Comment 57 Andrew J. Kroll 2008-07-21 13:08:27 UTC
Created attachment 16928 [details]
usb-storage fix

This patch contains:
drivers/scsi/scsi_error.c
    Adds underflow restore.

drivers/usb/storage/transport.c
    Fix corruption on no sense data.
    Adds 1 to 7ms delays for devices that need it.

drivers/usb/storage/unusual_devs.h
    Entry added for JMicron USB to ATA/ATAPI Bridge.

include/linux/usb_usual.h
    Added macros for 1 to 7ms delays

include/scsi/scsi_eh.h
    Added underflow variable to struct scsi_eh_save.
Comment 58 Alan Stern 2008-07-21 19:48:01 UTC
You're probably deluding yourself.  msleep() doesn't guarantee very good time resolution.  The results will vary according to your load and how many timer ticks per second the kernel uses.  It's perfectly possible for msleep(1) to delay for 6 ms on one system and 1 ms on another.

Also, 6 MB/s is a lousy transfer rate.  Maybe it's the best you can do with this particular device, though.  High-speed transfers in general can easily exceed 30 MB/s.

As you may have seen, I have already posted the SCSI parts of this fix.  The no-sense-data change to transport.c was out-of-date before you wrote it, because the code in that region has been changed (though not in any important way) in the development trees.

People might accept a new flag for a single 1-ms delay.  But I greatly doubt they will accept three flags for varying levels of delay.

In any case, attaching your patch to a Bugzilla report accomplishes nothing.  You have to post it to the appropriate mailing list so that people are aware of it.
Comment 59 David Brownell 2008-07-21 22:25:03 UTC
Agreed that 6 MB/s is lousy ... but I'll observe that 30 MB/s is hard to get, though I've certainly seen it in test rigs.  Disks generally get 20-25 MB/s in my limited observation.
Comment 60 Tomas Styblo 2008-07-22 03:08:33 UTC
Could you try this patch?

http://marc.info/?l=linux-kernel&m=121653150327283&w=2

The patch fixes what I think is a different, even worse problem caused by this chipset. It is possible that these two problems are related or have a common cause. You might also want to read the whole thread for more information about these troubles.
Comment 61 Andrew J. Kroll 2008-07-22 07:58:12 UTC
Alan:
I can bypass that process completly ;-)
I posted it so that others may benefit/test.

Alan/David:
That is the limit of the attached drive, not the interface. I get 30+MiB/sec easily on other units. Main point is that the speed is 6 times what it is on the CF reader, so it's a gain for me.

Tomas:
I will test the patch in a few hours, when I get the chance.
Comment 62 Andrew J. Kroll 2008-07-22 08:14:00 UTC
Tomas: I looked at your patch, problem is that I have seen this on multiple devices, not just the JMicron one, which leads me to believe that several companies could be using the same IP core in their base implementation. It would be better if we could detect such devices on-the-fly, and set a flag instead, until the real root cause is found. Note that M$ products DON'T have this issue, so it's more to the point of locating the root cause first, instead of making up for the shortcoming in other ways. This means hard research, and emails. I do plan on writing JMicron very soon and to ask them what the probable cause is. They do seem to be Linux friendly.
Comment 63 Andrew J. Kroll 2008-07-23 02:39:09 UTC
I've been thinking (and will test a theory) that the sleep may actually be allowing some other important thread to run... I know it sounds unlikly, but, I've seen missing locks and races happen before. Perhaps on my faster USB hard-disks, this doesn't matter... I'll try for example, a usleep() and see if that works too. Silly, perhaps, but... It's a thought, and I'll simply try it. Might be actually funny if it does find something like the odd race... :-)
Comment 64 Tomas Styblo 2008-07-23 16:55:11 UTC
> problem is that I have seen this on multiple
> devices, not just the JMicron one,

Yes. I have also been experiencing the problem _you_ describe with various devices with different chipsets, not just with JMicron. In my experience, USB storage with external disk enclosures has never been completely reliable neither on Linux nor on windows.

But this particular data corruption that I hopefully fixed with my patch is specific to JMicron and happens also on Windows, according to reports from various Windows users.
Comment 65 Andrew J. Kroll 2008-07-26 15:26:46 UTC
That's particularly interesting. Is JMicron aware of the issue?

I've been beating it to death now with the msleep(1); for several days, transferring over 1TiB on it so far, and there hasn't been one peep of any issues. 
Tests even included building a few kernels. The speed seems to top out at about 7MB/s...

Soon as I am totally satisfied with this, I'm going to try a usleep(); and see if that affects it the same way (The schedule guess). It could be just a disagreeable hardware combination, and I don't doubt that. I will also try your patch as well, and see what happens. If it avoids the whole huge reset, that would be wonderful. If it does not, then the losses are larger over time than a small ms of delay time, and that fix becomes a bigger loss in the long run.

The point I am aiming for is that since certain combinations of hardware can produce the effect, why can't we just trap this seemingly similar problem, and handle the error the same way no matter what device is connected? Certainly there can be no real harm, since there is no sense data, and it's a simple repeat of the last transaction. Are there any known devices that have a no sense condition meaning everything is OK? I would believe that would violate the SCSI spec in most unholy ways, and I doubt that's done in any implementation.

Comment 66 Alan Stern 2008-07-26 15:57:49 UTC
You don't seem to understand the real issues.  Your device has a bug, which manifests as two separate problems.  You can work around the bug by adding the msleep call; when you do this neither problem shows up.  That's fine, but it would be better to solve the problems directly instead of using an awkward work-around.

The first problem is that occasional READs stop short with no sense data.  The changes I have suggested will detect this condition and cause the READs to be retried.  Thus this will no longer be a problem.

The second problem is that occasional WRITEs stop short and the device's firmware crashes during the subsequent reset.  There doesn't appear to be any way to detect this error before it happens or to recover from it afterward.

I have seen reports of devices which suffered just the first problem or just the second problem, so they don't necessarily have to go together.
Comment 67 Andrew J. Kroll 2008-07-26 17:43:58 UTC
Actually, I do understand the issues with the buggy firmware. I've dealt with such crackhead engineered devices in the past. What's so bad about code that nukes off BOTH? It's certainly less to maintain...

The only other option I can think of is to detect such silliness in the ehci driver, but apparently everyone would piss and moan about that... Or am I wrong?

I think most people would simply be happy with something that works, after you get reliability with data, then you should worry about speed. That's at least how I feel about the whole unfortunate problem. Nobody likes it when their storage gets pushed through a meat grinder. 

I've been following the thread in the list too. I just can't really be bothered to subscribe to them, since there is always to much noise, and not enough signal, and I haven't time to wade through piles of emails that I don't care about. ;-)

So with that out of the way, I'll continue to look for a better fix, if I can locate one. If not, then this could be the only way to mitigate the issues with the hardware combination.

Do you also have this device to test with? It's pretty cheap, and you can probabbly pick one up for about $8USD, perhaps less... There is a huge advantage to having the actual device at your disposal when debugging.

Comment 68 Alan Stern 2008-07-26 19:57:01 UTC
I could get one of the devices, if I knew exactly what to buy.  You can't tell from the packaging what sort of bridge chip is used on the inside.  Can you specify a brand name and model?
Comment 69 Andrew J. Kroll 2008-07-28 23:52:49 UTC
Bit over priced, but this is exactly the one I have here.
http://www.showtimepc.com/showtimepc/details.asp?item=ADA-2020

literally, google for "ADA-2020" that is what it is called.
Comment 70 Alan Stern 2008-07-29 07:28:00 UTC
Okay, I ordered it.  Should be here by next week.
Comment 71 Andrew J. Kroll 2008-07-29 23:19:20 UTC
The plot thickens...

DOESN'T happen on SATA connected disk, ONLY PATA.
So you need to do the testing on anything PATA.
Also, the performance over SATA is actually worse than PATA, that's very strange!
Comment 72 Andrew J. Kroll 2008-07-30 00:01:45 UTC
Actually, the speed seems to be on-par with the WD's.... I forgot that it's doing a read/modify/write cycle, that would explain the slowness... still strange that the bug doesn't manifest on the SATA side of the bridge though... Can anyone else confirm this?
Comment 73 Robert Hancock 2008-07-31 01:39:11 UTC
Created attachment 17036 [details]
proposed patch for error detection fixes

Can those experiencing the problem test this patch to see if it prevents the corruption? This doesn't include any of the added delays (so the device lockup problem might still be there), but hopefully it should fix the error detection to avoid data corruption.
Comment 74 Alan Stern 2008-07-31 08:00:40 UTC
Your two sections of code are confusingly redundant.  The second section can never execute, because the first section will already have run.  That is, after the first section it is almost impossible to have both srb->result == SAM_STAT_CHECK_CONDITION and srb->sense_buffer[2] == 0.

Why do you think both sets of tests are needed?
Comment 75 Robert Hancock 2008-07-31 12:00:24 UTC
The first set of code is intended to trap a transport error that doesn't provide any valid sense data. The second set of code is intended to trap an underflow where we don't have valid sense data. There doesn't seem to be any guarantee that these two conditions will occur at the same time.

In particular, there doesn't seem to be any guarantee that the device can't say "Everything worked, oh by the way I didn't give you all the data you asked for". That should be treated as an error condition. If the device does not indicate a transport failure, then the first code will not execute.
Comment 76 Alan Stern 2008-07-31 12:31:15 UTC
A transport failure (not "error") with no sense data should probably be treated as a transport success.  It means the device essentially said "Something's wrong; oh no, wait, everything is okay."  What are we to believe when that happens?

An underflow is an error, and it should be reported as an error unless there's a real transport failure which takes precedence -- i.e., a failure with real sense data.  That is what my patch tests.

As for cases where the device says "Everything worked, oh by the way I didn't give you all the data you asked for," they aren't errors at all (unless there's an underflow -- note that underflows are not the same as short responses).  They are perfectly legal and they happen quite often.
Comment 77 Robert Hancock 2008-07-31 19:18:37 UTC
If we get a transport error from the device, that seems like a pretty good indication that a failure occurred. The fact that the device didn't provide valid sense data doesn't change this. The fact that this transport error is ignored is part of the reason why this corruption is happening.

In this case the device is both indicating an underflow and a transport error, so either one of the changes would suffice. But we should be properly handling both potential cases individually.

It is a bit odd that in the Windows log we see a stall happening, whereas in the case of a failed write here, we saw things just sort of wedge up. I suppose it could be a timing difference..
Comment 78 Andrew J. Kroll 2008-08-01 06:20:29 UTC
Stranger yet is that the bug does not manifest it'sself when you connect a hard disk that is SATA... So known craptastic issues with firmware aside, perhaps we could do a diagnostic some how to measure the request latency, and enable it, or, perhaps the device supplies information in another way that we can exploit to know if SATA or PATA... either way, it's a pain in the arse. I just moved several hundred GiB over it on SATA without patches, and it worked without a hitch. The speed was fairly good for a USB device too. I'm thinking there is something like a n overflow in stack inside the device, causing a crash. It won't happen on SATA because that would logically be a different code path, and a different timing. I still have to see if the problem is worse on a slower device than on a faster one (that test suite will be done today) and if it shows a relation there, then we'll know for sure, but not exactly how to maneuver around it.
Comment 79 Alan Stern 2008-08-01 08:33:21 UTC
If we get a transport failure (not "error"!) from the device without valid sense data, who knows what happened?  Maybe a failure occurred, maybe it didn't.  usb-storage leaves that determination up to the higher SCSI layers.  You might want to ask on the SCSI mailing list if that is the right thing to do.

As for the STALL in the Windows log -- it sure would be nice to know the reason for the difference from the Linux log.  Is it because the device didn't send a STALL to the Linux system?  Why not?  Or is it because it did send a STALL but the STALL wasn't recognized?

For these sorts of questions, the only way to get a definitive answer is to use a bus analyzer.
Comment 80 Andrew J. Kroll 2008-08-11 18:02:16 UTC
Any new news on this? ... Do you have access to one of those expensive analyzers?
Comment 81 Alan Stern 2008-08-12 06:56:25 UTC
Oh yes...  I did get the JMicron unit.  On my computer it worked perfectly.

I do not have access to any expensive analyzers, although I do have access to a relatively inexpensive full-speed-only analyzer.
Comment 82 rhubbell 2008-09-30 13:28:26 UTC
Last comment says "...it worked perfectly."  Can you define "it"?
I have issues with this and a PATA drive with an ext3 fs on the drive.

I have two of these enclosures and in one there's a SATA drive
and that worked fine. This one has a PATA drive and won't mount.


usb 1-1: usbfs: USBDEVFS_CONTROL failed cmd lsusb rqt 128 rq 6 len 4 ret -110

Running lsusb -v I get this:

can't get debug descriptor: Connection timed out


I also get the lsusb output:

idVendor  0x152d JMicron Technology Corp. / JMicron USA Technology Corp.
idProduct 0x2338 JM20337 Hi-Speed USB to SATA & PATA Combo Bridge

kernel: 2.6.25.10




more of the lsusb output:

 Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           32
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          4 USB Mass Storage
    bmAttributes         0xc0
      Self Powered
    MaxPower                2mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk (Zip)
      iInterface              6 MSC Bulk-Only Transfer
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Comment 83 Andrew J. Kroll 2008-09-30 23:09:58 UTC
Same here, both PATA ports seem fubared, SATA 100% working; it does seem to depend on drive latency. 

I had better successes with UDMA 100 or better drives, although, not perfect... 
When using something slower, it explodes. 

Still, I do run with the patch that fixes the SCSI save/restore variables, and it seems to help a little. Writes, still unfortunately corrupt from time to time, but the device isn't crashing, so that's a bonus. Atleast we know we aren't crashing the device firmware any longer.

As usual doze had no problems no matter what was hooked to it on the same hardware, so go figure
Comment 84 Alan Stern 2008-10-01 12:01:03 UTC
I just did another test.  I hooked up a Seagate Barracuda 500-GB PATA/100 drive to the JMicron cable.  (The power cable is very annoying; it has a pushbutton switch with no way to tell whether the power is on or off.)

I created a single partition, formatted it using mkfs.ext3, mounted it, wrote a file containing 8 GB of 0s (using dd), and unmounted it.  There were no errors at all, no problems of any sort.

Here's how lsusb describes the cable:

Bus 001 Device 002: ID 152d:2338 JMicron Technology Corp. / JMicron USA Technolo
gy Corp. JM20337 Hi-Speed USB to SATA & PATA Combo Bridge

The kernel was 2.6.27-rc8 together with http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/gregkh-all-2.6.27-rc8.patch  

Is there anything else you'd like to know?
Comment 85 Alexandre Oliva 2008-10-05 18:08:28 UTC
Do you guys succeed in getting the disk serial number *while* doing I/O on the disk?  Say, the `usn' scsi_io request (easy to issue with
http://samba.org/~tridge/ctdb/utils/scsi_io/scsi_io.c) causes I/O errors on my JMicron-identified enclosure.  More details at https://bugzilla.redhat.com/show_bug.cgi?id=465539
Comment 86 Alan Stern 2008-10-06 07:25:52 UTC
No, because none of us have tried it.

It's not surprising that an obscure SCSI command would cause a USB-IDE bridge to fail.  A lot of those chips aren't terribly high quality; sending them commands they don't understand is asking for trouble.

Some chips even have trouble with commands they _do_ understand, if the amount of data asked for isn't exactly right.  In your original example there is an INQUIRY command asking for 254 bytes.  Plenty of USB-IDE bridges crash if INQUIRY asks for anything other than 18 bytes.

If you like, you can post a usbmon trace (see Documentation/usb/usbmon.txt for instructions) showing exactly what happens when the offending command is sent.
Comment 87 Alexandre Oliva 2008-10-06 11:30:12 UTC
Here's the typical usn successful interaction (no intervening I/O to fail):
$addr1 $T S Bo:$B:$D:2 -115 31 = 55534243 $REQCNT? ff000000 80000612 018000ff 00000000 00000000 000000
$addr1 $T C Bo:$B:$D:2 0 31 >
$addr2 $T S Bi:$B:$D:1 -115 255 <
$addr2 $T C Bi:$B:$D:1 -121 16 = 0080000c 31353244 32303332 39303030
$addr1 $T S Bi:$B:$D:1 -115 13 <
$addr1 $T C Bi:$B:$D:1 -32 0
$addr1 $T S Co:$B:$D:0 s 02 01 0000 0081 0000 0
$addr1 $T C Co:$B:$D:0 0 0
$addr1 $T S Bi:$B:$D:1 -115 13 <
$addr1 $T C Bi:$B:$D:1 0 13 = 55534253 $REQCNT? ef000000 00

here's an interaction with an intervening I/O request that happens to fail:

$addr1 $T S Bo:$B:$D:2 -115 31 = 55534243 $REQCNT? 00100000 00000a2a 000950e4 46000008 00000000 000000
$addr1 $T C Bo:$B:$D:2 0 31 >
$addr2 $T S Bo:$B:$D:2 -115 4096 = 6269746d 04000000 145795d1 60300c4b 01c6bf61 928e6785 1f0d0100 00000000
$addr2 $T C Bo:$B:$D:2 0 4096 >
$addr1 $T S Bi:$B:$D:1 -115 13 <
$addr1 $T C Bi:$B:$D:1 0 13 = 55534253 $REQCNT? ef000000 00
$addr1 $T S Bo:$B:$D:2 -115 31 = 55534243 $REQCNT? ff000000 80000612 018000ff 00000000 00000000 000000
$addr1 $T C Bo:$B:$D:2 0 31 >
$addr2 $T S Bi:$B:$D:1 -115 255 <
$addr2 $T C Bi:$B:$D:1 -121 16 = 0080000c 31353244 32303332 39303030
$addr1 $T S Bi:$B:$D:1 -115 13 <
$addr1 $T C Bi:$B:$D:1 -32 0
$addr1 $T S Co:$B:$D:0 s 02 01 0000 0081 0000 0
$addr1 $T C Co:$B:$D:0 0 0
$addr1 $T S Bi:$B:$D:1 -115 13 <
$addr1 $T C Bi:$B:$D:1 0 13 = 55534253 $REQCNT? ef000000 00

Now, what baffles me is that the I/O request appears to complete *before* the USN request is even issued to the USB bus.  Why would that one fail?

Or perhaps it wasn't this one request that failed, but some subsequent request, that didn't even make to usbmon.  The reason I think this is possible is that I see in the SCSI layer that it does retry such faulty I/Os, but there's no record of their being retried in the usbmon log.

I wonder if usb-storage debug dumps would help.  I don't recall seeing anything useful in them yesterday, but I guess I could give it another try.  Maybe there's something getting lost within Linux's USB layer, rather than as a device's fault.
Comment 88 Alan Stern 2008-10-06 12:04:51 UTC
I'm confused.  Why do you refer to the additional WRITE(10) command as "intervening"?  Clearly it comes before the USN.  And why are you surprised that it completes before the USB request is issued?  After all, USB mass-storage devices can't handle more than one command at a time -- the WRITE could hardly complete _after_ the USN was issued.

Nothing is getting lost.  usb-storage debug dumps would reveal just about the same information as the usbmon log, although in easier-to-read form.  Go ahead and try it; you'll see.

The WRITE(10) failed because the residue value in the CSW is totally wrong.  (In fact, it is the same as the residue value for the USN commands.)  Don't ask me why; I haven't seen any comparable problems with bad residues for writes.  Maybe it was "remembered" somehow.  You could try adding an unusual_devs entry for the JMicron with the IGNORE_RESIDUE flag.
Comment 89 Alexandre Oliva 2008-10-06 23:10:47 UTC
The dumped one does indeed come before the USN command.  It's intervening in that I was running USN repeatedly until I got an I/O error causing a disk failure.  One would think that a successful command shouldn't have caused an I/O error to be reported by the SCSI subsystem, but it's either this one displayed command that somehow got misinterpreted as an error, or some other command that didn't get logged at all.  And then, retries performed by sd before failing the raid member should have somehow made to usb-storage or usbmon logs, but they don't.

In fact, I tried hacking usb-storage so as to decrement retries when experiencing an underflow.  Now, I couldn't see usbmon's logs in that case (the system seemed to be frozen other than by that indefinite disk activity), but the disk led showing indefinite activity.  And since these were failed requests and retries identical to the ones I'd expect in the log above, it looks like requests are indeed being issued, but usbmon is somehow missing them, unless it somehow discards retries, but why should it?
Comment 90 Alan Stern 2008-10-07 07:56:56 UTC
"One would think that a successful command shouldn't have caused an
I/O error to be reported by the SCSI subsystem..."  But the command _wasn't_ successful!  The residue value was 239 (0xef), meaning that only 3857 out of 4096 bytes were transferred correctly.

As for retries making it to the log...  All I can say is that every transfer going across the USB bus does make it to the log.  In particular, usbmon does _not_ discard retries or anything else.  If you don't see them then they didn't use USB.  Maybe they are trying to access other members of the RAID group.

The disk LED showing indefinite activity may be due to firmware failure rather than software errors.  We already know the firmware is not working right because of the incorrect residue value it returns.  You say the activity was failed requests and retries identical to the ones already in the log -- but how do you _know_ that, if the activity doesn't appear in the log?
Comment 91 Alexandre Oliva 2008-10-07 13:24:06 UTC
Uhh...  I didn't realize there was a residue.  I wasn't familiar with usbmon at all.  So, ok, we do have an error, for a I/O that occurs shortly after a USN or similar request.  (although the log above shows right before, there was another similar request just before it)

Other members of the RAID group aren't in the same enclosure, so they can't cause the led to blink.  Only access to that particular disk does.  I know sd issues retries because I see it in the scsi debug logs.  And it only turned into infinite retries (leaving the disk's led blinking indefinitely) when I hacked usb-storage to decrement retries on underflow.  So the retries *are* issued, they make the disk, but I don't see them in the usbmon logs.

I'll keep investigating it, as time permits, and report back.  Thanks for your insights so far.
Comment 92 Alexandre Oliva 2008-10-07 19:12:03 UTC
Created attachment 18206 [details]
scsi, usb-storage and usbmon logs for I/O failure during scsi-usn loop

Here are some logs I collected while booted into a kernel that experimentally changed the underflow return status from usb-storage from SUGGEST_RETRY to SUGGEST_SENSE when retries > 2.

The scsi logs were extracted from /var/log/messages.  All scsi_logging_level bits were set.

The usb-storage logs were extracted with a loop that ran dmesg -c and slept for 100ms (apparently this wasn't enough to get the dmesg buffer full).

The usbmon logs were taken with cat from 1u, on which both external enclosures were connected.

sdb is 1:003 (the one that fails), sdc is 1:002, sda is a PATA internal disk.  The system is in single-user mode, running scsi-io --command usn --device /dev/sdb in a loop while logging to files in /tmp.  /tmp is on a filesystem that's in RAID 1 across all 3 disks.  The only I/O is supposed to be that caused by the logging.

Can you make any sense of it?

Now, on to try the IGNORE_RESIDUE suggestion.
Comment 93 Alexandre Oliva 2008-10-07 23:10:28 UTC
Created attachment 18207 [details]
Patch that fixes errors after SCSI IOCTLs on JMicron USB enclosure

IGNORE_RESIDUE works!  Thanks for the tip!  I'll post the patch to the relevant lists momentarily.
Comment 94 Andrew J. Kroll 2008-10-08 06:15:52 UTC
Does it also prevent corrupted writes? There is a bug in scsi_EH save/restore, extract that from the patches just to be safe...
There are also quite a few other devices that suffer from the same issue... Might be time to create a list someplace.

Comment 95 Andrew J. Kroll 2008-10-08 06:26:40 UTC
(In reply to comment #84)
> I just did another test.  I hooked up a Seagate Barracuda 500-GB PATA/100 drive
> to the JMicron cable.  (The power cable is very annoying; it has a pushbutton
> switch with no way to tell whether the power is on or off.)

Try it with something older on the PATA. Mostly, you will see it happen with something slow more often, which is possibly why my slowdown patches worked.
I'll have to try the ignore residue patch next week or so, and let everyone know the results.

Comment 96 Robert Hancock 2008-10-08 07:39:30 UTC
Be careful with the ignore residue.. the device might really be trying to indicate an error, and if it's returning data past the residue marker, it's possible it's not valid..
Comment 97 Alan Stern 2008-10-08 08:37:55 UTC
Re comment #92: You have mixed up sda and sdc: sdc is the PATA drive and sda is the other USB drive.  The logs make sense; they show 6 failed WRITE attemps on sdb (tagged 0x5d88 through 0x5d8d), after which sdb was removed from the RAID.

Re comment #95: I don't have any older drives to try.  The ignore-residue patch won't make any difference to the problem reported in comment #10.

Re comment #96: In this case I doubt the device really is trying to indicate an error.  The coincidence of the two residue values makes that highly unlikely.  It's much more probable that a firmware bug causes the residue not to be set back to 0 after the USN command.
Comment 98 Alexandre Oliva 2008-10-08 11:08:34 UTC
Uhh, sorry about the mix-up.  The drives keep changing sd hosts at every boot.

Andrew, I've never got a corrupted write AFAIK, but I've been running 2.6.27ish code, which contains the patch anyway.  Thanks for the reminder, I was about to build another 2.6.26.5 kernel with the final patch, for my "stable" instal, and I'd have left it out otherwise.
Comment 99 Andrew J. Kroll 2008-10-23 04:42:08 UTC
Alexandre: No problem :-)

Now for some news...

I did a new suite of testing after an order for some new toys to test with came in, and here is what happened.

Jmicron -> PATA(either connector) -> 4G Microdrive .. errors out, doesn't recover and dies a most horrible death

Jmicron -> SATA -> SATA2PATA bridge -> 4G Microdrive .... *no errors*

SATA obviously has a different microcode path in the adapter, and it's success isn't a surprise. I actually expected it to work OK. That's great news because it means I can now use the microdrives at near full speeds without much worry, even if it has the extra bridge as a "hardware penalty".

Basically what I am seeing here is that any drive not capable of at least UDMA100 getting screwed up. M$ OS somehow does the right thing... perhaps it is measuring the latency and doing some sort of profiled throttle?? Something else??? Note too that old USB 1.x speeds more closely match the older PATA speeds (1M/sec) thus will be less likely to trip up the bug. :-)
Comment 100 Alan Stern 2008-10-23 07:55:53 UTC
The old USB 1.x speed is indeed around 1 MB/s, but older PATA drivers are a lot faster than that!  Even a 40-pin IDE cable can handle up to 16 MB/s.
Comment 101 Andrew J. Kroll 2008-10-24 04:06:19 UTC
I agree ... "can handle" ... yes, however the original speeds were 1MB/sec on ISA :-) ...and that's the point. ;-) something is goofed in the firmware, that's obvious, however using the SATA side with another bridge solves the issue.
Comment 102 Alan Stern 2008-10-24 06:45:47 UTC
So if you like, you can close this bug report with a WILL NOT FIX code.
Comment 103 Andrew J. Kroll 2008-10-24 07:55:43 UTC
Would be better to see if there is a way to check which link is connected, or it's speed and dump out a warning. hdparm doesn't seem to be able to grab the information, but sdparm shows the following hint:

Supported VPD pages VPD page:
  0x1f

This seems to be a non-standard VPD page, but it also may contain the information needed.

here's a verbose output, still need to be able to see the 36 bytes yet...
open /dev/sdf with flags=0x800
    inquiry cdb: 12 00 00 00 24 00
      duration=4 ms
    /dev/sdf: FUJITSU   MPB3052ATU
  PQual=0  Device_type=0x0  RMB=0  version=0x02  [SCSI-2]
  [AERC=0]  [TrmTsk=0]  NormACA=0  HiSUP=0  Resp_data_format=1
  SCCS=0  ACC=0  TGPS=0  3PC=0  Protect=0  BQue=0
  EncServ=0  MultiP=0  MChngr=0  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=0  Sync=0  Linked=0  [TranDis=0]  CmdQue=0
    inquiry cdb: 12 01 83 00 fc 00
      duration=1 ms
    inquiry: requested 252 bytes but got 36 bytes
malformed VPD response, VPD pages probably not supported


Comment 104 Andrew J. Kroll 2008-10-24 08:10:29 UTC
And those magical 36 bytes are....

f6bf1300 0.175028 C Bi:1:014:1 -121 36 =
00000201 1f000000 46554a49 54535520 4d504233 30353241 54552020 20202020
Comment 105 Alan Stern 2008-10-24 09:30:07 UTC
I don't follow your point.  What exactly are you suggesting?
Comment 106 Andrew J. Kroll 2008-10-25 01:55:34 UTC
Well, decoded those bytes are simply the string FUJITSU   MPB3052ATU, no idea on the other bytes there. Apears to look like stale data to me. What I am suggesting, however (Linux beating on the USB harder or not) is that since XP can work the device without problems, I don't see why Linux can't be taught a new trick or two. At the least, the data corruption problem is now gone now that the scsi error handler is usable, and that is a step in the right direction. I will look for better tools as far as USB sniffers for XP that are free, and try to get better information. I honestly think it is more of a lack of solid information, and we both can agree that since nobody we know currently has a hardware sniffer, that we are kind-of stuck at this juncture... and that using another bridge.... while solving the problem in a round-about way, is sub-optimal, but workable for me, and for this particular bridge. However it does not help to solve other devices that exhibit the same kind of reactions as observed with this device. I'll continue research as time allows, but leave the bug "closed" 'till I can get some sort of solid fix that everyone can agree upon.

Comment 107 Alan Stern 2008-10-25 08:02:09 UTC
I have to agree that it would be great if Linux could do whatever it is Windows does to make the device work.  If only we knew what that was...
Comment 108 Robert Hancock 2008-10-25 09:46:17 UTC
It seems likely that there's some problem in the chip that only shows up with the PATA interface, like some kind of buffer overflow perhaps. Linux tends to push USB storage devices faster and might be more likely to trigger this than Windows.

There may be some kind of delay that needs to be inserted somewhere, but that's the sort of thing that would likely need input from JMicron as far as exactly where it should go to ensure it fixes the problem. We have a such a thing in the kernel for Genesys chips already.
Comment 109 Andrew J. Kroll 2008-10-25 21:02:57 UTC
That would partially explain why my delay patches work, regardless of the accuracy of the delay.
Comment 110 Alan Stern 2008-12-10 08:22:48 UTC
Here's something interesting I just received:

http://bigacid.wordpress.com/2008/12/08/jm20337-read-data-corruption-solution/

Even if the blog post itself doesn't help, perhaps you can try writing to this Alex Lee person.
Comment 111 Andrew J. Kroll 2008-12-11 16:56:48 UTC
Did a visual inspection on the device I have, and that resistor is not present already...
Comment 112 Alan Stern 2008-12-12 06:48:10 UTC
Since JMicron's Alex Lee is reportedly such a helpful person, you should try asking him for help.
Comment 113 Andrew J. Kroll 2009-11-24 00:55:57 UTC
Fix found!

There seems to be two versions of this board, and one has better shielding traces than the other.
My solution is very simple.

1: Cover board with clear tape.
2: Glue aluminum foil to the inside of the cheap clear blue case.

I no longer have *ANY* problems with the device!

I stumbled upon the idea by looking at a different board revision that has a larger footprint, and noticed that it is larger because it has a lot of extra copper to ground. I also have had shielding issues with some other card readers, and did the same... just a simple layer of foil.

Anyone else with the same issue, give it a try. Heck, it won't even cost you one penny to perform the fix.
If it works for you, please comment, and let us know if it works for you.
Comment 114 Alan Stern 2009-11-24 15:36:56 UTC
What board are you referring to?  The original bug report concerned a USB to PATA/SATA bridge, and later the thread drifted off to cover the JMicron converter cable.

Note You need to log in before you can comment on or make changes to this bug.