Bug 16577

Summary: Firewire drive throughput slow, hangs, sbp2_scsi_aborts
Product: Drivers Reporter: bt
Component: IEEE1394Assignee: drivers_ieee1394
Status: RESOLVED OBSOLETE    
Severity: normal CC: alan, stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.34-020634 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg output ("Device offlined" when I removed expresscard)
lspci -vnvn output
lsscsi output
sg_sat_identify --len 12 /dev/sdb output
contents of /proc/version
dmesg with ohci parameter debug=255

Description bt 2010-08-13 02:30:59 UTC
Created attachment 27424 [details]
dmesg output  ("Device offlined" when I removed expresscard)

Hardware:
  Lenovo x201
  LaCie fw800 Expresscard/34
  LaCie fw800 Rugged Drive

Software:
  linux-image-2.6.34-020634-generic 
    from http://kernel.ubuntu.com/~kernel-ppa/mainline/
  New firewire stack (modules firewire_sbp2, etc.)

Upon attaching the drive, it mounts fine immediately.  When copying files from internal drive to the fw drive, something like rsync --progress -a, it starts out at 80+ MB/s.  Very good.

It degrades in about 30s-1min to 20MB/s, then 1MB/s.  It freezes/restarts.  Sometimes it seems to freeze the whole drive system, because GUI elements start to hang.  (Or perhaps they're just poking at the mounted fw disk...)

Attaching another (mounted, unused) drive to the chain accelerates the slowdown and resets.

Eventually      sbp2_scsi_aborts     start to appear in dmesg.   Sometimes the disk freezes entirely -- yanking the expresscard is the only option at this point.

Ubuntu had an earlier bug (Launchpad #548513) where APM scripts were attempting to hdparm -i on firewire drives, causing hangs.  This script was fixed and this release of Ubuntu no longer pokes at firewire drives.

Attaching some logs, please advise what else to provide.
Comment 1 bt 2010-08-13 02:31:35 UTC
Created attachment 27425 [details]
lspci -vnvn output
Comment 2 bt 2010-08-13 02:32:10 UTC
Created attachment 27426 [details]
lsscsi output
Comment 3 bt 2010-08-13 02:32:49 UTC
Created attachment 27427 [details]
sg_sat_identify --len 12 /dev/sdb    output
Comment 4 bt 2010-08-13 02:33:40 UTC
Created attachment 27428 [details]
contents of /proc/version
Comment 5 Stefan Richter 2010-08-13 12:14:17 UTC
(Cc: linux1394-devel@lists.sourceforge.net)

> bugzilla-daemon@bugzilla.kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=16577
[...]
>> Hardware:
>>   Lenovo x201
>>   LaCie fw800 Expresscard/34
>>   LaCie fw800 Rugged Drive
[...]
>> Created an attachment (id=27425)
>>  --> (https://bugzilla.kernel.org/attachment.cgi?id=27425)
>> lspci -vnvn output

The part which is about the FireWire ExpressCard:
> 05:00.0 PCI bridge [0604]: Texas Instruments XIO2000(A)/XIO2200(A) PCI
> Express-to-PCI Bridge [104c:8231] (rev 03)
> 06:00.0 FireWire (IEEE 1394) [0c00]: Texas Instruments TSB82AA2 IEEE-1394b
> Link Layer Controller [104c:8025] (rev 01) (prog-if 10)

XIO2x00 and TSB82AA2 work quite well in practice, but many FireWire
cards with a TSB82AA2 link layer controller come with a TSB81BA3
physical layer at revision C or older.  These are seriously buggy when
used with 1394b alias FireWire 800 hardware.  They only work reasonably
with 1394a hardware or, more or less, with both 1394b and 1394a nodes
present at the same time.

However, I sure hope that a card as new as the LaCie ExpressCard
contains a TSB81BA3 of revision D or E which do not feature most of the
1394b related bugs of the earlier revisions.  Furthermore, your dmesg
(attachment 27424 [details]) does not show the typical failure mode of a bus with
early revisions of TSB81BA3, namely repeated bus resets.  I'll comment
on the dmesg next.

Likewise, the LaCie rugged drive may or may not contain one of those
older and seriously buggy TSB81BA3.  But again, the log does not show
the typical failure mode of these.
Comment 6 Stefan Richter 2010-08-13 13:04:16 UTC
(Cc: linux1394-devel@lists.sourceforge.net)

> bugzilla-daemon@bugzilla.kernel.org wrote:
>> Created an attachment (id=27424)
>>  --> (https://bugzilla.kernel.org/attachment.cgi?id=27424)
>> dmesg output  ("Device offlined" when I removed expresscard)

Quoting the dmesg from initialization of the firewire drivers till
removal of the card:

> [ 6478.536741] firewire_ohci 0000:06:00.0: enabling device (0000 -> 0002)
> [ 6478.536757] firewire_ohci 0000:06:00.0: PCI INT A -> GSI 19 (level, low)
> -> IRQ 19
> [ 6478.636077] firewire_ohci: Added fw-ohci device 0000:06:00.0, OHCI v1.10,
> 4 IR + 8 IT contexts, quirks 0x2
> [ 6479.128337] firewire_core: created device fw0: GUID 7f82000000400000, S800
> [ 6479.157967] scsi27 : SBP-2 IEEE-1394
> [ 6479.158077] firewire_core: created device fw1: GUID 00d04b881a053e71, S800
> [ 6479.158086] firewire_core: phy config: card 2, new root=ffc0, gap_count=5
> [ 6479.928094] firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
> [ 6479.934630] scsi 27:0:0:0: Direct-Access     LaCie    Rugged FW/USB       
>  PQ: 0 ANSI: 4
> [ 6479.934923] sd 27:0:0:0: Attached scsi generic sg1 type 0
> [ 6479.939850] sd 27:0:0:0: [sdb] 976773168 512-byte logical blocks: (500
> GB/465 GiB)
> [ 6479.943081] sd 27:0:0:0: [sdb] Write Protect is off
> [ 6479.943086] sd 27:0:0:0: [sdb] Mode Sense: 10 00 00 00
> [ 6479.945138] sd 27:0:0:0: [sdb] Cache data unavailable
> [ 6479.945142] sd 27:0:0:0: [sdb] Assuming drive cache: write through
> [ 6479.954428] sd 27:0:0:0: [sdb] Cache data unavailable
> [ 6479.954433] sd 27:0:0:0: [sdb] Assuming drive cache: write through
> [ 6479.954440]  sdb: sdb1
> [ 6479.989692] sd 27:0:0:0: [sdb] Cache data unavailable
> [ 6479.989697] sd 27:0:0:0: [sdb] Assuming drive cache: write through
> [ 6479.989702] sd 27:0:0:0: [sdb] Attached SCSI disk
> [ 6480.532800] EXT4-fs (sdb1): mounted filesystem with ordered data mode

So far so good.

> [ 7710.529501] firewire_sbp2: fw1.0: sbp2_scsi_abort

The error handler of Linux' SCSI core detected a time-out of a SCSI
transaction.  I.e. the disk received the request but did not send status
before the (typically plentiful) time for completion that the SCSI core
expects ran out.  The SCSI core's error handler called the abort handler
of firewire-sbp2.  Since there is no further error message shortly after
the one above, the SCSI core (or a higher layer?) was able to recover
from this failed SCSI transaction.

BTW, for the duration of the SCSI transaction time-out, the process
which issued the request is (typically or always?) waiting in
uninterruptible state, e.g. seen as "D" state in the output of ps.

> [ 7758.502832] firewire_sbp2: fw1.0: sbp2_scsi_abort

Ditto.

I may be wrong but the period of 48 seconds between this and the former
sbp2_scsi_abort indicates that this is about a different SCSI
transaction, not a retry of the one that failed at 7710.529501.

> [ 8950.977907] firewire_sbp2: fw1.0: sbp2_scsi_abort
> [ 8950.977937] sd 27:0:0:0: Device offlined - not ready after error recovery
> [ 8950.978038] sd 27:0:0:0: [sdb] Unhandled error code
> [ 8950.978042] sd 27:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK
> [ 8950.978048] sd 27:0:0:0: [sdb] CDB: Write(10): 2a 00 01 da ec 3f 00 04 00
> 00
> [ 8950.978061] end_request: I/O error, dev sdb, sector 31124543

Now we have a SCSI transaction time-out from which SCSI core or higher
layers were unable to recover.

> [ 8950.978067] quiet_error: 246 callbacks suppressed
> [ 8950.978072] Buffer I/O error on device sdb1, logical block 3890560
> [ 8950.978075] lost page write due to I/O error on sdb1
> [ 8950.978085] Buffer I/O error on device sdb1, logical block 3890561
> [ 8950.978088] lost page write due to I/O error on sdb1
> [ 8950.978092] Buffer I/O error on device sdb1, logical block 3890562
> [ 8950.978096] lost page write due to I/O error on sdb1
> [ 8950.978100] Buffer I/O error on device sdb1, logical block 3890563
> [ 8950.978103] lost page write due to I/O error on sdb1
> [ 8950.978108] Buffer I/O error on device sdb1, logical block 3890564
> [ 8950.978111] lost page write due to I/O error on sdb1
> [ 8950.978116] Buffer I/O error on device sdb1, logical block 3890565
> [ 8950.978119] lost page write due to I/O error on sdb1
> [ 8950.978124] Buffer I/O error on device sdb1, logical block 3890566
> [ 8950.978127] lost page write due to I/O error on sdb1
> [ 8950.978131] Buffer I/O error on device sdb1, logical block 3890567
> [ 8950.978135] lost page write due to I/O error on sdb1
> [ 8950.978139] Buffer I/O error on device sdb1, logical block 3890568
> [ 8950.978142] lost page write due to I/O error on sdb1
> [ 8950.978147] Buffer I/O error on device sdb1, logical block 3890569
> [ 8950.978150] lost page write due to I/O error on sdb1
> [ 8950.988975] sd 27:0:0:0: [sdb] Unhandled error code
> [ 8950.988980] sd 27:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK
> [ 8950.988983] sd 27:0:0:0: [sdb] CDB: Write(10): 2a 00 01 da f0 3f 00 04 00
> 00
> [ 8950.988989] end_request: I/O error, dev sdb, sector 31125567

And here we have a SCSI transaction failure which differs from the ones
above in the way that "sbp2_scsi_abort" is not logged.  As far as I can
tell, this means that from now on already the small IEEE 1394
transaction fails which firewire-sbp2 uses to signal the availability of
a new SCSI request to the target.  (As I wrote, in the early two
failures the target seemingly accepted the SCSI request but did not
complete it in time.  Now firewire-sbp2 is apparently unable to get in
touch with the target at all.)

> [ 8950.989094] end_request: I/O error, dev sdb, sector 486801991
> [ 8950.989109] Aborting journal on device sdb1-8.
> [ 8950.989598] EXT4-fs error (device sdb1) in ext4_da_writepages: IO failure
> [ 8950.999021] JBD2: I/O error detected when updating journal superblock for
> sdb1-8.
> [ 8951.000143] EXT4-fs error (device sdb1): ext4_journal_start_sb: Detected
> aborted journal
> [ 8951.000148] EXT4-fs (sdb1): Remounting filesystem read-only
> [ 8951.000151] EXT4-fs (sdb1): ext4_da_writepages: jbd2_start: 2048 pages,
> ino 5636100; err -30
> [ 8951.000155] 
> [ 8951.059965] JBD2: Detected IO errors while flushing file data on sdb1-8
> [ 8951.059978] journal commit I/O error
> [ 8951.059980] journal commit I/O error
> [ 8951.386489] EXT4-fs error (device sdb1): ext4_find_entry: reading
> directory #2 offset 0
> [ 8951.386565] EXT4-fs error (device sdb1): ext4_find_entry: reading
> directory #2 offset 0
> [ 8951.386616] EXT4-fs error (device sdb1): ext4_find_entry: reading
> directory #2 offset 0
> [ 8951.418399] sd 27:0:0:0: [sdb] Stopping disk
> [ 8951.418478] sd 27:0:0:0: [sdb] START_STOP FAILED
> [ 8951.418482] sd 27:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK
> [ 8951.418577] firewire_sbp2: released fw1.0, target 27:0:0

The last four lines occur when the target node was switched off (or
switched itself off) or was unplugged or the FireWire controller card
was ejected.

Since all these error messages are logged right after the abort at
8950.977907, I presume that the target went away ( = was switched off,
switched itself off, was unplugged, or the card was ejected) shortly
before this abort.

> [ 8951.480505] EXT4-fs error (device sdb1): ext4_put_super: Couldn't clean up
> the journal
> [ 8961.502993] firewire_ohci 0000:06:00.0: PCI INT A disabled
> [ 8961.503000] firewire_ohci: Removed fw-ohci device.

The last two lines are when the FireWire controller card was ejected.
This is 10 seconds after the previous error messages.  Either it took
the driver stack 10 seconds to release all the remaining references to
the card --- if so, then all messages from 8950.977907 (inclusive)
onwards are a result of the card being ejected --- or the target node
went away 10 seconds before the card was ejected.

My conclusion here is that the unresponsiveness of the disk (first
recoverable but resulting in considerable slow-down according to your
description, later unrecoverable or due to the card being ejected) is
either a hardware problem or a firmware problem or a combination of these.

LaCie's product page to the rugged disk, "support" tab, shows a LaCie
Update Tool from Jun 23 2008, for Mac OS X and MS Windows.  If you
happen to have access to a FireWire equipped PC with one of these OSs,
check whether that tool offers a newer firmware for this disk.

A driver bug cannot be ruled out but is unlikely.  The code in question
is very well tested and known to be robust (also in conjunction with
Texas Instruments controllers like yours and with FireWire 800--to--SATA
bridge chips as used in LaCie drives, which are to my knowledge bridges
from Oxford Semiconductor).
Comment 7 bt 2010-08-13 13:22:43 UTC
You are correct, at 8950 + timestamps I had yanked out the card because everything had hung.

I've verified that the disk firmware is up to date on previous occasions (within 6mos).  I will do so again in case anything has changed recently.

I've contacted LaCie to see if they can clarify what revision the phys layer chipset is.  Is it correct for one to guess that OS X has workarounds to deal with the potentially buggy LaCie drive firmware?   OS X is the obvious usual target market for this gear.

I was unable to find a "Debugging firewire issues" HOWTO.  Is there some knob I can tweak, or some place I can RTFM, to get more debugging output from the kernel that may be useful?

Thanks for your diagnosing and comments thus far.
Comment 8 bt 2010-08-15 03:41:44 UTC
Created attachment 27447 [details]
dmesg with ohci parameter debug=255

While copying files to firewire drives with ohci debug=255, many pending/canceled transactions.  Are these important?
Comment 9 bt 2010-08-15 03:45:33 UTC
To try separating the drives from the card, I setup an IP over firewire network between a Mac OS X laptop and my LaCie card w/ firewire_net.  The network activity of copying files etc stuttered/hung, waited, restarted, etc similarly to the way disk activity does.
Comment 10 Stefan Richter 2010-08-21 16:18:28 UTC
(Cc: linux1394-devel@lists.sourceforge.net)

bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16577
> 
> --- Comment #7 from bt@lawngnome.org  2010-08-13 13:22:43 ---
> You are correct, at 8950 + timestamps I had yanked out the card because
> everything had hung.
> 
> I've verified that the disk firmware is up to date on previous occasions
> (within 6mos).  I will do so again in case anything has changed recently.
> 
> I've contacted LaCie to see if they can clarify what revision the phys layer
> chipset is.  Is it correct for one to guess that OS X has workarounds to deal
> with the potentially buggy LaCie drive firmware?   OS X is the obvious usual
> target market for this gear.

There have been occasional complaints from OS X users over trouble with LaCie
drives.  But maybe the disk is perfectly fine and it is rather an issue with
the TSB82AA2 card.

I have the following TSB82AA2 + TSB81BA3 based cards myself:
  - An AVLAB PCI card.  Worked mostly OK for a while, except with a 1394b 2.5"
disk enclosure from AVLAB themselves (go figure), eventually became unusable
with any other 1394b disks.
  - A Sunix PCIe card.  Worked quite well for quite a long time.  I had to
force the bus into 1394a arbitration mode though (by plugging in an additional
1394a node) whenever I used it together with a TSB81BA3 based 8-port 1394b
repeater.  I blamed it on the repeater.  Not long ago the card became unusable
for any of the 1394b devices that I have here.  I could continue to use it
with 1394a devices.  Recently, those stopped to work properly on this card
too, so I replaced that card now too.  I discovered that the mentioned 8-port
hub works without the 1394a trick on an Agere 1394b card.
  - An IOI CardBus card, FWB-CB01; version of the PHY within unknown but IOI's
web site says TSB81BA3.  Worked a while with 1394b devices, but stopped
working for 1394b not too long after purchase and only light usage.
  - A LaCie CardBus card, PN 109945.  Note how the plastic part of it looks
just like IOI FWB-CB01, only the metal part differs.  Right on arrival, this
card malfunctioned with any 1394b device like the IOI card is doing.

This is the complete list of all TSB82AA2 + TSB81BA3 cards ever in my
possession.  :-)  The typical behavior with all of these cards presently is
that there many bus resets when 1394b devices are connected.  However, given
that at least some of these cards worked better before for a longer or shorter
period, maybe your card is in an early stage of degradation only.  (Your
attachment 27424 [details] does not show those bus reset loops.)

> I was unable to find a "Debugging firewire issues" HOWTO.  Is there some knob
> I
> can tweak, or some place I can RTFM, to get more debugging output from the
> kernel that may be useful?

There is the debug parameter of firewire-ohci that you already found. However,
it is unlikely to reveal anything useful in this case as far as I can tell.
Note that 99.9% of the traffic from/to FireWire storage devices goes on
without interrupts, hence is not logged by firewire-ohci's debug feature.

Firmware issues on the level of SCSI command set support can be investigated
with command logging of the SCSI core.  You need the right compile options
enabled in the kernel config, and you need to feed some obscure numbers into a
sysfs interface.  Here is an example which I wrote down a while ago when I
needed it:
# echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level

If the troubles always happen after some special requests that are not
normally seen during normal usage, e.g. after inquiry (in contrast to read or
write), then it probably is a firmware issue.

> Thanks for your diagnosing and comments thus far.
> 

----------------

> --- Comment #8 from bt@lawngnome.org  2010-08-15 03:41:44 ---
> Created an attachment (id=27447)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=27447)
> dmesg with ohci parameter debug=255
> 
> While copying files to firewire drives with ohci debug=255, many
> pending/canceled transactions.  Are these important?

----------------

> [ 8391.837568] firewire_ohci: IRQ 00000021 AR_resp AT_req
> [ 8391.837579] firewire_ohci: AR spd 3 tl 19, ffc1 -> ffc2, ack_complete, W
> resp
> [ 8391.837585] firewire_ohci: AT spd 3 tl 19, ffc2 -> ffc1,
> pending/cancelled, BW req, fffff0100008 8,0
> [ 8391.842839] firewire_ohci: IRQ 00000010 AR_req
> [ 8391.842852] firewire_ohci: AR spd 3 tl 0e, ffc1 -> ffc2, ack_complete, BW
> req, 000100000000 8,0
> [ 8391.843021] firewire_ohci: IRQ 00000021 AR_resp AT_req
> [ 8391.843031] firewire_ohci: AR spd 3 tl 1a, ffc1 -> ffc2, ack_complete, W
> resp
> [ 8391.843038] firewire_ohci: AT spd 3 tl 1a, ffc2 -> ffc1,
> pending/cancelled, BW req, fffff0100008 8,0
> [ 8391.848305] firewire_ohci: IRQ 00000010 AR_req
> [ 8391.848318] firewire_ohci: AR spd 3 tl 12, ffc1 -> ffc2, ack_complete, BW
> req, 000100000000 8,0
> [ 8391.848518] firewire_ohci: IRQ 00000021 AR_resp AT_req
> [ 8391.848529] firewire_ohci: AR spd 3 tl 1b, ffc1 -> ffc2, ack_complete, W
> resp
> [ 8391.848535] firewire_ohci: AT spd 3 tl 1b, ffc2 -> ffc1,
> pending/cancelled, BW req, fffff0100008 8,0
> [ 8391.853768] firewire_ohci: IRQ 00000010 AR_req

This is perfectly normal traffic during use of a FireWire storage device.

"AR spd 3 tl 0e, ffc1 -> ffc2, ack_complete, BW req, 000100000000 8,0"

A write request was received.   The transaction was completed with "ACK complete".

"AR spd 3 tl 19, ffc1 -> ffc2, ack_complete, W resp"
"AT spd 3 tl 19, ffc2 -> ffc1, pending/cancelled, BW req, fffff0100008 8,0"

A write request was sent, an "ACK pending" received; then a write response was
received and the controller answered that with an"AC complete".  As a twist,
the controller first sent an interrupt event about the received response (the
AR interrupt) before it sent the interrupt about the sent request (the AT
interrupt).  This is why the log says a seemingly indecisive
"pending/cancelled" in the AT event rather than "ack_pending" as it would if
the more logical order of AT-req interrupt -> AR-resp interrupt had happened.
However, it is quite common that controllers reorder those interrupt events.

> --- Comment #9 from bt@lawngnome.org  2010-08-15 03:45:33 ---
> To try separating the drives from the card, I setup an IP over firewire
> network
> between a Mac OS X laptop and my LaCie card w/ firewire_net.  The network
> activity of copying files etc stuttered/hung, waited, restarted, etc
> similarly
> to the way disk activity does.

firewire-net is not quite stable yet.  So this doesn't say much about the card
or the disk.

Does the disk work well under OS X?

Another thing that you could test is to put the OS X laptop (if it is an Apple
laptop, not a Hackintosh) into target disk mode (press and hold the "t" key
while you power the laptop up), the put Linux laptop -- Apple laptop --
FireWire disk into a chain and try to use the latter.  I.e. let the Apple
laptop act as an expensive FireWire repeater.  Perhaps this also works if you
plug this chain together and then power the Apple laptop up normally or have
it hibernating or whatever.
Comment 11 Stefan Richter 2010-08-21 16:45:34 UTC
> # echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level

PS:  Alas this is impractical if your system disk is driven by the kernel's
SCSI core too.
Comment 12 bt 2010-08-23 03:32:08 UTC
(In reply to comment #10)
> sysfs interface.  Here is an example which I wrote down a while ago when I
> needed it:
> # echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level

yes, unfortunately my primary disk is also on scsi.  

> Does the disk work well under OS X?

unfortunately yes, the disk does work well under OS X, especially on the laptop's embedded firewire interface.  it also works semi-well, but pops a few "out of transaction labels" messages into kernel and i slower, when using the LaCie expresscard. 

I also have another (sigh) LaCie drive on a fw400-only enclosure.  this drive does not work any better or worse than the rugged fw800 drive I have been mentioning, it just fails differently.  

I have tried chaining them together in various ways, like:

[fw800 card]--fw800:fw400 cable--[LaCie fw400]--fw400--[Rugged]
[fw800 card]--fw800---[Rugged]---[fw400]----[Lacie fw400]

etc

however the results are all the same, stalls, bad throughput, etc.

 
> 
> Another thing that you could test is to put the OS X laptop (if it is an
> Apple
> laptop, not a Hackintosh) into target disk mode (press and hold the "t" key
> while you power the laptop up), the put Linux laptop -- Apple laptop --
> FireWire disk into a chain and try to use the latter.  I.e. let the Apple
> laptop act as an expensive FireWire repeater.  Perhaps this also works if you
> plug this chain together and then power the Apple laptop up normally or have
> it hibernating or whatever.

this sounds like an interesting test but unless if I'm missing something obvious, unfortunately I can't do it with my test Apple laptop because it only has 1 fw800 port.  To make it a repeater I'd need 2 for this setup.

I will look around at work and see if I can find a Mac desktop which has multiple native firewire interfaces, and report back if I can determine anything from making the Mac host the devices.
Comment 13 Stefan Richter 2010-08-23 11:24:59 UTC
(Cc linux1394-devel, quoting in full for the list)

bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16577
> 
> --- Comment #12 from bt@lawngnome.org  2010-08-23 03:32:08 ---
> (In reply to comment #10)
>> sysfs interface.  Here is an example which I wrote down a while ago when I
>> needed it:
>> # echo 9216 > /sys/module/scsi_mod/parameters/scsi_logging_level
> 
> yes, unfortunately my primary disk is also on scsi.  
> 
>> Does the disk work well under OS X?
> 
> unfortunately yes, the disk does work well under OS X, especially on the
> laptop's embedded firewire interface.  it also works semi-well, but pops a
> few
> "out of transaction labels" messages into kernel and i slower, when using the
> LaCie expresscard. 

Which hints that the card may be troublesome.  And that in turn agrees with my
experience that all TSB82AA2 + TSB81BA3 sooner or later exhibit random
malfunctions of lesser or greater severity, primarily together with 1394b
(FireWire 800) devices but in some cases also with 1394a (FireWire 400)
devices.  My guess is that the TSB81BA3 is to blame.  (In contrast to
TSB81BA3D and TSB81BA3E which seem to be fine.)

> I also have another (sigh) LaCie drive on a fw400-only enclosure.  this drive
> does not work any better or worse than the rugged fw800 drive I have been
> mentioning, it just fails differently.  

I have a LaCie Little Disk myself, 2x FireWire 400 + USB 2.0 + button version,
based on the OXUF934DSA bridge chip.  This works without troubles.  Just now I
tested it on my LaCie 1394b CardBus card, with an otherwise disabled PCI card
added as a 12V bus power provider.  Works speedy and reliable.  hdparm -tT
reports > 35 MB/s, just like when the Little Disk is plugged into other cards.

As I mentioned, this LaCie CardBus card here is useless with any 1394b
devices.  I don't have a LaCie 1394b disk to test though.  But I am certain
that one or some of the 1394b disks which I have here have the same bridge
chip as whichever one sticks in your LaCie Rugged Disk; though there may be
firmware differences of course.

> I have tried chaining them together in various ways, like:
> 
> [fw800 card]--fw800:fw400 cable--[LaCie fw400]--fw400--[Rugged]
> [fw800 card]--fw800---[Rugged]---[fw400]----[Lacie fw400]
> 
> etc
> 
> however the results are all the same, stalls, bad throughput, etc.

How do you supply power to the disks?  Please do not use USB power cables.
These may work with some luck but more likely introduce random I/O errors, if
not outright discovery/ inquiry failures.

>> Another thing that you could test is to put the OS X laptop (if it is an
>> Apple
>> laptop, not a Hackintosh) into target disk mode (press and hold the "t" key
>> while you power the laptop up), the put Linux laptop -- Apple laptop --
>> FireWire disk into a chain and try to use the latter.  I.e. let the Apple
>> laptop act as an expensive FireWire repeater.  Perhaps this also works if
>> you
>> plug this chain together and then power the Apple laptop up normally or have
>> it hibernating or whatever.
> 
> this sounds like an interesting test but unless if I'm missing something
> obvious, unfortunately I can't do it with my test Apple laptop because it
> only
> has 1 fw800 port.  To make it a repeater I'd need 2 for this setup.

Oh, I forgot that newer Apple laptops only have a single FireWire port again.
Still, you could at least include it in a test as a reliable bus power
provider, if you do not have a dedicated PSU for either the CardBus card or
the LaCie disk(s) or do not trust the reliability of such a PSU.  Of course
you need to get OS X as SBP-2 initiator out of the picture, e.g. by starting
the laptop in SBP-2 target mode as mentioned.

> I will look around at work and see if I can find a Mac desktop which has
> multiple native firewire interfaces, and report back if I can determine
> anything from making the Mac host the devices.

A test with an Apple desktop as repeater --- or any other kind of repeater ---
is probably not going to reveal anything useful now that even the test with
the LaCie FireWire 400 disk as repeater did not improve matters.

One other thought:  If you have the experience and time for doing so, you
could try to rebuild your 2.6.34(.?) kernel from source, with latest firewire
driver updates patched in:  http://user.in-berlin.de/~s5r6/linux1394/updates/
The 2.6.34.y patch series in there applies to 2.6.34.1 or any later 2.6.34.y
sources.  If your 2.6.34-020634-generic is still a plain 2.6.34 kernel (check
the first few lines in Makefile), the one patch in
http://user.in-berlin.de/~s5r6/linux1394/merged/in_2.6.34.1/ needs to be
applied to make the combo patch or patch series in updates/ applicable.

I do not remember anything in these driver changes since 2.6.34 that is of
obvious importance to your bug report, but an actual test would be good if you
can manage.  I hope there are build instructions somewhere at ubuntu.com.
Comment 14 bt 2010-08-23 15:28:02 UTC
(In reply to comment #13)
> (Cc linux1394-devel, quoting in full for the list)
> How do you supply power to the disks?  Please do not use USB power cables.
> These may work with some luck but more likely introduce random I/O errors, if
> not outright discovery/ inquiry failures.

for the Rugged (800) drive I DO use the modified USB power cable provided by LaCie.

for the older 400 drive, the enclosure is self powered.

I don't have the power adapter for the LaCie card power bus, but I'll attempt to find one.

> the LaCie disk(s) or do not trust the reliability of such a PSU.  Of course
> you need to get OS X as SBP-2 initiator out of the picture, e.g. by starting
> the laptop in SBP-2 target mode as mentioned.

I found a Mac Pro desktop here at work and I did a few tests. It has a 400 and 800 port on front, and also the same pair on the back.  note I did these while the system was running, not while in target mode.  please let me know if that ruins the test.

key:
"drive" is the LaCie rugged drive
"800 cable N" is a new cable I am borrowing at work
"800 cable O" is the cable I have been using the whole time
"400 cable" is the 400 cable I have been using the whole time

laptop -- fw 800 cable N -- mac -- fw 400 cable -- drive -- does not work
laptop -- fw 800 cable O -- mac -- fw 400 cable -- drive -- 18MB/s
laptop -- fw 800 cable N -- mac -- fw 800 cable O -- drive -- does not work
laptop -- fw 800:400 cable -- mac -- fw 800 cable O -- drive -- does not work
laptop -- fw 800:400 cable -- mac -- fw 800 cable N -- drive -- does not work
laptop -- fw 800:400 cable -- mac -- fw 400 cable -- drive --18 MB/s
laptop -- fw 800 cable N -- drive -- 45MB/s - usb powered
laptop -- fw 800 cable O -- drive -- 48MB/s * - usb..
laptop -- fw 800:400 cable -- drive -- 34MB/s - usb..

there are some very strange oddities here, for example 800 in/out of the mac the drive does not work, but going 800-400 works.

my laptop says this in the "does not work" scenarios:

[  640.195486] scsi13 : SBP-2 IEEE-1394
[  640.195596] firewire_core: created device fw2: GUID 00d04b881a053e71, S800, 1 config ROM retries
[  640.764644] firewire_sbp2: fw2.0: error status: 0:4
[  641.122794] firewire_sbp2: fw2.0: error status: 0:4
[  641.480721] firewire_sbp2: fw2.0: error status: 0:4
[  641.839073] firewire_sbp2: fw2.0: error status: 0:4
[  642.197226] firewire_sbp2: fw2.0: error status: 0:4
[  642.555137] firewire_sbp2: fw2.0: error status: 0:4
[  642.555143] firewire_sbp2: fw2.0: failed to login to LUN 0000


even stranger, suddenly my drive cooperates with me (see asterisk) after all of the earlier tests, it works with the original 800 cable and does about 48MB/s without hiccups.    this persists even after I remove, reinsert the expresscard.

very flaky.


> One other thought:  If you have the experience and time for doing so, you
> could try to rebuild your 2.6.34(.?) kernel from source, with latest firewire
> driver updates patched in:  http://user.in-berlin.de/~s5r6/linux1394/updates/
> The 2.6.34.y patch series in there applies to 2.6.34.1 or any later 2.6.34.y
> sources.  If your 2.6.34-020634-generic is still a plain 2.6.34 kernel (check
> the first few lines in Makefile), the one patch in
> http://user.in-berlin.de/~s5r6/linux1394/merged/in_2.6.34.1/ needs to be
> applied to make the combo patch or patch series in updates/ applicable.

I'll attempt to try this soon.

thanks
Comment 15 Stefan Richter 2010-09-12 09:40:58 UTC
[laptop + Mac + driver scenario]
> my laptop says this in the "does not work" scenarios:
> 
> [  640.195486] scsi13 : SBP-2 IEEE-1394
> [  640.195596] firewire_core: created device fw2: GUID 00d04b881a053e71,
> S800,
1 config ROM retries
> [  640.764644] firewire_sbp2: fw2.0: error status: 0:4
> [  641.122794] firewire_sbp2: fw2.0: error status: 0:4
...
> [  642.555143] firewire_sbp2: fw2.0: failed to login to LUN 0000

Error status 0:4 means "request complete" : "access denied".  Very likely OS X already logged into the disk before Linux could.  Thus the disk allows only the Mac to control it.  Hence my earlier suggestion of putting a Mac in target mode when testing with it acting as FireWire repeater and power provider (but not as SBP-2 initiator).

Anyway.  At this point I still think the most likely cause for failures of the laptop--disk scenario is unreliable power supply to the disk.