Bug 10342 - Add SBP2_WORKAROUND_128K_MAX_TRANS workaround for Prolific PL3507 bridge
Summary: Add SBP2_WORKAROUND_128K_MAX_TRANS workaround for Prolific PL3507 bridge
Status: REJECTED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-03-27 13:59 UTC by Konstantin A. Lepikhov
Modified: 2008-06-05 16:04 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.24
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Prolific PL3507 bridge workaround patch (986 bytes, patch)
2008-03-27 14:02 UTC, Konstantin A. Lepikhov
Details | Diff

Description Konstantin A. Lepikhov 2008-03-27 13:59:58 UTC
Latest working kernel version: N/A
Earliest failing kernel version: N/A
Distribution: ALT Linux Sisyphus (20071221)
Hardware Environment: i586 system w/ AMD Athlon(tm) 64 X2 Dual Core Processor 3800+  ASUS M2R32-MVP motherboard 4096 RAM 2x400Gb SATA drivers.
Software Environment: kernel-2.6.24.4 + glibc 2.5.1.
Problem Description: Prolific PL3507 bridge without special workaround hangs during copying large amount of data (e.g. .avi files for example).
Steps to reproduce: install PL3507 bridge (PATA/USB/FireWire case), start copying some .avi to drive inside PL3507 bridge, see the hang that looks like:
Mar 27 01:54:46 lks kernel: sd 34:0:0:0: [sdd] Attached SCSI disk
Mar 27 01:54:46 lks kernel: sd 34:0:0:0: Attached scsi generic sg4 type 0
Mar 27 01:55:09 lks kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5
Mar 27 01:55:09 lks kernel: scsi35 : SBP-2 IEEE-1394
Mar 27 01:55:09 lks kernel: firewire_core: created new fw device fw1 (0 config rom retries)
Mar 27 01:55:09 lks kernel: firewire_sbp2: logged in to sbp2 unit fw1.0 (0 retries)
Mar 27 01:55:09 lks kernel: firewire_sbp2:  - management_agent_address:    0xfffff0010000
Mar 27 01:55:09 lks kernel: firewire_sbp2:  - command_block_agent_address: 0xfffff0010020
Mar 27 01:55:09 lks kernel: firewire_sbp2:  - status write address:        0x000100000000
Mar 27 01:55:09 lks kernel: scsi 35:0:0:0: Direct-Access-RBC ST317242 A                     PQ: 0 ANSI: 4
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] 33683328 512-byte hardware sectors (17246 MB)
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write Protect is off
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Mode Sense: 11 00 00 00
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] 33683328 512-byte hardware sectors (17246 MB)
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write Protect is off
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Mode Sense: 11 00 00 00
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 01:55:09 lks kernel:  sde: sde1 sde2 < sde5 >
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Attached SCSI disk
Mar 27 01:55:09 lks kernel: sd 35:0:0:0: Attached scsi generic sg5 type 14
Mar 27 01:56:25 lks kernel: SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled
Mar 27 01:56:25 lks kernel: SGI XFS Quota Management subsystem
Mar 27 01:56:25 lks kernel: XFS mounting filesystem sde5
Mar 27 01:56:25 lks kernel: Ending clean XFS mount for filesystem: sde5
Mar 27 01:58:27 lks kernel: firewire_sbp2: sbp2_scsi_abort
Mar 27 01:58:27 lks kernel: firewire_sbp2: sbp2_scsi_abort
Mar 27 01:58:27 lks kernel: sd 35:0:0:0: rejecting I/O to offline device
Mar 27 01:58:27 lks last message repeated 114 times
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175581
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175582
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175583
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175584
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175585
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175586
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175587
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175588
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175589
Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175590
Mar 27 01:58:27 lks kernel: I/O error in filesystem ("sde5") meta-data dev sde5 block 0xa94df4       ("xlog_iodone") error 5 buf count 11264
Mar 27 01:58:27 lks kernel: Filesystem "sde5": Log I/O Error Detected.  Shutting down filesystem: sde5
Mar 27 01:58:27 lks kernel: Please umount the filesystem, and rectify the problem(s)
Mar 27 01:58:27 lks kernel: sd 35:0:0:0: rejecting I/O to offline device
Mar 27 01:58:55 lks kernel: firewire_sbp2: management write failed, rcode 0x13

With attached patch:
Mar 27 23:31:05 lks kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5
Mar 27 23:31:05 lks kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5
Mar 27 23:31:05 lks kernel: scsi49 : SBP-2 IEEE-1394
Mar 27 23:31:05 lks kernel: firewire_sbp2: Workarounds for node fw1.0: 0x1 (firmware_revision 0x012804, model_id 0x000001)
Mar 27 23:31:05 lks kernel: firewire_core: created new fw device fw1 (0 config rom retries, S400)
Mar 27 23:31:05 lks kernel: firewire_sbp2: logged in to sbp2 unit fw1.0 (0 retries)
Mar 27 23:31:05 lks kernel: firewire_sbp2:  - management_agent_address:    0xfffff0010000
Mar 27 23:31:05 lks kernel: firewire_sbp2:  - command_block_agent_address: 0xfffff0010020
Mar 27 23:31:05 lks kernel: firewire_sbp2:  - status write address:        0x000100000000
Mar 27 23:31:05 lks kernel: scsi 49:0:0:0: Direct-Access-RBC ST317242 A                     PQ: 0 ANSI: 4
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] 33683328 512-byte hardware sectors (17246 MB)
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write Protect is off
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Mode Sense: 11 00 00 00
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] 33683328 512-byte hardware sectors (17246 MB)
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write Protect is off
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Mode Sense: 11 00 00 00
Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Mar 27 23:31:05 lks kernel:  sdg: sdg1 sdg2 < sdg5 >
Comment 1 Konstantin A. Lepikhov 2008-03-27 14:02:55 UTC
Created attachment 15468 [details]
Prolific PL3507 bridge workaround patch
Comment 2 Konstantin A. Lepikhov 2008-03-27 14:05:11 UTC
I'm also checked that latest Profilic firmware(*) still need this patch.

(*) http://www.prolific.com.tw/eng/downloads.asp?ID=44
Comment 3 Anonymous Emailer 2008-03-27 14:20:12 UTC
Reply-To: stefanr@s5r6.in-berlin.de

Please try 2.6.25-rc7 (or 2.6.25 when it comes out, or 2.6.24 + firewire 
patches from http://me.in-berlin.de/~s5r6/linux1394/updates/ --- 
whatever suits you most) without the 128k workaround.

Among the other misc fixes and improvements, the patch "firewire: 
fw-sbp2: set single-phase retry_limit" should considerably stabilize 
PL3507 based devices.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=51f9dbef5be41f3ff6000c874741a3a357f9bad7

I suspect the PL3507 does not need the 128 request size limit.
Comment 4 Jarod Wilson 2008-03-27 14:31:20 UTC
Yeah, I've got several PL-3507 devices which work flawlessly after the patch referenced in comment #3, without any 128k max xfer workaround. I believe before the patch, I was actually still able to trigger an I/O hang even with the 128k max xfer workaround enabled, it was just harder to trigger.
Comment 5 Konstantin A. Lepikhov 2008-03-27 15:05:25 UTC
Hmm, looks like this patches didn't help for 2.6.22.y (sorry, I'm have only
this one in home). The dmesg diagnosting is changed but I/O timeouts and errors
still there:

ACPI: PCI Interrupt 0000:03:03.0[A] -> GSI 22 (level, low) -> IRQ 21
firewire_ohci: Added fw-ohci device 0000:03:03.0, OHCI version 1.10
firewire_core: created device fw0: GUID 0011d80000d4626d, S400
firewire_core: created device fw1: GUID 0050770e00071002, S400
firewire_core: phy config: card 0, new root=ffc1, gap_count=5
scsi51 : SBP-2 IEEE-1394
firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
scsi 51:0:0:0: Direct-Access-RBC ST317242 A                     PQ: 0 ANSI: 4
sd 51:0:0:0: [sdh] 33683328 512-byte hardware sectors (17246 MB)
sd 51:0:0:0: [sdh] Write Protect is off
sd 51:0:0:0: [sdh] Mode Sense: 11 00 00 00
sd 51:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
sd 51:0:0:0: [sdh] 33683328 512-byte hardware sectors (17246 MB)
sd 51:0:0:0: [sdh] Write Protect is off
sd 51:0:0:0: [sdh] Mode Sense: 11 00 00 00
sd 51:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support
DPO or FUA
 sdh: sdh1 sdh2 < sdh5 >
sd 51:0:0:0: [sdh] Attached SCSI disk
sd 51:0:0:0: Attached scsi generic sg8 type 14
XFS mounting filesystem sdh5
Ending clean XFS mount for filesystem: sdh5
firewire_sbp2: fw1.0: sbp2_scsi_abort
firewire_sbp2: fw1.0: sbp2_scsi_abort
sd 51:0:0:0: scsi: Device offlined - not ready after error recovery
sd 51:0:0:0: [sdh] Result: hostbyte=DID_BUS_BUSY
driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdh, sector 32745770
sd 51:0:0:0: rejecting I/O to offline device
...
sd 51:0:0:0: [sdh] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK,SUGGEST_OK
end_request: I/O error, dev sdh, sector 32746794
sd 51:0:0:0: rejecting I/O to offline device
sd 51:0:0:0: rejecting I/O to offline device
sd 51:0:0:0: rejecting I/O to offline device
sd 51:0:0:0: rejecting I/O to offline device
sd 51:0:0:0: rejecting I/O to offline device
sd 51:0:0:0: rejecting I/O to offline device
printk: 14600 messages suppressed.
Buffer I/O error on device sdh5, logical block 2657404
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657405
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657406
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657407
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657408
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657409
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657410
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657411
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657412
lost page write due to I/O error on sdh5
Buffer I/O error on device sdh5, logical block 2657413
lost page write due to I/O error on sdh5
I/O error in filesystem ("sdh5") meta-data dev sdh5 block 0xa9457a      
("xlog_iodone") error 5 buf count 5632
xfs_force_shutdown(sdh5,0x2) called from line 960 of file fs/xfs/xfs_log.c. 
Return address = 0xfa3deb8c
Filesystem "sdh5": Log I/O Error Detected.  Shutting down filesystem: sdh5
Please umount the filesystem, and rectify the problem(s)
sd 51:0:0:0: rejecting I/O to offline device
Comment 6 Stefan Richter 2008-03-27 15:35:47 UTC
2.6.22.y or 2.6.24.y?
Comment 7 Stefan Richter 2008-03-27 15:41:18 UTC
If you plug the device out and back in (if bus-powered) or power-cycle it (if self-powered) after the failure happened, will the controller detect that there was a device plugged in again?

I ask in order to verify that this is not bug 10307.
Comment 8 Konstantin A. Lepikhov 2008-03-27 16:34:21 UTC
(In reply to comment #6)
> 2.6.22.y or 2.6.24.y?
> 

Just tested only 2.6.22.19 cause I'm don't have 2.6.24.y sources at hand now.

(In reply to comment #7)
> If you plug the device out and back in (if bus-powered) or power-cycle it (if
> self-powered) after the failure happened, will the controller detect that
> there
> was a device plugged in again?
> 
> I ask in order to verify that this is not bug 10307.
> 

No, after the failure and power off/on device is detected properly.
Comment 9 Konstantin A. Lepikhov 2008-03-27 16:39:41 UTC
lspci -xnnvvv output for onboard IEE1394:

03:03.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. IEEE 1394 Host Contr                                     oller [1106:3044] (rev c0) (prog-if 10 [OHCI])
        Subsystem: ASUSTeK Computer Inc. Device [1043:81fe]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step                                     ping- SERR+ FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort                                     - <MAbort- >SERR- <PERR+ INTx-
        Latency: 64 (8000ns max), Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 21
        Region 0: Memory at fbfff000 (32-bit, non-prefetchable) [size=2K]
        Region 1: I/O ports at ec00 [size=128]
        Capabilities: <access denied>
        Kernel driver in use: firewire_ohci
00: 06 11 44 30 17 01 10 82 c0 10 00 0c 10 40 00 00
10: 00 f0 ff fb 01 ec 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 fe 81
30: 00 00 00 00 50 00 00 00 00 00 00 00 03 01 00 20
Comment 10 Stefan Richter 2008-03-27 17:11:59 UTC
VIA rev c0?  I don't think I have seen this in other people's lspci before. Is it a VT6308 perhaps?  Anyway, I don't expect fundamental differences between VIA chips series.

Since you tried Prolific's firmware update I conclude that you don't have one of those really old PL3507 revisions which don't support firmware upload.

I only have one PL3507 myself;  I use it with a DVD-RW but shall replace that by a HDD and test it with the various cards here, including VIA rev 46 and rev 80 (VT6306, VT6307).  Of course my PL3507 runs different firmware than yours.

Would still be nice though if you could test one of the initially mentioned kernels when you have the opportunity, and check with stress tests that the 128k workaround really fixes it instead of just raising the bar for the problem to happen.  OTOH, even if it just reduces the problem rather than fixing it (that might be ultimately impossible with PL3507s), we should perhaps adopt it at least as an improvement.

The existing 128k workaround is a priori for a specific mess-up in a particular old ex Symbios now LSI Logic bridge: ftp://ftp2.de.freebsd.org/pub/misc/specs/symbios/symchips/1394/UPDATED_configuration_ROMs/ReadMe.doc
This is basically a first-generation SBP-2 chip which was quickly made obsolete by the much better performing main competitor OXFW911.
Comment 11 Stefan Richter 2008-03-27 17:17:24 UTC
And there is one more thing that we should avoid mixing up here:

PL3507 will produce IO errors if they receive INQUIRY commands out of the order they expect.  (They should optimally get INQUIRY only as the very first command; but at least immediately followed by READ_CAPACITY.)

This means that you should ideally perform your tests with hald and any other programs which query storage devices (amarok?) disabled, to exclude other failure sources.
Comment 12 Konstantin A. Lepikhov 2008-04-07 14:26:39 UTC
Sorry for delay, I was too busy for testing :(  In last weekend I'm tested 2.6.24.4 + patches from http://me.in-berlin.de/~s5r6/linux1394/updates/ on other box (i965 based) with different firewire controller and didn't found any problems.   All work good including transfer many big files to different directions without any quirks and workarounds in sbp2 driver. So I'm suspend it's definitely 1394 controller hw problem and we need some quirks in OHCI driver. Anyway, there is 2 notebooks on the hand with various chipsets and firewire controllers but I'm need a special ("little") firewre cable and some time for testing this.
Comment 13 Stefan Richter 2008-04-07 14:51:48 UTC
Thanks for this additional testing.

Are there any fundamental differences between this better working setup and the nonworking one (besides the different controller and mainboard) --- such as more or less RAM, uniprocessor vs. SMP, CONFIG_PREEMPT_NONE vs. CONFIG_PREEMPT, x86-32 vs. x86-64, or whatever else...

BTW, the firewire drivers were buggy under x86-64 before patch 691 "firewire: fw-ohci: plug dma memory leak in AR handler" (posted end of March, merged in Linux 2.6.25-rc8), and very noticeably broken under x86-64 + > 3GB RAM before patch 674 "firewire: fw-ohci: use dma_alloc_coherent for ar_buffer" (posted mid of March, merged in Linux 2.6.25-rc6 or so).
Comment 14 Konstantin A. Lepikhov 2008-04-07 15:03:40 UTC
(In reply to comment #13)
> Thanks for this additional testing.
> 
> Are there any fundamental differences between this better working setup and
> the
> nonworking one (besides the different controller and mainboard) --- such as
> more or less RAM, uniprocessor vs. SMP, CONFIG_PREEMPT_NONE vs.
> CONFIG_PREEMPT,
> x86-32 vs. x86-64, or whatever else...
No, all tested systems have the same kernel (2.6.24.4 + linux1394 patches) and arch (x86-32) but only first and buggy box have > 3Gb RAM. However, this mainboard (ASUS M2R32-MVP) is _very_ buggy. For example, 2.6.24.y kernel simply hangs due excessive usb work or just removing usb stick. So, unfortunately, there are many other correlations to this case.

> 
> BTW, the firewire drivers were buggy under x86-64 before patch 691 "firewire:
> fw-ohci: plug dma memory leak in AR handler" (posted end of March, merged in
> Linux 2.6.25-rc8), and very noticeably broken under x86-64 + > 3GB RAM before
> patch 674 "firewire: fw-ohci: use dma_alloc_coherent for ar_buffer" (posted
> mid
> of March, merged in Linux 2.6.25-rc6 or so).
> 
Okay, it's a long way because I need build livecd for testing x86-64 environment. I'll try that but only in next weekend.
 
Comment 15 Stefan Richter 2008-04-07 15:15:39 UTC
No, since you already used x86-32 on both systems, it is not necessary for you to run tests with x86-64.

I'm just not sure how AMD based boards have (or have not) been affected by the DMA mapping bugs, therefore you should update the PC with M2R32-MVP to a firewire patch >= v691 if you didn't do so already.

I will try to do some own tests with a PL3507 enclosure + HDD this week, as mentioned in comment #10.
Comment 16 Jarod Wilson 2008-04-15 08:35:41 UTC
Oops, I meant to add myself to the cc list on this one when I commented last...
Comment 17 Konstantin A. Lepikhov 2008-04-15 15:37:22 UTC
Okay, today testing results - 2.6.24.4 + ieee1394 patches (last patch is "Subject: firewire: fw-ohci: work around generation bug in TI controllers (fix AV/C and more)"):
- write errors occur very rare but strange happens - sometimes system hangs during rmmod firewire-ohci module, sometimes rmmod ok but when I'm insmod firewire-ohci again see the acquainted symptoms in dmesg:
[  742.049744] firewire_sbp2: fw1.0: sbp2_scsi_abort
[  742.049794] sd 9:0:0:0: Device offlined - not ready after error recovery
[  742.049811] sd 9:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER                               _OK,SUGGEST_OK
[  742.049815] end_request: I/O error, dev sdd, sector 415
[  742.049819] Buffer I/O error on device sdd1, logical block 177
[  742.049824] Buffer I/O error on device sdd1, logical block 178
[  742.049827] Buffer I/O error on device sdd1, logical block 179
[  742.049832] Buffer I/O error on device sdd1, logical block 180
[  742.049904] Buffer I/O error on device sdd1, logical block 181
[  742.049910] Buffer I/O error on device sdd1, logical block 182
[  742.049913] Buffer I/O error on device sdd1, logical block 183
[  742.049916] Buffer I/O error on device sdd1, logical block 184
[  742.049919] Buffer I/O error on device sdd1, logical block 185
[  742.049922] Buffer I/O error on device sdd1, logical block 186
[  742.049994] sd 9:0:0:0: rejecting I/O to offline device
[  742.050004] sd 9:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIV                               ER_OK,SUGGEST_OK
[  742.050007] end_request: I/O error, dev sdd, sector 0
[  742.050691] sd 9:0:0:0: rejecting I/O to offline device
[  742.050709] sd 9:0:0:0: rejecting I/O to offline device
[  742.050716] sd 9:0:0:0: rejecting I/O to offline device
[  742.050723] sd 9:0:0:0: rejecting I/O to offline device
[  779.526238] sd 9:0:0:0: [sdd] Synchronizing SCSI cache
[  779.526528] sd 9:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER                               _OK,SUGGEST_OK
[  779.526644] firewire_sbp2: released fw1.0
<power off/on device>
[  784.190125] firewire_core: phy config: card 1, new root=ffc1, gap_count=5
[  784.254040] scsi10 : SBP-2 IEEE-1394
[  784.254102] firewire_core: created device fw1: GUID 0050770e00071002, S400, 2                                config ROM retries
[  784.260475] firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
[  784.260891] scsi 10:0:0:0: Direct-Access-RBC Maxtor 7 541 AP                P                               Q: 0 ANSI: 4
[  784.261250] sd 10:0:0:0: [sdd] 1061390 512-byte hardware sectors (543 MB)
[  784.261362] sd 10:0:0:0: [sdd] Write Protect is off
[  784.261365] sd 10:0:0:0: [sdd] Mode Sense: 11 00 00 00
[  784.261696] sd 10:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doe                               sn't support DPO or FUA
[  784.261984] sd 10:0:0:0: [sdd] 1061390 512-byte hardware sectors (543 MB)
[  784.262102] sd 10:0:0:0: [sdd] Write Protect is off
[  784.262105] sd 10:0:0:0: [sdd] Mode Sense: 11 00 00 00
[  784.262426] sd 10:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doe                               sn't support DPO or FUA
[  784.262431]  sdd: sdd1
[  784.270139] sd 10:0:0:0: [sdd] Attached SCSI disk
[  784.270179] sd 10:0:0:0: Attached scsi generic sg4 type 14
[  786.389308] reiser4: sdd1: found disk format 4.0.0.
Comment 18 Stefan Richter 2008-04-16 03:37:13 UTC
Konstantin, what is the precise product name of the enclosure and who is the vendor?

Did the firmware_revision marker in your patch match already before you updated the firmware of the device?
Comment 19 Konstantin A. Lepikhov 2008-04-16 14:19:39 UTC
It's Agestar ICB3A combo enclosure. No, I'm update firmware in first and them make the patch so I'm don't remember old marker. But I have dump of old firmware and may try to flash it again (hope is possible). I'll try this in the next weekends and report results here. 
Comment 20 Stefan Richter 2008-04-16 15:20:30 UTC
Note to self:
MacPower Prefect II 5.25" enclosure with PL3507 and as-shipped firmware from MacPower: firmware_revision 0x100102
Comment 21 Stefan Richter 2008-04-24 13:17:15 UTC
I stress-tested the device from comment #20 with a 400 GB HDD, reiserfs, and a loop of
    $ cp -pr directory_with_14gigs_of_avi_files test_directory
    $ cat test_directory/*/* > /dev/null
    $ rm -rf test_directory/*
No problem.  Even running "sg_inq" while the IO is going on does not do any harm.  So this one has quite good firmware.  (It only is quirky during login.  Sometimes it takes some replugging before it it agrees to start working.)

The stamp on the chip says
    PL-3507
    05124C

So whatever it is that throws your PL-3507 out of whack, it is either limited to Prolific's original firmware, or to certain chip revisions, or when certain IDE devices are attached, or a combination.

Will you find out the firmware_id of the older firmware (before you updated to the latest firmware) soon?  Or shall I go ahead and add the patch as-is?

BTW, since Linux 2.6.25, drivers/ieee1394/sbp2.c does not limit the transfer size anymore either.  Hence I would take over your patch for that driver as well.  There is only a slight difference between firewire-sbp2 and sbp2 in so far as  firewire-core performs gap count optimization which increases IEEE1394a throughput.  This shouldn't matter to the problem at hand though if there is indeed a purely transfer size dependent device bug.
Comment 22 Stefan Richter 2008-05-03 04:43:58 UTC
There are two independent things we need to do:
  - Check if fw-ohci properly guarantees the necessary order of memory and
    register accesses.
  - Find a way to reset PL-3507 when fetch agent reset doesn't work.

Long story:

Now that I reread the whole bug history, I think the transfer size limit patch is not really what we need.  Comment #9, comment #12, comment #14 are significant.

It is not the PL-3507 which needs the transfer size limit, it is the board --- which needs the limit not as a fix, but to mask out driver bugs or/and board quirks.

I may be deluded, but I like to believe that the firewire code which you last tested has all the necessary fixes for DMA mapping.  (For physical response unit:  DMA mapping in fw-sbp2 which I believe to have fixed long ago;  for asynchronous transmission and reception:  DMA mapping in fw-ohci which should be fixed since conversion of AR DMA to coherent allocation which is AFAIR included in 2.6.24.)  Also, Jarod already extensively tested firewire-sbp2 on some boards with IOMMU and RAM outside the 32bit PCI physical address range.

But wait, there may still be at least theoretical bugs in fw-ohci regarding the ordering of the CPU writing to DMA buffers (including descriptors) and MMIO register writes.  I still haven't fully understood what is necessary to ensure the proper order:  http://lkml.org/lkml/2008/3/27/10
We need however look into the need of write barriers in fw-ohci.

The other thing we need to look into is error handling in fw-sbp2.  Most bridge chips cope with fetch agent reset like we expect it.  But PL-3507 is known to not react on fetch agent reset anymore after certain failure conditions.  We have at least one way to inject failure conditions of this sort:  By deactivating the ack_busy_x related fix from Jarod.  Having a sufficiently reliable fault injection method in place, we then need to try other device reset methods.  If one of them is able to bring PL-3507 back into action, we need to figure out how to properly escalate from fetch agent reset to the more aggressive device reset.
Comment 23 Stefan Richter 2008-06-01 05:02:01 UTC
Another recent discussion of memory access vs. MMIO access ordering:
http://lkml.org/lkml/2008/5/26/297
http://lkml.org/lkml/2008/5/27/223
Comment 24 Konstantin A. Lepikhov 2008-06-05 15:24:21 UTC
After some playing w/ copying large files my enclosure is dead. Seems it was power/bad contacts problem because drive occasionally switched off just after initialization and head spinning. I try to test and fix electric circuit but suspect this bug may be closed now until hardware re-recheck.

PS Sorry for noise and many thanks for interest and useful info about PL-3507 internals.
Comment 25 Stefan Richter 2008-06-05 16:04:19 UTC
BTW, in the meantime I got another PL-3507 enclosure, this time a 2.5" HDD enclosure.  It shows a firmware_revision 0x012804 and model_id 0x000001.  So far this thing worked very well.

If you get your enclosure fixed and the 128k transfer issue reliably resurfaces again, please reopen this bug.

I created a new bug about the fetch agent reset issue with PL-3507, bug 10867, so that I don't lose sight of this subissue.

Note You need to log in before you can comment on or make changes to this bug.