Latest working kernel version: N/A Earliest failing kernel version: N/A Distribution: ALT Linux Sisyphus (20071221) Hardware Environment: i586 system w/ AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ ASUS M2R32-MVP motherboard 4096 RAM 2x400Gb SATA drivers. Software Environment: kernel-2.6.24.4 + glibc 2.5.1. Problem Description: Prolific PL3507 bridge without special workaround hangs during copying large amount of data (e.g. .avi files for example). Steps to reproduce: install PL3507 bridge (PATA/USB/FireWire case), start copying some .avi to drive inside PL3507 bridge, see the hang that looks like: Mar 27 01:54:46 lks kernel: sd 34:0:0:0: [sdd] Attached SCSI disk Mar 27 01:54:46 lks kernel: sd 34:0:0:0: Attached scsi generic sg4 type 0 Mar 27 01:55:09 lks kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5 Mar 27 01:55:09 lks kernel: scsi35 : SBP-2 IEEE-1394 Mar 27 01:55:09 lks kernel: firewire_core: created new fw device fw1 (0 config rom retries) Mar 27 01:55:09 lks kernel: firewire_sbp2: logged in to sbp2 unit fw1.0 (0 retries) Mar 27 01:55:09 lks kernel: firewire_sbp2: - management_agent_address: 0xfffff0010000 Mar 27 01:55:09 lks kernel: firewire_sbp2: - command_block_agent_address: 0xfffff0010020 Mar 27 01:55:09 lks kernel: firewire_sbp2: - status write address: 0x000100000000 Mar 27 01:55:09 lks kernel: scsi 35:0:0:0: Direct-Access-RBC ST317242 A PQ: 0 ANSI: 4 Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] 33683328 512-byte hardware sectors (17246 MB) Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write Protect is off Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Mode Sense: 11 00 00 00 Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] 33683328 512-byte hardware sectors (17246 MB) Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write Protect is off Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Mode Sense: 11 00 00 00 Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Mar 27 01:55:09 lks kernel: sde: sde1 sde2 < sde5 > Mar 27 01:55:09 lks kernel: sd 35:0:0:0: [sde] Attached SCSI disk Mar 27 01:55:09 lks kernel: sd 35:0:0:0: Attached scsi generic sg5 type 14 Mar 27 01:56:25 lks kernel: SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled Mar 27 01:56:25 lks kernel: SGI XFS Quota Management subsystem Mar 27 01:56:25 lks kernel: XFS mounting filesystem sde5 Mar 27 01:56:25 lks kernel: Ending clean XFS mount for filesystem: sde5 Mar 27 01:58:27 lks kernel: firewire_sbp2: sbp2_scsi_abort Mar 27 01:58:27 lks kernel: firewire_sbp2: sbp2_scsi_abort Mar 27 01:58:27 lks kernel: sd 35:0:0:0: rejecting I/O to offline device Mar 27 01:58:27 lks last message repeated 114 times Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175581 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175582 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175583 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175584 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175585 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175586 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175587 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175588 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175589 Mar 27 01:58:27 lks kernel: Buffer I/O error on device sde5, logical block 1175590 Mar 27 01:58:27 lks kernel: I/O error in filesystem ("sde5") meta-data dev sde5 block 0xa94df4 ("xlog_iodone") error 5 buf count 11264 Mar 27 01:58:27 lks kernel: Filesystem "sde5": Log I/O Error Detected. Shutting down filesystem: sde5 Mar 27 01:58:27 lks kernel: Please umount the filesystem, and rectify the problem(s) Mar 27 01:58:27 lks kernel: sd 35:0:0:0: rejecting I/O to offline device Mar 27 01:58:55 lks kernel: firewire_sbp2: management write failed, rcode 0x13 With attached patch: Mar 27 23:31:05 lks kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5 Mar 27 23:31:05 lks kernel: firewire_core: phy config: card 0, new root=ffc1, gap_count=5 Mar 27 23:31:05 lks kernel: scsi49 : SBP-2 IEEE-1394 Mar 27 23:31:05 lks kernel: firewire_sbp2: Workarounds for node fw1.0: 0x1 (firmware_revision 0x012804, model_id 0x000001) Mar 27 23:31:05 lks kernel: firewire_core: created new fw device fw1 (0 config rom retries, S400) Mar 27 23:31:05 lks kernel: firewire_sbp2: logged in to sbp2 unit fw1.0 (0 retries) Mar 27 23:31:05 lks kernel: firewire_sbp2: - management_agent_address: 0xfffff0010000 Mar 27 23:31:05 lks kernel: firewire_sbp2: - command_block_agent_address: 0xfffff0010020 Mar 27 23:31:05 lks kernel: firewire_sbp2: - status write address: 0x000100000000 Mar 27 23:31:05 lks kernel: scsi 49:0:0:0: Direct-Access-RBC ST317242 A PQ: 0 ANSI: 4 Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] 33683328 512-byte hardware sectors (17246 MB) Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write Protect is off Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Mode Sense: 11 00 00 00 Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] 33683328 512-byte hardware sectors (17246 MB) Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write Protect is off Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Mode Sense: 11 00 00 00 Mar 27 23:31:05 lks kernel: sd 49:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Mar 27 23:31:05 lks kernel: sdg: sdg1 sdg2 < sdg5 >
Created attachment 15468 [details] Prolific PL3507 bridge workaround patch
I'm also checked that latest Profilic firmware(*) still need this patch. (*) http://www.prolific.com.tw/eng/downloads.asp?ID=44
Reply-To: stefanr@s5r6.in-berlin.de Please try 2.6.25-rc7 (or 2.6.25 when it comes out, or 2.6.24 + firewire patches from http://me.in-berlin.de/~s5r6/linux1394/updates/ --- whatever suits you most) without the 128k workaround. Among the other misc fixes and improvements, the patch "firewire: fw-sbp2: set single-phase retry_limit" should considerably stabilize PL3507 based devices. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=51f9dbef5be41f3ff6000c874741a3a357f9bad7 I suspect the PL3507 does not need the 128 request size limit.
Yeah, I've got several PL-3507 devices which work flawlessly after the patch referenced in comment #3, without any 128k max xfer workaround. I believe before the patch, I was actually still able to trigger an I/O hang even with the 128k max xfer workaround enabled, it was just harder to trigger.
Hmm, looks like this patches didn't help for 2.6.22.y (sorry, I'm have only this one in home). The dmesg diagnosting is changed but I/O timeouts and errors still there: ACPI: PCI Interrupt 0000:03:03.0[A] -> GSI 22 (level, low) -> IRQ 21 firewire_ohci: Added fw-ohci device 0000:03:03.0, OHCI version 1.10 firewire_core: created device fw0: GUID 0011d80000d4626d, S400 firewire_core: created device fw1: GUID 0050770e00071002, S400 firewire_core: phy config: card 0, new root=ffc1, gap_count=5 scsi51 : SBP-2 IEEE-1394 firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries) scsi 51:0:0:0: Direct-Access-RBC ST317242 A PQ: 0 ANSI: 4 sd 51:0:0:0: [sdh] 33683328 512-byte hardware sectors (17246 MB) sd 51:0:0:0: [sdh] Write Protect is off sd 51:0:0:0: [sdh] Mode Sense: 11 00 00 00 sd 51:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sd 51:0:0:0: [sdh] 33683328 512-byte hardware sectors (17246 MB) sd 51:0:0:0: [sdh] Write Protect is off sd 51:0:0:0: [sdh] Mode Sense: 11 00 00 00 sd 51:0:0:0: [sdh] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sdh: sdh1 sdh2 < sdh5 > sd 51:0:0:0: [sdh] Attached SCSI disk sd 51:0:0:0: Attached scsi generic sg8 type 14 XFS mounting filesystem sdh5 Ending clean XFS mount for filesystem: sdh5 firewire_sbp2: fw1.0: sbp2_scsi_abort firewire_sbp2: fw1.0: sbp2_scsi_abort sd 51:0:0:0: scsi: Device offlined - not ready after error recovery sd 51:0:0:0: [sdh] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdh, sector 32745770 sd 51:0:0:0: rejecting I/O to offline device ... sd 51:0:0:0: [sdh] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK end_request: I/O error, dev sdh, sector 32746794 sd 51:0:0:0: rejecting I/O to offline device sd 51:0:0:0: rejecting I/O to offline device sd 51:0:0:0: rejecting I/O to offline device sd 51:0:0:0: rejecting I/O to offline device sd 51:0:0:0: rejecting I/O to offline device sd 51:0:0:0: rejecting I/O to offline device printk: 14600 messages suppressed. Buffer I/O error on device sdh5, logical block 2657404 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657405 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657406 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657407 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657408 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657409 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657410 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657411 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657412 lost page write due to I/O error on sdh5 Buffer I/O error on device sdh5, logical block 2657413 lost page write due to I/O error on sdh5 I/O error in filesystem ("sdh5") meta-data dev sdh5 block 0xa9457a ("xlog_iodone") error 5 buf count 5632 xfs_force_shutdown(sdh5,0x2) called from line 960 of file fs/xfs/xfs_log.c. Return address = 0xfa3deb8c Filesystem "sdh5": Log I/O Error Detected. Shutting down filesystem: sdh5 Please umount the filesystem, and rectify the problem(s) sd 51:0:0:0: rejecting I/O to offline device
2.6.22.y or 2.6.24.y?
If you plug the device out and back in (if bus-powered) or power-cycle it (if self-powered) after the failure happened, will the controller detect that there was a device plugged in again? I ask in order to verify that this is not bug 10307.
(In reply to comment #6) > 2.6.22.y or 2.6.24.y? > Just tested only 2.6.22.19 cause I'm don't have 2.6.24.y sources at hand now. (In reply to comment #7) > If you plug the device out and back in (if bus-powered) or power-cycle it (if > self-powered) after the failure happened, will the controller detect that > there > was a device plugged in again? > > I ask in order to verify that this is not bug 10307. > No, after the failure and power off/on device is detected properly.
lspci -xnnvvv output for onboard IEE1394: 03:03.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. IEEE 1394 Host Contr oller [1106:3044] (rev c0) (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. Device [1043:81fe] Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Step ping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort - <MAbort- >SERR- <PERR+ INTx- Latency: 64 (8000ns max), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 21 Region 0: Memory at fbfff000 (32-bit, non-prefetchable) [size=2K] Region 1: I/O ports at ec00 [size=128] Capabilities: <access denied> Kernel driver in use: firewire_ohci 00: 06 11 44 30 17 01 10 82 c0 10 00 0c 10 40 00 00 10: 00 f0 ff fb 01 ec 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 fe 81 30: 00 00 00 00 50 00 00 00 00 00 00 00 03 01 00 20
VIA rev c0? I don't think I have seen this in other people's lspci before. Is it a VT6308 perhaps? Anyway, I don't expect fundamental differences between VIA chips series. Since you tried Prolific's firmware update I conclude that you don't have one of those really old PL3507 revisions which don't support firmware upload. I only have one PL3507 myself; I use it with a DVD-RW but shall replace that by a HDD and test it with the various cards here, including VIA rev 46 and rev 80 (VT6306, VT6307). Of course my PL3507 runs different firmware than yours. Would still be nice though if you could test one of the initially mentioned kernels when you have the opportunity, and check with stress tests that the 128k workaround really fixes it instead of just raising the bar for the problem to happen. OTOH, even if it just reduces the problem rather than fixing it (that might be ultimately impossible with PL3507s), we should perhaps adopt it at least as an improvement. The existing 128k workaround is a priori for a specific mess-up in a particular old ex Symbios now LSI Logic bridge: ftp://ftp2.de.freebsd.org/pub/misc/specs/symbios/symchips/1394/UPDATED_configuration_ROMs/ReadMe.doc This is basically a first-generation SBP-2 chip which was quickly made obsolete by the much better performing main competitor OXFW911.
And there is one more thing that we should avoid mixing up here: PL3507 will produce IO errors if they receive INQUIRY commands out of the order they expect. (They should optimally get INQUIRY only as the very first command; but at least immediately followed by READ_CAPACITY.) This means that you should ideally perform your tests with hald and any other programs which query storage devices (amarok?) disabled, to exclude other failure sources.
Sorry for delay, I was too busy for testing :( In last weekend I'm tested 2.6.24.4 + patches from http://me.in-berlin.de/~s5r6/linux1394/updates/ on other box (i965 based) with different firewire controller and didn't found any problems. All work good including transfer many big files to different directions without any quirks and workarounds in sbp2 driver. So I'm suspend it's definitely 1394 controller hw problem and we need some quirks in OHCI driver. Anyway, there is 2 notebooks on the hand with various chipsets and firewire controllers but I'm need a special ("little") firewre cable and some time for testing this.
Thanks for this additional testing. Are there any fundamental differences between this better working setup and the nonworking one (besides the different controller and mainboard) --- such as more or less RAM, uniprocessor vs. SMP, CONFIG_PREEMPT_NONE vs. CONFIG_PREEMPT, x86-32 vs. x86-64, or whatever else... BTW, the firewire drivers were buggy under x86-64 before patch 691 "firewire: fw-ohci: plug dma memory leak in AR handler" (posted end of March, merged in Linux 2.6.25-rc8), and very noticeably broken under x86-64 + > 3GB RAM before patch 674 "firewire: fw-ohci: use dma_alloc_coherent for ar_buffer" (posted mid of March, merged in Linux 2.6.25-rc6 or so).
(In reply to comment #13) > Thanks for this additional testing. > > Are there any fundamental differences between this better working setup and > the > nonworking one (besides the different controller and mainboard) --- such as > more or less RAM, uniprocessor vs. SMP, CONFIG_PREEMPT_NONE vs. > CONFIG_PREEMPT, > x86-32 vs. x86-64, or whatever else... No, all tested systems have the same kernel (2.6.24.4 + linux1394 patches) and arch (x86-32) but only first and buggy box have > 3Gb RAM. However, this mainboard (ASUS M2R32-MVP) is _very_ buggy. For example, 2.6.24.y kernel simply hangs due excessive usb work or just removing usb stick. So, unfortunately, there are many other correlations to this case. > > BTW, the firewire drivers were buggy under x86-64 before patch 691 "firewire: > fw-ohci: plug dma memory leak in AR handler" (posted end of March, merged in > Linux 2.6.25-rc8), and very noticeably broken under x86-64 + > 3GB RAM before > patch 674 "firewire: fw-ohci: use dma_alloc_coherent for ar_buffer" (posted > mid > of March, merged in Linux 2.6.25-rc6 or so). > Okay, it's a long way because I need build livecd for testing x86-64 environment. I'll try that but only in next weekend.
No, since you already used x86-32 on both systems, it is not necessary for you to run tests with x86-64. I'm just not sure how AMD based boards have (or have not) been affected by the DMA mapping bugs, therefore you should update the PC with M2R32-MVP to a firewire patch >= v691 if you didn't do so already. I will try to do some own tests with a PL3507 enclosure + HDD this week, as mentioned in comment #10.
Oops, I meant to add myself to the cc list on this one when I commented last...
Okay, today testing results - 2.6.24.4 + ieee1394 patches (last patch is "Subject: firewire: fw-ohci: work around generation bug in TI controllers (fix AV/C and more)"): - write errors occur very rare but strange happens - sometimes system hangs during rmmod firewire-ohci module, sometimes rmmod ok but when I'm insmod firewire-ohci again see the acquainted symptoms in dmesg: [ 742.049744] firewire_sbp2: fw1.0: sbp2_scsi_abort [ 742.049794] sd 9:0:0:0: Device offlined - not ready after error recovery [ 742.049811] sd 9:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER _OK,SUGGEST_OK [ 742.049815] end_request: I/O error, dev sdd, sector 415 [ 742.049819] Buffer I/O error on device sdd1, logical block 177 [ 742.049824] Buffer I/O error on device sdd1, logical block 178 [ 742.049827] Buffer I/O error on device sdd1, logical block 179 [ 742.049832] Buffer I/O error on device sdd1, logical block 180 [ 742.049904] Buffer I/O error on device sdd1, logical block 181 [ 742.049910] Buffer I/O error on device sdd1, logical block 182 [ 742.049913] Buffer I/O error on device sdd1, logical block 183 [ 742.049916] Buffer I/O error on device sdd1, logical block 184 [ 742.049919] Buffer I/O error on device sdd1, logical block 185 [ 742.049922] Buffer I/O error on device sdd1, logical block 186 [ 742.049994] sd 9:0:0:0: rejecting I/O to offline device [ 742.050004] sd 9:0:0:0: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIV ER_OK,SUGGEST_OK [ 742.050007] end_request: I/O error, dev sdd, sector 0 [ 742.050691] sd 9:0:0:0: rejecting I/O to offline device [ 742.050709] sd 9:0:0:0: rejecting I/O to offline device [ 742.050716] sd 9:0:0:0: rejecting I/O to offline device [ 742.050723] sd 9:0:0:0: rejecting I/O to offline device [ 779.526238] sd 9:0:0:0: [sdd] Synchronizing SCSI cache [ 779.526528] sd 9:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER _OK,SUGGEST_OK [ 779.526644] firewire_sbp2: released fw1.0 <power off/on device> [ 784.190125] firewire_core: phy config: card 1, new root=ffc1, gap_count=5 [ 784.254040] scsi10 : SBP-2 IEEE-1394 [ 784.254102] firewire_core: created device fw1: GUID 0050770e00071002, S400, 2 config ROM retries [ 784.260475] firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries) [ 784.260891] scsi 10:0:0:0: Direct-Access-RBC Maxtor 7 541 AP P Q: 0 ANSI: 4 [ 784.261250] sd 10:0:0:0: [sdd] 1061390 512-byte hardware sectors (543 MB) [ 784.261362] sd 10:0:0:0: [sdd] Write Protect is off [ 784.261365] sd 10:0:0:0: [sdd] Mode Sense: 11 00 00 00 [ 784.261696] sd 10:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doe sn't support DPO or FUA [ 784.261984] sd 10:0:0:0: [sdd] 1061390 512-byte hardware sectors (543 MB) [ 784.262102] sd 10:0:0:0: [sdd] Write Protect is off [ 784.262105] sd 10:0:0:0: [sdd] Mode Sense: 11 00 00 00 [ 784.262426] sd 10:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doe sn't support DPO or FUA [ 784.262431] sdd: sdd1 [ 784.270139] sd 10:0:0:0: [sdd] Attached SCSI disk [ 784.270179] sd 10:0:0:0: Attached scsi generic sg4 type 14 [ 786.389308] reiser4: sdd1: found disk format 4.0.0.
Konstantin, what is the precise product name of the enclosure and who is the vendor? Did the firmware_revision marker in your patch match already before you updated the firmware of the device?
It's Agestar ICB3A combo enclosure. No, I'm update firmware in first and them make the patch so I'm don't remember old marker. But I have dump of old firmware and may try to flash it again (hope is possible). I'll try this in the next weekends and report results here.
Note to self: MacPower Prefect II 5.25" enclosure with PL3507 and as-shipped firmware from MacPower: firmware_revision 0x100102
I stress-tested the device from comment #20 with a 400 GB HDD, reiserfs, and a loop of $ cp -pr directory_with_14gigs_of_avi_files test_directory $ cat test_directory/*/* > /dev/null $ rm -rf test_directory/* No problem. Even running "sg_inq" while the IO is going on does not do any harm. So this one has quite good firmware. (It only is quirky during login. Sometimes it takes some replugging before it it agrees to start working.) The stamp on the chip says PL-3507 05124C So whatever it is that throws your PL-3507 out of whack, it is either limited to Prolific's original firmware, or to certain chip revisions, or when certain IDE devices are attached, or a combination. Will you find out the firmware_id of the older firmware (before you updated to the latest firmware) soon? Or shall I go ahead and add the patch as-is? BTW, since Linux 2.6.25, drivers/ieee1394/sbp2.c does not limit the transfer size anymore either. Hence I would take over your patch for that driver as well. There is only a slight difference between firewire-sbp2 and sbp2 in so far as firewire-core performs gap count optimization which increases IEEE1394a throughput. This shouldn't matter to the problem at hand though if there is indeed a purely transfer size dependent device bug.
There are two independent things we need to do: - Check if fw-ohci properly guarantees the necessary order of memory and register accesses. - Find a way to reset PL-3507 when fetch agent reset doesn't work. Long story: Now that I reread the whole bug history, I think the transfer size limit patch is not really what we need. Comment #9, comment #12, comment #14 are significant. It is not the PL-3507 which needs the transfer size limit, it is the board --- which needs the limit not as a fix, but to mask out driver bugs or/and board quirks. I may be deluded, but I like to believe that the firewire code which you last tested has all the necessary fixes for DMA mapping. (For physical response unit: DMA mapping in fw-sbp2 which I believe to have fixed long ago; for asynchronous transmission and reception: DMA mapping in fw-ohci which should be fixed since conversion of AR DMA to coherent allocation which is AFAIR included in 2.6.24.) Also, Jarod already extensively tested firewire-sbp2 on some boards with IOMMU and RAM outside the 32bit PCI physical address range. But wait, there may still be at least theoretical bugs in fw-ohci regarding the ordering of the CPU writing to DMA buffers (including descriptors) and MMIO register writes. I still haven't fully understood what is necessary to ensure the proper order: http://lkml.org/lkml/2008/3/27/10 We need however look into the need of write barriers in fw-ohci. The other thing we need to look into is error handling in fw-sbp2. Most bridge chips cope with fetch agent reset like we expect it. But PL-3507 is known to not react on fetch agent reset anymore after certain failure conditions. We have at least one way to inject failure conditions of this sort: By deactivating the ack_busy_x related fix from Jarod. Having a sufficiently reliable fault injection method in place, we then need to try other device reset methods. If one of them is able to bring PL-3507 back into action, we need to figure out how to properly escalate from fetch agent reset to the more aggressive device reset.
Another recent discussion of memory access vs. MMIO access ordering: http://lkml.org/lkml/2008/5/26/297 http://lkml.org/lkml/2008/5/27/223
After some playing w/ copying large files my enclosure is dead. Seems it was power/bad contacts problem because drive occasionally switched off just after initialization and head spinning. I try to test and fix electric circuit but suspect this bug may be closed now until hardware re-recheck. PS Sorry for noise and many thanks for interest and useful info about PL-3507 internals.
BTW, in the meantime I got another PL-3507 enclosure, this time a 2.5" HDD enclosure. It shows a firmware_revision 0x012804 and model_id 0x000001. So far this thing worked very well. If you get your enclosure fixed and the 128k transfer issue reliably resurfaces again, please reopen this bug. I created a new bug about the fetch agent reset issue with PL-3507, bug 10867, so that I don't lose sight of this subissue.