Bug 9734
Summary: | I/O error when plugging/unplugging a second firewire disk | ||
---|---|---|---|
Product: | Drivers | Reporter: | sbu (sbu) |
Component: | IEEE1394 | Assignee: | Stefan Richter (stefanr) |
Status: | CLOSED CODE_FIX | ||
Severity: | high | CC: | bunk, jarod, max |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.22...2.6.24 | Subsystem: | |
Regression: | --- | Bisected commit-id: |
Description
sbu
2008-01-12 07:47:42 UTC
If you reconfigure the kernel to use sbp2 + ieee1394 + ohci1394 instead of firewire-sbp2 + firewire-core + firewire-ohci, the problem will probably go away. Hence this is not a regression, except from the point of view of users of prebuilt kernel packages if the distributor chose to deactivate the "classic" drivers and enable only the new experimental drivers. Could you try the old driver stack under 2.6.24-rc7? Also please clarify: Does the failure generally happen, or only if you connect the 2nd disk _while_ the dd or other accesses to the 1st disk are running? #1 We really would like to use a prebuild kernel not a customized one. It's a workstation heavily used in production and we are not kernel-gurus. #2 it happens only when I connect the 2nd disk perhaps there is another (2 in total, each one with 2 ports) Texas Instruments TSB82AA2 firewire card plugged in that workstation. If I connect the disk in one of the port of the other card everything works ok. > it happens only when I connect the 2nd disk
If you stop all I/O to the 1st disk, then add the 2nd disk (to the same card), wait a few seconds, then attempt I/O to the 1st disk --- would that work?
Anyway, we need of course try to harden the driver to survive hotplug event even during heavy I/O. (But even if we manage to do so, it isn't exactly advisable to run I/O while hotplugging on the same bus, because the successful resumption after such disturbances also depends on proper working of all the other devices on the FireWire bus.)
> We really would like to use a prebuild kernel not a customized one. > It's a workstation heavily used in production and we are not > kernel-gurus. That's why distributors of pre-built kernel packages should either build only the old drivers, or build old and new drivers but blacklist the new drivers. As a hint, the new driver stack is marked EXPERIMENTAL in the kernel build configurator. We also added a link to http://wiki.linux1394.org/JujuMigration in the help text of the new driver stack in the kernel configurator at some point; alas perhaps too late for some distributors to notice. Or these kernel packager should at least track the firewire drivers development as closely and quickly (and actively participate in fixing them) as the Fedora maintainers do. Sorry, we should have somehow provided some huge flashy warning signs (more flashy than the big-letter EXPERIMENTAL tag...) as a caveat emptor to distributors when we added the new alternative drivers to the kernel. But there is only so much you can do, plus a number of problems with the new driver stack were of course unforeseen at the time. Reply-To: James.Bottomley@HansenPartnership.com Firewire list cc'd On Sat, 2008-01-12 at 07:47 -0800, bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9734 > > Summary: I/O error when inserting a second firewire sata disk > Product: IO/Storage > Version: 2.5 > KernelVersion: 2.6.24 rc7 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: SCSI > AssignedTo: linux-scsi@vger.kernel.org > ReportedBy: sbu@squadrainformatica.com > > > Latest working kernel version: 2.6.18-5 > (can't test any other kernel between 2.6.18-5 and 2.6.22-3 because aren't in > the repo) > > Earliest failing kernel version: 2.6.22-3 > > Distribution: debian lenny 32bit AND 64 bit (it happens the same on both) > > Hardware Environment: ibm x3400 quad core 2GB ram type/number 7976-KBG bios > version 1.56 http://www-03.ibm.com/systems/x/tower/x3400/specs.html > cpu info: -> http://www.pastebin.org/15078 > lspci firewire: -> http://www.pastebin.org/15081 > lsmod: -> http://www.pastebin.org/15083 > > Software Environment: bash > > Problem Description: > We have a 64bit pci double firewire 800 port to which I attach 2 sata hdd (it > doesn't matter the brand of the hdd. We tried many) > Everything works well until we use only one disk. Connecting a second disk > when > we are working on the first one (i.e. writing to the first device) causes > the > interruption of the first job. once happened also a kernel freeze but we > can't > document and reproduce it right now. > > Steps to reproduce: > We attach a sata disk to the first port of the pci firewire B controller > > 08:02.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link > Layer > Controller (rev 01) > Jan 12 16:19:13 x3400 kernel: ieee1394: Error parsing configrom for node > 1-00:1023 > Jan 12 16:19:13 x3400 kernel: ieee1394: Node changed: 1-00:1023 -> 1-01:1023 > Jan 12 16:19:14 x3400 kernel: ieee1394: Node resumed: ID:BUS[1-00:1023] > GUID[0030e002e0454697] > Jan 12 16:19:14 x3400 kernel: scsi9 : SBP-2 IEEE-1394 > Jan 12 16:19:15 x3400 kernel: ieee1394: sbp2: Logged into SBP-2 device > Jan 12 16:19:15 x3400 kernel: ieee1394: Node 1-00:1023: Max speed [S800] - > Max > payload [4096] > Jan 12 16:19:15 x3400 kernel: Vendor: WDC WD16 Model: 00JD-00HBC0 > Rev: > 08.0 > Jan 12 16:19:15 x3400 kernel: Type: Direct-Access-RBC > ANSI > SCSI revision: 04 > Jan 12 16:19:15 x3400 kernel: SCSI device sdc: 312579695 512-byte hdwr > sectors > (160041 MB) > Jan 12 16:19:15 x3400 kernel: sdc: Write Protect is off > Jan 12 16:19:15 x3400 kernel: sdc: Mode Sense: 11 00 00 00 > Jan 12 16:19:15 x3400 kernel: SCSI device sdc: drive cache: write back > Jan 12 16:19:15 x3400 kernel: SCSI device sdc: 312579695 512-byte hdwr > sectors > (160041 MB) > Jan 12 16:19:15 x3400 kernel: sdc: Write Protect is off > Jan 12 16:19:15 x3400 kernel: sdc: Mode Sense: 11 00 00 00 > Jan 12 16:19:15 x3400 kernel: SCSI device sdc: drive cache: write back > Jan 12 16:19:15 x3400 kernel: sdc: unknown partition table > Jan 12 16:19:15 x3400 kernel: sd 9:0:0:0: Attached scsi disk sdc > > now we launch : > dd if=/dev/zero of=/dev/sdc > > everything ok until now > > now we attach another sata disc to the second port of the pci firewire > controller: > > Jan 12 16:20:09 x3400 kernel: ieee1394: Error parsing configrom for node > 1-00:1023 > Jan 12 16:20:09 x3400 kernel: ieee1394: Node changed: 1-00:1023 -> 1-01:1023 > Jan 12 16:20:09 x3400 kernel: ieee1394: Node changed: 1-01:1023 -> 1-02:1023 > Jan 12 16:20:10 x3400 kernel: ieee1394: Reconnected to SBP-2 device > Jan 12 16:20:10 x3400 kernel: ieee1394: Node 1-01:1023: Max speed [S800] - > Max > payload [4096] > > using 2.6.18-5-686 everything works well > dd still works > > now we disconnect the disk from the second port > > Jan 12 16:21:13 x3400 kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023 > Jan 12 16:21:13 x3400 kernel: ieee1394: Node changed: 1-02:1023 -> 1-01:1023 > Jan 12 16:21:13 x3400 kernel: ieee1394: Reconnected to SBP-2 device > Jan 12 16:21:13 x3400 kernel: ieee1394: Node 1-00:1023: Max speed [S800] - > Max > payload [4096] > > everything ok also disconnecting the second device > > ------------------------------------------------------- > > now the same issue using 2.6.24-rc7-686: > > we attach a sata disk to the first port of the pci firewire B controller > > Jan 12 16:49:45 x3400 kernel: firewire_core: phy config: card 1, new > root=ffc1, > gap_count=5 > Jan 12 16:49:46 x3400 kernel: scsi11 : SBP-2 IEEE-1394 > Jan 12 16:49:46 x3400 kernel: firewire_core: created new fw device fw2 (2 > config rom retries, S800) > Jan 12 16:49:46 x3400 kernel: firewire_sbp2: logged in to fw2.0 LUN 0000 (0 > retries) > Jan 12 16:49:46 x3400 kernel: scsi 11:0:0:0: Direct-Access-RBC WDC WD16 > 00JD-00HBC0 08.0 PQ: 0 ANSI: 4 > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] 312579695 512-byte hardware > sectors (160041 MB) > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write Protect is off > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Mode Sense: 11 00 00 00 > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write cache: enabled, read > cache: enabled, doesn't support DPO or FUA > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] 312579695 512-byte hardware > sectors (160041 MB) > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write Protect is off > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Mode Sense: 11 00 00 00 > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write cache: enabled, read > cache: enabled, doesn't support DPO or FUA > Jan 12 16:49:46 x3400 kernel: sdc: unknown partition table > Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Attached SCSI disk > > now we launch : > dd if=/dev/zero of=/dev/sdc > > now we attach another sata disk to the second port of the pci firewire B > controller: > > Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out, rcode=0x11 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK Best I can tell, this is the source of the problem. The sbp2 driver is replying DID_BUS_BUSY until that gets sorted out, which seems to be never. So, first pass analysis indicates the error to be in the firewire subsystem. I'm guessing from the message that it's actually drivers/firewire, not drivers iee1934? > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942302 > Jan 12 16:50:49 x3400 kernel: printk: 571588 messages suppressed. > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942302 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942303 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942304 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942305 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942306 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942307 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942308 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942309 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942310 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block > 2942311 > Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942557 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942812 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943067 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943322 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943577 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943832 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944087 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944342 > Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: > hostbyte=DID_BUS_BUSY > driverbyte=DRIVER_OK,SUGGEST_OK > Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944597 > > it goes on this way until when we kill the dd > > thanks in advance > > damko & divilinux James Reply-To: stefanr@s5r6.in-berlin.de James Bottomley wrote: > Firewire list cc'd It's indeed a problem in the firewire stack. ... >> now we attach another sata disk to the second port of the pci firewire B >> controller: >> >> Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out, rcode=0x11 >> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: >> hostbyte=DID_BUS_BUSY >> driverbyte=DRIVER_OK,SUGGEST_OK > > Best I can tell, this is the source of the problem. The sbp2 driver is > replying DID_BUS_BUSY until that gets sorted out, which seems to be > never. When something was plugged in or out at the same bus, fw-sbp2 has to reconnect == renew the login to each logical unit. The syslog in the report is inconclusive whether that happened or failed. There should be one of "reconnected to ... LUN ... (... retries)" or "failed to reconnect to ..." logged eventually. As long as the reconnect didn't succeed, fw-sbp2 quits newly enqueued commands with DID_BUS_BUSY. I need a bit spare time and a way to reproduce this or similar things to get to the cause of this. (I had random reconnect problems with fw-sbp2 now and then, but not yet 100% reproducible ones like the reporter has.) The last tested kernel in the report, 2.6.24-rc7, already contains the latest and greatest firewire code as far as problem at hand is concerned. > So, first pass analysis indicates the error to be in the firewire > subsystem. I'm guessing from the message that it's actually > drivers/firewire, not drivers iee1934? Yes. As a side note, the old sbp2 driver does not quit commands with DID_BUS_BUSY between bus reset and reconnect. Instead it blocks the Scsi_Host in order to not receive commands during that time at all. The time from bus reset to successful reconnect is typically circa a second, on bigger buses with lots of protocol activity after bus reset perhaps a few seconds. (Could IMO still be optimized, in both of the drivers/ieee1394 and drivers/firewire implementations.) I set up a "just to test environment". I attached a new sata hdd with an amd64 lenny. In this way I can test whatever needed without involve the stable os. Hopefully if we find a working solution I'll move it to the production (kernel 2.6.18 it's not so quick as a 2.6.22 on a quad core afaik). #1 @stefan "If you reconfigure the kernel to use sbp2 + ieee1394 + ohci1394 instead of firewire-sbp2 + firewire-core + firewire-ohci, the problem will probably go away." I chose to recompile the debian source 2.6.23-1-amd64 with the old firewire modules as you told me, in the hope to obtain a stable kernel for production. I'll post the result. In the mean while I'll give to everyone all my support if something should be test. Pls tell me what I have to do to be helpfull. #7 @anonymous "(I had random reconnect problems with fw-sbp2 now and then, but not yet 100% reproducible ones like the reporter has.)" If you want to test something pls let me know what I should do. #4 @stefan "If you stop all I/O to the 1st disk, then add the 2nd disk (to the same card), wait a few seconds, then attempt I/O to the 1st disk --- would that work?" In that situation I think that everything should work well. I'll test it to be sure and I'll post the result. Reply-To: stefanr@s5r6.in-berlin.de I did a quick and simple test now: 1.) switch on 1st disk (sdd) Jan 19 13:46:18 stein sd 88:0:0:0: [sdd] Attached SCSI disk Jan 19 13:46:18 stein sd 88:0:0:0: Attached scsi generic sg3 type 14 2.) start "dd if=/dev/sdd of=/dev/null" 3.) switch on 2nd disk (sde) Jan 19 13:48:11 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871464 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983933 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983934 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983935 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983936 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983937 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983938 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983939 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983940 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983941 Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983942 Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871712 Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871720 Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871464 Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:12 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:12 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:12 stein firewire_sbp2: failed to reconnect to fw3.0 Jan 19 13:48:12 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries) Jan 19 13:48:26 stein firewire_sbp2: orb reply timed out, rcode=0x11 Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:27 stein scsi89 : SBP-2 IEEE-1394 Jan 19 13:48:27 stein firewire_core: created new fw device fw4 (6 config rom retries, S800) Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:27 stein firewire_sbp2: logged in to fw4.0 LUN 0000 (0 retries) Jan 19 13:48:27 stein scsi 89:0:0:0: Direct-Access-RBC HDS72404 0KLAT80 KFAO PQ: 0 ANSI: 4 Jan 19 13:48:27 stein sd 89:0:0:0: [sde] 1562845488 512-byte hardware sectors (800177 MB) Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Write Protect is off Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Mode Sense: 11 00 00 00 Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Jan 19 13:48:27 stein sde: sde1 Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Attached SCSI disk Jan 19 13:48:27 stein sd 89:0:0:0: Attached scsi generic sg4 type 14 Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9 Jan 19 13:48:27 stein firewire_sbp2: failed to reconnect to fw3.0 Jan 19 13:48:28 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries) Doing that, dd aborted: dd: reading `/dev/sdd': Input/output error 7871464+0 records in 7871464+0 records out 4030189568 bytes (4.0 GB) copied, 57.6538 s, 69.9 MB/s sdd was however accessible again after the login at 13:48:28. As you see, the SCSI stack did not take the disk offline after the I/O errors. However, the desired behavior is that dd is simply put into I/O wait state rather than failing with errors while the SBP-2 transport is busy. Both disks used in this test are based on the OXFW912 bridge. The "error status: 0:9" before reconnect is typical for them, at least as I have set them up right now (1394b card, 1394b hub, disks on the hub). The status means "request complete, function rejected". It has always been recoverable here when I switched on the 2nd disk with the 1st disk connected but idle. Perhaps we should do it in fw-sbp2 like in sbp2: Put the scsi_device (in sbp2: the Scsi_Host) into blocked state between bus reset event and successful reconnection. We took a while to get this right in sbp2 because blocking the host is prone to deadlocks. Therefore I was so far satisfied with fw-sbp2 not using that scheme. However, the first thing I shall try is to insert a "return SCSI_MLQUEUE_HOST_BUSY" early in sbp2_scsi_queuecommand, depending on a check of the logical unit's generation. A variation of this problem is if the bus reset -- reconnect phase happens while the SCSI stack is just in the middle of INQUIRY or READ CAPACITY. During this, the SCSI stack is very quick to offline a device. We need to do something there, because this situation is not too unusual (e.g. when several FireWire devices are being powered up together). I will proceed to experiment with this as spare time permits. Of course any advice on how to best interact with the SCSI core while the transport is busy would be appreciated. Reply-To: stefanr@s5r6.in-berlin.de Stefan Richter wrote: > However, the first thing I shall try is to insert a "return > SCSI_MLQUEUE_HOST_BUSY" early in sbp2_scsi_queuecommand, depending on a > check of the logical unit's generation. Doesn't help. Even worse, it's deadlocky. Reply-To: stefanr@s5r6.in-berlin.de I wrote on 2008-01-13: > James Bottomley wrote: >> Firewire list cc'd >>> Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out, >>> rcode=0x11 >>> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: >>> hostbyte=DID_BUS_BUSY >>> driverbyte=DRIVER_OK,SUGGEST_OK >> Best I can tell, this is the source of the problem. The sbp2 driver is >> replying DID_BUS_BUSY until that gets sorted out, which seems to be >> never. > > When something was plugged in or out at the same bus, fw-sbp2 has to > reconnect == renew the login to each logical unit. The syslog in the > report is inconclusive whether that happened or failed. In any case, there are frequently commands retried or newly enqueued while fw-sbp2 waits to get the login renewed. (And fw-sbp2 continues to complete them with DID_BUS_BUSY until the reconnection didn't succeed. Whoever caused that I/O, e.g. dd like in the reporter's and my own tests, will quickly fail. > As a side note, the old sbp2 driver does not quit commands with > DID_BUS_BUSY between bus reset and reconnect. Instead it blocks the > Scsi_Host in order to not receive commands during that time at all. I experimented with this yesterday. First I tried scsi_internal_device_block() because we - want to block logical units individually if possible, - need to block from within atomic context (softirq context). However, this failed miserably with all sorts of lock inversion bug backtraces (alleged ones or real ones, I don't know) and with occasional kernel lock-ups (so it were probably real lock inversions). These locking issues cannot be solved easily because block layer and scsi_lib play nauseating games with their locks. So, I switched over to scsi_block_requests(), i.e. blocking the whole host like the old sbp2 driver does. This doesn't seem to have scsi_internal_device_block()'s locking issues. However, the sbp2 driver has one Scsi_Host for each logical unit while the new fw-sbp2 driver however has one Scsi_Host for each target. Hence there are difficulties with targets with multiple logical units, but I probably got them sorted out now. There remain frequent problems with reconnection + re-login failures though. These failures don't happen with exactly the same bus topology if I don't run I/O during the bus resets. I have an idea though what to try next... I currently got a preliminary solution which works with OXUF922 and OXFW912 but not yet with TI StorageLynx. There seem also still refcounting bugs in my experimental patches. I will post them after more testing. Patch series posted at http://lkml.org/lkml/2008/2/3/195 and available for testing at http://me.in-berlin.de/~s5r6/linux1394/updates/ (v629 or later). Helps a lot according to my tests. Patch series has been updated at http://lkml.org/lkml/2008/2/16/122 and pushed out to http://me.in-berlin.de/~s5r6/linux1394/updates/ (v633 or later). This update fixes concurrency issues in my earlier patch. Additional testing shows that bus reset stability isn't yet top-notch with the mentioned TI StorageLynx, buggy PL-3507, and older Oxsemi911. These remaining issues require further changes to firewire-sbp2 which I can't implement and submit upstream overnight. Anyway, the current patches already improve things a lot: I/O during bus reset had simply no chance to survive at all before the patch set, now it at least works quite well with a variety of devices. fix committed to linux1394-2.6.git:master, planned to submit to Linus next week bus reset stabilization fix was merged by Linus Reopening this bug: There still needs to be some work done to make bus reset handling in firewire-sbp2 more robust. If I use a bunch of 1394b disks behind a 5 meters long cable (IEEE 1394b cable assembly reference design is for a maximum of 4.5 meters), I occasionally get failed SCSI commands during heavy usage. The commands fail with hostbyte=DID_BUS_BUSY and hostbyte=DID_NO_CONNECT, i.e. connection loss, and the SCSI devices are subsequently taken offline. I have to examine it more but I believe - there are bus resets - during which some or all of the external nodes go invisible. I have some rough ideas what may be needed to be done here, but it will take some time to verify and implement this. The devices which I tested this way are OXFW912 and OXUF924DSB based, i.e. ones whose bus reset behavior should have been addressed by the patches from February. Side note: Bus reset behavior of other bridges mentioned in comment #14 should also be improved as a side effect of a patch from Jarod which is already in Linus' tree (set single phase retry_limit). Closing the bug again. After a test with short cables and self ID logging enabled, I found that the problem is something else. At some point, the controller just isn't issuing any interrupts anymore, except for cycle64Seconds interrupt events. Look somewhat like bug 7569. Controller is a Sunix PCIe 1394b card with TSB82AA2 behind XIO2000, http://www.linux1394.org/view_device.php?id=1039 |