Bug 9734 - I/O error when plugging/unplugging a second firewire disk
Summary: I/O error when plugging/unplugging a second firewire disk
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-01-12 07:47 UTC by sbu
Modified: 2008-03-19 11:00 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.22...2.6.24
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description sbu 2008-01-12 07:47:42 UTC
Latest working kernel version: 2.6.18-5
(can't test any other kernel between 2.6.18-5 and 2.6.22-3 because aren't in the repo)

Earliest failing kernel version: 2.6.22-3

Distribution: debian lenny 32bit AND 64 bit (it happens the same on both)

Hardware Environment: ibm x3400 quad core 2GB ram type/number 7976-KBG bios version 1.56 http://www-03.ibm.com/systems/x/tower/x3400/specs.html
cpu info: -> http://www.pastebin.org/15078
lspci firewire: -> http://www.pastebin.org/15081
lsmod: -> http://www.pastebin.org/15083

Software Environment: bash

Problem Description:
We have a 64bit pci double firewire 800 port to which I attach 2 sata hdd (it doesn't matter the brand of the hdd. We tried many)
Everything works well until we use only one disk. Connecting a second disk when we are working on the first one (i.e. writing to the first device)  causes the interruption of the first job. once happened also a kernel freeze but we can't document and reproduce it right now.

Steps to reproduce:
We attach a sata disk to the first port of the pci firewire B controller

08:02.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link Layer Controller (rev 01)
Jan 12 16:19:13 x3400 kernel: ieee1394: Error parsing configrom for node 1-00:1023
Jan 12 16:19:13 x3400 kernel: ieee1394: Node changed: 1-00:1023 -> 1-01:1023
Jan 12 16:19:14 x3400 kernel: ieee1394: Node resumed: ID:BUS[1-00:1023]  GUID[0030e002e0454697]
Jan 12 16:19:14 x3400 kernel: scsi9 : SBP-2 IEEE-1394
Jan 12 16:19:15 x3400 kernel: ieee1394: sbp2: Logged into SBP-2 device
Jan 12 16:19:15 x3400 kernel: ieee1394: Node 1-00:1023: Max speed [S800] - Max payload [4096]
Jan 12 16:19:15 x3400 kernel:   Vendor: WDC WD16  Model: 00JD-00HBC0       Rev: 08.0
Jan 12 16:19:15 x3400 kernel:   Type:   Direct-Access-RBC                  ANSI SCSI revision: 04
Jan 12 16:19:15 x3400 kernel: SCSI device sdc: 312579695 512-byte hdwr sectors (160041 MB)
Jan 12 16:19:15 x3400 kernel: sdc: Write Protect is off
Jan 12 16:19:15 x3400 kernel: sdc: Mode Sense: 11 00 00 00
Jan 12 16:19:15 x3400 kernel: SCSI device sdc: drive cache: write back
Jan 12 16:19:15 x3400 kernel: SCSI device sdc: 312579695 512-byte hdwr sectors (160041 MB)
Jan 12 16:19:15 x3400 kernel: sdc: Write Protect is off
Jan 12 16:19:15 x3400 kernel: sdc: Mode Sense: 11 00 00 00
Jan 12 16:19:15 x3400 kernel: SCSI device sdc: drive cache: write back
Jan 12 16:19:15 x3400 kernel:  sdc: unknown partition table
Jan 12 16:19:15 x3400 kernel: sd 9:0:0:0: Attached scsi disk sdc

now we launch :
dd if=/dev/zero of=/dev/sdc

everything ok until now

now we attach another sata disc to the second port of the pci firewire controller:

Jan 12 16:20:09 x3400 kernel: ieee1394: Error parsing configrom for node 1-00:1023
Jan 12 16:20:09 x3400 kernel: ieee1394: Node changed: 1-00:1023 -> 1-01:1023
Jan 12 16:20:09 x3400 kernel: ieee1394: Node changed: 1-01:1023 -> 1-02:1023
Jan 12 16:20:10 x3400 kernel: ieee1394: Reconnected to SBP-2 device
Jan 12 16:20:10 x3400 kernel: ieee1394: Node 1-01:1023: Max speed [S800] - Max payload [4096]

using 2.6.18-5-686 everything works well
dd still works

now we disconnect the disk from the second port 

Jan 12 16:21:13 x3400 kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023
Jan 12 16:21:13 x3400 kernel: ieee1394: Node changed: 1-02:1023 -> 1-01:1023
Jan 12 16:21:13 x3400 kernel: ieee1394: Reconnected to SBP-2 device
Jan 12 16:21:13 x3400 kernel: ieee1394: Node 1-00:1023: Max speed [S800] - Max payload [4096]

everything ok also disconnecting the second device

-------------------------------------------------------

now the same issue using 2.6.24-rc7-686:

we attach a sata disk to the first port of the pci firewire B controller

Jan 12 16:49:45 x3400 kernel: firewire_core: phy config: card 1, new root=ffc1, gap_count=5
Jan 12 16:49:46 x3400 kernel: scsi11 : SBP-2 IEEE-1394
Jan 12 16:49:46 x3400 kernel: firewire_core: created new fw device fw2 (2 config rom retries, S800)
Jan 12 16:49:46 x3400 kernel: firewire_sbp2: logged in to fw2.0 LUN 0000 (0 retries)
Jan 12 16:49:46 x3400 kernel: scsi 11:0:0:0: Direct-Access-RBC WDC WD16 00JD-00HBC0      08.0 PQ: 0 ANSI: 4
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] 312579695 512-byte hardware sectors (160041 MB)
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write Protect is off
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Mode Sense: 11 00 00 00
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] 312579695 512-byte hardware sectors (160041 MB)
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write Protect is off
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Mode Sense: 11 00 00 00
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 12 16:49:46 x3400 kernel:  sdc: unknown partition table
Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Attached SCSI disk

now we launch :
dd if=/dev/zero of=/dev/sdc

now we attach another sata disk to the second port of the pci firewire B controller:

Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out, rcode=0x11
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942302
Jan 12 16:50:49 x3400 kernel: printk: 571588 messages suppressed.
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942302
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942303
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942304
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942305
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942306
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942307
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942308
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942309
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942310
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block 2942311
Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942557
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942812
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943067
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943322
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943577
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943832
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944087
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944342
Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944597

it goes on this way until when we kill the dd

thanks in advance

damko & divilinux
Comment 1 Stefan Richter 2008-01-12 08:55:56 UTC
If you reconfigure the kernel to use sbp2 + ieee1394 + ohci1394 instead of firewire-sbp2 + firewire-core + firewire-ohci, the problem will probably go away.

Hence this is not a regression, except from the point of view of users of prebuilt kernel packages if the distributor chose to deactivate the "classic" drivers and enable only the new experimental drivers.

Could you try the old driver stack under 2.6.24-rc7?
Comment 2 Stefan Richter 2008-01-12 09:02:45 UTC
Also please clarify:  Does the failure generally happen, or only if you connect the 2nd disk _while_ the dd or other accesses to the 1st disk are running?
Comment 3 sbu 2008-01-12 11:28:34 UTC
#1
We really would like to use a prebuild kernel not a customized one. It's a workstation heavily used in production and we are not kernel-gurus.

#2
it happens only when I connect the 2nd disk
perhaps there is another (2 in total, each one with 2 ports) Texas Instruments TSB82AA2 firewire card plugged in that workstation. If I connect the disk in one of the port of the other card everything works ok.
Comment 4 Stefan Richter 2008-01-12 13:21:00 UTC
> it happens only when I connect the 2nd disk

If you stop all I/O to the 1st disk, then add the 2nd disk (to the same card), wait a few seconds, then attempt I/O to the 1st disk --- would that work?

Anyway, we need of course try to harden the driver to survive hotplug event even during heavy I/O.  (But even if we manage to do so, it isn't exactly advisable to run I/O while hotplugging on the same bus, because the successful resumption after such disturbances also depends on proper working of all the other devices on the FireWire bus.)
Comment 5 Stefan Richter 2008-01-12 13:31:29 UTC
> We really would like to use a prebuild kernel not a customized one.
> It's a workstation heavily used in production and we are not
> kernel-gurus.

That's why distributors of pre-built kernel packages should either build only the old drivers, or build old and new drivers but blacklist the new drivers.  As a hint, the new driver stack is marked EXPERIMENTAL in the kernel build configurator.  We also added a link to http://wiki.linux1394.org/JujuMigration in the help text of the new driver stack in the kernel configurator at some point; alas perhaps too late for some distributors to notice.

Or these kernel packager should at least track the firewire drivers development as closely and quickly (and actively participate in fixing them) as the Fedora maintainers do.

Sorry, we should have somehow provided some huge flashy warning signs (more flashy than the big-letter EXPERIMENTAL tag...) as a caveat emptor to distributors when we added the new alternative drivers to the kernel.  But there is only so much you can do, plus a number of problems with the new driver stack were of course unforeseen at the time.
Comment 6 Anonymous Emailer 2008-01-12 17:12:04 UTC
Reply-To: James.Bottomley@HansenPartnership.com

Firewire list cc'd

On Sat, 2008-01-12 at 07:47 -0800, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=9734
> 
>            Summary: I/O error when inserting a second firewire sata disk
>            Product: IO/Storage
>            Version: 2.5
>      KernelVersion: 2.6.24 rc7
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: SCSI
>         AssignedTo: linux-scsi@vger.kernel.org
>         ReportedBy: sbu@squadrainformatica.com
> 
> 
> Latest working kernel version: 2.6.18-5
> (can't test any other kernel between 2.6.18-5 and 2.6.22-3 because aren't in
> the repo)
> 
> Earliest failing kernel version: 2.6.22-3
> 
> Distribution: debian lenny 32bit AND 64 bit (it happens the same on both)
> 
> Hardware Environment: ibm x3400 quad core 2GB ram type/number 7976-KBG bios
> version 1.56 http://www-03.ibm.com/systems/x/tower/x3400/specs.html
> cpu info: -> http://www.pastebin.org/15078
> lspci firewire: -> http://www.pastebin.org/15081
> lsmod: -> http://www.pastebin.org/15083
> 
> Software Environment: bash
> 
> Problem Description:
> We have a 64bit pci double firewire 800 port to which I attach 2 sata hdd (it
> doesn't matter the brand of the hdd. We tried many)
> Everything works well until we use only one disk. Connecting a second disk
> when
> we are working on the first one (i.e. writing to the first device)  causes
> the
> interruption of the first job. once happened also a kernel freeze but we
> can't
> document and reproduce it right now.
> 
> Steps to reproduce:
> We attach a sata disk to the first port of the pci firewire B controller
> 
> 08:02.0 FireWire (IEEE 1394): Texas Instruments TSB82AA2 IEEE-1394b Link
> Layer
> Controller (rev 01)
> Jan 12 16:19:13 x3400 kernel: ieee1394: Error parsing configrom for node
> 1-00:1023
> Jan 12 16:19:13 x3400 kernel: ieee1394: Node changed: 1-00:1023 -> 1-01:1023
> Jan 12 16:19:14 x3400 kernel: ieee1394: Node resumed: ID:BUS[1-00:1023] 
> GUID[0030e002e0454697]
> Jan 12 16:19:14 x3400 kernel: scsi9 : SBP-2 IEEE-1394
> Jan 12 16:19:15 x3400 kernel: ieee1394: sbp2: Logged into SBP-2 device
> Jan 12 16:19:15 x3400 kernel: ieee1394: Node 1-00:1023: Max speed [S800] -
> Max
> payload [4096]
> Jan 12 16:19:15 x3400 kernel:   Vendor: WDC WD16  Model: 00JD-00HBC0      
> Rev:
> 08.0
> Jan 12 16:19:15 x3400 kernel:   Type:   Direct-Access-RBC                 
> ANSI
> SCSI revision: 04
> Jan 12 16:19:15 x3400 kernel: SCSI device sdc: 312579695 512-byte hdwr
> sectors
> (160041 MB)
> Jan 12 16:19:15 x3400 kernel: sdc: Write Protect is off
> Jan 12 16:19:15 x3400 kernel: sdc: Mode Sense: 11 00 00 00
> Jan 12 16:19:15 x3400 kernel: SCSI device sdc: drive cache: write back
> Jan 12 16:19:15 x3400 kernel: SCSI device sdc: 312579695 512-byte hdwr
> sectors
> (160041 MB)
> Jan 12 16:19:15 x3400 kernel: sdc: Write Protect is off
> Jan 12 16:19:15 x3400 kernel: sdc: Mode Sense: 11 00 00 00
> Jan 12 16:19:15 x3400 kernel: SCSI device sdc: drive cache: write back
> Jan 12 16:19:15 x3400 kernel:  sdc: unknown partition table
> Jan 12 16:19:15 x3400 kernel: sd 9:0:0:0: Attached scsi disk sdc
> 
> now we launch :
> dd if=/dev/zero of=/dev/sdc
> 
> everything ok until now
> 
> now we attach another sata disc to the second port of the pci firewire
> controller:
> 
> Jan 12 16:20:09 x3400 kernel: ieee1394: Error parsing configrom for node
> 1-00:1023
> Jan 12 16:20:09 x3400 kernel: ieee1394: Node changed: 1-00:1023 -> 1-01:1023
> Jan 12 16:20:09 x3400 kernel: ieee1394: Node changed: 1-01:1023 -> 1-02:1023
> Jan 12 16:20:10 x3400 kernel: ieee1394: Reconnected to SBP-2 device
> Jan 12 16:20:10 x3400 kernel: ieee1394: Node 1-01:1023: Max speed [S800] -
> Max
> payload [4096]
> 
> using 2.6.18-5-686 everything works well
> dd still works
> 
> now we disconnect the disk from the second port 
> 
> Jan 12 16:21:13 x3400 kernel: ieee1394: Node changed: 1-01:1023 -> 1-00:1023
> Jan 12 16:21:13 x3400 kernel: ieee1394: Node changed: 1-02:1023 -> 1-01:1023
> Jan 12 16:21:13 x3400 kernel: ieee1394: Reconnected to SBP-2 device
> Jan 12 16:21:13 x3400 kernel: ieee1394: Node 1-00:1023: Max speed [S800] -
> Max
> payload [4096]
> 
> everything ok also disconnecting the second device
> 
> -------------------------------------------------------
> 
> now the same issue using 2.6.24-rc7-686:
> 
> we attach a sata disk to the first port of the pci firewire B controller
> 
> Jan 12 16:49:45 x3400 kernel: firewire_core: phy config: card 1, new
> root=ffc1,
> gap_count=5
> Jan 12 16:49:46 x3400 kernel: scsi11 : SBP-2 IEEE-1394
> Jan 12 16:49:46 x3400 kernel: firewire_core: created new fw device fw2 (2
> config rom retries, S800)
> Jan 12 16:49:46 x3400 kernel: firewire_sbp2: logged in to fw2.0 LUN 0000 (0
> retries)
> Jan 12 16:49:46 x3400 kernel: scsi 11:0:0:0: Direct-Access-RBC WDC WD16
> 00JD-00HBC0      08.0 PQ: 0 ANSI: 4
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] 312579695 512-byte hardware
> sectors (160041 MB)
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write Protect is off
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Mode Sense: 11 00 00 00
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] 312579695 512-byte hardware
> sectors (160041 MB)
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write Protect is off
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Mode Sense: 11 00 00 00
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Write cache: enabled, read
> cache: enabled, doesn't support DPO or FUA
> Jan 12 16:49:46 x3400 kernel:  sdc: unknown partition table
> Jan 12 16:49:46 x3400 kernel: sd 11:0:0:0: [sdc] Attached SCSI disk
> 
> now we launch :
> dd if=/dev/zero of=/dev/sdc
> 
> now we attach another sata disk to the second port of the pci firewire B
> controller:
> 
> Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out, rcode=0x11
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK

Best I can tell, this is the source of the problem.  The sbp2 driver is
replying DID_BUS_BUSY until that gets sorted out, which seems to be
never.

So, first pass analysis indicates the error to be in the firewire
subsystem.  I'm guessing from the message that it's actually
drivers/firewire, not drivers iee1934?

> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942302
> Jan 12 16:50:49 x3400 kernel: printk: 571588 messages suppressed.
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942302
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942303
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942304
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942305
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942306
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942307
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942308
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942309
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942310
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: Buffer I/O error on device sdc, logical block
> 2942311
> Jan 12 16:50:49 x3400 kernel: lost page write due to I/O error on sdc
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942557
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2942812
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943067
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943322
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943577
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2943832
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944087
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944342
> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
> hostbyte=DID_BUS_BUSY
> driverbyte=DRIVER_OK,SUGGEST_OK
> Jan 12 16:50:49 x3400 kernel: end_request: I/O error, dev sdc, sector 2944597
> 
> it goes on this way until when we kill the dd
> 
> thanks in advance
> 
> damko & divilinux

James
Comment 7 Anonymous Emailer 2008-01-13 01:06:44 UTC
Reply-To: stefanr@s5r6.in-berlin.de

James Bottomley wrote:
> Firewire list cc'd

It's indeed a problem in the firewire stack.

...
>> now we attach another sata disk to the second port of the pci firewire B
>> controller:
>>
>> Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out, rcode=0x11
>> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
>> hostbyte=DID_BUS_BUSY
>> driverbyte=DRIVER_OK,SUGGEST_OK
> 
> Best I can tell, this is the source of the problem.  The sbp2 driver is
> replying DID_BUS_BUSY until that gets sorted out, which seems to be
> never.

When something was plugged in or out at the same bus, fw-sbp2 has to
reconnect == renew the login to each logical unit.  The syslog in the
report is inconclusive whether that happened or failed.  There should be
one of
	"reconnected to ... LUN ... (... retries)"
or	"failed to reconnect to ..."
logged eventually.  As long as the reconnect didn't succeed, fw-sbp2
quits newly enqueued commands with DID_BUS_BUSY.

I need a bit spare time and a way to reproduce this or similar things to
get to the cause of this.  (I had random reconnect problems with fw-sbp2
now and then, but not yet 100% reproducible ones like the reporter has.)

The last tested kernel in the report, 2.6.24-rc7, already contains the
latest and greatest firewire code as far as problem at hand is concerned.

> So, first pass analysis indicates the error to be in the firewire
> subsystem.  I'm guessing from the message that it's actually
> drivers/firewire, not drivers iee1934?

Yes.

As a side note, the old sbp2 driver does not quit commands with
DID_BUS_BUSY between bus reset and reconnect.  Instead it blocks the
Scsi_Host in order to not receive commands during that time at all.

The time from bus reset to successful reconnect is typically circa a
second, on bigger buses with lots of protocol activity after bus reset
perhaps a few seconds.  (Could IMO still be optimized, in both of the
drivers/ieee1394 and drivers/firewire implementations.)
Comment 8 sbu 2008-01-13 13:16:04 UTC
I set up a "just to test environment".
I attached a new sata hdd with an amd64 lenny. In this way I can test whatever needed without involve the stable os. 
Hopefully if we find a working solution I'll move it to the production (kernel 2.6.18 it's not so quick as a 2.6.22 on a quad core afaik).
#1 @stefan
"If you reconfigure the kernel to use sbp2 + ieee1394 + ohci1394 instead of
firewire-sbp2 + firewire-core + firewire-ohci, the problem will probably go
away."
I chose to recompile the debian source 2.6.23-1-amd64 with the old firewire modules as you told me, in the hope to obtain a stable kernel for production.
I'll post the result.

In the mean while  I'll give to everyone all my support if something should be test. Pls tell me what I have to do to be helpfull.
#7 @anonymous
"(I had random reconnect problems with fw-sbp2
now and then, but not yet 100% reproducible ones like the reporter has.)"
If you want to test something pls let me know what I should do.

#4 @stefan
"If you stop all I/O to the 1st disk, then add the 2nd disk (to the same card),
wait a few seconds, then attempt I/O to the 1st disk --- would that work?"
In that situation I think that everything should work well. I'll test it to be sure and I'll post the result.
Comment 9 Anonymous Emailer 2008-01-19 05:20:29 UTC
Reply-To: stefanr@s5r6.in-berlin.de

I did a quick and simple test now:

  1.) switch on 1st disk (sdd)

Jan 19 13:46:18 stein sd 88:0:0:0: [sdd] Attached SCSI disk
Jan 19 13:46:18 stein sd 88:0:0:0: Attached scsi generic sg3 type 14

  2.) start "dd if=/dev/sdd of=/dev/null"

  3.) switch on 2nd disk (sde)

Jan 19 13:48:11 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871464
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983933
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983934
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983935
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983936
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983937
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983938
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983939
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983940
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983941
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983942
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871712
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871720
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871464
Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:12 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:12 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:12 stein firewire_sbp2: failed to reconnect to fw3.0
Jan 19 13:48:12 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries)
Jan 19 13:48:26 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein scsi89 : SBP-2 IEEE-1394
Jan 19 13:48:27 stein firewire_core: created new fw device fw4 (6 config rom retries, S800)
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: logged in to fw4.0 LUN 0000 (0 retries)
Jan 19 13:48:27 stein scsi 89:0:0:0: Direct-Access-RBC HDS72404 0KLAT80          KFAO PQ: 0 ANSI: 4
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] 1562845488 512-byte hardware sectors (800177 MB)
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Write Protect is off
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Mode Sense: 11 00 00 00
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 19 13:48:27 stein sde: sde1
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Attached SCSI disk
Jan 19 13:48:27 stein sd 89:0:0:0: Attached scsi generic sg4 type 14
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: failed to reconnect to fw3.0
Jan 19 13:48:28 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries)


Doing that, dd aborted:

dd: reading `/dev/sdd': Input/output error
7871464+0 records in
7871464+0 records out
4030189568 bytes (4.0 GB) copied, 57.6538 s, 69.9 MB/s

sdd was however accessible again after the login at 13:48:28.  As you
see, the SCSI stack did not take the disk offline after the I/O errors.
However, the desired behavior is that dd is simply put into I/O wait
state rather than failing with errors while the SBP-2 transport is busy.

Both disks used in this test are based on the OXFW912 bridge.  The
"error status: 0:9" before reconnect is typical for them, at least as I
have set them up right now (1394b card, 1394b hub, disks on the hub).
The status means "request complete, function rejected".  It has always
been recoverable here when I switched on the 2nd disk with the 1st disk
connected but idle.

Perhaps we should do it in fw-sbp2 like in sbp2:  Put the scsi_device
(in sbp2: the Scsi_Host) into blocked state between bus reset event and
successful reconnection.  We took a while to get this right in sbp2
because blocking the host is prone to deadlocks.  Therefore I was so far
satisfied with fw-sbp2 not using that scheme.

However, the first thing I shall try is to insert a "return
SCSI_MLQUEUE_HOST_BUSY" early in sbp2_scsi_queuecommand, depending on a
check of the logical unit's generation.

A variation of this problem is if the bus reset -- reconnect phase
happens while the SCSI stack is just in the middle of INQUIRY or READ
CAPACITY.  During this, the SCSI stack is very quick to offline a
device.  We need to do something there, because this situation is not
too unusual (e.g. when several FireWire devices are being powered up
together).

I will proceed to experiment with this as spare time permits.  Of course
any advice on how to best interact with the SCSI core while the
transport is busy would be appreciated.
Comment 10 Anonymous Emailer 2008-01-19 12:47:56 UTC
Reply-To: stefanr@s5r6.in-berlin.de

Stefan Richter wrote:
> However, the first thing I shall try is to insert a "return
> SCSI_MLQUEUE_HOST_BUSY" early in sbp2_scsi_queuecommand, depending on a
> check of the logical unit's generation.

Doesn't help.  Even worse, it's deadlocky.
Comment 11 Anonymous Emailer 2008-02-02 18:11:00 UTC
Reply-To: stefanr@s5r6.in-berlin.de

I wrote on 2008-01-13:
> James Bottomley wrote:
>> Firewire list cc'd
>>> Jan 12 16:50:49 x3400 kernel: firewire_sbp2: orb reply timed out,
>>> rcode=0x11
>>> Jan 12 16:50:49 x3400 kernel: sd 11:0:0:0: [sdc] Result:
>>> hostbyte=DID_BUS_BUSY
>>> driverbyte=DRIVER_OK,SUGGEST_OK
>> Best I can tell, this is the source of the problem.  The sbp2 driver is
>> replying DID_BUS_BUSY until that gets sorted out, which seems to be
>> never.
> 
> When something was plugged in or out at the same bus, fw-sbp2 has to
> reconnect == renew the login to each logical unit.  The syslog in the
> report is inconclusive whether that happened or failed.

In any case, there are frequently commands retried or newly enqueued
while fw-sbp2 waits to get the login renewed.  (And fw-sbp2 continues to
complete them with DID_BUS_BUSY until the reconnection didn't succeed.

Whoever caused that I/O, e.g. dd like in the reporter's and my own
tests, will quickly fail.

> As a side note, the old sbp2 driver does not quit commands with
> DID_BUS_BUSY between bus reset and reconnect.  Instead it blocks the
> Scsi_Host in order to not receive commands during that time at all.

I experimented with this yesterday.  First I tried
scsi_internal_device_block() because we
  - want to block logical units individually if possible,
  - need to block from within atomic context (softirq context).
However, this failed miserably with all sorts of lock inversion bug
backtraces (alleged ones or real ones, I don't know) and with occasional
kernel lock-ups (so it were probably real lock inversions).  These
locking issues cannot be solved easily because block layer and scsi_lib
play nauseating games with their locks.

So, I switched over to scsi_block_requests(), i.e. blocking the whole
host like the old sbp2 driver does.  This doesn't seem to have
scsi_internal_device_block()'s locking issues.  However, the sbp2 driver
has one Scsi_Host for each logical unit while the new fw-sbp2 driver
however has one Scsi_Host for each target.  Hence there are difficulties
with targets with multiple logical units, but I probably got them sorted
out now.

There remain frequent problems with reconnection + re-login failures
though.  These failures don't happen with exactly the same bus topology
if I don't run I/O during the bus resets.  I have an idea though what to
try next...
Comment 12 Stefan Richter 2008-02-03 10:05:43 UTC
I currently got a preliminary solution which works with OXUF922 and OXFW912 but not yet with TI StorageLynx.  There seem also still refcounting bugs in my experimental patches.  I will post them after more testing.
Comment 13 Stefan Richter 2008-02-03 15:05:20 UTC
Patch series posted at http://lkml.org/lkml/2008/2/3/195 and available for testing at http://me.in-berlin.de/~s5r6/linux1394/updates/ (v629 or later).  Helps a lot according to my tests.
Comment 14 Stefan Richter 2008-02-16 09:35:52 UTC
Patch series has been updated at http://lkml.org/lkml/2008/2/16/122 and pushed out to http://me.in-berlin.de/~s5r6/linux1394/updates/ (v633 or later).  This update fixes concurrency issues in my earlier patch.

Additional testing shows that bus reset stability isn't yet top-notch with the mentioned TI StorageLynx, buggy PL-3507, and older Oxsemi911.  These remaining issues require further changes to firewire-sbp2 which I can't implement and submit upstream overnight.  Anyway, the current patches already improve things a lot:  I/O during bus reset had simply no chance to survive at all before the patch set, now it at least works quite well with a variety of devices.
Comment 15 Stefan Richter 2008-02-19 11:08:03 UTC
fix committed to linux1394-2.6.git:master, planned to submit to Linus next week
Comment 16 Stefan Richter 2008-02-28 02:23:46 UTC
bus reset stabilization fix was merged by Linus
Comment 17 Stefan Richter 2008-03-19 07:13:43 UTC
Reopening this bug:  There still needs to be some work done to make bus reset handling in firewire-sbp2 more robust.

If I use a bunch of 1394b disks behind a 5 meters long cable (IEEE 1394b cable assembly reference design is for a maximum of 4.5 meters), I occasionally get failed SCSI commands during heavy usage.  The commands fail with hostbyte=DID_BUS_BUSY and hostbyte=DID_NO_CONNECT, i.e. connection loss, and the SCSI devices are subsequently taken offline.  I have to examine it more but I believe
  - there are bus resets
  - during which some or all of the external nodes go invisible.

I have some rough ideas what may be needed to be done here, but it will take some time to verify and implement this.

The devices which I tested this way are OXFW912 and OXUF924DSB based, i.e. ones whose bus reset behavior should have been addressed by the patches from February.

Side note:  Bus reset behavior of other bridges mentioned in comment #14 should also be improved as a side effect of a patch from Jarod which is already in Linus' tree (set single phase retry_limit).
Comment 18 Stefan Richter 2008-03-19 11:00:47 UTC
Closing the bug again.  After a test with short cables and self ID logging enabled, I found that the problem is something else.

At some point, the controller just isn't issuing any interrupts anymore, except for cycle64Seconds interrupt events.  Look somewhat like bug 7569.  Controller is a Sunix PCIe 1394b card with TSB82AA2 behind XIO2000, http://www.linux1394.org/view_device.php?id=1039

Note You need to log in before you can comment on or make changes to this bug.