Bug 10307

Summary: connection loss during heavy IO, probably hardware bug in 1394b PHY chip TSB81BA3
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: Stefan Richter (stefanr)
Status: REJECTED WILL_NOT_FIX    
Severity: normal CC: jarod
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: all Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on: 11349    
Bug Blocks:    
Attachments: failure log with ieee1394 stack
failure log with firewire stack (full log)

Description Stefan Richter 2008-03-22 04:17:21 UTC
Latest working kernel version: ?
Earliest failing kernel version: ?
Hardware Environment: Sunix 1394b PCIe card with TSB82AA2 + XIO2000
http://www.linux1394.org/view_device.php?id=1039
Software Environment: 2.6.25-rc5 SMP PREEMPT i686, with several 2.6.26 candidate patches

  - Connect 3 1394b disks (based on OXFW912 and OXUF924DSB).
  - Do IO on all of them in parallel, e.g. diff --recursive ...
  - After several hundreds of GBytes transferred, IO breaks down,
    SCSI devices are taken offline.

No IO is possible anymore.  No plugging out or in of devices are recognized anymore.  No interrupts except for cycle64Seconds are received anymore, according to debug printks in fw-ohci.c and according to the counter in /proc/interrupts.   When another PC is plugged in and this tries to read the dead card's bus info block (to be handled by the physical response unit of the dead card), these read requests fail with evt_missing_ack.

It is not yet clear whether this is a driver bug or a card erratum.  I need to do the same test with ohci1394 + sbp2 instead of firewire-ohci + firewire-sbp2.  Note the similarities to bug 7569 (2.6.18, Host disappears on 1394 bus reset).

I did not observe this bug yet when I accessed only one disk at a time.
Comment 1 Stefan Richter 2008-03-22 04:36:22 UTC
# echo -n 0000:02:02.0 > /sys/bus/pci/drivers/firewire_ohci/unbind
# echo -n 0000:02:02.0 > /sys/bus/pci/drivers/firewire_ohci/bind

gets the controller going again.
Comment 2 Jarod Wilson 2008-03-23 07:00:13 UTC
I'll see if I can reproduce this with any of my controllers when I have a chance.
Comment 3 Stefan Richter 2008-04-27 04:08:09 UTC
Created attachment 15933 [details]
failure log with ieee1394 stack

Tested the same hardware with the same workload with ohci1394 + sbp2 on 2.6.25-rc8:

There were IO errors after a while too, but bus reset detection, self ID reception, and physical response unit still worked after that.  I only had to unmount and remount the disks and could proceed.

As the syslog reveals, the three FireWire nodes corresponding to the four disks went away and came back, from the point of view of self ID reception:

Apr 27 12:26:14 stein ieee1394: Node changed: 0-04:1023 -> 0-00:1023
Apr 27 12:26:14 stein ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0030e0a5e0080293]
Apr 27 12:26:15 stein ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[0001d202eef4048e]
Apr 27 12:26:15 stein ieee1394: Node suspended: ID:BUS[0-02:1023]  GUID[0001d202e08500e7]
Apr 27 12:26:15 stein ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0030e0a5e0080293]
Apr 27 12:26:15 stein ieee1394: Node resumed: ID:BUS[0-01:1023]  GUID[0001d202eef4048e]
Apr 27 12:26:15 stein ieee1394: Node resumed: ID:BUS[0-02:1023]  GUID[0001d202e08500e7]
Apr 27 12:26:15 stein ieee1394: Node changed: 0-00:1023 -> 0-04:1023

This is a bus of five 1394b S800 nodes:  Sunix card in the PC, IOI hub, 3 disk enclosures.  The "Node changed" message revealed that all four external PHYs went away for a moment.

This could be a glitch of the PHY of the controller card or of the PHY in the repeater.

I am now re-running the test with firewire-ohci to see whether the new stack always uses interrupts, or also sometimes fails like just seen with the old stack.
Comment 4 Stefan Richter 2008-04-27 04:23:08 UTC
PS:
> As the syslog reveals, the three FireWire nodes corresponding to the four
> disks went away and came back, from the point of view of self ID reception:

(The three nodes of the disks and the node of the hub all went away.)

> I am now re-running the test with firewire-ohci to see whether the new stack
> always uses interrupts, or also sometimes fails like just seen with the old
> stack.

(...always loses interrupts, i.e. does not receive any interrupts anymore besides cycle64Seconds...)

Both failure modes can be caused by a hardware bug of the card's PHY.  However, it would be somewhat suspicious if the new stack always only failed in one way and the old stack always only in the other way.
Comment 5 Stefan Richter 2008-04-27 05:51:40 UTC
Created attachment 15934 [details]
failure log with firewire stack (full log)

Repeated the test with new firewire stack on 2.6.25-rc8 + firewire patches for 2.6.26:

Fails exactly like the previous test with ohci1394.  And can be recovered from exactly as with ohci1394 by simply unmounting and mounting the filesystems again.

Most important part of the log is where the errors start:

Apr 27 13:31:10 stein firewire_ohci: 1 selfIDs, generation 6, local node ID ffc0
Apr 27 13:31:10 stein selfID 0: 807fcc56, phy 0 [---] beta gc=63 -3W Lci
Apr 27 13:31:10 stein firewire_core: skipped bus generations, destroying all nodes
Apr 27 13:31:10 stein sd 46:0:0:0: [sdd] Synchronizing SCSI cache
Apr 27 13:31:11 stein firewire_ohci: 5 selfIDs, generation 7, local node ID ffc1
Apr 27 13:31:11 stein selfID 0: 807fc458, phy 0 [--p] beta gc=63 -3W L
Apr 27 13:31:11 stein selfID 0: 817fcc66, phy 1 [-p-] beta gc=63 -3W Lci
Apr 27 13:31:11 stein selfID 0: 827fc464, phy 2 [-p-] beta gc=63 -3W L
Apr 27 13:31:11 stein selfID 0: 837fc0b4, phy 3 [pc-] beta gc=63 +0W L
Apr 27 13:31:11 stein selfID 0: 843fc4fe, phy 4 [ccc] beta gc=63 -3W i
Apr 27 13:32:10 stein firewire_sbp2: fw1.0: sbp2_scsi_abort
Apr 27 13:32:10 stein sd 46:0:0:0: Device offlined - not ready after error recovery

This is what happens:
  - bus reset(s) (bus generation changes) without selfID complete event
    (not logged)
  - selfID complete event with only the selfID of the local node
  - selfID complete event with the selfIDs of all five nodes
  - driver stack destroys old device representations,
    creates new device representations.
Comment 6 Stefan Richter 2008-04-27 06:28:04 UTC
I changed bug title and kernel version according to the last findings.

My interpretation now is that this is a bug in the TSB81BA3.  This is the most often used PHY of 1394b devices.  We therefore need a workaround for it in the drivers, at least in the new stack.

The workaround could be implemented in the high-level drivers (firewire-sbp2, sbp2, and any other driver which may need high connection reliability --- possibly including userspace drivers) or in the midlayer drivers (firewire-core, ieee1394, raw1394).

Workaround for the failure mode from comments #3 and #5:

    After nodes are gone, keep device representations around for a
    short while.  Revive them if nodes with same config ROM content
    are discovered shortly thereafter.

Workaround for the failure mode from comment #0:

    Like above.  Also reset the local PHY.  But how to detect the
    necessity of doing so?  This mode is indistinguishable from a
    merely idle bus.

Workaround 1 can probably be quite easily implemented in sbp2 because there is already some infrastructure in ieee1394's nodemgr of which sbp2 can make use. (Node entries are "suspended" and "resumed" by nodemgr.  Sbp2 would have to implement hooks for the suspend and resume events and a timeout for suspend events which truly are node removals.)
Comment 7 Stefan Richter 2008-08-19 13:52:24 UTC
> Workaround for the failure mode from comments #3 and #5:
> 
>     After nodes are gone, keep device representations around for a
>     short while.  Revive them if nodes with same config ROM content
>     are discovered shortly thereafter.

Implemented for ieee1394 by patch
"ieee1394: survive a few seconds connection loss"
http://lkml.org/lkml/2008/8/19/369
Comment 8 Stefan Richter 2009-02-09 14:40:43 UTC
> Workaround for the failure mode from comments #3 and #5:
> 
>     After nodes are gone, keep device representations around for a
>     short while.  Revive them if nodes with same config ROM content
>     are discovered shortly thereafter.

Also implemented in firewire in 2.6.29-rc4.

> Workaround for the failure mode from comment #0:
> 
>     Like above.  Also reset the local PHY.  But how to detect the
>     necessity of doing so?  This mode is indistinguishable from a
>     merely idle bus.

Won't fix.  Apparently happens rarely enough to ignore... for now.
Comment 9 Stefan Richter 2009-02-09 14:51:15 UTC
> Apparently happens rarely enough to ignore... for now.

That's probably because I now use the bus with the hopelessly buggy TSB81BA3s only in legacy arbitration mode (by keeping a 1394a PHY attached).  Otherwise it wouldn't be usable due to too many bus resets since I want to use it with unbalanced cable lengths, thereby always trigger the TSB81BA3 erratum related to BOSS arbitration.  http://focus.ti.com/lit/er/sllz015c/sllz015c.pdf