Bug 10307
Summary: | connection loss during heavy IO, probably hardware bug in 1394b PHY chip TSB81BA3 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Stefan Richter (stefanr) |
Component: | IEEE1394 | Assignee: | Stefan Richter (stefanr) |
Status: | REJECTED WILL_NOT_FIX | ||
Severity: | normal | CC: | jarod |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | all | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Bug Depends on: | 11349 | ||
Bug Blocks: | |||
Attachments: |
failure log with ieee1394 stack
failure log with firewire stack (full log) |
Description
Stefan Richter
2008-03-22 04:17:21 UTC
# echo -n 0000:02:02.0 > /sys/bus/pci/drivers/firewire_ohci/unbind # echo -n 0000:02:02.0 > /sys/bus/pci/drivers/firewire_ohci/bind gets the controller going again. I'll see if I can reproduce this with any of my controllers when I have a chance. Created attachment 15933 [details]
failure log with ieee1394 stack
Tested the same hardware with the same workload with ohci1394 + sbp2 on 2.6.25-rc8:
There were IO errors after a while too, but bus reset detection, self ID reception, and physical response unit still worked after that. I only had to unmount and remount the disks and could proceed.
As the syslog reveals, the three FireWire nodes corresponding to the four disks went away and came back, from the point of view of self ID reception:
Apr 27 12:26:14 stein ieee1394: Node changed: 0-04:1023 -> 0-00:1023
Apr 27 12:26:14 stein ieee1394: Node suspended: ID:BUS[0-00:1023] GUID[0030e0a5e0080293]
Apr 27 12:26:15 stein ieee1394: Node suspended: ID:BUS[0-01:1023] GUID[0001d202eef4048e]
Apr 27 12:26:15 stein ieee1394: Node suspended: ID:BUS[0-02:1023] GUID[0001d202e08500e7]
Apr 27 12:26:15 stein ieee1394: Node resumed: ID:BUS[0-00:1023] GUID[0030e0a5e0080293]
Apr 27 12:26:15 stein ieee1394: Node resumed: ID:BUS[0-01:1023] GUID[0001d202eef4048e]
Apr 27 12:26:15 stein ieee1394: Node resumed: ID:BUS[0-02:1023] GUID[0001d202e08500e7]
Apr 27 12:26:15 stein ieee1394: Node changed: 0-00:1023 -> 0-04:1023
This is a bus of five 1394b S800 nodes: Sunix card in the PC, IOI hub, 3 disk enclosures. The "Node changed" message revealed that all four external PHYs went away for a moment.
This could be a glitch of the PHY of the controller card or of the PHY in the repeater.
I am now re-running the test with firewire-ohci to see whether the new stack always uses interrupts, or also sometimes fails like just seen with the old stack.
PS: > As the syslog reveals, the three FireWire nodes corresponding to the four > disks went away and came back, from the point of view of self ID reception: (The three nodes of the disks and the node of the hub all went away.) > I am now re-running the test with firewire-ohci to see whether the new stack > always uses interrupts, or also sometimes fails like just seen with the old > stack. (...always loses interrupts, i.e. does not receive any interrupts anymore besides cycle64Seconds...) Both failure modes can be caused by a hardware bug of the card's PHY. However, it would be somewhat suspicious if the new stack always only failed in one way and the old stack always only in the other way. Created attachment 15934 [details]
failure log with firewire stack (full log)
Repeated the test with new firewire stack on 2.6.25-rc8 + firewire patches for 2.6.26:
Fails exactly like the previous test with ohci1394. And can be recovered from exactly as with ohci1394 by simply unmounting and mounting the filesystems again.
Most important part of the log is where the errors start:
Apr 27 13:31:10 stein firewire_ohci: 1 selfIDs, generation 6, local node ID ffc0
Apr 27 13:31:10 stein selfID 0: 807fcc56, phy 0 [---] beta gc=63 -3W Lci
Apr 27 13:31:10 stein firewire_core: skipped bus generations, destroying all nodes
Apr 27 13:31:10 stein sd 46:0:0:0: [sdd] Synchronizing SCSI cache
Apr 27 13:31:11 stein firewire_ohci: 5 selfIDs, generation 7, local node ID ffc1
Apr 27 13:31:11 stein selfID 0: 807fc458, phy 0 [--p] beta gc=63 -3W L
Apr 27 13:31:11 stein selfID 0: 817fcc66, phy 1 [-p-] beta gc=63 -3W Lci
Apr 27 13:31:11 stein selfID 0: 827fc464, phy 2 [-p-] beta gc=63 -3W L
Apr 27 13:31:11 stein selfID 0: 837fc0b4, phy 3 [pc-] beta gc=63 +0W L
Apr 27 13:31:11 stein selfID 0: 843fc4fe, phy 4 [ccc] beta gc=63 -3W i
Apr 27 13:32:10 stein firewire_sbp2: fw1.0: sbp2_scsi_abort
Apr 27 13:32:10 stein sd 46:0:0:0: Device offlined - not ready after error recovery
This is what happens:
- bus reset(s) (bus generation changes) without selfID complete event
(not logged)
- selfID complete event with only the selfID of the local node
- selfID complete event with the selfIDs of all five nodes
- driver stack destroys old device representations,
creates new device representations.
I changed bug title and kernel version according to the last findings. My interpretation now is that this is a bug in the TSB81BA3. This is the most often used PHY of 1394b devices. We therefore need a workaround for it in the drivers, at least in the new stack. The workaround could be implemented in the high-level drivers (firewire-sbp2, sbp2, and any other driver which may need high connection reliability --- possibly including userspace drivers) or in the midlayer drivers (firewire-core, ieee1394, raw1394). Workaround for the failure mode from comments #3 and #5: After nodes are gone, keep device representations around for a short while. Revive them if nodes with same config ROM content are discovered shortly thereafter. Workaround for the failure mode from comment #0: Like above. Also reset the local PHY. But how to detect the necessity of doing so? This mode is indistinguishable from a merely idle bus. Workaround 1 can probably be quite easily implemented in sbp2 because there is already some infrastructure in ieee1394's nodemgr of which sbp2 can make use. (Node entries are "suspended" and "resumed" by nodemgr. Sbp2 would have to implement hooks for the suspend and resume events and a timeout for suspend events which truly are node removals.) > Workaround for the failure mode from comments #3 and #5: > > After nodes are gone, keep device representations around for a > short while. Revive them if nodes with same config ROM content > are discovered shortly thereafter. Implemented for ieee1394 by patch "ieee1394: survive a few seconds connection loss" http://lkml.org/lkml/2008/8/19/369 > Workaround for the failure mode from comments #3 and #5: > > After nodes are gone, keep device representations around for a > short while. Revive them if nodes with same config ROM content > are discovered shortly thereafter. Also implemented in firewire in 2.6.29-rc4. > Workaround for the failure mode from comment #0: > > Like above. Also reset the local PHY. But how to detect the > necessity of doing so? This mode is indistinguishable from a > merely idle bus. Won't fix. Apparently happens rarely enough to ignore... for now. > Apparently happens rarely enough to ignore... for now. That's probably because I now use the bus with the hopelessly buggy TSB81BA3s only in legacy arbitration mode (by keeping a 1394a PHY attached). Otherwise it wouldn't be usable due to too many bus resets since I want to use it with unbalanced cable lengths, thereby always trigger the TSB81BA3 erratum related to BOSS arbitration. http://focus.ti.com/lit/er/sllz015c/sllz015c.pdf |