Bug 7569 - Host disappears on 1394 bus reset
Summary: Host disappears on 1394 bus reset
Status: REJECTED WILL_NOT_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: IEEE1394 (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Stefan Richter
URL:
Keywords:
Depends on:
Blocks: 10046
  Show dependency tree
 
Reported: 2006-11-22 16:19 UTC by Robert Crocombe
Modified: 2008-03-22 04:18 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.18
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Code to provoke and notice the problem (15.94 KB, text/plain)
2007-01-15 04:42 UTC, Robert Crocombe
Details
Very slightly revised code to provoke reset problem (15.90 KB, text/plain)
2007-01-15 10:55 UTC, Robert Crocombe
Details
[patch 2.6.20-rcX] ieee1394: ohci1394: serialize reset requests and irq handler (5.63 KB, patch)
2007-01-26 10:08 UTC, Stefan Richter
Details | Diff
Slightly more revised code to provoke reset problem (16.68 KB, text/plain)
2007-02-07 09:04 UTC, Robert Crocombe
Details
[PATCH linux1394-2.6.git] ieee1394: ohci1394: serialize reset requests and irq handler (5.69 KB, patch)
2007-03-11 15:03 UTC, Stefan Richter
Details | Diff
New, hopefully better code for provoking reset problem (13.94 KB, text/plain)
2007-04-11 17:52 UTC, Robert Crocombe
Details

Description Robert Crocombe 2006-11-22 16:19:44 UTC
Most recent kernel where this bug did *NOT* occur: None as yet (occurs in all
tested).  Kernels tried are 2.6.16, 2.6.16-rt29, and 2.6.18.  Easiest to
reproduce for 2.6.16-rt29.
Distribution: Fedora Core 5
Hardware Environment: IWill H8502 quad Opteron.  One Unibrain Fireboard 800 and
one Indigita IDT804 (quad port) 1394b card.
Software Environment:
Problem Description:  After an uncertain number of bus resets, a host adapter
will lose functionality: bus reset handlers, etc. registered for that port will
no longer be called when the proper event occurs.

Steps to reproduce: Enable "excessive debugging" kernel option for
drivers->ieee1394.  Plug all host adapters together into a single bus.  Generate
1394 bus resets.  Eventually one of the ports will stop showing up in
/var/log/messages, i.e. fw-hostXYZ where XYZ is a number from 0 to (number of
supported ports -1), [0,4] for me.

Port can be resurrected by rmmod ohci1394, then modprobe ohci1394.
Comment 1 Robert Crocombe 2006-11-22 16:24:51 UTC
Discussion on lkml and linux1394-devel can be found starting at:

http://lkml.org/lkml/2006/11/22/122
Comment 2 Stefan Richter 2007-01-08 11:09:04 UTC
Could you test with the patch mentioned in bug 6070 and see if you get the
warning? Newer kernels will log it with "WARNING...", not "Badness...".
Comment 3 Robert Crocombe 2007-01-15 04:40:39 UTC
Stefan,

I applied:

http://me.in-berlin.de/~s5r6/linux1394/work-in-progress/debug-warn-if-we-sleep-in-an-irq-for-a-long-time/debug-warn-if-we-sleep-in-an-irq-for-a-long-time.2.6.19.patch

to a 2.6.19 kernel, but got bored after about ~350 resets with no warnings and
no port losses.  I then applied the patch to 2.6.19.1-rt15, and lost a port
after 148 resets.  However, the WARN_ON in the patch doesn't appear to trigger.

According to linux/bug.h, I should be grepping on "BUG: warning", but "grep -i"
for both "bug" and "warning" turned up nothing interesting.
Comment 4 Robert Crocombe 2007-01-15 04:42:14 UTC
Created attachment 10083 [details]
Code to provoke and notice the problem

Here's the code I now use to provoke the problem.
Comment 5 Stefan Richter 2007-01-15 05:19:52 UTC
OK, then this is more or less separate from bug 6070. I shall try your testcase
on my PCs. One has a CardBus-PCI bridge, the other one a PCI-PCIe bridge.
Needless to say, I never got this bug before, mostly using one or a few SBP-2
devices.
Comment 6 Stefan Richter 2007-01-15 05:25:50 UTC
I'll see if I can get an -rt kernel set up sometime soon.
Comment 7 Robert Crocombe 2007-01-15 10:55:05 UTC
Created attachment 10085 [details]
Very slightly revised code to provoke reset problem

Okay.  Note that it's not -rt only.  I finally just let the test thing run and
provoked it in successive tests after 438, 371, 340, and 445 resets.

I poked at the output from my program to make it more concise and pretty.  It
now looks something like:

root@spanky:read_file$ date; ./break_things ; date
Mon Jan 15 11:54:36 MST 2007
Signal handler installed.
There are 5 1394 interfaces.
[    0] Looking for 5 ports: 0 1 2 . 4
[    1] Looking for 5 ports: 0 1 2 . 4
[    2] Looking for 5 ports: 0 1 2 . 4
[    3] Looking for 5 ports: . 1 2 . 4
[    4] Looking for 5 ports: 0 1 2 . 4
[    5] Looking for 5 ports: 0 1 2 . 4
[    6] Looking for 5 ports: 0 1 2 . 4
[    7] Looking for 5 ports: 0 1 2 . 4
[    8] Looking for 5 ports: 0 1 2 . 4
[    9] Looking for 5 ports: 0 1 2 . 4
[   10] Looking for 5 ports: 0 1 2 . 4
Okay, I'm pretty sure port 3 is gone: missing 10 consecutive times
Program terminating
libraw1394: raw1394_destroy_handle line 157: Calling close on fd 3
Mon Jan 15 11:54:47 MST 200

where I ran this after port 3 had previously gone out after 445 resets.  I'll
reattach the code.
Comment 8 Stefan Richter 2007-01-17 00:52:34 UTC
Do you know if the IWill H8502 contains an IOMMU?
Comment 9 Robert Crocombe 2007-01-23 15:16:50 UTC
Yes, from "dmesg | grep IOMMU":

PCI-DMA: using GART IOMMU.

and there is a BIOS entry that indicates its presence.
Comment 10 Stefan Richter 2007-01-23 16:11:18 UTC
I currently hope that the solution is to serialize ohci1394's IRQ handler vs.
ohci1394 bus reset request function (and certainly a lot of other register
accessing functions) by means of spinlocks plus mmiowb barrier. Might be easy to
do but I don't know if I manage to put this together at one of these evenings;
there is a lot of other stuff going on here lately. Sorry for dragging this
issue out...
Comment 11 Robert Crocombe 2007-01-25 20:25:36 UTC
Eh.  We contacted Indigita to let them know about the problem, and I provided
them with my code, but they have decided that the problem doesn't merit their
attention.  I wasn't in on the exact conversation at this point, but the email
to me (from a co-worker) simply indicated that they called "uncle".  Uhm, in
case that doesn't translate, that is an indicator that they have given up on
providing a solution at their end.

This is annoying from a port density point of view, but the approach for the
future for the work here is to use Fireboard 800s in a PCI expansion chassis,
which was the initial approach until someone noticed the Indigita cards.

I would note that apparently Indigita does not own a 1394b bus analyzer (from
email), and their first solution to us was "don't do that" (put all 4 ports on
the same bus).

However, I have a bunch of the cards in a desk drawer now, and will be happy to
install one to test any solution you propose.  But take your time: the card's
functionality has become moot.

Thanks for all your efforts thus far, by the way.
Comment 12 Stefan Richter 2007-01-26 10:08:35 UTC
Created attachment 10198 [details]
[patch 2.6.20-rcX] ieee1394: ohci1394: serialize reset requests and irq handler

This is a poke in the dark and could even break more than it attempts to fix.
It will only address behaviour WRT bus resets which were triggered by local
software, not if the bus resets were forced by hardware or remote software.
Comment 13 Stefan Richter 2007-01-26 10:15:52 UTC
Well, given that even the linux1394 crew (what's left of it) is perplexed by
Linux' IEEE 1394 drivers, what do you expect from the poor guys at Indigita?
Comment 14 Stefan Richter 2007-01-26 10:19:28 UTC
PS about the patch: Need to do more research for the issue with externally
triggered bus resets.
Comment 15 Robert Crocombe 2007-02-07 08:52:03 UTC
Hrm, I'm not sure it's made much difference.  I've been pushed to a different
machine: same H8502, same # of processors, but less (10GB vs 32GB) RAM, and
causing the reset problem to occur seems to take a bit longer.  I ran 2.6.20 and
2.6.20-rt2 kernels both with and without the patch, and the results are like so:

2.6.20 w/out patch

1st iteration: 1264 resets then broke
2nd iteration: 4172 resets then broke
3rd iteration: 46,256 resets (!) resets then broke

2.6.20 w/ patch (I actually tried this first)

1st iteration: 1718 resets then broke
2nd iteration: 3010 resets then stopped by me (so I could see if the non-patched
was any worse)

2.6.20-rt2 w/out patch

1st iteration: 311 resets then broke
2nd iteration: 1035 resets then broke
3rd iteration: "10" resets, which means one interface wasn't restored via
rmmod/modprobe (didn't think to record which)
4th iteration: 48 resets then broke
5th iteration: 702 resets then broke
6th iteration: port 5 (indigita interfaces were 2,3,4,5) not restored (program
quit after my "give it enough time" 10 minimum resets)

At this point I ran my little reset script:

#!/bin/bash

rmmod raw1394
rmmod ohci1394
modprobe ohci1394
modprobe raw1394

and the kernel locked up hard (can't ping, keyboard seized)

2.6.20-rt2 (w/ patch)

1st iteration: 75 resets and broke
2nd iteration: 496 resets and then locked up (as above)
reset machine
3rd iteration: 41 resets
reset modules
immediate lockup on restarting test program
reset machine
4th iteration: 252 resets and then lost all 4 ports simultaneously (wow!)
reset modules -> lockup


So there's some ugliness in the -rt kernel, or some ugliness revealed by the -rt
kernel.  I'll probably poke at Ingo a little later.  The number of interrupts
that occurs skyrockets right before the lockup.  'vmstat 1' follows.  Normally I
seem to be running around 1600-2200 interrupts a second (Hz == 250), which dips
to about 400+ when I start the reset test program (and you can see the amount of
debug stuff that gets written) or when I do the rmmod/modprobe magic, and then
picks back up when the program terminates.  So 102k interrupts is a bit unusual...


procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 0  0      0 8714900 145420 348580    0    0     0     0 1887 17922  0  2 98  0
 0  0      0 8714776 145420 348580    0    0     0     0 1595 17254  0  3 97  0
 0  0      0 8714900 145420 348580    0    0     0     0 1703 17461  0  2 98  0
 0  0      0 8714884 145420 348580    0    0     0     0 1701 17512  0  2 98  0
 0  0      0 8714884 145420 348580    0    0     0     0 1582 17154  0  2 98  0
 0  1      0 8716808 145420 348580    0    0     0   416 1236 12329  0  1 89 10
 0  1      0 8716808 145420 348580    0    0     0  1036  395  4306  0  1 75 24
 0  1      0 8717036 145420 348588    0    0     0   976  380  4249  0  0 76 25
 0  1      0 8717036 145420 348588    0    0     0   992  385  4262  0  0 75 25
 0  1      0 8717036 145420 348596    0    0     0  1152  446  4638  0  0 76 24
 0  1      0 8717044 145420 348596    0    0     0  1008  389  4298  0  0 75 25
 0  1      0 8717048 145420 348608    0    0     0  1056  403  4384  0  0 75 25
 0  0      0 8716924 145420 348608    0    0     0   848  335  4151  0  0 79 21
 0  0      0 8715064 145420 348616    0    0     0     0 1707 14770  0  1 99  0
 0  0      0 8715064 145420 348616    0    0     0     0 1718 17723  0  3 97  0
 0  0      0 8715068 145420 348616    0    0     0     0 1700 17209  0  2 98  0
 0  0      0 8715068 145420 348616    0    0     0     0 1893 17218  0  2 98  0
 0  0      0 8714944 145420 348616    0    0     0     0 1921 17703  0  3 97  0
 0  0      0 8715068 145420 348616    0    0     0     0 1886 16980  0  2 98  0
 0  0      0 8717052 145420 348616    0    0     0    12  546  5675  0  0 99  0
 0  0      0 8715068 145420 348616    0    0     0     0 1606 11743  0  1 99  0
 0  0      0 8715200 145420 348616    0    0     0     0 1790 14243  0  2 98  0
 0  1      0 8711484 145420 348616    0    0     0   836  820  9356  1 11 72 16
 6  1      0 8711872 145420 348620    0    0     0   992 35476  4091  0 10 65 25
 6  1      0 8711872 145420 348620    0    0     0  1056 99101  3801  0 25 50 25
 6  1      0 8711872 145420 348628    0    0     0  1088 101964  3805  0 25 50 25
Comment 16 Robert Crocombe 2007-02-07 09:04:17 UTC
Created attachment 10334 [details]
Slightly more revised code to provoke reset problem

Tweaked to make it possible to ignore ports, which means they needn't be reset,
which saves a lot of SelfID debugging output.  For the Unibrain cards, which
seem not to have problems.
Comment 17 Stefan Richter 2007-02-07 13:37:11 UTC
Reviewing our old conversation, I noticed in http://lkml.org/lkml/2006/11/27/178
that there is one IRQ shared between two hosts (I would guess 1x Unibrain + 1x
Indigita?) and 3 separate IRQs (3x Indigita?). The shared interrupt stays on the
same CPU almost all the time; the others are quite evenly distributed between
CPUs. Stuff to think about.

In one post I said something about write ops to the interrupt event mask perhaps
not being flushed fast enough. The patch from comment #12 doesn't do any
flushing. The patch merely preserves ordering of the MMIO in
ohci_devctl/RESET_BUS versus ohci_irq_handler in presence of SMP + bus bridges,
like described in
http://lxr.linux.no/source/Documentation/memory-barriers.txt?v=2.6.18#L1285
(assuming the bus bridges work correctly).

I'm also trying to realize what http://lwn.net/Articles/146861/ wants to tell me.
Comment 18 Robert Crocombe 2007-02-12 12:10:44 UTC
Ah, I hadn't remembered the interrupt sharing.  That is not the case on the
"new" machine, which has 2 Unibrain cards (ports 0 and 1) and 1 Indigita (2,3,4,5).

[root@bubba trunk]# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:        240          0          0          0   IO-APIC-edge      timer
  1:          1          0          0          9   IO-APIC-edge      i8042
  8:          0          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          0        114   IO-APIC-edge      i8042
 19:     230585          0          0         33   IO-APIC-fasteoi   eth0
 21:          0          0          0          0   IO-APIC-fasteoi   ohci_hcd:usb2
 22:          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
 72:          0          0          0          1   IO-APIC-fasteoi   ohci1394
 81:          0          0          0          2   IO-APIC-fasteoi   ohci1394
 88:          0          0         15          0   IO-APIC-fasteoi   aic79xx
 89:       3203      34300      12618      32923   IO-APIC-fasteoi   aic79xx
 93:          0          0          1          0   IO-APIC-fasteoi   ohci1394
 94:          0          0          1          0   IO-APIC-fasteoi   ohci1394
 95:          0          0          1          0   IO-APIC-fasteoi   ohci1394
 96:          0          0          1          0   IO-APIC-fasteoi   ohci1394
NMI:          0          0          0          0
LOC:   60856314   60700937   60602709   60999361
ERR:          0

and for my test of isochronous transmit w/ realtime behavior (my end goal),
interrupts are absolutely nailed to CPU 1.  The short version of the test is
1200Hz data for 10 seconds (12k interrupts).  Sure enough:

[root@bubba trunk]# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:        240          0          0          0   IO-APIC-edge      timer
  1:          1          0          0          9   IO-APIC-edge      i8042
  8:          0          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          0        114   IO-APIC-edge      i8042
 19:     231032          0          0         33   IO-APIC-fasteoi   eth0
 21:          0          0          0          0   IO-APIC-fasteoi   ohci_hcd:usb2
 22:          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1
 72:          0          0          0          1   IO-APIC-fasteoi   ohci1394
 81:          0      12000          0          3   IO-APIC-fasteoi   ohci1394
 88:          0          0         15          0   IO-APIC-fasteoi   aic79xx
 89:       3208      34335      12768      33755   IO-APIC-fasteoi   aic79xx
 93:          0          0          1          0   IO-APIC-fasteoi   ohci1394
 94:          0          0          1          0   IO-APIC-fasteoi   ohci1394
 95:          0          0          1          0   IO-APIC-fasteoi   ohci1394
 96:          0          0          1          0   IO-APIC-fasteoi   ohci1394
NMI:          0          0          0          0
LOC:   60883385   60728002   60629667   61026483
ERR:          0

exactly 12k on CPU1 for IRQ 81 (port 1, Unibrain).  This is actually behavior
that I find desirable, since I hope to use processor affinity and nail the task
to the appropriate CPU either via cpusets or pthread_setaffinity stuff.

Hrm, I just noticed that the interrupt type is now "IO-APIC-fasteoi" rather than
"IO-APIC-level".  I hope that's not a cause for concern.  I guess my config has
drifted a bit more than I'd thought.

FYI, this code is what I've used to cause:

http://bugzilla.kernel.org/show_bug.cgi?id=7979

and the config and dmesg are there, too.
Comment 19 Stefan Richter 2007-03-11 15:03:07 UTC
Created attachment 10698 [details]
[PATCH linux1394-2.6.git] ieee1394: ohci1394: serialize reset requests and irq handler

Slightly revised patch:  Now also handles the host->in_bus_reset flag
atomically.  It still only addresses concurrency between local software
requesting a bus reset and the local reset IRQ handler.  Therefore I don't
think it is more interesting to your setup than the previous version.  I'm
posting this mostly just for reference.
Comment 20 Stefan Richter 2007-03-12 00:56:18 UTC
Now that Kristian H
Comment 21 Stefan Richter 2007-03-30 15:47:36 UTC
Could this make a difference?
Mitch Williams: Flush MSI-X table writes
http://lkml.org/lkml/2007/3/30/173
Comment 22 Stefan Richter 2007-03-30 15:55:48 UTC
Or this:
Eric W. Biederman: x86_64 irq: handle irqs pending in IRR during irq migration
http://marc.info/?l=linux-mm-commits&m=117046481708594
Comment 23 Robert Crocombe 2007-04-11 17:46:58 UTC
I've tried the patch from Mitch Williams, and it does not appear to fix things.
 I hope to try Eric's patch fairly soon.
Comment 24 Robert Crocombe 2007-04-11 17:52:41 UTC
Created attachment 11133 [details]
New, hopefully better code for provoking reset problem

No longer reads /var/log/messages (yikes), excessive debugging output not
required.
Comment 25 Robert Crocombe 2007-04-11 18:35:59 UTC
I'm seeing a fair number of error messages being reported in /var/log/messages,
probably hidden in the voluminous output of the earlier code.  Looks like the
following, which occurred during a sequence where after 119 resets the test code
terminated.

Apr 11 18:14:41 bubba kernel: ohci1394: fw-host5: Unrecoverable error!
Apr 11 18:14:41 bubba kernel: ohci1394: fw-host5: physical posted write error
Apr 11 18:14:41 bubba kernel: ohci1394: fw-host5: isochronous cycle too long
Apr 11 18:14:41 bubba kernel: ohci1394: fw-host5: SelfID received outside of bus
reset sequence
Apr 11 18:14:41 bubba kernel: ohci1394: fw-host5: Unhandled interrupt(s) 0x00000100
Apr 11 18:16:12 bubba kernel: ohci1394: fw-host4: isochronous cycle too long
Apr 11 18:16:12 bubba kernel: ohci1394: fw-host4: SelfID received outside of bus
reset sequence

I think the termination of the reset software coincided with that last selfID
message, but sometimes it can continue past them (as it is doing as I type
this).  The unrecoverable error doesn't seem to cause the reset stuff any
heartburn, either.

Note that there are no cycle starts and no isochronous traffic.  Only bus
resets, self IDs, and some quadlet read requests from the FireSpy as it asks
nodes for the start of their config ROMs: it keeps asking the root for the start
of its config ROM (16 times per reset).  However, the root is a phy-only 1394b
hub, so there's no reply.

Kernel version is 2.6.21-rc5 with Mitch William's patch applied.

I've put up the complete output from a run at:

http://rcrocomb.googlepages.com/
Comment 26 Robert Crocombe 2007-04-12 12:13:37 UTC
I just tried 2.6.21-rc6-mm1, which I presume has Eric's patch in it: sadly, the
test terminated after 504 resets.  The errors reported during the entirety of
the run are:

Apr 12 11:41:53 bubba kernel: ohci1394: fw-host3: physical posted write error
Apr 12 11:41:53 bubba kernel: ohci1394: fw-host3: isochronous cycle too long
Apr 12 11:41:53 bubba kernel: ohci1394: fw-host3: SelfID received outside of bus
reset sequence
Apr 12 11:41:53 bubba kernel: ohci1394: fw-host3: Unhandled interrupt(s) 0x00000100
Apr 12 11:42:20 bubba kernel: ohci1394: fw-host6: physical posted write error
Apr 12 11:42:20 bubba kernel: ohci1394: fw-host6: isochronous cycle too long
Apr 12 11:42:20 bubba kernel: ohci1394: fw-host6: SelfID received outside of bus
reset sequence
Apr 12 11:42:20 bubba kernel: ohci1394: fw-host6: Unhandled interrupt(s) 0x00000100
Apr 12 11:44:41 bubba kernel: ohci1394: fw-host4: Unrecoverable error!
Apr 12 11:44:41 bubba kernel: ohci1394: fw-host4: physical posted write error
Apr 12 11:44:41 bubba kernel: ohci1394: fw-host4: isochronous cycle too long
Apr 12 11:44:41 bubba kernel: ohci1394: fw-host4: SelfID received outside of bus
reset sequence
Apr 12 11:44:41 bubba kernel: ohci1394: fw-host4: Unhandled interrupt(s) 0x00000100
Apr 12 11:45:50 bubba kernel: ohci1394: fw-host4: Unrecoverable error!
Apr 12 11:45:50 bubba kernel: ohci1394: fw-host4: physical posted write error
Apr 12 11:45:50 bubba kernel: ohci1394: fw-host4: isochronous cycle too long
Apr 12 11:45:50 bubba kernel: ohci1394: fw-host4: SelfID received outside of bus
reset sequence
Apr 12 11:45:50 bubba kernel: ohci1394: fw-host4: Unhandled interrupt(s) 0x00000100
Apr 12 11:46:35 bubba kernel: ohci1394: fw-host6: physical posted write error
Apr 12 11:46:35 bubba kernel: ohci1394: fw-host6: isochronous cycle too long
Apr 12 11:46:35 bubba kernel: ohci1394: fw-host6: SelfID received outside of bus
reset sequence
Apr 12 11:46:35 bubba kernel: ohci1394: fw-host6: Unhandled interrupt(s) 0x00000100
Apr 12 11:46:41 bubba kernel: ohci1394: fw-host4: Unrecoverable error!
Apr 12 11:46:41 bubba kernel: ohci1394: fw-host4: physical posted write error
Apr 12 11:46:41 bubba kernel: ohci1394: fw-host4: isochronous cycle too long
Apr 12 11:46:41 bubba kernel: ohci1394: fw-host4: SelfID received outside of bus
reset sequence
Apr 12 11:46:41 bubba kernel: ohci1394: fw-host4: Unhandled interrupt(s) 0x00000100
Apr 12 11:48:49 bubba kernel: ohci1394: fw-host3: isochronous cycle too long
Apr 12 11:48:49 bubba kernel: ohci1394: fw-host3: SelfID received outside of bus
reset sequence
Apr 12 11:49:00 bubba last message repeated 11 times

Again, there were no isochronous packets or cycle starts.  The FireSpy shows:

504 resets
4032 self IDs
8064 Read Requests

I notice that this kernel has the 'juju' driver, so I will try and get
Kristian's libraw thing and try when I get the chance.
Comment 27 Stefan Richter 2007-04-12 13:13:54 UTC
BTW, the "physical posted write error"s mean the controller has difficulties to
get access to the PCI bus.

The "Unhandled interrupt(s) 0x00000100" is merely a small mistake on my part
when I added the logging of posted write errors.  There should be a
    event &= ~OHCI1394_postedWriteErr;
in the respective if block in ohci_irq_handler.  But this mistake doesn't affect
the operation.

I don't know though if or how physical posted write errors could be related to
the subsequent undesired masking of interrupt events.
Comment 28 Natalie Protasevich 2007-07-04 16:44:17 UTC
How is it working with latest kernel?
(I'd suggest enabling OHCI1394_DEBUG, maybe it will give more info on the event that was received. Also how about testing on single CPU. Sounds like bus analyszer would help.. ;)
Comment 29 Stefan Richter 2007-07-04 23:43:32 UTC
The FireSpy is a FireWire analyzer.  This is more a case for a PCI analyzer, or actually two:  One in front of and one behind the Indigita card's PCI-X/PCI bridge.  Theoretically spoken.  (The Unibrain card never had this problem, right?)
Comment 30 Stefan Richter 2007-07-04 23:46:53 UTC
PS:  About testing a new kernel:  The only interesting update in there is the new alternative firewire stack.  The updates of the old ieee1394 stack are unlikely to affect the issue.
Comment 31 Natalie Protasevich 2007-07-05 00:28:37 UTC
Ah, great - is it something that Robert should try, was it mentioned in #20 I suppose?
Comment 32 Stefan Richter 2007-07-05 01:02:04 UTC
Yes, the firewire-ohci driver has for example a different bus reset interrupt handler.  It's not split into bus reset event and self-id complete event anymore.  Of course the new drivers aren't as widely tested yet; probably nobody tried them together with the -rt patch yet.  And there are still some features missing in the drivers and the userland support.  (I still need to compile a status-quo summary; I hope I have it ready about when 2.6.22 is released.)

Library ports to the new driver stack:
libraw1394 from git://people.freedesktop.org/~krh/libraw1394 (juju branch)
libdc1394 v2 from SVN, https://libdc1394.svn.sourceforge.net/svnroot/libdc1394
Comment 33 Robert Crocombe 2007-08-20 15:58:56 UTC
This was on my queue to try, but unfortunately I left the job before I was able to try the new stack.  I passed the bug # along to the appropriate people.

One of the last things I did was to try a similar kind of card, from DAP design: a 3 interface card:

http://www.dapdesign.com/index.php?id=84

Which lasted through ~350,000 resets without any problems (1 reset per second).  I don't remember whose PCI bridge the card used, but it is 64-bit/66MHz and not PCI-X

Per #29: we actually had PCI bus analyzers, but the card and the slot were PCI-X 100/133MHz, which means 1 device per "bus".  We hadn't any way to split the slot to get the bus analyzer before the card, and of course after the bridge would have meant some kind of craziness (I don't think you solder wires to a 133MHz bus).
Comment 34 Stefan Richter 2008-02-19 12:08:14 UTC
There are currently no resources to fix this in drivers/ieee1394/.  Anybody who is able to (attempt to) solve this, please reopen the bug.

drivers/firewire/ has a fundamentally reworked bus reset handling and _hopefully_ does not feature this problem --- if it isn't a hardware problem in the first place.
Comment 35 Stefan Richter 2008-03-22 04:18:24 UTC
Bug with some similarities: bug 10307

Note You need to log in before you can comment on or make changes to this bug.