Bug 8704

Summary: Ethernet over firewire 4pin<->4pin cable crashes in high traffic!!
Product: Drivers Reporter: Cijoml Cijomlovic Cijomlov (cijoml)
Component: IEEE1394Assignee: Stefan Richter (stefanr)
Status: CLOSED CODE_FIX    
Severity: high CC: stefanr
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22.2 Subsystem:
Regression: --- Bisected commit-id:
Attachments: gscanbus output

Description Cijoml Cijomlovic Cijomlov 2007-07-02 09:43:47 UTC
Most recent kernel where this bug did not occur: unknown
Distribution: Debian testing
Hardware Environment:

OHCI Firewire cards connected via 4pin <-> 4pin cable

Software Environment: 2.6.21.5

---------------------
Problem Description:

1st. laptop:
02:00.4 FireWire (IEEE 1394): ALi Corporation M5253 P1394 OHCI 1.1 Controller

eth1      Zapouzdření:NEZNÁM  HWadr 00-90-E6-00-00-00-24-F7-00-00-00-00-00-00-00-00
          inet adr:10.0.0.72  Všesměr:10.0.0.255 Maska:255.255.255.0
          AKTIVOVÁNO VŠESMĚROVÉ_VYSÍLÁNÍ BĚŽÍ MULTICAST  MTU:1500  Metrika:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:90 errors:0 dropped:0 overruns:0 carrier:0
          kolizí:0 délka odchozí fronty:1000
          RX bytes:0 (0.0 b)  TX bytes:3240 (3.1 KiB)
ieee1394: Node added: ID:BUS[0-00:1023]  GUID[00030d002558ad3d]
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[00030d002558ad3d]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node resumed: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: Node resumed: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: raw1394: /dev/raw1394 device initialized
ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: Node resumed: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]

notas:/home/cijoml# ssh 10.0.0.2
ssh: connect to host 10.0.0.2 port 22: No route to host


---------

2nd. laptop:

02:09.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)

eth1      Zapouzdření:NEZNÁM  HWadr 00-03-0D-00-25-58-AD-3D-00-00-00-00-00-00-00-00
          inet adr:10.0.0.2  Všesměr:10.0.0.255 Maska:255.255.255.0
          AKTIVOVÁNO VŠESMĚROVÉ_VYSÍLÁNÍ BĚŽÍ MULTICAST  MTU:1500  Metrika:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:41 errors:0 dropped:0 overruns:0 carrier:0
          kolizí:0 délka odchozí fronty:1000
          RX bytes:0 (0.0 b)  TX bytes:5412 (5.2 KiB)

ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[0090e600000024f7]
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: raw1394: /dev/raw1394 device initialized
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0090e600000024f7]



Listik:~# ssh 10.0.0.72
ssh: connect to host 10.0.0.72 port 22: No route to host

---------------------------
Steps to reproduce:

connect, setup, check
Comment 1 Stefan Richter 2007-07-02 12:30:31 UTC
The type of cable shouldn't matter.  Any combination of 4-, 6-, 9-pin plugs should work the same.

The log messages from ieee1394 indicate that the nodes see each other.  If you have gscanbus (low-level FireWire test tool with GUI), then this should too be able to "see" another Linux node connected.

Questions:

  - When do the "Node suspended" messages happen?  Arbitrarily while you want to use the eth1394 interface, or simply _after_ you were done with eth1394 and unplugged the cable?

  - Do you load the ieee1394 drivers with parameters different from their defaults?  (grep -r 1394 /etc/modprobe.conf /etc/modprobe.d/)
Comment 2 Cijoml Cijomlovic Cijomlov 2007-07-02 13:01:41 UTC
Node suspended is done "itself" - cable connected, but I see these messages.
I used default params no additional parameters.
Comment 3 Stefan Richter 2007-07-02 13:18:53 UTC
Well, that's bad.  It means that the connection was lost on a low IEEE 1394 level.  I don't know why.  Could yo please do this on both PCs after such a suspended message:
$ cat /sys/bus/ieee1394/devices/fw-host0/node_count
$ cat /sys/bus/ieee1394/devices/fw-host0/nodes_active
Comment 4 Cijoml Cijomlovic Cijomlov 2007-07-03 14:43:21 UTC
Well so here is the update - I made some tests and disconnect looks like I moved cables - so now I don't see any suspended.

Both nodes shows:

Listik:~# cat /sys/bus/ieee1394/devices/fw-host0/nodes_active
2
Listik:~# cat /sys/bus/ieee1394/devices/fw-host0/node_count
2

notas:~# cat /sys/bus/ieee1394/devices/fw-host0/nodes_active
2
notas:~# cat /sys/bus/ieee1394/devices/fw-host0/node_count
2

But no ping/ssh/tcp/udp traffic is going through
Comment 5 Cijoml Cijomlovic Cijomlov 2007-07-03 14:50:06 UTC
Created attachment 11936 [details]
gscanbus output

Attaching gscanbus output to see atributes of both hosts.
Comment 6 Cijoml Cijomlovic Cijomlov 2007-07-08 07:10:49 UTC
Hi Stefan,

any updates/news in this bugreport?

Michal
Comment 7 Stefan Richter 2007-07-12 16:02:38 UTC
Looks like ieee1394 and eth1394 should have all information they need. Do all of the following symlinks exist on both PCs when they are connected?
/sys/bus/ieee1394/devices/00030d002558ad3d
/sys/bus/ieee1394/devices/00030d002558ad3d-0
/sys/bus/ieee1394/devices/0090e600000024f7
/sys/bus/ieee1394/devices/0090e600000024f7-0
Comment 8 Cijoml Cijomlovic Cijomlov 2007-08-14 11:48:16 UTC
Hi,

I moved to 2.6.22.2 kernel and now communication works, but under HIGH traffic it crashes. I tried ping (works) but in the moment I try scp copy from 1 host to other I gett to following problem:

Laptop 1 (pcmcia card)
ohci1394: fw-host0: OHCI-1394 1.1 (PCI): IRQ=[10]  MMIO=[5c003000-5c0037ff]  Max Packet=[2048]  IR/IT contexts=[4/8]
eth1394: eth1: IPv4 over IEEE 1394 (fw-host0)
ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Node added: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: Host added: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ohci1394: fw-host0: AT dma reset ctx=0, aborting transmission
ieee1394: Node suspended: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]
ieee1394: Node resumed: ID:BUS[0-01:1023]  GUID[00030d002558ad3d]

Laptop 2 (onboard controller)
ieee1394: The root node is not cycle master capable; selecting a new root node and resetting...
ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ohci1394: fw-host0: AT dma reset ctx=0, aborting transmission
ohci1394: fw-host0: AT dma reset ctx=1, aborting transmission
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node changed: 0-01:1023 -> 0-00:1023
ieee1394: Node suspended: ID:BUS[0-00:1023]  GUID[0090e600000024f7]
usb 4-5: USB disconnect, address 50
ieee1394: The root node is not cycle master capable; selecting a new root node and resetting...
ieee1394: Error parsing configrom for node 0-00:1023
ieee1394: Node changed: 0-00:1023 -> 0-01:1023
ieee1394: Node resumed: ID:BUS[0-00:1023]  GUID[0090e600000024f7]



notas:/home/cijoml# scp /usr/src/linux-2.6.22.2.tar.bz2 10.0.0.2:/usr/src/
Password:
linux-2.6.22.2.tar.bz2                                                                    10% 4744KB   0.2KB/s - stalled -n

64 bytes from 10.0.0.2: icmp_seq=48 ttl=64 time=0.346 ms
64 bytes from 10.0.0.2: icmp_seq=49 ttl=64 time=0.346 ms
64 bytes from 10.0.0.2: icmp_seq=50 ttl=64 time=0.354 ms
64 bytes from 10.0.0.2: icmp_seq=51 ttl=64 time=0.333 ms
64 bytes from 10.0.0.2: icmp_seq=52 ttl=64 time=0.348 ms
64 bytes from 10.0.0.2: icmp_seq=53 ttl=64 time=0.342 ms
64 bytes from 10.0.0.2: icmp_seq=54 ttl=64 time=0.340 ms
64 bytes from 10.0.0.2: icmp_seq=55 ttl=64 time=0.304 ms
64 bytes from 10.0.0.2: icmp_seq=56 ttl=64 time=0.471 ms
From 10.0.0.1 icmp_seq=86 Destination Host Unreachable
From 10.0.0.1 icmp_seq=87 Destination Host Unreachable
From 10.0.0.1 icmp_seq=88 Destination Host Unreachable
From 10.0.0.1 icmp_seq=90 Destination Host Unreachable
From 10.0.0.1 icmp_seq=91 Destination Host Unreachable
From 10.0.0.1 icmp_seq=92 Destination Host Unreachable
From 10.0.0.1 icmp_seq=94 Destination Host Unreachable
From 10.0.0.1 icmp_seq=95 Destination Host Unreachable
From 10.0.0.1 icmp_seq=96 Destination Host Unreachable
From 10.0.0.1 icmp_seq=98 Destination Host Unreachable
From 10.0.0.1 icmp_seq=99 Destination Host Unreachable
From 10.0.0.1 icmp_seq=100 Destination Host Unreachable
From 10.0.0.1 icmp_seq=102 Destination Host Unreachable
From 10.0.0.1 icmp_seq=103 Destination Host Unreachable
From 10.0.0.1 icmp_seq=104 Destination Host Unreachable



top - 10:39:06 up  4:22,  1 user,  load average: 1.53, 1.21, 1.21
Tasks: 111 total,   2 running, 109 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.0%us, 50.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  1.7%hi, 44.4%si,  0.0%st
Mem:   1018616k total,   449668k used,   568948k free,    42612k buffers
Swap:   698816k total,        0k used,   698816k free,   188048k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    3 root      39  19     0    0    0 R 60.7  0.0   3:28.48 ksoftirqd/0
    5 root      10  -5     0    0    0 S 34.5  0.0   1:16.44 events/0
 3346 root       5 -10  176m  43m 3632 S  2.3  4.4  27:39.64 Xorg
11559 cijoml    15   0 33768  15m  11m S  2.3  1.6   0:19.27 konsole
23442 cijoml    15   0  2360 1144  864 R  0.7  0.1   0:00.66 top
 3891 cijoml    15   0 33868  15m  12m S  0.3  1.6   2:05.32 kicker
22574 cijoml    15   0 58812  40m  19m S  0.3  4.1   1:34.33 konqueror
    1 root      15   0  2064  700  600 S  0.0  0.1   0:00.83 init
    2 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
    6 root      10  -5     0    0    0 S  0.0  0.0   0:00.19 khelper
   35 root      10  -5     0    0    0 S  0.0  0.0   0:00.16 kblockd/0
   36 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kacpid
   37 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kacpi_notify
  132 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 ata/0
  133 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 ata_aux
  134 root      10  -5     0    0    0 S  0.0  0.0   0:00.02 ksuspend_usbd
  137 root      13  -5     0    0    0 S  0.0  0.0   0:00.04 khubd
  139 root      10  -5     0    0    0 S  0.0  0.0   0:00.04 kseriod
  159 root      15   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
  160 root      15   0     0    0    0 S  0.0  0.0   0:01.13 pdflush
  161 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kswapd0
  212 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 aio/0
  847 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 pccardd
  849 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 pccardd
  865 root      10  -5     0    0    0 S  0.0  0.0   0:00.05 kondemand/0
  878 root      10  -5     0    0    0 S  0.0  0.0   0:02.35 kjournald
  980 root      16  -4  2888 1220  368 S  0.0  0.1   0:01.13 udevd
 1290 root      11  -5     0    0    0 S  0.0  0.0   0:00.00 kpsmoused
 1437 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 khpsbpkt
 1438 root      11  -5     0    0    0 S  0.0  0.0   0:00.00 knodemgrd_0
Comment 9 Stefan Richter 2007-08-14 13:36:20 UTC
Please do the following for me:  Check which of the log files receives kernel messages. On many distros it is /var/log/messages, on some it is /var/log/syslog; there may be others.  Then let a "tail -f /var/log/messages" (or whatever) running in one terminal while you generate eth1394 traffic in another terminal.  As soon as traffic breaks down, check if there were kernel messages right in this moment.

And what kind of machines do you have connected?  (I.e. which processor horsepower do they have?)
Comment 10 Cijoml Cijomlovic Cijomlov 2007-08-14 14:41:04 UTC
Hi,

I use Debian - only logs I receive are stored in dmesg and same in syslog.

laptop with internal FW card - Pentium M 1.7 GHz
laptop with PCMCIA card - Pentium Celeron (P4 2.4 GHz)
Comment 11 Cijoml Cijomlovic Cijomlov 2007-08-14 14:43:01 UTC
BTW I think problem is with this in top:

    3 root      39  19     0    0    0 R 60.7  0.0   3:28.48 ksoftirqd/0
                                                    ^^^

Which disappeared when I removed pcmcia card from slot (only way howto get communication back to work)
Comment 12 Stefan Richter 2007-08-14 15:04:11 UTC
Right, the values for ksoftirqd/0 and events/0 are insane from what I can tell.
Comment 13 Stefan Richter 2007-08-14 15:11:32 UTC
Could you watch how fast the CPU time of ksoftirqd increases over a suitably long period, both while you generate traffic through eth1394 and also after that with eth1394 unloaded from the kernel?  ("ps aux" shows the CPU time too.)
Comment 14 Stefan Richter 2007-08-30 08:53:25 UTC
I almost forgot:  In the meantime, the Sparse checker found a bug in eth1394 regarding spinlock usage when handling incoming data.  I don't know if this is in any way related to this bug here with high CPU load from softirqd and events thread.  The patch is in -mm, linux1394-2.6.git, and at http://me.in-berlin.de/~s5r6/linux1394/updates/.
Comment 15 Cijoml Cijomlovic Cijomlov 2007-10-10 22:55:34 UTC
Hi Stefan, I just tested 2.6.23 kernel and now eth code work - via SSH I moved 1.7 GB without any problem. Only transport speed 7 MB/s looks little bit small, but maybe it is caused by encryption 73%/30% CPU usage at both nodes.
Comment 16 Stefan Richter 2007-10-11 10:15:26 UTC
Thanks for staying patient and testing each current kernel.  I'm glad that it works now; I'm not sure though which of the 2.6.22(.2)->2.6.23 changes did the trick...

The throughput of IP over FireWire 400 (~400 Mbit/s base rate) is roughly comparable with Fast Ethernet (100 Mbit/s) due to considerably larger overhead, and perhaps lack of throughput optimization in Linux' IEEE 1394 drivers.  And if a CPU intensive task is going on at the same time, IP over 1394 throughput will suffer more or less.  I observed the latter also on OS X, where an older machine had a considerable drop in IP over 1394 throughput if SSH was used instead of FTP.