Bug 8361

Summary: eth1394 unreliable with peers with moderate CPU power (or IP fragmentation?)
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: drivers_ieee1394
Status: REJECTED WILL_NOT_FIX    
Severity: normal CC: protasnb
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: all Subsystem:
Regression: --- Bisected commit-id:
Bug Depends on:    
Bug Blocks: 10046    

Description Stefan Richter 2007-04-22 12:37:11 UTC
Tests of Linux 2.6.16.y and 2.6.21-rcX' eth1394 with OS X reveal considerable
reliability problems during bigger data transfers if peers are under high CPU load.

Test software: scp, encryption and compression enabled.

A) 1.6 GHz Athlon PC running Linux 2.6.16.48 or 2.6.21-rc7
B) 2.0 GHz Core 2 Duo PC running Linux 2.6.21-rc5
C) 400 MHz G4 PC running OS X 10.3.9
D) 2.3 GHz Core 2 Duo PC running OS X 10.4.9

In these tests, C was under 100% CPU load, with ~60% system load.  All other PCs
were only under moderate CPU load.  The OS X PCs only had 1394a (S400) adapters,
the Linux PCs had, and where tested with, 1394a (S400) and 1394b (S800) adapters.

-----------------------------------------------------------------

A-->B  OK, ~11 MB/s  (regardless of 1394a or 1394b connection)
B-->A  OK, ~11 MB/s  (regardless of 1394a or 1394b connection)

The equal speeds of S400 and S800 connections may perhaps be the fault of the
1394b card in A: http://www.linux1394.org/view_device.php?id=918

-----------------------------------------------------------------

A-->C  OK, ~3 MB/s
C-->A  fails after varying amounts of transferred data (~ 10...50 MB)
       scp says: "Received disconnect from UNKNOWN: 2: Corrupted MAC on input"

A-->D  OK, ~12 MB/s
D-->A  OK, ~10 MB/s

-----------------------------------------------------------------

B-->C  OK, ~3.5 MB/s
C-->B  fails like C-->A, but gets more data transferred before failure
       (~100...150 MB)

B-->D  OK, ~18 MB/s   (better than 100 Mbit/s ethernet: ~11 MB/s)
D-->B  OK, ~15 MB/s   (better than 100 Mbit/s ethernet: ~11 MB/s)

-----------------------------------------------------------------

C-->D  OK, <2.5 MB/s
D-->C  OK, ~3 MB/s

-----------------------------------------------------------------

Another observation:

vncviewer on A and AppleVNCServer on C quickly shows defect images and very soon
crashes with an error reported by zlib.
vncviewer on B and AppleVNCServer on C seemingly works.
vncviewer on A or B and AppleVNCServer on D works.

As a small comfort, the 2.0 GHz Linux PC had lower CPU utilization by Linux' IP
over 1394 implementation in the scp tests than the 2.3 GHz OS X PC had by OS X'
implementation.
Comment 1 Stefan Richter 2007-07-05 01:21:33 UTC
Bug status:  Still present with the latest eth1394 fixes.  Due to time constraints I am inclined to put eth1394 debugging on hold for now because we also still have to implement IP over 1394 for the new alternative/replacement firewire driver stack.  If/when that project succeeds, we either learned what is necessary to fix the old eth1394, or we do a WILL_NOT_FIX here.
Comment 2 Stefan Richter 2007-10-15 12:34:36 UTC
Note to self:  Need to retest under 2.6.23.  Different eth1394 troubles reportedly vanished due to unidentified changes between .22 and .23.
Comment 3 Stefan Richter 2007-11-25 12:59:42 UTC
Tested again with "C" (400 MHz G4 with OS X 10.3.9) and 2.33 GHz Linux dual core PC:  Bug still exists in 2.6.24-rc3.  Same symptoms:

  - scp from OS X to Linux fails after a few MB with "disconnect from UNKNOWN: 2: Corrupted MAC on input".  (The sshd on Linux killed the connection and logged "Disconnecting: Corrupted MAC on input" to syslog.)

  - vncviewer on Linux showing an animated preview of the screensaver prefs pane of OS X aborts after a while with "zlib inflate returned error: -3, msg: invalid stored block length".
Comment 4 Stefan Richter 2008-02-19 12:24:13 UTC
There are currently no resources to fix this in drivers/ieee1394/.
drivers/firewire/ does not feature this problem.
Comment 5 Stefan Richter 2008-07-06 02:33:26 UTC
Another cause for this issue could be that system C's FireWire controller is a CardBus card with maximum asynchronous payload limited to 1024 bytes.  Since the standard MTU (and the only standardized MTU) is 1500, IP datagrams are being fragmented when sent to and from such a CardBus controller.  So, perhaps Mac OS X and Linux don't interoperate correctly when IP fragmentation is involved.  Linux to Linux through a CardBus controller worked though when I tested the last time.

I wrote:
> There are currently no resources to fix this in drivers/ieee1394/.
> drivers/firewire/ does not feature this problem.

A lame sentence... (copy and paste syndrome).  It's true but only because drivers/firewire/ still does not have IP over 1394 support at all.

When we reimplement IP over 1394 for drivers/firewire/ or port eth1394 to drivers/firewire/, we should take care not to port possible bugs in IP fragments handling.  I reopen this bug for better visibility.
Comment 6 Stefan Richter 2008-12-13 03:22:06 UTC
It is unlikely that anybody will take care of this bug before an IP over 1394 implementation for the new firewire driver stack emerges.