Bug 8402

Summary: eth1394: cannot allocate transaction labels under heavy traffic
Product: Drivers Reporter: Stefan Richter (stefanr)
Component: IEEE1394Assignee: Stefan Richter (stefanr)
Status: CLOSED CODE_FIX    
Severity: normal    
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: all Subsystem:
Regression: --- Bisected commit-id:
Attachments: [PATCH 1/2] ieee1394: eth1394: remove bogus netif_wake_queue
[PATCH 2/2] ieee1394: eth1394: handle tlabel exhaustion
[PATCH 2.6.18] ieee1394: eth1394: handle tlabel exhaustion
ieee1394: fix to ether1394_tx in ether1394.c

Description Stefan Richter 2007-04-29 05:42:34 UTC
(got the report via personal e-mail)

Hardware Environment: 2.0 GHz dual core PC, 1394b hardware
Software Environment: eth1394 between two Linux PCs, tested with netperf -n 2 -H
... (http://www.netperf.org/)
Problem Description:
Apr 25 10:51:24 localhost kernel: 00:1023
Apr 25 10:51:24 localhost kernel: eth1394: No more tlabels left while sending to
node 0-0:1023

The reason is that more than 64 packets are attempted to be transmitted before
the split transaction of the first packet is completed.  AFAICT it may result in
dropped IP packets.
Comment 1 Stefan Richter 2007-04-30 07:38:34 UTC
The root cause is most certainly the same as in bug 6948 :
Tlabels are consumed from softIRQ context but freed from kthread context
(khpsbpkt). There may already be enough transactions completed, but the kthread
is not woken up early enough to free their tlabels.

The fix in bug 6948 is more a workaround:  It blocks further SCSI requests and
initiates the last outstanding request (the first request which failed due to
tlabel pool exhaustion) from workqueue context which sleeps until a tlabel is
available, i.e. until khpsbpkt was woken up.

Therefore the ultimate fix for both bugs would apparently be to convert khpsbpkt
into tasklets.
Comment 2 Stefan Richter 2007-04-30 09:13:15 UTC
Can also be reproduced using FTP.  Tested: 1.6 GHz single core Athlon with
proftpd serves a >500 MB big file to a 2.0 GHz Core 2 Duo with KDE's ftp
component.  File is received fully intact, speed was ~12 MByte/s over S400A.  No
errors were logged on the client, but ~250 times tlabel exhaustion was logged by
eth1394 on the server.
Comment 3 Stefan Richter 2007-05-05 08:33:19 UTC
Created attachment 11397 [details]
[PATCH 1/2] ieee1394: eth1394: remove bogus netif_wake_queue

precondition for a following fix
Comment 4 Stefan Richter 2007-05-05 08:35:26 UTC
Created attachment 11398 [details]
[PATCH 2/2] ieee1394: eth1394: handle tlabel exhaustion

The patch halts the transmit queue if no tlabel is available to eth1394's
->hard_start_xmit().  A workqueue job is then scheduled to catch the moment
when ieee1394 recycled the next lot of tlabels.


(Before this, I also tried an ieee1394 API addition which lets the ieee1394
recycle tlabels already in the low-level's receive tasklet's context.  This
lowered the frequency of tlabel exhaustion somewhat but didn't address
eth1394's inability to throttle the outgoing queue when necessary.  It also
made the transaction API even more complicated than it already is.)
Comment 5 Stefan Richter 2007-05-06 03:43:33 UTC
Created attachment 11405 [details]
[PATCH 2.6.18] ieee1394: eth1394: handle tlabel exhaustion

patch 1/2 + 2/2 combined and backported to Linux 2.6.18, only compile-tested
Comment 6 Stefan Richter 2007-06-17 14:01:44 UTC
Created attachment 11772 [details]
ieee1394: fix to ether1394_tx in ether1394.c

additional patch, currently pending for mainline merge

This is necessary in addition to "ieee1394: eth1394: handle tlabel exhaustion" in order to get the re-queued packets after tlabel recovery into proper shape.
Comment 7 Stefan Richter 2007-07-05 11:59:19 UTC
fix merged in 2.6.22-rcSomething