Bug 6796 - skge performance slow with iperf as client
Summary: skge performance slow with iperf as client
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Drivers
Classification: Unclassified
Component: Network (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-07-05 11:38 UTC by David G. Miller
Modified: 2008-12-15 10:12 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.17.1-2139
Subsystem:
Regression: ---
Bisected commit-id:


Attachments

Description David G. Miller 2006-07-05 11:38:58 UTC
Most recent kernel where this bug did not occur: 2.6.9 (latest kernel I have
installed that doesn't exhibit this behavior).

Distribution: Fedora Core 4

Hardware Environment: Tyan Tiger MPX mainboard with two Athlon (32 bit) 2400+
CPUs, 2GB RAM, NIC is a 3com 3c2000-t connecting through a 3com OfficeConnect
gigabit switch.

Software Environment: iperf version 2.0.2

Problem Description: I'm seeing asymetric speed (significantly different
depending on whether the system is the iperf client or the iperf server)
reported by iperf between two systems:

[dave@bend ~]# iperf -c mutilate
------------------------------------------------------------
Client connecting to mutilate, TCP port 5001
TCP window size: 27.4 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.185 port 59584 connected with 192.168.255.254 port 5001
[  3]  0.0-10.0 sec    232 MBytes    194 Mbits/sec
[dave@bend ~]# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.0.185 port 5001 connected with 192.168.255.254 port 33464
[  4]  0.0-10.0 sec    815 MBytes    684 Mbits/sec

I've swapped cables and NICs but the problem remains with the system running
2.6.17 and using the skge module.  The other system is running CentOS 4.3 which
is still on the 2.6.9 kernel and using the sk98lin module.  I don't see the same
asymetric speed difference between it and another CentOS box depending on which
role each system has in the test.

Steps to reproduce:

1) install iperf on a system running a version of the kernel that does not
include the skge module and on one that does.

2) "Tune" the system with the skge module by setting the MTU to 9000.  This
makes the difference more dramatic but isn't strictly required.  The iperf
reported speed drops by about a third with an MTU 1500 for the system with the
skge modules in server mode.

3) Run iperf on each system with one as client and the other as server and then
reverse the roles.

Expected results:

iperf performance varies very little regardless of which system is client and
which is server.

Actual results:

See bug description.  iperf reports the connection is about 3X faster when the
system running skge is the server as compared to when it is the client.

Notes: Made several cable swaps and even replaced the NIC with a brand new (was
still in the shrink wrap) 3c2000-t in the "skge" system.  Moved a cable that
exhibited the problem so that all test traffic traversed this cable and didn't
see the problem between the CentOS systems (2.6.9 kernel, sk98lin module); just
between either CentOS system and the FC4 system (2.6.17 kernel, skge driver).
Comment 1 Stephen Hemminger 2006-07-10 11:09:16 UTC
Perhaps you are seeing checksum errors.  Please try comparing performance
with transmit checksum offload disabled.

ethtool -K eth0 tx off

It could be that the transmit checksum offload is sending bad checksums.
Comment 2 David G. Miller 2006-07-10 13:09:14 UTC
Tiny improvement.  The first iperf is "before" and the second one is "after":

[dave@bend ~/tgz]# iperf -c mutilate
------------------------------------------------------------
Client connecting to mutilate, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.185 port 39842 connected with 192.168.255.254 port 5001
[  3]  0.0-10.0 sec    233 MBytes    195 Mbits/sec
[dave@bend ~/tgz]# iperf -c mutilate
------------------------------------------------------------
Client connecting to mutilate, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.185 port 53165 connected with 192.168.255.254 port 5001
[  3]  0.0-10.0 sec    241 MBytes    202 Mbits/sec

The other offload settings are:

[root@bend ~]# ethtool -k eth1
Offload parameters for eth1:
Cannot get device tcp segmentation offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off

Performnce the "other" direction is pretty much unaffected:

[dave@fraud ~]# iperf -c bend
------------------------------------------------------------
Client connecting to bend, TCP port 5001
TCP window size: 27.5 KByte (default)
------------------------------------------------------------
[  3] local 192.168.255.254 port 34406 connected with 192.168.0.185 port 5001
[  3]  0.0-10.0 sec    815 MBytes    684 Mbits/sec
Comment 3 Stephen Hemminger 2006-07-10 13:14:49 UTC
Did you tune the TCP settings? You may need to increase tcp_rmem and tcp_wmem
to get full TCP performance. Is this a direct LAN connection? Are there any
firewalls
or other "middleboxes" in the way? What is the system on the other end?

Does turning off TCP window scaling change the results:
   sysctl -w net.ipv4.tcp_window_scaling=0
Comment 4 David G. Miller 2006-07-10 19:34:27 UTC
First trial is after turning off TCP window scaling.  Second is after turning it
back on.

[dave@bend ~]# iperf -c mutilate -m
------------------------------------------------------------
Client connecting to mutilate, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.185 port 50557 connected with 192.168.255.254 port 5001
[  3]  0.0-10.0 sec    229 MBytes    192 Mbits/sec
[  3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
[dave@bend ~]# iperf -c mutilate -m
------------------------------------------------------------
Client connecting to mutilate, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.185 port 36099 connected with 192.168.255.254 port 5001
[  3]  0.0-10.0 sec    241 MBytes    202 Mbits/sec
[  3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)
[dave@bend ~]# 

In answer to your other questions, the boxes connect directly via a 3com
3C1671600 switch.  All of the boxes run iptables with only a few ports
(including 5001) open (yes, I'm paranoid).  I did:

 1006         echo 2500000 > /proc/sys/net/core/rmem_max
 1007         echo "4096 5000000 5000000" > /proc/sys/net/ipv4/tcp_rmem
 1008         echo "4096 65536 5000000" > /proc/sys/net/ipv4/tcp_wmem
 1009         echo "4096 65536 5000000" > /proc/sys/net/ipv4/tcp_wmemech

When I first discovered this issue and went looking for tuning parameters that
might make a difference.  If you have alternate values that you would like for
me to try, let me know.

This is my own little corner of the internet so I can go offline long enough to
stop iptables on the two boxes involved and re-run the test without iptables if
you think that is an interesting data point.
Comment 5 Stephen Hemminger 2006-07-11 13:44:56 UTC
It isn't a driver problem, probably something hardware or network related. With
my Linksys board I see:

~> iperf -c dxpl
------------------------------------------------------------
Client connecting to dxpl, TCP port 5001
TCP window size: 64.0 KByte (default)
------------------------------------------------------------
[  3] local 10.8.0.55 port 40502 connected with 10.8.0.74 port 5001
[  3]  0.0-10.0 sec  1.04 GBytes    894 Mbits/sec

The TCP settings in /etc/sysctl.conf
# increase Linux autotuning TCP buffer limits
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
net.core.wmem_max = 67108864
net.core.rmem_max = 67108864


The lspci is:

03:08.0 Ethernet controller: Linksys Gigabit Network Adapter (rev 12)
        Subsystem: Linksys EG1064 v2 Instant Gigabit Network Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 32 (5750ns min, 7750ns max), Cache Line Size 08
        Interrupt: pin A routed to IRQ 169
        Region 0: Memory at f2004000 (32-bit, non-prefetchable) [size=16K]
        Region 1: I/O ports at 9000 [size=256]
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data

Could you capture a TCP trace (with tcpdump or ethereal) of the problem?
Comment 6 David G. Miller 2006-07-12 07:29:10 UTC
lspci says:

02:07.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T [Marvell]
(rev 10)
        Subsystem: 3Com Corporation 3C941 Gigabit LOM Ethernet Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr-
Stepping- SERR- FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (5750ns min, 7750ns max), Cache Line Size 10
        Interrupt: pin A routed to IRQ 185
        Region 0: Memory at f1004000 (32-bit, non-prefetchable) [size=16K]
        Region 1: I/O ports at 2400 [size=256]
        [virtual] Expansion ROM at 88020000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)                Status: D0 PME-Enable- DSel=0
DScale=1 PME-
        Capabilities: [50] Vital Product Data

I did more moving cables, etc. and confirmed that the cable inside the wall
isn't the problem.  The only free box I have is the old PIII/733 but the jack I
had been using for the Tyan workstation gives me a little better than 400mbs
(symmetric) with the PIII plugged in and running iperf between the PIII and the
same server I had been using.

If I run a tcpdump to capture the iperf test, do you just need packet headers? 
Let me know if you have any particular set of tcpdump options you'd like to see.

I've filed support requests with both Tyan and 3com on this issue. 
Unfortunately, I'll probably just get the usual "Linux isn't supported" song and
dance.
Comment 7 Stephen Hemminger 2006-07-12 16:02:29 UTC
tcpdump of headers only (binary file), or just the ascii decode.
it will show if packets are being dropped or if there are huge pauses
Comment 8 David G. Miller 2006-07-15 08:12:05 UTC
Update: I installed FC5 on my PIII/733 test box and repeated the iperf tests
between it and both the server (Abit KG-7 motherboard system, CentOS 4.3, 2.6.9
kernel) and the workstation (Tyan Tiger MPX motherboard, FC4, 2.6.17 kernel). 
Throughput between the PIII/733 and the server dropped by a couple of percent
(compared to both systems running CentOS 4.3) but there were no hardware
duplicates in either direction.  Performance between the PIII/733 and the Tyan
box remained about the same (400mbs when the PIII/733 is sending and about
200mbs when the Tyan is sending but with hardware duplicates being generated
from the Tyan to the PIII/733).

Based on this result, the problem seems to be specific to the Tyan box.  I'm not
saying there isn't a bug in here but it may go back to the Tyan motherboard
design which we can't do anything about.  I just ordered a 3com 3c996b-t (64 bit
PCI, 66mhz) NIC.  If I recall correctly, the Tyan board gives preference to
cards in the 64 bit slots.  Also, this will be the only card in either of the
two 64 bit slots which should eliminate the possibility of a "slow" (33Mhz) card
slowing down the bus.

I'd like to leave this bug open until I get a chance to try the 64 bit NIC so I
have a place to post results from testing (should be here by 7/21/2006).  This
board should be able to keep up better than what I'm seeing and having a gigabit
NIC in it shouldn't be that odd.
Comment 9 David G. Miller 2006-07-26 18:29:24 UTC
Update: I finally got the 3c996b-t card working.  I did the old "pull out all of
the other cards and see if the problem still occurs."  The problem went away so
I added the PCI cards back in one at a time assuming I'd find one that caused
the conflict.  The system is now in exactly the same configuration it was in
when the tg3 driver gave a load error (except cards might not be in the same PCI
slots) but no error.

I re-ran the iperf tests and got:

[dave@bend ~]# iperf -m -c mutilate
------------------------------------------------------------
Client connecting to mutilate, TCP port 5001
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
[  3] local 192.168.0.184 port 43006 connected with 192.168.255.254 port 5001
[  3]  0.0-10.0 sec    970 MBytes    813 Mbits/sec
[  3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

and

[root@fraud rc.d]# iperf -c bend -m
------------------------------------------------------------
Client connecting to bend, TCP port 5001
TCP window size: 1.00 MByte (default)
------------------------------------------------------------
[  3] local 192.168.255.254 port 33064 connected with 192.168.0.184 port 5001
[  3]  0.0-10.0 sec    940 MBytes    787 Mbits/sec
[  3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface)

So, pretty close to symmetric performance.  A little slower than what you're
seeing but I'm guessing I'm hitting hardware limitations at this point given the
vintage of the systems I'm using.

I'm getting the performance I expected (after a significant amount of pain) with
the new NIC so I'm happy other than I had to shell out some $$ for a new NIC.  I
can rerun the iperf tests with tcpdump doing a capture to see if I'm still
seeing he same problem with hardware dupes.  Let me know if this is worth pursuing.
Comment 10 Stephen Hemminger 2006-07-27 10:47:11 UTC
Doesn't the 3c996b-t card use a different driver?
Comment 11 David G. Miller 2006-07-27 11:32:34 UTC
Yeah, the 3c996B-T uses the tg3 driver.  This mainly confirms that there isn't a
problem with the building wiring, the test server NIC and my switch.  I was
fairly convinced that those weren't causing the behavior that caused me to open
the bug but this confirms it.

About all we know is the 3c2000-t with the skge driver generates hardware dupes
in this box (Tyan S2466N-4M) when pushed to maximum speed by iperf.  This
includes both being either the iperf sender or the iperf receiver but its not as
bad on receive (fewer hardware dupes).  At slower speeds (~300mbs to ~400mbs) in
a different box (PIII/733) it doesn't this NIC with the skge driver doesn't
generate such dupes nor does the same NIC with the sk98lin driver.

I'm assuming that the presence of hardware dupes is "a bad thing" and is the
visible cause or somehow related to the "slow transmit" performance.  If this
isn't the case, all we know is what I put in the original bug.  

Another "shot in the dark" theory would be that this is somehow a limitation of
the chipset or peripheral bus controller for this motherboard (I've pinged Tyan
support but I just get the predicted "Linux not supported" song and dance with a
"discontinued motherboard" refraim).  I'm typically an application developer so
anything this close to the hardware is all new to me.  If you have anything
specific you'd like to see, let me know and I'll try to set up the test. 
Unfortunately, the hardware I've described in this bug is about everything I
have access to so I'm somewhat limited in what I can do.

Cheers,
Dave
Comment 12 Stephen Hemminger 2006-09-25 10:43:59 UTC
A number of performance related fixes went into 2.6.18. The main related to
this bug report was a change to have the skge driver use NAPI for transmit
cleanup.

Is this still true with later kernels?
Comment 13 David G. Miller 2006-09-25 13:40:38 UTC
The most recent kernel I have access to is 2.6.17-1.2157_FC5.  Any idea if the
folks at Red Hat have back-ported the fix?  I'm hoping to put FC6-test3 on my
test box at some point.  I may be able to grab something even more current at
that point.
Comment 14 Stephen Hemminger 2006-11-10 14:10:34 UTC
Fedora Core 6 is using 2.6.18 based kernel, could you please try that?
Comment 15 Stephen Hemminger 2007-01-12 12:23:40 UTC
Please reopen this if problem remains with current kernels.
Comment 16 Tvrtko Ursulin 2008-11-18 08:53:08 UTC
It looks like this wasn't completely resolved but worked around by buying a new NIC?

Anyway, I have the same symptoms with skge and forcedeth connected with short patch cables through a switch (DLink DGS-1008).

I tried wmem/rmem sysctl stuff (all of it) with no effect and also turning off window scaling halves performance.

With default settings I get ~200Mbps one and ~400Mbps the other way. Even the faster direction is pretty slow on my setup.

Hardware on the skge side is:

02:08.0 Ethernet controller: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11) (rev 11)
        Subsystem: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11)
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 22
        Region 0: Memory at fdcf8000 (32-bit, non-prefetchable) [size=16K]
        Region 1: I/O ports at dc00 [size=256]
        Expansion ROM at fde00000 [disabled] [size=128K]
        Capabilities: [48] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [50] Vital Product Data <?>
        Kernel driver in use: skge
        Kernel modules: skge

Forcedeth side:

00:0f.0 Ethernet controller: nVidia Corporation MCP73 Ethernet (rev a2)
        Subsystem: Giga-byte Technology Device e000
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0 (250ns min, 5000ns max)
        Interrupt: pin A routed to IRQ 4350
        Region 0: Memory at e5109000 (32-bit, non-prefetchable) [size=4K]
        Region 1: I/O ports at e000 [size=8]
        Region 2: Memory at e510a000 (32-bit, non-prefetchable) [size=256]
        Region 3: Memory at e5106000 (32-bit, non-prefetchable) [size=16]
        Capabilities: [44] Power Management version 2
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
                Status: D0 PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ Count=1/8 Enable+
                Address: 00000000fee0100c  Data: 4179
                Masking: 000000fe  Pending: 00000000
        Kernel driver in use: forcedeth
        Kernel modules: forcedeth

Skge is in a PCI slot in a ~3 year old AMD Turion64, while forcedeth is onboard a ~1 year old Gigabyte MB with C2D CPU. First is running Ubuntu 8.10 (so 2.6.27 derivative) and second one openSUSE 11.1 Beta5 (so also 2.6.27 derivative).

But numbers were exactly the same one version ago for both platforms, so 2.6.24 on Ubuntu 8.04 and 2.6.25 on openSUSE 11.

Anything further I could try?

P.S. I don't seem to have power to re-open this.
Comment 17 Stephen Hemminger 2008-12-15 10:12:15 UTC
Regular PCI slots have limited bandwidth (like 400MB) vs on board or PCI-E which have 6GB.

Note You need to log in before you can comment on or make changes to this bug.