Most recent kernel where this bug did not occur: 2.6.9 (latest kernel I have installed that doesn't exhibit this behavior). Distribution: Fedora Core 4 Hardware Environment: Tyan Tiger MPX mainboard with two Athlon (32 bit) 2400+ CPUs, 2GB RAM, NIC is a 3com 3c2000-t connecting through a 3com OfficeConnect gigabit switch. Software Environment: iperf version 2.0.2 Problem Description: I'm seeing asymetric speed (significantly different depending on whether the system is the iperf client or the iperf server) reported by iperf between two systems: [dave@bend ~]# iperf -c mutilate ------------------------------------------------------------ Client connecting to mutilate, TCP port 5001 TCP window size: 27.4 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.185 port 59584 connected with 192.168.255.254 port 5001 [ 3] 0.0-10.0 sec 232 MBytes 194 Mbits/sec [dave@bend ~]# iperf -s ------------------------------------------------------------ Server listening on TCP port 5001 TCP window size: 85.3 KByte (default) ------------------------------------------------------------ [ 4] local 192.168.0.185 port 5001 connected with 192.168.255.254 port 33464 [ 4] 0.0-10.0 sec 815 MBytes 684 Mbits/sec I've swapped cables and NICs but the problem remains with the system running 2.6.17 and using the skge module. The other system is running CentOS 4.3 which is still on the 2.6.9 kernel and using the sk98lin module. I don't see the same asymetric speed difference between it and another CentOS box depending on which role each system has in the test. Steps to reproduce: 1) install iperf on a system running a version of the kernel that does not include the skge module and on one that does. 2) "Tune" the system with the skge module by setting the MTU to 9000. This makes the difference more dramatic but isn't strictly required. The iperf reported speed drops by about a third with an MTU 1500 for the system with the skge modules in server mode. 3) Run iperf on each system with one as client and the other as server and then reverse the roles. Expected results: iperf performance varies very little regardless of which system is client and which is server. Actual results: See bug description. iperf reports the connection is about 3X faster when the system running skge is the server as compared to when it is the client. Notes: Made several cable swaps and even replaced the NIC with a brand new (was still in the shrink wrap) 3c2000-t in the "skge" system. Moved a cable that exhibited the problem so that all test traffic traversed this cable and didn't see the problem between the CentOS systems (2.6.9 kernel, sk98lin module); just between either CentOS system and the FC4 system (2.6.17 kernel, skge driver).
Perhaps you are seeing checksum errors. Please try comparing performance with transmit checksum offload disabled. ethtool -K eth0 tx off It could be that the transmit checksum offload is sending bad checksums.
Tiny improvement. The first iperf is "before" and the second one is "after": [dave@bend ~/tgz]# iperf -c mutilate ------------------------------------------------------------ Client connecting to mutilate, TCP port 5001 TCP window size: 64.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.185 port 39842 connected with 192.168.255.254 port 5001 [ 3] 0.0-10.0 sec 233 MBytes 195 Mbits/sec [dave@bend ~/tgz]# iperf -c mutilate ------------------------------------------------------------ Client connecting to mutilate, TCP port 5001 TCP window size: 64.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.185 port 53165 connected with 192.168.255.254 port 5001 [ 3] 0.0-10.0 sec 241 MBytes 202 Mbits/sec The other offload settings are: [root@bend ~]# ethtool -k eth1 Offload parameters for eth1: Cannot get device tcp segmentation offload settings: Operation not supported rx-checksumming: on tx-checksumming: off scatter-gather: off tcp segmentation offload: off Performnce the "other" direction is pretty much unaffected: [dave@fraud ~]# iperf -c bend ------------------------------------------------------------ Client connecting to bend, TCP port 5001 TCP window size: 27.5 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.255.254 port 34406 connected with 192.168.0.185 port 5001 [ 3] 0.0-10.0 sec 815 MBytes 684 Mbits/sec
Did you tune the TCP settings? You may need to increase tcp_rmem and tcp_wmem to get full TCP performance. Is this a direct LAN connection? Are there any firewalls or other "middleboxes" in the way? What is the system on the other end? Does turning off TCP window scaling change the results: sysctl -w net.ipv4.tcp_window_scaling=0
First trial is after turning off TCP window scaling. Second is after turning it back on. [dave@bend ~]# iperf -c mutilate -m ------------------------------------------------------------ Client connecting to mutilate, TCP port 5001 TCP window size: 64.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.185 port 50557 connected with 192.168.255.254 port 5001 [ 3] 0.0-10.0 sec 229 MBytes 192 Mbits/sec [ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface) [dave@bend ~]# iperf -c mutilate -m ------------------------------------------------------------ Client connecting to mutilate, TCP port 5001 TCP window size: 64.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.185 port 36099 connected with 192.168.255.254 port 5001 [ 3] 0.0-10.0 sec 241 MBytes 202 Mbits/sec [ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface) [dave@bend ~]# In answer to your other questions, the boxes connect directly via a 3com 3C1671600 switch. All of the boxes run iptables with only a few ports (including 5001) open (yes, I'm paranoid). I did: 1006 echo 2500000 > /proc/sys/net/core/rmem_max 1007 echo "4096 5000000 5000000" > /proc/sys/net/ipv4/tcp_rmem 1008 echo "4096 65536 5000000" > /proc/sys/net/ipv4/tcp_wmem 1009 echo "4096 65536 5000000" > /proc/sys/net/ipv4/tcp_wmemech When I first discovered this issue and went looking for tuning parameters that might make a difference. If you have alternate values that you would like for me to try, let me know. This is my own little corner of the internet so I can go offline long enough to stop iptables on the two boxes involved and re-run the test without iptables if you think that is an interesting data point.
It isn't a driver problem, probably something hardware or network related. With my Linksys board I see: ~> iperf -c dxpl ------------------------------------------------------------ Client connecting to dxpl, TCP port 5001 TCP window size: 64.0 KByte (default) ------------------------------------------------------------ [ 3] local 10.8.0.55 port 40502 connected with 10.8.0.74 port 5001 [ 3] 0.0-10.0 sec 1.04 GBytes 894 Mbits/sec The TCP settings in /etc/sysctl.conf # increase Linux autotuning TCP buffer limits net.ipv4.tcp_rmem = 4096 87380 67108864 net.ipv4.tcp_wmem = 4096 65536 67108864 net.core.wmem_max = 67108864 net.core.rmem_max = 67108864 The lspci is: 03:08.0 Ethernet controller: Linksys Gigabit Network Adapter (rev 12) Subsystem: Linksys EG1064 v2 Instant Gigabit Network Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 32 (5750ns min, 7750ns max), Cache Line Size 08 Interrupt: pin A routed to IRQ 169 Region 0: Memory at f2004000 (32-bit, non-prefetchable) [size=16K] Region 1: I/O ports at 9000 [size=256] Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data Could you capture a TCP trace (with tcpdump or ethereal) of the problem?
lspci says: 02:07.0 Ethernet controller: 3Com Corporation 3c940 10/100/1000Base-T [Marvell] (rev 10) Subsystem: 3Com Corporation 3C941 Gigabit LOM Ethernet Adapter Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- Latency: 64 (5750ns min, 7750ns max), Cache Line Size 10 Interrupt: pin A routed to IRQ 185 Region 0: Memory at f1004000 (32-bit, non-prefetchable) [size=16K] Region 1: I/O ports at 2400 [size=256] [virtual] Expansion ROM at 88020000 [disabled] [size=128K] Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data I did more moving cables, etc. and confirmed that the cable inside the wall isn't the problem. The only free box I have is the old PIII/733 but the jack I had been using for the Tyan workstation gives me a little better than 400mbs (symmetric) with the PIII plugged in and running iperf between the PIII and the same server I had been using. If I run a tcpdump to capture the iperf test, do you just need packet headers? Let me know if you have any particular set of tcpdump options you'd like to see. I've filed support requests with both Tyan and 3com on this issue. Unfortunately, I'll probably just get the usual "Linux isn't supported" song and dance.
tcpdump of headers only (binary file), or just the ascii decode. it will show if packets are being dropped or if there are huge pauses
Update: I installed FC5 on my PIII/733 test box and repeated the iperf tests between it and both the server (Abit KG-7 motherboard system, CentOS 4.3, 2.6.9 kernel) and the workstation (Tyan Tiger MPX motherboard, FC4, 2.6.17 kernel). Throughput between the PIII/733 and the server dropped by a couple of percent (compared to both systems running CentOS 4.3) but there were no hardware duplicates in either direction. Performance between the PIII/733 and the Tyan box remained about the same (400mbs when the PIII/733 is sending and about 200mbs when the Tyan is sending but with hardware duplicates being generated from the Tyan to the PIII/733). Based on this result, the problem seems to be specific to the Tyan box. I'm not saying there isn't a bug in here but it may go back to the Tyan motherboard design which we can't do anything about. I just ordered a 3com 3c996b-t (64 bit PCI, 66mhz) NIC. If I recall correctly, the Tyan board gives preference to cards in the 64 bit slots. Also, this will be the only card in either of the two 64 bit slots which should eliminate the possibility of a "slow" (33Mhz) card slowing down the bus. I'd like to leave this bug open until I get a chance to try the 64 bit NIC so I have a place to post results from testing (should be here by 7/21/2006). This board should be able to keep up better than what I'm seeing and having a gigabit NIC in it shouldn't be that odd.
Update: I finally got the 3c996b-t card working. I did the old "pull out all of the other cards and see if the problem still occurs." The problem went away so I added the PCI cards back in one at a time assuming I'd find one that caused the conflict. The system is now in exactly the same configuration it was in when the tg3 driver gave a load error (except cards might not be in the same PCI slots) but no error. I re-ran the iperf tests and got: [dave@bend ~]# iperf -m -c mutilate ------------------------------------------------------------ Client connecting to mutilate, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 192.168.0.184 port 43006 connected with 192.168.255.254 port 5001 [ 3] 0.0-10.0 sec 970 MBytes 813 Mbits/sec [ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface) and [root@fraud rc.d]# iperf -c bend -m ------------------------------------------------------------ Client connecting to bend, TCP port 5001 TCP window size: 1.00 MByte (default) ------------------------------------------------------------ [ 3] local 192.168.255.254 port 33064 connected with 192.168.0.184 port 5001 [ 3] 0.0-10.0 sec 940 MBytes 787 Mbits/sec [ 3] MSS size 8948 bytes (MTU 8988 bytes, unknown interface) So, pretty close to symmetric performance. A little slower than what you're seeing but I'm guessing I'm hitting hardware limitations at this point given the vintage of the systems I'm using. I'm getting the performance I expected (after a significant amount of pain) with the new NIC so I'm happy other than I had to shell out some $$ for a new NIC. I can rerun the iperf tests with tcpdump doing a capture to see if I'm still seeing he same problem with hardware dupes. Let me know if this is worth pursuing.
Doesn't the 3c996b-t card use a different driver?
Yeah, the 3c996B-T uses the tg3 driver. This mainly confirms that there isn't a problem with the building wiring, the test server NIC and my switch. I was fairly convinced that those weren't causing the behavior that caused me to open the bug but this confirms it. About all we know is the 3c2000-t with the skge driver generates hardware dupes in this box (Tyan S2466N-4M) when pushed to maximum speed by iperf. This includes both being either the iperf sender or the iperf receiver but its not as bad on receive (fewer hardware dupes). At slower speeds (~300mbs to ~400mbs) in a different box (PIII/733) it doesn't this NIC with the skge driver doesn't generate such dupes nor does the same NIC with the sk98lin driver. I'm assuming that the presence of hardware dupes is "a bad thing" and is the visible cause or somehow related to the "slow transmit" performance. If this isn't the case, all we know is what I put in the original bug. Another "shot in the dark" theory would be that this is somehow a limitation of the chipset or peripheral bus controller for this motherboard (I've pinged Tyan support but I just get the predicted "Linux not supported" song and dance with a "discontinued motherboard" refraim). I'm typically an application developer so anything this close to the hardware is all new to me. If you have anything specific you'd like to see, let me know and I'll try to set up the test. Unfortunately, the hardware I've described in this bug is about everything I have access to so I'm somewhat limited in what I can do. Cheers, Dave
A number of performance related fixes went into 2.6.18. The main related to this bug report was a change to have the skge driver use NAPI for transmit cleanup. Is this still true with later kernels?
The most recent kernel I have access to is 2.6.17-1.2157_FC5. Any idea if the folks at Red Hat have back-ported the fix? I'm hoping to put FC6-test3 on my test box at some point. I may be able to grab something even more current at that point.
Fedora Core 6 is using 2.6.18 based kernel, could you please try that?
Please reopen this if problem remains with current kernels.
It looks like this wasn't completely resolved but worked around by buying a new NIC? Anyway, I have the same symptoms with skge and forcedeth connected with short patch cables through a switch (DLink DGS-1008). I tried wmem/rmem sysctl stuff (all of it) with no effect and also turning off window scaling halves performance. With default settings I get ~200Mbps one and ~400Mbps the other way. Even the faster direction is pretty slow on my setup. Hardware on the skge side is: 02:08.0 Ethernet controller: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11) (rev 11) Subsystem: D-Link System Inc DGE-530T Gigabit Ethernet Adapter (rev 11) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 (5750ns min, 7750ns max), Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 22 Region 0: Memory at fdcf8000 (32-bit, non-prefetchable) [size=16K] Region 1: I/O ports at dc00 [size=256] Expansion ROM at fde00000 [disabled] [size=128K] Capabilities: [48] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=1 PME- Capabilities: [50] Vital Product Data <?> Kernel driver in use: skge Kernel modules: skge Forcedeth side: 00:0f.0 Ethernet controller: nVidia Corporation MCP73 Ethernet (rev a2) Subsystem: Giga-byte Technology Device e000 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 (250ns min, 5000ns max) Interrupt: pin A routed to IRQ 4350 Region 0: Memory at e5109000 (32-bit, non-prefetchable) [size=4K] Region 1: I/O ports at e000 [size=8] Region 2: Memory at e510a000 (32-bit, non-prefetchable) [size=256] Region 3: Memory at e5106000 (32-bit, non-prefetchable) [size=16] Capabilities: [44] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 PME-Enable+ DSel=0 DScale=0 PME- Capabilities: [50] Message Signalled Interrupts: Mask+ 64bit+ Count=1/8 Enable+ Address: 00000000fee0100c Data: 4179 Masking: 000000fe Pending: 00000000 Kernel driver in use: forcedeth Kernel modules: forcedeth Skge is in a PCI slot in a ~3 year old AMD Turion64, while forcedeth is onboard a ~1 year old Gigabyte MB with C2D CPU. First is running Ubuntu 8.10 (so 2.6.27 derivative) and second one openSUSE 11.1 Beta5 (so also 2.6.27 derivative). But numbers were exactly the same one version ago for both platforms, so 2.6.24 on Ubuntu 8.04 and 2.6.25 on openSUSE 11. Anything further I could try? P.S. I don't seem to have power to re-open this.
Regular PCI slots have limited bandwidth (like 400MB) vs on board or PCI-E which have 6GB.