Latest working kernel version: none Earliest failing kernel version: (not known) Distribution: Test was performed with Ubuntu 7.10 server, but applies to any distribution. Hardware Environment: Intel S5000PAL mother board + Intel 80003ES2LAN Gigabit Ethernet Controller (copper, full duplex) + MT25204 InfiniBand HCA + InfiniBand SDR 4x network. Software Environment: 2.6.23.14 kernel + OFED 1.2.5.5 userspace components. Problem Description: When running iperf between two hosts with the configuration described above, iperf reports a bandwidth of 942 Mbit/s over Ethernet and between 1800 and 3300 Mbit/s over InfiniBand (default iperf setup: 8 KB messages, TCP, IPv4, Nagle enabled). Inspection with Wireshark of the network traffic shows that when running iperf over Ethernet only the iperf data + TCP window update messages are communicated. When running iperf over InfiniBand one can see that iperf data, TCP window updates, Acked lost segment and Previous segment lost messages are exchanged. This is probably a bug in the IPoIB implementation. ib_rdma_bw reports 933.2 MB/s between the two test systems (7466 Mbit/s). Steps to reproduce: Run iperf as follows on the two interconnected Linux hosts (the tune_pci=1 parameter was specified as a compensation for a BIOS bug): (host 1) rmmod ib_mthca modprobe ib_ipoib modprobe ib_mthca tune_pci=1 sleep 5 ifdown ib0 ifup ib0 iperf -s (host 2) rmmod ib_mthca modprobe ib_ipoib modprobe ib_mthca tune_pci=1 sleep 5 ifdown ib0 ifup ib0 iperf -c ${IP_address_of_InfiniBand_interface_of_host_1}
Tests with the STGT SCSI target implementation and with the iSCSI protocol running over IPoIB and with iSCSI parameter node.conn[0].tcp.window_size = 524288 show a sharp performance drop for writes to a remote RAM disk with block transfer sizes of >= 1MB. This may be due to the IPoIB implementation. Write throughput with a block transfer size of 512 KB: 97 MB/s. Write throughput with a block transfer size of 1 MB: 21 MB/s. Or: a throughput reduction of more than four times.
Please ignore comment #1 -- this is probably an STGT performance bug. Regarding the jittery iperf results, I see the following kernel messages appear on the iperf client system after having enabled ipoib debugging (echo 1 > /sys/module/ib_ipoib/parameters/debug_level): [ 0.000000] ib0: TX ring full, stopping kernel net queue
This is must be an iperf problem and not an IPoIB problem. The fact that the kernel message "TX ring full" appears is normal, it means that iperf is supplying more data than can be transmitted. And Wireshark complains about acked lost segment/previous segment lost messages because it could not capture all transmitted packets. Netperf gives reproducible results for IPoIB on my setup: 3288+/-1 Mbit/s.