Bug 9883

Summary: Jittery performance when communicating via IPv4 over InfiniBand (IPoIB)
Product: Drivers Reporter: Bart Van Assche (bvanassche)
Component: OtherAssignee: Roland Dreier (roland)
Severity: normal    
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: --- Bisected commit-id:

Description Bart Van Assche 2008-02-04 02:27:01 UTC
Latest working kernel version: none
Earliest failing kernel version: (not known)
Distribution: Test was performed with Ubuntu 7.10 server, but applies to any distribution.
Hardware Environment: Intel S5000PAL mother board + Intel 80003ES2LAN Gigabit Ethernet Controller (copper, full duplex) + MT25204 InfiniBand HCA + InfiniBand SDR 4x network.
Software Environment: kernel + OFED userspace components.
Problem Description: 

When running iperf between two hosts with the configuration described above, iperf reports a bandwidth of 942 Mbit/s over Ethernet and between 1800 and 3300 Mbit/s over InfiniBand (default iperf setup: 8 KB messages, TCP, IPv4, Nagle enabled). Inspection with Wireshark of the network traffic shows that when running iperf over Ethernet only the iperf data + TCP window update messages are communicated. When running iperf over InfiniBand one can see that iperf data, TCP window updates, Acked lost segment and Previous segment lost messages are exchanged. This is probably a bug in the IPoIB implementation. ib_rdma_bw reports 933.2 MB/s between the two test systems (7466 Mbit/s).

Steps to reproduce:

Run iperf as follows on the two interconnected Linux hosts (the tune_pci=1 parameter was specified as a compensation for a BIOS bug):
(host 1)
rmmod ib_mthca
modprobe ib_ipoib
modprobe ib_mthca tune_pci=1
sleep 5
ifdown ib0
ifup ib0
iperf -s
(host 2)
rmmod ib_mthca
modprobe ib_ipoib
modprobe ib_mthca tune_pci=1
sleep 5
ifdown ib0
ifup ib0
iperf -c ${IP_address_of_InfiniBand_interface_of_host_1}
Comment 1 Bart Van Assche 2008-02-07 01:55:08 UTC
Tests with the STGT SCSI target implementation and with the iSCSI protocol running over IPoIB and with iSCSI parameter node.conn[0].tcp.window_size = 524288 show a sharp performance drop for writes to a remote RAM disk with block transfer sizes of >= 1MB. This may be due to the IPoIB implementation.

Write throughput with a block transfer size of 512 KB: 97 MB/s.
Write throughput with a block transfer size of 1 MB: 21 MB/s.

Or: a throughput reduction of more than four times.
Comment 2 Bart Van Assche 2008-02-07 08:22:35 UTC
Please ignore comment #1 -- this is probably an STGT performance bug.

Regarding the jittery iperf results, I see the following kernel messages appear on the iperf client system after having enabled ipoib debugging (echo 1 > /sys/module/ib_ipoib/parameters/debug_level):

[    0.000000] ib0: TX ring full, stopping kernel net queue
Comment 3 Bart Van Assche 2008-02-08 02:06:06 UTC
This is must be an iperf problem and not an IPoIB problem. The fact that the kernel message "TX ring full" appears is normal, it means that iperf is supplying more data than can be transmitted. And Wireshark complains about acked lost segment/previous segment lost messages because it could not capture all transmitted packets. Netperf gives reproducible results for IPoIB on my setup: 3288+/-1 Mbit/s.