Bug 111921

Summary: Severe IPoIB Routed Performance Regression
Product: Drivers Reporter: John-Michael Mulesa (thesaxophonist)
Component: Infiniband/RDMAAssignee: drivers_infiniband-rdma
Status: RESOLVED CODE_FIX    
Severity: normal CC: dledford, Hakon.Bugge, kb9vqf, koct9i, matveev.as, n.borisov.lkml, tiberizzle, v.tolstov, wry+bzkernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.4.1 Subsystem:
Regression: Yes Bisected commit-id:

Description John-Michael Mulesa 2016-02-05 05:23:14 UTC
(Apologies if this is in the wrong section)

There appears to be a significant performance regression somewhere in the Infiniband or IPoIB stack starting with kernel 4.4. IP performance over the infiniband network itself is fine, however I have my internet connection routed from my ethernet network to my infiniband network. In kernel versions up through 4.3.3, this worked fine and provided my ISP line speed of ~180Mbps over IB. However in 4.4 this has dropped over an order of magnitude and I can now only achieve ~5Mbps when traversing from ethernet to infiniband or vice versa. I'm running IPoIB in connected mode.

Reference: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1536837
Comment 1 Doug Ledford 2016-02-05 19:42:59 UTC
Can you bisect this down to the patch that breaks your performance?
Comment 2 John-Michael Mulesa 2016-02-18 03:45:55 UTC
Sorry about the delay; finally got around to bisecting this issue:

9207f9d45b0ad071baa128e846d7e7ed85016df3 is the first bad commit
commit 9207f9d45b0ad071baa128e846d7e7ed85016df3
Author: Konstantin Khlebnikov <koct9i@gmail.com>
Date:   Fri Jan 8 15:21:46 2016 +0300

    net: preserve IP control block during GSO segmentation
    
    Skb_gso_segment() uses skb control block during segmentation.
    This patch adds 32-bytes room for previous control block which
    will be copied into all resulting segments.
    
    This patch fixes kernel crash during fragmenting forwarded packets.
    Fragmentation requires valid IP CB in skb for clearing ip options.
    Also patch removes custom save/restore in ovs code, now it's redundant.
    
    Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
    Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 9c648128d8818ac7a2bc407fab9bf14198a34d41 212172b5c880eafc231be1277099e0e90eefdf45 M	include
:040000 040000 3bf1261c14b68f7542b795c85a0867ae451439ea f5dbca57e2c5638da2b0e0fab832f6ba89749f2b M	net
Comment 3 Timothy Pearson 2016-03-07 18:45:59 UTC
This also appears to be present in at least the 4.3 kernel for Debian.
Comment 4 Andrey Matveev 2016-04-01 15:13:42 UTC
I can confirm that reverting the above commit back solves the issue

https://bugzilla.proxmox.com/show_bug.cgi?id=927

Seems the fix from Konstantin Khlebnikov has a side effect(((
Comment 5 Konstantin Khlebnikov 2016-04-01 16:56:51 UTC
Obviously skb_gso_cb now intersects with hwaddr field of ipoib_skb_cb and probably corrupts it,
so ipoib_start_xmit cannot find neighbour entry and goes to slow path or something.
Before that commit skb_gso_cb was inside ipoib_cb->qdisc_cb
Comment 6 Konstantin Khlebnikov 2016-04-01 17:03:00 UTC
related IPoIB hack 936d7de3d736e0737542641269436f4b5968e9ef
("IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses")
Comment 7 Andrey Matveev 2016-04-02 01:55:24 UTC
Konstantin, many thanks for your comments!

So, the choice is:
IPoIB possible crash without the fix from Konstantin or 1000+ times speed degradation... I would prefer to choose the first one as I'm using Infiniband to access my storage from virtuaization environmetn
Comment 8 Konstantin Khlebnikov 2016-04-02 18:02:50 UTC
Current sharing of skb->cb

TCP/UDP/SCTP/DCCP coexists with IPCB/IP6CB
IPCB/IP6CB coexists with GSO
OVS coexists with GSO

IPoIB coexists with GSO (and intersects)
IPoIB coexists with Qdisc

IPoIB uses skb->cb for a stange thing - it keeps here huge hwaddr and lookups neighbour entry in start_xmit for each segment
Comment 9 Timothy Pearson 2016-04-05 21:44:00 UTC
(In reply to Andrey Matveev from comment #7)
> Konstantin, many thanks for your comments!
> 
> So, the choice is:
> IPoIB possible crash without the fix from Konstantin or 1000+ times speed
> degradation... I would prefer to choose the first one as I'm using
> Infiniband to access my storage from virtuaization environmetn

At this point I have to agree.

Konstantin, are you working on fixing this or is the bug "up for grabs" by other authors?

Thanks!
Comment 10 Konstantin Khlebnikov 2016-04-06 05:28:48 UTC
Sorry. I'm really busy right now.

The only quick solution I see is swapping places of qdisc and ipoib blocks in skb->cb. It would be perfernct if IB maintainters have a look into this code and tell why ipoib is so special and cannot use usual dst/neighbor machinery for this.
Comment 11 Doug Ledford 2016-04-07 15:24:33 UTC
On 4/5/16 10:28 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=111921
> 
> --- Comment #10 from Konstantin Khlebnikov <koct9i@gmail.com> ---
> Sorry. I'm really busy right now.
> 
> The only quick solution I see is swapping places of qdisc and ipoib blocks in
> skb->cb. It would be perfernct if IB maintainters have a look into this code
> and tell why ipoib is so special and cannot use usual dst/neighbor machinery
> for this.
> 

I'm at conference this week.  I'll look into this when I get back.
Comment 12 Timothy Pearson 2016-04-08 23:42:27 UTC
Interestingly, reverting the linked patch on Debian 4.4.6-1 did NOT fix the performance issue.
Comment 13 Timothy Pearson 2016-04-20 06:04:09 UTC
A bit more information...reverting the patch did fix my amd64 system.  A ppce64el system here is still showing the problem though; will investigate further as I have time.
Comment 14 Andrey Matveev 2016-04-25 11:25:40 UTC
any progress?
Comment 15 Timothy Pearson 2016-05-12 05:44:08 UTC
Still hitting this bug...now present in Debian Jessie backport kernel 4.5 as well, so will probably see the number of affected systems increase the longer this waits for a fix.
Comment 16 Doug Ledford 2016-05-20 13:51:21 UTC
I have my 4.7 pull request ready to go, so I'll be able to spend some time on this now.
Comment 17 Timothy Pearson 2016-06-01 21:43:24 UTC
Has any progress been made on this?  Is there any additional debugging information that may be helpful?  We're stuck on older kernels until this is resolved, so any forward movement would be welcome.

Thanks!
Comment 18 Timothy Pearson 2016-06-04 21:01:01 UTC
I got another chance to test the latest Debian 4.5 kernel with the patch reverted on ppcl64el and the performance issues disappeared, so it appears something went wrong in my earlier test and reverting 9207f9d45b0ad071baa128e846d7e7ed85016df3 does indeed fix the issue on all systems tested.
Comment 19 Hakon.Bugge 2016-06-20 09:32:28 UTC
FWIW:

I did some limited testing with Oracle UEK4 kernel 4.1.12-49, which includes the offending commit.

In summary, my finding was that the performance degradation required selective ACKs to be disabled. Running with selective ACK enabled on both nodes, no performance regression was observed.

Also, note that some distros has sysctl_perf_tuning, part of the rdma RPM. Now, depending on the setting in /etc/rdma/rdma.conf, the sysctl_perf_tuning might be run, and it will then disable selective ACKs.
Comment 20 Andrey Matveev 2016-06-20 12:09:59 UTC
in my setup there is no difference whether net.ipv4.tcp_sack is set on or off. In any cases I'm facing performance degradation
Comment 21 Hakon.Bugge 2016-06-23 11:00:12 UTC
Andrey,

Would you care to do a:

# for F in /proc/sys/net/ipv4/tcp_*; do printf "%35s : %10s\n" `basename $F` "`cat $F`"; done

on your system, so we can "compare notes"?
Comment 22 Andrey Matveev 2016-06-27 09:38:49 UTC
Here we go:

Sender:

root@pve02A:~#  for F in /proc/sys/net/ipv4/tcp_*; do printf "%35s : %10s" `basename $F`; cat $F; done
              tcp_abort_on_overflow :           0
                  tcp_adv_win_scale :           1
     tcp_allowed_congestion_control :           cubic reno
                        tcp_app_win :           31
                    tcp_autocorking :           1
   tcp_available_congestion_control :           cubic reno
                       tcp_base_mss :           1024
            tcp_challenge_ack_limit :           100
             tcp_congestion_control :           cubic
                          tcp_dsack :           1
                  tcp_early_retrans :           3
                            tcp_ecn :           2
                   tcp_ecn_fallback :           1
                           tcp_fack :           1
                       tcp_fastopen :           1
                   tcp_fastopen_key :           00000000-00000000-00000000-00000000
                    tcp_fin_timeout :           60
                           tcp_frto :           2
                  tcp_fwmark_accept :           0
              tcp_invalid_ratelimit :           500
                tcp_keepalive_intvl :           75
               tcp_keepalive_probes :           9
                 tcp_keepalive_time :           7200
             tcp_limit_output_bytes :           262144
                    tcp_low_latency :           0
                    tcp_max_orphans :           262144
                 tcp_max_reordering :           300
                tcp_max_syn_backlog :           2048
                 tcp_max_tw_buckets :           262144
                            tcp_mem :           869481  1159309 1738962
                   tcp_min_tso_segs :           2
                tcp_moderate_rcvbuf :           1
                    tcp_mtu_probing :           0
                tcp_no_metrics_save :           0
                  tcp_notsent_lowat :           -1
                 tcp_orphan_retries :           0
                 tcp_probe_interval :           600
                tcp_probe_threshold :           8
                     tcp_reordering :           3
               tcp_retrans_collapse :           1
                       tcp_retries1 :           3
                       tcp_retries2 :           15
                        tcp_rfc1337 :           0
                           tcp_rmem :           4096    87380   6291456
                           tcp_sack :           1
          tcp_slow_start_after_idle :           1
                         tcp_stdurg :           0
                 tcp_synack_retries :           5
                     tcp_syncookies :           1
                    tcp_syn_retries :           6
                    tcp_thin_dupack :           0
           tcp_thin_linear_timeouts :           0
                     tcp_timestamps :           1
                tcp_tso_win_divisor :           3
                     tcp_tw_recycle :           0
                       tcp_tw_reuse :           0
                 tcp_window_scaling :           1
                           tcp_wmem :           4096    16384   4194304
      tcp_workaround_signed_windows :           0



Receiver:
root@ib2eth:~#
root@ib2eth:~#  for F in /proc/sys/net/ipv4/tcp_*; do printf "%35s : %10s" `basename $F`; cat $F; done
              tcp_abort_on_overflow :           0
                  tcp_adv_win_scale :           1
     tcp_allowed_congestion_control :           cubic reno
                        tcp_app_win :           31
                    tcp_autocorking :           1
   tcp_available_congestion_control :           cubic reno
                       tcp_base_mss :           512
            tcp_challenge_ack_limit :           100
             tcp_congestion_control :           cubic
                          tcp_dsack :           1
                  tcp_early_retrans :           3
                            tcp_ecn :           2
                           tcp_fack :           1
                       tcp_fastopen :           1
                   tcp_fastopen_key :           00000000-00000000-00000000-00000000
                    tcp_fin_timeout :           60
                           tcp_frto :           2
                  tcp_fwmark_accept :           0
                tcp_keepalive_intvl :           75
               tcp_keepalive_probes :           9
                 tcp_keepalive_time :           7200
             tcp_limit_output_bytes :           131072
                    tcp_low_latency :           0
                    tcp_max_orphans :           8192
                tcp_max_syn_backlog :           128
                 tcp_max_tw_buckets :           8192
                            tcp_mem :           45996   61331   91992
                   tcp_min_tso_segs :           2
                tcp_moderate_rcvbuf :           1
                    tcp_mtu_probing :           0
                tcp_no_metrics_save :           0
                  tcp_notsent_lowat :           -1
                 tcp_orphan_retries :           0
                     tcp_reordering :           3
               tcp_retrans_collapse :           1
                       tcp_retries1 :           3
                       tcp_retries2 :           15
                        tcp_rfc1337 :           0
                           tcp_rmem :           4096    87380   33554432
                           tcp_sack :           1
          tcp_slow_start_after_idle :           1
                         tcp_stdurg :           0
                 tcp_synack_retries :           5
                     tcp_syncookies :           1
                    tcp_syn_retries :           6
                    tcp_thin_dupack :           0
           tcp_thin_linear_timeouts :           0
                     tcp_timestamps :           0
                tcp_tso_win_divisor :           3
                     tcp_tw_recycle :           0
                       tcp_tw_reuse :           0
                 tcp_window_scaling :           1
                           tcp_wmem :           4096    65536   33554432
      tcp_workaround_signed_windows :           0



Just a notice: I changed tcp_sack via sysctl command without reboot (if it matters)
Comment 23 Hakon.Bugge 2016-06-28 16:20:44 UTC
(In reply to Andrey Matveev from comment #22)
> Here we go:
> 
Are you running two different kernel?

Here is the diff of your output:

# diff tx.txt  rx.txt 
7c7
<                        tcp_base_mss :           1024
---
>                        tcp_base_mss :           512
13d12
<                    tcp_ecn_fallback :           1
20d18
<               tcp_invalid_ratelimit :           500
24c22
<              tcp_limit_output_bytes :           262144
---
>              tcp_limit_output_bytes :           131072
26,30c24,27
<                     tcp_max_orphans :           262144
<                  tcp_max_reordering :           300
<                 tcp_max_syn_backlog :           2048
<                  tcp_max_tw_buckets :           262144
<                             tcp_mem :           869481  1159309 1738962
---
>                     tcp_max_orphans :           8192
>                 tcp_max_syn_backlog :           128
>                  tcp_max_tw_buckets :           8192
>                             tcp_mem :           45996   61331   91992
37,38d33
<                  tcp_probe_interval :           600
<                 tcp_probe_threshold :           8
44c39
<                            tcp_rmem :           4096    87380   6291456
---
>                            tcp_rmem :           4096    87380   33554432
53c48
<                      tcp_timestamps :           1
---
>                      tcp_timestamps :           0
58c53
<                            tcp_wmem :           4096    16384   4194304
---
>                            tcp_wmem :           4096    65536   33554432

My settings (and kernel) matches you sender  the most:

[root@lab46 ~]# diff -b  me.txt tx.txt 
12a13
>                    tcp_ecn_fallback :           1
23c24
<              tcp_limit_output_bytes :      131072
---
>              tcp_limit_output_bytes :           262144
29c30
<                             tcp_mem :  8499705	11332943	16999410
---
>                             tcp_mem :           869481  1159309 1738962


You have the tcp_ecn_fallback sysctl, I do not, and you have twice the value of tcp_limit_output_bytes.

> Just a notice: I changed tcp_sack via sysctl command without reboot (if it
> matters)

It did not matter for me.
Comment 24 Andrey Matveev 2016-06-28 17:27:18 UTC
Yeap, my receiver is stand-alone server that is used as a gateway IB <--> 10GB_ETH

Linux ib2eth 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux
Comment 25 Andrey Matveev 2016-06-28 17:48:00 UTC
with:

tcp_ecn_fallback :           0
tcp_limit_output_bytes :      131072


Client connecting to 172.16.253.2, TCP port 5001
TCP window size: 2.50 MByte (default)
------------------------------------------------------------
[  3] local 172.16.253.15 port 35688 connected with 172.16.253.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.5 sec  16.5 MBytes  13.2 Mbits/sec
root@pve02A:~#


Unfortunately, I don't have an option to compare throughput between 2 hosts with the same kernel (both having 2 kernel WITH the patch from Konstantin)
I always test kernel without patch vs kernel with patch (the differences are only the kernel settings)
Comment 26 Timothy Pearson 2016-07-07 00:30:26 UTC
Confirming setting ipv4_sack to 0 does NOT fix the problem.  Tested the option on both sides of the link, no effect.
Comment 27 Andrey Matveev 2016-07-21 08:09:21 UTC
any progress?
Comment 28 Hakon.Bugge 2016-08-04 16:11:51 UTC
Yes. This is related to the mixture of connected and datagram mode of IPoIB.

Test 1:

Kernel with commit "net: preserve IP control block during GSO segmentation".

Two nodes, S1 and S2. S1 has IPoIB configured in connected mode, S2 has IPoIB configures in datagram mode.

Now, with TCP traffic (qperf tcp_bw), from S1 (connected) to S2 (datagram) we get:

tcp_bw:
    bw  =  297 KB/sec


With traffic from S2 (datagram) to S1 (connected):

tcp_bw:
    bw  =  2.96 GB/sec


Test 2:

Kernel without commit "net: preserve IP control block during GSO segmentation".

Traffic from S1 (connected) to S2 (datagram):

tcp_bw:
    bw  =  1.23 GB/sec

Traffic from S2 (datagram) to S1 (connected):

tcp_bw:
    bw  =  2.96 GB/sec


I will not speculate as to why, but this seems to be a strong correlation.
Comment 29 Andrey Matveev 2016-08-05 08:51:22 UTC
In my case I don't have mixture of connected and datagram modes. Connected mode is used on both sides:

root@pve02A:~# cat /sys/class/net/ib0/mode
connected

root@ib2eth:~# cat /sys/class/net/ib0/mode
connected
Comment 30 Nikolay Borisov 2016-08-05 12:05:25 UTC
I've also come accross this, here is more information on that + scripts to reproduce. I've only observed it when interfaces are in different namespaces. https://www.mail-archive.com/netdev@vger.kernel.org/msg121918.html
Comment 31 Hakon.Bugge 2016-08-05 16:25:33 UTC
(In reply to Andrey Matveev from comment #29)
> In my case I don't have mixture of connected and datagram modes. Connected
> mode is used on both sides:
> 
> root@pve02A:~# cat /sys/class/net/ib0/mode
> connected
> 
> root@ib2eth:~# cat /sys/class/net/ib0/mode
> connected

An 

# echo datagram > /sys/class/net/ib0/mode

on one of the nodes would fix that?
Comment 32 Roland Dreier 2016-08-05 16:57:19 UTC
I don't think there is a need for more experiments.  The bug is
well-understood, see
http://marc.info/?l=linux-netdev&m=146787278625498&w=2 and replies.

The problem is that no ones has stepped up to do the major surgery to
ipoib that is apparently required.

For what it's worth, on a 4.4.y system I am using the following
successfully, which preserves the original bug fix but also allows
IPoIB to work.  Unfortunately struct skb_gso_cb has grown upstream so
this approach is not applicable to newer kernels:

Upstream commit 1f0bdf609240 ("net: preserve IP control block during GSO
segmentation") moved SKB_GSO_CB() to 32 bytes into skb->cb[], to avoid
stomping on the IP control block.  However, the IPoIB control block
starts 28 bytes in, so GSO corrupts the IPoIB address for packets (and
kills IPoIB performance).

If we slightly shrink struct skb_gso_cb to 8 bytes, then we can move
SKB_GSO_CB() to 20 bytes (just after the IP control block) so that we
don't reintroduce the bug fixed by 1f0bdf609240 but also leave IPoIB working.
---
 include/linux/skbuff.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 171e12fc0696..52e326145458 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3626,10 +3626,10 @@ static inline struct sec_path
*skb_sec_path(struct sk_buff *skb)
  */
 struct skb_gso_cb {
        int     mac_offset;
-       int     encap_level;
+       __u16   encap_level;
        __u16   csum_start;
 };
-#define SKB_SGO_CB_OFFSET      32
+#define SKB_SGO_CB_OFFSET      20
 #define SKB_GSO_CB(skb) ((struct skb_gso_cb *)((skb)->cb + SKB_SGO_CB_OFFSET))

 static inline int skb_tnl_header_len(const struct sk_buff *inner_skb)
Comment 33 Nikolay Borisov 2016-08-08 06:56:35 UTC
Roland, unfortunately your patch only when the following two config options are disabled: CONFIG_IPV6_MIP6 and CONFIG_IPV6_MIP6_MODULE. In this case the 'dsthao' member is compiled out, otherwise inet6_skb_parm is 24 bytes: 

struct inet6_skb_parm {
	int                        iif;                  /*     0     4 */
	__be16                     ra;                   /*     4     2 */
	__u16                      dst0;                 /*     6     2 */
	__u16                      srcrt;                /*     8     2 */
	__u16                      dst1;                 /*    10     2 */
	__u16                      lastopt;              /*    12     2 */
	__u16                      nhoff;                /*    14     2 */
	__u16                      flags;                /*    16     2 */
	__u16                      dsthao;               /*    18     2 */
	__u16                      frag_max_size;        /*    20     2 */

	/* size: 24, cachelines: 1, members: 10 */
	/* padding: 2 */
	/* last cacheline: 24 bytes */
};
Comment 34 Timothy Pearson 2016-10-06 16:57:31 UTC
Is anyone planning to work on the requisite fix?  If Infiniband is this poorly maintained we may start looking at competing technologies for our new cluster deployment.
Comment 35 Doug Ledford 2016-10-07 12:35:41 UTC
Yes.  We know what the problem is, and LLNL has tasked someone to work on it since I haven't had the time.  I gave them a full dump of the two solutions I thought were appropriate yesterday.
Comment 36 Doug Ledford 2016-10-11 17:32:25 UTC
Upstream submission of proposed fix:

http://marc.info/?l=linux-rdma&m=147620680520525&w=2
Comment 37 Andrey Matveev 2016-10-13 10:33:38 UTC
it works! confirmed!
Comment 38 Timothy Pearson 2016-10-23 06:25:31 UTC
I can also confirm the provided patch fixes kernel 4.4 (first system I tested it on).  Nice work!
Comment 39 tiberizzle 2016-10-28 09:14:31 UTC
I was experiencing an issue with NFS over IPoIB performance degrading to ~5Mbps after the first time a NFS client was restarted, until NFS server was also restarted.

I bisected the issue to the same patch which was identified as the culprit for the original reporter's issue, and can confirm that Paolo Abeni's patches applied to the NFS server resolved my issue as well.
Comment 40 Vasiliy Tolstov 2017-01-25 14:27:46 UTC
I'm use 4.4.34 and try 4.4.44 but don't see that patch applied to stable or in upstream. Can somebody helps me and say - does patch accepted by upstream or created another patch or workaround for this issue?
Comment 41 Vasiliy Tolstov 2017-01-25 14:49:39 UTC
As i understand this patch does not goes to stable kernels. Please send it to stable.
Comment 42 Nikolay Borisov 2017-01-25 14:53:28 UTC
So the upstream patch is this: fc791b6335152c5278dc4a4991bcb2d329f806f9

However, it's not tagged for stable so you won't currently find it in anything before 4.9. So you have the following options: 

a) Backport the patch to your kernel
b) Use the fix which Roland Dreier suggested in comment 32 - https://bugzilla.kernel.org/show_bug.cgi?id=111921#c32 - this will only work on 4.4 series kernel

c) Ask for the commit to be included in stable releases - I have already done this, check the stable mailing list. But it will take some time until this propagates to the next stable release (if it is accepted at all).