Bug 111921
Summary: | Severe IPoIB Routed Performance Regression | ||
---|---|---|---|
Product: | Drivers | Reporter: | John-Michael Mulesa (thesaxophonist) |
Component: | Infiniband/RDMA | Assignee: | drivers_infiniband-rdma |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | dledford, Hakon.Bugge, kb9vqf, koct9i, matveev.as, n.borisov.lkml, tiberizzle, v.tolstov, wry+bzkernel |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.4.1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
John-Michael Mulesa
2016-02-05 05:23:14 UTC
Can you bisect this down to the patch that breaks your performance? Sorry about the delay; finally got around to bisecting this issue: 9207f9d45b0ad071baa128e846d7e7ed85016df3 is the first bad commit commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 Author: Konstantin Khlebnikov <koct9i@gmail.com> Date: Fri Jan 8 15:21:46 2016 +0300 net: preserve IP control block during GSO segmentation Skb_gso_segment() uses skb control block during segmentation. This patch adds 32-bytes room for previous control block which will be copied into all resulting segments. This patch fixes kernel crash during fragmenting forwarded packets. Fragmentation requires valid IP CB in skb for clearing ip options. Also patch removes custom save/restore in ovs code, now it's redundant. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 9c648128d8818ac7a2bc407fab9bf14198a34d41 212172b5c880eafc231be1277099e0e90eefdf45 M include :040000 040000 3bf1261c14b68f7542b795c85a0867ae451439ea f5dbca57e2c5638da2b0e0fab832f6ba89749f2b M net This also appears to be present in at least the 4.3 kernel for Debian. I can confirm that reverting the above commit back solves the issue https://bugzilla.proxmox.com/show_bug.cgi?id=927 Seems the fix from Konstantin Khlebnikov has a side effect((( Obviously skb_gso_cb now intersects with hwaddr field of ipoib_skb_cb and probably corrupts it, so ipoib_start_xmit cannot find neighbour entry and goes to slow path or something. Before that commit skb_gso_cb was inside ipoib_cb->qdisc_cb related IPoIB hack 936d7de3d736e0737542641269436f4b5968e9ef ("IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses") Konstantin, many thanks for your comments! So, the choice is: IPoIB possible crash without the fix from Konstantin or 1000+ times speed degradation... I would prefer to choose the first one as I'm using Infiniband to access my storage from virtuaization environmetn Current sharing of skb->cb TCP/UDP/SCTP/DCCP coexists with IPCB/IP6CB IPCB/IP6CB coexists with GSO OVS coexists with GSO IPoIB coexists with GSO (and intersects) IPoIB coexists with Qdisc IPoIB uses skb->cb for a stange thing - it keeps here huge hwaddr and lookups neighbour entry in start_xmit for each segment (In reply to Andrey Matveev from comment #7) > Konstantin, many thanks for your comments! > > So, the choice is: > IPoIB possible crash without the fix from Konstantin or 1000+ times speed > degradation... I would prefer to choose the first one as I'm using > Infiniband to access my storage from virtuaization environmetn At this point I have to agree. Konstantin, are you working on fixing this or is the bug "up for grabs" by other authors? Thanks! Sorry. I'm really busy right now. The only quick solution I see is swapping places of qdisc and ipoib blocks in skb->cb. It would be perfernct if IB maintainters have a look into this code and tell why ipoib is so special and cannot use usual dst/neighbor machinery for this. On 4/5/16 10:28 PM, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=111921 > > --- Comment #10 from Konstantin Khlebnikov <koct9i@gmail.com> --- > Sorry. I'm really busy right now. > > The only quick solution I see is swapping places of qdisc and ipoib blocks in > skb->cb. It would be perfernct if IB maintainters have a look into this code > and tell why ipoib is so special and cannot use usual dst/neighbor machinery > for this. > I'm at conference this week. I'll look into this when I get back. Interestingly, reverting the linked patch on Debian 4.4.6-1 did NOT fix the performance issue. A bit more information...reverting the patch did fix my amd64 system. A ppce64el system here is still showing the problem though; will investigate further as I have time. any progress? Still hitting this bug...now present in Debian Jessie backport kernel 4.5 as well, so will probably see the number of affected systems increase the longer this waits for a fix. I have my 4.7 pull request ready to go, so I'll be able to spend some time on this now. Has any progress been made on this? Is there any additional debugging information that may be helpful? We're stuck on older kernels until this is resolved, so any forward movement would be welcome. Thanks! I got another chance to test the latest Debian 4.5 kernel with the patch reverted on ppcl64el and the performance issues disappeared, so it appears something went wrong in my earlier test and reverting 9207f9d45b0ad071baa128e846d7e7ed85016df3 does indeed fix the issue on all systems tested. FWIW: I did some limited testing with Oracle UEK4 kernel 4.1.12-49, which includes the offending commit. In summary, my finding was that the performance degradation required selective ACKs to be disabled. Running with selective ACK enabled on both nodes, no performance regression was observed. Also, note that some distros has sysctl_perf_tuning, part of the rdma RPM. Now, depending on the setting in /etc/rdma/rdma.conf, the sysctl_perf_tuning might be run, and it will then disable selective ACKs. in my setup there is no difference whether net.ipv4.tcp_sack is set on or off. In any cases I'm facing performance degradation Andrey, Would you care to do a: # for F in /proc/sys/net/ipv4/tcp_*; do printf "%35s : %10s\n" `basename $F` "`cat $F`"; done on your system, so we can "compare notes"? Here we go: Sender: root@pve02A:~# for F in /proc/sys/net/ipv4/tcp_*; do printf "%35s : %10s" `basename $F`; cat $F; done tcp_abort_on_overflow : 0 tcp_adv_win_scale : 1 tcp_allowed_congestion_control : cubic reno tcp_app_win : 31 tcp_autocorking : 1 tcp_available_congestion_control : cubic reno tcp_base_mss : 1024 tcp_challenge_ack_limit : 100 tcp_congestion_control : cubic tcp_dsack : 1 tcp_early_retrans : 3 tcp_ecn : 2 tcp_ecn_fallback : 1 tcp_fack : 1 tcp_fastopen : 1 tcp_fastopen_key : 00000000-00000000-00000000-00000000 tcp_fin_timeout : 60 tcp_frto : 2 tcp_fwmark_accept : 0 tcp_invalid_ratelimit : 500 tcp_keepalive_intvl : 75 tcp_keepalive_probes : 9 tcp_keepalive_time : 7200 tcp_limit_output_bytes : 262144 tcp_low_latency : 0 tcp_max_orphans : 262144 tcp_max_reordering : 300 tcp_max_syn_backlog : 2048 tcp_max_tw_buckets : 262144 tcp_mem : 869481 1159309 1738962 tcp_min_tso_segs : 2 tcp_moderate_rcvbuf : 1 tcp_mtu_probing : 0 tcp_no_metrics_save : 0 tcp_notsent_lowat : -1 tcp_orphan_retries : 0 tcp_probe_interval : 600 tcp_probe_threshold : 8 tcp_reordering : 3 tcp_retrans_collapse : 1 tcp_retries1 : 3 tcp_retries2 : 15 tcp_rfc1337 : 0 tcp_rmem : 4096 87380 6291456 tcp_sack : 1 tcp_slow_start_after_idle : 1 tcp_stdurg : 0 tcp_synack_retries : 5 tcp_syncookies : 1 tcp_syn_retries : 6 tcp_thin_dupack : 0 tcp_thin_linear_timeouts : 0 tcp_timestamps : 1 tcp_tso_win_divisor : 3 tcp_tw_recycle : 0 tcp_tw_reuse : 0 tcp_window_scaling : 1 tcp_wmem : 4096 16384 4194304 tcp_workaround_signed_windows : 0 Receiver: root@ib2eth:~# root@ib2eth:~# for F in /proc/sys/net/ipv4/tcp_*; do printf "%35s : %10s" `basename $F`; cat $F; done tcp_abort_on_overflow : 0 tcp_adv_win_scale : 1 tcp_allowed_congestion_control : cubic reno tcp_app_win : 31 tcp_autocorking : 1 tcp_available_congestion_control : cubic reno tcp_base_mss : 512 tcp_challenge_ack_limit : 100 tcp_congestion_control : cubic tcp_dsack : 1 tcp_early_retrans : 3 tcp_ecn : 2 tcp_fack : 1 tcp_fastopen : 1 tcp_fastopen_key : 00000000-00000000-00000000-00000000 tcp_fin_timeout : 60 tcp_frto : 2 tcp_fwmark_accept : 0 tcp_keepalive_intvl : 75 tcp_keepalive_probes : 9 tcp_keepalive_time : 7200 tcp_limit_output_bytes : 131072 tcp_low_latency : 0 tcp_max_orphans : 8192 tcp_max_syn_backlog : 128 tcp_max_tw_buckets : 8192 tcp_mem : 45996 61331 91992 tcp_min_tso_segs : 2 tcp_moderate_rcvbuf : 1 tcp_mtu_probing : 0 tcp_no_metrics_save : 0 tcp_notsent_lowat : -1 tcp_orphan_retries : 0 tcp_reordering : 3 tcp_retrans_collapse : 1 tcp_retries1 : 3 tcp_retries2 : 15 tcp_rfc1337 : 0 tcp_rmem : 4096 87380 33554432 tcp_sack : 1 tcp_slow_start_after_idle : 1 tcp_stdurg : 0 tcp_synack_retries : 5 tcp_syncookies : 1 tcp_syn_retries : 6 tcp_thin_dupack : 0 tcp_thin_linear_timeouts : 0 tcp_timestamps : 0 tcp_tso_win_divisor : 3 tcp_tw_recycle : 0 tcp_tw_reuse : 0 tcp_window_scaling : 1 tcp_wmem : 4096 65536 33554432 tcp_workaround_signed_windows : 0 Just a notice: I changed tcp_sack via sysctl command without reboot (if it matters) (In reply to Andrey Matveev from comment #22) > Here we go: > Are you running two different kernel? Here is the diff of your output: # diff tx.txt rx.txt 7c7 < tcp_base_mss : 1024 --- > tcp_base_mss : 512 13d12 < tcp_ecn_fallback : 1 20d18 < tcp_invalid_ratelimit : 500 24c22 < tcp_limit_output_bytes : 262144 --- > tcp_limit_output_bytes : 131072 26,30c24,27 < tcp_max_orphans : 262144 < tcp_max_reordering : 300 < tcp_max_syn_backlog : 2048 < tcp_max_tw_buckets : 262144 < tcp_mem : 869481 1159309 1738962 --- > tcp_max_orphans : 8192 > tcp_max_syn_backlog : 128 > tcp_max_tw_buckets : 8192 > tcp_mem : 45996 61331 91992 37,38d33 < tcp_probe_interval : 600 < tcp_probe_threshold : 8 44c39 < tcp_rmem : 4096 87380 6291456 --- > tcp_rmem : 4096 87380 33554432 53c48 < tcp_timestamps : 1 --- > tcp_timestamps : 0 58c53 < tcp_wmem : 4096 16384 4194304 --- > tcp_wmem : 4096 65536 33554432 My settings (and kernel) matches you sender the most: [root@lab46 ~]# diff -b me.txt tx.txt 12a13 > tcp_ecn_fallback : 1 23c24 < tcp_limit_output_bytes : 131072 --- > tcp_limit_output_bytes : 262144 29c30 < tcp_mem : 8499705 11332943 16999410 --- > tcp_mem : 869481 1159309 1738962 You have the tcp_ecn_fallback sysctl, I do not, and you have twice the value of tcp_limit_output_bytes. > Just a notice: I changed tcp_sack via sysctl command without reboot (if it > matters) It did not matter for me. Yeap, my receiver is stand-alone server that is used as a gateway IB <--> 10GB_ETH Linux ib2eth 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux with: tcp_ecn_fallback : 0 tcp_limit_output_bytes : 131072 Client connecting to 172.16.253.2, TCP port 5001 TCP window size: 2.50 MByte (default) ------------------------------------------------------------ [ 3] local 172.16.253.15 port 35688 connected with 172.16.253.2 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.5 sec 16.5 MBytes 13.2 Mbits/sec root@pve02A:~# Unfortunately, I don't have an option to compare throughput between 2 hosts with the same kernel (both having 2 kernel WITH the patch from Konstantin) I always test kernel without patch vs kernel with patch (the differences are only the kernel settings) Confirming setting ipv4_sack to 0 does NOT fix the problem. Tested the option on both sides of the link, no effect. any progress? Yes. This is related to the mixture of connected and datagram mode of IPoIB. Test 1: Kernel with commit "net: preserve IP control block during GSO segmentation". Two nodes, S1 and S2. S1 has IPoIB configured in connected mode, S2 has IPoIB configures in datagram mode. Now, with TCP traffic (qperf tcp_bw), from S1 (connected) to S2 (datagram) we get: tcp_bw: bw = 297 KB/sec With traffic from S2 (datagram) to S1 (connected): tcp_bw: bw = 2.96 GB/sec Test 2: Kernel without commit "net: preserve IP control block during GSO segmentation". Traffic from S1 (connected) to S2 (datagram): tcp_bw: bw = 1.23 GB/sec Traffic from S2 (datagram) to S1 (connected): tcp_bw: bw = 2.96 GB/sec I will not speculate as to why, but this seems to be a strong correlation. In my case I don't have mixture of connected and datagram modes. Connected mode is used on both sides: root@pve02A:~# cat /sys/class/net/ib0/mode connected root@ib2eth:~# cat /sys/class/net/ib0/mode connected I've also come accross this, here is more information on that + scripts to reproduce. I've only observed it when interfaces are in different namespaces. https://www.mail-archive.com/netdev@vger.kernel.org/msg121918.html (In reply to Andrey Matveev from comment #29) > In my case I don't have mixture of connected and datagram modes. Connected > mode is used on both sides: > > root@pve02A:~# cat /sys/class/net/ib0/mode > connected > > root@ib2eth:~# cat /sys/class/net/ib0/mode > connected An # echo datagram > /sys/class/net/ib0/mode on one of the nodes would fix that? I don't think there is a need for more experiments. The bug is well-understood, see http://marc.info/?l=linux-netdev&m=146787278625498&w=2 and replies. The problem is that no ones has stepped up to do the major surgery to ipoib that is apparently required. For what it's worth, on a 4.4.y system I am using the following successfully, which preserves the original bug fix but also allows IPoIB to work. Unfortunately struct skb_gso_cb has grown upstream so this approach is not applicable to newer kernels: Upstream commit 1f0bdf609240 ("net: preserve IP control block during GSO segmentation") moved SKB_GSO_CB() to 32 bytes into skb->cb[], to avoid stomping on the IP control block. However, the IPoIB control block starts 28 bytes in, so GSO corrupts the IPoIB address for packets (and kills IPoIB performance). If we slightly shrink struct skb_gso_cb to 8 bytes, then we can move SKB_GSO_CB() to 20 bytes (just after the IP control block) so that we don't reintroduce the bug fixed by 1f0bdf609240 but also leave IPoIB working. --- include/linux/skbuff.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 171e12fc0696..52e326145458 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3626,10 +3626,10 @@ static inline struct sec_path *skb_sec_path(struct sk_buff *skb) */ struct skb_gso_cb { int mac_offset; - int encap_level; + __u16 encap_level; __u16 csum_start; }; -#define SKB_SGO_CB_OFFSET 32 +#define SKB_SGO_CB_OFFSET 20 #define SKB_GSO_CB(skb) ((struct skb_gso_cb *)((skb)->cb + SKB_SGO_CB_OFFSET)) static inline int skb_tnl_header_len(const struct sk_buff *inner_skb) Roland, unfortunately your patch only when the following two config options are disabled: CONFIG_IPV6_MIP6 and CONFIG_IPV6_MIP6_MODULE. In this case the 'dsthao' member is compiled out, otherwise inet6_skb_parm is 24 bytes: struct inet6_skb_parm { int iif; /* 0 4 */ __be16 ra; /* 4 2 */ __u16 dst0; /* 6 2 */ __u16 srcrt; /* 8 2 */ __u16 dst1; /* 10 2 */ __u16 lastopt; /* 12 2 */ __u16 nhoff; /* 14 2 */ __u16 flags; /* 16 2 */ __u16 dsthao; /* 18 2 */ __u16 frag_max_size; /* 20 2 */ /* size: 24, cachelines: 1, members: 10 */ /* padding: 2 */ /* last cacheline: 24 bytes */ }; Is anyone planning to work on the requisite fix? If Infiniband is this poorly maintained we may start looking at competing technologies for our new cluster deployment. Yes. We know what the problem is, and LLNL has tasked someone to work on it since I haven't had the time. I gave them a full dump of the two solutions I thought were appropriate yesterday. Upstream submission of proposed fix: http://marc.info/?l=linux-rdma&m=147620680520525&w=2 it works! confirmed! I can also confirm the provided patch fixes kernel 4.4 (first system I tested it on). Nice work! I was experiencing an issue with NFS over IPoIB performance degrading to ~5Mbps after the first time a NFS client was restarted, until NFS server was also restarted. I bisected the issue to the same patch which was identified as the culprit for the original reporter's issue, and can confirm that Paolo Abeni's patches applied to the NFS server resolved my issue as well. I'm use 4.4.34 and try 4.4.44 but don't see that patch applied to stable or in upstream. Can somebody helps me and say - does patch accepted by upstream or created another patch or workaround for this issue? As i understand this patch does not goes to stable kernels. Please send it to stable. So the upstream patch is this: fc791b6335152c5278dc4a4991bcb2d329f806f9 However, it's not tagged for stable so you won't currently find it in anything before 4.9. So you have the following options: a) Backport the patch to your kernel b) Use the fix which Roland Dreier suggested in comment 32 - https://bugzilla.kernel.org/show_bug.cgi?id=111921#c32 - this will only work on 4.4 series kernel c) Ask for the commit to be included in stable releases - I have already done this, check the stable mailing list. But it will take some time until this propagates to the next stable release (if it is accepted at all). |