It looks like at some point between 4.4 and 4.9 conntrack cleanup performance got severely degraded. This is very similar to this bug that was supposed to be fixed: * https://lkml.org/lkml/2016/12/7/114 We used the same reproduce script and only adapted it to ping 65k addresses on the Internet instead of 254 addresses on LAN. To remove any possibility of our kernel config interference we booted xubuntu 17.04 in a VirtualBox VM (4.10 series). To reproduce one needs to: 1. Make sure that conntrack module is loaded. 2. Make sure that iptable_nat module is loaded (no rules are required). 3. Run the script and wait a few seconds for conntrack entries to start expiring: * https://gist.github.com/bobrik/734aa556048c93f0b6f7b47de3ed7bd9 At that point one CPU will start burning in nf_nat_cleanup_conntrack. Here's how "perf top -F 99 -U" looks like: * https://i.imgur.com/G68ZdkK.png Here's how net.netfilter.nf_conntrack_count looks like with 1s steps after number of entries started decreasing: * https://i.imgur.com/ojnW2FC.png The more sockets you have, the slower the recovery is. With 250k entries it can easily take 10+ minutes to recover, so there's some some super-linear GC complexity. We also got some hosts doing a lot of TCP connections and it seems that 500k TCP entries in conntrack tables doesn't trigger this behavior, when ICMP does it all the time. 65k is the default conntrack table limit and it seems to be easy to fill it and start saturating one CPU on a remote machine in GC work by just sending 500-1000 pings per second. One needs to have 65k IPs to do that, but IPv6 is there just for that.
Additionally, running "conntrack -L" almost certainly locks the system when launched with high number of ICMP entries. [ 1105.413732] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [conntrack:12139] ... [ 1105.570500] task: ffff880ff0229e40 task.stack: ffffc900308b0000 [ 1105.584942] RIP: 0010:[<ffffffffa23a19ae>] [<ffffffffa23a19ae>] nf_nat_cleanup_conntrack+0xde/0x1f0 [nf_nat] [ 1105.603580] RSP: 0018:ffffc900308b3900 EFLAGS: 00000206 [ 1105.617538] RAX: ffff880fb995b518 RBX: 0000000000000000 RCX: 0000000000000101 [ 1105.633376] RDX: ffff880fb98dcbd8 RSI: 0000000000000001 RDI: ffff880fb7b7e9d8 [ 1105.649170] RBP: ffffc900308b3940 R08: 0000000000000001 R09: ffffffffa00c68a0 [ 1105.664855] R10: ffff88181f9bfb20 R11: ffffea003eebd000 R12: ffff880fba1ea9d8 [ 1105.680534] R13: ffff88103848be54 R14: ffff88100eb20000 R15: 000000000000039d [ 1105.696167] FS: 00007f42be84c700(0000) GS:ffff88103fb40000(0000) knlGS:0000000000000000 [ 1105.712862] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1105.727208] CR2: 00007f7ee83439a0 CR3: 0000000fb7c0d000 CR4: 00000000001406e0 [ 1105.743034] Stack: [ 1105.753679] ffff880fba1ea940 0000000000000000 ffffffffa23a1810 ffffffff822f9b68 [ 1105.770038] ffff8810205fa132 ffff880fba1ea940 ffffffff81f3ea40 0000000000000189 [ 1105.786367] ffffc900308b3968 ffffffff8181a860 ffff880fba1ea940 ffffffff81f3ea40 [ 1105.802660] Call Trace: [ 1105.813889] [<ffffffffa23a1810>] ? nf_nat_l3proto_register+0x70/0x70 [nf_nat] [ 1105.830130] [<ffffffff8181a860>] __nf_ct_ext_destroy+0x40/0x60 [ 1105.844987] [<ffffffff81812135>] nf_conntrack_free+0x25/0x60 [ 1105.859588] [<ffffffff818130b9>] destroy_conntrack+0x79/0x90 [ 1105.874125] [<ffffffff8180dfb7>] nf_conntrack_destroy+0x17/0x20 [ 1105.888894] [<ffffffffa00c5d41>] ctnetlink_dump_table+0x2a1/0x370 [nf_conntrack_netlink] [ 1105.905950] [<ffffffff817c141e>] ? __alloc_skb+0x8e/0x1c0 [ 1105.920310] [<ffffffff81808f22>] netlink_dump+0x122/0x290 [ 1105.934657] [<ffffffff8180a0ca>] __netlink_dump_start+0x13a/0x170 [ 1105.949721] [<ffffffffa00c52b8>] ctnetlink_get_conntrack+0x98/0x250 [nf_conntrack_netlink] [ 1105.967100] [<ffffffffa00c5aa0>] ? ctnetlink_dump_dying+0x20/0x20 [nf_conntrack_netlink] [ 1105.984365] [<ffffffffa00c32f0>] ? ctnetlink_glue_build+0x2d0/0x2d0 [nf_conntrack_netlink] [ 1106.001859] [<ffffffff8180fff0>] ? nfnetlink_net_exit_batch+0x70/0x70 [ 1106.017574] [<ffffffff818101e1>] nfnetlink_rcv_msg+0x1f1/0x220 [ 1106.032680] [<ffffffff818101f9>] ? nfnetlink_rcv_msg+0x209/0x220 [ 1106.047995] [<ffffffff8180fff0>] ? nfnetlink_net_exit_batch+0x70/0x70 [ 1106.063725] [<ffffffff8180bd64>] netlink_rcv_skb+0xa4/0xc0 [ 1106.078480] [<ffffffff818105c6>] nfnetlink_rcv+0x286/0x570 [ 1106.093206] [<ffffffff81809efe>] ? __netlink_lookup+0xbe/0x100 [ 1106.108290] [<ffffffff8180b74a>] netlink_unicast+0x16a/0x210 [ 1106.123182] [<ffffffff8180bb17>] netlink_sendmsg+0x327/0x3a0 [ 1106.137851] [<ffffffff817b8138>] sock_sendmsg+0x38/0x50 [ 1106.151886] [<ffffffff817b856f>] SYSC_sendto+0xef/0x170 [ 1106.165778] [<ffffffff810fd77d>] ? __audit_syscall_entry+0xbd/0x100 [ 1106.180558] [<ffffffff810013a6>] ? syscall_trace_enter+0x1c6/0x2d0 [ 1106.195009] [<ffffffff817b957e>] SyS_sendto+0xe/0x10 [ 1106.208003] [<ffffffff81001b83>] do_syscall_64+0x63/0x220 [ 1106.221219] [<ffffffff819bc1c6>] entry_SYSCALL64_slow_path+0x25/0x25 [ 1106.235206] Code: 8d 2c 82 4c 89 ef e8 72 a4 61 df 4b 8b 3c fe 40 f6 c7 01 75 77 49 39 fc 74 7b 48 89 fa eb 08 49 39 c4 74 1e 48 89 c2 48 8b 42 08 <48> 85 c0 75 ef 48 8b 07 a8 01 75 55 49 39 c4 74 60 48 89 c7 eb [ 1112.664265] INFO: rcu_sched self-detected stall on CPU [ 1112.676831] 5-...: (23926 ticks this GP) idle=d3b/140000000000001/0 softirq=27807/27807 fqs=19749 [ 1112.684269] INFO: rcu_sched detected stalls on CPUs/tasks: [ 1112.684272] 5-...: (23926 ticks this GP) idle=d3b/140000000000001/0 softirq=27807/27807 fqs=19749 [ 1112.684274] (detected by 29, t=39932 jiffies, g=15631, c=15630, q=714917) [ 1112.684274] Task dump for CPU 5: [ 1112.684299] conntrack R running task 14176 12139 12138 0x0000008c [ 1112.684302] 0000000000000000 ffffc900308b3988 ffffffff81812135 ffff880fbac1da40 [ 1112.684303] ffff880fbac1da40 ffffc900308b39a0 ffffffff818130b9 ffff880fbac1da40 [ 1112.684304] ffffc900308b39b0 ffffffff8180dfb7 ffffc900308b3a50 ffffffffa00c5d41 [ 1112.684304] Call Trace: [ 1112.684312] [<ffffffff81812135>] ? nf_conntrack_free+0x25/0x60 [ 1112.684313] [<ffffffff818130b9>] ? destroy_conntrack+0x79/0x90 [ 1112.684316] [<ffffffff8180dfb7>] nf_conntrack_destroy+0x17/0x20 [ 1112.684321] [<ffffffffa00c5d41>] ctnetlink_dump_table+0x2a1/0x370 [nf_conntrack_netlink] [ 1112.684324] [<ffffffff817c141e>] ? __alloc_skb+0x8e/0x1c0 [ 1112.684326] [<ffffffff81808f22>] netlink_dump+0x122/0x290 [ 1112.684328] [<ffffffff8180a0ca>] __netlink_dump_start+0x13a/0x170 [ 1112.684330] [<ffffffffa00c52b8>] ctnetlink_get_conntrack+0x98/0x250 [nf_conntrack_netlink] [ 1112.684331] [<ffffffffa00c5aa0>] ? ctnetlink_dump_dying+0x20/0x20 [nf_conntrack_netlink] [ 1112.684333] [<ffffffffa00c32f0>] ? ctnetlink_glue_build+0x2d0/0x2d0 [nf_conntrack_netlink] [ 1112.684334] [<ffffffff8180fff0>] ? nfnetlink_net_exit_batch+0x70/0x70 [ 1112.684335] [<ffffffff818101e1>] nfnetlink_rcv_msg+0x1f1/0x220 [ 1112.684336] [<ffffffff818101f9>] ? nfnetlink_rcv_msg+0x209/0x220 [ 1112.684337] [<ffffffff8180fff0>] ? nfnetlink_net_exit_batch+0x70/0x70 [ 1112.684339] [<ffffffff8180bd64>] netlink_rcv_skb+0xa4/0xc0 [ 1112.684340] [<ffffffff818105c6>] nfnetlink_rcv+0x286/0x570 [ 1112.684341] [<ffffffff81809efe>] ? __netlink_lookup+0xbe/0x100 [ 1112.684343] [<ffffffff8180b74a>] netlink_unicast+0x16a/0x210 [ 1112.684344] [<ffffffff8180bb17>] netlink_sendmsg+0x327/0x3a0 [ 1112.684347] [<ffffffff817b8138>] sock_sendmsg+0x38/0x50 [ 1112.684348] [<ffffffff817b856f>] SYSC_sendto+0xef/0x170 [ 1112.684351] [<ffffffff810fd77d>] ? __audit_syscall_entry+0xbd/0x100 [ 1112.684353] [<ffffffff810013a6>] ? syscall_trace_enter+0x1c6/0x2d0 [ 1112.684354] [<ffffffff817b957e>] SyS_sendto+0xe/0x10 [ 1112.684356] [<ffffffff81001b83>] do_syscall_64+0x63/0x220 [ 1112.684359] [<ffffffff819bc1c6>] entry_SYSCALL64_slow_path+0x25/0x25 [ 1113.163397] (t=39980 jiffies g=15631 c=15630 q=715039)
tracking this down abit, seems like the change from hlist to rhlist makes nf_nat_cleanup_conntrack much more slower than it should when the table is very imbalance (i.e. 65k connections with the same source). I tried to revert the changes to rhlist and the systems looks fine afterwards, no more expensive nf_nat_cleanup_conntrack.
Created attachment 258191 [details] Revert "netfilter: nat: fix src map lookup"
Created attachment 258193 [details] Revert "netfilter: nat: fix crash when conntrack entry is re-used"
Created attachment 258195 [details] Revert "netfilter: nat: switch to new rhlist interface"
Created attachment 258197 [details] Revert "netfilter: nat: fix cmp return value"
Created attachment 258199 [details] Revert "netfilter: nat: convert nat bysrc hash to rhashtable"
Created attachment 258201 [details] Revert "netfilter: move nat hlist_head to nf_conn"