Bug 203671 - Stuck connections if flow offload enabled in nftables
Summary: Stuck connections if flow offload enabled in nftables
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: Netfilter/Iptables (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: networking_netfilter-iptables@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-21 23:16 UTC by nucleo
Modified: 2019-09-03 22:11 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.1.15-300.fc30.x86_64
Subsystem:
Regression: No
Bisected commit-id:


Attachments
skip fixup on teardown state (1.20 KB, application/mbox)
2019-08-24 18:34 UTC, Pablo Neira Ayuso
Details

Description nucleo 2019-05-21 23:16:54 UTC
Hi,

I am trying to enable flow offload in nftables.

Testing Fedora 30 virtual machine with three virtio interfaced (eth0 -for communication with host system, eth1 and eth2 for routing), all default settings except net.ipv4.ip_forward=1.

Interfaces configuration: 

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.247  netmask 255.255.255.0  broadcast 192.168.122.255
        inet6 fe80::5054:ff:fefd:4919  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:fd:49:19  txqueuelen 1000  (Ethernet)
        RX packets 375  bytes 27949 (27.2 KiB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 168  bytes 39187 (38.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 198.51.100.1  netmask 255.255.255.0  broadcast 198.51.100.255
        inet6 fe80::5054:ff:fefd:4921  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:fd:49:21  txqueuelen 1000  (Ethernet)
        RX packets 1204680  bytes 79543005 (75.8 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1547644  bytes 12823278521 (11.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.1  netmask 255.255.255.0  broadcast 10.0.0.255
        inet6 fe80::5054:ff:fefd:4920  prefixlen 64  scopeid 0x20<link>
        ether 52:54:00:fd:49:20  txqueuelen 1000  (Ethernet)
        RX packets 8785433  bytes 13300974656 (12.3 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1204716  bytes 79548109 (75.8 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

nftables configuration:

table inet filter {
        flowtable ft {
                hook ingress priority 0
                devices = { eth1, eth2 }
        }

        chain forward {
                type filter hook forward priority 0; policy accept;
                ip protocol tcp flow offload @ft
                oif "eth2" jump lan
        }

        chain lan {
                ct state established,related accept
                drop
        }
}
table ip nat {
        chain prerouting {
                type nat hook prerouting priority -100; policy accept;
        }

        chain postrouting {
                type nat hook postrouting priority 100; policy accept;
                oif "eth1" ip saddr 10.0.0.0/24 snat to 198.51.100.1
        }
}

In host system two veth pairs with one side added in bridges where vm's inetrfaces eth1,eth2 and other side in two netns:

veth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 198.51.100.254  netmask 255.255.255.0  broadcast 198.51.100.255
        inet6 fe80::83b:2cff:fe75:6ea6  prefixlen 64  scopeid 0x20<link>
        ether 0a:3b:2c:75:6e:a6  txqueuelen 1000  (Ethernet)
        RX packets 6917456  bytes 56984049726 (53.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 5336448  bytes 352365569 (336.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.0.0.2  netmask 255.255.255.0  broadcast 10.0.0.255
        inet6 fe80::1461:8eff:fe7b:8c13  prefixlen 64  scopeid 0x20<link>
        ether 16:61:8e:7b:8c:13  txqueuelen 1000  (Ethernet)
        RX packets 5336683  bytes 352452369 (336.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 920209  bytes 56590718932 (52.7 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

In first netns started server "iperf3 -s", in second - client "iperf3 -c 198.51.100.254 -n 100G".

If I interrupt test on client side on 29 sec or earlier then test on server side terminated as it should

-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 198.51.100.1, port 58010
[  5] local 198.51.100.254 port 5201 connected to 198.51.100.1 port 58012
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   273 MBytes  2.29 Gbits/sec                  
[  5]   1.00-2.00   sec   308 MBytes  2.58 Gbits/sec                  
[  5]   2.00-3.00   sec   289 MBytes  2.42 Gbits/sec                  
....................................................
[  5]  28.00-29.00  sec   282 MBytes  2.37 Gbits/sec                  
[  5]  28.00-29.00  sec   282 MBytes  2.37 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-29.00  sec  8.17 GBytes  2.42 Gbits/sec                  receiver
iperf3: the client has terminated
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

In /proc/net/nf_conntrack after test interruption appears

ipv4     2 tcp      6 5 CLOSE src=10.0.0.2 dst=198.51.100.254 sport=58030 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=58030 mark=0 zone=0 use=2

But if test terminated on client after > 30 sec then on server side it never terminates

Accepted connection from 198.51.100.1, port 58018
[  5] local 198.51.100.254 port 5201 connected to 198.51.100.1 port 58020
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   251 MBytes  2.10 Gbits/sec                  
[  5]   1.00-2.00   sec   278 MBytes  2.33 Gbits/sec                  
[  5]   2.00-3.00   sec   286 MBytes  2.40 Gbits/sec                  
....................................................
[  5]  28.00-29.00  sec   301 MBytes  2.53 Gbits/sec                  
[  5]  29.00-30.00  sec   293 MBytes  2.46 Gbits/sec                  
[  5]  30.00-31.00  sec   283 MBytes  2.38 Gbits/sec                  
[  5]  31.00-32.00  sec   116 MBytes   974 Mbits/sec                  
[  5]  32.00-33.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  33.00-34.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  34.00-35.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  35.00-36.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  36.00-37.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  37.00-38.00  sec  0.00 Bytes  0.00 bits/sec                  

in /proc/net/nf_conntrack after test interruption appears only

ipv4     2 tcp      6 37 CLOSE_WAIT src=10.0.0.2 dst=198.51.100.254 sport=58034 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=58034 mark=0 zone=0 use=2
Comment 1 nucleo 2019-05-21 23:24:35 UTC
I should add that no problem after removing "ip protocol tcp flow offload @ft" or "oif "eth1" ip saddr 10.0.0.0/24 snat to 198.51.100.1" rules.
Comment 2 nucleo 2019-07-01 21:07:01 UTC
The same behaviour with nftables-0.9.1 and kernel 5.1.15-300.fc30.x86_64.
Comment 3 ottorei 2019-07-02 12:49:44 UTC
I am also experiencing similar connection stalling with TCP-connections. In my case the offload is applied on NAT-gateway before ct rules like this:

cat /etc/nftables.conf | grep flow
#add flowtable ip filter flows { hook ingress priority -50; devices = {enp6s0f0, enp6s0f1}; }
#add rule ip filter FORWARD counter flow offload @flows comment "FASTPATH TEST"

The stalling occurs systematically on long running TCP-connections but I have not seen any issues with UDP. When the offloading is disabled, the issue no longer occurs.
Comment 4 nucleo 2019-07-02 17:21:35 UTC
I also can reproduce bug with this minimal nftables 0.9.1 setup:

table inet filter {
        flowtable ft {
                hook ingress priority filter
                devices = { eth1, eth2 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                ip protocol tcp flow add @ft
        }
}
table ip nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
                oif "eth1" ip saddr 10.0.0.0/24 snat to 198.51.100.1
        }
}
Comment 5 Pablo Neira Ayuso 2019-07-02 17:36:45 UTC
Could you give a try to these fixes?

https://patchwork.ozlabs.org/patch/1102703/
https://patchwork.ozlabs.org/patch/1102704/
https://patchwork.ozlabs.org/patch/1102705/
https://patchwork.ozlabs.org/patch/1102706/

I can request for inclusion into -stable.

Thanks.
Comment 6 nucleo 2019-07-02 20:33:28 UTC
No stale connection with patches applied to to 4.19.56.
I am going to test also 5.1.15.
Comment 7 nucleo 2019-07-02 21:55:46 UTC
No stale connection also in 5.1.15 with patches.

But I noticed in /proc/net/nf_conntrack sometimes left TIME_WAIT or CLOSE with large timeout 86399 but this is hard to reproduce because most of connections disappeared shortly after closing.
Comment 8 nucleo 2019-07-02 22:03:38 UTC
When iperf3 connection is active all [OFFLOAD] entries disappeared from /proc/net/nf_conntrack after about 60 seconds.
Comment 9 Pablo Neira Ayuso 2019-08-09 10:49:59 UTC
There is a race that might trigger the 85400 timeout with TIME_WAIT (this is actually one day, in seconds, which is an internal offload timeout that is leaking to userspace when hitting this bug).

http://patchwork.ozlabs.org/patch/1144577/
http://patchwork.ozlabs.org/patch/1144578/

Please, give a try to these patches.

Regarding the "entries dissapeared from /proc/net/nf_conntrack after 60 seconds", I cannot reproduce it. Could you tell me how you do reproduce it there?
Comment 10 nucleo 2019-08-10 13:35:39 UTC
Disappearing is hard to reproduce. I just run client  client "iperf3 -c 198.51.100.254 -n 100G" several times and there is randomly different behaviour entries in /proc/net/nf_conntrack.

First, here results with kernel 5.2.6-200.fc30.x86_64 without patches from comment 9 with this ruleset

table inet filter {
        flowtable ft {
                hook ingress priority filter
                devices = { eth1, eth2 }
        }

        chain forward {
                type filter hook forward priority filter; policy accept;
                flow add @ft
        }
}
table ip nat {
        chain prerouting {
                type nat hook prerouting priority dstnat; policy accept;
        }

        chain postrouting {
                type nat hook postrouting priority srcnat; policy accept;
                oif "eth1" ip saddr 10.0.0.0/24 snat to 198.51.100.1
        }
}



First iperf3 test without rule "flow add @ft":

server side:
tcp6       0      0 198.51.100.254:5201     198.51.100.1:49806      ESTABLISHED 27619/iperf3        
tcp6       0      0 198.51.100.254:5201     198.51.100.1:49808      ESTABLISHED 27619/iperf3     

client side:
tcp        0      0 10.0.0.2:49806          198.51.100.254:5201     ESTABLISHED 27660/iperf3        
tcp        0 3082920 10.0.0.2:49808          198.51.100.254:5201     ESTABLISHED 27660/iperf3        

/proc/net/nf_conntrack during all test have correct entries:
ipv4     2 tcp      6 431932 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=49806 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49806 [ASSURED] mark=0 zone=0 use=2
ipv4     2 tcp      6 300 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=49808 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49808 [ASSURED] mark=0 zone=0 use=2 

/proc/net/nf_conntrack after interrupting client with ctrl+c:
ipv4     2 tcp      6 118 TIME_WAIT src=10.0.0.2 dst=198.51.100.254 sport=49806 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49806 [ASSURED] mark=0 zone=0 use=2
ipv4     2 tcp      6 8 CLOSE src=10.0.0.2 dst=198.51.100.254 sport=49808 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49808 [ASSURED] mark=0 zone=0 use=2 




Now several tests with "flow add @ft" rule:

/proc/net/nf_conntrack after test started
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49724 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49724 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49722 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49722 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after 30 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49724 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49724 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 24 SYN_RECV src=10.0.0.2 dst=198.51.100.254 sport=49722 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49722 mark=0 zone=0 use=2

/proc/net/nf_conntrack after 59 seconds (on lient side still two established connections)
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49724 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49724 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after interrupting client
ipv4     2 tcp      6 2 CLOSE src=10.0.0.2 dst=198.51.100.254 sport=49724 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49724 mark=0 zone=0 use=2
ipv4     2 tcp      6 52 CLOSE_WAIT src=10.0.0.2 dst=198.51.100.254 sport=49722 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49722 mark=0 zone=0 use=2



Next test:

/proc/net/nf_conntrack after test started
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49748 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49748 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49746 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49746 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after 30 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49748 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49748 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 86378 SYN_RECV src=10.0.0.2 dst=198.51.100.254 sport=49746 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49746 mark=0 zone=0 use=2

/proc/net/nf_conntrack after 59 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49748 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49748 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 86344 SYN_RECV src=10.0.0.2 dst=198.51.100.254 sport=49746 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49746 mark=0 zone=0 use=2

/proc/net/nf_conntrack after interrupting client
ipv4     2 tcp      6 7 CLOSE src=10.0.0.2 dst=198.51.100.254 sport=49748 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49748 mark=0 zone=0 use=2
ipv4     2 tcp      6 117 TIME_WAIT src=10.0.0.2 dst=198.51.100.254 sport=49746 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49746 [ASSURED] mark=0 zone=0 use=2



Next test:

/proc/net/nf_conntrack after test started
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49760 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49760 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49762 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49762 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after 30 seconds
ipv4     2 tcp      6 26 SYN_RECV src=10.0.0.2 dst=198.51.100.254 sport=49760 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49760 mark=0 zone=0 use=2
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49762 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49762 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after 59 seconds empty, on cleint side
[  4]  59.00-60.00  sec   659 MBytes  5.53 Gbits/sec  1764   1.17 MBytes       
iperf3: error - unable to write to stream socket: Connection reset by peer


Tests with kernel 5.2.7-200.fc30.x86_64 without patches from comment 9:

/proc/net/nf_conntrack after test started
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49854 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49854 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after 33 seconds
ipv4     2 tcp      6 117 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=49854 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49854 mark=0 zone=0 use=2
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after 154 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after interrupting client
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49854 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49854 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0 zone=0 use=3

/proc/net/nf_conntrack after couple of seconds
ipv4     2 tcp      6 39 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=49854 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49854 mark=0 zone=0 use=2
Comment 11 nucleo 2019-08-10 13:37:24 UTC
Last test was with kernel 5.2.7-200.fc30.x86_64 and with patches from comment 9
Comment 12 Pablo Neira Ayuso 2019-08-12 09:55:26 UTC
(In reply to nucleo from comment #10)
> Tests with kernel 5.2.7-200.fc30.x86_64 _with_ patches from comment 9:
> 
> /proc/net/nf_conntrack after test started
> ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49854 dport=5201
> src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49854 [OFFLOAD] mark=0
> zone=0 use=3
> ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201
> src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0
> zone=0 use=3

Both flows have been placed in the flowtable.

> /proc/net/nf_conntrack after 33 seconds
> ipv4     2 tcp      6 117 ESTABLISHED src=10.0.0.2 dst=198.51.100.254
> sport=49854 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201
> dport=49854 mark=0 zone=0 use=2
> ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201
> src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0
> zone=0 use=3

Flowtable sees no packets for flow sport=49854 after 30 seconds, this flow is pushed out from flowtable and conntrack recover control on it. The pick up timeout (120 seconds) kicks in and the entry is set in ESTABLISHED state (tracking is also set to liberal).

> /proc/net/nf_conntrack after 154 seconds
> ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201
> src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0
> zone=0 use=3

Flow sport=49854 is gone. No traffic for it after a while, conntrack saw no packets after 120 seconds either (this is 30 seconds flowtable timeout + 120 seconds for the pickup timeout) so the entry sport=49854 is released.

> /proc/net/nf_conntrack after interrupting client
> ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49854 dport=5201
> src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49854 [OFFLOAD] mark=0
> zone=0 use=3
> ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=49856 dport=5201
> src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=49856 [OFFLOAD] mark=0
> zone=0 use=3

After pressing ctrl-c on the client, here in my testbed I see one entry in TIME_WAIT (in your case, that would be the flow identified by sport=49856) and another flow in ESTABLISHED state which is this one below...

> /proc/net/nf_conntrack after couple of seconds
> ipv4     2 tcp      6 39 ESTABLISHED src=10.0.0.2 dst=198.51.100.254
> sport=49854 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201
> dport=49854 mark=0 zone=0 use=2

... this is sport=49854, the fin/rst packet is sent back to the flowtable, then the entry expires (no packets after 30 seconds) and it goes back to conntrack.

I'll be posting two patches here:

1) do not push back flow to flowtable if packet is fin/rst.
2) likely increase default flowtable timeout to 120 seconds. I'll also expose toggles to make this configurable too.

Thanks for your feedback.
Comment 13 Pablo Neira Ayuso 2019-08-13 08:54:42 UTC
(In reply to Pablo Neira Ayuso from comment #12)
[...]
> 1) do not push back flow to flowtable if packet is fin/rst.

https://patchwork.ozlabs.org/patch/1146133/

With this patch, conntrack entries enter TIME_WAIT state fin/rst after interrupting client.
Comment 14 Pablo Neira Ayuso 2019-08-13 15:42:26 UTC
(In reply to Pablo Neira Ayuso from comment #13)
> (In reply to Pablo Neira Ayuso from comment #12)
> [...]
> > 1) do not push back flow to flowtable if packet is fin/rst.
> 
> https://patchwork.ozlabs.org/patch/1146133/

Patch version 2:

https://patchwork.ozlabs.org/patch/1146419/
Comment 15 nucleo 2019-08-22 14:45:41 UTC
Here my tests with Fedora 5.2.9 kernel with applied patches from comment 9 and comment 14. I repeated the several times "iperf3 -c 198.51.100.254 -n 100G" interrupting it with ctrl+c.

First run

Contents of /proc/net/nf_conntrack
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=51994 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51994 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=51992 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51992 [OFFLOAD] mark=0 zone=0 use=3

after 30 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=51994 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51994 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 18 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=51992 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51992 mark=0 zone=0 use=2

after 60 seconds /proc/net/nf_conntrack empty and after yet one second
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=51994 dport=5201 [UNREPLIED] src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51994 [OFFLOAD] mark=0 zone=0 use=3

ctrl+c
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=51992 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51992 [OFFLOAD] mark=0 zone=0 use=3

and after that
ipv4     2 tcp      6 112 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=51992 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=51992 mark=0 zone=0 use=2





Second run

ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52000 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52000 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52002 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52002 [OFFLOAD] mark=0 zone=0 use=3

after 30 seconds
ipv4     2 tcp      6 26 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=52000 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52000 mark=0 zone=0 use=2
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52002 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52002 [OFFLOAD] mark=0 zone=0 use=3

after 60 seconds /proc/net/nf_conntrack empty, on client side:
[  4]  59.00-60.00  sec   528 MBytes  4.42 Gbits/sec  498   1.34 MBytes       
iperf3: error - unable to write to stream socket: Connection reset by peer

In one of other runs test continued with empty /proc/net/nf_conntrack.




Third run

ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52008 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52008 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52006 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52006 [OFFLOAD] mark=0 zone=0 use=3

after 30 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52008 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52008 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 25 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=52006 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52006 mark=0 zone=0 use=2

after 60 seconds
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52008 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52008 [OFFLOAD] mark=0 zone=0 use=3

ctrl+c
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52008 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52008 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52006 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52006 [OFFLOAD] mark=0 zone=0 use=3

after that
ipv4     2 tcp      6 3 CLOSE src=10.0.0.2 dst=198.51.100.254 sport=52008 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52008 [ASSURED] mark=0 zone=0 use=2
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52006 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52006 [OFFLOAD] mark=0 zone=0 use=3

after that
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52006 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52006 [OFFLOAD] mark=0 zone=0 use=3

after that
ipv4     2 tcp      6 71 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=52006 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52006 mark=0 zone=0 use=2



Fourth run finished without interrupting

ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52020 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52020 [OFFLOAD] mark=0 zone=0 use=3
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52022 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52022 [OFFLOAD] mark=0 zone=0 use=3

after 30 seconds
ipv4     2 tcp      6 10 ESTABLISHED src=10.0.0.2 dst=198.51.100.254 sport=52020 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52020 mark=0 zone=0 use=2
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52022 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52022 [OFFLOAD] mark=0 zone=0 use=3

after 60 seconds /proc/net/nf_conntrack empty and after 1 second
ipv4     2 tcp      6 src=10.0.0.2 dst=198.51.100.254 sport=52022 dport=5201 [UNREPLIED] src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52022 [OFFLOAD] mark=0 zone=0 use=3

test finished
ipv4     2 tcp      6 107 TIME_WAIT src=10.0.0.2 dst=198.51.100.254 sport=52020 dport=5201 src=198.51.100.254 dst=198.51.100.1 sport=5201 dport=52020 mark=0 zone=0 use=2
Comment 16 Pablo Neira Ayuso 2019-08-24 17:44:29 UTC
I cannot reproduce this here on 5.3-rc, I've been repeating similar tests here. Would you mind to check for all patches between 4.19 and 5.3 for the flowtable infrastructure?

You could do this via:

git log --oneline v4.19..v5.3-rc3 net/netfilter/nft_flow_offload.c

also check for these files:

net/netfilter/nf_flow_table_core.c
net/netfilter/nf_flow_table_ip.c
net/netfilter/nf_flow_table_inet.c
net/ipv4/netfilter/nf_flow_table_ipv4.c
net/ipv6/netfilter/nf_flow_table_ipv6.c
include/net/netfilter/nf_flow_table.h

Make sure you get a fresh clone of:

https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

to check for any missing patch.

If there are relevant patches already upstream that are not in 4.19 that fix the problem that you report, please tell a list of commit IDs and I'll request -stable maintainers to include them in the 4.19 -stable release.

I agree that the 30 seconds timer to evict a flow from the flowtable that has seen no traffic is too aggresive, but before making a patch to rise this default timeout (and to expose a knob to allow users to configure this), it would be good to make sure no relevant patch is missing.

Thanks!
Comment 17 Pablo Neira Ayuso 2019-08-24 18:34:44 UTC
Created attachment 284587 [details]
skip fixup on teardown state

After 150 seconds (30 seconds to evict the iperf control flow from the flowtable + 120 in ESTABLISHED state), if I press ctr-c, I can see this:

ipv4     2 tcp      6 104 TIME_WAIT src=192.168.10.2 dst=10.0.1.2 s
port=33994 dport=5201 src=10.0.1.2 dst=10.0.1.1 sport=5201 dport=33
994 mark=0 secctx=null zone=0 use=2
ipv4     2 tcp      6 104 ESTABLISHED src=192.168.10.2 dst=10.0.1.2
 sport=33992 dport=5201 src=10.0.1.2 dst=10.0.1.1 sport=5201 dport=
33992 mark=0 secctx=null zone=0 use=2

The flow tcp sport/33992 is the iperf control plane flow.

It seems that iperf sends a data packet on the after ctrl-c

20:13:22.268161 IP 192.168.10.2.33992 > 10.0.1.2.5201: Flags [P.], seq 3952723680:3952723681, ack 3326915136, win 502, options [nop,nop,TS val 2165195608 ecr 2773810022], length 1

this pushed in the flow into the flowtable again, however...

20:13:22.268434 IP 10.0.1.2.5201 > 192.168.10.2.33992: Flags [F.], seq 1, ack 1, win 509, options [nop,nop,TS val 2773964852 ecr 2165195608], length 0
20:13:22.268472 IP 192.168.10.2.33992 > 10.0.1.2.5201: Flags [F.], seq 1, ack 2, win 502, options [nop,nop,TS val 2165195608 ecr 2773964852], length 0
20:13:22.268492 IP 10.0.1.2.5201 > 192.168.10.2.33992: Flags [.], ack 2, win 509, options [nop,nop,TS val 2773964852 ecr 2165195608], length 0

These tcp fin packet schedules the flowtable entry to be removed, but the state fixup routine takes the conntrack entry from FIN_WAIT -> ESTABLISHED.
Comment 18 Pablo Neira Ayuso 2019-08-24 19:45:13 UTC
Scratch that, patch is not correct.
Comment 19 Pablo Neira Ayuso 2019-09-02 17:39:35 UTC
This patch fixes incorrect timeout initialization of the flowtable entry:

https://patchwork.ozlabs.org/patch/1156702/
Comment 20 nucleo 2019-09-02 17:47:49 UTC
Now needed all of pacthes from comments 9, 14, 19?

I can't test 4.19 kernel because 9,14 comments patches I can't apply to last 4.19.x.
Comment 21 Pablo Neira Ayuso 2019-09-03 22:11:51 UTC
Could you try with Fedora 5.2.9 kernel?

Note You need to log in before you can comment on or make changes to this bug.