Bug 14580

Summary: Problem with nfs provided by two redundant (active/backup) servers
Product: Networking Reporter: Krzysztof Oledzki (ole)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED OBSOLETE    
Severity: blocking CC: alan, bfields, kolo, trondmy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.31.6 Subsystem:
Regression: No Bisected commit-id:

Description Krzysztof Oledzki 2009-11-10 15:37:49 UTC
Not sure where to put here as this seems to be releatd with both nfs and tcp, so starting with Networking.

I'm trying to setup a NFS on two redundant servers: 192.168.152.21 and 192.168.152.22 (active/backup) providing the service by one virtual IP (192.168.152.20) managed by keepalived (VRRP). Currently there is one client (192.168.152.205) using nfsv3.

There is of course no problem with mounting the share on a first time:

16:10:44.775497 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 74: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [S], seq 1257747497, win 5840, options [mss 1460,sackOK,TS val 23752 ecr 0,nop,wscale 7], length 0
16:10:44.775981 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 74: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [S.], seq 1274865300, ack 1257747498, win 5792, options [mss 1460,sackOK,TS val 474289 ecr 23752,nop,wscale 7], length 0
16:10:44.775986 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1274865301, win 46, options [nop,nop,TS val 23753 ecr 474289], length 0
16:10:44.775990 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 110: 192.168.152.205.934407731 > 192.168.152.20.2049: 40 null
16:10:44.776090 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257747542, win 46, options [nop,nop,TS val 474290 ecr 23753], length 0
16:10:44.776130 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 94: 192.168.152.20.2049 > 192.168.152.205.934407731: reply ok 24 null
16:10:44.776135 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1274865329, win 46, options [nop,nop,TS val 23753 ecr 474290], length 0
16:10:44.776170 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 110: 192.168.152.205.951184947 > 192.168.152.20.2049: 40 null
16:10:44.776274 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 94: 192.168.152.20.2049 > 192.168.152.205.951184947: reply ok 24 null

Problem starts when I manually switch active and backups nodes:

16:11:21.160921 00:26:b9:3f:41:5d > 00:1e:c9:ab:c3:6d, ethertype IPv4 (0x0800), length 210: 192.168.152.205.1823600179 > 192.168.152.20.2049: 140 lookup fh Unknown/01000100000000000000000831362D31312D3038000000000000000000000000 "16-11-08"
16:11:21.163126 00:1e:c9:ab:c3:6d > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 60: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [R], seq 1274873345, win 0, length 0

The above is OK, the new node does not have this connection so it sends a RST, but please look here:

16:11:27.163148 00:26:b9:3f:41:5d > 00:1e:c9:ab:c3:6d, ethertype IPv4 (0x0800), length 70: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [S], seq 1257816600, win 5840, options [mss 1460,sackOK,TS val 66140 ecr 0], length 0
16:11:27.163251 00:1e:c9:ab:c3:6d > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 70: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [S.], seq 1930073820, ack 1257816601, win 5792, options [mss 1460,sackOK,TS val 492407 ecr 66140], length 0
16:11:27.163266 00:26:b9:3f:41:5d > 00:1e:c9:ab:c3:6d, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930073821, win 5840, options [nop,nop,TS val 66140 ecr 492407], length 0
16:11:27.163290 00:26:b9:3f:41:5d > 00:1e:c9:ab:c3:6d, ethertype IPv4 (0x0800), length 210: 192.168.152.205.1823600179 > 192.168.152.20.2049: 140 lookup fh Unknown/01000100000000000000000831362D31312D3038000000000000000000000000 "16-11-08"
16:11:27.163419 00:1e:c9:ab:c3:6d > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257816745, win 6432, options [nop,nop,TS val 492407 ecr 66140], length 0

The client decides to reuse a 38676 port!

This of course works but now both nodes think each one has a TCP connection: 192.168.152.205.38676 -> 192.168.152.20:2049.

The real problem starts when I switch the nodes again:

16:12:15.108153 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 210: 192.168.152.205.4189187635 > 192.168.152.20.2049: 140 lookup fh Unknown/01000100000000000000000831362D31322D3032000000000000000000000000 "16-12-02"
16:12:15.108262 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257755910, win 473, options [nop,nop,TS val 564619 ecr 46513], length 0
16:12:15.108272 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930096905, win 65535, options [nop,nop,TS val 114085 ecr 526685], length 0
16:12:15.108362 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257755910, win 473, options [nop,nop,TS val 564619 ecr 46513], length 0
16:12:15.108367 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930096905, win 65535, options [nop,nop,TS val 114085 ecr 526685], length 0
16:12:15.108450 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257755910, win 473, options [nop,nop,TS val 564620 ecr 46513], length 0
16:12:15.108456 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930096905, win 65535, options [nop,nop,TS val 114085 ecr 526685], length 0
16:12:15.108538 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257755910, win 473, options [nop,nop,TS val 564620 ecr 46513], length 0
16:12:15.108545 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930096905, win 65535, options [nop,nop,TS val 114085 ecr 526685], length 0
16:12:15.109026 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257755910, win 473, options [nop,nop,TS val 564620 ecr 46513], length 0
16:12:15.109030 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930096905, win 65535, options [nop,nop,TS val 114086 ecr 526685], length 0
16:12:15.109107 00:1e:c9:ab:ca:98 > 00:26:b9:3f:41:5d, ethertype IPv4 (0x0800), length 66: 192.168.152.20.2049 > 192.168.152.205.38676: Flags [.], ack 1257755910, win 473, options [nop,nop,TS val 564620 ecr 46513], length 0
16:12:15.109112 00:26:b9:3f:41:5d > 00:1e:c9:ab:ca:98, ethertype IPv4 (0x0800), length 66: 192.168.152.205.38676 > 192.168.152.20.2049: Flags [.], ack 1930096905, win 65535, options [nop,nop,TS val 114086 ecr 526685], length 0
(...)

The old node does not issue a RST as is still thinks that it is the old, valid connection. Instead both host start to exchange infinite ~35Mbps flood repeating the same acks.

So I think there are two problems here:
 - why nfs decides to reuse the sam port?
 - isn't the TCP stack supposed handle the situation more gracefully by dropping the connection?
Comment 1 Trond Myklebust 2009-11-10 20:21:16 UTC
NFS MUST reuse the same port because on most servers, the replay cache is keyed
to the port number. In other words, when we replay an RPC call, the server will
only recognise it as a replay if it originates from the same port.
See http://www.connectathon.org/talks96/werme1.html
Comment 2 Krzysztof Oledzki 2009-11-15 15:54:10 UTC
OK, than(In reply to comment #1)
> NFS MUST reuse the same port because on most servers, the replay cache is
> keyed
> to the port number. In other words, when we replay an RPC call, the server
> will
> only recognise it as a replay if it originates from the same port.
> See http://www.connectathon.org/talks96/werme1.html

OK, I see. Thank you for pointing it out.

So, only the second part of the bugreport is valid: TCP should handle such situation, shouldn't it?
Comment 3 bfields 2009-11-16 22:04:14 UTC
I'm not sure what the TCP stack itself could do.

As long as you're switching active and backup nodes, can't you take down the (now unused) interface on the old active node?  Would that be sufficient to destroy any stale TCP state?
Comment 4 Krzysztof Oledzki 2009-11-16 23:51:05 UTC
(In reply to comment #3)
> I'm not sure what the TCP stack itself could do.

IMO it should detect that the other side is out of sync and reset such connection but instead it enters a loop. I know that SRC IP/DST IP/SRC PORT/DST PORT mach but tcp seq does not.

> As long as you're switching active and backup nodes, can't you take down the
> (now unused) interface on the old active node?  Would that be sufficient to
> destroy any stale TCP state?

I'm removing the IP address from the node, but it is not enough to destroy TCP states. Maybe I'm missing some important sysctls, am I? Taking down the interface is not an option, of course.
Comment 5 Vaclav Bilek 2009-11-27 06:29:54 UTC
We are not using nfs; 
only varnish http reverse proxy: http://varnish.projects.linpro.no/