Bug 32832

Summary: shutdown(2) does not fully shut down socket any more
Product: Networking Reporter: Kees Cook (kees)
Component: IPV4Assignee: Stephen Hemminger (stephen)
Status: CLOSED CODE_FIX    
Severity: normal CC: 773800126, Alexandra.Kossovsky, florian
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: test case showing the behavioral change

Description Kees Cook 2011-04-06 22:42:37 UTC
In 2.6.35 and earlier, shutdown(2) will fully remove a socket. This does not appear to be true any more and is causing software to misbehave.

2.6.35:
$ ./testcase
parent: 5957
before:
tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN     
after:
child: 5961
$ ./testcase
parent: 6001
before:
tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN     
after:
child: 6002

2.6.38:
$ ./testcase
parent: 1138
before:
tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN     
after:
child: 1142
$ ./testcase
bind: Address already in use

The listener doesn't show up in netstat any more, but as long as the child process is running, the socket is unavailable. It is as if the shutdown(2) behavior has partially reverted to close(2) behavior (but in the case of using close(2), the child's socket would remain visible in netstat).
Comment 1 Kees Cook 2011-04-06 22:43:01 UTC
Created attachment 53732 [details]
test case showing the behavioral change
Comment 2 Kees Cook 2011-04-06 22:44:05 UTC
Forwarded from https://launchpad.net/bugs/731878
Comment 3 Andrew Morton 2011-04-12 23:16:19 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 6 Apr 2011 22:42:38 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=32832

There is a tescase attached to this bugzilla report.

>            Summary: shutdown(2) does not fully shut down socket any more
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.38
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: kees@outflux.net
>         Regression: Yes
> 
> 
> In 2.6.35 and earlier, shutdown(2) will fully remove a socket. This does not
> appear to be true any more and is causing software to misbehave.
> 
> 2.6.35:
> $ ./testcase
> parent: 5957
> before:
> tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN    
> after:
> child: 5961
> $ ./testcase
> parent: 6001
> before:
> tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN    
> after:
> child: 6002
> 
> 2.6.38:
> $ ./testcase
> parent: 1138
> before:
> tcp        0      0 0.0.0.0:12345           0.0.0.0:*               LISTEN    
> after:
> child: 1142
> $ ./testcase
> bind: Address already in use
> 
> The listener doesn't show up in netstat any more, but as long as the child
> process is running, the socket is unavailable. It is as if the shutdown(2)
> behavior has partially reverted to close(2) behavior (but in the case of
> using
> close(2), the child's socket would remain visible in netstat).
>
Comment 4 David S. Miller 2011-04-12 23:18:42 UTC
From: Andrew Morton <akpm@linux-foundation.org>
Date: Tue, 12 Apr 2011 16:15:56 -0700

> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).

Stephen Hemminger forwarded this to the list last week, and Eric
Dumazet is actively working on a fix.
Comment 5 Andrew Morton 2011-04-12 23:42:39 UTC
On Tue, 12 Apr 2011 16:17:44 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Tue, 12 Apr 2011 16:15:56 -0700
> 
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> 
> Stephen Hemminger forwarded this to the list last week, and Eric
> Dumazet is actively working on a fix.

OK.

Please don't forward bugzilla reports!  Instead do a reply-to-all and
add the cc's.  That way, bugzilla will capture the email discussion and
you won't get akpmspammed.

Sigh, kernel bugzilla is a PITA.  Some people do seem to like and
expect it though.
Comment 6 Eric Dumazet 2011-04-13 02:55:57 UTC
Le mardi 12 avril 2011 à 16:17 -0700, David Miller a écrit :
> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Tue, 12 Apr 2011 16:15:56 -0700
> 
> > 
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> 
> Stephen Hemminger forwarded this to the list last week, and Eric
> Dumazet is actively working on a fix.

I worked on it this week end to discover FreeBSD 8.1 would not allow
several CLOSE sockets bound to same port even with REUSEADDR.

So haproxy claim is a bit wrong (its trick doesnt work on FreeBSD), and
used an undocumented linux feature.

I feel this case is a call for SO_REUSEPORT, eventually.

http://www.unixguide.net/network/socketfaq/4.11.shtml

  SO_REUSEADDR allows your server to bind to an address which is in a
  TIME_WAIT state.  It does not allow more than one server to bind to
  the same address.  It was mentioned that use of this flag can create a
  security risk because another server can bind to a the same port, by
  binding to a specific address as opposed to INADDR_ANY.  The
  SO_REUSEPORT flag allows multiple processes to bind to the same
  address provided all of them use the SO_REUSEPORT option.


Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
and eventually work on SO_REUSEPORT on net-next-2.6

What do you think ?
Comment 7 Eric Dumazet 2011-04-13 03:00:38 UTC
Le mercredi 13 avril 2011 à 04:55 +0200, Eric Dumazet a écrit :

> Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
> and eventually work on SO_REUSEPORT on net-next-2.6
> 
> What do you think ?
> 

Sorry, I should have mentioned commit id : c191a836a908d1dd6
(tcp: disallow bind() to reuse addr/port)
Comment 8 Anonymous Emailer 2011-04-13 08:02:06 UTC
Reply-To: cyril.bonte@free.fr

Le mercredi 13 avril 2011 04:55:27, Eric Dumazet a écrit :
> I worked on it this week end to discover FreeBSD 8.1 would not allow
> several CLOSE sockets bound to same port even with REUSEADDR.

Just to complete the information, yes it does, but only after a shutdown() 
call. And this is the use case of haproxy, amavisd (quoted in the bugzilla bug 
report), and others.

> So haproxy claim is a bit wrong (its trick doesnt work on FreeBSD), and
> used an undocumented linux feature.

Both test cases (the one I provided to explain the haproxy issue and the one 
provided by Kees) are not about binding 2 sockets at the same time but binding 
a new socket after the first one has been shutdown.
Sadly this also looks undocumented on FreeBSD (only saw a reference on it in a 
code comment).

> Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
> and eventually work on SO_REUSEPORT on net-next-2.6
> 
> What do you think ?

Agree.

Many thanks for the time you already spent on that.
Comment 9 Eric Dumazet 2011-04-13 08:51:54 UTC
Le mercredi 13 avril 2011 à 09:06 +0200, Cyril Bonté a écrit :
> Le mercredi 13 avril 2011 04:55:27, Eric Dumazet a écrit :
> > I worked on it this week end to discover FreeBSD 8.1 would not allow
> > several CLOSE sockets bound to same port even with REUSEADDR.
> 
> Just to complete the information, yes it does, but only after a shutdown() 
> call. And this is the use case of haproxy, amavisd (quoted in the bugzilla
> bug 
> report), and others.

Yes, but after a shutdown(), FreeBSD doesnt allow a reuse of the socket.
listen() is not available anymore. Its a bit like an unbind, or a full
close().
Comment 10 Anonymous Emailer 2011-04-13 11:57:45 UTC
Reply-To: dbaluta@ixiacom.com

Hi Eric, Cyril,

>> Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
>> and eventually work on SO_REUSEPORT on net-next-2.6
>>
>> What do you think ?
>>
>
> Sorry, I should have mentioned commit id : c191a836a908d1dd6
> (tcp: disallow bind() to reuse addr/port)

Cyril's use case looks suspect. I don't think that this is a good
reason for reverting this commit.

I would love to help on adding SO_REUSEPORT option in kernel.

thanks,
Daniel.
Comment 11 David S. Miller 2011-04-13 17:44:18 UTC
From: Daniel Baluta <dbaluta@ixiacom.com>
Date: Wed, 13 Apr 2011 14:57:18 +0300

> Cyril's use case looks suspect. I don't think that this is a good
> reason for reverting this commit.

I complete disagree.

Something that worked perfectly fine, probably for years, we broke.

We simply cannot do that, especially since we do not have a reasonable
alternative at this time.

Adding SO_REUSEPORT is a long range option, and not something that
will provide a fix for users right now.

So please don't even pretend to suggest that we shouldn't fix this
with a revert unless a simple, obvious, kernel fix presents itself.
Comment 12 David S. Miller 2011-04-13 19:10:29 UTC
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 13 Apr 2011 05:00:08 +0200

> Le mercredi 13 avril 2011 à 04:55 +0200, Eric Dumazet a écrit :
> 
>> Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
>> and eventually work on SO_REUSEPORT on net-next-2.6
>> 
>> What do you think ?
>> 
> 
> Sorry, I should have mentioned commit id : c191a836a908d1dd6
> (tcp: disallow bind() to reuse addr/port)

I'm commiting the revert as follows to net-2.6, and will queue
it up for -stable as well:

--------------------
Revert "tcp: disallow bind() to reuse addr/port"

This reverts commit c191a836a908d1dd6b40c503741f91b914de3348.

It causes known regressions for programs that expect to be able to use
SO_REUSEADDR to shutdown a socket, then successfully rebind another
socket to the same ID.

Programs such as haproxy and amavisd expect this to work.

This should fix kernel bugzilla 32832.

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/ipv4/inet_connection_sock.c  |    5 ++---
 net/ipv6/inet6_connection_sock.c |    2 +-
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 6c0b7f4..38f23e7 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -73,7 +73,7 @@ int inet_csk_bind_conflict(const struct sock *sk,
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
 			if (!reuse || !sk2->sk_reuse ||
-			    ((1 << sk2->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))) {
+			    sk2->sk_state == TCP_LISTEN) {
 				const __be32 sk2_rcv_saddr = sk_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr(sk) ||
 				    sk2_rcv_saddr == sk_rcv_saddr(sk))
@@ -122,8 +122,7 @@ again:
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
-						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1 &&
-						    !inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
+						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
 							spin_unlock(&head->lock);
 							snum = smallest_rover;
 							goto have_snum;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 1660546..f2c5b0f 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -44,7 +44,7 @@ int inet6_csk_bind_conflict(const struct sock *sk,
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if) &&
 		    (!sk->sk_reuse || !sk2->sk_reuse ||
-		     ((1 << sk2->sk_state) & (TCPF_LISTEN | TCPF_CLOSE))) &&
+		     sk2->sk_state == TCP_LISTEN) &&
 		     ipv6_rcv_saddr_equal(sk, sk2))
 			break;
 	}
Comment 13 Anonymous Emailer 2011-04-13 19:36:15 UTC
Reply-To: shemminger@vyatta.com

On Wed, 13 Apr 2011 10:43:17 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Daniel Baluta <dbaluta@ixiacom.com>
> Date: Wed, 13 Apr 2011 14:57:18 +0300
> 
> > Cyril's use case looks suspect. I don't think that this is a good
> > reason for reverting this commit.
> 
> I complete disagree.
> 
> Something that worked perfectly fine, probably for years, we broke.
> 
> We simply cannot do that, especially since we do not have a reasonable
> alternative at this time.
> 
> Adding SO_REUSEPORT is a long range option, and not something that
> will provide a fix for users right now.
> 
> So please don't even pretend to suggest that we shouldn't fix this
> with a revert unless a simple, obvious, kernel fix presents itself.

Just to echo what Dave said.
Even though the semantics of this is not documented in some standard,
applications have been built on Linux expecting a certain behavior.
If you want to change what happens in this case, you have to have a
really good reason (like crash, security hole, or standards violation).
Comment 14 Eric Dumazet 2011-04-14 02:18:29 UTC
Le mercredi 13 avril 2011 à 12:09 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 13 Apr 2011 05:00:08 +0200
> 
> > Le mercredi 13 avril 2011 à 04:55 +0200, Eric Dumazet a écrit :
> > 
> >> Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
> >> and eventually work on SO_REUSEPORT on net-next-2.6
> >> 
> >> What do you think ?
> >> 
> > 
> > Sorry, I should have mentioned commit id : c191a836a908d1dd6
> > (tcp: disallow bind() to reuse addr/port)
> 
> I'm commiting the revert as follows to net-2.6, and will queue
> it up for -stable as well:
> 
> --------------------
> Revert "tcp: disallow bind() to reuse addr/port"
> 
> This reverts commit c191a836a908d1dd6b40c503741f91b914de3348.
> 
> It causes known regressions for programs that expect to be able to use
> SO_REUSEADDR to shutdown a socket, then successfully rebind another
> socket to the same ID.
> 
> Programs such as haproxy and amavisd expect this to work.
> 
> This should fix kernel bugzilla 32832.
> 
> Signed-off-by: David S. Miller <davem@davemloft.net>
> ---

Thanks David, I was planning to submit this revert this morning, you
beat me ;)
Comment 15 Horms 2011-04-14 06:51:30 UTC
On Wed, Apr 13, 2011 at 04:55:27AM +0200, Eric Dumazet wrote:
> Le mardi 12 avril 2011 à 16:17 -0700, David Miller a écrit :
> > From: Andrew Morton <akpm@linux-foundation.org>
> > Date: Tue, 12 Apr 2011 16:15:56 -0700
> > 
> > > 
> > > (switched to email.  Please respond via emailed reply-to-all, not via the
> > > bugzilla web interface).
> > 
> > Stephen Hemminger forwarded this to the list last week, and Eric
> > Dumazet is actively working on a fix.
> 
> I worked on it this week end to discover FreeBSD 8.1 would not allow
> several CLOSE sockets bound to same port even with REUSEADDR.
> 
> So haproxy claim is a bit wrong (its trick doesnt work on FreeBSD), and
> used an undocumented linux feature.
> 
> I feel this case is a call for SO_REUSEPORT, eventually.
> 
> http://www.unixguide.net/network/socketfaq/4.11.shtml
> 
>   SO_REUSEADDR allows your server to bind to an address which is in a
>   TIME_WAIT state.  It does not allow more than one server to bind to
>   the same address.  It was mentioned that use of this flag can create a
>   security risk because another server can bind to a the same port, by
>   binding to a specific address as opposed to INADDR_ANY.  The
>   SO_REUSEPORT flag allows multiple processes to bind to the same
>   address provided all of them use the SO_REUSEPORT option.
> 
> 
> Since SO_REUSEPORT is not a 'stable fix', I suggest we revert the patch,
> and eventually work on SO_REUSEPORT on net-next-2.6
> 
> What do you think ?

Not entirely related, but FWIW I think that SO_REUSEPORT would
be rather useful for haproxy.

I've been working on allowing haproxy to be reconfigured without dropping
or reusing connections. I have done this by re-using open sockets. But it
would have been rather a lot easier to achieve using SO_REUSEPORT -
assuming that I understand SO_REUSEPORT correctly.
Comment 16 Alexandra Kossovsky 2011-04-19 10:01:21 UTC
2.6.38 behaviour is not only different from previous user-space ABI, but also inconsistent.
If port is not specified in bind(), i.e. port is selected by kernel, then it is possible to bind another socket to the port reported by getsockname() after listen-and-shutdown of the first socket (both sockets have SO_REUSEADDR set).
Comment 17 Florian Mickler 2011-04-27 06:17:08 UTC
A patch referencing this bug report has been merged in v2.6.39-rc5:

commit 3e8c806a08c7beecd972e7ce15c570b9aba64baa
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Apr 13 12:01:14 2011 -0700

    Revert "tcp: disallow bind() to reuse addr/port"
Comment 18 xjqxz95 2014-05-04 01:48:19 UTC
How can I get the bug code and the fixed code?