Bug 16568 - Regression and incompatibility with Windows SP2-SP3-Vista TCP stack causing lost connections
Summary: Regression and incompatibility with Windows SP2-SP3-Vista TCP stack causing l...
Status: RESOLVED DOCUMENTED
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-08-12 08:19 UTC by Yuriy Yevtukhov
Modified: 2010-08-12 17:41 UTC (History)
0 users

See Also:
Kernel Version: 2.6.30+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Yuriy Yevtukhov 2010-08-12 08:19:47 UTC
Hi.
I administer about 50 highly-loaded web servers (free CMS hosting) under linux. Having on most of them kernel versions between 2.6.24 and 2.6.29 at the beginnig of the year, I made TCP sysctls tunings for increasing DDOS and different flooding protection (our servers have attacks rather often). tcp_tw_recyle=1 was among of them, as many manuals in the net recommend to do this and linux documentation does not say anything bad. Having periodic kernel panics connected with bugs in ethernet card drivers and ext3 and after founding that 2.6.31+ kernels work faster with ext3, I upgraded almost all kernels to 2.6.32.8, which was already being tested on several servers for several months. 
Somewhen after that we began to receive complaints from our users (site owners) that they (and their visitors) see very unstable work of their sites. It looked like HTTP-connections were just lost in a random way. Not everybody had the problem, just a small percent. We tried to find problem with internet providers or buggy firewalls, but finally came to conclusion that problem is connected with our servers. Analizing situations with lost connections using tcpdump i found that client host send packets, BUT LINUX JUST IGNORES THEM, there was SYN-packet repeated 3 times with interval of 3 secs, but NO SYN-ACK reply.
Most problems had users with Windows SP3 (i.e. almost all users with SP3 had the problem). I booted one server with old 2.6.24 kernel and found that problem dissappeared. Then began look for exact kernel version, that introduced incompatibility. Using binary search I compiled several kernels between 2.6.24 and 2.6.32.8 and found that 2.6.29.6 DO NO have the problem, but 2.6.30 DOES. Studing commits made to tcp_input.c and tcp_ipv4.c (which i supposed were involved) between that releases I found this one.
  author	Eric Dumazet <dada1@cosmosbay.com>	
	Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
  committer	David S. Miller <davem@davemloft.net>	
	Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
  commit	fc1ad92dfc4e363a055053746552cdb445ba5c57

  tcp: allow timestamps even if SYN packet has tsval=0

  Some systems send SYN packets with apparently wrong RFC1323 timestamp
  option values [timestamp tsval=0 tsecr=0].
  It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml )
  Linux TCP stack ignores this option and sends back a SYN+ACK packet
  without timestamp option, thus many TCP flows cannot use timestamps
  and lose some benefit of RFC1323.
  Other operating systems seem to not care about initial tsval value, and let
  tcp flows to negotiate timestamp option.

  net/ipv4/tcp_ipv4.c 		diff :

--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1226,15 +1226,6 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
        if (want_cookie && !tmp_opt.saw_tstamp)
                tcp_clear_options(&tmp_opt);
 
-       if (tmp_opt.saw_tstamp && !tmp_opt.rcv_tsval) {
-               /* Some OSes (unknown ones, but I see them on web server, which
-                * contains information interesting only for windows'
-                * users) do not send their stamp in SYN. It is easy case.
-                * We simply do not advertise TS support.
-                */
-               tmp_opt.saw_tstamp = 0;
-               tmp_opt.tstamp_ok  = 0;
-       }
        tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
 
        tcp_openreq_init(req, &tmp_opt, skb);

Removing that was not very good. Having analized lost connections from SP3 I know that they have timestamps turned on and timestamp value is 0. Here is it:
13:39:10.430498 IP 192.168.99.130.3493 > 192.168.99.100.80: S 2507911465:2507911465(0) win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 0 0,nop,nop,sackOK>
        0x0000:  4500 0040 2bda 4000 8006 86a6 c0a8 6382  E..@+.@.......c.
        0x0010:  c0a8 6364 0da5 0050 957b b129 0000 0000  ..cd...P.{.)....
        0x0020:  b002 ffff 992c 0000 0204 05b4 0103 0303  .....,..........
        0x0030:  0101 080a 0000 0000 0000 0000 0101 0402  ................

Having above code fragment removed we got tmp_opt.tstamp_ok=1, as i understand. But a little later in source code of tcp_ipv4.c read:
        /* VJ's idea. We save last timestamp seen
         * from the destination in peer table, when entering
         * state TIME-WAIT, and check against it before
         * accepting new connection request.
         *
         * If "isn" is not zero, this request hit alive
         * timewait bucket, so that all the necessary checks
         * are made in the function processing timewait state.
         */
        if (tmp_opt.saw_tstamp &&
            tcp_death_row.sysctl_tw_recycle &&
            (dst = inet_csk_route_req(sk, req)) != NULL &&
            (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
            peer->v4daddr == saddr) {
            if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
                (s32)(peer->tcp_ts - req->ts_recent) >
                            TCP_PAWS_WINDOW) {
                NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSPASSIVEREJECTED);
                goto drop_and_release;
            }
        }
which in some way (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle are true), random way, having not closed time-wait sockets from the pear, leads to packet ignorence.

As for me, i understand, that i should not enable tw_recycle, BUT DOCUMENTATION DOES NOT STATE, that enabling it i'll got random and rather often lost of connections from some types of popular clients (like Windows).
Concerning above stated commit, it should include something to prevent above condition to become true if tmp_opt.rcv_tsval==0. I'm not sure, but something like
        if (tmp_opt.saw_tstamp &&
+           tmp_opt.rcv_tsval &&
            tcp_death_row.sysctl_tw_recycle &&
            (dst = inet_csk_route_req(sk, req)) != NULL &&
            (peer = rt_get_peer((struct rtable *)dst)) != NULL &&

just to not provide regression and strong TCP-stack incompatibility in case tw_recycle is enabled.
Also documentation does not state, that tw_recyle should not be used at all for internet servers, because web-clients, which are behind NAT, will have problems connected with the same above condition because successive connections from different clients (which have common IP) could have incompatible timestamps.

Sorry if i detracted somebody busy from his work with my unimportant problem.
Comment 1 Yuriy Yevtukhov 2010-08-12 08:22:15 UTC
P.S. I just disabled tcp_tw_recycle and problem, of cause, disappeared.
Comment 2 Andrew Morton 2010-08-12 14:39:57 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).


On Thu, 12 Aug 2010 08:20:01 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=16568
> 
>            Summary: Regression and incompatibility with Windows
>                     SP2-SP3-Vista TCP stack causing lost connections
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.30+
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: yuriy@ucoz.com
>         Regression: No
> 
> 
> Hi.
> I administer about 50 highly-loaded web servers (free CMS hosting) under
> linux.
> Having on most of them kernel versions between 2.6.24 and 2.6.29 at the
> beginnig of the year, I made TCP sysctls tunings for increasing DDOS and
> different flooding protection (our servers have attacks rather often).
> tcp_tw_recyle=1 was among of them, as many manuals in the net recommend to do
> this and linux documentation does not say anything bad. Having periodic
> kernel
> panics connected with bugs in ethernet card drivers and ext3 and after
> founding
> that 2.6.31+ kernels work faster with ext3, I upgraded almost all kernels to
> 2.6.32.8, which was already being tested on several servers for several
> months. 
> Somewhen after that we began to receive complaints from our users (site
> owners)
> that they (and their visitors) see very unstable work of their sites. It
> looked
> like HTTP-connections were just lost in a random way. Not everybody had the
> problem, just a small percent. We tried to find problem with internet
> providers
> or buggy firewalls, but finally came to conclusion that problem is connected
> with our servers. Analizing situations with lost connections using tcpdump i
> found that client host send packets, BUT LINUX JUST IGNORES THEM, there was
> SYN-packet repeated 3 times with interval of 3 secs, but NO SYN-ACK reply.
> Most problems had users with Windows SP3 (i.e. almost all users with SP3 had
> the problem). I booted one server with old 2.6.24 kernel and found that
> problem
> dissappeared. Then began look for exact kernel version, that introduced
> incompatibility. Using binary search I compiled several kernels between
> 2.6.24
> and 2.6.32.8 and found that 2.6.29.6 DO NO have the problem, but 2.6.30 DOES.
> Studing commits made to tcp_input.c and tcp_ipv4.c (which i supposed were
> involved) between that releases I found this one.
>   author    Eric Dumazet <dada1@cosmosbay.com>    
>     Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
>   committer    David S. Miller <davem@davemloft.net>    
>     Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
>   commit    fc1ad92dfc4e363a055053746552cdb445ba5c57
> 
>   tcp: allow timestamps even if SYN packet has tsval=0
> 
>   Some systems send SYN packets with apparently wrong RFC1323 timestamp
>   option values [timestamp tsval=0 tsecr=0].
>   It might be for security reasons (http://www.secuobs.com/plugs/25220.shtml
>   )
>   Linux TCP stack ignores this option and sends back a SYN+ACK packet
>   without timestamp option, thus many TCP flows cannot use timestamps
>   and lose some benefit of RFC1323.
>   Other operating systems seem to not care about initial tsval value, and let
>   tcp flows to negotiate timestamp option.
> 
>   net/ipv4/tcp_ipv4.c         diff :
> 
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1226,15 +1226,6 @@ int tcp_v4_conn_request(struct sock *sk, struct
> sk_buff
> *skb)
>         if (want_cookie && !tmp_opt.saw_tstamp)
>                 tcp_clear_options(&tmp_opt);
> 
> -       if (tmp_opt.saw_tstamp && !tmp_opt.rcv_tsval) {
> -               /* Some OSes (unknown ones, but I see them on web server,
> which
> -                * contains information interesting only for windows'
> -                * users) do not send their stamp in SYN. It is easy case.
> -                * We simply do not advertise TS support.
> -                */
> -               tmp_opt.saw_tstamp = 0;
> -               tmp_opt.tstamp_ok  = 0;
> -       }
>         tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
> 
>         tcp_openreq_init(req, &tmp_opt, skb);
> 
> Removing that was not very good. Having analized lost connections from SP3 I
> know that they have timestamps turned on and timestamp value is 0. Here is
> it:
> 13:39:10.430498 IP 192.168.99.130.3493 > 192.168.99.100.80: S
> 2507911465:2507911465(0) win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp 0
> 0,nop,nop,sackOK>
>         0x0000:  4500 0040 2bda 4000 8006 86a6 c0a8 6382  E..@+.@.......c.
>         0x0010:  c0a8 6364 0da5 0050 957b b129 0000 0000  ..cd...P.{.)....
>         0x0020:  b002 ffff 992c 0000 0204 05b4 0103 0303  .....,..........
>         0x0030:  0101 080a 0000 0000 0000 0000 0101 0402  ................
> 
> Having above code fragment removed we got tmp_opt.tstamp_ok=1, as i
> understand.
> But a little later in source code of tcp_ipv4.c read:
>         /* VJ's idea. We save last timestamp seen
>          * from the destination in peer table, when entering
>          * state TIME-WAIT, and check against it before
>          * accepting new connection request.
>          *
>          * If "isn" is not zero, this request hit alive
>          * timewait bucket, so that all the necessary checks
>          * are made in the function processing timewait state.
>          */
>         if (tmp_opt.saw_tstamp &&
>             tcp_death_row.sysctl_tw_recycle &&
>             (dst = inet_csk_route_req(sk, req)) != NULL &&
>             (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
>             peer->v4daddr == saddr) {
>             if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
>                 (s32)(peer->tcp_ts - req->ts_recent) >
>                             TCP_PAWS_WINDOW) {
>                 NET_INC_STATS_BH(sock_net(sk),
>                 LINUX_MIB_PAWSPASSIVEREJECTED);
>                 goto drop_and_release;
>             }
>         }
> which in some way (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle are
> true), random way, having not closed time-wait sockets from the pear, leads
> to
> packet ignorence.
> 
> As for me, i understand, that i should not enable tw_recycle, BUT
> DOCUMENTATION
> DOES NOT STATE, that enabling it i'll got random and rather often lost of
> connections from some types of popular clients (like Windows).
> Concerning above stated commit, it should include something to prevent above
> condition to become true if tmp_opt.rcv_tsval==0. I'm not sure, but something
> like
>         if (tmp_opt.saw_tstamp &&
> +           tmp_opt.rcv_tsval &&
>             tcp_death_row.sysctl_tw_recycle &&
>             (dst = inet_csk_route_req(sk, req)) != NULL &&
>             (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
> 
> just to not provide regression and strong TCP-stack incompatibility in case
> tw_recycle is enabled.
> Also documentation does not state, that tw_recyle should not be used at all
> for
> internet servers, because web-clients, which are behind NAT, will have
> problems
> connected with the same above condition because successive connections from
> different clients (which have common IP) could have incompatible timestamps.
> 
> Sorry if i detracted somebody busy from his work with my unimportant problem.
>
Comment 3 Eric Dumazet 2010-08-12 15:40:35 UTC
Le jeudi 12 août 2010 à 07:40 -0700, Andrew Morton a écrit :
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> 
> On Thu, 12 Aug 2010 08:20:01 GMT bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=16568
> > 
> >            Summary: Regression and incompatibility with Windows
> >                     SP2-SP3-Vista TCP stack causing lost connections
> >            Product: Networking
> >            Version: 2.5
> >     Kernel Version: 2.6.30+
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: high
> >           Priority: P1
> >          Component: IPV4
> >         AssignedTo: shemminger@linux-foundation.org
> >         ReportedBy: yuriy@ucoz.com
> >         Regression: No
> > 
> > 
> > Hi.
> > I administer about 50 highly-loaded web servers (free CMS hosting) under
> linux.
> > Having on most of them kernel versions between 2.6.24 and 2.6.29 at the
> > beginnig of the year, I made TCP sysctls tunings for increasing DDOS and
> > different flooding protection (our servers have attacks rather often).
> > tcp_tw_recyle=1 was among of them, as many manuals in the net recommend to
> do
> > this and linux documentation does not say anything bad. Having periodic
> kernel
> > panics connected with bugs in ethernet card drivers and ext3 and after
> founding
> > that 2.6.31+ kernels work faster with ext3, I upgraded almost all kernels
> to
> > 2.6.32.8, which was already being tested on several servers for several
> months. 
> > Somewhen after that we began to receive complaints from our users (site
> owners)
> > that they (and their visitors) see very unstable work of their sites. It
> looked
> > like HTTP-connections were just lost in a random way. Not everybody had the
> > problem, just a small percent. We tried to find problem with internet
> providers
> > or buggy firewalls, but finally came to conclusion that problem is
> connected
> > with our servers. Analizing situations with lost connections using tcpdump
> i
> > found that client host send packets, BUT LINUX JUST IGNORES THEM, there was
> > SYN-packet repeated 3 times with interval of 3 secs, but NO SYN-ACK reply.
> > Most problems had users with Windows SP3 (i.e. almost all users with SP3
> had
> > the problem). I booted one server with old 2.6.24 kernel and found that
> problem
> > dissappeared. Then began look for exact kernel version, that introduced
> > incompatibility. Using binary search I compiled several kernels between
> 2.6.24
> > and 2.6.32.8 and found that 2.6.29.6 DO NO have the problem, but 2.6.30
> DOES.
> > Studing commits made to tcp_input.c and tcp_ipv4.c (which i supposed were
> > involved) between that releases I found this one.
> >   author    Eric Dumazet <dada1@cosmosbay.com>    
> >     Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
> >   committer    David S. Miller <davem@davemloft.net>    
> >     Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
> >   commit    fc1ad92dfc4e363a055053746552cdb445ba5c57
> > 
> >   tcp: allow timestamps even if SYN packet has tsval=0
> > 
> >   Some systems send SYN packets with apparently wrong RFC1323 timestamp
> >   option values [timestamp tsval=0 tsecr=0].
> >   It might be for security reasons
> (http://www.secuobs.com/plugs/25220.shtml )
> >   Linux TCP stack ignores this option and sends back a SYN+ACK packet
> >   without timestamp option, thus many TCP flows cannot use timestamps
> >   and lose some benefit of RFC1323.
> >   Other operating systems seem to not care about initial tsval value, and
> let
> >   tcp flows to negotiate timestamp option.
> > 
> >   net/ipv4/tcp_ipv4.c         diff :
> > 
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1226,15 +1226,6 @@ int tcp_v4_conn_request(struct sock *sk, struct
> sk_buff
> > *skb)
> >         if (want_cookie && !tmp_opt.saw_tstamp)
> >                 tcp_clear_options(&tmp_opt);
> > 
> > -       if (tmp_opt.saw_tstamp && !tmp_opt.rcv_tsval) {
> > -               /* Some OSes (unknown ones, but I see them on web server,
> which
> > -                * contains information interesting only for windows'
> > -                * users) do not send their stamp in SYN. It is easy case.
> > -                * We simply do not advertise TS support.
> > -                */
> > -               tmp_opt.saw_tstamp = 0;
> > -               tmp_opt.tstamp_ok  = 0;
> > -       }
> >         tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
> > 
> >         tcp_openreq_init(req, &tmp_opt, skb);
> > 
> > Removing that was not very good. Having analized lost connections from SP3
> I
> > know that they have timestamps turned on and timestamp value is 0. Here is
> it:
> > 13:39:10.430498 IP 192.168.99.130.3493 > 192.168.99.100.80: S
> > 2507911465:2507911465(0) win 65535 <mss 1460,nop,wscale 3,nop,nop,timestamp
> 0
> > 0,nop,nop,sackOK>
> >         0x0000:  4500 0040 2bda 4000 8006 86a6 c0a8 6382  E..@+.@.......c.
> >         0x0010:  c0a8 6364 0da5 0050 957b b129 0000 0000  ..cd...P.{.)....
> >         0x0020:  b002 ffff 992c 0000 0204 05b4 0103 0303  .....,..........
> >         0x0030:  0101 080a 0000 0000 0000 0000 0101 0402  ................
> > 
> > Having above code fragment removed we got tmp_opt.tstamp_ok=1, as i
> understand.
> > But a little later in source code of tcp_ipv4.c read:
> >         /* VJ's idea. We save last timestamp seen
> >          * from the destination in peer table, when entering
> >          * state TIME-WAIT, and check against it before
> >          * accepting new connection request.
> >          *
> >          * If "isn" is not zero, this request hit alive
> >          * timewait bucket, so that all the necessary checks
> >          * are made in the function processing timewait state.
> >          */
> >         if (tmp_opt.saw_tstamp &&
> >             tcp_death_row.sysctl_tw_recycle &&
> >             (dst = inet_csk_route_req(sk, req)) != NULL &&
> >             (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
> >             peer->v4daddr == saddr) {
> >             if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
> >                 (s32)(peer->tcp_ts - req->ts_recent) >
> >                             TCP_PAWS_WINDOW) {
> >                 NET_INC_STATS_BH(sock_net(sk),
> LINUX_MIB_PAWSPASSIVEREJECTED);
> >                 goto drop_and_release;
> >             }
> >         }
> > which in some way (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle
> are
> > true), random way, having not closed time-wait sockets from the pear, leads
> to
> > packet ignorence.
> > 
> > As for me, i understand, that i should not enable tw_recycle, BUT
> DOCUMENTATION
> > DOES NOT STATE, that enabling it i'll got random and rather often lost of
> > connections from some types of popular clients (like Windows).
> > Concerning above stated commit, it should include something to prevent
> above
> > condition to become true if tmp_opt.rcv_tsval==0. I'm not sure, but
> something
> > like
> >         if (tmp_opt.saw_tstamp &&
> > +           tmp_opt.rcv_tsval &&
> >             tcp_death_row.sysctl_tw_recycle &&
> >             (dst = inet_csk_route_req(sk, req)) != NULL &&
> >             (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
> > 
> > just to not provide regression and strong TCP-stack incompatibility in case
> > tw_recycle is enabled.
> > Also documentation does not state, that tw_recyle should not be used at all
> for
> > internet servers, because web-clients, which are behind NAT, will have
> problems
> > connected with the same above condition because successive connections from
> > different clients (which have common IP) could have incompatible
> timestamps.
> > 
> > Sorry if i detracted somebody busy from his work with my unimportant
> problem.
> > 
> 
> --

Hi Yuriy

Interesting analysis but wrong conclusions :)

Clients using RFC1323 (timestamps) and behind a NAT device will barf on
your setup. No matter they use Windows SP3 or other operating system.

Only because RFC1323 is more often enabled at client level (a registry
change on Windows XP, Vista or Seven I dont know), you start noticing
your server drops more connections than before.

Point is  :

Dont mess with tcp_tw_recycle=1, tcp_timestamps=1 on public machines

Its a non working setup, for clients behind NAT devices (since their
TSVAL will probably lead to incorrect behavior on server, with infamous
LINUX_MIB_PAWSPASSIVEREJECTED status seen on netstat -s, as you
discovered.

And your patch solves nothing for this very common case, unless the NAT
device is able to overwrite TSVAL values with its own values (very
unlikely !!!)

A working setup is (and is the default) :

tcp_tw_recycle=0
tcp_timestamps=1


Documentation might be improved, but I feel whole "tcp_tw_recycle"
affair is really too tricky to be ever documented (not mentioning using
it ;) )
Comment 4 Eric Dumazet 2010-08-12 17:19:02 UTC
Le jeudi 12 août 2010 à 19:46 +0300, Yuriy a écrit :

> Thanks for reply.
> 
> Main idea that i wanted to say is just to document this feature appropriately
> as internet is full of recommendations to enable it. 
> Just few words like "do not used it on public servers" would be much better
> than now.

Sure, but we dont maintain nor correct the recommandations found on
various Internet pages ;)

BTW, a google search on "tcp_tw_recycle" gives many results on problems
with this setting, not improvements.

Also, many "recommandations" found on Internet suggest to disable
tcp_timestamps, only because it adds 12 bytes to TCP header.

Apparently you chose to follow the tcp_tw_recycle=1 recommandation, not
the tcp_timestamps=0 one ;)
Comment 5 Yuriy Yevtukhov 2010-08-12 17:41:41 UTC
Hi, Eric.

You wrote 12.08.2010, 18:09:33:

ED> Le jeudi 12 août 2010 à 07:40 -0700, Andrew Morton a écrit :
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).


>> On Thu, 12 Aug 2010 08:20:01 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

>> > https://bugzilla.kernel.org/show_bug.cgi?id=16568
>> > 
>> >            Summary: Regression and incompatibility with Windows
>> >                     SP2-SP3-Vista TCP stack causing lost connections
>> >            Product: Networking
>> >            Version: 2.5
>> >     Kernel Version: 2.6.30+
>> >           Platform: All
>> >         OS/Version: Linux
>> >               Tree: Mainline
>> >             Status: NEW
>> >           Severity: high
>> >           Priority: P1
>> >          Component: IPV4
>> >         AssignedTo: shemminger@linux-foundation.org
>> >         ReportedBy: yuriy@ucoz.com
>> >         Regression: No
>> > 
>> > 
>> > Hi.
>> > I administer about 50 highly-loaded web servers (free CMS hosting) under
>> linux.
>> > Having on most of them kernel versions between 2.6.24 and 2.6.29 at the
>> > beginnig of the year, I made TCP sysctls tunings for increasing DDOS and
>> > different flooding protection (our servers have attacks rather often).
>> > tcp_tw_recyle=1 was among of them, as many manuals in the net recommend to
>> do
>> > this and linux documentation does not say anything bad. Having periodic
>> kernel
>> > panics connected with bugs in ethernet card drivers and ext3 and after
>> founding
>> > that 2.6.31+ kernels work faster with ext3, I upgraded almost all kernels
>> to
>> > 2.6.32.8, which was already being tested on several servers for several
>> months. 
>> > Somewhen after that we began to receive complaints from our users (site
>> owners)
>> > that they (and their visitors) see very unstable work of their sites. It
>> looked
>> > like HTTP-connections were just lost in a random way. Not everybody had
>> the
>> > problem, just a small percent. We tried to find problem with internet
>> providers
>> > or buggy firewalls, but finally came to conclusion that problem is
>> connected
>> > with our servers. Analizing situations with lost connections using tcpdump
>> i
>> > found that client host send packets, BUT LINUX JUST IGNORES THEM, there
>> was
>> > SYN-packet repeated 3 times with interval of 3 secs, but NO SYN-ACK reply.
>> > Most problems had users with Windows SP3 (i.e. almost all users with SP3
>> had
>> > the problem). I booted one server with old 2.6.24 kernel and found that
>> problem
>> > dissappeared. Then began look for exact kernel version, that introduced
>> > incompatibility. Using binary search I compiled several kernels between
>> 2.6.24
>> > and 2.6.32.8 and found that 2.6.29.6 DO NO have the problem, but 2.6.30
>> DOES.
>> > Studing commits made to tcp_input.c and tcp_ipv4.c (which i supposed were
>> > involved) between that releases I found this one.
>> >   author    Eric Dumazet <dada1@cosmosbay.com>    
>> >     Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
>> >   committer    David S. Miller <davem@davemloft.net>    
>> >     Wed, 11 Mar 2009 16:23:57 +0000 (09:23 -0700)
>> >   commit    fc1ad92dfc4e363a055053746552cdb445ba5c57
>> > 
>> >   tcp: allow timestamps even if SYN packet has tsval=0
>> > 
>> >   Some systems send SYN packets with apparently wrong RFC1323 timestamp
>> >   option values [timestamp tsval=0 tsecr=0].
>> >   It might be for security reasons
>> (http://www.secuobs.com/plugs/25220.shtml )
>> >   Linux TCP stack ignores this option and sends back a SYN+ACK packet
>> >   without timestamp option, thus many TCP flows cannot use timestamps
>> >   and lose some benefit of RFC1323.
>> >   Other operating systems seem to not care about initial tsval value, and
>> let
>> >   tcp flows to negotiate timestamp option.
>> > 
>> >   net/ipv4/tcp_ipv4.c         diff :
>> > 
>> > --- a/net/ipv4/tcp_ipv4.c
>> > +++ b/net/ipv4/tcp_ipv4.c
>> > @@ -1226,15 +1226,6 @@ int tcp_v4_conn_request(struct sock *sk, struct
>> sk_buff
>> > *skb)
>> >         if (want_cookie && !tmp_opt.saw_tstamp)
>> >                 tcp_clear_options(&tmp_opt);
>> > 
>> > -       if (tmp_opt.saw_tstamp && !tmp_opt.rcv_tsval) {
>> > -               /* Some OSes (unknown ones, but I see them on web server,
>> which
>> > -                * contains information interesting only for windows'
>> > -                * users) do not send their stamp in SYN. It is easy case.
>> > -                * We simply do not advertise TS support.
>> > -                */
>> > -               tmp_opt.saw_tstamp = 0;
>> > -               tmp_opt.tstamp_ok  = 0;
>> > -       }
>> >         tmp_opt.tstamp_ok = tmp_opt.saw_tstamp;
>> > 
>> >         tcp_openreq_init(req, &tmp_opt, skb);
>> > 
>> > Removing that was not very good. Having analized lost connections from SP3
>> I
>> > know that they have timestamps turned on and timestamp value is 0. Here is
>> it:
>> > 13:39:10.430498 IP 192.168.99.130.3493 > 192.168.99.100.80: S
>> > 2507911465:2507911465(0) win 65535 <mss 1460,nop,wscale
>> 3,nop,nop,timestamp 0
>> > 0,nop,nop,sackOK>
>> >         0x0000:  4500 0040 2bda 4000 8006 86a6 c0a8 6382  E..@+.@.......c.
>> >         0x0010:  c0a8 6364 0da5 0050 957b b129 0000 0000  ..cd...P.{.)....
>> >         0x0020:  b002 ffff 992c 0000 0204 05b4 0103 0303  .....,..........
>> >         0x0030:  0101 080a 0000 0000 0000 0000 0101 0402  ................
>> > 
>> > Having above code fragment removed we got tmp_opt.tstamp_ok=1, as i
>> understand.
>> > But a little later in source code of tcp_ipv4.c read:
>> >         /* VJ's idea. We save last timestamp seen
>> >          * from the destination in peer table, when entering
>> >          * state TIME-WAIT, and check against it before
>> >          * accepting new connection request.
>> >          *
>> >          * If "isn" is not zero, this request hit alive
>> >          * timewait bucket, so that all the necessary checks
>> >          * are made in the function processing timewait state.
>> >          */
>> >         if (tmp_opt.saw_tstamp &&
>> >             tcp_death_row.sysctl_tw_recycle &&
>> >             (dst = inet_csk_route_req(sk, req)) != NULL &&
>> >             (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
>> >             peer->v4daddr == saddr) {
>> >             if ((u32)get_seconds() - peer->tcp_ts_stamp < TCP_PAWS_MSL &&
>> >                 (s32)(peer->tcp_ts - req->ts_recent) >
>> >                             TCP_PAWS_WINDOW) {
>> >                 NET_INC_STATS_BH(sock_net(sk),
>> LINUX_MIB_PAWSPASSIVEREJECTED);
>> >                 goto drop_and_release;
>> >             }
>> >         }
>> > which in some way (tmp_opt.saw_tstamp && tcp_death_row.sysctl_tw_recycle
>> are
>> > true), random way, having not closed time-wait sockets from the pear,
>> leads to
>> > packet ignorence.
>> > 
>> > As for me, i understand, that i should not enable tw_recycle, BUT
>> DOCUMENTATION
>> > DOES NOT STATE, that enabling it i'll got random and rather often lost of
>> > connections from some types of popular clients (like Windows).
>> > Concerning above stated commit, it should include something to prevent
>> above
>> > condition to become true if tmp_opt.rcv_tsval==0. I'm not sure, but
>> something
>> > like
>> >         if (tmp_opt.saw_tstamp &&
>> > +           tmp_opt.rcv_tsval &&
>> >             tcp_death_row.sysctl_tw_recycle &&
>> >             (dst = inet_csk_route_req(sk, req)) != NULL &&
>> >             (peer = rt_get_peer((struct rtable *)dst)) != NULL &&
>> > 
>> > just to not provide regression and strong TCP-stack incompatibility in
>> case
>> > tw_recycle is enabled.
>> > Also documentation does not state, that tw_recyle should not be used at
>> all for
>> > internet servers, because web-clients, which are behind NAT, will have
>> problems
>> > connected with the same above condition because successive connections
>> from
>> > different clients (which have common IP) could have incompatible
>> timestamps.
>> > 
>> > Sorry if i detracted somebody busy from his work with my unimportant
>> problem.
>> > 

>> --

ED> Hi Yuriy

ED> Interesting analysis but wrong conclusions :)

ED> Clients using RFC1323 (timestamps) and behind a NAT device will barf on
ED> your setup. No matter they use Windows SP3 or other operating system.

ED> Only because RFC1323 is more often enabled at client level (a registry
ED> change on Windows XP, Vista or Seven I dont know), you start noticing
ED> your server drops more connections than before.

ED> Point is  :

ED> Dont mess with tcp_tw_recycle=1, tcp_timestamps=1 on public machines

ED> Its a non working setup, for clients behind NAT devices (since their
ED> TSVAL will probably lead to incorrect behavior on server, with infamous
ED> LINUX_MIB_PAWSPASSIVEREJECTED status seen on netstat -s, as you
ED> discovered.

ED> And your patch solves nothing for this very common case, unless the NAT
ED> device is able to overwrite TSVAL values with its own values (very
ED> unlikely !!!)

ED> A working setup is (and is the default) :

ED> tcp_tw_recycle=0
ED> tcp_timestamps=1


ED> Documentation might be improved, but I feel whole "tcp_tw_recycle"
ED> affair is really too tricky to be ever documented (not mentioning using
ED> it ;) )


Thanks for reply.

Main idea that i wanted to say is just to document this feature appropriately as internet is full of recommendations to enable it. 
Just few words like "do not used it on public servers" would be much better than now.

Note You need to log in before you can comment on or make changes to this bug.