Bug 18952 - The amount of SYN retries is not equal to /proc/sys/net/ipv4/tcp_syn_retries
Summary: The amount of SYN retries is not equal to /proc/sys/net/ipv4/tcp_syn_retries
Status: RESOLVED OBSOLETE
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-09-22 08:50 UTC by Yuri Chislov
Modified: 2012-08-13 16:40 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.32.12, 2.6.32.15, 2.6.35.4
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Yuri Chislov 2010-09-22 08:50:11 UTC
When setting a value in  /proc/sys/net/ipv4/tcp_syn_retries, the actual number of syn retries is the number set in  /proc/sys/net/ipv4/tcp_syn_retries minus 2.

For example:
When setting /proc/sys/net/ipv4/tcp_syn_retries to 5 the actual number of SYN retries is 3.
When setting /proc/sys/net/ipv4/tcp_syn_retries to 7 the actual number of SYN retries is 5.
However, when setting setting /proc/sys/net/ipv4/tcp_syn_retries to 2 the actual number of SYN retries is 2.

Note: In the kernel 2.6.31.9 actual number of SYN = tcp_syn_retries + 1


Steps to reproduce:
sudo iptables -I INPUT 1 -i lo -p tcp --dport 88 -j DROP
sudo tcpdump -n -i lo -v tcp port 88

from another console run:
time wget -t 1 -O - --connect-timeout=300 http://0:88

tcpdump output:
sudo tcpdump -n -i lo -v tcp port 88
11:29:39.820058 IP (tos 0x0, ttl 64, id 14664, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.43730 > 127.0.0.1.kerberos: Flags [S], cksum 0xfe30 (incorrect -> 0xecf4), seq 1012617667, win 32792, options [mss 16396,sackOK,TS val 12871819 ecr 0,nop,wscale 7], length 0
11:29:42.824091 IP (tos 0x0, ttl 64, id 14665, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.43730 > 127.0.0.1.kerberos: Flags [S], cksum 0xfe30 (incorrect -> 0xe137), seq 1012617667, win 32792, options [mss 16396,sackOK,TS val 12874824 ecr 0,nop,wscale 7], length 0
11:29:48.832153 IP (tos 0x0, ttl 64, id 14666, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.43730 > 127.0.0.1.kerberos: Flags [S], cksum 0xfe30 (incorrect -> 0xc9bf), seq 1012617667, win 32792, options [mss 16396,sackOK,TS val 12880832 ecr 0,nop,wscale 7], length 0

wget output:
time wget -t 1 -O - --connect-timeout=300 http://0:88
Resolving 0... 0.0.0.0
Connecting to 0|0.0.0.0|:88... failed: Connection timed out.
Giving up.


real    0m21.050s
user    0m0.003s
sys     0m0.004s

It valid for remote host also.
Comment 1 Andrew Morton 2010-09-22 09:03:13 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 22 Sep 2010 08:50:12 GMT bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=18952
> 
>            Summary: The mount of SYN retries is not equal to
>                     /proc/sys/net/ipv4/tcp_syn_retries
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.32.12, 2.6.32.15, 2.6.35.4
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: yuri@itinteg.net
>         Regression: No
> 
> 
> When setting a value in  /proc/sys/net/ipv4/tcp_syn_retries, the actual
> number
> of syn retries is the number set in  /proc/sys/net/ipv4/tcp_syn_retries minus
> 2.
> 
> For example:
> When setting /proc/sys/net/ipv4/tcp_syn_retries to 5 the actual number of SYN
> retries is 3.
> When setting /proc/sys/net/ipv4/tcp_syn_retries to 7 the actual number of SYN
> retries is 5.
> However, when setting setting /proc/sys/net/ipv4/tcp_syn_retries to 2 the
> actual number of SYN retries is 2.
> 
> Note: In the kernel 2.6.31.9 actual number of SYN = tcp_syn_retries + 1
> 
> 
> Steps to reproduce:
> sudo iptables -I INPUT 1 -i lo -p tcp --dport 88 -j DROP
> sudo tcpdump -n -i lo -v tcp port 88
> 
> from another console run:
> time wget -t 1 -O - --connect-timeout=300 http://0:88
> 
> tcpdump output:
> sudo tcpdump -n -i lo -v tcp port 88
> 11:29:39.820058 IP (tos 0x0, ttl 64, id 14664, offset 0, flags [DF], proto
> TCP
> (6), length 60)
>     127.0.0.1.43730 > 127.0.0.1.kerberos: Flags [S], cksum 0xfe30 (incorrect
>     ->
> 0xecf4), seq 1012617667, win 32792, options [mss 16396,sackOK,TS val 12871819
> ecr 0,nop,wscale 7], length 0
> 11:29:42.824091 IP (tos 0x0, ttl 64, id 14665, offset 0, flags [DF], proto
> TCP
> (6), length 60)
>     127.0.0.1.43730 > 127.0.0.1.kerberos: Flags [S], cksum 0xfe30 (incorrect
>     ->
> 0xe137), seq 1012617667, win 32792, options [mss 16396,sackOK,TS val 12874824
> ecr 0,nop,wscale 7], length 0
> 11:29:48.832153 IP (tos 0x0, ttl 64, id 14666, offset 0, flags [DF], proto
> TCP
> (6), length 60)
>     127.0.0.1.43730 > 127.0.0.1.kerberos: Flags [S], cksum 0xfe30 (incorrect
>     ->
> 0xc9bf), seq 1012617667, win 32792, options [mss 16396,sackOK,TS val 12880832
> ecr 0,nop,wscale 7], length 0
> 
> wget output:
> time wget -t 1 -O - --connect-timeout=300 http://0:88
> Resolving 0... 0.0.0.0
> Connecting to 0|0.0.0.0|:88... failed: Connection timed out.
> Giving up.
> 
> 
> real    0m21.050s
> user    0m0.003s
> sys     0m0.004s
> 
> It valid for remote host also.
>
Comment 2 David S. Miller 2010-09-25 03:06:11 UTC
tcp_syn_retries is not an exact calculation.

It is input into a calculation which estimates how long that many
retransmits (with suitable backoff applied) will take, and that time
estimte in turn determines the time limit for when we'll kill the
connection attempt.

Feel free to update the documentation in
Documentation/networking/ip-sysctl.txt to more closely match the
behavior.

The logic is in net/ipv4/tcp_timer.c:retransmits_timed_out().
Comment 3 David S. Miller 2010-09-27 08:10:50 UTC
From: Yuri Chislov <yuri@itinteg.net>
Date: Mon, 27 Sep 2010 10:07:06 +0200

>   It looks like the behavior changed in 2.6.32. 2.6.32 and up, uses some 
> calculation instead of a direct definition of the retries number, that makes
> it 
> harder to achieve the necessary system behavior.
> 
>  The default behavior of the system changed completely
> (the old default connect timeout was ~ 180 seconds, while the new one is ~21 
> sec). 
> 
> The new behavior invalidates the kernel documentation and tcp man page.
> 
> It's not possible to set a connect timeout > 25 sec in the applications while 
> using the default values in /proc. 
> 
> From my view point is regression.

Agreed, Damian you have to fix this.

Otherwise I'm reverting all of your Revert Backoff commits.
Comment 4 Yuri Chislov 2010-09-27 08:44:12 UTC
  It looks like the behavior changed in 2.6.32. 2.6.32 and up, uses some 
calculation instead of a direct definition of the retries number, that makes it 
harder to achieve the necessary system behavior.

 The default behavior of the system changed completely
(the old default connect timeout was ~ 180 seconds, while the new one is ~21 
sec). 

The new behavior invalidates the kernel documentation and tcp man page.

It's not possible to set a connect timeout > 25 sec in the applications while 
using the default values in /proc. 

From my view point is regression.

On Saturday, September 25, 2010 05:05:57 am David Miller wrote:
> tcp_syn_retries is not an exact calculation.
> 
> It is input into a calculation which estimates how long that many
> retransmits (with suitable backoff applied) will take, and that time
> estimte in turn determines the time limit for when we'll kill the
> connection attempt.
> 
> Feel free to update the documentation in
> Documentation/networking/ip-sysctl.txt to more closely match the
> behavior.
> 
> The logic is in net/ipv4/tcp_timer.c:retransmits_timed_out().
Comment 5 Anonymous Emailer 2010-09-27 09:06:33 UTC
Reply-To: damian@tvk.rwth-aachen.de

Hi David,

give me some time, please. I'll take a closer look in the evening.

Regards
 Damian

Am Montag, den 27.09.2010, 01:10 -0700 schrieb David Miller:
> From: Yuri Chislov <yuri@itinteg.net>
> Date: Mon, 27 Sep 2010 10:07:06 +0200
> 
> >   It looks like the behavior changed in 2.6.32. 2.6.32 and up, uses some 
> > calculation instead of a direct definition of the retries number, that
> makes it 
> > harder to achieve the necessary system behavior.
> > 
> >  The default behavior of the system changed completely
> > (the old default connect timeout was ~ 180 seconds, while the new one is
> ~21 
> > sec). 
> > 
> > The new behavior invalidates the kernel documentation and tcp man page.
> > 
> > It's not possible to set a connect timeout > 25 sec in the applications
> while 
> > using the default values in /proc. 
> > 
> > From my view point is regression.
> 
> Agreed, Damian you have to fix this.
> 
> Otherwise I'm reverting all of your Revert Backoff commits.
Comment 6 Anonymous Emailer 2010-09-27 20:31:07 UTC
Reply-To: damian@tvk.rwth-aachen.de

Ok, I see.

When I read your mail this morning, I was afraid, that this is another
bug in the calculation routine. However, the routine seems ok to me.

The problem is the high discrepancy between the RTO_MIN-based
calculation and TCP_TIMEOUT_INIT-based actual backoff in the SYN-case.

Yuris example did not reveal a conceptually new bug, as the same
behaviour can be observed on pre 2.6.32 kernels at higher values.
I tested wget on a 2.6.26 kernel with timeout values of 300 and above.
All runs did time out after 189 seconds, with the sysctl-value as their
hard limit.

My suggestion for solving this issue:
Introducing a third boolean parameter for retransmits_timed_out()
indicating whether the socket is in SYN state or not. In the SYN case, 
TCP_TIMEOUT_INIT will be used for the calculation instead of
TCP_RTO_MIN.

Is that ok?

Regards
 Damian


Am Montag, den 27.09.2010, 01:10 -0700 schrieb David Miller:
> From: Yuri Chislov <yuri@itinteg.net>
> Date: Mon, 27 Sep 2010 10:07:06 +0200
> 
> >   It looks like the behavior changed in 2.6.32. 2.6.32 and up, uses some 
> > calculation instead of a direct definition of the retries number, that
> makes it 
> > harder to achieve the necessary system behavior.
> > 
> >  The default behavior of the system changed completely
> > (the old default connect timeout was ~ 180 seconds, while the new one is
> ~21 
> > sec). 
> > 
> > The new behavior invalidates the kernel documentation and tcp man page.
> > 
> > It's not possible to set a connect timeout > 25 sec in the applications
> while 
> > using the default values in /proc. 
> > 
> > From my view point is regression.
> 
> Agreed, Damian you have to fix this.
> 
> Otherwise I'm reverting all of your Revert Backoff commits.
Comment 7 David S. Miller 2010-09-28 04:52:45 UTC
From: Damian Lukowski <damian@tvk.rwth-aachen.de>
Date: Mon, 27 Sep 2010 22:00:08 +0200

> My suggestion for solving this issue:
> Introducing a third boolean parameter for retransmits_timed_out()
> indicating whether the socket is in SYN state or not. In the SYN case, 
> TCP_TIMEOUT_INIT will be used for the calculation instead of
> TCP_RTO_MIN.
> 
> Is that ok?

Sounds fine to me, please prepare a patch.

Thanks!
Comment 8 Yuri Chislov 2010-09-28 07:39:50 UTC
 What is advantage in the replace compare by calculation?
Please clarify, if possible.
Thanks.
Yuri.     


--- linux-2.6.31.14/net/ipv4/tcp_timer.c        2010-07-05 17:11:43.000000000 
+0000
+++ linux-2.6.32.15/net/ipv4/tcp_timer.c        2010-06-01 16:56:03.000000000 
+0000
@@ -137,13 +137,14 @@
 {
        struct inet_connection_sock *icsk = inet_csk(sk);
        int retry_until;
+       bool do_reset;
 
        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
                if (icsk->icsk_retransmits)
                        dst_negative_advice(&sk->sk_dst_cache);
                retry_until = icsk->icsk_syn_retries ? : 
sysctl_tcp_syn_retries;
        } else {
-               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
+               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
                        /* Black hole detection */
                        tcp_mtu_probing(icsk, sk);
 
@@ -155,13 +156,15 @@
                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
 
                        retry_until = tcp_orphan_retries(sk, alive);
+                       do_reset = alive ||
+                                  !retransmits_timed_out(sk, retry_until);
 
-                       if (tcp_out_of_resources(sk, alive || icsk-
>icsk_retransmits < retry_until))
+                       if (tcp_out_of_resources(sk, do_reset))
                                return 1;
                }
        }
 
-       if (icsk->icsk_retransmits >= retry_until) {
+       if (retransmits_timed_out(sk, retry_until)) {
                /* Has it gone just too far? */
                tcp_write_err(sk);
                return 1;
@@ -279,7 +282,7 @@
  *     The TCP retransmit timer.
  */
 
-static void tcp_retransmit_timer(struct sock *sk)
+void tcp_retransmit_timer(struct sock *sk)
 {
        struct tcp_sock *tp = tcp_sk(sk);
        struct inet_connection_sock *icsk = inet_csk(sk);
@@ -385,7 +388,7 @@
 out_reset_timer:
        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
TCP_RTO_MAX);
-       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
+       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
                __sk_dst_reset(sk);
 
 out:;
@@ -499,8 +502,7 @@
        elapsed = tcp_time_stamp - tp->rcv_tstamp;
 
        if (elapsed >= keepalive_time_when(tp)) {
-               if ((!tp->keepalive_probes && icsk->icsk_probes_out >= 
sysctl_tcp_keepalive_probes) ||
-                    (tp->keepalive_probes && icsk->icsk_probes_out >= tp-
>keepalive_probes)) {
+               if (icsk->icsk_probes_out >= keepalive_probes(tp)) {
                        tcp_send_active_reset(sk, GFP_ATOMIC);
                        tcp_write_err(sk);
                        goto out;


On Tuesday, September 28, 2010 06:52:41 am David Miller wrote:
> From: Damian Lukowski <damian@tvk.rwth-aachen.de>
> Date: Mon, 27 Sep 2010 22:00:08 +0200
> 
> > My suggestion for solving this issue:
> > Introducing a third boolean parameter for retransmits_timed_out()
> > indicating whether the socket is in SYN state or not. In the SYN case,
> > TCP_TIMEOUT_INIT will be used for the calculation instead of
> > TCP_RTO_MIN.
> > 
> > Is that ok?
> 
> Sounds fine to me, please prepare a patch.
> 
> Thanks!
Comment 9 Ilpo Järvinen 2010-09-28 10:29:18 UTC
On Tue, 28 Sep 2010, Yuri Chislov wrote:

>  What is advantage in the replace compare by calculation?
> Please clarify, if possible.
> Thanks.
> Yuri.     

If you did take this from kernel history, reading the particular log 
message might have helped?!?

> --- linux-2.6.31.14/net/ipv4/tcp_timer.c        2010-07-05 17:11:43.000000000 
> +0000
> +++ linux-2.6.32.15/net/ipv4/tcp_timer.c        2010-06-01 16:56:03.000000000 
> +0000
> @@ -137,13 +137,14 @@
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         int retry_until;
> +       bool do_reset;
>  
>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>                 if (icsk->icsk_retransmits)
>                         dst_negative_advice(&sk->sk_dst_cache);
>                 retry_until = icsk->icsk_syn_retries ? : 
> sysctl_tcp_syn_retries;
>         } else {
> -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>                         /* Black hole detection */
>                         tcp_mtu_probing(icsk, sk);
>  
> @@ -155,13 +156,15 @@
>                         const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>  
>                         retry_until = tcp_orphan_retries(sk, alive);
> +                       do_reset = alive ||
> +                                  !retransmits_timed_out(sk, retry_until);
>  
> -                       if (tcp_out_of_resources(sk, alive || icsk-
> >icsk_retransmits < retry_until))
> +                       if (tcp_out_of_resources(sk, do_reset))
>                                 return 1;
>                 }
>         }
>  
> -       if (icsk->icsk_retransmits >= retry_until) {
> +       if (retransmits_timed_out(sk, retry_until)) {
>                 /* Has it gone just too far? */
>                 tcp_write_err(sk);
>                 return 1;
> @@ -279,7 +282,7 @@
>   *     The TCP retransmit timer.
>   */
>  
> -static void tcp_retransmit_timer(struct sock *sk)
> +void tcp_retransmit_timer(struct sock *sk)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct inet_connection_sock *icsk = inet_csk(sk);
> @@ -385,7 +388,7 @@
>  out_reset_timer:
>         icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>         inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
> TCP_RTO_MAX);
> -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>                 __sk_dst_reset(sk);
>  
>  out:;
> @@ -499,8 +502,7 @@
>         elapsed = tcp_time_stamp - tp->rcv_tstamp;
>  
>         if (elapsed >= keepalive_time_when(tp)) {
> -               if ((!tp->keepalive_probes && icsk->icsk_probes_out >= 
> sysctl_tcp_keepalive_probes) ||
> -                    (tp->keepalive_probes && icsk->icsk_probes_out >= tp-
> >keepalive_probes)) {
> +               if (icsk->icsk_probes_out >= keepalive_probes(tp)) {
>                         tcp_send_active_reset(sk, GFP_ATOMIC);
>                         tcp_write_err(sk);
Comment 10 Yuri Chislov 2010-09-28 12:07:29 UTC
Thank you. The log is really helpful.

On Tuesday, September 28, 2010 11:47:23 am Ilpo Järvinen wrote:
> On Tue, 28 Sep 2010, Yuri Chislov wrote:
> >  What is advantage in the replace compare by calculation?
> >
> > Please clarify, if possible.
> > Thanks.
> > Yuri.     
> 
> If you did take this from kernel history, reading the particular log 
> message might have helped?!?

Note You need to log in before you can comment on or make changes to this bug.