Distribution: Red Hat 7.3 Hardware Environment: i386 Software Environment: Problem Description: Although the current kernel complies with RFC1323 (about timestamps, among other things), it does not follow the authors' later correction to the timestamp (ts) algorithm (http://www.kohala.com/start/tcplw- extensions.txt). In a nutshell: If the connection goes down, the subsequent retransmissions will have later ts values. However, if an ACK was lost, the kernel currently does not update ts_recent with these values. If the channel comes back up and a retrans succeeds, the client will send an ack with an ts echo equal to the old value of ts_recent. As the difference between the echo and the current time is quite large, it will result in an extremely large RTT measurement, triggering a huge increase in the RTO. In an unreliable, bursty link, such as satellites, another outage could occur shortly thereafter, and an inappropriately long timeout can occur. Steps to reproduce: (1) Start a long TCP transfer and a tcpdump on the client side (2) unplug the cable for a while (3) look at the results: 16:06:01.336173 192.168.2.100.32808 > 192.168.1.100.5000: . 13566465:13567913 (1448) ack 1 win 5840 <nop,nop,timestamp 158539946 20339> (DF) 16:06:01.336296 192.168.2.100.32808 > 192.168.1.100.5000: . 13567913:13569361 (1448) ack 1 win 5840 <nop,nop,timestamp 158539946 20339> (DF) 16:06:01.336320 192.168.1.100.5000 > 192.168.2.100.32808: . ack 13569361 win 63712 <nop,nop,timestamp 20339 158539946> (DF) < outage > 16:06:27.993187 192.168.2.100.32808 > 192.168.1.100.5000: . 13505649:13507097 (1448) ack 1 win 5840 <nop,nop,timestamp 158542613 20339> (DF) 16:06:27.993340 192.168.1.100.5000 > 192.168.2.100.32808: . ack 13569361 win 63712 <nop,nop,timestamp 23005 158539946,nop,nop,sack sack 1 {13505649:13507097} > (DF) 16:06:27.993827 192.168.2.100.32808 > 192.168.1.100.5000: . 13569361:13570809 (1448) ack 1 win 5840 <nop,nop,timestamp 158542613 23005> (DF) 16:06:27.993866 192.168.1.100.5000 > 192.168.2.100.32808: . ack 13570809 win 63712 <nop,nop,timestamp 23005 158542613> (DF) Note that the first packet after the outage triggers an ack with an echo not from its timestamp, but from one before the outage. Solution: In tcp_rcv_established() of tcp_input.c: if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) { /* RFC793, page 37: "In all states except SYN-SENT, all reset * (RST) segments are validated by checking their SEQ-fields." * And page 69: "If an incoming segment is not acceptable, * an acknowledgment should be sent in reply (unless the RST bit * is set, if so drop the segment and return)". */ if (!th->rst) tcp_send_dupack(sk, skb); goto discard; } if(th->rst) { tcp_reset(sk); goto discard; } tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq); Move the tcp_replace_ts_recent() call from the end to the beginning of this block of code. This updates ts_recent prior to the conditional which is executed in the case described above. Sample correct output (same scenario): 15:54:48.067421 192.168.2.100.32806 > 192.168.1.100.5000: . 9999249:10000697 (1448) ack 1 win 5840 <nop,nop,timestamp 158472657 2096312> (DF) 15:54:48.067544 192.168.2.100.32806 > 192.168.1.100.5000: . 10000697:10002145 (1448) ack 1 win 5840 <nop,nop,timestamp 158472657 2096312> (DF) 15:54:48.067569 192.168.1.100.5000 > 192.168.2.100.32806: . ack 10002145 win 63712 <nop,nop,timestamp 2096313 158472657> (DF) <outage> 15:55:14.725019 192.168.2.100.32806 > 192.168.1.100.5000: . 9938433:9939881 (1448) ack 1 win 5840 <nop,nop,timestamp 158475324 2096312> (DF) 15:55:14.725161 192.168.1.100.5000 > 192.168.2.100.32806: . ack 10002145 win 63712 <nop,nop,timestamp 2098979 158475324,nop,nop,sack sack 1 {9938433:9939881} > (DF) 15:55:14.725611 192.168.2.100.32806 > 192.168.1.100.5000: . 10002145:10003593 (1448) ack 1 win 5840 <nop,nop,timestamp 158475324 2098979> (DF) 15:55:14.725735 192.168.2.100.32806 > 192.168.1.100.5000: . 10003593:10005041 (1448) ack 1 win 5840 <nop,nop,timestamp 158475324 2098979> (DF) 15:55:14.725790 192.168.1.100.5000 > 192.168.2.100.32806: . ack 10003593 win 63712 <nop,nop,timestamp 2098979 158475324> (DF)
Said draft is buggy and our behavior is correct. The TCP implementation community considers our behavior acceptable. There was a discussion about exactly this on the tcp-impl list some time ago, sorry I don't have a reference handy. In particular, said draft opens up a hole where anyone can break the connection on us.