This has been a longstanding "bug" of sorts when talking to a system that has extremely small windows (under 1.5k). The only way to give the stack on the other side a nudge is to ACK twice. Here is a sample transcript, with a max window size of 1025 bytes. 18:25:43.968358 IP dr.ea.ms.http > 192.168.80.2.40246: . 37377:37633(256) ack 120 win 5840 18:25:43.992402 IP 192.168.80.2.40246 > dr.ea.ms.http: . ack 37121 win 769 <mss 256> 18:25:44.390305 IP 192.168.80.2.40246 > dr.ea.ms.http: . ack 37121 win 1025 <mss 256> 18:25:44.823084 IP dr.ea.ms.http > 192.168.80.2.40246: . 37633:37889(256) ack 120 win 5840 If I take the "nudge" code out of my IP stack, it sits for an aweful long time, waiting on the next packet, when there clearly is room for a few more. Should I: 1: Have my IP stack lie about the window till it is important? 2: Something else? I can't see any good reason for the large delay, since it is on a serial link, via SLIP. I can point you to source code that will allow you to verify the problem for yourself, if you would like.
The Linux stack follows the RFC standard for silly window avoidance. Any window less than a full MTU is deemed a silly window and will not be used. The application can turn off the Nagle algorithm on a per socket basis with TCP_NODELAY. What is the OS or device on the other end that is so non-standard compliant? Since Linux follows the standard, you really need to fix the receiver.
Reply-To: akpm@linux-foundation.org On Sun, 16 Sep 2007 17:02:46 -0700 (PDT) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9031 > > Summary: TPC window is to cautious on send > Product: Networking > Version: 2.5 > KernelVersion: Any > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: IPV4 > AssignedTo: shemminger@osdl.org > ReportedBy: a@oo.ms > > > This has been a longstanding "bug" of sorts when talking to a system that has > extremely small windows (under 1.5k). > > The only way to give the stack on the other side a nudge is to ACK twice. > > Here is a sample transcript, with a max window size of 1025 bytes. > > 18:25:43.968358 IP dr.ea.ms.http > 192.168.80.2.40246: . 37377:37633(256) ack > 120 win 5840 > 18:25:43.992402 IP 192.168.80.2.40246 > dr.ea.ms.http: . ack 37121 win 769 > <mss > 256> > 18:25:44.390305 IP 192.168.80.2.40246 > dr.ea.ms.http: . ack 37121 win 1025 > <mss 256> > 18:25:44.823084 IP dr.ea.ms.http > 192.168.80.2.40246: . 37633:37889(256) ack > 120 win 5840 > > If I take the "nudge" code out of my IP stack, it sits for an aweful long > time, > waiting on the next packet, when there clearly is room for a few more. > > Should I: > 1: Have my IP stack lie about the window till it is important? > 2: Something else? > > I can't see any good reason for the large delay, since it is on a serial > link, > via SLIP. > > I can point you to source code that will allow you to verify the problem for > yourself, if you would like. >
Reply-To: shemminger@linux-foundation.org On Sun, 16 Sep 2007 23:43:40 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Sun, 16 Sep 2007 17:02:46 -0700 (PDT) bugme-daemon@bugzilla.kernel.org > wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=9031 > > > > Summary: TPC window is to cautious on send > > Product: Networking > > Version: 2.5 > > KernelVersion: Any > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: IPV4 > > AssignedTo: shemminger@osdl.org > > ReportedBy: a@oo.ms > > > > > > This has been a longstanding "bug" of sorts when talking to a system that > has > > extremely small windows (under 1.5k). > > > > The only way to give the stack on the other side a nudge is to ACK twice. > > > > Here is a sample transcript, with a max window size of 1025 bytes. > > > > 18:25:43.968358 IP dr.ea.ms.http > 192.168.80.2.40246: . 37377:37633(256) > ack > > 120 win 5840 > > 18:25:43.992402 IP 192.168.80.2.40246 > dr.ea.ms.http: . ack 37121 win 769 > <mss > > 256> > > 18:25:44.390305 IP 192.168.80.2.40246 > dr.ea.ms.http: . ack 37121 win 1025 > > <mss 256> > > 18:25:44.823084 IP dr.ea.ms.http > 192.168.80.2.40246: . 37633:37889(256) > ack > > 120 win 5840 > > > > If I take the "nudge" code out of my IP stack, it sits for an aweful long > time, > > waiting on the next packet, when there clearly is room for a few more. > > > > Should I: > > 1: Have my IP stack lie about the window till it is important? > > 2: Something else? > > > > I can't see any good reason for the large delay, since it is on a serial > link, > > via SLIP. > > > > I can point you to source code that will allow you to verify the problem > for > > yourself, if you would like. > > See my comment, on bug report, Linux is doing Silly Window Syndrome avoidance (RFC 813) as required in host requirements RFC1122 4.2.3.4 When to Send Data A TCP MUST include a SWS avoidance algorithm in the sender. A TCP SHOULD implement the Nagle Algorithm [TCP:9] to coalesce short segments. However, there MUST be a way for an application to disable the Nagle algorithm on an individual connection. In all cases, sending data is also subject to the limitation imposed by the Slow Start algorithm (Section 4.2.2.15). The Linux mechanism to disable Nagle is setsockopt(TCP_NODELAY).
So then the option is to lie on my end. There is no way to increase the actual window on some of the devices, which is very unfortunate. It shouldn't matter much any way, since it is a serial device, and the buffer space is available anyway. The windowing scheme used reports to the other host the amount of buffer space available per connection. It doesn't count the free shared buffers, which are usually used to collect fragments and hold on to them for assembly. I suppose that would be the idea fix. Thanks for pointing me to the RFC. it will be seriously helpful in the development of my IP stack.
If the device doesn't have a enough buffer space for a whole Ethernet packet. Then a sensible thing to do would be to use a smaller Maximum Segment Size (MSS) during the TCP negotiation phase. The SWS avoidance is done based on MSS, so if you set it to 512 bytes everything would work.