Bug 206319

Summary: ECN header flag processing overly restrictive in TCP
Product: Networking Reporter: rscheff
Component: OtherAssignee: Stephen Hemminger (stephen)
Status: NEW ---    
Severity: normal CC: nealcardwell, rscheff, ycheng
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: HEAD Subsystem:
Regression: No Bisected commit-id:

Description rscheff 2020-01-27 08:15:36 UTC
RFC3168 states, that a CWR flag "SHOULD" be *sent* together with a new data segment.

However, linux is processing the CWR flag as data receiver ONLY when it arrives together with some data (but apparently does accept it on retransmissions).

This has been found to be an interoperability issue with *BSD, where the CWR is sent as quickly as possible, including on pure ACKs (or retransmissions) so far. That deviation from RFC3168 there is reported at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243231

Nevertheless, CWR processing as receiver should be less restrictive, to meet the sprit of Postels Law: "Be liberal in what you accept, and conservative in what you send."


This has been demonstrated to be a dramatic performance impediment, as the data receiver (linux) keeps the ECE latched, while *BSD interprets the additional ECE flags as another round of congestion. To which the data sender reacts by continous reductions of the congestion window until extremely low packet transmission rates (1 packet per delayed ACK timeout, or even persist timeout (5s) are hit, and kept at that level for extensive periods of time.

Discussed this issue with Neal and Yuchung already, this bug report is to track the issue in the field (impacted environments).
Comment 1 rscheff 2020-01-27 08:30:41 UTC
Also, while not having validated this on linux:

As CWR is effectivly bound to the first transmission of [snd_max+1] when entering ECN-congestion, the recovery point should be set to that sequence number too; it may be set to snd_max (as with loss recovery). This can lead to premature termination of ECN-based congestion reaction, on a full ACK of snd_max. Which is likely during request-respond workloads, where the data direction changes frequently.

If that happens, the still latched ECE can lead to two consecutive ECN reactions (reducing cwnd twice). 

Again, I have not verified if this problem may exist on linux, but found this on *BSD derived stacks.