Bug 194723 - connect() to localhost stalls after 4.9 -> 4.10 upgrade
Summary: connect() to localhost stalls after 4.9 -> 4.10 upgrade
Status: NEW
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-27 11:28 UTC by Lutz Vieweg
Modified: 2020-11-17 12:27 UTC (History)
9 users (show)

See Also:
Kernel Version: 4.10
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Arch-Linux supplied kernel config that was used on a system not exhibiting this bug. (182.61 KB, text/plain)
2017-03-14 17:45 UTC, Lutz Vieweg
Details

Description Lutz Vieweg 2017-02-27 11:28:51 UTC
After upgrading a machine running the latest CentOS from using mainline kernel linux-4.9 to linux-4.10, attempts to connect() via IPv4 to localhost fail in about half of the cases, leaving the process trying to connect() stalled.

Reproduction:

> ncat -k -l 19999 &
> C=1 ; while true ; do echo -n "$C " ; echo ping | ncat localhost 19999 ;
> C=`expr $C + 1` ; sleep 1 ; done

Using linux-4.10, the output looks like this:

> 1 ping
> 2 Ncat: Connection timed out.
> 3 ping
> 4 Ncat: Connection timed out.
> 5 ping
> 6 ping
> 7 ping
> 8 Ncat: Connection timed out.
> 9 ping
> 10 ping
> 11 Ncat: Connection timed out.
> 12 ping
> 13 Ncat: Connection timed out.
> 14 ping
> 15 Ncat: Connection timed out.
> 16 ping
> 17 Ncat: Connection timed out.
> 18 ping
> 19 ping
> 20 Ncat: Connection timed out.
> 21 ping
> 22 ping
> 23 Ncat: Connection timed out.
> 24 ping
> 25 Ncat: Connection timed out.
> 26 Ncat: Connection timed out.
> 27 ping
> 28 Ncat: Connection timed out.
> 29 ping


Using linux-4.9, the output looks like this:

> 1 ping
> 2 ping
> 3 ping
> 4 ping
> 5 ping
> 6 ping
> 7 ping
> 8 ping
> 9 ping
> 10 ping
> 11 ping
> 12 ping
> 13 ping
> 14 ping
> 15 ping
> 16 ping
> 17 ping
> 18 ping
> 19 ping

(The same behaviour was later also confirmed by a colleague on an Ubuntu running machine after upgrading to Ubuntu's 4.10 kernel.)
Comment 1 Longinos 2017-02-28 09:00:34 UTC
I have not tried these scripts but have scripts to populate a web page cache that don´t work after upgrade to the 4.10.0/4.10.1. These scripts fails by timeout the most of times, some coonect. No error messages, no kernel panic
Centos 7 using kernel-ml from elRepo.
Comment 2 Lutz Vieweg 2017-03-01 12:04:45 UTC
Using tcpdump I found that when the connect() stalls, the initial SYN packet appears at the "lo" interface, and is re-sent multiple times, but no ACK packet is ever returned.

Error case with linux-4.10:

> 12:57:25.685640 IP 127.0.0.1.44074 > 127.0.0.1.dnp-sec: Flags [S], seq
> 1952288470, win 43690, options [mss 65495,sackOK,TS val 1942998659 ecr
> 0,nop,wscale 7], length 0
> 12:57:26.728890 IP 127.0.0.1.44074 > 127.0.0.1.dnp-sec: Flags [S], seq
> 1952288470, win 43690, options [mss 65495,sackOK,TS val 1942999703 ecr
> 0,nop,wscale 7], length 0
> 12:57:28.776935 IP 127.0.0.1.44074 > 127.0.0.1.dnp-sec: Flags [S], seq
> 1952288470, win 43690, options [mss 65495,sackOK,TS val 1943001751 ecr
> 0,nop,wscale 7], length 0
...

Normal case:

> 13:01:43.037135 IP 127.0.0.1.44362 > 127.0.0.1.dnp-sec: Flags [S], seq
> 3181010757, win 43690, options [mss 65495,sackOK,TS val 3314900273 ecr
> 0,nop,wscale 7], length 0
> 13:01:43.037171 IP 127.0.0.1.dnp-sec > 127.0.0.1.44362: Flags [S.], seq
> 1934682061, ack 3181010758, win 43690, options [mss 65495,sackOK,TS val
> 2947413993 ecr 3314900273,nop,wscale 7], length 0
> 13:01:43.037196 IP 127.0.0.1.44362 > 127.0.0.1.dnp-sec: Flags [.], ack 1, win
> 342, options [nop,nop,TS val 3314900273 ecr 2947413993], length 0


According to strace, the listening process does not even leave the select() call it uses to wait for incoming connections to accept in the error case.
Comment 3 Lutz Vieweg 2017-03-03 11:25:19 UTC
I just confirmed that this bug is still present in linux-4.10.1
Comment 4 SledgeSulaweyo 2017-03-12 13:20:53 UTC
I just wanted to confirm this issue for Arch kernel 4.10.1-1-ARCH as well. Spent some time investigating yesterday after the update. Thx Lutz for tracking it down!
Comment 5 Dor Askayo 2017-03-13 10:36:21 UTC
I can confirm this as well on Arch Linux using kernel version 4.10.1. It's quite severe.
Comment 6 Lutz Vieweg 2017-03-14 17:00:28 UTC
While this bug is 100% reproducible on some machines, it seems to be 0% reproducible on others, and I am still unable to say what is the criterion for this bug to occur.

So far, we have verified the bug to occur on
- One AMD Opteron-based server running CentOS
- One Intel i5 desktop PC running Ubuntu
- SledgeSulaweyo above confirmed the bug to exist on some machine running Arch-Linux

On the other hand, I have now knowledge of at least two counter-examples:
- A qemu-based virtual machine running CentOS 6 on the same AMD Opteron physical host mentioned above does _not_ exhibit this bug
- Another colleage running Arch Linux on his private PC could also not reproduce this bug.

If I can find one machine that exhibits the bug that I can use to bisect this bug, I'll do so, but above machines do not qualify for this due to their daily duties at the moment.
Comment 7 Jay Vosburgh 2017-03-14 17:14:41 UTC
I've also attempted to reproduce this without success, testing an Ubuntu 4.10 kernel and a mainline 4.10.1.  I'm testing on a physical machine, an Intel i7 Haswell system, installed with Ubuntu 16.04.

Perhaps there is some kernel config dependency here?
Comment 8 Lutz Vieweg 2017-03-14 17:44:49 UTC
There is one (unverified) indication against the relevance of the kernel config: The colleague of mine who tried the Arch Linux provided linux-4.10.1 kernel used the unmodified kernel config from the package maintainer, and if SledgeSulaweyo did the same, we would have one bugged and one non-bugged system both running kernels with the same config.

@SledgeSulaweyo: Find attached the kernel config my colleague used under Arch Linux - did you use the same?
Comment 9 Lutz Vieweg 2017-03-14 17:45:37 UTC
Created attachment 255247 [details]
Arch-Linux supplied kernel config that was used on a system not exhibiting this bug.
Comment 10 SledgeSulaweyo 2017-03-14 17:54:10 UTC
I pulled the kernel straight from the Arch repo without any change.
My affected home server is Intel Core i3-4130T@2.90GHz
My Laptop is a Intel Core2 Duo U9600@1.60GHz and i can reproduce this issue there too. Did not try on my desktop but at least i did not directly feel any impact there (i7) but i have to check.
Comment 11 SledgeSulaweyo 2017-03-14 18:07:37 UTC
Just verified it on my desktop (Intel Core i7-3930K@3.20GHz) and i can reproduce it there too. On my laptop i just pulled the fresh 4.10.2-1-ARCH kernel and the issue is still there.
Comment 12 SledgeSulaweyo 2017-03-15 07:09:24 UTC
Now it's getting interesting ... on my dell (i7) desktop at work i can not reproduce this with 4.10.2-1-ARCH
Comment 13 Tino Lange 2017-03-15 11:55:37 UTC
I could reproduce this bug on xubuntu 16.04 with unmodified kernel ppa kernel http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.10/
System is a Intel(R) Core(TM) i5 CPU 660
Comment 14 Lutz Vieweg 2017-03-15 17:41:44 UTC
JFYI: I have allocated a machine that reproduces the error and currently "git bisect" with the mainline-kernel.

Takes a horrible time due to the very long recompile/install/reboot cycles, but I am down to "8 iterations" by now.
Comment 15 Lutz Vieweg 2017-03-15 19:41:59 UTC
At last, bisecting converged:

git bisect start
# bad: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
git bisect bad c470abd4fde40ea6a0846a2beab642a578c0b8cd
# good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
git bisect good 69973b830859bc6529a7a0468ba0d80ee5117826
# bad: [f4000cd99750065d5177555c0a805c97174d1b9f] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect bad f4000cd99750065d5177555c0a805c97174d1b9f
# bad: [7079efc9d3e7f1f7cdd34082ec58209026315057] Merge tag 'fbdev-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
git bisect bad 7079efc9d3e7f1f7cdd34082ec58209026315057
# bad: [669bb4c58c3091cd54650e37c5f4e345dd12c564] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32
git bisect bad 669bb4c58c3091cd54650e37c5f4e345dd12c564
# good: [7a8bca043cf1bb0433aa43d008b6c4de6c07d6a2] Merge branch 'sfc-tso-v2'
git bisect good 7a8bca043cf1bb0433aa43d008b6c4de6c07d6a2
# bad: [4f4f907a6729ae9e132810711c3a05e48311a948] Merge branch 'mvneta-64bit'
git bisect bad 4f4f907a6729ae9e132810711c3a05e48311a948
# good: [33f8a0458b2ce4546b681c5fae04427e3077a543] Merge tag 'wireless-drivers-next-for-davem-2016-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
git bisect good 33f8a0458b2ce4546b681c5fae04427e3077a543
# good: [80439a1704e811697ee01fd09dd95dd10790bc93] qede: Remove 'num_tc'.
git bisect good 80439a1704e811697ee01fd09dd95dd10790bc93
# good: [5067b6020770ef7c8102f47079c9e577d175ef2c] net/mlx5e: Remove flow encap entry in the correct place
git bisect good 5067b6020770ef7c8102f47079c9e577d175ef2c
# bad: [7091d8c7055d7310339435ae3af2fb490a92524d] net/sched: cls_flower: Add offload support using egress Hardware device
git bisect bad 7091d8c7055d7310339435ae3af2fb490a92524d
# good: [b14945ac3efdf5217182a344b037f96d6b0afae1] net: atarilance: use %8ph for printing hex string
git bisect good b14945ac3efdf5217182a344b037f96d6b0afae1
# bad: [25429d7b7dca01dc4f17205de023a30ca09390d0] tcp: allow to turn tcp timestamp randomization off
git bisect bad 25429d7b7dca01dc4f17205de023a30ca09390d0
# good: [1d6cff4fca4366d0529dbce170e0f33cfe213790] qed: Add iSCSI out of order packet handling.
git bisect good 1d6cff4fca4366d0529dbce170e0f33cfe213790
# bad: [95a22caee396cef0bb2ca8fafdd82966a49367bb] tcp: randomize tcp timestamp offsets for each connection
git bisect bad 95a22caee396cef0bb2ca8fafdd82966a49367bb


So the culprit seems to be this change: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95a22caee396cef0bb2ca8fafdd82966a49367bb

"tcp: randomize tcp timestamp offsets for each connection
jiffies based timestamps allow for easy inference of number of devices
behind NAT translators and also makes tracking of hosts simpler.

commit ceaa1fef65a7c2e ("tcp: adding a per-socket timestamp offset")
added the main infrastructure that is needed for per-connection ts
randomization, in particular writing/reading the on-wire tcp header
format takes the offset into account so rest of stack can use normal
tcp_time_stamp (jiffies).

So only two items are left:
 - add a tsoffset for request sockets
 - extend the tcp isn generator to also return another 32bit number
   in addition to the ISN.

Re-use of ISN generator also means timestamps are still monotonically
increasing for same connection quadruple, i.e. PAWS will still work.

Includes fixes from Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"

I will try to attract some attention from above mentioned people.
Comment 16 Lutz Vieweg 2017-03-15 19:51:14 UTC
I tried whether the value in /proc/sys/net/ipv4/tcp_timestamps makes a difference - but the bug occurs with all of 0, 1 and 2 in there.
Comment 17 Lutz Vieweg 2017-03-15 19:56:08 UTC
I tried whether the culprit change can be reversed out of linux-4.10 using

> patch --dry-run -R -p 1 < ../tcp_randomize.patch

- but alas, this does not apply without errors.
Comment 18 Lutz Vieweg 2017-03-15 20:02:15 UTC
Eric Dumazet provided what looks like a functional work-around:

> echo 0 >/proc/sys/net/ipv4/tcp_tw_recycle
Comment 19 Lutz Vieweg 2017-03-16 00:37:24 UTC
Find an email thread discussing how to fix this at http://marc.info/?l=linux-netdev&m=148960986332106&w=2
Comment 20 Longinos 2017-03-16 08:56:28 UTC
Only to confirm that in 4.10.3 the issue still here.

echo 0 >/proc/sys/net/ipv4/tcp_tw_recycle workaround works.

Centos 7 4.10.3-1.el7.elrepo.x86_64 kernel in a Linode KVM VM with Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz processors
Comment 21 SledgeSulaweyo 2017-03-16 18:05:12 UTC
Just to confirm, the workaround solved the issue on all my affected nodes.
Comment 22 Lutz Vieweg 2017-03-16 18:56:38 UTC
You might also want to read https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux on the somewhat disappointing options regarding prevention of "running out of ports".
Comment 23 Dor Askayo 2017-03-16 19:30:28 UTC
I've done a bit of analysis on the suggested workarounds. On Arch Linux with kernel version 4.10.2, these are the settings I'm getting by default after a reboot:

# cat /proc/sys/net/ipv4/tcp_tw_recycle
0
# cat /proc/sys/net/ipv4/tcp_timestamps
1

With these, connection to my router (through HTTP/HTTPS) is unreliable and often stalls. However, in this configuration, the reproduction steps mentioned in the bug report don't actually result in "Ncat: Connection timed out."

Doing the following does indeed introduce the "Ncat: Connection timed out." issue reliably:

# echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
# cat /proc/sys/net/ipv4/tcp_tw_recycle
1
# cat /proc/sys/net/ipv4/tcp_timestamps
1

However, doing the following actually fixes both issues:

# echo 0 > /proc/sys/net/ipv4/tcp_timestamps
# cat /proc/sys/net/ipv4/tcp_timestamps
0
# cat /proc/sys/net/ipv4/tcp_tw_recycle
1

So it appears that changes to the code enabled by tcp_timestamps introduced regressions, and when coupled with tcp_tw_recycle, the symptoms get even worse. However, when tcp_timestamps is disabled, the issue disappears completely. The odd this is that this some of you found it otherwise. 

On a somewhat similar note, another issue I've ran into for which I've yet to find a workaround is that Transmission's (https://transmissionbt.com/) listening port test always returns "closed" on 4.10.0+ even when the port is forwarded properly. Going back to 4.9.x, however, fixes this issue reliably.

In conclusion, I'm not sure if all of these are symptoms of the same bug or different bugs with different symptoms, but one thing is sure – something went very wrong with the networking code in this cycle.
Comment 24 Dor Askayo 2017-03-16 19:34:49 UTC
(In reply to Dor Askayo from comment #23)
> So it appears that changes to the code enabled by tcp_timestamps introduced
> regressions, and when coupled with tcp_tw_recycle, the symptoms get even
> worse. However, when tcp_timestamps is disabled, the issue disappears
> completely. The odd this is that this some of you found it otherwise. 
Sorry, I mistyped there. I meant write "The odd thing is that some of you found the situation to be otherwise, with tcp_timestamps not affecting the issue at all."
Comment 25 Dor Askayo 2017-03-16 19:52:53 UTC
(In reply to Dor Askayo from comment #23)
> On a somewhat similar note, another issue I've ran into for which I've yet
> to find a workaround is that Transmission's (https://transmissionbt.com/)
> listening port test always returns "closed" on 4.10.0+ even when the port is
> forwarded properly. Going back to 4.9.x, however, fixes this issue reliably.
Please disregard this last point. I've been able to track down and fix my issue, and it's not directly related to the kernel. The other points still stand.
Comment 27 Dor Askayo 2017-03-24 19:18:05 UTC
The bug is quite serious and fixing it should be a priority for 4.10.x as well, in my opinion.

Note You need to log in before you can comment on or make changes to this bug.