Bug 11588 - stalled connections (pop3, nntp, smtp, ftp)
Summary: stalled connections (pop3, nntp, smtp, ftp)
Status: REJECTED INVALID
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-18 13:31 UTC by Dâniel Fraga
Modified: 2008-11-01 22:56 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.25 and above
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Dâniel Fraga 2008-09-18 13:31:26 UTC
Latest working kernel version: 2.6.24
Earliest failing kernel version: 2.6.25
Distribution: Linux from scratch
Hardware Environment: Intel Xeon 3040, x86_64
Software Environment: Linux 2.6.26, gcc 4.3.2
Problem Description:

The connection to the server running Linux stay stalled sometimes and I can only recover it using "nmap -sS <server>". We've been discussing it in this long thread:

http://kerneltrap.org/mailarchive/linux-netdev/2008/7/7/2374824

But my problem isn't related to tcp_frto, since I already tried to disabled it and the problem persists. And everytime the connection is stalled, I have this on the log:

Sep 13 20:01:21 teleporto vmunix: C193.8. S=5.5.5.5 E=8TS00 RC00 T=1 D262DF PROTO=TCP SPT=4146 DPT=4899 WINDOW=65535 RES=0x00 SYN URGP=0 
Sep 13 20:01:22 teleporto vmunix: OOTPST45 P=89WNO=53 E=x0SNUG= 

Steps to reproduce:

After many and many tries, my conclusion is that there's a bug in the timer code, which was changed in 2.6.25. I know that because it started on 2.6.25 and if I disable ntpd, the problem doesn't happen anymore or happen very seldom. If I disable high resolution timer, it helps so the problem will not happen frequently.

Unfortunately there's no way to reproduce. It can happen or not.

***

What I can assure you is that "nmap -sS server" from outside, makes the connection work again. By "stalled connection" I mean when the server doesn't reply to a request. And it happens only to old protocols like (pop3, nntp, smtp, ftp). Other protocols like http and ssh, don't suffer from stalling connections.

I only post this bug now, because before doing this, I tried to solve with network people, but as it doesn't seem to be a network issue, I post it to Timer developers, if they have a clue. Thank you very much.

Ps: and if I choose incorrectly the Component as "interval timers" feel free to correct.
Comment 1 Andrew Morton 2008-09-18 14:23:07 UTC
hm, I wonder if this could be related to the distcc hang which Peter saw..
Comment 2 Dâniel Fraga 2008-09-27 11:57:01 UTC
Is there a way to debug this? Maybe using some Kernel Hacking options... Suggestions would be appreciated. Thanks.
Comment 3 Thomas Gleixner 2008-09-27 12:14:14 UTC
can you try latest mainline 2.6.27-rc7 ?
Comment 4 Dâniel Fraga 2008-09-30 22:24:12 UTC
I tested with 2.6.27-rc7 and 2.6.27-rc8, but the problem still happens. How is the best way to debug this?
Comment 5 Thomas Gleixner 2008-10-01 02:09:46 UTC
I hand this off to the network people for the following reason:

> After many and many tries, my conclusion is that there's a bug in the timer
> code, which was changed in 2.6.25. I know that because it started on 2.6.25
> and
> if I disable ntpd, the problem doesn't happen anymore or happen very seldom.
> If
> I disable high resolution timer, it helps so the problem will not happen
> frequently.

Disabling ntpd and/or high resolution timers is making the problem less frequently, but it does not solve it. Disabling those things is just changing the timing slightly which 

I don't see how the timers are connected to the observation that it happens with pop3, smtp, nntp, ftp but not with http, ssh.

Also nmap -Ss is not even remotely related to timers.

@netfolks: is there any timer related code which I should look at and think about how to instrument ?

FYI, Peter is still having the distcc issue where sendfile times out, which might be related or not.
Comment 6 Dâniel Fraga 2008-11-01 22:56:48 UTC
Well the problem was syslogd! Could you believe it? The following is the message I posted right now explaining the problem:

On Thu, 30 Oct 2008 12:43:05 +0200 (EET)
"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote:

> Perhaps we could try to solve it though stracing syslogd...

	Well Ilpo, you're right, what I'm about to write here will make
me very ashamed, but the truth must be told! The culprit was syslogd!
Almost unbeliavable, but I had been using and old syslogd version for
about 5 years!

	How can I'm sure that it's syslogd's fault? Simply, because I
had a stall today and when I killed syslogd everything was back to
normal.

	Well, I reinstalled GNU inetutils 1.5 (which I had already
installed before), but I don't know why it put syslogd
in /usr/local/libexec directory.

	But no problem. I'll just wait a few more days to test if
syslogd is the only responsible for this, but I'm 90% sure it is.

	So, just posting this, so if someone, who knows, some day, have
a similar problem, can read this message and avoid all the problems I
had.

	I apologize for thinking that it was a kernel fault. Anyway,
one more lesson I learned: do not keep old binaries lying around... ;)

	Thanks everyone, mainly Ilpo for giving me all tools to reach 
to this point.

	Ps: just for curiosity, I was using a syslogd binary from Mar,
3, 2003! Extremely old! This is so old, it was compiled for Linux
2.2.5. Or maybe I was too lazy and copied it from another machine...

syslogd: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for
GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped

	Ps2: I'll close the bug I opened on bugzilla.

	Ps3: anyway, it's interesting how a small piece of the system
(syslogd) can generate those kinds of problems... I mean, a simple
error on syslogd could lead to a complete stall on connections, just
because everything is waiting for it to log through /dev/log. Of course
the problem was the binary, but it could have a time out, so even if it
was in fact a buggy syslogd, it won't cause such a stall on the system. I really don't know what changed from 2.6.24 to 2.6.25, but maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that... you guys know much more than me.

	Ps4: maybe now we can understand why nmap solved the issue...

Note You need to log in before you can comment on or make changes to this bug.