Bug 11588
Summary: | stalled connections (pop3, nntp, smtp, ftp) | ||
---|---|---|---|
Product: | Networking | Reporter: | Dan (dan76) |
Component: | IPV4 | Assignee: | Stephen Hemminger (stephen) |
Status: | REJECTED INVALID | ||
Severity: | normal | CC: | a.p.zijlstra, akpm, mingo, tglx |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.25 and above | Subsystem: | |
Regression: | Yes | Bisected commit-id: |
Description
Dan
2008-09-18 13:31:26 UTC
hm, I wonder if this could be related to the distcc hang which Peter saw.. Is there a way to debug this? Maybe using some Kernel Hacking options... Suggestions would be appreciated. Thanks. can you try latest mainline 2.6.27-rc7 ? I tested with 2.6.27-rc7 and 2.6.27-rc8, but the problem still happens. How is the best way to debug this? I hand this off to the network people for the following reason:
> After many and many tries, my conclusion is that there's a bug in the timer
> code, which was changed in 2.6.25. I know that because it started on 2.6.25
> and
> if I disable ntpd, the problem doesn't happen anymore or happen very seldom.
> If
> I disable high resolution timer, it helps so the problem will not happen
> frequently.
Disabling ntpd and/or high resolution timers is making the problem less frequently, but it does not solve it. Disabling those things is just changing the timing slightly which
I don't see how the timers are connected to the observation that it happens with pop3, smtp, nntp, ftp but not with http, ssh.
Also nmap -Ss is not even remotely related to timers.
@netfolks: is there any timer related code which I should look at and think about how to instrument ?
FYI, Peter is still having the distcc issue where sendfile times out, which might be related or not.
Well the problem was syslogd! Could you believe it? The following is the message I posted right now explaining the problem: On Thu, 30 Oct 2008 12:43:05 +0200 (EET) "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> wrote: > Perhaps we could try to solve it though stracing syslogd... Well Ilpo, you're right, what I'm about to write here will make me very ashamed, but the truth must be told! The culprit was syslogd! Almost unbeliavable, but I had been using and old syslogd version for about 5 years! How can I'm sure that it's syslogd's fault? Simply, because I had a stall today and when I killed syslogd everything was back to normal. Well, I reinstalled GNU inetutils 1.5 (which I had already installed before), but I don't know why it put syslogd in /usr/local/libexec directory. But no problem. I'll just wait a few more days to test if syslogd is the only responsible for this, but I'm 90% sure it is. So, just posting this, so if someone, who knows, some day, have a similar problem, can read this message and avoid all the problems I had. I apologize for thinking that it was a kernel fault. Anyway, one more lesson I learned: do not keep old binaries lying around... ;) Thanks everyone, mainly Ilpo for giving me all tools to reach to this point. Ps: just for curiosity, I was using a syslogd binary from Mar, 3, 2003! Extremely old! This is so old, it was compiled for Linux 2.2.5. Or maybe I was too lazy and copied it from another machine... syslogd: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), not stripped Ps2: I'll close the bug I opened on bugzilla. Ps3: anyway, it's interesting how a small piece of the system (syslogd) can generate those kinds of problems... I mean, a simple error on syslogd could lead to a complete stall on connections, just because everything is waiting for it to log through /dev/log. Of course the problem was the binary, but it could have a time out, so even if it was in fact a buggy syslogd, it won't cause such a stall on the system. I really don't know what changed from 2.6.24 to 2.6.25, but maybe 2.6.24 had such a timeout? Maybe I'm just silly writing that... you guys know much more than me. Ps4: maybe now we can understand why nmap solved the issue... |