Created attachment 24049 [details] Warnings found in kernel, relating to network corruption. This bug is new as of 2.6.31.x kernels. After a short period of heavy surfing (e.g. lots of tabs open in Firefox), the kernel will suddenly stop responding. Nothing is written to the serial console, and the machine stops responding to pings. My only clue so far has been a warning which I found once in my dmesg log (attached). I have already tried manually applying this patch from the upcoming -stable queue: net-fix-sk_forward_alloc-corruption.patch to no effect. I am currently switching back to Fedora's 2.6.31.6-145.fc12.i686 kernel to see if it is more stable. (I cannot trust 2.6.31.6 any more.)
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sun, 6 Dec 2009 13:40:18 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=14749 > > Summary: Kernel locks up after a few minutes of heavy surfing > Product: Networking > Version: 2.5 > Kernel Version: 2.6.31.6 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: high > Priority: P1 > Component: IPV4 > AssignedTo: shemminger@linux-foundation.org > ReportedBy: rankincj@yahoo.com > Regression: Yes > > > Created an attachment (id=24049) > --> (http://bugzilla.kernel.org/attachment.cgi?id=24049) > Warnings found in kernel, relating to network corruption. > > This bug is new as of 2.6.31.x kernels. After a short period of heavy surfing > (e.g. lots of tabs open in Firefox), the kernel will suddenly stop > responding. > Nothing is written to the serial console, and the machine stops responding to > pings. My only clue so far has been a warning which I found once in my dmesg > log (attached). > > I have already tried manually applying this patch from the upcoming -stable > queue: > > net-fix-sk_forward_alloc-corruption.patch > > to no effect. > > I am currently switching back to Fedora's 2.6.31.6-145.fc12.i686 kernel to > see > if it is more stable. (I cannot trust 2.6.31.6 any more.) > Thanks. A regression in the latest 2.6.31 -stable tree. Are you really really sure that you applied that patch, recompiled, reinstalled, etc?
--- On Mon, 7/12/09, Andrew Morton <akpm@linux-foundation.org> wrote: > A regression in the latest 2.6.31 -stable tree. > > Are you really really sure that you applied that patch, > recompiled, reinstalled, etc? Yup, 'fraid so. And because you asked so nicely, I've just managed to reproduce the problem having first done "make distclean" and "make oldconfig", followed by "make" :-). (This was with F12's latest compiler gcc 4.4.2 20091027, BTW.) The symptom was the same - a complete system freeze without anything written to the serial console. So it's just a *guess* that it's network-related, but it does always seem to happen while I'm waiting for a web page to load in my browser... I saw something interesting in 2.6.31.7 about a crash due to fragmentation: ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPS I'll try applying that patch too, to see if it makes any difference. Along with that other UDP-related thing I noticed: udp: Fix udp_poll() and ioctl() Cheers, Chris
One other point that seems worth mentioning: Fedora's 2.6.31.6-162.fc12.i686 kernel does *not* seem to have this problem, and neither did 2.6.31.6-145.fc12.i686 before it. (Fedora kernels have had KMS problems, but nothing that has stopped SysRq from working.) Cheers, Chris
Chris Rankin a écrit : > > I saw something interesting in 2.6.31.7 about a crash due to fragmentation: > > ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL > ptr OOPS > > I'll try applying that patch too, to see if it makes any difference. Along > with that other UDP-related thing I noticed: > > udp: Fix udp_poll() and ioctl() > Its all two years old UDP bugs (I spot another one some hours ago), and very rare. I run heavy duty servers with lot of UDP trafic and never caught a _single_ error, I am quite suprised it could happen on your machine on demand. 1) Do you have another NIC adapter to try ? It might be a buggy driver. (Neil Horman found an error on Intel drivers some hours ago, that can corrupt skbs) 2) Could you add following debugging aid ? 3) Any chance you can do a git bisect ? Thanks diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 7d12c6a..5a7a456 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -147,10 +147,15 @@ void inet_sock_destruct(struct sock *sk) return; } - WARN_ON(atomic_read(&sk->sk_rmem_alloc)); - WARN_ON(atomic_read(&sk->sk_wmem_alloc)); - WARN_ON(sk->sk_wmem_queued); - WARN_ON(sk->sk_forward_alloc); + WARN((atomic_read(&sk->sk_rmem_alloc) | atomic_read(&sk->sk_wmem_alloc) | + sk->sk_wmem_queued | sk->sk_forward_alloc) != 0, + "%s socket sk_rmem_alloc=%d sk_wmem_alloc=%d " + "sk_wmem_queued=%d sk_forward_alloc=%d\n", + sk->sk_prot->name, + atomic_read(&sk->sk_rmem_alloc), + atomic_read(&sk->sk_wmem_alloc), + sk->sk_wmem_queued, + sk->sk_forward_alloc); kfree(inet->opt); dst_release(sk->sk_dst_cache);
--- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Its all two years old UDP bugs (I spot another one some > hours ago), and very rare. > I am quite suprised it could happen on your machine on > demand. Who said anything about "on demand"? It took about 30 minutes to freeze last time; I was starting to think that a complete recompile had fixed it! For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place). > 1) Do you have another NIC adapter to try ? It might be a > buggy driver. (Neil Horman found an error on Intel drivers some > hours ago, that can corrupt skbs) I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK. > 2) Could you add following debugging aid ? Not a problem; I do have a serial console attached. > 3) Any chance you can do a git bisect ? How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night. Cheers, Chris
One other thing: this is an SMP machine with 2 physical hyper-threaded CPUs in. And all its IP traffic is routed through a UP 200MHz Pentium MMX machine that is also running 2.6.31.6 via an e100 card. The Pentium MMX machine has been rock-solid so far. Cheers, Chris
Chris Rankin a écrit : > --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Its all two years old UDP bugs (I spot another one some >> hours ago), and very rare. > >> I am quite suprised it could happen on your machine on >> demand. > > Who said anything about "on demand"? It took about 30 minutes to freeze last > time; > I was starting to think that a complete recompile had fixed it! > 30 minutes is pretty fast, this is why I said 'on demand'... > For the record: I've only seen that dmesg warning I've reported *once*, and > that didn't kill the machine immediately (hence I was able to report it in > the first place). > >> 1) Do you have another NIC adapter to try ? It might be a >> buggy driver. (Neil Horman found an error on Intel drivers some >> hours ago, that can corrupt skbs) > > I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is > an on-board device and I don't have another. But Fedora's 2.6.31.x kernels > seem OK. > >> 2) Could you add following debugging aid ? > > Not a problem; I do have a serial console attached. > >> 3) Any chance you can do a git bisect ? > > How do you git-bisect a bug that you can't reproduce on demand? A negative is > easy to spot, but a positive would be not experiencing a random freeze. As I > said, I *almost* thought that I'd resolved the issue by recompiling last > night. > Please fold your lines length to < 70 If Fedora kernel works, either its just pure luck, or they found a bug and they didnt sent the fix to mainline (unlikely)
On 08-12-2009 12:21, Eric Dumazet wrote: > If Fedora kernel works, either its just pure luck, or they found > a bug and they didnt sent the fix to mainline (unlikely) Is it the same .config? Jarek P.
On Tue, Dec 08, 2009 at 01:03:15AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Its all two years old UDP bugs (I spot another one some > > hours ago), and very rare. > > > I am quite suprised it could happen on your machine on > > demand. > > Who said anything about "on demand"? It took about 30 minutes to freeze last > time; I was starting to think that a complete recompile had fixed it! > > For the record: I've only seen that dmesg warning I've reported *once*, and > that didn't kill the machine immediately (hence I was able to report it in > the first place). > 30 minutes isn't too long to wait for an error to appear, I think. > > 1) Do you have another NIC adapter to try ? It might be a > > buggy driver. (Neil Horman found an error on Intel drivers some > > hours ago, that can corrupt skbs) > > I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is > an on-board device and I don't have another. But Fedora's 2.6.31.x kernels > seem OK. > Those patches I posted for the intel drivers will apply cleanly pretty far back in git, as that code hasn't changed much. You might also consider turning on slab debugging. Many of the errors I encountered leading up to a fatal oops werent themselves fatal, and were hidden until such time as we used slab debugging to catch a bunch of redzone violations. > > 2) Could you add following debugging aid ? > > Not a problem; I do have a serial console attached. > > > 3) Any chance you can do a git bisect ? > > How do you git-bisect a bug that you can't reproduce on demand? A negative is > easy to spot, but a positive would be not experiencing a random freeze. As I > said, I *almost* thought that I'd resolved the issue by recompiling last > night. Well, it sounds like your longest time to failure is about 30 minutes. Why not write a script that runs your test for an hour at a stretch, and plug that inot git bisect, and walk away? You should have results in a day or so. Regards Neil
--- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote: > Is it the same .config? Similar, but no. I'll attach the .config to the bug tonight. Chris
--- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote: > 30 minutes isn't too long to wait for an error to appear, I think. Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is. Chris
On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote: > > 30 minutes isn't too long to wait for an error to appear, I think. > > Except it's a very "busy" waiting process with me actively surfing the web. I > can't automate that. I'm still not entirely sure what the trigger condition > is. > Sure you can, generate a list of sites that you visited and access them all with a curl or wget script. I would imagine thats a reasonable test to trigger the reproducer. Neil
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote: > > Is it the same .config? > > Similar, but no. I'll attach the .config to the bug tonight. ...And a diff to Fedora's .config, plus if possible try if this difference could matter. Jarek P.
Neil Horman a écrit : > On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote: >> --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote: >>> 30 minutes isn't too long to wait for an error to appear, I think. >> Except it's a very "busy" waiting process with me actively surfing the web. >> I can't automate that. I'm still not entirely sure what the trigger >> condition is. >> > Sure you can, generate a list of sites that you visited and access them all > with > a curl or wget script. I would imagine thats a reasonable test to trigger > the > reproducer. Yes, but I suspect a multi threading bug, or vm , or X11, or something. Andi posted a futex patch that is worth to try, if machine is swaping a bit. Chris, please provide as much information as you can # cat /proc/cpuinfo # cat /proc/meminfo # ps aux # scripts/ver_linux
Created attachment 24106 [details] .config file for bad kernel
Created attachment 24107 [details] Difference between Fedora's config and my own. Diff is output of: diff -u /boot/config-2.6.31.6-162.fc12.i686 config
Created attachment 24108 [details] Output of /proc/cpuinfo
Created attachment 24123 [details] Output of /proc/meminfo This is from shortly after the machine was turned on, and before I started trying to reproduce the problem.
Created attachment 24124 [details] Output of scripts/ver_linux
My patched 2.6.31.6 kernel has not crashed yet. I've been doing everything that I was doing before, too. It's still too early to know whether those two extra IPv4 patches have fixed the problem, though. (I've been trying to sort my DNS out in the meantime: I've been suffering from slow DNS in Fedora, although the "fix" is apparently to disable IPv6 in Firefox?! So I'm not sure if that's relevant to recreating the crash.)
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote: > --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote: > > Is it the same .config? > > Similar, but no. I'll attach the .config to the bug tonight. I can see quite a lot of differences, and some could matter here, e.g. like these: -# CONFIG_PREEMPT_RCU is not set +# CONFIG_TREE_RCU is not set +CONFIG_PREEMPT_RCU=y ... -CONFIG_PREEMPT_VOLUNTARY=y -# CONFIG_PREEMPT is not set +# CONFIG_PREEMPT_VOLUNTARY is not set +CONFIG_PREEMPT=y It's hard to guess, but at least this second patch mentioned by you (ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c) shouldn't matter here. Anyway, now 2.6.32.1 should be preferred for testing (if possible). Jarek P.
(In reply to comment #23) > It's hard to guess, but at least this second patch mentioned by you > (ipv4: additional update of dev_net(dev) to struct *net in > ip_fragment.c) shouldn't matter here. Anyway, now 2.6.32.1 should be > preferred for testing (if possible). My kernel still hasn't locked up again - I am starting to think that one of those last two patches "did the trick" (i.e. "udp: Fix udp_poll() and ioctl()"). I upgraded to 2.6.31.7 last night.
No lockups any more, 2.6.32+ all fine so far.