Bug 14749 - Kernel locks up after a few minutes of heavy surfing
Summary: Kernel locks up after a few minutes of heavy surfing
Status: RESOLVED CODE_FIX
Alias: None
Product: Networking
Classification: Unclassified
Component: IPV4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Stephen Hemminger
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-12-06 13:40 UTC by Chris Rankin
Modified: 2010-04-06 22:38 UTC (History)
0 users

See Also:
Kernel Version: 2.6.31.6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Warnings found in kernel, relating to network corruption. (6.25 KB, text/plain)
2009-12-06 13:40 UTC, Chris Rankin
Details
.config file for bad kernel (75.33 KB, text/plain)
2009-12-09 01:25 UTC, Chris Rankin
Details
Difference between Fedora's config and my own. (136.41 KB, text/plain)
2009-12-09 01:27 UTC, Chris Rankin
Details
Output of /proc/cpuinfo (2.19 KB, text/plain)
2009-12-09 01:31 UTC, Chris Rankin
Details
Output of /proc/meminfo (1.01 KB, text/plain)
2009-12-09 21:47 UTC, Chris Rankin
Details
Output of scripts/ver_linux (1.69 KB, text/plain)
2009-12-09 21:47 UTC, Chris Rankin
Details

Description Chris Rankin 2009-12-06 13:40:17 UTC
Created attachment 24049 [details]
Warnings found in kernel, relating to network corruption.

This bug is new as of 2.6.31.x kernels. After a short period of heavy surfing (e.g. lots of tabs open in Firefox), the kernel will suddenly stop responding. Nothing is written to the serial console, and the machine stops responding to pings. My only clue so far has been a warning which I found once in my dmesg log (attached).

I have already tried manually applying this patch from the upcoming -stable queue:

net-fix-sk_forward_alloc-corruption.patch

to no effect.

I am currently switching back to Fedora's 2.6.31.6-145.fc12.i686 kernel to see if it is more stable. (I cannot trust 2.6.31.6 any more.)
Comment 1 Andrew Morton 2009-12-07 21:54:16 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 6 Dec 2009 13:40:18 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=14749
> 
>            Summary: Kernel locks up after a few minutes of heavy surfing
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.31.6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: rankincj@yahoo.com
>         Regression: Yes
> 
> 
> Created an attachment (id=24049)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=24049)
> Warnings found in kernel, relating to network corruption.
> 
> This bug is new as of 2.6.31.x kernels. After a short period of heavy surfing
> (e.g. lots of tabs open in Firefox), the kernel will suddenly stop
> responding.
> Nothing is written to the serial console, and the machine stops responding to
> pings. My only clue so far has been a warning which I found once in my dmesg
> log (attached).
> 
> I have already tried manually applying this patch from the upcoming -stable
> queue:
> 
> net-fix-sk_forward_alloc-corruption.patch
> 
> to no effect.
> 
> I am currently switching back to Fedora's 2.6.31.6-145.fc12.i686 kernel to
> see
> if it is more stable. (I cannot trust 2.6.31.6 any more.)
> 

Thanks.

A regression in the latest 2.6.31 -stable tree.

Are you really really sure that you applied that patch, recompiled,
reinstalled, etc?
Comment 2 Chris Rankin 2009-12-08 00:19:38 UTC
--- On Mon, 7/12/09, Andrew Morton <akpm@linux-foundation.org> wrote:
> A regression in the latest 2.6.31 -stable tree.
> 
> Are you really really sure that you applied that patch,
> recompiled, reinstalled, etc?

Yup, 'fraid so. And because you asked so nicely, I've just managed to reproduce the problem having first done "make distclean" and "make oldconfig", followed by "make" :-). (This was with F12's latest compiler gcc 4.4.2 20091027, BTW.) The symptom was the same - a complete system freeze without anything written to the serial console. So it's just a *guess* that it's network-related, but it does always seem to happen while I'm waiting for a web page to load in my browser...

I saw something interesting in 2.6.31.7 about a crash due to fragmentation:

ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPS

I'll try applying that patch too, to see if it makes any difference. Along with that other UDP-related thing I noticed:

udp: Fix udp_poll() and ioctl()

Cheers,
Chris
Comment 3 Chris Rankin 2009-12-08 00:19:46 UTC
--- On Mon, 7/12/09, Andrew Morton <akpm@linux-foundation.org> wrote:
> A regression in the latest 2.6.31 -stable tree.
> 
> Are you really really sure that you applied that patch,
> recompiled, reinstalled, etc?

Yup, 'fraid so. And because you asked so nicely, I've just managed to reproduce the problem having first done "make distclean" and "make oldconfig", followed by "make" :-). (This was with F12's latest compiler gcc 4.4.2 20091027, BTW.) The symptom was the same - a complete system freeze without anything written to the serial console. So it's just a *guess* that it's network-related, but it does always seem to happen while I'm waiting for a web page to load in my browser...

I saw something interesting in 2.6.31.7 about a crash due to fragmentation:

ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL ptr OOPS

I'll try applying that patch too, to see if it makes any difference. Along with that other UDP-related thing I noticed:

udp: Fix udp_poll() and ioctl()

Cheers,
Chris
Comment 4 Chris Rankin 2009-12-08 00:38:28 UTC
One other point that seems worth mentioning: Fedora's 2.6.31.6-162.fc12.i686 kernel does *not* seem to have this problem, and neither did 2.6.31.6-145.fc12.i686 before it. (Fedora kernels have had KMS problems, but nothing that has stopped SysRq from working.)

Cheers,
Chris
Comment 5 Eric Dumazet 2009-12-08 03:03:35 UTC
Chris Rankin a écrit :
> 
> I saw something interesting in 2.6.31.7 about a crash due to fragmentation:
> 
> ipv4: additional update of dev_net(dev) to struct *net in ip_fragment.c, NULL
> ptr OOPS
> 
> I'll try applying that patch too, to see if it makes any difference. Along
> with that other UDP-related thing I noticed:
> 
> udp: Fix udp_poll() and ioctl()
> 

Its all two years old UDP bugs (I spot another one some hours ago), and very rare.
I run heavy duty servers with lot of UDP trafic and never caught a _single_ error,
I am quite suprised it could happen on your machine on demand.

1) Do you have another NIC adapter to try ? It might be a buggy driver.
  (Neil Horman found an error on Intel drivers some hours ago, that can corrupt skbs)

2) Could you add following debugging aid ?

3) Any chance you can do a git bisect ?

Thanks


diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 7d12c6a..5a7a456 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -147,10 +147,15 @@ void inet_sock_destruct(struct sock *sk)
 		return;
 	}
 
-	WARN_ON(atomic_read(&sk->sk_rmem_alloc));
-	WARN_ON(atomic_read(&sk->sk_wmem_alloc));
-	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN((atomic_read(&sk->sk_rmem_alloc) | atomic_read(&sk->sk_wmem_alloc) |
+	     sk->sk_wmem_queued | sk->sk_forward_alloc) != 0,
+	     "%s socket sk_rmem_alloc=%d sk_wmem_alloc=%d "
+	     "sk_wmem_queued=%d sk_forward_alloc=%d\n",
+	     sk->sk_prot->name,
+	     atomic_read(&sk->sk_rmem_alloc),
+	     atomic_read(&sk->sk_wmem_alloc),
+	     sk->sk_wmem_queued,
+	     sk->sk_forward_alloc);
 
 	kfree(inet->opt);
 	dst_release(sk->sk_dst_cache);
Comment 6 Chris Rankin 2009-12-08 09:03:17 UTC
--- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Its all two years old UDP bugs (I spot another one some
> hours ago), and very rare.

> I am quite suprised it could happen on your machine on
> demand.

Who said anything about "on demand"? It took about 30 minutes to freeze last time; I was starting to think that a complete recompile had fixed it!

For the record: I've only seen that dmesg warning I've reported *once*, and that didn't kill the machine immediately (hence I was able to report it in the first place).

> 1) Do you have another NIC adapter to try ? It might be a
> buggy driver. (Neil Horman found an error on Intel drivers some
> hours ago, that can corrupt skbs)

I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is an on-board device and I don't have another. But Fedora's 2.6.31.x kernels seem OK.

> 2) Could you add following debugging aid ?

Not a problem; I do have a serial console attached.

> 3) Any chance you can do a git bisect ?

How do you git-bisect a bug that you can't reproduce on demand? A negative is easy to spot, but a positive would be not experiencing a random freeze. As I said, I *almost* thought that I'd resolved the issue by recompiling last night.

Cheers,
Chris
Comment 7 Chris Rankin 2009-12-08 09:17:30 UTC
One other thing: this is an SMP machine with 2 physical hyper-threaded CPUs in. And all its IP traffic is routed through a UP 200MHz Pentium MMX machine that is also running 2.6.31.6 via an e100 card.

The Pentium MMX machine has been rock-solid so far.

Cheers,
Chris
Comment 8 Chris Rankin 2009-12-08 09:17:38 UTC
One other thing: this is an SMP machine with 2 physical hyper-threaded CPUs in. And all its IP traffic is routed through a UP 200MHz Pentium MMX machine that is also running 2.6.31.6 via an e100 card.

The Pentium MMX machine has been rock-solid so far.

Cheers,
Chris
Comment 9 Eric Dumazet 2009-12-08 11:21:16 UTC
Chris Rankin a écrit :
> --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Its all two years old UDP bugs (I spot another one some
>> hours ago), and very rare.
> 
>> I am quite suprised it could happen on your machine on
>> demand.
> 
> Who said anything about "on demand"? It took about 30 minutes to freeze last
> time; 
> I was starting to think that a complete recompile had fixed it!
> 

30 minutes is pretty fast, this is why I said 'on demand'...

> For the record: I've only seen that dmesg warning I've reported *once*, and
> that didn't kill the machine immediately (hence I was able to report it in
> the first place).
> 
>> 1) Do you have another NIC adapter to try ? It might be a
>> buggy driver. (Neil Horman found an error on Intel drivers some
>> hours ago, that can corrupt skbs)
> 
> I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is
> an on-board device and I don't have another. But Fedora's 2.6.31.x kernels
> seem OK.
> 
>> 2) Could you add following debugging aid ?
> 
> Not a problem; I do have a serial console attached.
> 
>> 3) Any chance you can do a git bisect ?
> 
> How do you git-bisect a bug that you can't reproduce on demand? A negative is
> easy to spot, but a positive would be not experiencing a random freeze. As I
> said, I *almost* thought that I'd resolved the issue by recompiling last
> night.
> 

Please fold your lines length to < 70 

If Fedora kernel works, either its just pure luck, or they found
a bug and they didnt sent the fix to mainline (unlikely)
Comment 10 Jarek Poplawski 2009-12-08 11:36:50 UTC
On 08-12-2009 12:21, Eric Dumazet wrote:
> If Fedora kernel works, either its just pure luck, or they found
> a bug and they didnt sent the fix to mainline (unlikely)

Is it the same .config?

Jarek P.
Comment 11 Neil Horman 2009-12-08 12:00:30 UTC
On Tue, Dec 08, 2009 at 01:03:15AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Its all two years old UDP bugs (I spot another one some
> > hours ago), and very rare.
> 
> > I am quite suprised it could happen on your machine on
> > demand.
> 
> Who said anything about "on demand"? It took about 30 minutes to freeze last
> time; I was starting to think that a complete recompile had fixed it!
> 
> For the record: I've only seen that dmesg warning I've reported *once*, and
> that didn't kill the machine immediately (hence I was able to report it in
> the first place).
> 
30 minutes isn't too long to wait for an error to appear, I think.

> > 1) Do you have another NIC adapter to try ? It might be a
> > buggy driver. (Neil Horman found an error on Intel drivers some
> > hours ago, that can corrupt skbs)
> 
> I can test any patches for a e1000 that apply to 2.6.31.x. But the e1000 is
> an on-board device and I don't have another. But Fedora's 2.6.31.x kernels
> seem OK.
> 
Those patches I posted for the intel drivers will apply cleanly pretty far back
in git, as that code hasn't changed much.  You might also consider turning on
slab debugging.  Many of the errors I encountered leading up to a fatal oops
werent themselves fatal, and were hidden until such time as we used slab
debugging to catch a bunch of redzone violations.

> > 2) Could you add following debugging aid ?
> 
> Not a problem; I do have a serial console attached.
> 
> > 3) Any chance you can do a git bisect ?
> 
> How do you git-bisect a bug that you can't reproduce on demand? A negative is
> easy to spot, but a positive would be not experiencing a random freeze. As I
> said, I *almost* thought that I'd resolved the issue by recompiling last
> night.
Well, it sounds like your longest time to failure is about 30 minutes.  Why not
write a script that runs your test for an hour at a stretch, and plug that inot
git bisect, and walk away?  You should have results in a day or so.

Regards
Neil
Comment 12 Chris Rankin 2009-12-08 13:35:42 UTC
--- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> Is it the same .config?

Similar, but no. I'll attach the .config to the bug tonight.

Chris
Comment 13 Chris Rankin 2009-12-08 13:39:30 UTC
--- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
> 30 minutes isn't too long to wait for an error to appear, I think.

Except it's a very "busy" waiting process with me actively surfing the web. I can't automate that. I'm still not entirely sure what the trigger condition is.

Chris
Comment 14 Neil Horman 2009-12-08 13:42:07 UTC
On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
> > 30 minutes isn't too long to wait for an error to appear, I think.
> 
> Except it's a very "busy" waiting process with me actively surfing the web. I
> can't automate that. I'm still not entirely sure what the trigger condition
> is.
> 
Sure you can, generate a list of sites that you visited and access them all with
a curl or wget script.  I would imagine thats a reasonable test to trigger the
reproducer.

Neil
Comment 15 Jarek Poplawski 2009-12-08 13:48:00 UTC
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > Is it the same .config?
> 
> Similar, but no. I'll attach the .config to the bug tonight.

...And a diff to Fedora's .config, plus if possible try if this
difference could matter.

Jarek P.
Comment 16 Eric Dumazet 2009-12-08 14:39:40 UTC
Neil Horman a écrit :
> On Tue, Dec 08, 2009 at 05:39:28AM -0800, Chris Rankin wrote:
>> --- On Tue, 8/12/09, Neil Horman <nhorman@tuxdriver.com> wrote:
>>> 30 minutes isn't too long to wait for an error to appear, I think.
>> Except it's a very "busy" waiting process with me actively surfing the web.
>> I can't automate that. I'm still not entirely sure what the trigger
>> condition is.
>>
> Sure you can, generate a list of sites that you visited and access them all
> with
> a curl or wget script.  I would imagine thats a reasonable test to trigger
> the
> reproducer.

Yes, but I suspect a multi threading bug, or vm , or X11, or something.

Andi posted a futex patch that is worth to try, if machine is swaping a bit.

Chris, please provide as much information as you can

# cat /proc/cpuinfo
# cat /proc/meminfo
# ps aux
# scripts/ver_linux
Comment 17 Chris Rankin 2009-12-09 01:25:47 UTC
Created attachment 24106 [details]
.config file for bad kernel
Comment 18 Chris Rankin 2009-12-09 01:27:47 UTC
Created attachment 24107 [details]
Difference between Fedora's config and my own.

Diff is output of:

diff -u /boot/config-2.6.31.6-162.fc12.i686 config
Comment 19 Chris Rankin 2009-12-09 01:31:34 UTC
Created attachment 24108 [details]
Output of /proc/cpuinfo
Comment 20 Chris Rankin 2009-12-09 21:47:01 UTC
Created attachment 24123 [details]
Output of /proc/meminfo

This is from shortly after the machine was turned on, and before I started trying to reproduce the problem.
Comment 21 Chris Rankin 2009-12-09 21:47:43 UTC
Created attachment 24124 [details]
Output of scripts/ver_linux
Comment 22 Chris Rankin 2009-12-11 00:29:32 UTC
My patched 2.6.31.6 kernel has not crashed yet. I've been doing everything that I was doing before, too. It's still too early to know whether those two extra IPv4 patches have fixed the problem, though.

(I've been trying to sort my DNS out in the meantime: I've been suffering from slow DNS in Fedora, although the "fix" is apparently to disable IPv6 in Firefox?! So I'm not sure if that's relevant to recreating the crash.)
Comment 23 Jarek Poplawski 2009-12-15 07:55:03 UTC
On Tue, Dec 08, 2009 at 05:35:40AM -0800, Chris Rankin wrote:
> --- On Tue, 8/12/09, Jarek Poplawski <jarkao2@gmail.com> wrote:
> > Is it the same .config?
> 
> Similar, but no. I'll attach the .config to the bug tonight.

I can see quite a lot of differences, and some could matter here, e.g.
like these:

-# CONFIG_PREEMPT_RCU is not set
+# CONFIG_TREE_RCU is not set
+CONFIG_PREEMPT_RCU=y
...
-CONFIG_PREEMPT_VOLUNTARY=y
-# CONFIG_PREEMPT is not set
+# CONFIG_PREEMPT_VOLUNTARY is not set
+CONFIG_PREEMPT=y

It's hard to guess, but at least this second patch mentioned by you
(ipv4: additional update of dev_net(dev) to struct *net in
ip_fragment.c) shouldn't matter here. Anyway, now 2.6.32.1 should be
preferred for testing (if possible).

Jarek P.
Comment 24 Chris Rankin 2009-12-15 08:47:56 UTC
(In reply to comment #23)
> It's hard to guess, but at least this second patch mentioned by you
> (ipv4: additional update of dev_net(dev) to struct *net in
> ip_fragment.c) shouldn't matter here. Anyway, now 2.6.32.1 should be
> preferred for testing (if possible).

My kernel still hasn't locked up again - I am starting to think that one of those last two patches "did the trick" (i.e. "udp: Fix udp_poll() and ioctl()").

I upgraded to 2.6.31.7 last night.
Comment 25 Chris Rankin 2010-04-06 22:38:35 UTC
No lockups any more, 2.6.32+ all fine so far.

Note You need to log in before you can comment on or make changes to this bug.