Bug 10238

Summary: netconsole still hangs
Product: Networking Reporter: Rafael J. Wysocki (rjw)
Component: OtherAssignee: Arnaldo Carvalho de Melo (acme)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, jarkao2, stephen
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc5-git2 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    

Description Rafael J. Wysocki 2008-03-13 09:25:28 UTC
Subject    : netconsole still hangs
Submitter  : Andrew Morton <akpm@linux-foundation.org>
Date       : 2008-03-12 23:14
References : http://marc.info/?t=120536379200004&r=1&w=2
Handled-By : David Miller <davem@davemloft.net>
Handled-By : Stephen Hemminger <shemminger@linux-foundation.org>

This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Anonymous Emailer 2008-03-18 01:50:40 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 18 Mar 2008 08:04:39 +0000 Jarek Poplawski <jarkao2@gmail.com> wrote:

> On Mon, Mar 17, 2008 at 04:12:22PM -0700, Andrew Morton wrote:
> ...
> > I retested.  This patch doesn't appear to make anything worse, but the hang
> > is still there.  
> 
> Yes, but since this doesn't look like something very common, and we
> don't even know if this OOPS and the hangs are the same bug, there is
> needed more information e.g.:
> 
> - is it reproducible with e1000E only and no wlan?

Yes.  Both the machines I can reproduce this on have both E1000=y and
E1000E=y.  From the dmesg (below), one uses e1000 and the other uses
e1000e.  Both crash.  

http://userweb.kernel.org/~akpm/config-akpm2.txt
http://userweb.kernel.org/~akpm/dmesg-akpm2.txt

http://userweb.kernel.org/~akpm/config-t61p.txt
http://userweb.kernel.org/~akpm/dmesg-t61p.txt

I used to be able to reproduce the problems with a 2-way i386 e100 system,
but that seems to be fixed now, perhaps from David's revert.

I also used to be able to reproduce the problem on a one-way i386 e100
machine but that also seem to have gone away.

> - is there a possibility to check this with some other card
>   (even wlan while e1000E is off)?

err, dunno.  Perhaps I could try e1000 on the e1000e-using machine and vice
versa, but for that some PCI ID table hacking might be needed.

I cc'ed bugzilla on this thread.

> - could you add .config to the bugzilla report:
>   http://bugzilla.kernel.org/show_bug.cgi?id=10238

See above.

> - is it acceptable to send you some patches for debugging this?

As a last resort.  But it'd surely be better if a net developer could
reproduce this and do some work on it.  It's bog-trivial to reproduce here
and afaik nobody has even tried.  Perhaps you have...

service syslog stop
while true
do
	echo t > /proc/sysrq-trigger
done

and that's it.
Comment 2 Jarek Poplawski 2008-03-18 14:02:18 UTC
Andrew Morton wrote, On 03/18/2008 09:50 AM:
...
> As a last resort.  But it'd surely be better if a net developer could
> reproduce this and do some work on it.  It's bog-trivial to reproduce here
> and afaik nobody has even tried.  Perhaps you have...
> 
> service syslog stop
> while true
> do
>       echo t > /proc/sysrq-trigger
> done
> 
> and that's it.

Alas my testing possibilities, especially with real network, are very
limited, I can confirm: yes, the above test really hangs my box, yet
with syslog on and netconsole off. So, maybe I miss something, but I
don't understand why do you expect netconsole should endure this?

IMHO, after the below patch to sched.c you can't compare netconsole to
2.6.24 with this sysrq-trigger test; any bugs found with this could be
something old and not necessarily in netconsole (could be only exposed
by netconsole like this earlier mentioned, unexplained, probably after
double kfree OOPS).

Regards,
Jarek P.

From: Nick Piggin <nickpiggin@yahoo.com.au>
Date: Fri, 25 Jan 2008 20:08:34 +0000 (+0100)
Subject: sched: print backtrace of running tasks too
X-Git-Tag: v2.6.25-rc1~1237^2~3
X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464

sched: print backtrace of running tasks too

The attached patch is something really simple that can sometimes help
in getting more info out of a hung system.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 4d3a5a7..524285e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5161,8 +5161,7 @@ void sched_show_task(struct task_struct *p)
 	printk(KERN_CONT "%5lu %5d %6d\n", free,
 		task_pid_nr(p), task_pid_nr(p->real_parent));
 
-	if (state != TASK_RUNNING)
-		show_stack(p, NULL);
+	show_stack(p, NULL);
 }
 
 void show_state_filter(unsigned long state_filter)

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464
Comment 3 Anonymous Emailer 2008-03-18 14:48:30 UTC
Reply-To: akpm@linux-foundation.org

On Tue, 18 Mar 2008 22:05:42 +0100
Jarek Poplawski <jarkao2@gmail.com> wrote:

> Andrew Morton wrote, On 03/18/2008 09:50 AM:
> ...
> > As a last resort.  But it'd surely be better if a net developer could
> > reproduce this and do some work on it.  It's bog-trivial to reproduce here
> > and afaik nobody has even tried.  Perhaps you have...
> > 
> > service syslog stop
> > while true
> > do
> >     echo t > /proc/sysrq-trigger
> > done
> > 
> > and that's it.
> 
> Alas my testing possibilities, especially with real network, are very
> limited, I can confirm: yes, the above test really hangs my box, yet
> with syslog on and netconsole off. So, maybe I miss something, but I
> don't understand why do you expect netconsole should endure this?

I expect it to fail coz it's recently been filled with bugs ;)

I see that your netpoll-zap_completion_queue-adjust-skb-users-counter.patch
should fix the oops I earlier hit.  Good.

> IMHO, after the below patch to sched.c you can't compare netconsole to
> 2.6.24 with this sysrq-trigger test; any bugs found with this could be
> something old and not necessarily in netconsole (could be only exposed
> by netconsole like this earlier mentioned, unexplained, probably after
> double kfree OOPS).
> 
> Regards,
> Jarek P.
> 
> From: Nick Piggin <nickpiggin@yahoo.com.au>
> Date: Fri, 25 Jan 2008 20:08:34 +0000 (+0100)
> Subject: sched: print backtrace of running tasks too
> X-Git-Tag: v2.6.25-rc1~1237^2~3
> X-Git-Url:
> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464
> 
> sched: print backtrace of running tasks too
> 
> The attached patch is something really simple that can sometimes help
> in getting more info out of a hung system.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 4d3a5a7..524285e 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5161,8 +5161,7 @@ void sched_show_task(struct task_struct *p)
>       printk(KERN_CONT "%5lu %5d %6d\n", free,
>               task_pid_nr(p), task_pid_nr(p->real_parent));
>  
> -     if (state != TASK_RUNNING)
> -             show_stack(p, NULL);
> +     show_stack(p, NULL);
>  }
>  
>  void show_state_filter(unsigned long state_filter)

hm.

I tried a few things:

1:

   cat monstrous-text-file > /dev/kmsg

  Works OK.

2:

   Disable netconsole, do

	while true
	do
		echo t > /proc/sysrq-trigger
	done

   Works OK.

3:

  Enable netconsole, do

	while true
	do
		echo t > /proc/sysrq-trigger
	done

  Output comes out.  I was able to ^C the while loop.  After a while the
  output stopped.  So that seems OK too.


So right now it's cannot-reproduce.  I'll try things on the other machine
this evening.

I dunno why the sched.c change causes your sysrq-T operation to fail.  Can
you provide more details please?
Comment 4 Jarek Poplawski 2008-03-18 15:50:55 UTC
On Tue, Mar 18, 2008 at 02:47:42PM -0700, Andrew Morton wrote:
> On Tue, 18 Mar 2008 22:05:42 +0100
> Jarek Poplawski <jarkao2@gmail.com> wrote:
...
> > IMHO, after the below patch to sched.c you can't compare netconsole to
> > 2.6.24 with this sysrq-trigger test; any bugs found with this could be
...
> hm.
...
> So right now it's cannot-reproduce.  I'll try things on the other machine
> this evening.
> 
> I dunno why the sched.c change causes your sysrq-T operation to fail.  Can
> you provide more details please?

...hmm...
Doesn't sysrq-t trigger this sched.c function?

Anyway... My first tests seemed to hang the box with syslog only. Now
I can't repeat it neither with syslog nor netconsole... So, this patch
is a bad hit or it's really about timing.

Jarek P.
Comment 5 Jarek Poplawski 2008-03-19 02:17:53 UTC
(In reply to comment #3)
> Reply-To: akpm@linux-foundation.org
...
> I see that your netpoll-zap_completion_queue-adjust-skb-users-counter.patch
> should fix the oops I earlier hit.  Good.

Actually, this patch is only expected to prevent some memory leak, so probably can sometimes prevent or delay lack of skbs. But the OOPS which let to find this showed there is probably a double kfreeing of skbs in some of your network drivers, not necessarily used with netconsole, which can corrupt completion_queue. This could trigger with or without netconsole (but more probable under some stress). That's why I asked about a possibility of testing this with only one of your drivers on.
Comment 6 Jarek Poplawski 2008-03-19 12:14:00 UTC
On Tue, Mar 18, 2008 at 11:47:29PM +0100, Jarek Poplawski wrote:
...
> Anyway... My first tests seemed to hang the box with syslog only. Now
> I can't repeat it neither with syslog nor netconsole... So, this patch
> is a bad hit or it's really about timing.

I've just repeated this this test with syslog only. After letting it
go for ~5 min. I couldn't break it with any keys for at least next 10
min., and I turned the power down. Then the same but with this sched.c
patch reverted: ^C worked after a few seconds. It looks like time
can really matter here.

So, maybe it's again something accidental, I don't have another box
around to stay idle while repeting this test, but it seems this could
be not the best way to compare anything with 2.6.24 or older.

Regards,
Jarek P.
Comment 7 Anonymous Emailer 2008-03-19 14:20:18 UTC
Reply-To: akpm@linux-foundation.org

On Wed, 19 Mar 2008 20:17:25 +0100
Jarek Poplawski <jarkao2@gmail.com> wrote:

> On Tue, Mar 18, 2008 at 11:47:29PM +0100, Jarek Poplawski wrote:
> ...
> > Anyway... My first tests seemed to hang the box with syslog only. Now
> > I can't repeat it neither with syslog nor netconsole... So, this patch
> > is a bad hit or it's really about timing.
> 
> I've just repeated this this test with syslog only. After letting it
> go for ~5 min. I couldn't break it with any keys for at least next 10
> min., and I turned the power down. Then the same but with this sched.c
> patch reverted: ^C worked after a few seconds. It looks like time
> can really matter here.

Yeah, I was fiddling with that.  If you do

for i in $(seq 100)
do
	echo t > /proc/sysrq-trigger
done

then yes there's no response to ^C and the machine is basically dead.  But
when the loop finishes, things return to normal.

Perhaps it's something to do with longer holds on tasklist_lock, something
liek that.

> So, maybe it's again something accidental, I don't have another box
> around to stay idle while repeting this test, but it seems this could
> be not the best way to compare anything with 2.6.24 or older.

No.  I still haven't retested on the other offending machine.  Right now
I'm not sure that we any longer have anything which needs fixing.  Apart
from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch?
Comment 8 David S. Miller 2008-03-19 14:31:10 UTC
From: Andrew Morton <akpm@linux-foundation.org>
Date: Wed, 19 Mar 2008 14:20:10 -0700

> I'm not sure that we any longer have anything which needs fixing.  Apart
> from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch?

I'll take care of merging this, give me a day or two.
Comment 9 Jarek Poplawski 2008-03-19 14:51:43 UTC
On Wed, Mar 19, 2008 at 02:20:10PM -0700, Andrew Morton wrote:
...
> No.  I still haven't retested on the other offending machine.  Right now
> I'm not sure that we any longer have anything which needs fixing.  Apart
> from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch?

I agree that at least there seems to be no proof of a regression which
needs fixing. But I bet there are still things in netpoll, like this
zap_completion_queue, which could be (not urgently) fixed...

Jarek P.