Subject : netconsole still hangs Submitter : Andrew Morton <akpm@linux-foundation.org> Date : 2008-03-12 23:14 References : http://marc.info/?t=120536379200004&r=1&w=2 Handled-By : David Miller <davem@davemloft.net> Handled-By : Stephen Hemminger <shemminger@linux-foundation.org> This entry is being used for tracking a regression from 2.6.24. Please don't close it until the problem is fixed in the mainline.
Reply-To: akpm@linux-foundation.org On Tue, 18 Mar 2008 08:04:39 +0000 Jarek Poplawski <jarkao2@gmail.com> wrote: > On Mon, Mar 17, 2008 at 04:12:22PM -0700, Andrew Morton wrote: > ... > > I retested. This patch doesn't appear to make anything worse, but the hang > > is still there. > > Yes, but since this doesn't look like something very common, and we > don't even know if this OOPS and the hangs are the same bug, there is > needed more information e.g.: > > - is it reproducible with e1000E only and no wlan? Yes. Both the machines I can reproduce this on have both E1000=y and E1000E=y. From the dmesg (below), one uses e1000 and the other uses e1000e. Both crash. http://userweb.kernel.org/~akpm/config-akpm2.txt http://userweb.kernel.org/~akpm/dmesg-akpm2.txt http://userweb.kernel.org/~akpm/config-t61p.txt http://userweb.kernel.org/~akpm/dmesg-t61p.txt I used to be able to reproduce the problems with a 2-way i386 e100 system, but that seems to be fixed now, perhaps from David's revert. I also used to be able to reproduce the problem on a one-way i386 e100 machine but that also seem to have gone away. > - is there a possibility to check this with some other card > (even wlan while e1000E is off)? err, dunno. Perhaps I could try e1000 on the e1000e-using machine and vice versa, but for that some PCI ID table hacking might be needed. I cc'ed bugzilla on this thread. > - could you add .config to the bugzilla report: > http://bugzilla.kernel.org/show_bug.cgi?id=10238 See above. > - is it acceptable to send you some patches for debugging this? As a last resort. But it'd surely be better if a net developer could reproduce this and do some work on it. It's bog-trivial to reproduce here and afaik nobody has even tried. Perhaps you have... service syslog stop while true do echo t > /proc/sysrq-trigger done and that's it.
Andrew Morton wrote, On 03/18/2008 09:50 AM: ... > As a last resort. But it'd surely be better if a net developer could > reproduce this and do some work on it. It's bog-trivial to reproduce here > and afaik nobody has even tried. Perhaps you have... > > service syslog stop > while true > do > echo t > /proc/sysrq-trigger > done > > and that's it. Alas my testing possibilities, especially with real network, are very limited, I can confirm: yes, the above test really hangs my box, yet with syslog on and netconsole off. So, maybe I miss something, but I don't understand why do you expect netconsole should endure this? IMHO, after the below patch to sched.c you can't compare netconsole to 2.6.24 with this sysrq-trigger test; any bugs found with this could be something old and not necessarily in netconsole (could be only exposed by netconsole like this earlier mentioned, unexplained, probably after double kfree OOPS). Regards, Jarek P. From: Nick Piggin <nickpiggin@yahoo.com.au> Date: Fri, 25 Jan 2008 20:08:34 +0000 (+0100) Subject: sched: print backtrace of running tasks too X-Git-Tag: v2.6.25-rc1~1237^2~3 X-Git-Url: http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464 sched: print backtrace of running tasks too The attached patch is something really simple that can sometimes help in getting more info out of a hung system. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- diff --git a/kernel/sched.c b/kernel/sched.c index 4d3a5a7..524285e 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5161,8 +5161,7 @@ void sched_show_task(struct task_struct *p) printk(KERN_CONT "%5lu %5d %6d\n", free, task_pid_nr(p), task_pid_nr(p->real_parent)); - if (state != TASK_RUNNING) - show_stack(p, NULL); + show_stack(p, NULL); } void show_state_filter(unsigned long state_filter) http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464
Reply-To: akpm@linux-foundation.org On Tue, 18 Mar 2008 22:05:42 +0100 Jarek Poplawski <jarkao2@gmail.com> wrote: > Andrew Morton wrote, On 03/18/2008 09:50 AM: > ... > > As a last resort. But it'd surely be better if a net developer could > > reproduce this and do some work on it. It's bog-trivial to reproduce here > > and afaik nobody has even tried. Perhaps you have... > > > > service syslog stop > > while true > > do > > echo t > /proc/sysrq-trigger > > done > > > > and that's it. > > Alas my testing possibilities, especially with real network, are very > limited, I can confirm: yes, the above test really hangs my box, yet > with syslog on and netconsole off. So, maybe I miss something, but I > don't understand why do you expect netconsole should endure this? I expect it to fail coz it's recently been filled with bugs ;) I see that your netpoll-zap_completion_queue-adjust-skb-users-counter.patch should fix the oops I earlier hit. Good. > IMHO, after the below patch to sched.c you can't compare netconsole to > 2.6.24 with this sysrq-trigger test; any bugs found with this could be > something old and not necessarily in netconsole (could be only exposed > by netconsole like this earlier mentioned, unexplained, probably after > double kfree OOPS). > > Regards, > Jarek P. > > From: Nick Piggin <nickpiggin@yahoo.com.au> > Date: Fri, 25 Jan 2008 20:08:34 +0000 (+0100) > Subject: sched: print backtrace of running tasks too > X-Git-Tag: v2.6.25-rc1~1237^2~3 > X-Git-Url: > http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Ftorvalds%2Flinux-2.6.git;a=commitdiff_plain;h=5fb5e6de55860a99c2d8fe7e0c8222d5c53d8464 > > sched: print backtrace of running tasks too > > The attached patch is something really simple that can sometimes help > in getting more info out of a hung system. > > Signed-off-by: Ingo Molnar <mingo@elte.hu> > --- > > diff --git a/kernel/sched.c b/kernel/sched.c > index 4d3a5a7..524285e 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -5161,8 +5161,7 @@ void sched_show_task(struct task_struct *p) > printk(KERN_CONT "%5lu %5d %6d\n", free, > task_pid_nr(p), task_pid_nr(p->real_parent)); > > - if (state != TASK_RUNNING) > - show_stack(p, NULL); > + show_stack(p, NULL); > } > > void show_state_filter(unsigned long state_filter) hm. I tried a few things: 1: cat monstrous-text-file > /dev/kmsg Works OK. 2: Disable netconsole, do while true do echo t > /proc/sysrq-trigger done Works OK. 3: Enable netconsole, do while true do echo t > /proc/sysrq-trigger done Output comes out. I was able to ^C the while loop. After a while the output stopped. So that seems OK too. So right now it's cannot-reproduce. I'll try things on the other machine this evening. I dunno why the sched.c change causes your sysrq-T operation to fail. Can you provide more details please?
On Tue, Mar 18, 2008 at 02:47:42PM -0700, Andrew Morton wrote: > On Tue, 18 Mar 2008 22:05:42 +0100 > Jarek Poplawski <jarkao2@gmail.com> wrote: ... > > IMHO, after the below patch to sched.c you can't compare netconsole to > > 2.6.24 with this sysrq-trigger test; any bugs found with this could be ... > hm. ... > So right now it's cannot-reproduce. I'll try things on the other machine > this evening. > > I dunno why the sched.c change causes your sysrq-T operation to fail. Can > you provide more details please? ...hmm... Doesn't sysrq-t trigger this sched.c function? Anyway... My first tests seemed to hang the box with syslog only. Now I can't repeat it neither with syslog nor netconsole... So, this patch is a bad hit or it's really about timing. Jarek P.
(In reply to comment #3) > Reply-To: akpm@linux-foundation.org ... > I see that your netpoll-zap_completion_queue-adjust-skb-users-counter.patch > should fix the oops I earlier hit. Good. Actually, this patch is only expected to prevent some memory leak, so probably can sometimes prevent or delay lack of skbs. But the OOPS which let to find this showed there is probably a double kfreeing of skbs in some of your network drivers, not necessarily used with netconsole, which can corrupt completion_queue. This could trigger with or without netconsole (but more probable under some stress). That's why I asked about a possibility of testing this with only one of your drivers on.
On Tue, Mar 18, 2008 at 11:47:29PM +0100, Jarek Poplawski wrote: ... > Anyway... My first tests seemed to hang the box with syslog only. Now > I can't repeat it neither with syslog nor netconsole... So, this patch > is a bad hit or it's really about timing. I've just repeated this this test with syslog only. After letting it go for ~5 min. I couldn't break it with any keys for at least next 10 min., and I turned the power down. Then the same but with this sched.c patch reverted: ^C worked after a few seconds. It looks like time can really matter here. So, maybe it's again something accidental, I don't have another box around to stay idle while repeting this test, but it seems this could be not the best way to compare anything with 2.6.24 or older. Regards, Jarek P.
Reply-To: akpm@linux-foundation.org On Wed, 19 Mar 2008 20:17:25 +0100 Jarek Poplawski <jarkao2@gmail.com> wrote: > On Tue, Mar 18, 2008 at 11:47:29PM +0100, Jarek Poplawski wrote: > ... > > Anyway... My first tests seemed to hang the box with syslog only. Now > > I can't repeat it neither with syslog nor netconsole... So, this patch > > is a bad hit or it's really about timing. > > I've just repeated this this test with syslog only. After letting it > go for ~5 min. I couldn't break it with any keys for at least next 10 > min., and I turned the power down. Then the same but with this sched.c > patch reverted: ^C worked after a few seconds. It looks like time > can really matter here. Yeah, I was fiddling with that. If you do for i in $(seq 100) do echo t > /proc/sysrq-trigger done then yes there's no response to ^C and the machine is basically dead. But when the loop finishes, things return to normal. Perhaps it's something to do with longer holds on tasklist_lock, something liek that. > So, maybe it's again something accidental, I don't have another box > around to stay idle while repeting this test, but it seems this could > be not the best way to compare anything with 2.6.24 or older. No. I still haven't retested on the other offending machine. Right now I'm not sure that we any longer have anything which needs fixing. Apart from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch?
From: Andrew Morton <akpm@linux-foundation.org> Date: Wed, 19 Mar 2008 14:20:10 -0700 > I'm not sure that we any longer have anything which needs fixing. Apart > from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch? I'll take care of merging this, give me a day or two.
On Wed, Mar 19, 2008 at 02:20:10PM -0700, Andrew Morton wrote: ... > No. I still haven't retested on the other offending machine. Right now > I'm not sure that we any longer have anything which needs fixing. Apart > from merging netpoll-zap_completion_queue-adjust-skb-users-counter.patch? I agree that at least there seems to be no proof of a regression which needs fixing. But I bet there are still things in netpoll, like this zap_completion_queue, which could be (not urgently) fixed... Jarek P.
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=8a455b087c9629b3ae3b521b4f1ed16672f978cc