|Summary:||CPU lockup during boot|
|Product:||Process Management||Reporter:||Bruno Wolff III (bruno)|
|Component:||Scheduler||Assignee:||Ingo Molnar (mingo)|
|Bug Depends on:|
lspci -vvv output
i686.PAE combined config file
Description Bruno Wolff III 2011-06-19 15:00:18 UTC
3.0 kernels mostly lockup during boot with a traceback. 2.6.39 does work for me.
Comment 2 Bruno Wolff III 2011-06-19 15:01:37 UTC
Created attachment 62872 [details] lspci -vvv output
Comment 3 Bruno Wolff III 2011-06-19 15:02:22 UTC
I have also filed a Fedora bug for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=714478
Comment 4 john stultz 2011-06-20 18:11:49 UTC
Here's the referenced trace from the other bugzilla: https://bugzilla.redhat.com/attachment.cgi?id=505469 Looks like an NMI hit you in try_to_wake_up from the mutex_unlock path. Maybe a mutex or scheduler issue?
Comment 5 Bruno Wolff III 2011-06-20 18:44:51 UTC
Sorry about not adding that attachment here. I thought I had, but I must have screwed up.
Comment 6 Bruno Wolff III 2011-06-29 11:46:38 UTC
I am still seeing this with 3.0-rc5
Comment 7 john stultz 2011-07-01 19:20:05 UTC
Peter: Did you get a chance to look at this? It doesn't look like a timers issue, so I'm reassigning it to "process managment" since it looks scheduler-y. Bruno: Could you also attach a full dmesg from the working 2.6.39 kernel?
Comment 8 Bruno Wolff III 2011-07-01 19:26:30 UTC
Created attachment 64422 [details] dmesg output If it helps, I appear to have the same issue on a old xeon based machine, as well as the athlon based one. I could get another picture and hardware information.
Comment 9 Peter Zijlstra 2011-07-01 21:28:24 UTC
On Fri, 2011-07-01 at 19:26 +0000, email@example.com wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=37872 > > > Bruno Wolff III <firstname.lastname@example.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Attachment #64422 [details]|application/octet-stream |text/plain > mime type| | > > I managed to get a 32bit SMP thing going using a F15 livecd on a usb-stick and an atom dev-board. Sadly with the modified i386_defconfig I cannot seem to reproduce. Could you post your .config?
Comment 10 Bruno Wolff III 2011-07-09 08:21:19 UTC
I am not sure what you are looking for. I am using the stock (for Fedora) rawhide kernels when I see this issue.
Comment 11 Bruno Wolff III 2011-07-09 08:34:57 UTC
Created attachment 65042 [details] i686.PAE combined config file I think this is what you want. I checked out the source for the kernel package and ran make Makefile.config which builds the various config files used to build kernels.
Comment 12 Bruno Wolff III 2011-07-10 13:21:47 UTC
I am still seeing this as of rc6.
Comment 13 Bruno Wolff III 2011-07-11 13:37:45 UTC
I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 package.
Comment 14 Bruno Wolff III 2011-07-11 14:50:14 UTC
This patch has been recommended as a possible fix: --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -7750,6 +7750,9 @@ static void init_cfs_rq(struct cfs_rq *c #endif #endif cfs_rq->min_vruntime = (u64)(-(1LL << 20)); +#ifndef CONFIG_64BIT + cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; +#endif } static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
Comment 15 Bruno Wolff III 2011-07-11 15:48:30 UTC
On Mon, Jul 11, 2011 at 16:35:24 +0200, Peter Zijlstra <email@example.com> wrote: > > > --- Comment #13 from Bruno Wolff III <firstname.lastname@example.org> 2011-07-11 13:37:45 > --- > > I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 > > package. > > > > Does the below cure things? My Atom board seems to indeed trigger it > with the fat (debug) .config from Fedora, after the below patch it does > not. I will try that out. I probably won't get a chance to work on it until tonight and then I'll need to wait for a kernel build. (If I do it on my machine it will take a while, but I'll look at doing a scratch build on Fedora's build system to speed things up.) Thanks!
Comment 16 Peter Zijlstra 2011-07-11 15:54:33 UTC
> --- Comment #13 from Bruno Wolff III <email@example.com> 2011-07-11 13:37:45 > --- > I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 > package. > Does the below cure things? My Atom board seems to indeed trigger it with the fat (debug) .config from Fedora, after the below patch it does not. --- Subject: sched: Fix 32bit race From: Peter Zijlstra <firstname.lastname@example.org> Date: Mon Jul 11 16:28:50 CEST 2011 Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads on 32bit") forgot to initialize min_vruntime_copy which could lead to an infinite while loop in task_waking_fair() under some circumstances (early boot, lucky timing). Signed-off-by: Peter Zijlstra <email@example.com> Link: http://firstname.lastname@example.org --- Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -7750,6 +7750,9 @@ static void init_cfs_rq(struct cfs_rq *c #endif #endif cfs_rq->min_vruntime = (u64)(-(1LL << 20)); +#ifndef CONFIG_64BIT + cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; +#endif } static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
Comment 17 Bruno Wolff III 2011-07-11 23:23:34 UTC
The patch appears to work. Both machines (Athlon and Xeon) that had been locking up during boots, both booted into graphical desktops and appear to be working normally. Thanks for the help!
Comment 18 Florian Mickler 2011-07-12 09:15:53 UTC