Bug 37872
Summary: | CPU lockup during boot | ||
---|---|---|---|
Product: | Process Management | Reporter: | Bruno Wolff III (bruno) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | a.p.zijlstra, florian |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | kernel-PAE-3.0-0.rc3.git5.1.fc16.i686 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 36912 | ||
Attachments: |
/proc/cpuinfo
lspci -vvv output dmesg output i686.PAE combined config file |
Description
Bruno Wolff III
2011-06-19 15:00:18 UTC
Created attachment 62862 [details]
/proc/cpuinfo
Created attachment 62872 [details]
lspci -vvv output
I have also filed a Fedora bug for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=714478 Here's the referenced trace from the other bugzilla: https://bugzilla.redhat.com/attachment.cgi?id=505469 Looks like an NMI hit you in try_to_wake_up from the mutex_unlock path. Maybe a mutex or scheduler issue? Sorry about not adding that attachment here. I thought I had, but I must have screwed up. I am still seeing this with 3.0-rc5 Peter: Did you get a chance to look at this? It doesn't look like a timers issue, so I'm reassigning it to "process managment" since it looks scheduler-y. Bruno: Could you also attach a full dmesg from the working 2.6.39 kernel? Created attachment 64422 [details]
dmesg output
If it helps, I appear to have the same issue on a old xeon based machine, as well as the athlon based one. I could get another picture and hardware information.
On Fri, 2011-07-01 at 19:26 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=37872 > > > Bruno Wolff III <bruno@wolff.to> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Attachment #64422 [details]|application/octet-stream |text/plain > mime type| | > > I managed to get a 32bit SMP thing going using a F15 livecd on a usb-stick and an atom dev-board. Sadly with the modified i386_defconfig I cannot seem to reproduce. Could you post your .config? I am not sure what you are looking for. I am using the stock (for Fedora) rawhide kernels when I see this issue. Created attachment 65042 [details]
i686.PAE combined config file
I think this is what you want. I checked out the source for the kernel package and ran make Makefile.config which builds the various config files used to build kernels.
I am still seeing this as of rc6. I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 package. This patch has been recommended as a possible fix: --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -7750,6 +7750,9 @@ static void init_cfs_rq(struct cfs_rq *c #endif #endif cfs_rq->min_vruntime = (u64)(-(1LL << 20)); +#ifndef CONFIG_64BIT + cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; +#endif } static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq) On Mon, Jul 11, 2011 at 16:35:24 +0200, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > --- Comment #13 from Bruno Wolff III <bruno@wolff.to> 2011-07-11 13:37:45 > --- > > I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 > > package. > > > > Does the below cure things? My Atom board seems to indeed trigger it > with the fat (debug) .config from Fedora, after the below patch it does > not. I will try that out. I probably won't get a chance to work on it until tonight and then I'll need to wait for a kernel build. (If I do it on my machine it will take a while, but I'll look at doing a scratch build on Fedora's build system to speed things up.) Thanks! > --- Comment #13 from Bruno Wolff III <bruno@wolff.to> 2011-07-11 13:37:45 > --- > I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 > package. > Does the below cure things? My Atom board seems to indeed trigger it with the fat (debug) .config from Fedora, after the below patch it does not. --- Subject: sched: Fix 32bit race From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Jul 11 16:28:50 CEST 2011 Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads on 32bit") forgot to initialize min_vruntime_copy which could lead to an infinite while loop in task_waking_fair() under some circumstances (early boot, lucky timing). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-sw3r5hpdwctatwob4c19df4n@git.kernel.org --- Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -7750,6 +7750,9 @@ static void init_cfs_rq(struct cfs_rq *c #endif #endif cfs_rq->min_vruntime = (u64)(-(1LL << 20)); +#ifndef CONFIG_64BIT + cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime; +#endif } static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq) The patch appears to work. Both machines (Athlon and Xeon) that had been locking up during boots, both booted into graphical desktops and appear to be working normally. Thanks for the help! Patch got merged into v3.0: commit c64be78ffb415278d7d32d6f55de95c73dcc19a4 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Jul 11 16:28:50 2011 +0200 sched: Fix 32bit race |