Bug 37872 - CPU lockup during boot
CPU lockup during boot
Status: CLOSED CODE_FIX
Product: Process Management
Classification: Unclassified
Component: Scheduler
All Linux
: P1 normal
Assigned To: Ingo Molnar
:
Depends on:
Blocks: 36912
  Show dependency treegraph
 
Reported: 2011-06-19 15:00 UTC by Bruno Wolff III
Modified: 2011-08-15 08:07 UTC (History)
2 users (show)

See Also:
Kernel Version: kernel-PAE-3.0-0.rc3.git5.1.fc16.i686
Tree: Fedora
Regression: Yes


Attachments
/proc/cpuinfo (1.01 KB, text/plain)
2011-06-19 15:01 UTC, Bruno Wolff III
Details
lspci -vvv output (12.41 KB, text/plain)
2011-06-19 15:01 UTC, Bruno Wolff III
Details
dmesg output (85.48 KB, text/plain)
2011-07-01 19:26 UTC, Bruno Wolff III
Details
i686.PAE combined config file (112.20 KB, text/plain)
2011-07-09 08:34 UTC, Bruno Wolff III
Details

Description Bruno Wolff III 2011-06-19 15:00:18 UTC
3.0 kernels mostly lockup during boot with a traceback. 2.6.39 does work for me.
Comment 1 Bruno Wolff III 2011-06-19 15:01:10 UTC
Created attachment 62862 [details]
/proc/cpuinfo
Comment 2 Bruno Wolff III 2011-06-19 15:01:37 UTC
Created attachment 62872 [details]
lspci -vvv output
Comment 3 Bruno Wolff III 2011-06-19 15:02:22 UTC
I have also filed a Fedora bug for this issue:
https://bugzilla.redhat.com/show_bug.cgi?id=714478
Comment 4 john stultz 2011-06-20 18:11:49 UTC
Here's the referenced trace from the other bugzilla: https://bugzilla.redhat.com/attachment.cgi?id=505469

Looks like an NMI hit you in try_to_wake_up from the mutex_unlock path.

Maybe a mutex or scheduler issue?
Comment 5 Bruno Wolff III 2011-06-20 18:44:51 UTC
Sorry about not adding that attachment here. I thought I had, but I must have screwed up.
Comment 6 Bruno Wolff III 2011-06-29 11:46:38 UTC
I am still seeing this with 3.0-rc5
Comment 7 john stultz 2011-07-01 19:20:05 UTC
Peter: Did you get a chance to look at this? It doesn't look like a timers issue, so I'm reassigning it to "process managment" since it looks scheduler-y.

Bruno: Could you also attach a full dmesg from the working 2.6.39 kernel?
Comment 8 Bruno Wolff III 2011-07-01 19:26:30 UTC
Created attachment 64422 [details]
dmesg output

If it helps, I appear to have the same issue on a old xeon based machine, as well as the athlon based one. I could get another picture and hardware information.
Comment 9 Peter Zijlstra 2011-07-01 21:28:24 UTC
On Fri, 2011-07-01 at 19:26 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=37872
> 
> 
> Bruno Wolff III <bruno@wolff.to> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>   Attachment #64422 [details]|application/octet-stream    |text/plain
>           mime type|                            |
> 
> 

I managed to get a 32bit SMP thing going using a F15 livecd on a
usb-stick and an atom dev-board.

Sadly with the modified i386_defconfig I cannot seem to reproduce. Could
you post your .config?
Comment 10 Bruno Wolff III 2011-07-09 08:21:19 UTC
I am not sure what you are looking for. I am using the stock (for Fedora) rawhide kernels when I see this issue.
Comment 11 Bruno Wolff III 2011-07-09 08:34:57 UTC
Created attachment 65042 [details]
i686.PAE combined config file

I think this is what you want. I checked out the source for the kernel package and ran make Makefile.config which builds the various config files used to build kernels.
Comment 12 Bruno Wolff III 2011-07-10 13:21:47 UTC
I am still seeing this as of rc6.
Comment 13 Bruno Wolff III 2011-07-11 13:37:45 UTC
I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686 package.
Comment 14 Bruno Wolff III 2011-07-11 14:50:14 UTC
This patch has been recommended as a possible fix:
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7750,6 +7750,9 @@ static void init_cfs_rq(struct cfs_rq *c
 #endif
 #endif
        cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+#ifndef CONFIG_64BIT
+	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
+#endif
 }

 static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
Comment 15 Bruno Wolff III 2011-07-11 15:48:30 UTC
On Mon, Jul 11, 2011 at 16:35:24 +0200,
  Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > --- Comment #13 from Bruno Wolff III <bruno@wolff.to>  2011-07-11 13:37:45 ---
> > I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686
> > package.
> > 
> 
> Does the below cure things? My Atom board seems to indeed trigger it
> with the fat (debug) .config from Fedora, after the below patch it does
> not.

I will try that out. I probably won't get a chance to work on it until
tonight and then I'll need to wait for a kernel build. (If I do it on
my machine it will take a while, but I'll look at doing a scratch build
on Fedora's build system to speed things up.)

Thanks!
Comment 16 Peter Zijlstra 2011-07-11 15:54:33 UTC
> --- Comment #13 from Bruno Wolff III <bruno@wolff.to>  2011-07-11 13:37:45 ---
> I am still seeing this in Fedora's kernel-PAE-3.0-0.rc6.git6.1.fc16.i686
> package.
> 

Does the below cure things? My Atom board seems to indeed trigger it
with the fat (debug) .config from Fedora, after the below patch it does
not.

---
Subject: sched: Fix 32bit race
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Jul 11 16:28:50 CEST 2011

Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads
on 32bit") forgot to initialize min_vruntime_copy which could lead to
an infinite while loop in task_waking_fair() under some circumstances
(early boot, lucky timing).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-sw3r5hpdwctatwob4c19df4n@git.kernel.org
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7750,6 +7750,9 @@ static void init_cfs_rq(struct cfs_rq *c
 #endif
 #endif
 	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
+#ifndef CONFIG_64BIT
+	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
+#endif
 }
 
 static void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq)
Comment 17 Bruno Wolff III 2011-07-11 23:23:34 UTC
The patch appears to work. Both machines (Athlon and Xeon) that had been locking up during boots, both booted into graphical desktops and appear to be working normally.
Thanks for the help!
Comment 18 Florian Mickler 2011-07-12 09:15:53 UTC
Patch: https://bugzilla.kernel.org/show_bug.cgi?id=37872#c16
Comment 19 Florian Mickler 2011-08-15 08:07:14 UTC
Patch got merged into v3.0:

commit c64be78ffb415278d7d32d6f55de95c73dcc19a4
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Mon Jul 11 16:28:50 2011 +0200

    sched: Fix 32bit race

Note You need to log in before you can comment on or make changes to this bug.