Bug 80251 - Crash early in boot that is likely scheduler related.
Summary: Crash early in boot that is likely scheduler related.
Status: RESOLVED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: i386 Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-07-15 04:38 UTC by Bruno Wolff III
Modified: 2014-07-27 23:29 UTC (History)
0 users

See Also:
Kernel Version: 3.16
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Config file used to build from first bad commit (139.79 KB, text/plain)
2014-07-16 06:36 UTC, Bruno Wolff III
Details
lshw output from the machine exhibiting the problem (22.18 KB, text/plain)
2014-07-16 06:42 UTC, Bruno Wolff III
Details
lshw from another i686 machine that doesn't exhibit the problem (19.90 KB, text/plain)
2014-07-16 06:43 UTC, Bruno Wolff III
Details
/proc/cpuinfo (2.50 KB, text/plain)
2014-07-16 19:50 UTC, Bruno Wolff III
Details
/proc/sys/kernel/sched_domain/cpu*/domain*/* (454 bytes, text/plain)
2014-07-17 03:05 UTC, Bruno Wolff III
Details
/proc/schedstat output (1.43 KB, text/plain)
2014-07-17 04:32 UTC, Bruno Wolff III
Details
Boot picture (573.73 KB, image/jpeg)
2014-07-17 16:35 UTC, Bruno Wolff III
Details
dmesg output with Peter's debug patch and earlypr intk=keepsched_debug (57.04 KB, text/plain)
2014-07-18 05:31 UTC, Bruno Wolff III
Details
dmesg output with Peter's updated debug patch and earlyprintk=keep sched_debug (59.36 KB, text/plain)
2014-07-18 12:58 UTC, Bruno Wolff III
Details
cpuid output (20.41 KB, text/plain)
2014-07-22 12:08 UTC, Bruno Wolff III
Details
cpuid -r output (3.21 KB, text/plain)
2014-07-22 12:09 UTC, Bruno Wolff III
Details
dmesg output with latest test patches (57.75 KB, text/plain)
2014-07-23 01:33 UTC, Bruno Wolff III
Details
The previous dmesg output had these differences from 3.16-rc6 (4.84 KB, text/plain)
2014-07-23 01:34 UTC, Bruno Wolff III
Details

Description Bruno Wolff III 2014-07-15 04:38:34 UTC
There is a log of information about this in fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1116529.
I haven't finished a bisect yet, but the range of potential bad commits is down to a small number of scheduler related commits.
The crash is very early in the boot and I am not able to capture a crash dump using netconsole. I took some pictures that are attached to the Fedora bug after slowing down console output. If you want I can attach those here or try to get some new ones. The traceback starts off with a divide error and there is a message suggesting a CPU has locked up.
I have 2 i686 machines and only one is crashing. The one that has the problem has 2 hyperthreaded processors and the one that doesn't has 2 plain processors. There are some other differences, but that seems to be the one mostly likely to be related to scheduling.
I expect to finish the bisect late Tuesday and hopefully test reverting the problem commit against the HEAD of Linus' tree on Wednesday.
The bisect log so far is:
git bisect start
# good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d
# good: [fad01e866afdbe01a1f3ec06a39c3a8b9e197014] Linux 3.15-rc8
git bisect good fad01e866afdbe01a1f3ec06a39c3a8b9e197014
# bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1
git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f
# bad: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next
git bisect bad aaeb2554337217dfa4eac2fcc90da7be540b9a73
# good: [5142c33ed86acbcef5c63a63d2b7384b9210d39f] Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next
git bisect good 5142c33ed86acbcef5c63a63d2b7384b9210d39f
# bad: [b05d59dfceaea72565b1648af929b037b0f96d7f] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
git bisect bad b05d59dfceaea72565b1648af929b037b0f96d7f
# good: [e13cccfd86481bd4c0499577f44c570d334da79b] Merge tag 'spi-v3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi into next
git bisect good e13cccfd86481bd4c0499577f44c570d334da79b
# good: [3d521f9151dacab566904d1f57dcb3e7080cdd8f] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect good 3d521f9151dacab566904d1f57dcb3e7080cdd8f
# good: [f82393426afb7c82f7618b3b4e440d8dd2b40c08] MIPS: KVM: Add master disable count interface
git bisect good f82393426afb7c82f7618b3b4e440d8dd2b40c08
# bad: [4aef77b2fe373cdba461925589b9d1d4468ee016] Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect bad 4aef77b2fe373cdba461925589b9d1d4468ee016
# good: [3944a9274ef6cda0cc282daf0739832f661670f7] sched: Fix exec_start/task_hot on migrated tasks
git bisect good 3944a9274ef6cda0cc282daf0739832f661670f7
# bad: [3d1a3bda65d2f48fead6f0727f2f392c15206852] Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect bad 3d1a3bda65d2f48fead6f0727f2f392c15206852
# bad: [a803f0261bb2bb57aab5542af3174db43b2a3887] sched: Initialize rq->age_stamp on processor start
git bisect bad a803f0261bb2bb57aab5542af3174db43b2a3887
Comment 1 Bruno Wolff III 2014-07-16 06:29:41 UTC
caffcdd8d27ba78730d5540396ce72ad022aff2c is the first bad commit
commit caffcdd8d27ba78730d5540396ce72ad022aff2c
Author: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Date:   Wed Apr 30 14:39:38 2014 +0100

    sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
    
    There is no need to zero struct sched_group member cpumask and struct
    sched_group_power member power since both structures are already allocated
    as zeroed memory in __sdt_alloc().
    
    This patch has been tested with
    BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power);
    in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including
    CPU hotplug scenarios.
    
    Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

:040000 040000 8d78e3f468e8bd4a51ba53750ca53d16583e4b53 d42eabda6d8a22ec6ee830a739aa7ac408883184 M	kernel
Comment 2 Bruno Wolff III 2014-07-16 06:31:11 UTC
git bisect start
# good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15
git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d
# good: [fad01e866afdbe01a1f3ec06a39c3a8b9e197014] Linux 3.15-rc8
git bisect good fad01e866afdbe01a1f3ec06a39c3a8b9e197014
# bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1
git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f
# bad: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next
git bisect bad aaeb2554337217dfa4eac2fcc90da7be540b9a73
# good: [5142c33ed86acbcef5c63a63d2b7384b9210d39f] Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next
git bisect good 5142c33ed86acbcef5c63a63d2b7384b9210d39f
# bad: [b05d59dfceaea72565b1648af929b037b0f96d7f] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next
git bisect bad b05d59dfceaea72565b1648af929b037b0f96d7f
# good: [e13cccfd86481bd4c0499577f44c570d334da79b] Merge tag 'spi-v3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi into next
git bisect good e13cccfd86481bd4c0499577f44c570d334da79b
# good: [3d521f9151dacab566904d1f57dcb3e7080cdd8f] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect good 3d521f9151dacab566904d1f57dcb3e7080cdd8f
# good: [f82393426afb7c82f7618b3b4e440d8dd2b40c08] MIPS: KVM: Add master disable count interface
git bisect good f82393426afb7c82f7618b3b4e440d8dd2b40c08
# bad: [4aef77b2fe373cdba461925589b9d1d4468ee016] Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect bad 4aef77b2fe373cdba461925589b9d1d4468ee016
# good: [3944a9274ef6cda0cc282daf0739832f661670f7] sched: Fix exec_start/task_hot on migrated tasks
git bisect good 3944a9274ef6cda0cc282daf0739832f661670f7
# bad: [3d1a3bda65d2f48fead6f0727f2f392c15206852] Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
git bisect bad 3d1a3bda65d2f48fead6f0727f2f392c15206852
# bad: [a803f0261bb2bb57aab5542af3174db43b2a3887] sched: Initialize rq->age_stamp on processor start
git bisect bad a803f0261bb2bb57aab5542af3174db43b2a3887
# good: [c515db8cd311ef77b2dc7cbd6b695022655bb0f3] sched/numa: Fix initialization of sched_domain_topology for NUMA
git bisect good c515db8cd311ef77b2dc7cbd6b695022655bb0f3
# bad: [52a08ef1f13a11289c9e18cd4cfb4e51c024058b] sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
git bisect bad 52a08ef1f13a11289c9e18cd4cfb4e51c024058b
# bad: [a9467fa3cd2d5bf39e7cb7d0706d29d7ef4df212] sched: Use clamp() and clamp_val() to make sys_nice() more readable
git bisect bad a9467fa3cd2d5bf39e7cb7d0706d29d7ef4df212
# bad: [caffcdd8d27ba78730d5540396ce72ad022aff2c] sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
git bisect bad caffcdd8d27ba78730d5540396ce72ad022aff2c
# first bad commit: [caffcdd8d27ba78730d5540396ce72ad022aff2c] sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
Comment 3 Bruno Wolff III 2014-07-16 06:36:34 UTC
Created attachment 143211 [details]
Config file used to build from first bad commit
Comment 4 Bruno Wolff III 2014-07-16 06:42:38 UTC
Created attachment 143221 [details]
lshw output from the machine exhibiting the problem
Comment 5 Bruno Wolff III 2014-07-16 06:43:50 UTC
Created attachment 143231 [details]
lshw from another i686 machine that doesn't exhibit the problem
Comment 6 Bruno Wolff III 2014-07-16 06:51:04 UTC
A simple revert (against commit 1795cd9b3a91d4b5473c97f491d63892442212ab) didn't build:
kernel/sched/core.c: In function ‘build_sched_groups’:
kernel/sched/core.c:5851:5: error: ‘struct sched_group’ has no member named ‘sgp’
   sg->sgp->power = 0;
     ^
scripts/Makefile.build:257: recipe for target 'kernel/sched/core.o' failed
Comment 7 Bruno Wolff III 2014-07-16 06:58:49 UTC
I have been using gcc-4.9.0 to do kernel builds.
Comment 8 Bruno Wolff III 2014-07-16 12:19:39 UTC
Adding back just the cpumask_clear(sched_group_cpus(sg)) (to rc5) gets things working again.

git diff v3.16-rc5
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b..7c3674d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5847,6 +5847,7 @@ build_sched_groups(struct sched_domain *sd, int cpu)
                        continue;
 
                group = get_group(i, sdd, &sg);
+               cpumask_clear(sched_group_cpus(sg));
                cpumask_setall(sched_group_mask(sg));
Comment 9 Bruno Wolff III 2014-07-16 19:50:20 UTC
Created attachment 143261 [details]
/proc/cpuinfo
Comment 10 Bruno Wolff III 2014-07-17 03:05:07 UTC
Created attachment 143311 [details]
/proc/sys/kernel/sched_domain/cpu*/domain*/*
Comment 11 Bruno Wolff III 2014-07-17 04:32:37 UTC
Created attachment 143321 [details]
/proc/schedstat output

This is from 3.16-rc5 with the following diff:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b..21ba65c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5847,6 +5847,10 @@ build_sched_groups(struct sched_domain *sd, int cpu)
                        continue;
 
                group = get_group(i, sdd, &sg);
+               cpumask_clear(sched_group_cpus(sg));
+               sg->sgc->capacity = 0;
+               BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
+               BUG_ON(sg->sgc->capacity);
                cpumask_setall(sched_group_mask(sg));
 
                for_each_cpu(j, span) {
Comment 12 Bruno Wolff III 2014-07-17 16:35:16 UTC
Created attachment 143331 [details]
Boot picture

The picture DSCN1530.JPG shows bug on output triggered at 5850 which is: BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
Comment 13 Bruno Wolff III 2014-07-18 05:31:48 UTC
Created attachment 143361 [details]
dmesg output with Peter's debug patch and earlypr intk=keepsched_debug
Comment 14 Bruno Wolff III 2014-07-18 12:58:59 UTC
Created attachment 143381 [details]
dmesg output with Peter's updated debug patch and earlyprintk=keep sched_debug
Comment 15 Bruno Wolff III 2014-07-22 12:08:38 UTC
Created attachment 143871 [details]
cpuid output
Comment 16 Bruno Wolff III 2014-07-22 12:09:18 UTC
Created attachment 143881 [details]
cpuid -r output
Comment 17 Bruno Wolff III 2014-07-23 01:33:24 UTC
Created attachment 143961 [details]
dmesg output with latest test patches
Comment 18 Bruno Wolff III 2014-07-23 01:34:36 UTC
Created attachment 143971 [details]
The previous dmesg output had these differences from 3.16-rc6
Comment 19 Bruno Wolff III 2014-07-23 14:50:30 UTC
This last test was a success. Peter is going to formally write up a patch and send it to Linus. I'll test the formal patch when it shows up.
Comment 20 Bruno Wolff III 2014-07-27 23:29:30 UTC
This is in 3.16-rc7 as commit 2a2261553dd1472ca574acadbd93e12f44c4e6d5.

Note You need to log in before you can comment on or make changes to this bug.