There is a log of information about this in fedora bug https://bugzilla.redhat.com/show_bug.cgi?id=1116529. I haven't finished a bisect yet, but the range of potential bad commits is down to a small number of scheduler related commits. The crash is very early in the boot and I am not able to capture a crash dump using netconsole. I took some pictures that are attached to the Fedora bug after slowing down console output. If you want I can attach those here or try to get some new ones. The traceback starts off with a divide error and there is a message suggesting a CPU has locked up. I have 2 i686 machines and only one is crashing. The one that has the problem has 2 hyperthreaded processors and the one that doesn't has 2 plain processors. There are some other differences, but that seems to be the one mostly likely to be related to scheduling. I expect to finish the bisect late Tuesday and hopefully test reverting the problem commit against the HEAD of Linus' tree on Wednesday. The bisect log so far is: git bisect start # good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15 git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d # good: [fad01e866afdbe01a1f3ec06a39c3a8b9e197014] Linux 3.15-rc8 git bisect good fad01e866afdbe01a1f3ec06a39c3a8b9e197014 # bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1 git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f # bad: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next git bisect bad aaeb2554337217dfa4eac2fcc90da7be540b9a73 # good: [5142c33ed86acbcef5c63a63d2b7384b9210d39f] Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next git bisect good 5142c33ed86acbcef5c63a63d2b7384b9210d39f # bad: [b05d59dfceaea72565b1648af929b037b0f96d7f] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next git bisect bad b05d59dfceaea72565b1648af929b037b0f96d7f # good: [e13cccfd86481bd4c0499577f44c570d334da79b] Merge tag 'spi-v3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi into next git bisect good e13cccfd86481bd4c0499577f44c570d334da79b # good: [3d521f9151dacab566904d1f57dcb3e7080cdd8f] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect good 3d521f9151dacab566904d1f57dcb3e7080cdd8f # good: [f82393426afb7c82f7618b3b4e440d8dd2b40c08] MIPS: KVM: Add master disable count interface git bisect good f82393426afb7c82f7618b3b4e440d8dd2b40c08 # bad: [4aef77b2fe373cdba461925589b9d1d4468ee016] Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect bad 4aef77b2fe373cdba461925589b9d1d4468ee016 # good: [3944a9274ef6cda0cc282daf0739832f661670f7] sched: Fix exec_start/task_hot on migrated tasks git bisect good 3944a9274ef6cda0cc282daf0739832f661670f7 # bad: [3d1a3bda65d2f48fead6f0727f2f392c15206852] Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect bad 3d1a3bda65d2f48fead6f0727f2f392c15206852 # bad: [a803f0261bb2bb57aab5542af3174db43b2a3887] sched: Initialize rq->age_stamp on processor start git bisect bad a803f0261bb2bb57aab5542af3174db43b2a3887
caffcdd8d27ba78730d5540396ce72ad022aff2c is the first bad commit commit caffcdd8d27ba78730d5540396ce72ad022aff2c Author: Dietmar Eggemann <Dietmar.Eggemann@arm.com> Date: Wed Apr 30 14:39:38 2014 +0100 sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups() There is no need to zero struct sched_group member cpumask and struct sched_group_power member power since both structures are already allocated as zeroed memory in __sdt_alloc(). This patch has been tested with BUG_ON(!cpumask_empty(sched_group_cpus(sg))); and BUG_ON(sg->sgp->power); in build_sched_groups() on ARM TC2 and INTEL i5 M520 platform including CPU hotplug scenarios. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1398865178-12577-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> :040000 040000 8d78e3f468e8bd4a51ba53750ca53d16583e4b53 d42eabda6d8a22ec6ee830a739aa7ac408883184 M kernel
git bisect start # good: [1860e379875dfe7271c649058aeddffe5afd9d0d] Linux 3.15 git bisect good 1860e379875dfe7271c649058aeddffe5afd9d0d # good: [fad01e866afdbe01a1f3ec06a39c3a8b9e197014] Linux 3.15-rc8 git bisect good fad01e866afdbe01a1f3ec06a39c3a8b9e197014 # bad: [7171511eaec5bf23fb06078f59784a3a0626b38f] Linux 3.16-rc1 git bisect bad 7171511eaec5bf23fb06078f59784a3a0626b38f # bad: [aaeb2554337217dfa4eac2fcc90da7be540b9a73] Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media into next git bisect bad aaeb2554337217dfa4eac2fcc90da7be540b9a73 # good: [5142c33ed86acbcef5c63a63d2b7384b9210d39f] Merge tag 'staging-3.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging into next git bisect good 5142c33ed86acbcef5c63a63d2b7384b9210d39f # bad: [b05d59dfceaea72565b1648af929b037b0f96d7f] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next git bisect bad b05d59dfceaea72565b1648af929b037b0f96d7f # good: [e13cccfd86481bd4c0499577f44c570d334da79b] Merge tag 'spi-v3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi into next git bisect good e13cccfd86481bd4c0499577f44c570d334da79b # good: [3d521f9151dacab566904d1f57dcb3e7080cdd8f] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect good 3d521f9151dacab566904d1f57dcb3e7080cdd8f # good: [f82393426afb7c82f7618b3b4e440d8dd2b40c08] MIPS: KVM: Add master disable count interface git bisect good f82393426afb7c82f7618b3b4e440d8dd2b40c08 # bad: [4aef77b2fe373cdba461925589b9d1d4468ee016] Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect bad 4aef77b2fe373cdba461925589b9d1d4468ee016 # good: [3944a9274ef6cda0cc282daf0739832f661670f7] sched: Fix exec_start/task_hot on migrated tasks git bisect good 3944a9274ef6cda0cc282daf0739832f661670f7 # bad: [3d1a3bda65d2f48fead6f0727f2f392c15206852] Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next git bisect bad 3d1a3bda65d2f48fead6f0727f2f392c15206852 # bad: [a803f0261bb2bb57aab5542af3174db43b2a3887] sched: Initialize rq->age_stamp on processor start git bisect bad a803f0261bb2bb57aab5542af3174db43b2a3887 # good: [c515db8cd311ef77b2dc7cbd6b695022655bb0f3] sched/numa: Fix initialization of sched_domain_topology for NUMA git bisect good c515db8cd311ef77b2dc7cbd6b695022655bb0f3 # bad: [52a08ef1f13a11289c9e18cd4cfb4e51c024058b] sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance() git bisect bad 52a08ef1f13a11289c9e18cd4cfb4e51c024058b # bad: [a9467fa3cd2d5bf39e7cb7d0706d29d7ef4df212] sched: Use clamp() and clamp_val() to make sys_nice() more readable git bisect bad a9467fa3cd2d5bf39e7cb7d0706d29d7ef4df212 # bad: [caffcdd8d27ba78730d5540396ce72ad022aff2c] sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups() git bisect bad caffcdd8d27ba78730d5540396ce72ad022aff2c # first bad commit: [caffcdd8d27ba78730d5540396ce72ad022aff2c] sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
Created attachment 143211 [details] Config file used to build from first bad commit
Created attachment 143221 [details] lshw output from the machine exhibiting the problem
Created attachment 143231 [details] lshw from another i686 machine that doesn't exhibit the problem
A simple revert (against commit 1795cd9b3a91d4b5473c97f491d63892442212ab) didn't build: kernel/sched/core.c: In function ‘build_sched_groups’: kernel/sched/core.c:5851:5: error: ‘struct sched_group’ has no member named ‘sgp’ sg->sgp->power = 0; ^ scripts/Makefile.build:257: recipe for target 'kernel/sched/core.o' failed
I have been using gcc-4.9.0 to do kernel builds.
Adding back just the cpumask_clear(sched_group_cpus(sg)) (to rc5) gets things working again. git diff v3.16-rc5 diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3bdf01b..7c3674d 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5847,6 +5847,7 @@ build_sched_groups(struct sched_domain *sd, int cpu) continue; group = get_group(i, sdd, &sg); + cpumask_clear(sched_group_cpus(sg)); cpumask_setall(sched_group_mask(sg));
Created attachment 143261 [details] /proc/cpuinfo
Created attachment 143311 [details] /proc/sys/kernel/sched_domain/cpu*/domain*/*
Created attachment 143321 [details] /proc/schedstat output This is from 3.16-rc5 with the following diff: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3bdf01b..21ba65c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5847,6 +5847,10 @@ build_sched_groups(struct sched_domain *sd, int cpu) continue; group = get_group(i, sdd, &sg); + cpumask_clear(sched_group_cpus(sg)); + sg->sgc->capacity = 0; + BUG_ON(!cpumask_empty(sched_group_cpus(sg))); + BUG_ON(sg->sgc->capacity); cpumask_setall(sched_group_mask(sg)); for_each_cpu(j, span) {
Created attachment 143331 [details] Boot picture The picture DSCN1530.JPG shows bug on output triggered at 5850 which is: BUG_ON(!cpumask_empty(sched_group_cpus(sg)));
Created attachment 143361 [details] dmesg output with Peter's debug patch and earlypr intk=keepsched_debug
Created attachment 143381 [details] dmesg output with Peter's updated debug patch and earlyprintk=keep sched_debug
Created attachment 143871 [details] cpuid output
Created attachment 143881 [details] cpuid -r output
Created attachment 143961 [details] dmesg output with latest test patches
Created attachment 143971 [details] The previous dmesg output had these differences from 3.16-rc6
This last test was a success. Peter is going to formally write up a patch and send it to Linus. I'll test the formal patch when it shows up.
This is in 3.16-rc7 as commit 2a2261553dd1472ca574acadbd93e12f44c4e6d5.