Created attachment 160381 [details] Picture of kernel panic. I don't currently have access to hardware to capture the boot sequence. The best I could do is the attached image showing a chunk of the backtrace. The RIP line is: [ 4.346095] RIP [<ffffffff81021166>] hswep_uncore_sbox_msr_init_box+0x76/0xb0 That leads me to believe this is related to this commit: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=68055915c1c22489f9658bd2b7391bb11b2cf4e4 This is a Haswell-E box, and that commit is about Haswell-EP, so may be specific to that platform. The same issue does not exist in 3.17 so this is a regression in 3.18.
cpuinfo, in case it's useful. Let me know what other information I can provide, or any debugging steps I can take. # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 63 model name : Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz stepping : 2 microcode : 0x29 cpu MHz : 1944.960 cache size : 15360 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 6 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 15 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt bugs : bogomips : 6996.49 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
3.18-rc5 boots with no issue. The commit mentioned in my first comment was included in rc6. I had previously tried rc7, which also panics on the same function call. Since that commit added the function in question it would appear to be the culprit.
As time allows, I'm happy to try to bisect between rc5 and rc6 if it's really necessary.
Simple bisect between rc5 and rc6. This may be too targeted given that only two changes exist in this path: # git bisect bad 41a134a5830a5e1396723ace0a63000780d6e267 is the first bad commit commit 41a134a5830a5e1396723ace0a63000780d6e267 Author: Andi Kleen <ak@linux.intel.com> Date: Mon Nov 3 17:00:27 2014 -0800 perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP The counter register offsets for the IRP box PMU for Haswell-EP were incorrect. The offsets actually changed over IvyBridge EP. Fix them to the correct values. For this we need to fork the read function from the IVB and use an own counter array. Tested-by: patrick.lu@intel.com Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/1415062828-19759-3-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org> :040000 040000 497aa3072731499f5d256af3392c451c44d0ae22 d243608ad2fe37985cc85d454d73d54e59b83fdb M arch # git bisect log git bisect start '--' 'arch/x86/kernel/cpu' # bad: [5d01410fe4d92081f349b013a2e7a95429e4f2c9] Linux 3.18-rc6 git bisect bad 5d01410fe4d92081f349b013a2e7a95429e4f2c9 # good: [fc14f9c1272f62c3e8d01300f52467c0d9af50f9] Linux 3.18-rc5 git bisect good fc14f9c1272f62c3e8d01300f52467c0d9af50f9 # bad: [13f5004c94785af107dd702d9fbbe160f1004064] Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 13f5004c94785af107dd702d9fbbe160f1004064 # bad: [41a134a5830a5e1396723ace0a63000780d6e267] perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP git bisect bad 41a134a5830a5e1396723ace0a63000780d6e267 # first bad commit: [41a134a5830a5e1396723ace0a63000780d6e267] perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP
Full bisect between rc5 and rc6: # git bisect bad ce5686d4ed12158599d2042a6c8659254ed263ce is the first bad commit commit ce5686d4ed12158599d2042a6c8659254ed263ce Author: Peter Zijlstra (Intel) <peterz@infradead.org> Date: Wed Oct 29 11:17:04 2014 +0100 perf/x86: Fix embarrasing typo Because we're all human and typing sucks.. Fixes: 7fb0f1de49fc ("perf/x86: Fix compile warnings for intel_uncore") Reported-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: x86@kernel.org Link: http://lkml.kernel.org/n/tip-be0bftjh8yfm4uvmvtf3yi87@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> :040000 040000 c17be2a3d3aeee6d6f72cdbfdc0b87324dcfc16a 497aa3072731499f5d256af3392c451c44d0ae22 M arch # git bisect log git bisect start # good: [fc14f9c1272f62c3e8d01300f52467c0d9af50f9] Linux 3.18-rc5 git bisect good fc14f9c1272f62c3e8d01300f52467c0d9af50f9 # bad: [5d01410fe4d92081f349b013a2e7a95429e4f2c9] Linux 3.18-rc6 git bisect bad 5d01410fe4d92081f349b013a2e7a95429e4f2c9 # bad: [928352e9eebabb814d0c38af1772af55677faf62] Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux git bisect bad 928352e9eebabb814d0c38af1772af55677faf62 # bad: [a46171d0100eafc0c276962d80f470406d66dcdd] Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending git bisect bad a46171d0100eafc0c276962d80f470406d66dcdd # bad: [4fc82c0a766cf1d0bc098fb42d00b5292dde65f7] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux git bisect bad 4fc82c0a766cf1d0bc098fb42d00b5292dde65f7 # bad: [8b2ed21e846c63d8f1bdee0d8df0645721a604a1] Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 8b2ed21e846c63d8f1bdee0d8df0645721a604a1 # bad: [13f5004c94785af107dd702d9fbbe160f1004064] Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 13f5004c94785af107dd702d9fbbe160f1004064 # bad: [41a134a5830a5e1396723ace0a63000780d6e267] perf/x86/intel/uncore: Fix IRP uncore register offsets on Haswell EP git bisect bad 41a134a5830a5e1396723ace0a63000780d6e267 # bad: [226424eee809251ec23bd4b09d8efba09c10fc3c] perf: Fix corruption of sibling list with hotplug git bisect bad 226424eee809251ec23bd4b09d8efba09c10fc3c # bad: [ce5686d4ed12158599d2042a6c8659254ed263ce] perf/x86: Fix embarrasing typo git bisect bad ce5686d4ed12158599d2042a6c8659254ed263ce # first bad commit: [ce5686d4ed12158599d2042a6c8659254ed263ce] perf/x86: Fix embarrasing typo So, it looks like the problem isn't with any commit post rc5, but that perf counters for intel_uncore are completely broken on Haswell-E? That seems to have been masked previously by the config typo.
Even worse, CONFIG_PERF_EVENTS_INTEL_UNCORE can't be turned off. There is no option in menuconfig, and if you comment it out or set it to 'n' make automatically resets it to 'y'. This is argued about a bit here: https://lkml.org/lkml/2014/10/28/549 Not allowing this to be user configurable means there is no workaround for this issue.
Reintroducing the typo bug (which completely disables CONFIG_PERF_EVENTS_INTEL_UNCORE) and rebuilding allows 3.18.0 to boot.
This appears to be fixed in more recent versions of 3.18 (I just tried 3.18.6). Possibly fixed by http://kernel.opensuse.org/cgit/kernel/commit/?h=stable&id=7f5dada0d5890fd3c4eee140f9489ef32f1a3ae6