Bug 43098
Summary: | Kernel panic : early exception 06 rip 10 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | valère monseur (valere.monseur) |
Component: | x86-64 | Assignee: | Tejun Heo (tj) |
Status: | RESOLVED CODE_FIX | ||
Severity: | blocking | CC: | arthur.titeica, dennyvatwork, hpa, j.heather, jfree143dev, lopezmen, nathansamson, the.ridikulus.rat, tj, valere.monseur |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.3.1 and 3.3.2 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
lspci output
lsusb output output of objdump screenshot with debug info adapted smp_scan_config in mpparse.c to allow boot dmesg output after booting with the adapted mpparse.c memblock-zero-len-fix.patch |
Description
valère monseur
2012-04-13 08:31:04 UTC
Created attachment 72902 [details]
lspci output
Created attachment 72903 [details]
lsusb output
Looks like people succeeded to boot after disabling the wifi in the BIOS (see info here: https://bbs.archlinux.org/viewtopic.php?id=139280). *** Bug 43076 has been marked as a duplicate of this bug. *** ...and of course the info already mentionned here: https://bugzilla.kernel.org/show_bug.cgi?id=43076 What other info can we provide to help finding the root cause of this bug ? This is a bug affecting quite a number of people: see #43076. Would be really good to get this sorted, because this affects every distro with a recent kernel, so Linux is currently no go on affected machines. Ok, I've recompiled the current stable kernel which is now 3.3.2 and the problem is still present. Due to recomp and code changes the error message is now: early exception 06 rip 10: ffffffff81453b2f => so only the addess has changed. I've attached the file dump.txt which is the output of 'objdump -d --no-show-raw-insn vmlinux' As far as I can see the address ffffffff81453b2f is in the function 'memblock_reserve' which I think is located in mm/bootmem.c The only patch I can find in 3.3 release that has changed bootmem.c is described here http://www.spinics.net/lists/stable-commits/msg16531.html ...but as I'm not a kernel expert. It's just a guess. ...so I will try to recompile the kernel tomorrow without changes made by that patch and see the result. Remark: I don't see any link with the wifi driver here so I'm not sure this will give any good result. As mentionned it's just a guess. Any comment on this is welcome. Created attachment 72916 [details]
output of objdump
This means the kernel has tripped over a BUG() or BUG_ON(). Please compile the kernel with CONFIG_DEBUG_INFO and use: addr2line -e vmlinux <address> ... to figure out which one. git bisect gives the following result: 24aa07882b672fff2da2f5c955759f0bd13d32d5 is the first bad commit commit 24aa07882b672fff2da2f5c955759f0bd13d32d5 Author: Tejun Heo <tj@kernel.org> Date: Tue Jul 12 11:16:06 2011 +0200 memblock, x86: Replace memblock_x86_reserve/free_range() with generic ones Other than sanity check and debug message, the x86 specific version of memblock reserve/free functions are simple wrappers around the generic versions - memblock_reserve/free(). This patch adds debug messages with caller identification to the generic versions and replaces x86 specific ones and kills them. arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty after this change and removed. Signed-off-by: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org Cc: Yinghai Lu <yinghai@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> :040000 040000 f87cd91f4ebb1883b5830906c25932a32919a9c6 1da1a8bb80a2049a0d97048d0b139ddae9ded2fc M arch :040000 040000 c32fb9b38f23b5f024b03d490be3a0959843245b fd770cb6a1aec09ff1c713b498de5618ac09e4a7 M include :040000 040000 45602c08b24f9afa325b38c6d915ab283ca92093 6b842d41af45012d764cd5c721138378b7d8812b M mm Here is the first tag that the commit was included in: git describe --contains 24aa07882b672fff2da2f5c955759f0bd13d32d5 v3.3-rc2~15^2~8 Previously in arch/x86/mm/memblock.c void __init memblock_x86_reserve_range(u64 start, u64 end, char *name) and void __init memblock_x86_free_range(u64 start, u64 end) did the following before calling memblock_reserve/memblock_free: if (start == end) return; Now those functions are gone and memblock_reserve/memblock_free are called directly...if start == end then the following gets called: BUG_ON(0 == size); Created attachment 72918 [details]
screenshot with debug info
I've added a screenshot of debug info that are printed at the time the problem occurs. Command line parameters used are: debug ignore_loglevel log_buf_len=10M print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug What other info can I provide? So, a bit more investigation shows me that the panic occurs when 'memblock_reserve' is called with a size of 0 (we already knew that). The sequence is the following: In the source /arch/x86/kernel/mpparse.c in function 'smp_scan_config', it tries first to find the 'mpf_intel' table and succeeds (in my case at address F6D60). It performs then a 'memblock_reserve' of that table and succeeds. this is the line of code: memblock_reserve(mem, sizeof(*mpf)); Then it uses the pointer 'mpf->physptr' to access the 'mpc' table and calls 'smp_reserve_memory' these are the lines of code: if (mpf->physptr) smp_reserve_memory(mpf); The call to 'memblock_reserve' is done in 'smp_reserve_memory' and it fails with a size of 0. In fact, the pointer 'mpf->physptr' doesn't seem to point at the right address, or the memory at that address is corrupted so the size retrieved for the call to memblock_reserve is of course wrong. In my case the address of mpc table is at 9E0A1 when it fails. If I try to boot with wifi disabled in the BIOS (as we know it allows the kernel to boot), the address of the mpc table is at 9C971. Voilà, that's all for now. If you have any ideas on what to do next, please let me know. Cheers Following the Intel Multiprocessor Specification, "The MP Floating Point structure must be stored in at least...the first kilobytes of EBDA...within the last kilobytes of system base memory (639K-640K)...or in the BIOS rom address space (0F0000h-0FFFFFh)". Also according to the specification, it should be looked by the OS in this sequence: 1) EBDA 2) last kb of system base memory 3) BIOS rom When looking at the code in mpparse.c, linux does it in that sequence: 1) first kb of system base memory (why?) 2) last kb of system base memory 3) BIOS rom 4) EBDA The reason to check EBDA as last option is because some buggy linux loaders corrupt EBDA memory. Ok, now let's take the assumption that my BIOS is buggy in some cases (and that the way the code was done before was hidding it), then when doing the 'smp_scan_config' in mmparse.c, should'nt it check also that the pointer to mpc table (if any) really points to a valid address (by checking the mpc signature) before it reserves the memory? If it doesn't point to a valid address, then it whould skip it. I've quickly tested the above and it worked. I mean it has indeed found an mpf table with a pointer. It has checked the signature at that address and found that there was no mpc signature so it has skipped the memory reservation and continued. It has then found an mpf table in the EBDA and continued to boot. So, once more I have to say I'm not a kernel expert so I would appreciate if someone could have a look at this. Is this the direction to follow? Is it not a BIOS bug? I post right after the code of mpparse.c I've changed and dmesg output. Created attachment 72923 [details]
adapted smp_scan_config in mpparse.c to allow boot
Created attachment 72924 [details]
dmesg output after booting with the adapted mpparse.c
Ok, now that it boots with 3.3.2 I've just compared the dmesg output with the one I have with the LTS kernel (3.0.27). I can see this for the LTS: [ 0.000000] Scan SMP from ffff880000000000 for 1024 bytes. [ 0.000000] Scan SMP from ffff88000009fc00 for 1024 bytes. [ 0.000000] Scan SMP from ffff8800000f0000 for 65536 bytes. [ 0.000000] found SMP MP-table at [ffff8800000f6d60] f6d60 [ 0.000000] memblock_x86_reserve_range: [0x000f6d60-0x000f6d6f] * MP-table mpf [ 0.000000] mpc: 9e0a1-9e0a1 [ 0.000000] memblock_x86_reserve_range: [0x01b31000-0x01b311cb] BRK What's really interesting is this line: "mpc: 9e0a1-9e0a1" which comes from this code: apic_printk(APIC_VERBOSE, " mpc: %lx-%lx\n", physptr, physptr + size); This means that even with previous kernel versions, the mpc address was wrong and so the size was 0. The difference is that before it was just not panic'ing. So, different solutions are possible: - skip the mpf table if the mpc signature is not found (above proposal) - skip the mpc table if the mpc signature is not found (like if it was null ptr) - do not 'memblock_reserve' is size is 0 Ok, waiting for you, guys! I have the same issue running fedora 16 on kernels 3.2.10-3.3.1 but on a dell studio xps 13 (studio xps 1340). I could not deactivate the wifi thru the bios so i couldn't say if is the wifi or not I can confirm this issue on my Dell XPS L501X two machines. The error is always PANIC: early exception 06 rip 10:ffffffff81d369ef error 0 cr2 0 Tested with kernel 3.3.1 from Fedora 16 rpm and 3.3.2 compiled from sources. Disabling the WiFi in BIOS makes the system booting. Thanks for reporting this. I'm wondering however if some kernel expert is really looking at it as the status of the bug report is still NEW and not ASSIGNED. If someone has an idea? (In reply to comment #21) > I'm wondering however if some kernel expert is really looking at it as the > status of the bug report is still NEW and not ASSIGNED. Yes, I have the same worry. It would be really useful to get this fixed before Fedora 17 comes out, because otherwise no one with this hardware will be able to upgrade. @H. Peter Anvin is on the cc list, and also one of the guys who signed off the original patch. Maybe he will look at it? James Suffering from the same issue with an Dell XPS 1340 on Archlinux. Last bootable kernel for me (because of other dependencies like nvidia driver) is 3.2.7. I was able to boot 3.3.2 with acpi=off but that just made nvidia crash... so I'm back the older kernel, collecting dependency ignores... :/ There are some related upstream bugs on bugs.archlinux.org ... but it seems that noone really knows what to make of it... https://bugs.archlinux.org/task/29240 https://bugs.archlinux.org/task/29063 https://bugs.archlinux.org/task/29351 It seems to be a serious problem for Dell XPS laptops and random other hardware. Weird stuff! Please look into it! :/ Created attachment 72983 [details]
memblock-zero-len-fix.patch
Sorry about the trouble. memblock shouldn't be panicking on zero-len ops. Can someone please verify the attached fix?
Thank you.
@Tejun Heo: no problem and thanks for the patch. I will test it asap, unfortunately not today but surely will do tomorrow. @David Runge: this bug report is only related to the archlinux bug 29351 (that I opened to keep track of this bug report on kernel.org). There is maybe no link at all with the other bugs you mentionned (29063 & 29240). You should see first with the assigned person in archlinux to know if/what has already been done with them. Thanks I tried it on the mainline Arch kernel (3.3.2) and it worked perfectly! Great Job! Thanks a lot for testing. Patch posted upstream. http://article.gmane.org/gmane.linux.kernel/1285340 I have tested the patch by Tejun Heo on Fedora 16 3.3.2 kernel from update-testing (3.3.2-1.fc16.x86_64) and I can confirm that now my system is booting and working without any issue. Many thanks to Tejun Heo and all the others. Patch committed mainline. Should also appear in the next -stable release. b3dc627cabb33fc95f93da78457770c1b2a364d2 Thanks. Thak you very much! |