Bug 43098

Summary: Kernel panic : early exception 06 rip 10
Product: Platform Specific/Hardware Reporter: valère monseur (valere.monseur)
Component: x86-64Assignee: Tejun Heo (tj)
Status: RESOLVED CODE_FIX    
Severity: blocking CC: arthur.titeica, dennyvatwork, hpa, j.heather, jfree143dev, lopezmen, nathansamson, the.ridikulus.rat, tj, valere.monseur
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.3.1 and 3.3.2 Subsystem:
Regression: No Bisected commit-id:
Attachments: lspci output
lsusb output
output of objdump
screenshot with debug info
adapted smp_scan_config in mpparse.c to allow boot
dmesg output after booting with the adapted mpparse.c
memblock-zero-len-fix.patch

Description valère monseur 2012-04-13 08:31:04 UTC
Hi,

I've upgraded from 3.2.9 to 3.3.1 and I face a kernel panic after the messages:
"Decompressing Linux...Parsing ELF...done." and "Booting the kernel".

The error message is "PANIC : early exception 06 rip 10 : ffffffff81456a10
error 0 cr 2 0"

I'm running a laptop DELL XPS L501X.
In attach, you can find the output of lspci and lsusb.

Remark: this bug entry replaces old one here: https://bugzilla.kernel.org/show_bug.cgi?id=43076 as the architecture was not correct.
Comment 1 valère monseur 2012-04-13 08:31:41 UTC
Created attachment 72902 [details]
lspci output
Comment 2 valère monseur 2012-04-13 08:32:06 UTC
Created attachment 72903 [details]
lsusb output
Comment 3 valère monseur 2012-04-13 08:33:34 UTC
Looks like people succeeded to boot after disabling the wifi in the BIOS (see info here: https://bbs.archlinux.org/viewtopic.php?id=139280).
Comment 4 valère monseur 2012-04-13 08:35:25 UTC
*** Bug 43076 has been marked as a duplicate of this bug. ***
Comment 5 valère monseur 2012-04-13 08:39:26 UTC
...and of course the info already mentionned here:
https://bugzilla.kernel.org/show_bug.cgi?id=43076

What other info can we provide to help finding the root cause of this bug ?
Comment 6 j.heather 2012-04-13 08:42:39 UTC
This is a bug affecting quite a number of people: see #43076.

Would be really good to get this sorted, because this affects every distro with a recent kernel, so Linux is currently no go on affected machines.
Comment 7 valère monseur 2012-04-13 22:14:55 UTC
Ok, I've recompiled the current stable kernel which is now 3.3.2 and the problem is still present.

Due to recomp and code changes the error message is now: early exception 06 rip 10: ffffffff81453b2f => so only the addess has changed.

I've attached the file dump.txt which is the output of 'objdump -d --no-show-raw-insn vmlinux'

As far as I can see the address ffffffff81453b2f is in the function 'memblock_reserve' which I think is located in mm/bootmem.c

The only patch I can find in 3.3 release that has changed bootmem.c is described here http://www.spinics.net/lists/stable-commits/msg16531.html

...but as I'm not a kernel expert. It's just a guess.
...so I will try to recompile the kernel tomorrow without changes made by that patch and see the result.

Remark: I don't see any link with the wifi driver here so I'm not sure this will give any good result. As mentionned it's just a guess.

Any comment on this is welcome.
Comment 8 valère monseur 2012-04-13 22:17:31 UTC
Created attachment 72916 [details]
output of objdump
Comment 9 H. Peter Anvin 2012-04-13 23:03:00 UTC
This means the kernel has tripped over a BUG() or BUG_ON().

Please compile the kernel with CONFIG_DEBUG_INFO and use:

addr2line -e vmlinux <address>

... to figure out which one.
Comment 10 Joseph Freeman 2012-04-14 02:40:53 UTC
git bisect gives the following result:

24aa07882b672fff2da2f5c955759f0bd13d32d5 is the first bad commit
commit 24aa07882b672fff2da2f5c955759f0bd13d32d5
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Jul 12 11:16:06 2011 +0200

    memblock, x86: Replace memblock_x86_reserve/free_range() with generic ones
    
    Other than sanity check and debug message, the x86 specific version of
    memblock reserve/free functions are simple wrappers around the generic
    versions - memblock_reserve/free().
    
    This patch adds debug messages with caller identification to the
    generic versions and replaces x86 specific ones and kills them.
    arch/x86/include/asm/memblock.h and arch/x86/mm/memblock.c are empty
    after this change and removed.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Link: http://lkml.kernel.org/r/1310462166-31469-14-git-send-email-tj@kernel.org
    Cc: Yinghai Lu <yinghai@kernel.org>
    Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: "H. Peter Anvin" <hpa@zytor.com>
    Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>

:040000 040000 f87cd91f4ebb1883b5830906c25932a32919a9c6 1da1a8bb80a2049a0d97048d0b139ddae9ded2fc M      arch
:040000 040000 c32fb9b38f23b5f024b03d490be3a0959843245b fd770cb6a1aec09ff1c713b498de5618ac09e4a7 M      include
:040000 040000 45602c08b24f9afa325b38c6d915ab283ca92093 6b842d41af45012d764cd5c721138378b7d8812b M      mm

Here is the first tag that the commit was included in:

git describe --contains  24aa07882b672fff2da2f5c955759f0bd13d32d5
v3.3-rc2~15^2~8
Comment 11 Joseph Freeman 2012-04-14 03:09:30 UTC
Previously in arch/x86/mm/memblock.c

void __init memblock_x86_reserve_range(u64 start, u64 end, char *name)

and

void __init memblock_x86_free_range(u64 start, u64 end)

did the following before calling memblock_reserve/memblock_free:
if (start == end)
        return;

Now those functions are gone and memblock_reserve/memblock_free are called directly...if start == end then the following gets called:
BUG_ON(0 == size);
Comment 12 valère monseur 2012-04-14 09:12:15 UTC
Created attachment 72918 [details]
screenshot with debug info
Comment 13 valère monseur 2012-04-14 09:14:20 UTC
I've added a screenshot of debug info that are printed at the time the problem occurs. Command line parameters used are: debug ignore_loglevel log_buf_len=10M print_fatal_signals=1 LOGLEVEL=8 earlyprintk=vga,keep sched_debug

What other info can I provide?
Comment 14 valère monseur 2012-04-14 15:54:00 UTC
So, a bit more investigation shows me that the panic occurs when 'memblock_reserve' is called with a size of 0 (we already knew that).

The sequence is the following:

In the source /arch/x86/kernel/mpparse.c in function 'smp_scan_config', it tries first to find the 'mpf_intel' table and succeeds (in my case at address F6D60). It performs then a 'memblock_reserve' of that table and succeeds.

    this is the line of code:
        memblock_reserve(mem, sizeof(*mpf));

Then it uses the pointer 'mpf->physptr' to access the 'mpc' table and calls 'smp_reserve_memory'

    these are the lines of code:
			if (mpf->physptr)
				smp_reserve_memory(mpf);

The call to 'memblock_reserve' is done in 'smp_reserve_memory' and it fails with a size of 0.

In fact, the pointer 'mpf->physptr' doesn't seem to point at the right address, or the memory at that address is corrupted so the size retrieved for the call to memblock_reserve is of course wrong.

In my case the address of mpc table is at 9E0A1 when it fails.

If I try to boot with wifi disabled in the BIOS (as we know it allows the kernel to boot), the address of the mpc table is at 9C971.

Voilà, that's all for now.
If you have any ideas on what to do next, please let me know.

Cheers
Comment 15 valère monseur 2012-04-15 09:14:40 UTC
Following the Intel Multiprocessor Specification, "The MP Floating Point structure must be stored in at least...the first kilobytes of EBDA...within the last kilobytes of system base memory (639K-640K)...or in the BIOS rom address space (0F0000h-0FFFFFh)".

Also according to the specification, it should be looked by the OS in this sequence:

1) EBDA
2) last kb of system base memory
3) BIOS rom

When looking at the code in mpparse.c, linux does it in that sequence:

1) first kb of system base memory (why?)
2) last kb of system base memory
3) BIOS rom
4) EBDA

The reason to check EBDA as last option is because some buggy linux loaders corrupt EBDA memory.

Ok, now let's take the assumption that my BIOS is buggy in some cases (and that the way the code was done before was hidding it), then when doing the 'smp_scan_config' in mmparse.c, should'nt it check also that the pointer to mpc table (if any) really points to a valid address (by checking the mpc signature) before it reserves the memory? If it doesn't point to a valid address, then it whould skip it.

I've quickly tested the above and it worked. I mean it has indeed found an mpf table with a pointer. It has checked the signature at that address and found that there was no mpc signature so it has skipped the memory reservation and continued.

It has then found an mpf table in the EBDA and continued to boot.

So, once more I have to say I'm not a kernel expert so I would appreciate if someone could have a look at this.

Is this the direction to follow?
Is it not a BIOS bug?

I post right after the code of mpparse.c I've changed and dmesg output.
Comment 16 valère monseur 2012-04-15 09:15:35 UTC
Created attachment 72923 [details]
adapted smp_scan_config in mpparse.c to allow boot
Comment 17 valère monseur 2012-04-15 09:16:08 UTC
Created attachment 72924 [details]
dmesg output after booting with the adapted mpparse.c
Comment 18 valère monseur 2012-04-15 09:49:35 UTC
Ok, now that it boots with 3.3.2 I've just compared the dmesg output with the one I have with the LTS kernel (3.0.27).

I can see this for the LTS:

[    0.000000] Scan SMP from ffff880000000000 for 1024 bytes.
[    0.000000] Scan SMP from ffff88000009fc00 for 1024 bytes.
[    0.000000] Scan SMP from ffff8800000f0000 for 65536 bytes.
[    0.000000] found SMP MP-table at [ffff8800000f6d60] f6d60
[    0.000000]     memblock_x86_reserve_range: [0x000f6d60-0x000f6d6f]   * MP-table mpf
[    0.000000]   mpc: 9e0a1-9e0a1
[    0.000000]     memblock_x86_reserve_range: [0x01b31000-0x01b311cb]              BRK

What's really interesting is this line: "mpc: 9e0a1-9e0a1" which comes from this code:

	apic_printk(APIC_VERBOSE, "  mpc: %lx-%lx\n", physptr, physptr + size);

This means that even with previous kernel versions, the mpc address was wrong and so the size was 0. The difference is that before it was just not panic'ing.

So, different solutions are possible:
- skip the mpf table if the mpc signature is not found (above proposal)
- skip the mpc table if the mpc signature is not found (like if it was null ptr)
- do not 'memblock_reserve' is size is 0

Ok, waiting for you, guys!
Comment 19 lopezmen 2012-04-17 20:15:19 UTC
I have the same issue running fedora 16 on kernels 3.2.10-3.3.1 but on a dell studio xps 13 (studio xps 1340).

I could not deactivate the wifi thru the bios so i couldn't say if is the wifi or not
Comment 20 Daniele Viganò 2012-04-18 09:37:15 UTC
I can confirm this issue on my Dell XPS L501X two machines.

The error is always
PANIC: early exception 06 rip 10:ffffffff81d369ef error 0 cr2 0

Tested with kernel 3.3.1 from Fedora 16 rpm and 3.3.2 compiled from sources.
Disabling the WiFi in BIOS makes the system booting.
Comment 21 valère monseur 2012-04-18 11:25:13 UTC
Thanks for reporting this.
I'm wondering however if some kernel expert is really looking at it as the status of the bug report is still NEW and not ASSIGNED.
If someone has an idea?
Comment 22 j.heather 2012-04-18 11:34:13 UTC
(In reply to comment #21)
> I'm wondering however if some kernel expert is really looking at it as the
> status of the bug report is still NEW and not ASSIGNED.

Yes, I have the same worry. It would be really useful to get this fixed before Fedora 17 comes out, because otherwise no one with this hardware will be able to upgrade.

@H. Peter Anvin is on the cc list, and also one of the guys who signed off the original patch. Maybe he will look at it?

James
Comment 23 David Runge 2012-04-18 22:22:57 UTC
Suffering from the same issue with an Dell XPS 1340 on Archlinux.
Last bootable kernel for me (because of other dependencies like nvidia driver) is 3.2.7.
I was able to boot 3.3.2 with acpi=off but that just made nvidia crash... so I'm back the older kernel, collecting dependency ignores... :/

There are some related upstream bugs on bugs.archlinux.org ... but it seems that noone really knows what to make of it...
https://bugs.archlinux.org/task/29240
https://bugs.archlinux.org/task/29063
https://bugs.archlinux.org/task/29351

It seems to be a serious problem for Dell XPS laptops and random other hardware. Weird stuff! Please look into it! :/
Comment 24 Tejun Heo 2012-04-19 17:14:48 UTC
Created attachment 72983 [details]
memblock-zero-len-fix.patch

Sorry about the trouble. memblock shouldn't be panicking on zero-len ops. Can someone please verify the attached fix?

Thank you.
Comment 25 valère monseur 2012-04-19 20:06:54 UTC
@Tejun Heo: no problem and thanks for the patch. I will test it asap, unfortunately not today but surely will do tomorrow.

@David Runge: this bug report is only related to the archlinux bug 29351 (that I opened to keep track of this bug report on kernel.org). There is maybe no link at all with the other bugs you mentionned (29063 & 29240). You should see first with the assigned person in archlinux to know if/what has already been done with them.

Thanks
Comment 26 Joseph Freeman 2012-04-20 01:49:29 UTC
I tried it on the mainline Arch kernel (3.3.2) and it worked perfectly! Great Job!
Comment 27 Tejun Heo 2012-04-20 15:32:58 UTC
Thanks a lot for testing. Patch posted upstream.

 http://article.gmane.org/gmane.linux.kernel/1285340
Comment 28 Daniele Viganò 2012-04-20 16:10:08 UTC
I have tested the patch by Tejun Heo on Fedora 16 3.3.2 kernel from update-testing (3.3.2-1.fc16.x86_64) and I can confirm that now my system is booting and working without any issue.

Many thanks to Tejun Heo and all the others.
Comment 29 Tejun Heo 2012-04-20 21:02:40 UTC
Patch committed mainline. Should also appear in the next -stable release.

 b3dc627cabb33fc95f93da78457770c1b2a364d2

Thanks.
Comment 30 valère monseur 2012-04-21 19:51:22 UTC
Thak you very much!