Bug 12167

Summary: 2.6.27.6 vmware guest panics on boot with CONFIG_VMI=Y
Product: Memory Management Reporter: Norman Back (norman)
Component: OtherAssignee: Andrew Morton (akpm)
Status: CLOSED CODE_FIX    
Severity: normal CC: akataria, alan, kernel, mingo, yhlu.kernel, zach
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27.6 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: System map of of 2.6.27.6
Patch against 2.6.27.6 that moves those dmi_... lines after * NOTE ...
Patch to fix boot time ioremap crash with VMI

Description Norman Back 2008-12-05 11:52:43 UTC
Latest working kernel version: 2.6.27.5
Earliest failing kernel version: 2.6.27.6
Distribution: Gentoo vanilla-sources
Hardware Environment: AMD 
Software Environment: vmware-workstation 6.5
Problem Description: 2.6.27.6 vmware guest panics on boot with CONFIG_VMI=Y

Steps to reproduce: Compile 2.6.27.6 with CONFIG_VMI=Y and boot as guest in vmware-workstation 6.5. Kernel panics Int 14 CR2

Bisection between 2.6.27.5 and 2.6.27.6 gives

5c371b31be32033b0a4a993431484da8a2305369 is first bad commit
commit 5c371b31be32033b0a4a993431484da8a2305369
Author: Yinghai Lu <yhlu.kernel@gmail.com>
Date:   Mon Sep 22 02:52:26 2008 -0700

    x86: fix CONFIG_X86_RESERVE_LOW_64K=y

    commit 2216d199b1430d1c0affb1498a9ebdbd9c0de439 upstream

    The bad_bios_dmi_table() quirk never triggered because we do DMI setup
    too late. Move it a bit earlier.

    Also change the CONFIG_X86_RESERVE_LOW_64K quirk to operate on the e820
    table directly instead of messing with early reservations - this handles
    overlaps (which do occur in this low range of RAM) more gracefully.

    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

:040000 040000 b7b81ffb62eddf60c2d8545a61566f0d34c1b2a9
858d983687c53db5304015a245ee0c23f10c266d M      arch

See http://bugs.gentoo.org/show_bug.cgi?id=249751
Comment 1 Daniel Drake 2008-12-05 12:08:11 UTC
For completeness, here is the output from the VMI-enabled vmware guest as it crashes on boot:

Decompressing Linux... Parsing ELF... done.
Booting the kernel.
BUG: Int 14: CR2 fbe00000
      EDI c05b1f98  ESI fbe00000  EBP 00a6e003  ESP c05b1f7c
      EBX c05b1f98  EDX 0000000e  ECX 00000003  EAX fbe00000
      err 00000000  EIP c05db95c   CS 00000062  flg 00010092
 Stack: c00cc618 c00cc625 00000003 00000000 00000000 00000563 c05b1ff8 fbe00000
        fbe10000 fbe00000 c05dba7e c05b1ff8 c05b1ff8 00646513 00609000 c05bac50
        00000800 00099d00 c059a000 00a6e003 00000800 00099d00 c059a000 c05b66d2
Comment 2 Axel Dyks 2008-12-05 12:14:32 UTC
I suspect that regression is caused by putting "dmi_scan_machine" BEFORE

  * NOTE: On x86-32, only from this point on, fixmaps are ready for use

in arch/x86/kernel/setup.c

See http://lkml.org/lkml/2008/8/7/298
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commit;h=3a6ddd5f18405ca92e004416af8ed44b9c9783d7
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.27.y.git;a=commit;h=5c371b31be32033b0a4a993431484da8a2305369

Maybe just moving those two "dmi_*" lines AFTER the comment would solve the problem.
Comment 3 Yinghai Lu 2008-12-05 13:15:56 UTC
can you check 2.6.28-rc7 etc?
it seems there is one following up patch for xen guest...in mainline
Comment 4 Axel Dyks 2008-12-05 14:48:27 UTC
(In reply to comment #3)
> can you check 2.6.28-rc7 etc?
> it seems there is one following up patch for xen guest...in mainline
Though we can of course test a lot of different kernels it might be
interesting to know which of the patches added to 2.6.18-r7 might have
solved the problem ... and what xen patch you are refering to.

Thanks
Axel
Comment 6 Axel Dyks 2008-12-05 15:33:27 UTC
Ah, I see. But this means it's in 2.6.27.8 already.
Comment 7 Norman Back 2008-12-05 15:43:04 UTC
(In reply to comment #3)
> can you check 2.6.28-rc7 etc?
> it seems there is one following up patch for xen guest...in mainline
> 
2.6.28-rc7 also panics with Int 14 CR2
Comment 8 Yinghai Lu 2008-12-05 18:35:58 UTC
please use gdb to check what is the code around
c05db95c
Comment 9 Norman Back 2008-12-06 00:19:33 UTC
Created attachment 19170 [details]
System map of of 2.6.27.6

 I'm not sure how to use gdb to get that from a kernel, however I have attached the System.map and crash screen (below) of the 2.6.27.6 kernel with "# CONFIG_X86_RESERVE_LOW_64K is not set"

From Systme.map 
 c05d1b80 t dmi_present
 c05d1c80 T dmi_scan_machine

So looks with EIP c05d1ba0 like dmi_present+x20

 Decompressing Linux... Parsing ELF... done.
 Booting the kernel.

 BUG: Int 14: CR2 fbe00000
      EDI c05a5f70  ESI fbe00000  EBP c05a5f8c  ESP c05a5f54
      EBX c05a5f70  EDX 0000000e  ECX 00000003  EAX fbe00000
      err 00000000  EIP c05d1ba0   CS 00000062  flg 00010046
 Stack: c05be709 00000163 80000000 00000000 00000001 000001ff 00000000 fbe00000
        fbe10000 fbe00000 c05a5fa4 c05d1cb7 00000001 c05a5fdc 00000000 c05a2f94
        c05a5fc8 c05aecc1 c04ee7e6 00099d00 c0597000 c05a5fc8 c04ee7e6 00099d00
 VGA: Screenshot done.
Comment 10 Norman Back 2008-12-06 01:33:53 UTC
After experimenting with gdb I got:

Dump of assembler code for function dmi_present:
0xc05d1b80 <dmi_present+0>:     push   %ebp
0xc05d1b81 <dmi_present+1>:     mov    %esp,%ebp
0xc05d1b83 <dmi_present+3>:     sub    $0x28,%esp
0xc05d1b86 <dmi_present+6>:     mov    %ebx,-0xc(%ebp)
0xc05d1b89 <dmi_present+9>:     mov    %esi,-0x8(%ebp)
0xc05d1b8c <dmi_present+12>:    mov    %edi,-0x4(%ebp)
0xc05d1b8f <dmi_present+15>:    call   0xc0104318 <mcount>
0xc05d1b94 <dmi_present+20>:    lea    -0x1c(%ebp),%ebx
0xc05d1b97 <dmi_present+23>:    mov    %eax,%esi
0xc05d1b99 <dmi_present+25>:    mov    $0x3,%ecx
0xc05d1b9e <dmi_present+30>:    mov    %ebx,%edi
0xc05d1ba0 <dmi_present+32>:    rep movsl %ds:(%esi),%es:(%edi)
0xc05d1ba2 <dmi_present+34>:    mov    $0xf,%ecx
0xc05d1ba7 <dmi_present+39>:    and    $0x3,%ecx
0xc05d1baa <dmi_present+42>:    je     0xc05d1bae <dmi_present+46>
0xc05d1bac <dmi_present+44>:    rep movsb %ds:(%esi),%es:(%edi)
0xc05d1bae <dmi_present+46>:    cld
0xc05d1baf <dmi_present+47>:    mov    $0xc0520bfc,%edi
0xc05d1bb4 <dmi_present+52>:    mov    $0x5,%ecx
0xc05d1bb9 <dmi_present+57>:    mov    %ebx,%esi
0xc05d1bbb <dmi_present+59>:    repz cmpsb %es:(%edi),%ds:(%esi)
0xc05d1bbd <dmi_present+61>:    je     0xc05d1bd3 <dmi_present+83>
0xc05d1bbf <dmi_present+63>:    mov    $0x1,%edx
0xc05d1bc4 <dmi_present+68>:    mov    -0xc(%ebp),%ebx
0xc05d1bc7 <dmi_present+71>:    mov    %edx,%eax
0xc05d1bc9 <dmi_present+73>:    mov    -0x8(%ebp),%esi
0xc05d1bcc <dmi_present+76>:    mov    -0x4(%ebp),%edi
0xc05d1bcf <dmi_present+79>:    mov    %ebp,%esp
0xc05d1bd1 <dmi_present+81>:    pop    %ebp
0xc05d1bd2 <dmi_present+82>:    ret
0xc05d1bd3 <dmi_present+83>:    mov    %ebx,%eax
0xc05d1bd5 <dmi_present+85>:    call   0xc05d1510 <dmi_checksum>
0xc05d1bda <dmi_present+90>:    test   %eax,%eax
0xc05d1bdc <dmi_present+92>:    je     0xc05d1bbf <dmi_present+63>

Does that look right?
Comment 11 Axel Dyks 2008-12-06 07:32:26 UTC
Created attachment 19184 [details]
Patch against 2.6.27.6 that moves those dmi_... lines after * NOTE ...

Now, that it seems that kernel crashes when calling "dmi_scan_machine"
would you mind applying the attached patch against 2.6.27.6 and
check, if this solves the problem?
Comment 12 Yinghai Lu 2008-12-06 12:09:41 UTC
dmi_scan_machine will use dmi_ioremap and it is early_ioremap, we aready coulde
use that after early_ioremap_init()

so we don't need to move that later
Comment 13 Norman Back 2008-12-06 13:54:54 UTC
(In reply to comment #11)
> Created an attachment (id=19184) [details]
> Patch against 2.6.27.6 that moves those dmi_... lines after * NOTE ...
> 
> Now, that it seems that kernel crashes when calling "dmi_scan_machine"
> would you mind applying the attached patch against 2.6.27.6 and
> check, if this solves the problem?

The patch resolves this issue. With the patch applied 2.6.27.6 boots to completion (logged into kde) without error.
Comment 14 Axel Dyks 2008-12-06 14:04:38 UTC
Hmm, I have to admit that I've almost no understanding of (early) memory
management on linux. Just by reading through http://lkml.org/lkml/2008/8/7/298
made me think about what would happen, if vmi_init() moves down the so-called
FIXMAP area by 64MB.
Comment 15 Daniel Drake 2008-12-07 05:37:33 UTC
Alok, Zach, can you help? thanks
Comment 16 Zachary Amsden 2008-12-08 15:24:22 UTC
This patch doesn't look right, moving dmi_scan_machine earlier fixes a real bug and should not be reverted.  I believe the issue is an early_ioremap bug with VMI that was never exposed before.
Comment 17 Axel Dyks 2008-12-08 15:41:25 UTC
(In reply to comment #16)
> This patch doesn't look right, moving dmi_scan_machine earlier fixes a real
> bug
> and should not be reverted.  I believe the issue is an early_ioremap bug with
> VMI that was never exposed before.
> 
Thanks for getting back on this bug.
dmi_scan_machine used to be late in the code (later than my patch puts it),
but it has been moved a few lines up. Probably just to be BEFORE dmi_check_system.

Actually I don't know, because I'm no expert in this early_ioremap stuff ...

BTW, there's another VMI bug that we (at gentoo) haven't been able yet to
bisect down to a single commit. For this might be related stuff, I would like
to point you to http://bugs.gentoo.org/show_bug.cgi?id=250094

Anyway, any help you could give to help resolve those bugs is greatly appreciated.

Thanks
Comment 18 Axel Dyks 2008-12-09 11:06:41 UTC
It seems you are not alone ... http://lkml.org/lkml/2008/12/9/13
Comment 19 Zachary Amsden 2008-12-09 15:25:01 UTC
I wish someone had contacted me about this earlier... the fix is quite easy.  I have two fixes actually, I'll send out both and see which is preferred, they both have different tradeoffs as far as risks for future breakages.

I should test the fixes before making such broad statements as calling them fixes, however.
Comment 20 Axel Dyks 2008-12-09 15:34:06 UTC
(In reply to comment #19)
> I wish someone had contacted me about this earlier... the fix is quite easy. 
> I
> have two fixes actually, I'll send out both and see which is preferred, they
> both have different tradeoffs as far as risks for future breakages.
> 
> I should test the fixes before making such broad statements as calling them
> fixes, however.
> 
Great!
Comment 21 Zachary Amsden 2008-12-12 10:21:37 UTC
Created attachment 19262 [details]
Patch to fix boot time ioremap crash with VMI

I've sent the attached patch upstream for inclusion in 2.6.28.
Comment 22 Axel Dyks 2008-12-12 10:50:41 UTC
Thanks. Great!
Comment 23 Jonas Frey 2009-02-10 07:32:42 UTC
I do have exactly the same problem but i am not using VMware.
I am using a Dual QuadCore Opteron using a Tyan S5376 mainboard with latest Bios 3.04 (AMI Bios). Any kernel later than 2.6.27.4 crashes immediatly so some of these x86 64k memory patches must have broken something. If i can assist on find the problem please let me know. Current 2.6.28.4 also crashes.
Enabling/disabling CONFIG_X86_RESERVE_LOW_64K has no effect.

# dmidecode 2.8
SMBIOS 2.5 present.
81 structures occupying 2869 bytes.
Table at 0x000FCD70.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 'V3.04    '
        Release Date: 12/25/2008
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 1024 kB
        Characteristics:
                PCI is supported
                PNP is supported
                APM is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 KB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                LS-120 boot is supported
                ATAPI Zip drive boot is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
        BIOS Revision: 8.14

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: empty
        Product Name: empty
        Version: empty
        Serial Number: empty
        UUID: 00020003-0004-0005-0006-000700080009
        Wake-up Type: Power Switch
        SKU Number: To Be Filled By O.E.M.
        Family: Embedded
Comment 24 Zachary Amsden 2009-02-10 13:12:59 UTC
(In reply to comment #23)
> I do have exactly the same problem but i am not using VMware.

It's highly unlikely to be the same problem.  This problem was only possible if using high address space reservation for VMI.

Any chance you can bisect this down to the bad commit?
Comment 25 Yinghai Lu 2009-02-10 21:08:54 UTC
On Tue, Feb 10, 2009 at 7:32 AM,  <bugme-daemon@bugzilla.kernel.org> wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12167
>
>
>
>
>
> ------- Comment #23 from jonas.frey@gmx.de  2009-02-10 07:32 -------
> I do have exactly the same problem but i am not using VMware.
> I am using a Dual QuadCore Opteron using a Tyan S5376 mainboard with latest
> Bios 3.04 (AMI Bios). Any kernel later than 2.6.27.4 crashes immediatly so
> some
> of these x86 64k memory patches must have broken something. If i can assist
> on
> find the problem please let me know. Current 2.6.28.4 also crashes.
> Enabling/disabling CONFIG_X86_RESERVE_LOW_64K has no effect.

can you post bootlog before 2.6.27.4? please add debug in command line.

YH
Comment 26 Alan 2009-03-26 12:59:39 UTC
(as discussed off bug with Zachary a while back I've now put together a test dmi robustness patch for corrupt dmi tables as promised)