Bug 12236

Summary: 2.6.27 vmware x86_64 guests panic on boot on VMware ESX
Product: Platform Specific/Hardware Reporter: Timo Gurr (timo.gurr)
Component: x86-64Assignee: Zachary Amsden (zach)
Status: CLOSED CODE_FIX    
Severity: normal CC: alan, kernel
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: successful boot on VMware ESX

Description Timo Gurr 2008-12-16 05:47:43 UTC
Latest working kernel version: 2.6.26.8
Earliest failing kernel version: 2.6.27
Distribution: Gentoo vanilla-sources
Hardware Environment: Blade System / AMD Opteron 2216 HE
Software Environment: VMware ESX Server version=3.5.0 build=build-98103
Problem Description: VMware ESX 64bit guests running >=2.6.27 fail to boot with a kernel panic

Steps to reproduce: Compile >=2.6.27 on a 64bit guest and boot as guest in
VMware ESX Server version=3.5.0 build=build-98103. Kernel panics with following messages:

Vanilla 2.6.27:

Booting the kernel.
Kernel alive
Kernel really alive
PANIC: early exception 0e rip 10:ffffffff8151558d error 0 cr2 fffffffeff34003d

git-sources-2.6.28_rc8-r2:

PANIC: early exception 0e rip 10:ffffffff815142c2 error 0 cr2 fffffffeff44003d


This panic only appears when booting 64bit guests, 32bits guests boot just fine. Please let me know if I can provide further hopefully more useful information to resolve this.

See http://bugs.gentoo.org/show_bug.cgi?id=250695
Comment 1 Alan 2008-12-17 08:37:40 UTC
Please make sure you file a copy of this bug report with vmware as if its a bug in the vmware side of things there isn't anything we can do about it.

Does the same kernel boot fine outside of vmware on that system ?
Comment 2 Greg Kroah-Hartman 2008-12-17 13:28:19 UTC
Is already fixed in Linus's tree, and will be resolved in the 2.6.27.10 release tomorrow as well.
Comment 3 Zachary Amsden 2008-12-17 13:41:37 UTC
This is actually a different bug, the bug I fixed was for VMI, which is 32-bit only, and this is a 64-bit only bug... not sure how to change back to ASSIGNED, but feel free to assign it to me.
Comment 4 Daniel Drake 2008-12-17 14:02:39 UTC
Thanks Zach!
Comment 5 Greg Kroah-Hartman 2008-12-17 14:06:18 UTC
Oops, sorry about that.
Comment 6 Daniel Drake 2008-12-18 09:33:22 UTC
Zach, is there any other information we can provide to help here? Have you had a chance to attempt to reproduce it?
Comment 7 Zachary Amsden 2008-12-18 12:14:33 UTC
Thanks, I think I have everything I need.  I'm a bit slow on the uptake right now because I'm still recovering from wisdom tooth surgery.
Comment 8 Zachary Amsden 2008-12-19 13:38:30 UTC
You didn't select CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, did you?

Can you attach the .config used to generate this kernel?  I can't find it either here or in the Gentoo bug report and so far I'm having a bit of trouble reproducing this panic.
Comment 9 Timo Gurr 2008-12-19 14:04:09 UTC
I'll attach the log on monday when I'm at work again. In the meantime, an easy way to reproduce the problem (at least on our ESX setup) was simply booting the SystemRescueCd beta5 iso offered on http://www.sysresccd.org/ under the beta section (direct download link: http://beta.sysresccd.org/systemrescuecd-x86-1.1.4-beta6.iso ) It boots a 64bit 2.6.27.x kernel, when booted with "rescue64" and also produces the kernel PANIC on our system.
Comment 10 Axel Dyks 2008-12-19 14:10:02 UTC
(In reply to comment #9)
> I'll attach the log on monday when I'm at work again.

Timo, you've been asked to attach the kernel .config
and not a (dmesg) log. :-)
Comment 11 Zachary Amsden 2008-12-19 15:12:50 UTC
This doesn't reproduce for me on VMware workstation; it could be a difference in configuration or a change in VMware.  I will try it on 3.5.

Interestingly, the crash is in dmi_string_nosave  ... which doesn't do much in the way of safety or validation.  The other bugs referenced in the Gentoo bug report appear to be potentially different hardware related crashes.
Comment 12 Zachary Amsden 2008-12-19 15:45:20 UTC
this doesn't reproduce on ESX 3.5, with the same system rescue CD.

Can I get a copy of the .vmx file used to boot this VM?

I suspect it is dependent on number of cpus or memory size.  At this point, I am doubtful this is a VMware bug, and suspect something to do with DMI..
Comment 13 Timo Gurr 2008-12-19 15:50:23 UTC
Thanks for the headup Zachary, I'll try to provide the vmx file and more information about our hardware on monday, I'll also try to physically boot one of our blade servers from the SystemRescueCd and see how well this works.
Comment 14 Zachary Amsden 2008-12-19 15:50:55 UTC
Created attachment 19393 [details]
successful boot on VMware ESX

successful boot on VMware ESX
Comment 15 Timo Gurr 2008-12-22 05:47:02 UTC
To test we've just upgraded one of our Blades (HP ProLiant BL465c G1) which failed before with VMware ESX Server version=3.5.0 build=build-98103 to VMware ESX Server version=3.5.0 build=build-130756 (and all available updates) and things magically started to work. Booting the SystemRescueCd (rescue64) as well as upgrading and booting an installed Gentoo kernel 2.6.27.9 (64bit) works without producing any PANIC now.
Comment 16 Zachary Amsden 2008-12-22 10:11:24 UTC
Sounds very plausible to me.. I was testing ESX 3.5U2, which is build 110268.  Since it's already fixed, I don't really see any need to open a bug on our end or binary search it down to a root cause... it's likely the DMI data in our BIOS was badly terminated and caused a panic here.  I'm going to go ahead and close it then.  If it recurs, feel free to re-open.
Comment 17 Alan 2008-12-22 10:14:16 UTC
Zach - can you fill in more info about bad DMI data if you had it in the VMware BIOS. Our DMI code should be robust against such things and if it isn't I'd like to see it fixed.
Comment 18 Zachary Amsden 2008-12-22 10:27:02 UTC
(In reply to comment #17)
> Zach - can you fill in more info about bad DMI data if you had it in the
> VMware
> BIOS. Our DMI code should be robust against such things and if it isn't I'd
> like to see it fixed.

To be honest it could be many things that caused this bug and lack of DMI robustness is just a convenient suspect.  Perhaps it was an MMU bug that was fixed?  In any case, the same .iso image boots fine on one version of VMware and not on a slightly bugfixed version off the same branch.. pointing to a VMware bug as the real culprit.

However, inspecting the DMI code I do see a reliance on proper NULL termination and length setting of strings in several places:

static const char * __init dmi_string_nosave(const struct dmi_header *dm, u8 s)
{
        const u8 *bp = ((u8 *) dm) + dm->length;

        if (s) {
                s--;
                while (s > 0 && *bp) {
                        bp += strlen(bp) + 1;    // (1)
                        s--;
                }

                if (*bp != 0) {
                        size_t len = strlen(bp)+1;   // (2)
                        size_t cmp_len = len > 8 ? 8 : len;

                        if (!memcmp(bp, dmi_empty_string, cmp_len))
                                return dmi_empty_string;
                        return bp;
                }

(1) strlen isn't really the best choice here, there should be a bounded dmi_strlen function which takes as an argument the DMI header pointer and imposes sanity bounds on the maximum length

(2) in the second case, then length is irrelevant if it is longer than 8... so a bounded strncmp would be a better choice.