Bug 90321

Summary: [REGRESSION] commit f5b2831d65 cause boot failure on VMware ESXi 5.1
Product: Platform Specific/Hardware Reporter: Qu Wenruo (wqu)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: high CC: igor.sverkos+kernel, wqu
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: v3.19.0-rc1 Subsystem:
Regression: Yes Bisected commit-id:

Description Qu Wenruo 2014-12-26 01:41:00 UTC
v3.19-rc1 kernel fails to boot on VMware ESXi 5.1 host.

The kernel only output "Probing EDD (edd=off to disable)...ok" and hangs,
no backtrace or warning or error or even VGA mode change, just hangs.
(Maybe even did changed to protected mode).

On qemu-kvm or vbox or bare machine, it is OK.

Bisect shows the following commit causing the bug:
commit f5b2831d654167d77da8afbef4d2584897b12d0c
Author: Juergen Gross <jgross@suse.com>
Date:   Mon Nov 3 14:02:02 2014 +0100

    x86: Respect PAT bit when copying pte values between large and normal pages

Thanks,
Qu
Comment 1 Igor Sverkos 2014-12-29 16:11:28 UTC
Hi,

I saw this problem, too. Full boot log:

> Loading /kernel... ok
> Loading /initramfs...ok
> early console in setup code
> Probing EDD (edd=off to disable)... ok
> early console in decompress_kernel
> KASLR using RDRAND RDTSC...
> 
> Decompressing Linux...
> 
> [    0.000000] Initializing cgroup subsys cpuset
> [    0.000000] Initializing cgroup subsys cpu
> [    0.000000] Initializing cgroup subsys cpuacct
> [    0.000000] Linux version 3.19.0-rc1 (root@bare42) (gcc version 4.8.4
> (Gentoo 4.8.4 p1.0, pie-0.6.1) ) #1 SMP Mon Dec 29 15:55:00 CET 2014
> [    0.000000] Command line: BOOT_IMAGE=/kernel dolvm root=UID=...
> rootfs=ext4 initrd=/initramfs debug LOGLEVEL=8 earlyprintk=vga,keep
> [    0.000000] KERNEL supported cpus:
> [    0.000000]   Intel GenuineIntel
> [    0.000000] Disabled fast string operations
> [    0.000000] e820: BIOS-provided physical RAM map:
> [ Skipping the RAM map - please request if needed ]
> [    0.000000] debug: ignoring loglevel setting.
> [    0.000000] console [earlyvga0] enabled
> [    0.000000] MX (Execute Disable) protection: active
> [    0.000000] SMBIOS 2.4 present.
> [    0.000000] DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop
> Reference Platform, BIOS 6.00 07/31/2013
> [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
> [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [    0.000000] e820: last_pfn = 0x140000 max_arch_pfn = 0x400000000
> [    0.000000] MTRR default type: uncachable
> [    0.000000] MTRR fixed ranges enabled:
> [    0.000000]   00000-9FFFF write-back
> [    0.000000]   A0000-BFFFF uncachable
> [    0.000000]   C0000-CBFFF write-protect
> [    0.000000]   CC000-EFFFF uncachable
> [    0.000000]   F0000-FFFFF write-protect
> [    0.000000] MTRR variable ranges enabled:
> [    0.000000]   0 base 00C0000000 mask FFc0000000 uncachable
> [    0.000000]   1 base 0000000000 mask FF00000000 write-back
> [    0.000000]   2 base 0100000000 mask FFC0000000 write-back
> [    0.000000]   3 disabled
> [    0.000000]   4 disabled
> [    0.000000]   5 disabled
> [    0.000000]   6 disabled
> [    0.000000]   7 disabled
> PANIC: early exception 06 rip 10:ffffffff8d03e852 error 0 cr2
> ffff88000dd4cff8
> [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc1 #1
> [    0.000000] Hardware name: VMware, Inc. VMware Virtual Platform/440BX
> Desktop Reference Platform. BIOS 6.00 07/31/2013
> [    0.000000] ffffffff8dc03e1b ffffffff8dc03d70 ffffffff8d6a9d63
> 000000000000004d
> [    0.000000] 00000000ffffffff ffffffff8dc03e08 ffffffff8dd011a1
> 0000000000000030
> [    0.000000] 000000000000002c 0000000000000019 0000000000000018
> 0000000000000000
> [    0.000000] Call Trace:
> [    0.000000]  [<ffffffff8d6a9d63>] dump_stack+0x45/0x57
> [    0.000000]  [<ffffffff8dd011a1>] early_idt_handler+0x81/0xa8
> [    0.000000]  [<ffffffff8d03e852>] ? update_cache_mode_entry+0x42/0x50
> [    0.000000]  [<ffffffff8d043d2c>] pat_init_cache_modes+0x7c/0xc0
> [    0.000000]  [<ffffffff8d043df4>] pat_init+0x84/0xa0
> [    0.000000]  [<ffffffff8dd0af26>] get_mtrr_state+0x284/0x296
> [    0.000000]  [<ffffffff8dd0aae3>] mtrr_bp_init+0x134/0x157
> [    0.000000]  [<ffffffff8dd047f1>] setup_arch+0x56e/0xc6f
> [    0.000000]  [<ffffffff8d11b69f>] ? vprintk_default+0x1f/0x30
> [    0.000000]  [<ffffffff8dd01c87>] start_kernel+0xd3/0x471
> [    0.000000]  [<ffffffff8dd015ad>] x86_64_start_reservations+0x2a/0x2c
> [    0.000000]  [<ffffffff8dd016a6>] x86_x64_start_kernel+0xf7/0xfb
> [    0.000000] RIP 0x3


But this is already known and fixed, see https://lkml.org/lkml/2014/12/28/35

I can confirm that the patch is working for me. Now booting looks like

> [    0.000000] MTRR default type: uncachable
> [    0.000000] MTRR fixed ranges enabled:
> [    0.000000]   00000-9FFFF write-back
> [    0.000000]   A0000-BFFFF uncachable
> [    0.000000]   C0000-CBFFF write-protect
> [    0.000000]   CC000-EFFFF uncachable
> [    0.000000]   F0000-FFFFF write-protect
> [    0.000000] MTRR variable ranges enabled:
> [    0.000000]   0 base 00C0000000 mask FFC0000000 uncachable
> [    0.000000]   1 base 0000000000 mask FF00000000 write-back
> [    0.000000]   2 base 0100000000 mask FFC0000000 write-back
> [    0.000000]   3 disabled
> [    0.000000]   4 disabled
> [    0.000000]   5 disabled
> [    0.000000]   6 disabled
> [    0.000000]   7 disabled
> [    0.000000] PAT read returns always zero, disabled.
> [    0.000000] original variable MTRRs
Comment 2 Qu Wenruo 2014-12-30 01:32:29 UTC
Thanks for the info.

I'm currently using nopat kernel option to avoid the bug.
The fixing bug is not in v3.19-rc1, so I still hit it without nopat option.

Since the bug is already known and fix patch is already sent,
I will close the BZ.

Thanks,
Qu