Bug 192241 - 4.10-rc2 does not boot in VMware Fusion 7.1.2
Summary: 4.10-rc2 does not boot in VMware Fusion 7.1.2
Status: NEW
Alias: None
Product: Virtualization
Classification: Unclassified
Component: lguest (show other bugs)
Hardware: x86-64 Linux
: P1 normal
Assignee: virtualization_lguest
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-10 07:12 UTC by Christophe de Dinechin
Modified: 2017-02-14 03:31 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.10-rc2
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Console log from the virtual serial port (230.24 KB, text/plain)
2017-01-10 07:12 UTC, Christophe de Dinechin
Details
VM log file (8.67 KB, application/octet-stream)
2017-01-19 16:13 UTC, Christophe de Dinechin
Details
vmware.log after one failed boot following one successful boot (2.38 MB, application/octet-stream)
2017-01-20 09:29 UTC, Christophe de Dinechin
Details

Description Christophe de Dinechin 2017-01-10 07:12:45 UTC
Created attachment 251081 [details]
Console log from the virtual serial port

In a 64-bit Fedora 25 workstation installation running inside VMware fusion 7.1.2 on macOS Sierra version 10.12.2, I can boot a locally build 4.9 kernel, but kernels from the current mainline hang shortly after activating DRM.

I first saw this with SHA1 e02003b515e8d95f40f20f213622bb82510873d2 on the 4.10-rc2 line.

Symptoms include one of:

- The VM boots, reaches the point where it would switch to graphic mode, and reboots with a corrupted VMware logo, and the cycle repeats.

- If activating the serial console, I get this at the end, and the VM hangs instead of rebooting:

[   19.585294] [drm] Fifo max 0x00040000 min 0x00001000 cap 0x0000077f
[   19.588834] [drm] Using command buffers with DMA pool.
[   19.590977] [drm] DX: no.
         Starting dracut initqueue hook...
[  OK  ] Started Show Plymouth Boot Screen.
[  OK  ] Reached target Paths.
[  OK  ] Started Forward Password Requests to Plymouth Directory Watch.
[  OK  ] Reached target Basic System.
GG

I always see a "GG".

- After updating to 4.10-rc3, SHA1 a121103c922847ba5010819a3f250f1f7fc84ab8, I get:

[   11.919074] [TTM] Initializing pool allocator
[   11.923006] [TTM] Initializing DMA pool allocator
[   11.926280] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[   11.931211] [drm] No driver support for vblank timestamp query.
[   11.935264] [drm] Screen Target Display device initialized
[   11.937851] [drm] width 800
[   11.939140] [drm] height 480
[   11.940463] [drm] bpp 32
[   11.950890] [drm] Fifo max 0x00040000 min 0x00001000 cap 0x0000077f
[   11.955451] [drm] Using command buffers with DMA pool.
[   11.962502] [drm] DX: no.
G

Same thing, there's a G at the end.

The VM is configured with 2G of RAM and 4 cores. It normally has recursive virtualization and PMU support enabled, but I checked that disabling those did not change the symptoms.
Comment 1 Christophe de Dinechin 2017-01-11 11:02:25 UTC
Hmm, I may have filed this wrong. I assumed "lguest" stood for "Linux as guest", but apparently that's for the lguest hypervisor. Could someone tell me where I should file a problem with Linux as a guest?
Comment 2 Christophe de Dinechin 2017-01-13 08:17:06 UTC
Bisected, the first bad commit is https://github.com/torvalds/linux/commit/dabdcdc9822ae4e23cd7ff07090098d34f287b28. Unsurprisingly given the symptoms, it's a change in VMware GFX driver.

https://patchwork.freedesktop.org/patch/125167/
Comment 3 Christophe de Dinechin 2017-01-13 08:54:05 UTC
Reverted the commit. The revert is clean, but it DOES NOT work. So it's not a simple matter of getting rid of that code. Need to double-check that I identified the right commit.
Comment 4 Christophe de Dinechin 2017-01-13 09:41:54 UTC
Manually checkout the previous commit, and it still does not work. So something went wrong with the bisect. Here are the bisect steps I wrote:

9439b3710df688d853eb6cb4851256f2c92b1797: Bad
628d1655: Good. 9 steps to go. VMware copy to host stopped working, so typed by hand.
0d5320fc: Good. 8 steps to go.
2601a15d5d9b7f262e94b88784b1e1cf28ec020d: Bad
8e57ec613: Good
bfd5be0f9e0cd: Good
db444e1344dd: Bad
ad1231080be5a5cb: Good
8f5040e421ca4bbd:Bad, but interestingly, no DRM message, although GG is there
1f32478f: Bad

Re-testing the last known good, ad1231080be5a5cb. It is possible that it does not fail every time.
Comment 5 Christophe de Dinechin 2017-01-13 11:07:02 UTC
It turns out ad1231080be5a5cb and bfd5be0f9e0cd now fail. It is a bit strange. I could have gotten one wrong, but I doubt I got three wrong. Some other effect is at play.
Comment 6 Sinclair 2017-01-13 17:11:13 UTC
What is the version of your Mac OS?
Comment 7 Christophe de Dinechin 2017-01-15 20:43:54 UTC
Rebuild 4.9.0 from tag v4.9.0, and now this one fails too.

Looking at the console logs, I found this crash:

[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007fffffff]
[    0.000000] ACPI: PM-Timer IO Port: 0x1008
[    0.000000] ------------[ cut here ]------------
[    0.000000] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:2065 __generic_processor_info+0x28c/0x370
[    0.000000] Only 63 processors supported.Processor 64/0x80 and the rest are ignored.
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.0 #25
[    0.000000] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[    0.000000]  ffffffff81e03cb0 ffffffff8132bbe7 ffffffff81e03d00 0000000000000000
[    0.000000]  ffffffff81e03cf0 ffffffff81059c26 0000081100001000 0000000000000040
[    0.000000]  0000000000000015 0000000000000000 0000000000000000 0000000000000080
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff8132bbe7>] dump_stack+0x4d/0x66
[    0.000000]  [<ffffffff81059c26>] __warn+0xc6/0xe0
[    0.000000]  [<ffffffff81059c8a>] warn_slowpath_fmt+0x4a/0x50
[    0.000000]  [<ffffffff8103fa8c>] __generic_processor_info+0x28c/0x370
[    0.000000]  [<ffffffff8103ad12>] acpi_register_lapic+0x32/0x80
[    0.000000]  [<ffffffff81f5b252>] acpi_parse_lapic+0x46/0x4e
[    0.000000]  [<ffffffff81f88e2a>] acpi_parse_entries_array+0xf2/0x14d
[    0.000000]  [<ffffffff81f88fc4>] acpi_table_parse_entries_array+0xae/0xd0
[    0.000000]  [<ffffffff81f5bcc5>] acpi_boot_init+0xdf/0x4a7
[    0.000000]  [<ffffffff81f5b20c>] ? acpi_parse_x2apic_nmi+0x46/0x46
[    0.000000]  [<ffffffff81f5b6c8>] ? dmi_ignore_irq0_timer_override+0x2e/0x2e
[    0.000000]  [<ffffffff81f54dcf>] setup_arch+0xafa/0xc00
[    0.000000]  [<ffffffff81128082>] ? printk+0x43/0x4b
[    0.000000]  [<ffffffff81f4db86>] start_kernel+0x59/0x3c7
[    0.000000]  [<ffffffff81f4d28e>] x86_64_start_reservations+0x2a/0x2c
[    0.000000]  [<ffffffff81f4d408>] x86_64_start_kernel+0x178/0x18b
[    0.000000] ---[ end trace 3183ba07e9bf6fae ]---

So it seems to complain about the number of processors registered in the ACPI tables (more than 64?).

There is also what I find a surprisingly large number of ACPI LAPIC entries, but maybe that's just the way the virtual platform describes itself. Still, it's odd.
Comment 8 Christophe de Dinechin 2017-01-15 20:44:36 UTC
@Sinclair: macOS Sierra 10.12.2
Comment 9 Sinclair 2017-01-18 22:25:24 UTC
I've tried down the crash, although it doesn't seem to be related to the ACPI.

Please give this a try and let me know...

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_fb.c b/drivers/gpu/drm/vmwgfx/vmwgfx_fb.c
index 723fd76..7a96798 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_fb.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_fb.c
@@ -481,8 +481,7 @@ static int vmw_fb_kms_framebuffer(struct fb_info *info)
        mode_cmd.height = var->yres;
        mode_cmd.pitches[0] = ((var->bits_per_pixel + 7) / 8) * mode_cmd.width;
        mode_cmd.pixel_format =
-               drm_mode_legacy_fb_format(var->bits_per_pixel,
-                       ((var->bits_per_pixel + 7) / 8) * mode_cmd.width);
+               drm_mode_legacy_fb_format(var->bits_per_pixel, depth);
 
        cur_fb = par->set_fb;
        if (cur_fb && cur_fb->width == mode_cmd.width &&
Comment 10 Christophe de Dinechin 2017-01-19 12:21:16 UTC
I tried your fix. No effect, apparently. Anything I could do to investigate better? When you said you "tried down the crash", does it mean you reproduced?

I don't think it's related to ACPI either. I'm just puzzled why there are errors like this. Am I building Linux incorrectly? If so, why does it work on some versions and not others?
Comment 11 Sinclair 2017-01-19 15:30:40 UTC
I meant "tracked down the crash", or more accurately, tracked down "a" crash.  Given my fix did not fix your issue, I'm guessing it's not the same crash.

Can you attach the vmware.log for the VM right after it crashed?  This will likely be the most helpful log.

A few additional things that may help:

1.  Your HW model, e.g. 15" MacBook Pro 2014 with Intel graphcis
2.  Try setting "modprobe.blacklist = vmwgfx" at the kernel command line just to
    see if it boots

Since you're seeing a corrupted VMWare logo on guest boot, I suspect this may turn out to be something else entirely, but the data will help narrowing things down.
Comment 12 Christophe de Dinechin 2017-01-19 16:13:02 UTC
Created attachment 252451 [details]
VM log file

This is from ~/Library/Logs/VMware Fusion/vmware-vmfusion.log.
Comment 13 Christophe de Dinechin 2017-01-19 16:55:20 UTC
I think you probably tracked down the first crash that I saw. Maybe if I disable console logging again, I will return to a bootable state? Tested that idea: no, still hangs. Although I do see the beginning of a graphical state in that case.

modprobe.blacklist did not fix things.

To answer the questions:
- MacBook Pro Retina 15-inch Mid 2015 (MacBook Pro 11,5), 2x2.5 Core i7
- AMD Radeon R9 M370X 2G RAM. Also Intel Iris Pro 1536MB, but I was always running on power when I had the issue.

The corrupted VMware logo is what you see at reboot every time. It's not different from what I can tell.

There is no vmware.log that I can find. Several log files, one called vmware-vmfusion.log. See attachment.
Comment 14 Christophe de Dinechin 2017-01-19 17:09:50 UTC
There is one thing I did not resolve yet: Building tag v4.9, I sometimes had a version that worked (that's the one I'm using right now), and sometimes had a version that did not work. There must be something in the sequence of things I do to build the kernel that has an impact on the result.

Basically, my bisecting script is doing:
make distclean
make menuconfig or make defconfig
make
make modules
sudo makes modules_install install
sudo reboot

Anything in this sequence that is wrong?
Comment 15 Sinclair 2017-01-19 17:51:38 UTC
The easiest way to to get the vmware.log file is probably through the terminal.

From the log file attached, it looks like the VM is located at /Volumes/LittleBig/Virtual Machines/Fedora 25 64-bit.vmwarevm

So start a terminal, and do a
  cp "/Volumes/LittleBig/Virtual Machines/Fedora 25 64-bit.vmwarevm/vmware.log" ~/Downloads

and then attach the vmware.log from downloads.

If the corrupted VMWare logo is one with "lines" in the background, then that's most likely unrelated to the crash.

Is there a way you can try the same VM with Fusion 8.x?  We don't officially support Sierra with Fusion 7.x.  Given that black listing vmwgfx did not help, and that the crash appears to be random, I'm beginning to suspect it's due to certain incompatibilities between Fusion 7.x and Sierra.

I'll still try to reproduce the crash, but I'm only using Fusion 8.5.2.
Comment 16 Christophe de Dinechin 2017-01-20 09:29:45 UTC
Created attachment 252541 [details]
vmware.log after one failed boot following one successful boot

This is the requested vmware.log file, right after a failed boot of yesterday's linux master branch with your patch, following a successful boot of v4.9.
Comment 17 Christophe de Dinechin 2017-01-20 09:35:58 UTC
>Given that black listing vmwgfx did not help, and that the crash appears to be
>random

The crash is not random. For a given build, it is always successful, or always failing. However, for some versions, notably v4.9, I have done builds that reliably boot, and others that reliably fail. Right now, I have a v4.9 that reliably boots, for example. But if I rebuild v4.9, I may end up with one that reliably fails.

While I did the bisect, I suspect that I identified the correct change. It would be very strange if bisecting with some heisenbug would land right on a vmware gfx commit, out of all possible commits in Linux.

However, after that initial bisect, I started seeing really strange things, like: reverting to one specific commit that had been "good" during the bisect, rebuilding it, and seeing it as "bad". Never the other way round. It's more like there is something else I changed that made builds from good commits turn bad.

What I'll try today is a clean checkout, and redoing the steps of my build script manually the way I used to do them for the first iterations. I think something changed when I started scripting the Linux build.
Comment 18 Sinclair 2017-01-25 00:53:06 UTC
(In reply to Christophe de Dinechin from comment #17)
> What I'll try today is a clean checkout, and redoing the steps of my build
> script manually the way I used to do them for the first iterations. I think
> something changed when I started scripting the Linux build.

By any chance you can try Fusion 8.x?
Comment 19 Christophe de Dinechin 2017-01-26 10:25:22 UTC
(In reply to Sinclair from comment #18)
> (In reply to Christophe de Dinechin from comment #17)
> > What I'll try today is a clean checkout, and redoing the steps of my build
> > script manually the way I used to do them for the first iterations. I think
> > something changed when I started scripting the Linux build.
> 
> By any chance you can try Fusion 8.x?

I'm currently discussing how to get a Fusion 8 for testing. This may take some time.
Comment 20 Zhu Lingshan 2017-02-13 06:22:58 UTC
Hi experts,

I am using VMware workstation Pro 12 on Linux host, facing the same issue, I also see [drm] DX: no. in the log. 
I have a workaround can boot the kernel temporarily, that is you can try to enable 3D feature in VMware settings, then the kernel boot for me.

It can not be a final solution, because I assume the kernel should boot without 3D, I am not a graphic expert, but I am glad to help like testing.

Thanks,
BR
Zhu Lingshan
Comment 21 Sinclair 2017-02-13 15:41:01 UTC
(In reply to Zhu Lingshan from comment #20)
> Hi experts,
> 
> I am using VMware workstation Pro 12 on Linux host, facing the same issue, I
> also see [drm] DX: no. in the log. 
> I have a workaround can boot the kernel temporarily, that is you can try to
> enable 3D feature in VMware settings, then the kernel boot for me.
> 
> It can not be a final solution, because I assume the kernel should boot
> without 3D, I am not a graphic expert, but I am glad to help like testing.
> 
> Thanks,
> BR
> Zhu Lingshan

Hi, Which kernel are you using?

There was a fix that just made it into 4.10-rc8.  Can you try that?
Comment 22 Zhu Lingshan 2017-02-14 03:31:57 UTC
(In reply to Sinclair from comment #21)
> (In reply to Zhu Lingshan from comment #20)
> > Hi experts,
> > 
> > I am using VMware workstation Pro 12 on Linux host, facing the same issue,
> I
> > also see [drm] DX: no. in the log. 
> > I have a workaround can boot the kernel temporarily, that is you can try to
> > enable 3D feature in VMware settings, then the kernel boot for me.
> > 
> > It can not be a final solution, because I assume the kernel should boot
> > without 3D, I am not a graphic expert, but I am glad to help like testing.
> > 
> > Thanks,
> > BR
> > Zhu Lingshan
> 
> Hi, Which kernel are you using?
> 
> There was a fix that just made it into 4.10-rc8.  Can you try that?

Hi Sinclair,

It is not a 100% reproduced bug. 4.10-rc8 kernel can boot my one of my virtual machine, but can not boot at another, but the two virtual machines were cloned from the same VM, that means they are actually the same virtual machine, but different behaviours.

At the failed VM, I can see such log form serial port.

[    4.495738] [drm]   DX Features.
[    4.495739] [drm] Max GMR ids is 64
[    4.495740] [drm] Max number of GMR pages is 65536
[    4.495740] [drm] Max dedicated hypervisor surface memory is 0 kiB
[    4.495741] [drm] Maximum display memory size is 32768 kiB
[    4.495742] [drm] VRAM at 0xe8000000 size is 8192 kiB
[    4.495742] [drm] MMIO at 0xfe000000 size is 256 kiB
[    4.495744] [drm] global init.
[    4.495846] [TTM] Zone  kernel: Available graphics memory: 2006924 kiB
[    4.495847] [TTM] Initializing pool allocator
[    4.495851] [TTM] Initializing DMA pool allocator
[    4.495989] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    4.495990] [drm] No driver support for vblank timestamp query.
[    4.496289] [drm] Screen Target Display device initialized
[    4.496340] [drm] width 1280
[    4.496346] [drm] height 768
[    4.496351] [drm] bpp 32
[    4.519676] [drm] Fifo max 0x00040000 min 0x00001000 cap 0x0000077f
[    4.528399] [drm] Using command buffers with DMA pool.
[    4.529300] [drm] DX: no.

When I enable 3D feature, it's working again.

Additional information: Again, it is not 100% percent reproduced, for what I see , VMware for Linux(Linux as a host) also have it's own problems in Graphic. For example, if I run VMware workstaion 12 Pro on my laptop, it has a in-chip GPU, but reported that there is no 3D accelerator hardwares,then 3D disabled, even I can see it's enabled in settings. So I suggest we can also look a at VMware side.

If you need further information, please let me know

Thanks,
BR
Zhu Lingshan

Note You need to log in before you can comment on or make changes to this bug.