Bug 7640

Summary: [2.6.18] Significant delays when booting in vmware
Product: Platform Specific/Hardware Reporter: Frans Pop (elendil)
Component: i386Assignee: john stultz (john.stultz)
Status: REJECTED INVALID    
Severity: normal CC: john.stultz, jth, mann, zach
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.18.3 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Patch to verify calibrate_delay is being slow.

Description Frans Pop 2006-12-06 00:58:39 UTC
Most recent kernel where this bug did *NOT* occur: See below.
Distribution: Debian
Hardware Environment: VMWare 5.5.2 build-29772 on Pentium 4 laptop
Software Environment: Debian unstable
Problem Description:
During early boot of the kernel in WMWare I am seeing a significant pause in 
two places that was not there with 2.6.17 or earlier kernels.

Using git-bisect I have traced the origin of the delay to the following patch:
commit 539eb11e6e904f2cd4f62908cc5e44d724879721
Author: john stultz <johnstul@us.ibm.com>
Date:   Mon Jun 26 00:25:10 2006 -0700
    [PATCH] Time: i386 Conversion - part 2: Rework TSC Support

With a kernel build from git up to the previous commit, both delays disappear.
Note that this commit was part of a series.

I have marked the delays in the fragment from dmesg below.
<snip>
Memory: 154944k/163840k available (1500k kernel code, 8280k reserved, 602k 
data, 256k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
*** significant delay of about 5-7 seconds ***
Calibrating delay using timer specific routine.. 3222.25 BogoMIPS 
(lpj=6444502)
Security Framework v1.0.0 initialized
SELinux:  Disabled at boot.
Capability LSM initialized
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0febfbff 00000000 00000000 00000000 
00000000 00000000 00000000
CPU: After vendor identify, caps: 0febfbff 00000000 00000000 00000000 00000000 
00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
CPU: L2 cache: 512K
CPU: After all inits, caps: 0febfbff 00000000 00000000 00000080 00000000 
00000000 00000000
Compat vDSO mapped to ffffe000.
CPU: Intel Mobile Intel(R) Pentium(R) 4     CPU 2.80GHz stepping 08
Checking 'hlt' instruction... OK.
ACPI: Core revision 20060707
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
*** significant delay of about 8-10 seconds ***
checking if image is initramfs... it is
Freeing initrd memory: 4182k freed
</snip>

Debian bug #401352 has some additional details:
http://bugs.debian.org/401352
Comment 1 john stultz 2006-12-06 11:14:50 UTC
Does booting w/ clocksource=acpi_pm or clocksource=pit change the behavior?
Comment 2 Frans Pop 2006-12-07 15:14:02 UTC
> Does booting w/ clocksource=acpi_pm or clocksource=pit change the behavior?

No, neither helps.
Comment 3 john stultz 2006-12-07 16:16:23 UTC
Huh. That's stumps me a bit. It looks like the delay calibration is going
slowly, but the lpj numbers and the cpu_khz values between the two dmesgs
(2.6.18 vs 2.6.17) are pretty similar. Its odd.
Comment 4 john stultz 2006-12-07 16:19:37 UTC
Created attachment 9759 [details]
Patch to verify calibrate_delay is being slow.

This is just a simple printk patch that should verify that calibrate_delay is
the cause of the initial stall. You should see the "starting calibrate" message
then the stall, then the "finishing calibrate" message.
Comment 5 Tim Mann 2006-12-07 16:49:21 UTC
I'll check whether we've seen this internally at VMware.  I didn't notice it
when I was testing a slightly earlier version of John's patches.
Comment 6 Frans Pop 2006-12-07 17:38:49 UTC
> You should see the "starting calibrate" message then the stall, then
> the "finishing calibrate" message.

Yes, confirmed. Only for the first delay though, it shows nothing for the 
second one (but from your message that was expected).
Comment 7 john stultz 2006-12-07 18:11:47 UTC
Huh. Still stumped on this one. I suspect it has something to do w/ changes to
read_current_timer(), but it appears both cases are using the TSC, so it
shouldn't really be different. Other potential causes for slowness would be the
timer tick being off, which could affect jiffies from being updated at the right
frequency.

But neither of these really seem right. Does booting w/ "noapic" change anything?
Comment 8 Tim Mann 2006-12-07 18:33:43 UTC
A couple of people have indeed seen this internally at VMware, and we'll look
into it from our end too.  No results yet.

Also see this LKML thread:
http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/ede1e4e63b08b916/11e5725a86524552?lnk=st&q=2.6.18+delay+boot+vmware&rnum=1&hl=en#11e5725a86524552

The max_cstate part of that thread looks like a red herring to me, though.
Comment 9 Frans Pop 2006-12-07 18:38:53 UTC
> Does booting w/ "noapic" change anything?

Not for the initial delay.
It does get rid of the second delay, but seems to me only by skipping a whole 
codepath as the message preceding that delay is gone too.
It also seems to make the boot as a whole more sluggish.
Comment 10 Frans Pop 2006-12-23 11:51:40 UTC
Any news on this one?

I've just discovered that this issue seems to also have another effect: a 30 
second timeout (for swapping floppies in a floppy based Debian install) takes 
~60 seconds in VMWare. On real hardware it is exactly 30 seconds.

The code used for the timeout is in:
http://svn.debian.org/wsvn/d-i/trunk/packages/rootskel/src-bootfloppy/bin/timeout_read.c?op=file&rev=0&sc=0
Comment 11 Zachary Amsden 2007-01-04 15:55:26 UTC
John, your code looks great.  This turns out to be a VMware bug.  Two
workarounds, put this in the .vmx configuration file for the VM:

timer.calibrationUsec = 90000

Or boot Linux with lpj=XXXX as a bootparameter, where XXXX is the host value.
Comment 12 Frans Pop 2007-01-04 21:55:24 UTC
Closing report as it turns out to be a VMWare issue, not a bug in the kernel.
Setting timer.calibrationUsec works for me.

Thanks to all for their efforts.