Most recent kernel where this bug did *NOT* occur: See below. Distribution: Debian Hardware Environment: VMWare 5.5.2 build-29772 on Pentium 4 laptop Software Environment: Debian unstable Problem Description: During early boot of the kernel in WMWare I am seeing a significant pause in two places that was not there with 2.6.17 or earlier kernels. Using git-bisect I have traced the origin of the delay to the following patch: commit 539eb11e6e904f2cd4f62908cc5e44d724879721 Author: john stultz <johnstul@us.ibm.com> Date: Mon Jun 26 00:25:10 2006 -0700 [PATCH] Time: i386 Conversion - part 2: Rework TSC Support With a kernel build from git up to the previous commit, both delays disappear. Note that this commit was part of a series. I have marked the delays in the fragment from dmesg below. <snip> Memory: 154944k/163840k available (1500k kernel code, 8280k reserved, 602k data, 256k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. *** significant delay of about 5-7 seconds *** Calibrating delay using timer specific routine.. 3222.25 BogoMIPS (lpj=6444502) Security Framework v1.0.0 initialized SELinux: Disabled at boot. Capability LSM initialized Mount-cache hash table entries: 512 CPU: After generic identify, caps: 0febfbff 00000000 00000000 00000000 00000000 00000000 00000000 CPU: After vendor identify, caps: 0febfbff 00000000 00000000 00000000 00000000 00000000 00000000 CPU: Trace cache: 12K uops, L1 D cache: 8K CPU: L2 cache: 512K CPU: After all inits, caps: 0febfbff 00000000 00000000 00000080 00000000 00000000 00000000 Compat vDSO mapped to ffffe000. CPU: Intel Mobile Intel(R) Pentium(R) 4 CPU 2.80GHz stepping 08 Checking 'hlt' instruction... OK. ACPI: Core revision 20060707 ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 *** significant delay of about 8-10 seconds *** checking if image is initramfs... it is Freeing initrd memory: 4182k freed </snip> Debian bug #401352 has some additional details: http://bugs.debian.org/401352
Does booting w/ clocksource=acpi_pm or clocksource=pit change the behavior?
> Does booting w/ clocksource=acpi_pm or clocksource=pit change the behavior? No, neither helps.
Huh. That's stumps me a bit. It looks like the delay calibration is going slowly, but the lpj numbers and the cpu_khz values between the two dmesgs (2.6.18 vs 2.6.17) are pretty similar. Its odd.
Created attachment 9759 [details] Patch to verify calibrate_delay is being slow. This is just a simple printk patch that should verify that calibrate_delay is the cause of the initial stall. You should see the "starting calibrate" message then the stall, then the "finishing calibrate" message.
I'll check whether we've seen this internally at VMware. I didn't notice it when I was testing a slightly earlier version of John's patches.
> You should see the "starting calibrate" message then the stall, then > the "finishing calibrate" message. Yes, confirmed. Only for the first delay though, it shows nothing for the second one (but from your message that was expected).
Huh. Still stumped on this one. I suspect it has something to do w/ changes to read_current_timer(), but it appears both cases are using the TSC, so it shouldn't really be different. Other potential causes for slowness would be the timer tick being off, which could affect jiffies from being updated at the right frequency. But neither of these really seem right. Does booting w/ "noapic" change anything?
A couple of people have indeed seen this internally at VMware, and we'll look into it from our end too. No results yet. Also see this LKML thread: http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/ede1e4e63b08b916/11e5725a86524552?lnk=st&q=2.6.18+delay+boot+vmware&rnum=1&hl=en#11e5725a86524552 The max_cstate part of that thread looks like a red herring to me, though.
> Does booting w/ "noapic" change anything? Not for the initial delay. It does get rid of the second delay, but seems to me only by skipping a whole codepath as the message preceding that delay is gone too. It also seems to make the boot as a whole more sluggish.
Any news on this one? I've just discovered that this issue seems to also have another effect: a 30 second timeout (for swapping floppies in a floppy based Debian install) takes ~60 seconds in VMWare. On real hardware it is exactly 30 seconds. The code used for the timeout is in: http://svn.debian.org/wsvn/d-i/trunk/packages/rootskel/src-bootfloppy/bin/timeout_read.c?op=file&rev=0&sc=0
John, your code looks great. This turns out to be a VMware bug. Two workarounds, put this in the .vmx configuration file for the VM: timer.calibrationUsec = 90000 Or boot Linux with lpj=XXXX as a bootparameter, where XXXX is the host value.
Closing report as it turns out to be a VMWare issue, not a bug in the kernel. Setting timer.calibrationUsec works for me. Thanks to all for their efforts.