Latest working kernel version: 2.6.21 Earliest failing kernel version: 2.6.22 (don't know if it's the same reason) Distribution: debian Hardware Environment: vmware esx 3.5 Software Environment: vmware esx 3.5 Problem Description: tested 2.6.24-rc7,-rc8 and final 64-bit vm freezes and puts out: WARNING: at kernel/time/clockevents.c:82 clockevents_program_event() Clocksource tsc unstable (delta = 29539902156 ns) Pid: 0, comm: swapper Not tainted 2.6.24-1-amd64 #1 Call Trace: [<ffffffff80255e05>] clockevents_program_event+0x3b/0x91 tick_program_event+0x3a/0x5a hrtimer_force_reprogram+0x7f/0x81 __remove_hrtimer+0x72/0x8c hrtimr_try_to_cancel+0x5c/0x77 hrtimer_cancel+0x14/0x20 tick_nohz_restart_sched_tick+0xda/0x13e default_idle+0x0/0x42 cpu_idle+0xbc/0xc3 rest_init+0x5a/0x5c start_kernel+0x2e0/0x2eb _sinittext+0x119/0x120 Steps to reproduce: start the kernel and wait ... i don't know the exact kernel releases where the problems begin to happen ... 2.6.22 has problems with snapshots, 2.6.23 works somewhat, but the server has 100% cpu usage (samba, 2cpu-vm), 2.6.21 was the last working ...
Created attachment 14617 [details] screenshot of the stacktrace
I'll mark this a regression.
Can you please verify whether the problem persist with the latest 2.6.24.4 and 2.6.25-rc7 ? Thanks, tglx
i have tested kernel 2.6.24.4 and 2.6.25-rc7 but there's no difference - problem persists. the problem now is that the machines are freezing. no output, login prompt displayed, no user interaction, no input possible, not reachable from the network. some VMs work with 2.6.24 - but 2 of these machines are making trouble. (both: distribution: debian-testing, main app: samba) thanks, tk Am 27.03.2008 um 20:50 schrieb bugme-daemon@bugzilla.kernel.org: > http://bugzilla.kernel.org/show_bug.cgi?id=9834 > > > > > > ------- Comment #3 from tglx@linutronix.de 2008-03-27 12:50 ------- > Can you please verify whether the problem persist with the latest > 2.6.24.4 and > 2.6.25-rc7 ? > > Thanks, > tglx > > > -- > Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter.
is this specific to a dedicated esx box, or does it happen on other esx hosts, too ? does the vm stick to dedicated host, or is it relocated by vmotion at any time? if earlier kernel versions work, then maybe git bisect gives a clue....
>Pid: 0, comm: swapper Not tainted 2.6.24-1-amd64 #1 so, is this kernel optimized for amd cpu (CONFIG_MK8) ? can you post your .config and/or try generic kernel ?
Short version: found working solution. set clocksource to acpi_pm Long version: i found my machines frozen, no indication why. turning off console blanking and found "clocksource tsc unstable (delta = 3231410198209 ns)" the machines that crashed were the only ones using "Virtual SMP" (2 or 4 CPUs) setting clocksource to acpi_pm made them stable again. (the other option to make them stable was to use only 1 cpu) thank you for your help! please close this bug report. thomas Am 01.05.2008 um 13:22 schrieb bugme-daemon@bugzilla.kernel.org: > http://bugzilla.kernel.org/show_bug.cgi?id=9834 > > > devzero@web.de changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |devzero@web.de > > > > > ------- Comment #6 from devzero@web.de 2008-05-01 04:22 ------- >> Pid: 0, comm: swapper Not tainted 2.6.24-1-amd64 #1 > > so, is this kernel optimized for amd cpu (CONFIG_MK8) ? > can you post your .config and/or try generic kernel ? > > > -- > Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter.
>Short version: found working solution. set clocksource to acpi_pm you found a workaround ;) >please close this bug report. seems YOUR problem is solved, but maybe it`s a problem in the kernel which remains - and that`s not getting resolved if this bug report being closed....
On Mon, 5 May 2008, bugme-daemon@bugzilla.kernel.org wrote: > i found my machines frozen, no indication why. turning off console > blanking and found "clocksource tsc unstable (delta = 3231410198209 ns)" > the machines that crashed were the only ones using "Virtual SMP" (2 or > 4 CPUs) setting clocksource to acpi_pm made them stable again. (the > other option to make them stable was to use only 1 cpu) So this is more a VMware problem ? Thanks, tglx
it seems to be. - Googling for this problem shows that the problem is known - even at vmware itself (they have support articles). But no clear solutions for 64-bit, virtual smp. i think that the time stamp clock registers in the different cpu diverge more and more from each other (only visible on (Virtual) SMP - because there are more than 1 cpu that can diverge from the others). linux-kernel reports on bootup that the ts in the different cpus have a too big difference and the kernel comes to the conclusion that the timesource is unstable. - that's all. after working for 1 minute or for a few hours the freeze comes. - last message is "clocksource tsc unstable ..." running ntp makes no difference. for me it does not seem to be a problem with the kernel but a problem with the underlying (virtual) "hardware". Am 07.05.2008 um 10:46 schrieb bugme-daemon@bugzilla.kernel.org: > http://bugzilla.kernel.org/show_bug.cgi?id=9834 > > > > > > ------- Comment #9 from tglx@linutronix.de 2008-05-07 01:45 ------- > On Mon, 5 May 2008, bugme-daemon@bugzilla.kernel.org wrote: >> i found my machines frozen, no indication why. turning off console >> blanking and found "clocksource tsc unstable (delta = 3231410198209 >> ns)" >> the machines that crashed were the only ones using "Virtual SMP" (2 >> or >> 4 CPUs) setting clocksource to acpi_pm made them stable again. (the >> other option to make them stable was to use only 1 cpu) > > So this is more a VMware problem ? > > Thanks, > tglx > > > -- > Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter.
>it seems to be. - Googling for this problem shows that the problem is >known - even at vmware itself (they have support articles). could you share your findings? >i think that the time stamp clock registers in the different cpu >diverge more and more from each other (only visible on (Virtual) SMP - >because there are more than 1 cpu that can diverge from the others). wasn`t that true also for cpu`s like AMD X2 ? what CPU do you have? >linux-kernel reports on bootup that the ts in the different cpus have >a too big difference and the kernel comes to the conclusion that the >timesource is unstable. - that's all. yes, but should that be a reason to freeze ? >for me it does not seem to be a problem with the kernel but a problem >with the underlying (virtual) "hardware". i`m not sure. regardless if this is a vmware problem or a linux problem - here is a bug which needs to be found or at least needs explanation.
This issue may have been fixed by this commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d8bb6f4c1670c8324e4135c61ef07486f7f17379 When this condition occurs, not only can gettimeofday go backwards, but the kernel can freeze (due to the bogus time values used to calculate the the timeout). Please try 2.6.26-rc1 which includes this patch to see if this problem is still reproducible (without using clocksource=acpi_pm). BTW, the virtual TSCs of each virtual cpu don't diverge from each other over the long term; it's just that this bug only needs a very small difference between TSCs and the right timing of taking a timer interrupt on one vcpu and reading the tsc on the other in order to hit.
> >i think that the time stamp clock registers in the different cpu > >diverge more and more from each other (only visible on (Virtual) SMP - > >because there are more than 1 cpu that can diverge from the others). > > wasn`t that true also for cpu`s like AMD X2 ? > what CPU do you have? We exclude TSC on systems like the X2 actually before it is used as the clock source. > >linux-kernel reports on bootup that the ts in the different cpus have > >a too big difference and the kernel comes to the conclusion that the > >timesource is unstable. - that's all. Hmm, which clocksource is used by the kernel ? cat /sys/devices/system/clocksource/clocksource0/current_clocksource > yes, but should that be a reason to freeze ? Not really. But we have a hard time when the underlying "hardware" does something which we can not detect. > >for me it does not seem to be a problem with the kernel but a problem > >with the underlying (virtual) "hardware". > > i`m not sure. regardless if this is a vmware problem or a linux problem - > here > is a bug which needs to be found or at least needs explanation. Agreed. workaround is one thing, but having a reasonable sense for the root cause is definitely what we want. Thanks, tglx
here is somebody with a probably related issue thread at: http://marc.info/?l=linux-kernel&m=121062853005859&w=2
for me she showed up one time in the floor of hour, however as a result of this problem I lost a mysql database - mysql innodb not start - "Accertion error" - did not help even "innodb_force_recovery = 4", backup was an a week remoteness - the works of whole department lost data for a few days, the management simply in shock - I going to discharge from job ((( this problem already whole year: ----------- I'm stumped trying to track down the below intermittent problem..... I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21. ----------- http://lkml.org/lkml/2007/6/14/154 http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765 http://kerneltrap.org/node/16175 "System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300 "sata hotplug removal of drive freezes all 2.6.21 kernels" http://bugzilla.kernel.org/show_bug.cgi?id=8421 "(sata_via) system freeze in random time" http://bugzilla.kernel.org/show_bug.cgi?id=9115 "kernel freezes with on clockevent warning" http://bugzilla.kernel.org/show_bug.cgi?id=9834 "[pata_ali] Unspecified hang on Acer laptop" http://bugzilla.kernel.org/show_bug.cgi?id=9898 "System freezes after I/O on pata_jmicron device" http://bugzilla.kernel.org/show_bug.cgi?id=10296 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521 https://bugs.launchpad.net/ubuntu/+bug/164183 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437 https://bugs.launchpad.net/ubuntu/+bug/226600
I'm out of the office until June 23rd.
Is this problem fixed now ?