Bug 9834

Summary: kernel freezes with on clockevent warning
Product: Timers Reporter: thomas kotzian (thomas.kotzian)
Component: OtherAssignee: john stultz (john.stultz)
Status: REJECTED INSUFFICIENT_DATA    
Severity: high CC: akataria, akpm, devzero, dhecht, mingo, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.24 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: screenshot of the stacktrace

Description thomas kotzian 2008-01-27 16:28:21 UTC
Latest working kernel version: 2.6.21
Earliest failing kernel version: 2.6.22 (don't know if it's the same reason)
Distribution: debian
Hardware Environment: vmware esx 3.5
Software Environment: vmware esx 3.5
Problem Description:

tested 2.6.24-rc7,-rc8 and final

64-bit vm freezes and puts out:

WARNING: at kernel/time/clockevents.c:82 clockevents_program_event()
Clocksource tsc unstable (delta = 29539902156 ns)
Pid: 0, comm: swapper Not tainted 2.6.24-1-amd64 #1

Call Trace:
 [<ffffffff80255e05>] clockevents_program_event+0x3b/0x91
tick_program_event+0x3a/0x5a
hrtimer_force_reprogram+0x7f/0x81
__remove_hrtimer+0x72/0x8c
hrtimr_try_to_cancel+0x5c/0x77
hrtimer_cancel+0x14/0x20
tick_nohz_restart_sched_tick+0xda/0x13e
default_idle+0x0/0x42
cpu_idle+0xbc/0xc3
rest_init+0x5a/0x5c
start_kernel+0x2e0/0x2eb
_sinittext+0x119/0x120

Steps to reproduce: start the kernel and wait ...

i don't know the exact kernel releases where the problems begin to happen ... 2.6.22 has problems with snapshots, 2.6.23 works somewhat, but the server has 100% cpu usage (samba, 2cpu-vm), 2.6.21 was the last working ...
Comment 1 thomas kotzian 2008-01-27 16:29:37 UTC
Created attachment 14617 [details]
screenshot of the stacktrace
Comment 2 Andrew Morton 2008-01-27 21:04:34 UTC
I'll mark this a regression.
Comment 3 Thomas Gleixner 2008-03-27 12:50:14 UTC
Can you please verify whether the problem persist with the latest 2.6.24.4 and 2.6.25-rc7 ?

Thanks,
         tglx
Comment 4 thomas kotzian 2008-04-07 02:06:45 UTC
i have tested kernel 2.6.24.4 and 2.6.25-rc7 but there's no difference  
- problem persists. the problem now is that the machines are freezing.  
no output, login prompt displayed, no user interaction, no input  
possible, not reachable from the network.

some VMs work with 2.6.24 - but 2 of these machines are making  
trouble. (both: distribution: debian-testing, main app: samba)
thanks, tk

Am 27.03.2008 um 20:50 schrieb bugme-daemon@bugzilla.kernel.org:
> http://bugzilla.kernel.org/show_bug.cgi?id=9834
>
>
>
>
>
> ------- Comment #3 from tglx@linutronix.de  2008-03-27 12:50 -------
> Can you please verify whether the problem persist with the latest  
> 2.6.24.4 and
> 2.6.25-rc7 ?
>
> Thanks,
>         tglx
>
>
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
Comment 5 Roland Kletzing 2008-04-12 10:08:51 UTC
is this specific to a dedicated esx box, or does it happen on other esx hosts, too ?

does the vm stick to dedicated host, or is it relocated by vmotion at any time?

if earlier kernel versions work, then maybe git bisect gives a clue....
Comment 6 Roland Kletzing 2008-05-01 04:22:32 UTC
>Pid: 0, comm: swapper Not tainted 2.6.24-1-amd64 #1

so, is this kernel optimized for amd cpu (CONFIG_MK8) ?
can you post your .config and/or try generic kernel ?
Comment 7 thomas kotzian 2008-05-05 03:01:47 UTC
Short version: found working solution. set clocksource to acpi_pm

Long version:
i found my machines frozen, no indication why. turning off console  
blanking and found "clocksource tsc unstable (delta = 3231410198209 ns)"
the machines that crashed were the only ones using "Virtual SMP" (2 or  
4 CPUs) setting clocksource to acpi_pm made them stable again. (the  
other option to make them stable was to use only 1 cpu)

thank you for your help!

please close this bug report.

thomas

Am 01.05.2008 um 13:22 schrieb bugme-daemon@bugzilla.kernel.org:

> http://bugzilla.kernel.org/show_bug.cgi?id=9834
>
>
> devzero@web.de changed:
>
>           What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                 CC|                            |devzero@web.de
>
>
>
>
> ------- Comment #6 from devzero@web.de  2008-05-01 04:22 -------
>> Pid: 0, comm: swapper Not tainted 2.6.24-1-amd64 #1
>
> so, is this kernel optimized for amd cpu (CONFIG_MK8) ?
> can you post your .config and/or try generic kernel ?
>
>
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
Comment 8 Roland Kletzing 2008-05-05 13:32:44 UTC
>Short version: found working solution. set clocksource to acpi_pm

you found a workaround ;)

>please close this bug report.

seems YOUR problem is solved, but maybe it`s a problem in the kernel which remains - and that`s not getting resolved if this bug report being closed....
Comment 9 Thomas Gleixner 2008-05-07 01:45:59 UTC
On Mon, 5 May 2008, bugme-daemon@bugzilla.kernel.org wrote:
> i found my machines frozen, no indication why. turning off console  
> blanking and found "clocksource tsc unstable (delta = 3231410198209 ns)"
> the machines that crashed were the only ones using "Virtual SMP" (2 or  
> 4 CPUs) setting clocksource to acpi_pm made them stable again. (the  
> other option to make them stable was to use only 1 cpu)

So this is more a VMware problem ?

Thanks,
	tglx
Comment 10 thomas kotzian 2008-05-07 05:52:28 UTC
it seems to be. - Googling for this problem shows that the problem is  
known - even at vmware itself (they have support articles). But no  
clear solutions for 64-bit, virtual smp.

i think that the time stamp clock registers in the different cpu  
diverge more and more from each other (only visible on (Virtual) SMP -  
because there are more than 1 cpu that can diverge from the others).

linux-kernel reports on bootup that the ts in the different cpus have  
a too big difference and the kernel comes to the conclusion that the  
timesource is unstable. - that's all. after working for 1 minute or  
for a few hours the freeze comes. - last message is "clocksource tsc  
unstable ..."

running ntp makes no difference.

for me it does not seem to be a problem with the kernel but a problem  
with the underlying (virtual) "hardware".


Am 07.05.2008 um 10:46 schrieb bugme-daemon@bugzilla.kernel.org:

> http://bugzilla.kernel.org/show_bug.cgi?id=9834
>
>
>
>
>
> ------- Comment #9 from tglx@linutronix.de  2008-05-07 01:45 -------
> On Mon, 5 May 2008, bugme-daemon@bugzilla.kernel.org wrote:
>> i found my machines frozen, no indication why. turning off console
>> blanking and found "clocksource tsc unstable (delta = 3231410198209  
>> ns)"
>> the machines that crashed were the only ones using "Virtual SMP" (2  
>> or
>> 4 CPUs) setting clocksource to acpi_pm made them stable again. (the
>> other option to make them stable was to use only 1 cpu)
>
> So this is more a VMware problem ?
>
> Thanks,
>        tglx
>
>
> -- 
> Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
Comment 11 Roland Kletzing 2008-05-07 12:36:43 UTC
>it seems to be. - Googling for this problem shows that the problem is  
>known - even at vmware itself (they have support articles). 

could you share your findings?

>i think that the time stamp clock registers in the different cpu  
>diverge more and more from each other (only visible on (Virtual) SMP -  
>because there are more than 1 cpu that can diverge from the others).

wasn`t that true also for cpu`s like AMD X2 ?
what CPU do you have?

>linux-kernel reports on bootup that the ts in the different cpus have  
>a too big difference and the kernel comes to the conclusion that the  
>timesource is unstable. - that's all. 

yes, but should that be a reason to freeze ?


>for me it does not seem to be a problem with the kernel but a problem  
>with the underlying (virtual) "hardware".

i`m not sure. regardless if this is a vmware problem or a linux problem  - here is a bug which needs to be found or at least needs explanation.
Comment 12 Dan Hecht 2008-05-07 19:19:57 UTC
This issue may have been fixed by this commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d8bb6f4c1670c8324e4135c61ef07486f7f17379

When this condition occurs, not only can gettimeofday go backwards, but the kernel can freeze (due to the bogus time values used to calculate the the timeout).

Please try 2.6.26-rc1 which includes this patch to see if this problem is still reproducible (without using clocksource=acpi_pm).

BTW, the virtual TSCs of each virtual cpu don't diverge from each other over the long term; it's just that this bug only needs a very small difference between TSCs and the right timing of taking a timer interrupt on one vcpu and reading the tsc on the other in order to hit.
Comment 13 Thomas Gleixner 2008-05-08 06:07:34 UTC
> >i think that the time stamp clock registers in the different cpu  
> >diverge more and more from each other (only visible on (Virtual) SMP -  
> >because there are more than 1 cpu that can diverge from the others).
> 
> wasn`t that true also for cpu`s like AMD X2 ?
> what CPU do you have?

We exclude TSC on systems like the X2 actually before it is used as
the clock source.
 
> >linux-kernel reports on bootup that the ts in the different cpus have  
> >a too big difference and the kernel comes to the conclusion that the  
> >timesource is unstable. - that's all. 

Hmm, which clocksource is used by the kernel ?

cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
 
> yes, but should that be a reason to freeze ?

Not really. But we have a hard time when the underlying "hardware"
does something which we can not detect.

> >for me it does not seem to be a problem with the kernel but a problem  
> >with the underlying (virtual) "hardware".
> 
> i`m not sure. regardless if this is a vmware problem or a linux problem  -
> here
> is a bug which needs to be found or at least needs explanation.

Agreed. workaround is one thing, but having a reasonable sense for the
root cause is definitely what we want.

Thanks,

	tglx
Comment 14 Roland Kletzing 2008-05-14 10:59:31 UTC
here is somebody with a probably related issue

thread at: http://marc.info/?l=linux-kernel&m=121062853005859&w=2
Comment 15 kiev 2008-05-25 16:08:32 UTC
for me she showed up one time in the floor of hour, however as a result of this
problem I lost a mysql database - mysql innodb not start - "Accertion error" -
did not help even "innodb_force_recovery = 4", backup was an a week remoteness
- the works of whole department lost data for a few days, the management simply
in shock - I going to discharge from job (((

this problem already whole year:
-----------
I'm stumped trying to track down the below intermittent problem.....
I've confirmed this problem on 2.6.19, 2.6.20 and 2.6.21.
-----------
http://lkml.org/lkml/2007/6/14/154
http://kerneltrap.org/mailarchive/linux-kernel/2007/6/14/103765
http://kerneltrap.org/node/16175

"System hang from time to time" http://bugzilla.kernel.org/show_bug.cgi?id=8300
"sata hotplug removal of drive freezes all 2.6.21 kernels"
http://bugzilla.kernel.org/show_bug.cgi?id=8421
"(sata_via) system freeze in random time"
http://bugzilla.kernel.org/show_bug.cgi?id=9115
"kernel freezes with on clockevent warning"
http://bugzilla.kernel.org/show_bug.cgi?id=9834
"[pata_ali] Unspecified hang on Acer laptop"
http://bugzilla.kernel.org/show_bug.cgi?id=9898
"System freezes after I/O on pata_jmicron device"
http://bugzilla.kernel.org/show_bug.cgi?id=10296

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217920
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/229747
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159521
https://bugs.launchpad.net/ubuntu/+bug/164183
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/187146
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/221437
https://bugs.launchpad.net/ubuntu/+bug/226600
Comment 16 Dan Hecht 2008-05-25 16:09:54 UTC
I'm out of the office until June 23rd.
Comment 17 Thomas Gleixner 2008-09-04 10:57:19 UTC
Is this problem fixed now ?