Bug 7963 - system time slowing down
Summary: system time slowing down
Status: CLOSED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Realtime Clock (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: john stultz
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-02-08 01:07 UTC by hajma
Modified: 2012-09-05 06:44 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.24
Subsystem:
Regression: No
Bisected commit-id:


Attachments
2.6.20 dmesg (14.95 KB, text/plain)
2007-02-08 11:39 UTC, hajma
Details
grep -e "kernel: Linux version" -e "MHz processor" /var/log/messages > mhz (13.21 KB, text/plain)
2007-02-09 00:12 UTC, hajma
Details
Reduce calibration time to 5 ticks (486 bytes, patch)
2007-02-09 14:27 UTC, john stultz
Details | Diff
Correct ntp drift (regression from 2.6.21) (3.84 KB, patch)
2008-01-28 09:46 UTC, john stultz
Details | Diff

Description hajma 2007-02-08 01:07:51 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.17.13
Distribution: Mandriva
Hardware Environment: Athlon XP 1800+, 768MB RAM, PATA harddrives

# lspci
00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge
00:01.0 PCI bridge: VIA Technologies, Inc. VT8235 PCI Bridge
00:09.0 Ethernet controller: Broadcom Corporation BCM4401 100Base-T (rev 01)
00:0b.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10)
00:0d.0 Network controller: Intersil Corporation Prism 2.5 Wavelan chipset (rev 01)
00:0f.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
RTL-8139/8139C/8139C+ (rev 10)
00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 80)
00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 80)
00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller
(rev 80)
00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82)
00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge
00:11.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06)
01:00.0 VGA compatible controller: nVidia Corporation NV17 [GeForce4 MX 440]
(rev a3)
#

Software Environment: desktop, KDE, vanilla kernel 2.6.20, not tainted

Problem Description: 
System time keeps slowing down at a rate of aprox. 1second/hour.
Mandriva bug report: http://qa.mandriva.com/show_bug.cgi?id=28279
Appeared when I switched from Mandriva kernel 2.6.17.13 to Mandriva kernel 2.6.19.2.
With mandriva kernel 2.6.20.0.rc6 the slowdown was 10seconds/hour, with mandriva
kernel 2.6.20.0.rc7 it was back to 1sec/hour.
Does not happen when I switch back to mandriva kernel 2.6.17.13

Steps to reproduce:
I made a script to compare time from ntp server and log them:
#!/bin/bash
aaa=`LC_ALL=C rdate -p ntp.ujf.cas.cz | sed s/^.*]\\\t//g`
bbb=`LC_ALL=C date --utc -d "$aaa" +%s`
ccc=`date --utc +%s`
n=$(($bbb - $ccc))
d=`LC_ALL=C date`
echo $d,$n

Time , difference in seconds:
Thu Feb 8 00:11:07 CET 2007,3837
Thu Feb 8 01:11:01 CET 2007,3839
Thu Feb 8 02:11:02 CET 2007,3838
Thu Feb 8 03:11:01 CET 2007,3839
Thu Feb 8 04:11:01 CET 2007,3840
Thu Feb 8 05:11:02 CET 2007,3839
Thu Feb 8 06:11:02 CET 2007,3839
Thu Feb 8 07:11:03 CET 2007,3839
Thu Feb 8 08:11:01 CET 2007,3841
Thu Feb 8 09:11:01 CET 2007,3842
Comment 1 john stultz 2007-02-08 11:03:42 UTC
Hmm.. Right off my guess is that your TSC is being mis-calibrated and its
showing up now due to the switch between tick based timekeeping and continuous
TSC based timekeeping.

Can you attach a full dmesg log from 2.6.20?

Also check that booting w/ clock=jiffies avoids the drift?

Comment 2 hajma 2007-02-08 11:39:16 UTC
Created attachment 10356 [details]
2.6.20 dmesg

will try the clock option later today or tomorrow
Comment 3 john stultz 2007-02-08 11:59:48 UTC
Hmmm. From the dmesg, cpu_khz seems to in the ballpark (1533Mhz for 1800+). 1
second every hour is ~278ppm drift, which is pretty bad, but not insane, so it
might be due to a slight calibration error.

Is the cpu_khz value the same for pre-2.6.18 kernels? Does it vary much boot to
boot?

Also, clock=acpi_pm should work on your hardware as well.
Comment 4 hajma 2007-02-09 00:12:38 UTC
Created attachment 10363 [details]
grep -e "kernel: Linux version" -e "MHz processor" /var/log/messages > mhz

2.6.17 had it always 1533.469 MHz (one time 1533.477 MHz)
2.6.19 and 2.6.20 most of the time 1533.530 MHz, (one time 1533.383 MHz)

2.6.20 is running with clock=jiffies for 2 hours now. So far no sign of
slowdown.
Comment 5 john stultz 2007-02-09 14:27:58 UTC
Created attachment 10372 [details]
Reduce calibration time to 5 ticks

This is sort of a shot in the dark.

The TSC calibration was changed to improve accuracy, but apparently that
doesn't seem to be the case for you (although it could also be that the
accuracy is improved and the crystal in your system is just slightly off -
~278ppm is within the bound that is correctable using NTP). 

I'm going to guess that it might be SMIs affecting the new calibration code. So
this patch drops the calibration time back down to 5 ticks (as used
pre-2.6.18). I'm guessing here that the 30ms calibration now used is long
enough that any BIOS SMIs that might occur could affect the results.

If it does not affect the results, I can try a patch that reverts back to the
old calibration code to see if it helps as well.
Comment 6 hajma 2007-02-09 21:16:30 UTC
before I apply the patch - the clock=jiffies seems to make the clock run 
faster:
time, difference
Fri Feb 9 08:11:01 CET 2007,3858
Fri Feb 9 09:11:02 CET 2007,3857
Fri Feb 9 10:11:01 CET 2007,3858
Fri Feb 9 11:11:01 CET 2007,3858
Fri Feb 9 12:11:01 CET 2007,3857
Fri Feb 9 13:11:01 CET 2007,3857
Fri Feb 9 14:11:01 CET 2007,3857
Fri Feb 9 15:11:01 CET 2007,3857
Fri Feb 9 16:11:01 CET 2007,3857
Fri Feb 9 17:11:01 CET 2007,3857
Fri Feb 9 18:11:01 CET 2007,3857
Fri Feb 9 19:11:02 CET 2007,3857
Fri Feb 9 20:11:01 CET 2007,3857
Fri Feb 9 21:11:01 CET 2007,3857
Fri Feb 9 22:11:01 CET 2007,3857
Fri Feb 9 23:11:01 CET 2007,3857
Sat Feb 10 00:11:01 CET 2007,3856
Sat Feb 10 01:11:01 CET 2007,3857
Sat Feb 10 02:11:01 CET 2007,3856
Sat Feb 10 03:11:01 CET 2007,3856
Sat Feb 10 04:11:02 CET 2007,3856
Sat Feb 10 05:11:01 CET 2007,3856
Sat Feb 10 06:11:01 CET 2007,3856
[root@localhost hajma]#
Comment 7 hajma 2007-02-11 20:59:27 UTC
the patch did not help, in fact it is slightly worse now:

dmesg:Detected 1533.677 MHz processor.

the clock=jiffies vanilla kernel
Sun Feb 11 13:11:01 CET 2007,3853
Sun Feb 11 14:11:04 CET 2007,3852
Sun Feb 11 15:11:02 CET 2007,3853
Sun Feb 11 16:11:01 CET 2007,3853
Sun Feb 11 17:11:04 CET 2007,3853
Sun Feb 11 18:11:02 CET 2007,3852
Sun Feb 11 19:11:03 CET 2007,3851

patched kernel:

Sun Feb 11 20:11:01 CET 2007,3853
Sun Feb 11 21:11:01 CET 2007,3854
Sun Feb 11 22:11:01 CET 2007,3855
Sun Feb 11 23:11:01 CET 2007,3856
Mon Feb 12 00:11:01 CET 2007,3857
Mon Feb 12 01:11:01 CET 2007,3858
Mon Feb 12 02:11:01 CET 2007,3859
Mon Feb 12 03:11:01 CET 2007,3860
Mon Feb 12 04:11:01 CET 2007,3861
Mon Feb 12 05:11:01 CET 2007,3862
Comment 8 hajma 2007-02-12 04:00:53 UTC
Furthermore, the calibration may be only a part of the problem,as the time
difference changes sometimes - but I was not able to trace that to any
particular application - sometimes it is 1 sec/hour, sometimes 3 secs/hour:
time, difference
Mon Feb 12 05:11:01 CET 2007,3862
Mon Feb 12 06:11:01 CET 2007,3862
Mon Feb 12 07:11:01 CET 2007,3863
Mon Feb 12 08:11:01 CET 2007,3864
Mon Feb 12 09:11:02 CET 2007,3865
Mon Feb 12 10:11:02 CET 2007,3868
Mon Feb 12 11:11:01 CET 2007,3871
Mon Feb 12 12:11:01 CET 2007,3874
Comment 9 john stultz 2007-02-12 11:12:28 UTC
regarding comment #6 
There will always be some time drift, but it would be good to minimize it if
possible, so I wouldn't worry about jiffies drifting forward instead of behind,
it seems to be still quite minimal (~20ppm - quite reasonable).

regarding comment #7
Thanks for testing the patch. If I can find some time today I'll send a patch
that reverts it all back to the old calibration code.

regarding comment #8
That's odd. Do let me know if you see any pattern as to when you see the 3sec
per hour and when it is 1sec.
Comment 10 hajma 2007-02-13 00:45:43 UTC
I have a cron job sending 
top "-n1 -b -S" 
to a file every hour, but I see no special pattern there. I might suspect
kaffeine or VirtualBox, but how to measure? 
Comment 11 Natalie Protasevich 2007-07-07 14:21:12 UTC
Any update on this problem?
Thanks.
Comment 12 Ingo Molnar 2007-11-18 07:14:59 UTC
The 2.6.24-rc3 kernel includes fixes to the TSC calibration routines that might explain this problem. Could you please try it and let us know if the time skew still occurs?
Comment 13 hajma 2007-12-01 05:07:48 UTC
I will be able to test it in 3-4 weeks, thanks for understanding
Comment 14 hajma 2008-01-04 14:33:09 UTC
with kernel 2.6.24-rc6 the slowdown rate is still there - on average 1.5 seconds per hour.
As before, there are hours without a difference, there are hours with 3 seconds slowdown, usually 1-2 seconds.
Comment 15 john stultz 2008-01-04 14:41:32 UTC
Have you tried booting w/ clocksource=acpi_pm ? 

Since it uses a constant frequency (in other words, its not calibrated) so any skew in your tick rate that might affect calibration of the TSC shouldn't effect the acpi_pm.
Comment 16 hajma 2008-01-05 04:42:43 UTC
after the first 11 hours with clocksource=acpi_pm the time difference increased by 5 seconds -> .45 s/hour is much better than before but 6 minutes/month is still a lot.
Comment 17 Hal Murray 2008-01-05 15:59:29 UTC
I have a problem that seems related to this discussion.
I'll submit a new bug report if that's preferable.

The TSC calibration routine is not consistent.  I'd be
happy with any of the values it returns as long as it
would get the same result the next time it ran.

These are from a 2.6.23 kernel:

Dec  2 03:26:18 shuksan kernel: Detected 2793.070 MHz processor.
Dec  2 03:51:12 shuksan kernel: Detected 2793.093 MHz processor.
Dec  2 04:37:05 shuksan kernel: Detected 2793.067 MHz processor.
Dec  2 13:47:31 shuksan kernel: Detected 2793.063 MHz processor.
Dec  2 14:38:48 shuksan kernel: Detected 2793.183 MHz processor.
Dec  3 06:15:01 shuksan kernel: Detected 2793.146 MHz processor.
Dec  9 22:58:36 shuksan kernel: Detected 2793.175 MHz processor.
Dec 11 01:32:19 shuksan kernel: Detected 2793.058 MHz processor.
Dec 18 10:30:35 shuksan kernel: Detected 2793.130 MHz processor.
Dec 22 22:03:45 shuksan kernel: Detected 2793.089 MHz processor.

That's a spread of 44 ppm which makes a nasty transient for
ntpd.  You probably won't notice unless you look carefully.
A few ppm might be due to temperature changes.

I've seen similar quirks in 2.6.18, 2.6.19 and 2.6.22.
That's on two Dell workstations and the XO ($100 laptop
www.laptop.org)

At one point, I patched tsc_init to loop calling
calculate_cpu_khz and printing the answer.  It was
the same sort of scatter so I'm pretty sure the problem
isn't temperature or some strange initialization state/mode.
Comment 18 hajma 2008-01-27 15:23:56 UTC
just another comment:
with 2.6.24-rc7 the slowdown was terrible 10 seconds/hour, with rc8 it is back to 1s/h
Comment 19 john stultz 2008-01-28 09:42:46 UTC
I recently sent out a patch to lkml may help address this issue. Let me dig it up and I'll post it here for testing.
Comment 20 john stultz 2008-01-28 09:46:18 UTC
Created attachment 14627 [details]
Correct ntp drift (regression from 2.6.21)

Here's the patch to try. On my system it corrected for a 280ppm drift (about what you're seeing if I did the math right). Thanks for keeping on me about this issue!
Comment 21 hajma 2008-01-28 15:33:27 UTC
thanks, I do not have time for building the kernel ATM. Did it make it to 2.6.24 ?
Comment 22 john stultz 2008-01-28 15:46:38 UTC
No, not yet. I'm working to push it however for 2.6.25 and 2.6.24.x.
Comment 23 Heiko Gerstung 2008-02-15 00:49:12 UTC
Any news on this? I experience the same problems with an embedded x86 CPU, the CPU clock calculation is different in 7 out of 10 reboots and this leads to a massive delay for my ntp before it converges into a reasonable (for me) range. This is not the case with a 2.6.15.1 kernel and it seems to be cured if I set the clockrate to a fixed value:
        cpu_khz = 498052;
        tsc_khz = 498052;
Since ntpd calculates a drift rate and stores this for its next start, this is a big problem for me. Due to the changing clock rate calculation after a reboot, the drift value of ntpd is wrong and it takes a while (hours) before ntpd manages to detect that and recalculates the drift.

(I tested this with 2.6.24)

I would vote for accepting this as a bug (and change the status from NEEDINFO to ASSIGNED) :-)

Best regards,
 Heiko
Comment 24 john stultz 2009-06-10 19:08:31 UTC
For folks concerned about the TSC calibration variance issue in this bug (which does not affect folks not using the TSC for timekeeping). Please see the patch here: http://lkml.org/lkml/2009/5/12/506

It allows you to specify a tsc_khz freq at boot up that will always be used, avoiding the calibration error.

It has not been included into mainline, but if folks find it an acceptable solution, I'll push for it.
Comment 25 hajma 2009-06-11 05:49:54 UTC
thanks! I'll disable ntp, see if I still have the issue and eventually try the patch
Comment 26 john stultz 2010-01-05 01:57:02 UTC
hajma: Is this issue still affecting you on recent kernels?
Comment 27 Yuriy Yevtukhov 2010-09-20 14:30:18 UTC
Fast TSC calibration can cause very large error in detection of CPU speed and thus causing very bad system time speed. I have a server with 2GHz Xeons (two CPUs with 2 cores each) and recently got situation when system time was behind by 0,2%, i.e. every 100 seconds system clock became 2 seconds slower. When there were no fast TSC calibration, CPU speed were always correct with high precision (about 0,01 MHz):

kernel 2.6.24.1: 
TSC calibrated against PM_TIMER
Detected 1999.998 MHz processor.

TSC calibrated against PM_TIMER
time.c: Detected 1999.998 MHz processor.

TSC calibrated against PM_TIMER
time.c: Detected 1999.996 MHz processor.

TSC calibrated against PM_TIMER
time.c: Detected 2000.005 MHz processor.

TSC calibrated against PM_TIMER
time.c: Detected 1999.991 MHz processor.

But when fast TSC calibration appeared it always gave higher mistake and once it became extremely large:

kernel 2.6.32.8: 
Fast TSC calibration using PIT
Detected 2000.121 MHz processor.

Fast TSC calibration using PIT
Detected 1999.791 MHz processor.

Fast TSC calibration using PIT
Detected 2000.178 MHz processor.

Fast TSC calibration failed
TSC: PIT calibration matches HPET. 1 loops
Detected 1999.995 MHz processor.       !!!!NO fast calibr. and no such mistake!

kernel 2.6.32.19: 
Fast TSC calibration using PIT
Detected 2038.003 MHz processor.       !!!!!VERY BAD

Fast TSC calibration using PIT
Detected 2000.243 MHz processor.


I don't want to change clock source from TSC and don't want to specify CPU speed explicitly (because have many different servers with identical configs), but fast TSC calibration should be corrected or must be a way to disable it. I didn't found it looking through sources:

linux-2.6.32.19/arch/x86/kernel/tsc.c: line 401
unsigned long native_calibrate_tsc(void)
{
 u64 tsc1, tsc2, delta, ref1, ref2;
 unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX;
 unsigned long flags, latch, ms, fast_calibrate;
 int hpet = is_hpet_enabled(), i, loopmin;

 local_irq_save(flags);
 fast_calibrate = quick_pit_calibrate();    //I WANT TO DISABLE IT IN KERNEL CONFIG AND/OR COMMAND LINE
 local_irq_restore(flags);
 if (fast_calibrate)
  return fast_calibrate;

 ....other methods of calibration
}
Comment 28 john stultz 2011-04-18 19:44:02 UTC
I've recently added a timer-based refined TSC calibration to the kernel. This should resolve most TSC calibration issues where the calibrated TSC freq would be different boot-to-boot.

I'm going to mark this as closed, but if anyone is still having problems with TSC calibration, please open a new bug and CC me.

Note You need to log in before you can comment on or make changes to this bug.