Most recent kernel where this bug did *NOT* occur: 2.6.17.13 Distribution: Mandriva Hardware Environment: Athlon XP 1800+, 768MB RAM, PATA harddrives # lspci 00:00.0 Host bridge: VIA Technologies, Inc. VT8377 [KT400/KT600 AGP] Host Bridge 00:01.0 PCI bridge: VIA Technologies, Inc. VT8235 PCI Bridge 00:09.0 Ethernet controller: Broadcom Corporation BCM4401 100Base-T (rev 01) 00:0b.0 Multimedia audio controller: C-Media Electronics Inc CM8738 (rev 10) 00:0d.0 Network controller: Intersil Corporation Prism 2.5 Wavelan chipset (rev 01) 00:0f.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10) 00:10.0 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80) 00:10.1 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80) 00:10.2 USB Controller: VIA Technologies, Inc. VT82xxxxx UHCI USB 1.1 Controller (rev 80) 00:10.3 USB Controller: VIA Technologies, Inc. USB 2.0 (rev 82) 00:11.0 ISA bridge: VIA Technologies, Inc. VT8235 ISA Bridge 00:11.1 IDE interface: VIA Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master IDE (rev 06) 01:00.0 VGA compatible controller: nVidia Corporation NV17 [GeForce4 MX 440] (rev a3) # Software Environment: desktop, KDE, vanilla kernel 2.6.20, not tainted Problem Description: System time keeps slowing down at a rate of aprox. 1second/hour. Mandriva bug report: http://qa.mandriva.com/show_bug.cgi?id=28279 Appeared when I switched from Mandriva kernel 2.6.17.13 to Mandriva kernel 2.6.19.2. With mandriva kernel 2.6.20.0.rc6 the slowdown was 10seconds/hour, with mandriva kernel 2.6.20.0.rc7 it was back to 1sec/hour. Does not happen when I switch back to mandriva kernel 2.6.17.13 Steps to reproduce: I made a script to compare time from ntp server and log them: #!/bin/bash aaa=`LC_ALL=C rdate -p ntp.ujf.cas.cz | sed s/^.*]\\\t//g` bbb=`LC_ALL=C date --utc -d "$aaa" +%s` ccc=`date --utc +%s` n=$(($bbb - $ccc)) d=`LC_ALL=C date` echo $d,$n Time , difference in seconds: Thu Feb 8 00:11:07 CET 2007,3837 Thu Feb 8 01:11:01 CET 2007,3839 Thu Feb 8 02:11:02 CET 2007,3838 Thu Feb 8 03:11:01 CET 2007,3839 Thu Feb 8 04:11:01 CET 2007,3840 Thu Feb 8 05:11:02 CET 2007,3839 Thu Feb 8 06:11:02 CET 2007,3839 Thu Feb 8 07:11:03 CET 2007,3839 Thu Feb 8 08:11:01 CET 2007,3841 Thu Feb 8 09:11:01 CET 2007,3842
Hmm.. Right off my guess is that your TSC is being mis-calibrated and its showing up now due to the switch between tick based timekeeping and continuous TSC based timekeeping. Can you attach a full dmesg log from 2.6.20? Also check that booting w/ clock=jiffies avoids the drift?
Created attachment 10356 [details] 2.6.20 dmesg will try the clock option later today or tomorrow
Hmmm. From the dmesg, cpu_khz seems to in the ballpark (1533Mhz for 1800+). 1 second every hour is ~278ppm drift, which is pretty bad, but not insane, so it might be due to a slight calibration error. Is the cpu_khz value the same for pre-2.6.18 kernels? Does it vary much boot to boot? Also, clock=acpi_pm should work on your hardware as well.
Created attachment 10363 [details] grep -e "kernel: Linux version" -e "MHz processor" /var/log/messages > mhz 2.6.17 had it always 1533.469 MHz (one time 1533.477 MHz) 2.6.19 and 2.6.20 most of the time 1533.530 MHz, (one time 1533.383 MHz) 2.6.20 is running with clock=jiffies for 2 hours now. So far no sign of slowdown.
Created attachment 10372 [details] Reduce calibration time to 5 ticks This is sort of a shot in the dark. The TSC calibration was changed to improve accuracy, but apparently that doesn't seem to be the case for you (although it could also be that the accuracy is improved and the crystal in your system is just slightly off - ~278ppm is within the bound that is correctable using NTP). I'm going to guess that it might be SMIs affecting the new calibration code. So this patch drops the calibration time back down to 5 ticks (as used pre-2.6.18). I'm guessing here that the 30ms calibration now used is long enough that any BIOS SMIs that might occur could affect the results. If it does not affect the results, I can try a patch that reverts back to the old calibration code to see if it helps as well.
before I apply the patch - the clock=jiffies seems to make the clock run faster: time, difference Fri Feb 9 08:11:01 CET 2007,3858 Fri Feb 9 09:11:02 CET 2007,3857 Fri Feb 9 10:11:01 CET 2007,3858 Fri Feb 9 11:11:01 CET 2007,3858 Fri Feb 9 12:11:01 CET 2007,3857 Fri Feb 9 13:11:01 CET 2007,3857 Fri Feb 9 14:11:01 CET 2007,3857 Fri Feb 9 15:11:01 CET 2007,3857 Fri Feb 9 16:11:01 CET 2007,3857 Fri Feb 9 17:11:01 CET 2007,3857 Fri Feb 9 18:11:01 CET 2007,3857 Fri Feb 9 19:11:02 CET 2007,3857 Fri Feb 9 20:11:01 CET 2007,3857 Fri Feb 9 21:11:01 CET 2007,3857 Fri Feb 9 22:11:01 CET 2007,3857 Fri Feb 9 23:11:01 CET 2007,3857 Sat Feb 10 00:11:01 CET 2007,3856 Sat Feb 10 01:11:01 CET 2007,3857 Sat Feb 10 02:11:01 CET 2007,3856 Sat Feb 10 03:11:01 CET 2007,3856 Sat Feb 10 04:11:02 CET 2007,3856 Sat Feb 10 05:11:01 CET 2007,3856 Sat Feb 10 06:11:01 CET 2007,3856 [root@localhost hajma]#
the patch did not help, in fact it is slightly worse now: dmesg:Detected 1533.677 MHz processor. the clock=jiffies vanilla kernel Sun Feb 11 13:11:01 CET 2007,3853 Sun Feb 11 14:11:04 CET 2007,3852 Sun Feb 11 15:11:02 CET 2007,3853 Sun Feb 11 16:11:01 CET 2007,3853 Sun Feb 11 17:11:04 CET 2007,3853 Sun Feb 11 18:11:02 CET 2007,3852 Sun Feb 11 19:11:03 CET 2007,3851 patched kernel: Sun Feb 11 20:11:01 CET 2007,3853 Sun Feb 11 21:11:01 CET 2007,3854 Sun Feb 11 22:11:01 CET 2007,3855 Sun Feb 11 23:11:01 CET 2007,3856 Mon Feb 12 00:11:01 CET 2007,3857 Mon Feb 12 01:11:01 CET 2007,3858 Mon Feb 12 02:11:01 CET 2007,3859 Mon Feb 12 03:11:01 CET 2007,3860 Mon Feb 12 04:11:01 CET 2007,3861 Mon Feb 12 05:11:01 CET 2007,3862
Furthermore, the calibration may be only a part of the problem,as the time difference changes sometimes - but I was not able to trace that to any particular application - sometimes it is 1 sec/hour, sometimes 3 secs/hour: time, difference Mon Feb 12 05:11:01 CET 2007,3862 Mon Feb 12 06:11:01 CET 2007,3862 Mon Feb 12 07:11:01 CET 2007,3863 Mon Feb 12 08:11:01 CET 2007,3864 Mon Feb 12 09:11:02 CET 2007,3865 Mon Feb 12 10:11:02 CET 2007,3868 Mon Feb 12 11:11:01 CET 2007,3871 Mon Feb 12 12:11:01 CET 2007,3874
regarding comment #6 There will always be some time drift, but it would be good to minimize it if possible, so I wouldn't worry about jiffies drifting forward instead of behind, it seems to be still quite minimal (~20ppm - quite reasonable). regarding comment #7 Thanks for testing the patch. If I can find some time today I'll send a patch that reverts it all back to the old calibration code. regarding comment #8 That's odd. Do let me know if you see any pattern as to when you see the 3sec per hour and when it is 1sec.
I have a cron job sending top "-n1 -b -S" to a file every hour, but I see no special pattern there. I might suspect kaffeine or VirtualBox, but how to measure?
Any update on this problem? Thanks.
The 2.6.24-rc3 kernel includes fixes to the TSC calibration routines that might explain this problem. Could you please try it and let us know if the time skew still occurs?
I will be able to test it in 3-4 weeks, thanks for understanding
with kernel 2.6.24-rc6 the slowdown rate is still there - on average 1.5 seconds per hour. As before, there are hours without a difference, there are hours with 3 seconds slowdown, usually 1-2 seconds.
Have you tried booting w/ clocksource=acpi_pm ? Since it uses a constant frequency (in other words, its not calibrated) so any skew in your tick rate that might affect calibration of the TSC shouldn't effect the acpi_pm.
after the first 11 hours with clocksource=acpi_pm the time difference increased by 5 seconds -> .45 s/hour is much better than before but 6 minutes/month is still a lot.
I have a problem that seems related to this discussion. I'll submit a new bug report if that's preferable. The TSC calibration routine is not consistent. I'd be happy with any of the values it returns as long as it would get the same result the next time it ran. These are from a 2.6.23 kernel: Dec 2 03:26:18 shuksan kernel: Detected 2793.070 MHz processor. Dec 2 03:51:12 shuksan kernel: Detected 2793.093 MHz processor. Dec 2 04:37:05 shuksan kernel: Detected 2793.067 MHz processor. Dec 2 13:47:31 shuksan kernel: Detected 2793.063 MHz processor. Dec 2 14:38:48 shuksan kernel: Detected 2793.183 MHz processor. Dec 3 06:15:01 shuksan kernel: Detected 2793.146 MHz processor. Dec 9 22:58:36 shuksan kernel: Detected 2793.175 MHz processor. Dec 11 01:32:19 shuksan kernel: Detected 2793.058 MHz processor. Dec 18 10:30:35 shuksan kernel: Detected 2793.130 MHz processor. Dec 22 22:03:45 shuksan kernel: Detected 2793.089 MHz processor. That's a spread of 44 ppm which makes a nasty transient for ntpd. You probably won't notice unless you look carefully. A few ppm might be due to temperature changes. I've seen similar quirks in 2.6.18, 2.6.19 and 2.6.22. That's on two Dell workstations and the XO ($100 laptop www.laptop.org) At one point, I patched tsc_init to loop calling calculate_cpu_khz and printing the answer. It was the same sort of scatter so I'm pretty sure the problem isn't temperature or some strange initialization state/mode.
just another comment: with 2.6.24-rc7 the slowdown was terrible 10 seconds/hour, with rc8 it is back to 1s/h
I recently sent out a patch to lkml may help address this issue. Let me dig it up and I'll post it here for testing.
Created attachment 14627 [details] Correct ntp drift (regression from 2.6.21) Here's the patch to try. On my system it corrected for a 280ppm drift (about what you're seeing if I did the math right). Thanks for keeping on me about this issue!
thanks, I do not have time for building the kernel ATM. Did it make it to 2.6.24 ?
No, not yet. I'm working to push it however for 2.6.25 and 2.6.24.x.
Any news on this? I experience the same problems with an embedded x86 CPU, the CPU clock calculation is different in 7 out of 10 reboots and this leads to a massive delay for my ntp before it converges into a reasonable (for me) range. This is not the case with a 2.6.15.1 kernel and it seems to be cured if I set the clockrate to a fixed value: cpu_khz = 498052; tsc_khz = 498052; Since ntpd calculates a drift rate and stores this for its next start, this is a big problem for me. Due to the changing clock rate calculation after a reboot, the drift value of ntpd is wrong and it takes a while (hours) before ntpd manages to detect that and recalculates the drift. (I tested this with 2.6.24) I would vote for accepting this as a bug (and change the status from NEEDINFO to ASSIGNED) :-) Best regards, Heiko
For folks concerned about the TSC calibration variance issue in this bug (which does not affect folks not using the TSC for timekeeping). Please see the patch here: http://lkml.org/lkml/2009/5/12/506 It allows you to specify a tsc_khz freq at boot up that will always be used, avoiding the calibration error. It has not been included into mainline, but if folks find it an acceptable solution, I'll push for it.
thanks! I'll disable ntp, see if I still have the issue and eventually try the patch
hajma: Is this issue still affecting you on recent kernels?
Fast TSC calibration can cause very large error in detection of CPU speed and thus causing very bad system time speed. I have a server with 2GHz Xeons (two CPUs with 2 cores each) and recently got situation when system time was behind by 0,2%, i.e. every 100 seconds system clock became 2 seconds slower. When there were no fast TSC calibration, CPU speed were always correct with high precision (about 0,01 MHz): kernel 2.6.24.1: TSC calibrated against PM_TIMER Detected 1999.998 MHz processor. TSC calibrated against PM_TIMER time.c: Detected 1999.998 MHz processor. TSC calibrated against PM_TIMER time.c: Detected 1999.996 MHz processor. TSC calibrated against PM_TIMER time.c: Detected 2000.005 MHz processor. TSC calibrated against PM_TIMER time.c: Detected 1999.991 MHz processor. But when fast TSC calibration appeared it always gave higher mistake and once it became extremely large: kernel 2.6.32.8: Fast TSC calibration using PIT Detected 2000.121 MHz processor. Fast TSC calibration using PIT Detected 1999.791 MHz processor. Fast TSC calibration using PIT Detected 2000.178 MHz processor. Fast TSC calibration failed TSC: PIT calibration matches HPET. 1 loops Detected 1999.995 MHz processor. !!!!NO fast calibr. and no such mistake! kernel 2.6.32.19: Fast TSC calibration using PIT Detected 2038.003 MHz processor. !!!!!VERY BAD Fast TSC calibration using PIT Detected 2000.243 MHz processor. I don't want to change clock source from TSC and don't want to specify CPU speed explicitly (because have many different servers with identical configs), but fast TSC calibration should be corrected or must be a way to disable it. I didn't found it looking through sources: linux-2.6.32.19/arch/x86/kernel/tsc.c: line 401 unsigned long native_calibrate_tsc(void) { u64 tsc1, tsc2, delta, ref1, ref2; unsigned long tsc_pit_min = ULONG_MAX, tsc_ref_min = ULONG_MAX; unsigned long flags, latch, ms, fast_calibrate; int hpet = is_hpet_enabled(), i, loopmin; local_irq_save(flags); fast_calibrate = quick_pit_calibrate(); //I WANT TO DISABLE IT IN KERNEL CONFIG AND/OR COMMAND LINE local_irq_restore(flags); if (fast_calibrate) return fast_calibrate; ....other methods of calibration }
I've recently added a timer-based refined TSC calibration to the kernel. This should resolve most TSC calibration issues where the calibrated TSC freq would be different boot-to-boot. I'm going to mark this as closed, but if anyone is still having problems with TSC calibration, please open a new bug and CC me.