Distribution: Slackware 10.0 Hardware Environment: Athlon XP 2500+ (Barton, 1.83 GHz), MSI K7N2 nForce2 mobo Software Environment: N/A Problem Description: The system clock gains almost a second every two minutes resulting in the clock running about 11 minutes fast at the end of a 24 hour period (time syncs are done every 24 hrs). Steps to reproduce: Build the kernel with IO-APIC enabled.
Created attachment 5569 [details] kernel config
Created attachment 5570 [details] lspci -vvv
Created attachment 5571 [details] /proc/cpuinfo
Created attachment 5572 [details] dmesg
This looks similar but not exactly like the nVidia issue in bug #3341.
For what it's worth, I'm also seeing symptoms described in this bug on a Fedora Core 4 x86-64 system. My clock drifts at about the same rate as the original reporter for this bug wrote. There's an interesting correlation between the clock drift and the amount of data transferred via network. For every ~11MB transferred, the clock advances one second. I'm currently running ntpd -q every five minutes to correct the drift, and usually the correction is only a few seconds, less during the night when the traffic is lower. I tried downloading an ISO image and the correction amount jumped to ~20 seconds / 5 minutes. This is an ABIT AV8 motherboard (VIA K8T800 chipset) with AMD Athlon 64 4600+ dual-core CPU, 250GB SATA Maxtor hard disk, 4GB of RAM and ATI Radeon 7000 as the display adapter. The on-board network adapter is apparently "VIA Technologies, Inc. VT6120/VT6121/VT6122 Gigabit Ethernet Adapter (rev 11)", using the via_velocity driver, kernel 2.6.13-1.1526_FC4smp. I'm not running cpufreq, Powernow has been disabled. Various listings from the system can be found at http://jaguaari.miuku.net/clock/ if they're of assistance. Unfortunately this is a production system, located some 150km away from me, so my debugging possibilities are rather limited in this regard.
Just to verify my understanding of this issue, does booting w/ noapic causes the problem to go away?
As I mentioned in my comment above, my server is running in a colocation facility some 150km away from me and as such I'd rather not try something that would potentially cause the server not to come up again after rebooting. From what I've seen, "noapic" may cause other problems and I don't really want to try my luck on a production system. Sorry, I hope you understand my concerns. Perhaps the original reporter can shed some more light on this issue? The problem is still present, at the current rate the clock seems to move about 1 second per minute too fast.
Cal: I suspect you're running into the same problem as seen in bug #5545. Have you tried updating your BIOS recently, as that resolved the problem for that bug's submitter? Anssi: Since you have a different chipset, I suspect your issue is not quite the same as the original submitter. Would you mind reproducing the issue with a vanilla 2.6.14 kernel ( and if it still exists filing a new bug? Since you're dealing with a production environment, I understand that you might not be able to test vanilla kernels. You might try booting with idle=poll, to see if that helps. If it does not help, I'd suggest filing a bug with the distribution you are using to see if they can assist.
John, updating the BIOS was one of the first things my mind went to, but unfortunately it didn't fix it :( I'll have the box back online shortly and will retest with the latest kernel and try the noapic param as well. The acpi_skip_timer_override mentioned in bug 5545 looks interesting. It says it's being ignored in the boot log... perhaps it shouldn't be? Thanks!
I'm now running 2.6.14-1.1637_FC4smp and the problem appears to be gone. Sorry, I neved had the time to test with plain vanilla kernels, but as FC4's current kernel is based on 2.6.13.2, people still suffering from this problem might want to try if a newer kernel fixes the problem for them. There's some discussion about my clock problems at https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171219
Still no luck with the latest kernel - 2.6.15-rc2. Booting with 'noapic' does make it disappear though.
bug confirmed on SuSE 10.0 64bit, kernel 2.6.15-rc5.
Truth: Could you attach your dmesg please?
Bug confirmed, PIIX4 board with Celeron 2.4ghz processor (Supermicro m/b) with 2.6.15.4 kernel (2.6.10 works) not only is there timeloss, openntpd dies: Feb 16 12:50:15 mickey ntpd[906]: ntp engine ready Feb 16 12:50:38 mickey ntpd[906]: peer 128.250.36.2 now valid Feb 16 12:50:40 mickey ntpd[906]: peer 130.102.2.123 now valid Feb 16 12:50:42 mickey ntpd[906]: peer 128.250.36.3 now valid Feb 16 13:39:38 mickey ntpd[906]: peer 130.102.2.123 now invalid Feb 16 13:41:25 mickey ntpd[905]: adjusting local clock by -0.178269s Feb 16 13:50:36 mickey ntpd[906]: fatal: client_query socket: Address family not supported by protocol Feb 16 13:50:36 mickey ntpd[905]: dispatch_imsg in main: pipe closed Feb 16 13:50:36 mickey ntpd[905]: Lost child: child exited Feb 16 13:50:36 mickey ntpd[905]: Terminating Big problems in timekeeping code in 2.6.12+, IMO
Same problem here, with a twist: Clock runs normal, but every (re)boot makes it gain 11-12 seconds. Distro: SuSE 10.0 CPU: Amd Athlon 64 X2 4200+ MoBo: ASUS A8N5X R1.00 (no BIOS update yet) Kernel: 2.6.16-rc5 from kernel.org (vanilla, not SuSE, IO_APIC is enabled) The changelog of 2.6.16-rc5 contains some hints that Andi Kleen got the time problems for AMD dual core processors solved, so I tried this kernel without any special boot parameters. The clock seems to run quite normal now on a rather unloaded system. I'm not yet sure if this will hold but there was no "time gain" as observed before, after the last reboot. More on this tomorrow. But: Every reboot "gains" 11-12 seconds! Should I provide dmesg output, .config, boot params and ...? Should I start a new bug? Cheers Dirk
see http://h18004.www1.hp.com/products/servers/linux/powernow-notes.html#sles9sp2x86 for additional info and work-arounds.
Well, I'm not sure if this is completely fixed but... with kernel 2.6.18 the clock only gains about 15 secs over 24 hours. This is acceptable as I do time syncs every four hours now on this box. I'll prolly do a bisect to see what fixed it if for no other reason than I'll know who to thank. Unless anyone has anything else to add I guess I'll mark this resolved and close it out.
Could you check /proc/interrupts to see if you are getting the expected number of interrupts per second (depending on your HZ config setting)? It may be that the problem still exists but that the symptom (bad timekeeping) is resolved.
Here's some interesting results... v2.6.17 w/ ioapic: 1000 per second v2.6.17.13 w/o ioapic: 1000 per second v2.6.18 w/ ioapic: ~1008 per second v2.6.18.3 w/o ioapic: 1000 per second v2.6.18.3 w/ ioapic: ~1008 per second all at 1000HZ. Based on these numbers I'm thinking broken timekeeping code got fixed but now IO-APIC is causing the timer interrupt to fire a few too many times per second...? Here's the bash code I used to get my numbers (realizing of course that if timing is broken then sleep(1) may not be reliable): for i in 5 10 15 30 45 60; do TIC=$(cat /proc/interrupts | awk '/timer/ { print $2 }') sleep $i TOC=$(cat /proc/interrupts | awk '/timer/ { print $2 }') echo $(( ($TOC - $TIC) / $i )) done
Any update on this problem please. Thanks.
2.6.22.7 seems to have a time loss problem for me. Attached is a log of time readjustments. I have the sneeky suspicion it is related to using USB heavily, and the older /dev/ub* driver.
Created attachment 13208 [details] log of time adjustment over a few days
Created attachment 13209 [details] system info combined dump of system info /proc/cpuinfo lspci -vvv /usr/src/linux-2.6.22.7/.config /proc/interrupts
Created attachment 13210 [details] system info combined system info try #2 /proc/cpuinfo lspci -vvv /usr/src/linux-2.6.22.7/.config /proc/interrupts
Andrew, can you also provide dmesg please. I suspect your problem might be different, because your config shows both APIC and IO-APIC configured yet it runs in PIC mode. Also, you seem to have NMI watchdog enabled (number of NMIs on each processor suggest that), it is disabled by default on 2.6.22+. Maybe it should be different bugzilla opened per your case.
I have to run with IO-APIC off due to a buggy IO-APIC I end up with the annoying vector error of 40:40, which is a known hardware flaw, and it fills up my logs :-( My current fix (as shown in the attachment logs) is a cron job to re-sync with the official atomic clock. As you can see there isn't really 2 CPU, since it's hyper threaded... I think possibly that hyper threading has issues with the broken IO-APIC, because there is only really one (or is there? the bootup dmesg shows two when I boot with IOAPIC...) Problem is that if there really isn't a second IO-APIC the error code (I looked it up) is telling us that the vector is incorrect... What I should do is re-enable it after patching to have it show which "cpu" is bitching about it, and perhaps this "flaw" can actually be mitigated, or actually fixed. So as you see it's kind-of a two-way brokenness. :-( One problem might be affecting the other, and the root cause could be simply PIC related, or USB related (IRQ storm???). I don't know which it is, as I haven't fully done any serious diagnostics to locate the root cause, but if you have any pointers, or patches that can dump extra debug information, I'll be happy to supply my limited time to help resolve the issue.
I am experiencing this issue on an older machine. It has been going on for at least a year (forgot which kernel version) and is still present in 2.6.24. The clock gains about 2.5 minutes every hour. The machine has a VIA K8T800 chipset. I also have the VT6120/VT6121/VT6122 Gigabit Ethernet Adapter. It is a 32 bit Athlon XP system Doing: cat /proc/interrupts |grep timer; sleep 10s; cat /proc/interrupts |grep timer Shows the following: 0: 137759783 IO-APIC-edge timer LOC: 144077022 Local timer interrupts 0: 137769322 IO-APIC-edge timer LOC: 144087032 Local timer interrupts I will try noapic at the next maintenance opportunity.
nuitari-kernel: I've got a similar chipset on one of my boxes (K8M800), and I've not seen any obvious issues, but I'll check again soon. Meanwhile, this bug has been a fairly long running bug, and contains a number of different issues, a few of which have been fixed. Could those on the CC list who still are having problems with recent (2.6.24+) kernels, comment and let me know? Thanks.
Hi John, The timekeeping looks better on 2.6.24, which is weird cause on the same kernel it still use to do the problem. However, I consistently see 9 extra firing of the timer interrupt using the following command: cat /proc/interrupts |grep timer; sleep 10s; cat /proc/interrupts |grep timer 0: 731268897 IO-APIC-edge timer LOC: 731280976 Local timer interrupts 0: 731278906 IO-APIC-edge timer LOC: 731290984 Local timer interrupts I'll try updating to 2.6.26 when vmware-modules catches up to it.
nuitari-kernel: Good to hear things are better. A few extra timer interrupts using the above script isn't unexpected. The wakeup may be a few ticks late, and it takes some time for the proc/interrupts code to run. So I'd not fret much about that.
John, it looks fine now, I'm using the gentoo 2.6.26 kernel now on that machine. There is still some seconds of drift but nothing major. Before I used to have 10+ minute issues every hour. I've moved ntp-client from cron.hourly to cron.daily and that will be fine :)
nuitari-kernel: Thank so much for testing! If you continue to see less severe ntp drift issues, do feel free to open a new bug describing the extent of the drift and how it is observed. I believe this issue can be closed. If other reporters are still seeing this problem (fast running clock which goes away with "noapic") please reopen this bug. If you are seeing a different or slightly different issue, please open a new bug.
My "fix" was to use the PIC instead of the TSC, which had alot of drift any way because of thermal throttling.
One other note as far as using PIC v.s. IOAPIC -- I have to disable it due to a buggy chipset :-)