Bug 5038 - Fast running system clock with IO-APIC enabled
Summary: Fast running system clock with IO-APIC enabled
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: i386 (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: john stultz
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-09 18:53 UTC by Cal Peake
Modified: 2008-08-11 09:02 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.13-rc6 - 2.6.17
Tree: Mainline
Regression: ---


Attachments
kernel config (34.70 KB, text/plain)
2005-08-09 18:54 UTC, Cal Peake
Details
lspci -vvv (12.98 KB, text/plain)
2005-08-09 18:55 UTC, Cal Peake
Details
/proc/cpuinfo (413 bytes, text/plain)
2005-08-09 19:39 UTC, Cal Peake
Details
dmesg (12.86 KB, text/plain)
2005-08-09 19:40 UTC, Cal Peake
Details
log of time adjustment over a few days (6.22 KB, text/plain)
2007-10-18 22:36 UTC, Andrew J. Kroll
Details
system info (860 bytes, text/plain)
2007-10-18 22:40 UTC, Andrew J. Kroll
Details
system info (68.82 KB, text/plain)
2007-10-18 22:44 UTC, Andrew J. Kroll
Details

Description Cal Peake 2005-08-09 18:53:49 UTC
Distribution: Slackware 10.0
Hardware Environment: Athlon XP 2500+ (Barton, 1.83 GHz), MSI K7N2 nForce2 mobo
Software Environment: N/A
Problem Description:

The system clock gains almost a second every two minutes resulting in the clock
running about 11 minutes fast at the end of a 24 hour period (time syncs are
done every 24 hrs).

Steps to reproduce:

Build the kernel with IO-APIC enabled.
Comment 1 Cal Peake 2005-08-09 18:54:43 UTC
Created attachment 5569 [details]
kernel config
Comment 2 Cal Peake 2005-08-09 18:55:12 UTC
Created attachment 5570 [details]
lspci -vvv
Comment 3 Cal Peake 2005-08-09 19:39:36 UTC
Created attachment 5571 [details]
/proc/cpuinfo
Comment 4 Cal Peake 2005-08-09 19:40:00 UTC
Created attachment 5572 [details]
dmesg
Comment 5 john stultz 2005-09-02 11:50:59 UTC
This looks similar but not exactly like the nVidia issue in bug #3341.
Comment 6 Anssi Johansson 2005-10-19 14:46:03 UTC
For what it's worth, I'm also seeing symptoms described in this bug on a Fedora
Core 4 x86-64 system. My clock drifts at about the same rate as the original
reporter for this bug wrote.

There's an interesting correlation between the clock drift and the amount of
data transferred via network. For every ~11MB transferred, the clock advances
one second. I'm currently running ntpd -q every five minutes to correct the
drift, and usually the correction is only a few seconds, less during the night
when the traffic is lower. I tried downloading an ISO image and the correction
amount jumped to ~20 seconds / 5 minutes.

This is an ABIT AV8 motherboard (VIA K8T800 chipset) with AMD Athlon 64 4600+
dual-core CPU, 250GB SATA Maxtor hard disk, 4GB of RAM and ATI Radeon 7000 as
the display adapter. The on-board network adapter is apparently "VIA
Technologies, Inc. VT6120/VT6121/VT6122 Gigabit Ethernet Adapter (rev 11)",
using the via_velocity driver, kernel 2.6.13-1.1526_FC4smp. I'm not running
cpufreq, Powernow has been disabled.

Various listings from the system can be found at
http://jaguaari.miuku.net/clock/ if they're of assistance. Unfortunately this is
a production system, located some 150km away from me, so my debugging
possibilities are rather limited in this regard.
Comment 7 john stultz 2005-11-03 12:17:03 UTC
Just to verify my understanding of this issue, does booting w/ noapic causes the
problem to go away?
Comment 8 Anssi Johansson 2005-11-09 14:26:13 UTC
As I mentioned in my comment above, my server is running in a colocation
facility some 150km away from me and as such I'd rather not try something that
would potentially cause the server not to come up again after rebooting. From
what I've seen, "noapic" may cause other problems and I don't really want to try
my luck on a production system. Sorry, I hope you understand my concerns.
Perhaps the original reporter can shed some more light on this issue?

The problem is still present, at the current rate the clock seems to move about
1 second per minute too fast.
Comment 9 john stultz 2005-11-09 14:46:14 UTC
Cal: I suspect you're running into the same problem as seen in bug #5545. Have
you tried updating your BIOS recently, as that resolved the problem for that
bug's submitter?

Anssi: Since you have a different chipset, I suspect your issue is not quite the
same as the original submitter. Would you mind reproducing the issue with a
vanilla 2.6.14 kernel ( and if it still exists filing a new bug? Since you're
dealing with a production environment, I understand that you might not be able
to test vanilla kernels. You might try booting with idle=poll, to see if that
helps. If it does not help, I'd suggest filing a bug with the distribution you
are using to see if they can assist.
Comment 10 Cal Peake 2005-11-09 23:27:24 UTC
John, updating the BIOS was one of the first things my mind went to, but
unfortunately it didn't fix it :( I'll have the box back online shortly and will
retest with the latest kernel and try the noapic param as well. 

The acpi_skip_timer_override mentioned in bug 5545 looks interesting. It says
it's being ignored in the boot log... perhaps it shouldn't be?

Thanks!
Comment 11 Anssi Johansson 2005-11-20 08:49:47 UTC
I'm now running 2.6.14-1.1637_FC4smp and the problem appears to be gone. Sorry,
I neved had the time to test with plain vanilla kernels, but as FC4's current
kernel is based on 2.6.13.2, people still suffering from this problem might want
to try if a newer kernel fixes the problem for them.

There's some discussion about my clock problems at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171219
Comment 12 Cal Peake 2005-11-20 19:18:40 UTC
Still no luck with the latest kernel - 2.6.15-rc2. Booting with 'noapic' does
make it disappear though.
Comment 13 Truth 2005-12-05 07:04:51 UTC
bug confirmed on SuSE 10.0 64bit, kernel 2.6.15-rc5.
Comment 14 john stultz 2005-12-21 12:44:17 UTC
Truth: Could you attach your dmesg please? 
Comment 15 Mike Goatly 2006-02-17 03:02:02 UTC
Bug confirmed, PIIX4 board with Celeron 2.4ghz processor (Supermicro m/b) with
2.6.15.4 kernel (2.6.10 works)

not only is there timeloss, openntpd dies:

Feb 16 12:50:15 mickey ntpd[906]: ntp engine ready
Feb 16 12:50:38 mickey ntpd[906]: peer 128.250.36.2 now valid
Feb 16 12:50:40 mickey ntpd[906]: peer 130.102.2.123 now valid
Feb 16 12:50:42 mickey ntpd[906]: peer 128.250.36.3 now valid
Feb 16 13:39:38 mickey ntpd[906]: peer 130.102.2.123 now invalid
Feb 16 13:41:25 mickey ntpd[905]: adjusting local clock by -0.178269s
Feb 16 13:50:36 mickey ntpd[906]: fatal: client_query socket: Address family not
supported by protocol
Feb 16 13:50:36 mickey ntpd[905]: dispatch_imsg in main: pipe closed
Feb 16 13:50:36 mickey ntpd[905]: Lost child: child exited
Feb 16 13:50:36 mickey ntpd[905]: Terminating

Big problems in timekeeping code in 2.6.12+, IMO
Comment 16 DirkS 2006-02-27 09:22:40 UTC
Same problem here, with a twist: Clock runs normal, but every (re)boot makes it 
gain 11-12 seconds.

Distro: SuSE 10.0
CPU: Amd Athlon 64 X2 4200+
MoBo: ASUS A8N5X R1.00 (no BIOS update yet)
Kernel: 2.6.16-rc5 from kernel.org (vanilla, not SuSE, IO_APIC is enabled)

The changelog of 2.6.16-rc5 contains some hints that Andi Kleen got the time 
problems for AMD dual core processors solved, so I tried this kernel without any 
special boot parameters. The clock seems to run quite normal now on a rather 
unloaded system. I'm not yet sure if this will hold but there was no "time gain" 
as observed before, after the last reboot. More on this tomorrow.

But: Every reboot "gains" 11-12 seconds!
Should I provide dmesg output, .config, boot params and ...?
Should I start a new bug?
Cheers
  Dirk
Comment 17 Jim Short 2006-08-31 07:53:04 UTC
see http://h18004.www1.hp.com/products/servers/linux/powernow-notes.html#sles9sp2x86
for additional info and work-arounds.
Comment 18 Cal Peake 2006-11-25 20:20:56 UTC
Well, I'm not sure if this is completely fixed but... with kernel 2.6.18 the
clock only gains about 15 secs over 24 hours. This is acceptable as I do time
syncs every four hours now on this box. I'll prolly do a bisect to see what
fixed it if for no other reason than I'll know who to thank. Unless anyone has
anything else to add I guess I'll mark this resolved and close it out.
Comment 19 john stultz 2006-11-26 13:45:36 UTC
Could you check /proc/interrupts to see if you are getting the expected number
of interrupts per second (depending on your HZ config setting)? It may be that
the problem still exists but that the symptom (bad timekeeping) is resolved.
Comment 20 Cal Peake 2006-11-26 23:10:17 UTC
Here's some interesting results...

v2.6.17 w/ ioapic: 1000 per second
v2.6.17.13 w/o ioapic: 1000 per second
v2.6.18 w/ ioapic: ~1008 per second
v2.6.18.3 w/o ioapic: 1000 per second
v2.6.18.3 w/ ioapic: ~1008 per second

all at 1000HZ.

Based on these numbers I'm thinking broken timekeeping code got fixed but now
IO-APIC is causing the timer interrupt to fire a few too many times per second...?

Here's the bash code I used to get my numbers (realizing of course that if
timing is broken then sleep(1) may not be reliable):

for i in 5 10 15 30 45 60; do
  TIC=$(cat /proc/interrupts | awk '/timer/ { print $2 }')
  sleep $i
  TOC=$(cat /proc/interrupts | awk '/timer/ { print $2 }')
  echo $(( ($TOC - $TIC) / $i ))
done
Comment 21 Natalie Protasevich 2007-10-10 23:39:45 UTC
Any update on this problem please.
Thanks.
Comment 22 Andrew J. Kroll 2007-10-18 22:35:28 UTC
2.6.22.7 seems to have a time loss problem for me.

Attached is a log of time readjustments. I have the sneeky suspicion it is related to using USB heavily, and the older /dev/ub* driver.
Comment 23 Andrew J. Kroll 2007-10-18 22:36:45 UTC
Created attachment 13208 [details]
log of time adjustment over a few days
Comment 24 Andrew J. Kroll 2007-10-18 22:40:48 UTC
Created attachment 13209 [details]
system info

combined dump of system info

/proc/cpuinfo
lspci -vvv 
/usr/src/linux-2.6.22.7/.config
/proc/interrupts
Comment 25 Andrew J. Kroll 2007-10-18 22:44:06 UTC
Created attachment 13210 [details]
system info

combined system info try #2
/proc/cpuinfo
lspci -vvv
/usr/src/linux-2.6.22.7/.config
/proc/interrupts
Comment 26 Natalie Protasevich 2007-10-18 22:59:26 UTC
Andrew, can you also provide dmesg please. 
I suspect your problem might be different, because your config shows both APIC and IO-APIC configured yet it runs in PIC mode. Also, you seem to have NMI watchdog enabled (number of NMIs on each processor suggest that), it is disabled by default on 2.6.22+. Maybe it should be different bugzilla opened per your case.
Comment 27 Andrew J. Kroll 2007-10-19 07:35:44 UTC
I have to run with IO-APIC off due to a buggy IO-APIC 
I end up with the annoying vector error of 40:40, 
which is a known hardware flaw, and it fills up my logs :-(

My current fix (as shown in the attachment logs) 
is a cron job to re-sync with the official atomic clock.

As you can see there isn't really 2 CPU, since it's hyper threaded...
I think possibly that hyper threading has issues with the broken IO-APIC, because there is only really one (or is there? the bootup dmesg shows two when I boot with IOAPIC...) Problem is that if there really isn't a second IO-APIC the error code (I looked it up) is telling us that the vector is incorrect... What I should do is re-enable it after patching to have it show which "cpu" is bitching about it, and perhaps this "flaw" can actually be mitigated, or actually fixed.

So as you see it's kind-of a two-way brokenness. :-( One problem might be affecting the other, and the root cause could be simply PIC related, or USB related (IRQ storm???). I don't know which it is, as I haven't fully done any serious diagnostics to locate the root cause, but if you have any pointers, or patches that can dump extra debug information, I'll be happy to supply my limited time to help resolve the issue.
Comment 28 nuitari-kernel 2008-02-04 00:03:17 UTC
I am experiencing this issue on an older machine. It has been going on for at least a year (forgot which kernel version) and is still present in 2.6.24.

The clock gains about 2.5 minutes every hour.

The machine has a VIA K8T800 chipset. 
I also have the VT6120/VT6121/VT6122 Gigabit Ethernet Adapter.

It is a 32 bit Athlon XP system

Doing: cat /proc/interrupts |grep timer; sleep 10s; cat /proc/interrupts |grep timer

Shows the following:
  0:  137759783   IO-APIC-edge      timer
LOC:  144077022   Local timer interrupts

  0:  137769322   IO-APIC-edge      timer
LOC:  144087032   Local timer interrupts

I will try noapic at the next maintenance opportunity.
Comment 29 john stultz 2008-07-29 19:56:59 UTC
nuitari-kernel: I've got a similar chipset on one of my boxes (K8M800), and I've not seen any obvious issues, but I'll check again soon.

Meanwhile, this bug has been a fairly long running bug, and contains a number of different issues, a few of which have been fixed. 

Could those on the CC list who still are having problems with recent (2.6.24+) kernels, comment and let me know? Thanks.
Comment 30 nuitari-kernel 2008-07-30 12:24:13 UTC
Hi John,

The timekeeping looks better on 2.6.24, which is weird cause on the same kernel it still use to do the problem.

However, I consistently see 9 extra firing of the timer interrupt using the following command:

cat /proc/interrupts |grep timer; sleep 10s; cat /proc/interrupts |grep timer
  0:  731268897   IO-APIC-edge      timer
LOC:  731280976   Local timer interrupts
  0:  731278906   IO-APIC-edge      timer
LOC:  731290984   Local timer interrupts

I'll try updating to 2.6.26 when vmware-modules catches up to it.
Comment 31 john stultz 2008-07-30 12:30:35 UTC
nuitari-kernel: Good to hear things are better. 

A few extra timer interrupts using the above script isn't unexpected. The wakeup may be a few ticks late, and it takes some time for the proc/interrupts code to run. So I'd not fret much about that.
Comment 32 nuitari-kernel 2008-08-04 16:21:11 UTC
John, it looks fine now, I'm using the gentoo 2.6.26 kernel now on that machine.

There is still some seconds of drift but nothing major. 
Before I used to have 10+ minute issues every hour.

I've moved ntp-client from cron.hourly to cron.daily and that will be fine :)
Comment 33 john stultz 2008-08-04 17:07:57 UTC
nuitari-kernel: Thank so much for testing! If you continue to see less severe ntp drift issues, do feel free to open a new bug describing the extent of the drift and how it is observed. 

I believe this issue can be closed. If other reporters are still seeing this problem (fast running clock which goes away with "noapic") please reopen this bug.

If you are seeing a different or slightly different issue, please open a new bug.
Comment 34 Andrew J. Kroll 2008-08-11 08:52:45 UTC
My "fix" was to use the PIC instead of the TSC, which had alot of drift any way because of thermal throttling.
Comment 35 Andrew J. Kroll 2008-08-11 09:02:40 UTC
One other note as far as using PIC v.s. IOAPIC -- I have to disable it due to a buggy chipset :-)

Note You need to log in before you can comment on or make changes to this bug.