Most recent kernel where this bug did not occur: 2.4 Distribution: Hardware Environment: Software Environment: Problem Description: [Note: I reported this bug on the high-res-timers-discourse mailing list and the kernel mailing list back in November. The folks who are working on major changes to timer code took it seriously, but it's not fixed in Linus's kernel yet and those major changes don't seem close to being all finished yet, so I'm filing the bug here -- mostly for reference, partly in hopes that it may get fixed incrementally in the current timer code. I've checked, and the buggy code is still in 2.6.12.5. Below is a revised version of my post from November.] In 2.6, some code has been added to watch for "lost ticks" and increment jiffies_64 to compensate for them. A "lost tick" is when timer interrupts are masked for so long that ticks pile up and the kernel doesn't see each one individually, so it loses count. Lost ticks are a real problem, especially in 2.6 with the base interrupt rate having been increased to 1000 Hz, and it's good that the kernel tries to correct for them. However, detecting when a tick has truly been lost is tricky. The code that has been added (both in timer_tsc.c's mark_offset_tsc and timer_pm.c's mark_offset_pmtmr) is overly simplistic and can get false positives. Each time this happens, a spurious extra tick gets added in, causing the kernel's clock to go faster than real time. The lost ticks code in timer_pm.c essentially works as follows. Whenever we handle a timer tick interrupt, we note the current time as measured on a finer-grained clock (namely the PM timer). Let delta = current_tick_pm_time - last_tick_pm_time. If delta >= 2.0 ticks, then we assume that the last floor(delta) - 1 ticks were lost and add this amount into the jiffies counter. The timer_tsc.c code is more complex but shares the same basic concept. What's wrong with this? The problem is that when we get around to reading the PM timer or TSC in the timer interrupt handler, there may already be *another* timer interrupt pending. There is a small amount of queuing in x86 interrupt controllers (PIC or APIC), to handle the case where a device needs to request another interrupt in the window between when its previous interrupt request has been passed on from the controller to the CPU and when the OS's interrupt handler has run to completion and unmasked the interrupt. When this case happens, the CPU gets interrupted again as soon as the interrupt is unmasked. The queue length here is very short, but it's not 0. When using the APIC, I think there can be one interrupt that the CPU has acknowledged and is currently processing with interrupts masked, a second interrupt currently pending in the APIC's ISR, and a third interrupt currently being requested in the APIC's IRR, making the queue length 2. This queuing means that we are being slow about responding to timer interrupts (due to having interrupts masked for too long, say), then when we finally get into the interrupt handler for timer interrupt number T, interrupt number T+1 (and possibly T+2 as well) may already be queued. If we handled interrupt T-1 on time, then at this point delta will be a little more than 2.0 ticks, because it's now past time for tick T+1 to happen, so the "lost ticks" code will fire and add an extra tick. But no ticks were really lost. We are handling tick T right now, and as soon as we return from the interrupt service routine and unmask the clock interrupt, we will immediately get another clock interrupt, the one for tick T+1, and we'll increment jiffies_64 for this tick too. So, checking whether delta >= 2.0 will give us false positives. How to fix this? Because of the queuing, I believe there's no way to detect lost ticks without either false positives or negatives just by looking at the spacing between the current tick and the last tick. One idea is to do this test over a longer period, not just the last inter-tick period. If the TSC or PM timer tells us that we should have seen 10 ticks and we've seen only 7, we can be reasonably sure we've lost at least 1 tick. It's important not to make the period too long, though. Since we only know approximately how fast the TSC is, if we've seen 999,997 ticks when the TSC tells us we should have seen 1,000,000, that error may be due only to our idea of the TSC's rate being off by a few parts per million. (And actually, the measurement that Linux does of the TSC rate at bootup time isn't that accurate.) Another idea that may help is if we've recently added a "lost" tick and the next tick appears to be early, conclude that this is the tick we thought we'd lost showing up after all, and don't add it. For more ideas, see the discussion on http://lists.sourceforge.net/lists/listinfo/high-res-timers-discourse in November 2004. The thread subject is: Spurious "lost ticks". I should say that so far I haven't tried to test how much of an effect this bug has on real hardware, but it certainly can happen on any system where the lost ticks code is needed at all. It has a big effect in VMware VMs. I've seen time in 2.6 kernels run as much as 10% fast using the code in timer_tsc.c (kernel command line option clock=tsc), and I've seen a gain of roughly 1 second per hour with clock=pmtmr. (I understand why VM's would tickle the bug a lot more than real hardware does, and unfortunately it's not something I can do enough about within the VM implementation. Until it's fixed on the Linux side, all I can do is tell people to use timer=pit when they run 2.6 in a VM, which turns off all lost ticks compensation.)
Yes, this is a problematic issue and has been a motivater for my generic timekeeping code. I'm working to get those patches integrated into the kernel, but I am running into a bit of trouble on lkml. Please give my patches a try (the last released version was B5). The cumulative version of the patch can be found here: http://www.ussg.iu.edu/hypermail/linux/kernel/0508.1/0982.html
Thanks, John. I think I have a few cycles to try that this week...
err, John, we'd prefer to not rewrite the whole timer subsystem to fix one bug :(
Heh, that's kind of unfair to John, as this bug wasn't his motivation for rewriting it; he's just taking care that his rewrite doesn't have the bug. But yeah, it would be great to have the bug fixed in the short term within the current timer subsystem...
Andrew: I too would prefer not to have to re-write timekeeping to fix one bug. :) This bug actually is *one* of the motivating forces, because it illustrates a core problem with the tick based timekeeping. When ticks are missed we lose time. We've "fixed" that issue with the lost tick compensation code, which helps but doesn't really fix the problem, and in this case, creates new problems where multiple ticks show up very late, but are not lost (common in virtualization). I do realize that my patches feel like more then a heavy weight bug fix, and I've been working to break them up and re-work bits as needed. But I really felt it was necessary to first find a solution that would be correct in all of these odd cases. And hey, if someone can find a quick staplegun fix for this that doesn't break something else, that would be great.
Tim: Did you ever get a chance to verify that the issue is cleared up with my timeofday patchset? If not, I should be releasing a new B9 version against -mm today or tomorrow. I expect the issue should go away, but I just really want to be sure. Thanks!
The new timekeeping code in 2.6.18-rc1 should resolve this issue. Please re-open if the problem still exists.
Can anyone please point me to the relevant commits in Linus' tree?
I guess 734efb467b31e56c2f9430590a9aa867ecf3eea1 is the start of it. With 5d0cf410e94b1f1ff852c3f210d22cc6c5a27ffa providing the initial i386 clocksources.