Bug 5127

Summary: Lost ticks compensation fires when it should not
Product: Timers Reporter: Tim Mann (mann)
Component: gettimeofdayAssignee: john stultz (john.stultz)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, jdelvare
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.12.5 Subsystem:
Regression: No Bisected commit-id:

Description Tim Mann 2005-08-24 17:12:48 UTC
Most recent kernel where this bug did not occur: 2.4
Distribution:
Hardware Environment:
Software Environment:
Problem Description:

[Note: I reported this bug on the high-res-timers-discourse mailing
list and the kernel mailing list back in November.  The folks who are
working on major changes to timer code took it seriously, but it's not
fixed in Linus's kernel yet and those major changes don't seem close
to being all finished yet, so I'm filing the bug here -- mostly for
reference, partly in hopes that it may get fixed incrementally in the
current timer code.  I've checked, and the buggy code is still in
2.6.12.5.  Below is a revised version of my post from November.]

In 2.6, some code has been added to watch for "lost ticks" and
increment jiffies_64 to compensate for them.  A "lost tick" is when
timer interrupts are masked for so long that ticks pile up and the
kernel doesn't see each one individually, so it loses count.

Lost ticks are a real problem, especially in 2.6 with the base
interrupt rate having been increased to 1000 Hz, and it's good that
the kernel tries to correct for them.  However, detecting when a tick
has truly been lost is tricky. The code that has been added (both in
timer_tsc.c's mark_offset_tsc and timer_pm.c's mark_offset_pmtmr) is
overly simplistic and can get false positives.  Each time this
happens, a spurious extra tick gets added in, causing the kernel's
clock to go faster than real time.

The lost ticks code in timer_pm.c essentially works as follows.
Whenever we handle a timer tick interrupt, we note the current time as
measured on a finer-grained clock (namely the PM timer).  Let delta =
current_tick_pm_time - last_tick_pm_time.  If delta >= 2.0 ticks, then
we assume that the last floor(delta) - 1 ticks were lost and add this
amount into the jiffies counter.  The timer_tsc.c code is more complex
but shares the same basic concept.

What's wrong with this?  The problem is that when we get around to
reading the PM timer or TSC in the timer interrupt handler, there may
already be *another* timer interrupt pending.  There is a small amount
of queuing in x86 interrupt controllers (PIC or APIC), to handle the
case where a device needs to request another interrupt in the window
between when its previous interrupt request has been passed on from
the controller to the CPU and when the OS's interrupt handler has run
to completion and unmasked the interrupt.  When this case happens, the
CPU gets interrupted again as soon as the interrupt is unmasked.  The
queue length here is very short, but it's not 0.  When using the APIC,
I think there can be one interrupt that the CPU has acknowledged and
is currently processing with interrupts masked, a second interrupt
currently pending in the APIC's ISR, and a third interrupt currently
being requested in the APIC's IRR, making the queue length 2.

This queuing means that we are being slow about responding to timer
interrupts (due to having interrupts masked for too long, say), then
when we finally get into the interrupt handler for timer interrupt
number T, interrupt number T+1 (and possibly T+2 as well) may already
be queued.  If we handled interrupt T-1 on time, then at this point
delta will be a little more than 2.0 ticks, because it's now past time
for tick T+1 to happen, so the "lost ticks" code will fire and add an
extra tick.  But no ticks were really lost.  We are handling tick T
right now, and as soon as we return from the interrupt service routine
and unmask the clock interrupt, we will immediately get another clock
interrupt, the one for tick T+1, and we'll increment jiffies_64 for
this tick too.

So, checking whether delta >= 2.0 will give us false positives.  How
to fix this?  Because of the queuing, I believe there's no way to
detect lost ticks without either false positives or negatives just by
looking at the spacing between the current tick and the last tick.

One idea is to do this test over a longer period, not just the last
inter-tick period.  If the TSC or PM timer tells us that we should
have seen 10 ticks and we've seen only 7, we can be reasonably sure
we've lost at least 1 tick.  It's important not to make the period too
long, though.  Since we only know approximately how fast the TSC is,
if we've seen 999,997 ticks when the TSC tells us we should have seen
1,000,000, that error may be due only to our idea of the TSC's rate
being off by a few parts per million.  (And actually, the measurement
that Linux does of the TSC rate at bootup time isn't that accurate.)

Another idea that may help is if we've recently added a "lost" tick
and the next tick appears to be early, conclude that this is the tick
we thought we'd lost showing up after all, and don't add it.

For more ideas, see the discussion on
http://lists.sourceforge.net/lists/listinfo/high-res-timers-discourse
in November 2004.  The thread subject is: Spurious "lost ticks".

I should say that so far I haven't tried to test how much of an effect
this bug has on real hardware, but it certainly can happen on any
system where the lost ticks code is needed at all.  It has a big
effect in VMware VMs.  I've seen time in 2.6 kernels run as much as
10% fast using the code in timer_tsc.c (kernel command line option
clock=tsc), and I've seen a gain of roughly 1 second per hour with
clock=pmtmr.  (I understand why VM's would tickle the bug a lot more
than real hardware does, and unfortunately it's not something I can do
enough about within the VM implementation.  Until it's fixed on the
Linux side, all I can do is tell people to use timer=pit when they run
2.6 in a VM, which turns off all lost ticks compensation.)
Comment 1 john stultz 2005-08-24 17:24:50 UTC
Yes, this is a problematic issue and has been a motivater for my generic
timekeeping code. I'm working to get those patches integrated into the kernel,
but I am running into a bit of trouble on lkml.

Please give my patches a try (the last released version was B5). The cumulative
version of the patch can be found here:
http://www.ussg.iu.edu/hypermail/linux/kernel/0508.1/0982.html
Comment 2 Tim Mann 2005-08-24 17:39:08 UTC
Thanks, John.  I think I have a few cycles to try that this week...
Comment 3 Andrew Morton 2005-08-25 21:46:32 UTC
err, John, we'd prefer to not rewrite the whole timer subsystem to fix
one bug :(
Comment 4 Tim Mann 2005-08-26 11:07:58 UTC
Heh, that's kind of unfair to John, as this bug wasn't his motivation for
rewriting it; he's just taking care that his rewrite doesn't have the bug.

But yeah, it would be great to have the bug fixed in the short term within the
current timer subsystem...
Comment 5 john stultz 2005-08-29 11:19:59 UTC
Andrew: I too would prefer not to have to re-write timekeeping to fix one bug.
:) This bug actually is *one* of the motivating forces, because it illustrates a
core problem with the tick based timekeeping. When ticks are missed we lose
time. We've "fixed" that issue with the lost tick compensation code, which helps
but doesn't really fix the problem, and in this case, creates new problems where
multiple ticks show up  very late, but are not lost (common in virtualization).

I do realize that my patches feel like more then a heavy weight bug fix, and
I've been working to break them up and re-work bits as needed. But I really felt
it was necessary to first find a solution that would be correct in all of these
odd cases.

And hey, if someone can find a quick staplegun fix for this that doesn't break
something else, that would be great.
Comment 6 john stultz 2005-10-31 10:06:22 UTC
Tim: Did you ever get a chance to verify that the issue is cleared up with my  
timeofday patchset? If not, I should be releasing a new B9 version against -mm  
today or tomorrow. I expect the issue should go away, but I just really want  
to be sure. Thanks! 
Comment 7 john stultz 2006-07-10 11:18:12 UTC
The new timekeeping code in 2.6.18-rc1 should resolve this issue. Please re-open
if the problem still exists.
Comment 8 Jean Delvare 2011-03-07 20:20:51 UTC
Can anyone please point me to the relevant commits in Linus' tree?
Comment 9 john stultz 2011-03-07 20:27:46 UTC
I guess 734efb467b31e56c2f9430590a9aa867ecf3eea1 is the start of it. With 5d0cf410e94b1f1ff852c3f210d22cc6c5a27ffa providing the initial i386 clocksources.