Bug 5366

Summary: synchronize_tsc_bp can zero the TSC during a lost ticks measurement
Product: Timers Reporter: Tim Mann (mann)
Component: OtherAssignee: john stultz (john.stultz)
Status: CLOSED CODE_FIX    
Severity: low CC: andi-bz, weissman
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.11 Subsystem:
Regression: --- Bisected commit-id:
Attachments: [UNTESTED] Use flag to avoid early lost tick compensation

Description Tim Mann 2005-10-04 18:16:37 UTC
Most recent kernel where this bug did not occur: Unknown -- long ago, probably
Distribution:
Hardware Environment: i386 and x86_64
Software Environment:
Problem Description:

I verified this in 2.6.11 since it was handy on lxr.linux.no, but it
probably exists in most kernel versions.  Also, the bug appears to be
present in both the i386 and x86_64 arch even though it's in
arch-specific code.

The basic problem is that (in some system configurations) the timer
interrupt handler uses the TSC to determine whether some ticks have
been lost, but the kernel synchronizes the TSCs by zeroing them all
*after* timer interrupts have been started.  Thus at the next timer
interrupt, lost ticks are computed as a value from after the reset
(near 0) minus a value from before the reset.  This overflows and
produces a garbage value that can be either positive or negative (the
details of the arithmetic differ between x86_64 and i386, but the
value is garbage on both).  If it's positive, it looks like a large
number of ticks have been lost and they get spuriously added to the
jiffies counter.  This in turn can cause timeouts to go off before
they should.

Here are some more details for x86_64.  i386 is generally similar.

arch/x86_64/kernel/smpboot.c:smp_boot_cpus()

	if (!skip_ioapic_setup && nr_ioapics)
		setup_IO_APIC();      // calls check_timer() - hooks up PIT IRQ
	else
		nr_ioapics = 0;

	setup_boot_APIC_clock();

	/*
	 * Synchronize the TSC with the AP
	 */
	if (cpu_has_tsc && cpucount)
		synchronize_tsc_bp(); // Clear TSCs of all cpus


setup_IO_APIC() calls check_timer(). check_timer() goes through
various schemes of hooking up the PIT IRQ. On success, it leaves it in
the unmasked state and from that point on we get timer interrupts.

synchronize_tsc_bp() is called after PIT timer interrupts are enabled. This
function clears TSC on all cpus. By the time TSC is cleared, the BSP is
guaranteed to have received at least 4 PIT interrupts (otherwise
check_timer() would have declared it nonworking).

The timer interrupt code has some logic to compensate for lost interrupts.
TSC of the last interrupt is saved in vxtime.last_tsc. However, when Linux
clears TSC in synchronize_tsc_bp(), it does not clear any state in vxtime.
As a result, the next timer interrupt _after_ TSC is cleared can get
confused and think that lots of time was lost. Whether this happens or not
is non-deterministic. "offset" variable that captures the time since the
last timer, is declared "int". It could end up positive or negative.

		offset = (((tsc - vxtime.last_tsc) *
			   vxtime.tsc_quot) >> 32) - (USEC_PER_SEC / HZ);

The number of lost ticks in "lost" is a function of offset which could be
bogus. Here is all relevant code:

arch/x86)64/kernel/time.c:timer_interrupt(), VXTIME_TSC case:
	unsigned long tsc, lost = 0;
	int delay, offset = 0;

	if (vxtime.mode == VXTIME_HPET) {
           ... // we do not go here
	} else {
		offset = (((tsc - vxtime.last_tsc) *
			   vxtime.tsc_quot) >> 32) - (USEC_PER_SEC / HZ);

		if (offset < 0)
			offset = 0;

		if (offset > (USEC_PER_SEC / HZ)) {
			lost = offset / (USEC_PER_SEC / HZ);
			offset %= (USEC_PER_SEC / HZ);
		}

		monotonic_base += (tsc - vxtime.last_tsc)*1000000/cpu_khz ;

		vxtime.last_tsc = tsc - vxtime.quot * delay / vxtime.tsc_quot;

		if ((((tsc - vxtime.last_tsc) *
		      vxtime.tsc_quot) >> 32) < offset)
			vxtime.last_tsc = tsc -
				(((long) offset << 32) / vxtime.tsc_quot) - 1;
	}

	if (lost) {
		if (report_lost_ticks) {
			printk(KERN_WARNING "time.c: Lost %ld timer "
			       "tick(s)! ", lost);
			print_symbol("rip %s)\n", regs->rip);
		}
		jiffies += lost;
	}
Comment 1 Tim Mann 2005-10-04 18:18:43 UTC
p.s.  Boris Weissman found this bug; the detailed description above is due to him.
Comment 2 john stultz 2005-10-31 10:24:41 UTC
So to summarize, and try to verify I understand the issue: Since timer  
interrupts occur before the TSCs are synced at boot, its possible to get large  
TSC offsets before and for one tick after the TSCs have been synchronized.  
This causes a larger jiffies tick count at boot then there should be.   
   
This appears to be a fairly small window before init, so I'm not sure how the   
issue manifests itself. Further I'm not sure I understand the claim: "This in   
turn can cause timeouts to go off before they should." Could these issues be   
clarified a bit?  Thanks! 
Comment 3 Tim Mann 2005-10-31 12:24:28 UTC
Yes, you understand the bug correctly.

We've seen the issue manifest itself because some timers do get set before the
TSC sync-up, so they can go off spuriously.  I don't claim to know how many
instances there are of timers that get set that early, but
http://bugme.osdl.org/show_bug.cgi?id=5397 gives one example:  The
blank_screen_t timer gets scheduled early, and we've seen it go off prematurely
because of this bug.  That would be fairly harmless except for the additional
bug (which bug 5397 describes) that the timer callback uses workqueues, which
aren't initialized yet at the point where the timer is set.
Comment 4 Andi Kleen 2005-11-03 10:30:27 UTC
It's bogus - timer interrupts only occur on the BP, which is synced by definition.
Comment 5 Tim Mann 2005-11-03 11:27:33 UTC
No, it's not bogus.  Syncing the TSCs is implemented by setting them all to 0,
not by setting the APs to the BP's value.  So the BP's TSC does change when the
TSCs are synced up.  (You can't do the latter because on older processors, it's
impossible to write a value to the TSC with a nonzero value in the high-order 32
bits.  That is, you can write values only between 0x0000000000000000 and
0x00000000ffffffff.)
Comment 6 Andi Kleen 2005-11-03 15:15:49 UTC
That was only before 2.6.12, but hasn't been true since then. With the CPU
hotplug changes a new TSC sync algorithm came too that doesn't do this

-> INVALID
Comment 7 Tim Mann 2005-11-03 16:11:17 UTC
Andi, your patch in 2.6.12 was only for x86_64, but I'm talking about i386.  The
i386 arch still has the old algorithm that zeros all the TSCs.

It would be possible to port the new algorithm to i386, but you'd have to insert
a round of zeroing all the TSCs first, in case the BP's existing TSC value
happens to be more than 0x00000000ffffffff.  So the bug I reported here would
still exist.
Comment 8 Tim Mann 2005-11-03 16:21:45 UTC
p.s.  My second paragraph above isn't quite right, but you probably get the
point -- because of the limitation in writing the TSC, in general i386 will
still sometimes need to change the BP's TSC value when syncing up with the APs.

By the way, the limitation to writing values <= 0x00000000ffffffff exists in all
Intel processors older than P4 model 3 and all AMD processors older than Opteron.
Comment 9 john stultz 2005-11-03 16:39:52 UTC
That's funny, just today Jim Houston has posted a patch to lkml converting i386
to use the x86-64 TSC sync method (look for "improved boot time TSC
synchronization"). 

Tim, might you be able to test Jim's patch to see if it resolves this issue?
Comment 10 Tim Mann 2005-11-03 16:53:35 UTC
Unlike the ia64 and x86_64 versions, Jim's patch for i386 zeros the TSC on the
BP (of necessity, as I explained above), so it can't fix this bug.  Quoting a
few lines (dropping some - lines):

+ * Writing the TSC resets the upper 32-bits, so we need to be careful
+ * that all of the cpus can be synchronized before we overflow the 
+ * 32-bit count.
...
+/*
+ * The boot processor set its own TSC to zero and then gives each 
+ * slave processor the chance to synchronize itself.
+ */
+static void __init synchronize_tsc_bp (void)
+{
+	unsigned int tsc_low, tsc_high, error;
+	int cpu;

+	printk("start TSC synchronization\n");
+	write_tsc(0, 0);
Comment 11 john stultz 2005-11-03 17:22:35 UTC
Ah, good point. Sorry for that goose hunt.

Basically it seems we need a flag to disable lost tick compensation until after
the TSCs are synced. Does this sound reasonable? I'm a bit busy at the moment,
but I'll try to come up with something soon.
Comment 12 Tim Mann 2005-11-03 17:53:59 UTC
Yes, that sounds quite reasonable.  Thanks.
Comment 13 john stultz 2005-12-21 12:32:19 UTC
Created attachment 6872 [details]
[UNTESTED] Use flag to avoid early lost tick compensation

Hey Tim, Sorry for the very slow response on this one(things are finally
slowing down a bit for the holidays). Here's a untested patch that I think
should resolve this. Let me know if you continue to see the problem.
Comment 14 john stultz 2006-01-30 14:35:18 UTC
Patch has been sent to akpm and lkml.
Comment 15 john stultz 2006-02-01 13:45:34 UTC
Patch was included in Linus' -git tree today. Closing.
Comment 16 Tim Mann 2006-05-09 17:36:30 UTC
I don't have an easy reproducer for the bug, but the code change looks good. 
Sorry for not commenting sooner.  Just for reference (in case Boris cares),
2.6.16 is the first Linus release with the fix.