Most recent kernel where this bug did not occur: Unknown -- long ago, probably Distribution: Hardware Environment: i386 and x86_64 Software Environment: Problem Description: I verified this in 2.6.11 since it was handy on lxr.linux.no, but it probably exists in most kernel versions. Also, the bug appears to be present in both the i386 and x86_64 arch even though it's in arch-specific code. The basic problem is that (in some system configurations) the timer interrupt handler uses the TSC to determine whether some ticks have been lost, but the kernel synchronizes the TSCs by zeroing them all *after* timer interrupts have been started. Thus at the next timer interrupt, lost ticks are computed as a value from after the reset (near 0) minus a value from before the reset. This overflows and produces a garbage value that can be either positive or negative (the details of the arithmetic differ between x86_64 and i386, but the value is garbage on both). If it's positive, it looks like a large number of ticks have been lost and they get spuriously added to the jiffies counter. This in turn can cause timeouts to go off before they should. Here are some more details for x86_64. i386 is generally similar. arch/x86_64/kernel/smpboot.c:smp_boot_cpus() if (!skip_ioapic_setup && nr_ioapics) setup_IO_APIC(); // calls check_timer() - hooks up PIT IRQ else nr_ioapics = 0; setup_boot_APIC_clock(); /* * Synchronize the TSC with the AP */ if (cpu_has_tsc && cpucount) synchronize_tsc_bp(); // Clear TSCs of all cpus setup_IO_APIC() calls check_timer(). check_timer() goes through various schemes of hooking up the PIT IRQ. On success, it leaves it in the unmasked state and from that point on we get timer interrupts. synchronize_tsc_bp() is called after PIT timer interrupts are enabled. This function clears TSC on all cpus. By the time TSC is cleared, the BSP is guaranteed to have received at least 4 PIT interrupts (otherwise check_timer() would have declared it nonworking). The timer interrupt code has some logic to compensate for lost interrupts. TSC of the last interrupt is saved in vxtime.last_tsc. However, when Linux clears TSC in synchronize_tsc_bp(), it does not clear any state in vxtime. As a result, the next timer interrupt _after_ TSC is cleared can get confused and think that lots of time was lost. Whether this happens or not is non-deterministic. "offset" variable that captures the time since the last timer, is declared "int". It could end up positive or negative. offset = (((tsc - vxtime.last_tsc) * vxtime.tsc_quot) >> 32) - (USEC_PER_SEC / HZ); The number of lost ticks in "lost" is a function of offset which could be bogus. Here is all relevant code: arch/x86)64/kernel/time.c:timer_interrupt(), VXTIME_TSC case: unsigned long tsc, lost = 0; int delay, offset = 0; if (vxtime.mode == VXTIME_HPET) { ... // we do not go here } else { offset = (((tsc - vxtime.last_tsc) * vxtime.tsc_quot) >> 32) - (USEC_PER_SEC / HZ); if (offset < 0) offset = 0; if (offset > (USEC_PER_SEC / HZ)) { lost = offset / (USEC_PER_SEC / HZ); offset %= (USEC_PER_SEC / HZ); } monotonic_base += (tsc - vxtime.last_tsc)*1000000/cpu_khz ; vxtime.last_tsc = tsc - vxtime.quot * delay / vxtime.tsc_quot; if ((((tsc - vxtime.last_tsc) * vxtime.tsc_quot) >> 32) < offset) vxtime.last_tsc = tsc - (((long) offset << 32) / vxtime.tsc_quot) - 1; } if (lost) { if (report_lost_ticks) { printk(KERN_WARNING "time.c: Lost %ld timer " "tick(s)! ", lost); print_symbol("rip %s)\n", regs->rip); } jiffies += lost; }
p.s. Boris Weissman found this bug; the detailed description above is due to him.
So to summarize, and try to verify I understand the issue: Since timer interrupts occur before the TSCs are synced at boot, its possible to get large TSC offsets before and for one tick after the TSCs have been synchronized. This causes a larger jiffies tick count at boot then there should be. This appears to be a fairly small window before init, so I'm not sure how the issue manifests itself. Further I'm not sure I understand the claim: "This in turn can cause timeouts to go off before they should." Could these issues be clarified a bit? Thanks!
Yes, you understand the bug correctly. We've seen the issue manifest itself because some timers do get set before the TSC sync-up, so they can go off spuriously. I don't claim to know how many instances there are of timers that get set that early, but http://bugme.osdl.org/show_bug.cgi?id=5397 gives one example: The blank_screen_t timer gets scheduled early, and we've seen it go off prematurely because of this bug. That would be fairly harmless except for the additional bug (which bug 5397 describes) that the timer callback uses workqueues, which aren't initialized yet at the point where the timer is set.
It's bogus - timer interrupts only occur on the BP, which is synced by definition.
No, it's not bogus. Syncing the TSCs is implemented by setting them all to 0, not by setting the APs to the BP's value. So the BP's TSC does change when the TSCs are synced up. (You can't do the latter because on older processors, it's impossible to write a value to the TSC with a nonzero value in the high-order 32 bits. That is, you can write values only between 0x0000000000000000 and 0x00000000ffffffff.)
That was only before 2.6.12, but hasn't been true since then. With the CPU hotplug changes a new TSC sync algorithm came too that doesn't do this -> INVALID
Andi, your patch in 2.6.12 was only for x86_64, but I'm talking about i386. The i386 arch still has the old algorithm that zeros all the TSCs. It would be possible to port the new algorithm to i386, but you'd have to insert a round of zeroing all the TSCs first, in case the BP's existing TSC value happens to be more than 0x00000000ffffffff. So the bug I reported here would still exist.
p.s. My second paragraph above isn't quite right, but you probably get the point -- because of the limitation in writing the TSC, in general i386 will still sometimes need to change the BP's TSC value when syncing up with the APs. By the way, the limitation to writing values <= 0x00000000ffffffff exists in all Intel processors older than P4 model 3 and all AMD processors older than Opteron.
That's funny, just today Jim Houston has posted a patch to lkml converting i386 to use the x86-64 TSC sync method (look for "improved boot time TSC synchronization"). Tim, might you be able to test Jim's patch to see if it resolves this issue?
Unlike the ia64 and x86_64 versions, Jim's patch for i386 zeros the TSC on the BP (of necessity, as I explained above), so it can't fix this bug. Quoting a few lines (dropping some - lines): + * Writing the TSC resets the upper 32-bits, so we need to be careful + * that all of the cpus can be synchronized before we overflow the + * 32-bit count. ... +/* + * The boot processor set its own TSC to zero and then gives each + * slave processor the chance to synchronize itself. + */ +static void __init synchronize_tsc_bp (void) +{ + unsigned int tsc_low, tsc_high, error; + int cpu; + printk("start TSC synchronization\n"); + write_tsc(0, 0);
Ah, good point. Sorry for that goose hunt. Basically it seems we need a flag to disable lost tick compensation until after the TSCs are synced. Does this sound reasonable? I'm a bit busy at the moment, but I'll try to come up with something soon.
Yes, that sounds quite reasonable. Thanks.
Created attachment 6872 [details] [UNTESTED] Use flag to avoid early lost tick compensation Hey Tim, Sorry for the very slow response on this one(things are finally slowing down a bit for the holidays). Here's a untested patch that I think should resolve this. Let me know if you continue to see the problem.
Patch has been sent to akpm and lkml.
Patch was included in Linus' -git tree today. Closing.
I don't have an easy reproducer for the bug, but the code change looks good. Sorry for not commenting sooner. Just for reference (in case Boris cares), 2.6.16 is the first Linus release with the fix.