Bug 9791
Summary: | Clock is running too fast^Wslow using acpi_pm clocksource | ||
---|---|---|---|
Product: | Timers | Reporter: | tosn00j02 |
Component: | gettimeofday | Assignee: | john stultz (john.stultz) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | normal | CC: | bunk, macro, mingo, tglx |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.26-rc1 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 10492 | ||
Attachments: |
dmesg
Dump APIC and PIC state dmesg with patch-2.6.26-rc1-20080505-all-pic-dump-3 Always try the timer both directly and through the 8259A |
Description
tosn00j02
2008-01-21 12:38:09 UTC
Reply-To: akpm@linux-foundation.org On Mon, 21 Jan 2008 12:38:09 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=9791 > > Summary: Clock is running too fast using acpi_pm clocksource > Product: Timers > Version: 2.5 > KernelVersion: 2.6.24-rc8-ga7da60f4 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: gettimeofday > AssignedTo: johnstul@us.ibm.com > ReportedBy: tosn00j02@sneakemail.com > > > Commit d4d25deca49ec2527a634557bf5a6cf449f85deb broke acpi_pm as a > clocksource > on my Shuttle SK41G. It makes the clock run a couple of seconds too fast per > minute. > > Reverting that commit fixes the problem. > So I assume this was a post-2.6.23 regression? Please attach dmesg output. Here is the git link to the patch: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d4d25deca49ec2527a634557bf5a6cf449f85deb Yes, post-2.6.23 regression. Created attachment 14514 [details]
dmesg
On Mon, 21 Jan 2008, Andrew Morton wrote:
> > ReportedBy: tosn00j02@sneakemail.com
> >
> > Commit d4d25deca49ec2527a634557bf5a6cf449f85deb broke acpi_pm as a
> > clocksource on my Shuttle SK41G. It makes the clock run a couple
> > of seconds too fast per minute.
> >
> > Reverting that commit fixes the problem.
> >
>
> So I assume this was a post-2.6.23 regression?
Yes it is a post-2.6.23 regression. This commit was done to fix long
standing bugzilla entries. We definitely need to revert it for .24 and
find a solution for .25 which fixes the old bugs and does not
introduce new ones.
Linus,
can you please pull the revert from:
ssh://master.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86.git
Thanks,
tglx
Reverted in commit 4960c9df1407c2723459c60ff13e6efe0c209c62. Commit 6e908947b4995bc0e551a8257c586d5c3e428201 breaks this again. Regressions list annotation: Date : 2008-05-03 05:09 Created attachment 16168 [details] Dump APIC and PIC state (In reply to comment #7) > Commit 6e908947b4995bc0e551a8257c586d5c3e428201 breaks this again. Can you please try the patch attached, which makes an elaborate dump of all registers of all the interrupt controllers to be output to the bootstrap log, and post the result? I find the problem with your system strange as the lone effect of reverting the change is disabling OCW3 poll cycles to the 8259A. But your system has its 8254 timer interrupt routed directly to an I/O APIC and therefore timer interrupts should be unaffected no matter what the change does to the 8259A as the chip is not in the timer interrupt path on this system. My first suspect is some interrupt routing and/or masking misconfiguration which makes the 8259A post interrupts to the CPU somehow. Maciej > I find the problem with your system strange as the lone effect of
> reverting the change is disabling OCW3 poll cycles to the 8259A. But
> your system has its 8254 timer interrupt routed directly to an I/O
> APIC and therefore timer interrupts should be unaffected no matter
> what the change does to the 8259A as the chip is not in the timer
> interrupt path on this system. My first suspect is some interrupt
> routing and/or masking misconfiguration which makes the 8259A post
> interrupts to the CPU somehow.
It's even more odd because it affects the acpi_pm timer, which is read
only and neither related to the 8254 nor to the 8259. The time is
solely retrieved from the acpi_pm timer and should not be affected by
timer interrupts going too fast.
Does adding "nohz=off" to the kernel command line change anything?
Thanks,
tglx
nohz=off makes the clock stable for 2.6.26-rc2-00433-gf26a398. Furthermore, the drift was never as big with 2.6.26-rc? as it was with 2.6.24-rc?, but big enough for ntpd not to keep it stable. Will try your patch next (without nohz=off). Created attachment 16171 [details]
dmesg with patch-2.6.26-rc1-20080505-all-pic-dump-3
> nohz=off makes the clock stable for 2.6.26-rc2-00433-gf26a398.
>
> Furthermore, the drift was never as big with 2.6.26-rc? as it was with
> 2.6.24-rc?, but big enough for ntpd not to keep it stable.
Hmm. That's odd.
John, any idea how to decode this ?
Thanks,
tglx
To correct myself, the clock is about 1 second too _slow_ per minute, not too fast. I went back to 2.6.24-rc? and that was indeed slow too. Sorry for the confusion. (In reply to comment #14) > To correct myself, the clock is about 1 second too _slow_ per minute, not too > fast. I went back to 2.6.24-rc? and that was indeed slow too. Well, that is an important update. Such a symptom usually means a timer tick is lost somewhere, typically because interrupts are held disabled for too long. Perhaps the NMI watchdog could be used to track down this place now that we have more choice about timer sources than a few years ago. Your dump of IC registers looks sane. To rule out the influence of the output of the 8259A (the PIT interrupt is enabled in the chip), could you please try the "disable_8254_timer" option of the kernel? It should change the PIC IMR to ffff. > could you please try the "disable_8254_timer" option of the kernel?
> It should change the PIC IMR to ffff.
It did, and the problem is gone now.
Hmm, so now we have a solution that works, though we do not know why. I smell some SMM interaction. Anybody who could shed some light here is welcome to speak -- without the system at hands it is quite hard to figure out what is going on there. Anyway, the kernel option and the associated code is unnecessary as there is no need to unmask the timer interrupt in the PIC in the first place unless you want to make use of the I/O APIC NMI watchdog. I shall cook a reasonable patch shortly so that the option is not needed for you. I think the unmasking has been there since the day one and I always felt a bit uneasy about doing that unnecessarily. Whoever did this "disable_8254_timer" option must have not been aware that there is really no need for this piece of code here at all. ;-) Self-correction -- for the record. The interrupt in the PIC actually had to be unmasked for an obscure piece of code that used to live in do_slow_gettimeoffset() that did a read from the PIC to check for the value of the IRR bit corresponding to the timer interrupt. The code is now gone together with the whole function, so the unmasking can be removed from here. > To correct myself, the clock is about 1 second too _slow_ per minute, not too
> fast. I went back to 2.6.24-rc? and that was indeed slow too.
>
> Sorry for the confusion.
That makes a lot more sense.
Created attachment 16180 [details]
Always try the timer both directly and through the 8259A
This is a change to clean up the timer setup through the I/O APIC
a little bit. If a single candidate I/O APIC interrupt input pin
is found only (either in the MP table or in the ACPI table), then
it is always tried directly first and then, if failed, through the
8259A. This is to take care of broken BIOSes which report an
ExtINTA I/O APIC input (typically the pin #0 of the APIC #0) as a
native timer interrupt input. Care is taken that the 8259A has its
timer IRQ input unmasked only if known to be needed.
Please try this patch and see whether it fixes the clock being late
for you. Note with this set of changes the "disable_8254_timer"
option is now gone.
Works perfectly. Thank you very much. Good! It's on the way to upstream now (split into two as it comprises close, but functionally separate changes), but I suspect it will take a while to get integrated as it tackles historically one of the most fragile pieces of code dealing with the APIC and therefore needs a considerable amount of testing. Thanks for your feedback. Handled-By : Maciej W. Rozycki <macro@linux-mips.org> Patch : http://bugzilla.kernel.org/attachment.cgi?id=16180 Caused by: commit 6e908947b4995bc0e551a8257c586d5c3e428201 Author: Ingo Molnar <mingo@elte.hu> Date: Fri Mar 21 14:32:36 2008 +0100 x86: fix ioapic bug again un-revert: commit 4960c9df1407c2723459c60ff13e6efe0c209c62 Author: Thomas Gleixner <tglx@linutronix.de> Date: Tue Jan 22 10:23:01 2008 +0100 Revert "x86: fix NMI watchdog & 'stopped time' problem" This reverts commit d4d25deca49ec2527a634557bf5a6cf449f85deb. needs a proper fix though ... Signed-off-by: Ingo Molnar <mingo@elte.hu> Is this issue now resolved or not? It is resolved. 2.6.27-rc1, which has the attached patch, works perfectly. Thanks again. |