Bug 5105

Summary: lost ticks - hang check - after loading the CPU
Product: Timers Reporter: Nathan Becker (nbecker)
Component: OtherAssignee: john stultz (john.stultz)
Status: CLOSED CODE_FIX    
Severity: normal CC: akpm, andi-bz, brix, drow, frankvm, gralves, gtdev, khanreaper, mlists, roy.franz
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.12.5 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg from running in console
dmesg from running in x
Dmesg for another affected system.
dmesg from boot with notsc on x2 3800+, 2.6.13 + mppe patch
2.6.13 SMP on Athlon X2: nanosleep returning waay to soon, clock_gettime(CLOCK_REALTIME...) proceeding too fast
2.6.13 SMP on AMD Athlon X2 (i386): time anomalies
Another machine w/ the problem
Drop single-socket synch assumption
dmesg Asus A8N-SLI premiul Athlon 64 X2 4400+ 4g ram
tsc synchronization check
Config of failing i386 system
dmesg of failing i386 system + clock=pit workaround
dmesg of failing i386 system + clock=pit + CONFIG_X86_PM_TIMER

Description Nathan Becker 2005-08-21 12:47:36 UTC
Distribution: Slamd64
Hardware Environment:

AMD Athlon64 X2 4800+
Gigabyte GA-K8NXP-SLI motherboard
eVGA NVIDIA 6600GT PCI express

Software Environment:
x86_64 kernel target compiled from vanilla 2.6.12.5 kernel source
x.org, nv driver (open source NVIDIA 2D accelerator driver)

Problem Description:
The clock begins to run fast as soon as the CPU is loaded with a mathematical
calculation.  Kernel dmesg gives different messages depending on if in X windows
or just at the console.  At the console the message is:

Losing some ticks... checking if CPU frequency changed.

In X-Windows an additional message is

warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip default_idle+0x20/0x30

It might be possible to get that second message while only in console
eventually, but I haven't been able to do it.

Also, if hangcheck is turned on then I get a huge number of Hangcheck messages.

At first glance this might just sound like an annoying problem where the system
clock is wrong.  However, it is more than just wrong system time.  After running
for a while the clock is running very fast.  The keyboard repeat delay depends
on the system clock being accurate, so once it is running fast then it is almost
impossible to type in X-windows.  Typing in the console does not seem to be
affected.  A workaround I have come up with for the typing problem in X is to
increase the key repeat delay to the max:

xset r rate 1000 30

Even this workaround eventually fails as the clock continues to speed up.

Steps to reproduce:

Run anything that uses a lot of CPU.  Wait a while.  May take 30 minutes to an
hour to start to get bad.
Comment 1 Nathan Becker 2005-08-21 12:51:11 UTC
Created attachment 5709 [details]
dmesg from running in console
Comment 2 Nathan Becker 2005-08-21 12:51:46 UTC
Created attachment 5710 [details]
dmesg from running in x
Comment 3 Andrew Morton 2005-08-21 12:52:44 UTC
bugme-daemon@kernel-bugs.osdl.org wrote:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=5105
> 
>             Summary: lost ticks - hang check - after loading the CPU
>      Kernel Version: 2.6.12.5

Lots of people seem to be reporting that their clocks are running
way too fast.   Did we get to the bottom of this?


Comment 4 Nathan Becker 2005-08-21 13:03:47 UTC
I don't know.  I did pretty extensive searches on the kernel mailing list,
posted some messages, and then John Stultz suggested I open a bugzilla report. 
I definitely found many people reporting similar problems, but no definitive
solution.  Some people said that using the kernel option no_timer_check fixes
it, but it doesn't work for me.  In fact no_timer_check makes it significantly
worse.
Comment 5 Anonymous Emailer 2005-08-21 13:34:22 UTC
Reply-To: ak@muc.de

On Sun, Aug 21, 2005 at 12:51:16PM -0700, Andrew Morton wrote:
> bugme-daemon@kernel-bugs.osdl.org wrote:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=5105
> > 
> >             Summary: lost ticks - hang check - after loading the CPU
> >      Kernel Version: 2.6.12.5
> 
> Lots of people seem to be reporting that their clocks are running
> way too fast.   Did we get to the bottom of this?

There are different classes of bugs: 

- ATI: still being worked on
- Nvidia: some new systems have developed problems.  totally mysterious
still. The strange thing is that it works for some people. e.g. I got
a report that the new sun single CPU opteron box had this problem,
but then for another user it ran just great with a similar kernel.
- AMD 8111: one report that it has problems now. totally surprising
because it has always worked great for me (it's the kind of reference
platform for x86-64) 

Not much progress unfortunately on any of them.

-Andi

Comment 6 john stultz 2005-08-22 12:25:37 UTC
Does the behaviour change if you boot a uniproc kernel?
Comment 7 Nathan Becker 2005-08-22 12:27:38 UTC
> Does the behaviour change if you boot a uniproc kernel?

Do you mean just remove SMP support?  I will try that tonight and get 
back to you.

Comment 8 Nathan Becker 2005-08-23 08:28:00 UTC
As far as I can tell, this problem does not occur on an identical kernel 
with SMP disabled.  I ran the CPU at 100% overnight and the clock was 
stable.  I also turned on hangcheck and there were no messages.  This was 
with x.org running.

Comment 9 Roel van der Made 2005-08-26 09:25:21 UTC
Hi,

I'm following this thread also since I'm experiencing about the same issues
mentioned by Nathan. I'm running an Athlon64 3700+ 2.2Ghz with 2GB ram on an MSI
Neo4FI board and an ATI pci-e video-card. I do not experience the 'speeding
console' issues but do see the kernel/dmesg messages about the lost tickets. I'm
also running x.org with (Debian-)kernel 2.6.11-9-amd64-k8.

Another problem I'm experiencing is that I'm unable to compile a kernel since it
segfaults on any random place during the compile. Don't know if it could be
related..(checked mem during a whole night with memtest86+).

If any more input is needed please tell me.

Thanks,

Roel.
Email: roelATroel.net

sniplet:

warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip default_idle+0x20/0x30

Comment 10 Roel van der Made 2005-08-26 10:13:10 UTC
Hi,

Ok, my kernel-build issue seems to be solved when forcing gcc-3.3 as the
compiler. I was building with gcc-4 (debian/sid).

Thanks,

Roel.
Comment 11 Nathan Becker 2005-08-26 11:08:44 UTC
From cc -v my compiler version is

Reading specs from /usr/lib/gcc/x86_64-slackware-linux/3.4.4/specs
Configured with: ../gcc-3.4.4/configure --prefix=/usr --enable-shared 
--enable-threads=posix --enable-__cxa_atexit --disable-checking 
--with-gnu-ld --verbose --target=x86_64-slackware-linux 
--host=x86_64-slackware-linux
Thread model: posix
gcc version 3.4.4

Here's a more effective (although probably obvious) workaround for the 
keyboard repeat problem:

xset r off

This completely disables character repeat.  It's sort of annoying if you 
type a lot like I do, but it does make the computer usable.

Also, I never posted the actual message from hangcheck. I don't know if 
hangcheck can give other messages so maybe this info is pointless, but the 
message is:

Hangcheck: hangcheck value past margin!

BTW, if anyone has any bleeding edge patches for this issue, then I'm 
happy to try them out.


Nathan

Comment 12 Daniel Jacobowitz 2005-08-27 09:44:02 UTC
This is definitely happening to me, too.  It's an nForce3 (Shuttle SN25P)
motherboard with an X2 4600+, running an SMP kernel.  I don't know offhand if it
happened on a UP kernel, but I doubt it.

I've attached a dmesg.  Here's the errors with times attached, for comparison:

Aug 25 12:44:39 caradoc kernel: input: USB HID v1.10 Device [Microsoft Natural
Keyboard Pro] on usb-0000:00:02.0-2.1
Aug 25 14:43:30 caradoc kernel: Losing some ticks... checking if CPU frequency
changed.
Aug 25 15:01:58 caradoc kernel: warning: many lost ticks.

Could this have something to do with the "Nvidia board detected, ignoring ACPI
timer override"?  I don't know when that was added, but perhaps it's obsolete on
some current boards...
Comment 13 Daniel Jacobowitz 2005-08-27 09:44:41 UTC
Created attachment 5784 [details]
Dmesg for another affected system.
Comment 14 Daniel Jacobowitz 2005-08-30 15:55:19 UTC
Well here's an interesting datapoint.  I've been booting with "apic" on the
command line, which overrides the Nvidia quirk.  Here's the diff in dmesg:

-Nvidia board detected. Ignoring ACPI timer override.
-ACPI: BIOS IRQ0 pin2 override ignored.
+ACPI: IRQ0 used by override.
+ACPI: IRQ2 used by override.

+..MP-BIOS bug: 8254 timer not connected to IO-APIC
+ failed.
+timer doesn't work through the IO-APIC - disabling NMI Watchdog!
+Uhhuh. NMI received for unknown reason 2d.
+Dazed and confused, but trying to continue
+Do you have a strange power saving mode enabled?

-testing NMI watchdog ... OK.
+testing NMI watchdog ... CPU#0: NMI appears to be stuck (11->11)!

-Losing some ticks... checking if CPU frequency changed.
-warning: many lost ticks.
-Your time source seems to be instable or some driver is hogging interupts
-rip default_idle+0x22/0x30

i.e. it complains, something is definitely wrong, but whatever time source it's
falling back to seems to work OK.  I haven't had the clock speed up since I
started using "apic".

The only substantial difference in /proc/interrupts is that the timer interrupt
isn't marked as connected to the APIC anymore:

  0:     312519      38470          XT-PIC  timer
Comment 15 Daniel Jacobowitz 2005-08-30 17:15:13 UTC
Never mind... five hours later it showed up again.
Comment 16 Daniel Jacobowitz 2005-08-31 08:46:21 UTC
[Andi, hope you don't mind being added to CC]

Anyone have suggestions on how to debug this?  It's not powernow (I don't even
have it loaded).  I'm willing to test about anything at this point :-)

The timer runs fine for a while, until something triggers the problem -
sometimes CPU load, sometimes no apparent cause.  After that the timer goes very
wonky.  It's kind of entertaining to try to type with key repeat enabled, though.
Comment 17 Roy Franz 2005-09-05 21:17:49 UTC
I have also seen the 'fast time' problem on my system:
2.6.13 + mppe patch
MSI neo4 platinum (Nforce4)
Athlon64x2 3800+
Nvidia 6600 video card

I upgraded to the latest bios (version 1.8), and this seemed
to make the problem go away.

I am still seeing the 'lost tick' messages (no powernow or cpufreq compiled in)
in all of the configs I have tried. (noapic, noacpi, noacpi noapic, apic)  I am
running with the 'report_lost_ticks' option all the time.  Doing some compute
and/or network traffic reliably gets this failure with a minute or so. I have
been scp'ing some large files - I have never copied 600 MBytes before this
fails.  Running rsync over a WAN connection also causes this problem, so it is
not high traffic rate related.  I see this problem with or without the nvidia
driver (nv doesn't seem to support my card), and using either the nvidia
ethernet or the marvel yukon pci-e on board.
My remaining problem is somewhat off-topic for this bug, so I may open a new one.

Roy
Comment 18 Daniel Jacobowitz 2005-09-06 06:26:52 UTC
Here's a revised version of a script written by Frank van Maarseveen and posted
to lkml:

#!/bin/sh      

for i in `yes|head -100`
do
        time1=`date '+%s.%N'`
        s1=`cat /proc/interrupts`
        sleep 1
        time2=`date '+%s.%N'`
        s2=`cat /proc/interrupts`

        t10=`echo "$s1" | awk '$1=="0:"{ print $2}'`
        t11=`echo "$s1" | awk '$1=="0:"{ print $3}'`
        t20=`echo "$s2" | awk '$1=="0:"{ print $2}'`
        t21=`echo "$s2" | awk '$1=="0:"{ print $3}'`
        d1=`expr $t20 - $t10`
        d2=`expr $t21 - $t11`
        echo $d1 + $d2 = `expr $d1 + $d2` `calc $time2 - $time1`s
done | cat -n

This shows the number of timer interrupts elapsed on each CPU, and the total
time elapsed according to gettimeofday, every second.  It's not very accurate
but it's accurate enough to show the problem.

I've verified some interesting properties:
  - A normal second has about a thousand timer ticks.  I built with HZ=1000
    (don't remember why) so this is what I'd expect.
  - A normal second has all its timer interrupts delivered to CPU1.
  - There are more and worse "bad" seconds under load than when idle.
  - A bad second will show 1s elapsed via gettimeofday but substantially
    fewer timer ticks.  I didn't verify that they were actually less than
    a second but Frank ran similar tests using rsh and got the expected
    results - they're actually short.
  - A bad second will show less than a thousand ticks delivered to CPU1,
    and a few (but not enough to make up the difference) ticks delivered
    to CPU0.

For example:
     1  1 + 919 = 920 1.006286s
     2  0 + 1006 = 1006 1.00587s
     3  0 + 1007 = 1007 1.00634s
     4  6 + 738 = 744 1.004804s
     5  2 + 830 = 832 1.007727s

    42  0 + 1007 = 1007 1.00585s
    43  0 + 1007 = 1007 1.006391s
    44  1 + 918 = 919 1.00729s
    45  5 + 885 = 890 1.061896s
    46  0 + 1006 = 1006 1.005736s
    47  0 + 1006 = 1006 1.004535s
    48  0 + 1007 = 1007 1.005311s

So this makes me wonder whether the timer interrupts are supposed to be load
balanced to both CPUs (presumably they are and that's not the problem), what's
causing them to, and whether the PM timer would work better than the
PIT/TSC-based timing.  Since I don't think I have an HPET that's my only other
option.

Hmm, there's always a substantial correction at boot when synchronizing the
TSCs.  If the two TSCs drift, that would explain the constantly "lost" ticks...
Comment 19 Daniel Jacobowitz 2005-09-06 06:36:21 UTC
Here's what it looks like if I boot with notsc:
     1  0 + 1005 = 1005 1.005261s
     2  0 + 1007 = 1007 1.005885s
     3  3 + 1003 = 1006 1.005783s
     4  1 + 1006 = 1007 1.006228s
     5  0 + 1007 = 1007 1.006273s
     6  6 + 1000 = 1006 1.006229s
     7  0 + 1005 = 1005 1.005498s
     8  0 + 1006 = 1006 1.005203s
     9  1 + 1005 = 1006 1.005421s

I'll give it a day or so to work and see if this holds, but it looks much better.
Comment 20 john stultz 2005-09-08 14:55:57 UTC
Nathan: Does booting w/ notsc improve the situation for you? 
Comment 21 Brian S. Stephan 2005-09-09 00:54:49 UTC
The 'notsc' fix has worked for me on my troubled box. Similar hardware as  
above:  
  
AMD Athlon 64 X2 3800+  
Gigabyte GA-K8N Pro-SLI (nForce 4) 
  
Tried various kernels, up to 2.6.13-mm1, losing ticks on them all, error  
messages as above.  
  
Console felt mostly "fine", but when I hopped into KDE and started copying  
stuff over SSH (fish://) keyboard input repeat would get worse... and then  
bad... and then _really_ bad... tapping a key would result in ~10 keypresses.  
  
Running the tick calculation script verified that as time passed, more ticks  
got lost, and with an increasing amount of variance.  
  
Anyway, 'notsc' is a working fix for me too. Thanks.  
Comment 22 Roy Franz 2005-09-09 11:19:08 UTC
Booting with the 'notsc' option (kernel 2.6.13, HZ=100) 
fixes the lost tick messages I was getting.  I can now run
scp several times without getting any lost tick messages.

What are the side effects of the notsc option?
Comment 23 john stultz 2005-09-09 11:28:46 UTC
Thanks for checking that. Sounds like these dualcore systems are running w/
unsycned TSCs. Using notsc should force you to fall back to the HPET or ACPI PM
timer for timekeeping. Would you mind posting a dmesg w/ notsc to see which
timesource you end up with?

I'm not certian this is the same issue that the original bug-report discusses,
but I'll wait to see if Nathan can confirm or deny.
Comment 24 Nathan Becker 2005-09-09 11:40:38 UTC
Sorry for the delay in my response.  I've been in the process of moving to 
a new apartment.

I tried notsc and it seems to fix both hangcheck and lost ticks - at least 
at first glance.  I would like to run some serious calculations overnight 
to be sure that it's fixed.  So far, it looks very promising.  Thanks to 
everyone who worked on this!

Nathan

Comment 25 Roy Franz 2005-09-09 12:11:52 UTC
Created attachment 5958 [details]
dmesg from boot with notsc on x2 3800+, 2.6.13 + mppe patch

Here is the dmesg from my maching booting with the notsc option
So far I have seen no lost ticks, but I have not been using the machine
heavily.
I will try to run more stuff this weekend.
Comment 26 Frank van Maarseveen 2005-09-09 13:40:45 UTC
Created attachment 5960 [details]
2.6.13 SMP on Athlon X2: nanosleep returning waay to soon, clock_gettime(CLOCK_REALTIME...) proceeding too fast

My posting to LKML. The script has already made it to bugzilla (comment #18).
Additional info:

The nanosleep problem seems gone for now (for no apparent reason) but there are
still many other subtle timing problems: random keyboard repeats under X
(problem becomes manageable by restricting Xorg to run on only one CPU using
the taskset command) and apparently a not-so smooth smoothscroll in mozilla.
Booting with "nosmp" fixes it. I'll try "notsc"

I'd be happy to try a patch.
Comment 27 Frank van Maarseveen 2005-09-09 13:59:12 UTC
nope -- "notsc" does not fix it for me: still random keyboard repeats
Comment 28 Frank van Maarseveen 2005-09-09 15:28:32 UTC
kernel: notsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC
tried disabling TSC by patching arch/i386/Kconfig because make oldconfig
reverted it to =y again. Result _still_ didn't work. tried "notsc" in addition
but then
kernel hangs right after mounting /.
Comment 29 john stultz 2005-09-09 16:31:49 UTC
Frank: I believe the other submitters have been dealing with x86-64. If you're
using an i386 kernel, maybe we need to open a new bug?
Comment 30 Frank van Maarseveen 2005-09-09 16:35:03 UTC
in my .xinitrc I have:

        xset s 300                      # black after 5 min
        xset dpms 0 0 310               # off after 5 minutes 10 sec

but occasionally the screen goes black after maybe 20 seconds or so... I first
thought a fuse was blown. The random keyboard repeat remains but is manageable.
The nanosleep/clock_gettime(CLOCK_REALTIME...) problem is still gone (might be
BIOS setting related) and the script for counting per-CPU timer interrupts is
unable to detect any time anomalies right now. Even visually all "sleep 1"
commands really seem to take a second. I have never seen "lost ticks" or extreme
speedups. The common denominator is "time" but the symptoms vary wildly,
apparently even on the same machine such as mine. Another common denominator
seems to be the AMD Athlon X2 with SMP, for both i386 and x86_64.
Comment 31 Frank van Maarseveen 2005-09-09 16:46:19 UTC
John: I think it has to do with hardware initialization and i386/x86_64 do not
really matter. Even BIOS PnP OS versus no PnP OS seemed to make a difference.

Next time I'll start a discussion on LKML because that's a better place. IMO.
Comment 32 Frank van Maarseveen 2005-09-10 06:27:44 UTC
Created attachment 5966 [details]
2.6.13 SMP on AMD Athlon X2 (i386): time anomalies

posted on lkml
Comment 33 Gustavo Ribeiro Alves 2005-09-10 22:13:21 UTC
Created attachment 5971 [details]
Another machine w/ the problem

I'm also having the problem. I'm going to try the notsc option and see what
happens.
Comment 34 Mudreac Nelu 2005-09-11 01:55:04 UTC
Hardware 
ASUS A8N SLI Premium 
CPU AMD64 X2 3800+
RAM 2G 2-3-3-5
Video ASUS nVidia 6600
HDD RAID 1 /boot RAID 0 /
Software
Kernel gentoo-source-2.6.13 SMP enable Powernow enable
No X server
After using for some time notsc in grub.conf my sistem do not show any more in
dmesg  lost tick messages.
I wll tes it more to be shure 100%
Comment 35 Mudreac Nelu 2005-09-11 02:00:45 UTC
Noup, after more testing message I have
Losing some ticks... checking if CPU frequency changed.
:(
Comment 36 john stultz 2005-09-13 12:24:53 UTC
Created attachment 5993 [details]
Drop single-socket synch assumption

Andi: It looks like we're finally running into live SMP cpufreq systems.
We might need to drop the single-socket assumption in unsynchronized_tsc(). The
patch attached does this. Your thoughts?
Comment 37 Gustavo Ribeiro Alves 2005-09-13 12:29:20 UTC
I'm running the system since I posted the last message w/ the notsc option and
the lost tick messages (and subsequent problems) seems to have disapeered.
Comment 38 john stultz 2005-09-13 12:40:07 UTC
Frank van Maarseveen: Could you attach a dmesg of i386 kernel on your hardware?
Comment 39 Roy Franz 2005-09-13 12:45:23 UTC
using the 'notsc' option I get no lost tick messages anymore.  I did many hours
of scp copies which previously would exhibit lost ticks in less than 1 minute.

Roy
Comment 40 Nathan Becker 2005-09-13 16:35:28 UTC
OK, I ran calculations all weekend and the last couple of days.  When I 
first starting doing stuff I was able to get a lost ticks message after 
about 15 hours of loading both CPU cores.  That happened once.  I 
rebooted, but now I am unable to reproduce it.  My machine has been up for 
over 1 day with a constant load average > 2 and no lost ticks messages.

It seems that either notsc works or makes it extremely difficult to 
reproduce the problem.  Not sure which possibility is worse.  Anyway I'll 
keep an eye on it, but I think this is a good fix for me for now.


Nathan

Comment 41 john stultz 2005-09-13 17:20:27 UTC
Nathan: Thanks for the testing! If you could, I'd appreciate it if you could try
the patch from comment #36 just to verify that it automatically triggers the
notsc setting on your box. 
Comment 42 Kurtis Rader 2005-09-14 22:04:27 UTC
Configuration:
    AMD64 X2 4400+
    ASUS A8N-SLI mainboard
    Nvidia Nforce 4 chipset
    ATI Radeon video adapters
    kernel 2.6.13-git12 nopreempt 
    AMD Cool&Quiet mode disabled

System had been running a single core AMD64 CPU without problems for
approximately six months (with kernel 2.6.11). I installed a dual
core CPU 1.5 weeks ago and immediately noted lost tick warnings and
positive clock drift (i.e., clock too fast).  Booting with CONFIG_HZ
1000 results in random oopses (but typically when the SATA interface is
being initialized). CONFIG_HZ == 250 and 100 don't oops but both exhibit
clock drift and occassional lost tick warnings and key repeat flakiness.

Booting the same kernel config only UP instead of SMP results in a
stable system with no clock drift. Running the UP kernel ntpd clock
stability stabilizes at 20 ppm. The SMP kernel never has stability
better than 500 ppm and was normally greater than 1000 ppm.

Booting a SMP kernel results in 10 out of 11 boots reporting:

    CPU 1: synchronized TSC with CPU 0 
    (last diff -82 cycles, maxerr 637 cycles)

The remaining boot was only slightly different:

    CPU 1: synchronized TSC with CPU 0 
    (last diff -70 cycles, maxerr 613 cycles)

Even when booting with "clock=pmtmr" the kernel reports:

    time.c: Using PIT/TSC based timekeeping.

Booting the SMP kernel with "notsc" results in a stable system (although
as expected the clock stability reported by ntpd is significantly
worse). Booting with the patch in comment #36 but without "notsc"
also results in a stable system. In both cases the kernel reports:

    time.c: Using PM based timekeeping.

There is definitely a correlation with clock problems and the timer
interrupt being handled by the second CPU. That is, 99.99% of the
time the timer is handled by CPU #1. If something causes the timer to
by handled by CPU #0 ntpd complains and keyboard repeat starts acting
flakey. This is consistent with the hypothesis that the TSC of the two
CPU cores drift and the PIT/TSC source is used.

It's not too surpising that this problem exists since the AMD64 X2
CPU has two essentially indepdendent cores that share a memory bus. I
haven't looked at the details of the implementation but I would expect
each core to have an independent TSC. This is in contrast to a SMT CPU
where much more logic is shared.
Comment 43 john stultz 2005-09-15 11:23:22 UTC
Kurtis: Thanks for the verification. As an aside, the "clock=" option is a i386
thing only. 

Andi: Any feedback before I send this patch to lkml?
Comment 44 Marc Perkel 2005-09-20 10:20:50 UTC
For what it's worth - I too am having this problem. I'm heading to the
datacenter to try some of the suggestions made here.

My hardware:

Dual Core Athlon 4400+
Asus A8n-SLC Premium
2.6.13.1 kernel

Selected lines From DMESG

time.c: Using 3.579545 MHz PM timer.
time.c: Detected 2211.363 MHz processor.
Calibrating delay using timer specific routine.. 4427.26 BogoMIPS (lpj=8854528)
CPU 0(2) -> Node 0 -> Core 0
mtrr: v2.0 (20020519)
Using local APIC timer interrupts.
Detected 12.564 MHz APIC timer.

Comment 45 Marc Perkel 2005-09-20 10:25:55 UTC
Created attachment 6059 [details]
dmesg Asus A8N-SLI premiul Athlon 64 X2 4400+ 4g ram

Addad my DMESG file
Comment 46 john stultz 2005-09-20 12:15:39 UTC
bugzilla.kernel.org lost all the data from yesterday. So I'm repopulating this
from my email logs.

------- Additional Comments From nbecker@physics.ucsb.edu  2005-09-19 12:13 -------
I tried the patch in comment #36.  It seems to work fine.  Sorry for the 
delay in my response; I was running calculations that prevented me from 
rebooting last week.

Comment 47 john stultz 2005-09-20 12:16:32 UTC
------- Additional Comments From ak@suse.de  2005-09-19 12:35 -------
I asked AMD some time ago and they told me it was synchronized.                
The TSC on K8 is C state invariant, but not P state invariant,                  
but P states always happen synchronized on dual cores.                         
                                                                                
So I'm not quite convinced of your explanation yet.                            
                                                                                
Most likely you workaround some other bug by switching to pmtimer,              
Or just changed the timing enough because pmtimer is incredibly                 
slow.  It would be better to find the other bug.                             
Comment 48 Marc Perkel 2005-09-20 12:17:42 UTC
So - about the patch - is this a real fix or does it just mask the problem?
Comment 49 john stultz 2005-09-20 12:18:28 UTC
Created attachment 6060 [details]
tsc synchronization check

Here's a modified time consistency test to look for unsynced TSCs.  If you
could, please run the program for a little while on any dualcore system seeing
this issue.
Comment 50 john stultz 2005-09-20 12:19:11 UTC
------- Additional Comments From ak@suse.de  2005-09-19 13:53 -------
I don't think the program tests what you're looking for.

First if pstate/cstate is really wrong then the tsc would just run
with a slower frequency and not go backward Your test wouldn't detect
that.
And then you need to actually idle to see any c/p state problems.

Better would be if you port the tsc sync code from smpboot.c to
run in user space. Then let the system run for a day or two
to give it time to build up any different tscs and then run
the tsc sync algorithm and see how much difference it reports.
If it's small it's still ok because it's not fully accurate.

But again i have my doubts.
Comment 51 john stultz 2005-09-20 12:19:32 UTC
------- Additional Comments From drow@false.org  2005-09-19 14:06 -------
Andi, would you expect any output from John's program at all?  In fact it
produces quite a lot.

[Note to self: only works when compiled as 32-bit, duh.]
Comment 52 john stultz 2005-09-20 12:19:53 UTC
------- Additional Comments From drow@false.org  2005-09-19 14:10 -------
Here's an example output.  The box has thirteen days of uptime at the moment.

2775877798926041
2775877798926057
2775877798926073
2775877798926089
2775877798926105
--------------------
2775877798926121
2775877616214435
--------------------
2775877616214453
2775877616214469
2775877616214485
2775877616214510
Comment 53 john stultz 2005-09-20 12:21:28 UTC
------- Additional Comments From jonas@mysql.com  2005-09-19 22:31 -------
Hi,

I have a AMD64 X2 and get get "Lost tick".
I cant get the notsc to work.
I.e kernel is using PIT/TSC even if I add notsc to boot line.


Comment 54 john stultz 2005-09-20 12:23:08 UTC
Recovered data from email, comment number doesn't match up (should be comment #49)

------- Additional Comments From jonas@mysql.com  2005-09-19 22:54 -------
BTW: John's program from comment#48
produces output as soon as I start compiling on my machine
Comment 55 john stultz 2005-09-20 12:25:57 UTC
Marc: I believe the patch in comment #36 is the right fix, however Andi suspects
it is something else since it goes against what he's been previously told. I'll
update the bug if we find a different cause, but for now booting w/ notsc should
get you aroudn the issue (assuming you have HPET or ACPI PM support on your
hardware).
Comment 56 Frank van Maarseveen 2005-09-20 13:46:26 UTC
For i386 on Athlon64 X2 the option "clock=pit" seems to be a workaround. The
kernel reports "kernel: Using pit for high-res timesource" and all time
anomalies seem gone. But still I see this:

Sep 20 21:02:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:05:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:08:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:11:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:14:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:17:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:20:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:23:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:26:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:29:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:32:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:35:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:38:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:41:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:44:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:47:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:50:50 iapetus kernel: Hangcheck: hangcheck value past margin! 
Sep 20 21:53:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:56:50 iapetus kernel: Hangcheck: hangcheck value past margin!
Sep 20 21:59:50 iapetus kernel: Hangcheck: hangcheck value past margin!

Probably a different issue.
Comment 57 john stultz 2005-09-20 13:58:41 UTC
Frank: Do you have the HPET or ACPI PM timer enabled in your .config for i386?
Comment 58 Ga 2005-09-20 14:00:36 UTC
I have an Athlon64 X2 on a ASUS A8V Deluxe board and I was suffering the same
clock problem.  Changing the time source did solve the probleme, but I had to
*disable* ACPI 2.0 Support in the BIOS for the PM timer to be properly detected
(otherwise, the kernel continued to use the TSC timer even with the notsc
parameter, and the clock was still bad).  I now use the patch from comment #36
and my system runs just fine (without needing the notsc parameter, since it is
what the patch does).
Comment 59 Marc Perkel 2005-09-20 20:08:46 UTC
OK - I tried the patch and after 3 hours it is still working. And I uped the
freq from 100 to 250 to really test it. So far so good. And it is using the PM
timer now.
Comment 60 Johan van Baarlen 2005-09-21 02:49:35 UTC
Applied the patch for unsynced TSCs, and it seems to work on one of my systems, 
but not completely on the other. 
Same mainboard, same bios, same distro, same kernel, same memory (Asus A8N-E, 
1008, FC3, 2.6.13.2, 4G PC3200).
Both systems run stable, but clock is still too fast on one of them. Both use 
PMtimer as source, and kernel-ticks are at 250Hz.

System one (AthlonX2-4400+, cpu model 35, stepping 2 - 2.2GHz, 2x1Mb): ntpq 
shows jitter <5ms.
System two (AthlonX2-4600+, cpu model 43, stepping 1 - 2.4GHz, 2x512k): ntpq 
shows increasing jitter (1hr after startup, 99% idle, around 40 ms jitter). 

The little TSCsynctest gives output on both, but looks more like a counter 
overflowing than an actual problem.
--------------------
4294967292
11
--------------------

IRQbalance is active, timer-interrupts are handled by both CPU0 and CPU1, even 
distribution. The timertest-script shows interrupt jumping every 10 seconds, 
251 to 253 ticks-per-step, and a good 1.004 to 1.009 seconds between. Quite 
alright.

Situation seems to have improved, but let's see if ntpq can keep clocks synced 
after 24hrs of load - that would be the first time on these systems since they 
were upgraded to the X2-cpus.
Comment 61 Frank van Maarseveen 2005-09-21 13:38:18 UTC
Created attachment 6080 [details]
Config of failing i386 system
Comment 62 Frank van Maarseveen 2005-09-21 13:44:26 UTC
Created attachment 6081 [details]
dmesg of failing i386 system + clock=pit workaround
Comment 63 john stultz 2005-09-21 13:51:00 UTC
Frank: Could you try enabling ACPI and the ACPI PM timer in your .config?
Comment 64 Frank van Maarseveen 2005-09-21 14:35:04 UTC
I tried a new i386 config but clock=hpet and clock=pmtmr still revert
to PIT and "notsc" still uses the TSC (anomalies confirmed again). Both
CONFIG_HPET_TIMER and CONFIG_X86_PM_TIMER are now set. Diff with previous
config (=attachment id=6080):

> CONFIG_ACPI=y
> CONFIG_ACPI_BOOT=y
> CONFIG_ACPI_INTERPRETER=y
> CONFIG_ACPI_FAN=y
> CONFIG_ACPI_PROCESSOR=y
> CONFIG_ACPI_THERMAL=y
> CONFIG_ACPI_BLACKLIST_YEAR=0
> CONFIG_ACPI_BUS=y
> CONFIG_ACPI_EC=y
> CONFIG_ACPI_POWER=y
> CONFIG_ACPI_PCI=y
> CONFIG_ACPI_SYSTEM=y
> CONFIG_X86_PM_TIMER=y
> CONFIG_PCI_MMCONFIG=y

Comment 65 Nathan Becker 2005-09-22 01:41:38 UTC
I've been running a huge number of calculations lately and the timing 
problem has appeared again while running the patch from comment #36.  It 
seems that patch greatly improves the situation, but does not totally fix 
it, as some people on this thread seem to already have surmised.

It has taken almost 3 days of constantly loading both cores for this to 
reappear.  So far the only message is

Losing some ticks... checking if CPU frequency changed.

I will continue to keep an eye on it.

Nathan

Comment 66 john stultz 2005-09-22 11:06:09 UTC
Nathan: What exactly was the problem you saw in your last post? Was it just the
lost-ticks message? That isn't a critical message, as lost ticks do occaionally
occur on many systems without effect.
Comment 67 Nathan Becker 2005-09-22 11:25:55 UTC
Yes, the message was:

Losing some ticks... checking if CPU frequency changed.

There are no other symptoms that I've noticed so far.  So I guess it's 
fine, I was just trying to be thorough in my bug report.  The clock seems 
to be OK, although it gained about 12 seconds over a 3 day period - not 
ideal but not a big problem.

Comment 68 Gustavo Ribeiro Alves 2005-09-25 09:00:52 UTC
Please take a look at this disturbing message someone posted on gentoo forums:

----- BEGIN QUOTES -----
I think that the problem is more basic. It has to be in the kernel's handling of
time. So I'm in the same boat as Entropias
Entropius wrote:
In my case it's not just error messages -- the clock is losing about twenty
minutes a day, making me late for work yesterday.

I am sitting in an office with 5 PC's. All of them are synched with ntpd to the
same time source at boot time. By the end of the day - each computer has a
different time showing.

They range from an ancient P100 running on the Intel 430TX chipset - to an AMD
751 [Irongate] - to a P4 on Intel 82850 (Tehema). Some use APM and some use ACPI
and some use nothing.

About all that they have in common is that they cannot keep time.

All have been booted at least once in the past three days. The AMD has lost
about 30 minutes and the P100 has lost about 20 (compated against the P4 which I
am using as a benchmark here). Is the P4 correct? It is today! But I have
noticed that after a full-system emerge even my P4 can be up to an hour off.

It VERY MUCH APPEARS that the more I emerge - the slower my clocks get. I have
not tested to see if it is "emerge" itself or , more likely, the compiler taking
up so many cycles that the clock wonks out.

So either there is something that I am setting up incorrectly in my kernels
(various kernels in play here folks) or "Houston, we have a problem". It sure
doesn't appear to be processor or chipset or APM related because NONE of my
systems have these things in common. Maybe I have 5 different problems - but if
it that is the case and getting an accurate clock is that impossible, then we
have a BIG problem with Gentoo! 
----- END QUOTES -----

My system was also running for three days w/o boot. Yesterday I noticed a 20min
clock drift. I AM USING the notsc option.
Comment 69 john stultz 2005-09-26 11:21:57 UTC
Gustavo: Unacceptable time drift sounds like a different issue then what this
bug is covering. Would you mind filing a new bug describing the time drift you
are seeing as well as your NTP settings.

thanks.
Comment 70 john stultz 2005-09-26 11:26:41 UTC
Frank: Could you attach dmesg output with the CONFIG_X86_PM_TIMER  enabled
config? You may be in the unfortunate (and very rare) situation of having a
motherboard that does not support ACPI PM or HPET but supports dualcore AMD
cpus. clock=pit may be the only realistic workaround for you.
Comment 71 Frank van Maarseveen 2005-09-26 11:48:31 UTC
Created attachment 6163 [details]
dmesg of failing i386 system + clock=pit + CONFIG_X86_PM_TIMER

Lots of dmesg changes due to ACPI enabling. PM related diff seems small:
< apm: disabled - APM is not SMP safe (power off active).
---
> apm: overridden by ACPI.
Comment 72 john stultz 2005-09-26 11:58:00 UTC
Frank: Hmm. Unfortunately it does look like your system does not support HPET or
ACPI PM. clock=pit is likely going to be your best short term solution. 

You might want to double check that you've got the latest BIOS and if so,
contact your motherboard vendor and see if they might enable HPET or ACPI PM
support in the next revision.
Comment 73 john stultz 2005-09-26 12:42:32 UTC
With the confirmation from the AMD folks, the patch in comment #49 should be the
correct soution here. Marking this as resolved, code-fix. I'm going to re-submit
the patch to Andrew.
Comment 74 Gustavo Ribeiro Alves 2005-09-26 20:26:34 UTC
john: I will ask for the guy that posted on gentoo.org to open the bug. Since I
use a gentoo kernel tainted w/ ati binary-only drivers, and the bug seems only
to appear after more than 3 days of uptime, it will be hard for me to try to
reproduce it with a "clean" kernel (I need this machine running w/ the binary
drivers)
Comment 75 Dirk Vromans 2006-01-06 06:20:07 UTC
The problem seems to be solved.
I have an Asus A8N-E motherboard with an AMD 3800+ X2 CPU and with the latest
2.6.15 kernel and the problems seems te be gone. Even after a system stresstest
the clock is still normal. I also haven't got any errors in the logs.
Seems to be fixed!
Comment 76 Pierre Baldensperger 2006-08-13 09:26:42 UTC
Hi,

Just a quick warning that this bug may have re-appeared
in the latest kernel update 2.6.16.21-0.13-smp (IA32)
with SuSE Linux 10.1 on AMD x2.

Same symptoms showed up after update. Had to use option
"clock=pit" to make them (apparently) disappear. Option
"notsc" ended up with kernel panic and crash.

dmesg | grep -i tsc says:
   checking TSC synchronization across 2 CPUs:
   CPU#0 had 0 usecs TSC skew, fixed it up.
   CPU#1 had -2136799 usecs TSC skew, fixed it up.

Sorry if this comment has nothing to do here,

-Pierre.