Bug 5105
Description
Nathan Becker
2005-08-21 12:47:36 UTC
Created attachment 5709 [details]
dmesg from running in console
Created attachment 5710 [details]
dmesg from running in x
bugme-daemon@kernel-bugs.osdl.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=5105 > > Summary: lost ticks - hang check - after loading the CPU > Kernel Version: 2.6.12.5 Lots of people seem to be reporting that their clocks are running way too fast. Did we get to the bottom of this? I don't know. I did pretty extensive searches on the kernel mailing list, posted some messages, and then John Stultz suggested I open a bugzilla report. I definitely found many people reporting similar problems, but no definitive solution. Some people said that using the kernel option no_timer_check fixes it, but it doesn't work for me. In fact no_timer_check makes it significantly worse. Reply-To: ak@muc.de On Sun, Aug 21, 2005 at 12:51:16PM -0700, Andrew Morton wrote: > bugme-daemon@kernel-bugs.osdl.org wrote: > > > > http://bugzilla.kernel.org/show_bug.cgi?id=5105 > > > > Summary: lost ticks - hang check - after loading the CPU > > Kernel Version: 2.6.12.5 > > Lots of people seem to be reporting that their clocks are running > way too fast. Did we get to the bottom of this? There are different classes of bugs: - ATI: still being worked on - Nvidia: some new systems have developed problems. totally mysterious still. The strange thing is that it works for some people. e.g. I got a report that the new sun single CPU opteron box had this problem, but then for another user it ran just great with a similar kernel. - AMD 8111: one report that it has problems now. totally surprising because it has always worked great for me (it's the kind of reference platform for x86-64) Not much progress unfortunately on any of them. -Andi Does the behaviour change if you boot a uniproc kernel? > Does the behaviour change if you boot a uniproc kernel?
Do you mean just remove SMP support? I will try that tonight and get
back to you.
As far as I can tell, this problem does not occur on an identical kernel with SMP disabled. I ran the CPU at 100% overnight and the clock was stable. I also turned on hangcheck and there were no messages. This was with x.org running. Hi, I'm following this thread also since I'm experiencing about the same issues mentioned by Nathan. I'm running an Athlon64 3700+ 2.2Ghz with 2GB ram on an MSI Neo4FI board and an ATI pci-e video-card. I do not experience the 'speeding console' issues but do see the kernel/dmesg messages about the lost tickets. I'm also running x.org with (Debian-)kernel 2.6.11-9-amd64-k8. Another problem I'm experiencing is that I'm unable to compile a kernel since it segfaults on any random place during the compile. Don't know if it could be related..(checked mem during a whole night with memtest86+). If any more input is needed please tell me. Thanks, Roel. Email: roelATroel.net sniplet: warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip default_idle+0x20/0x30 Hi, Ok, my kernel-build issue seems to be solved when forcing gcc-3.3 as the compiler. I was building with gcc-4 (debian/sid). Thanks, Roel. From cc -v my compiler version is Reading specs from /usr/lib/gcc/x86_64-slackware-linux/3.4.4/specs Configured with: ../gcc-3.4.4/configure --prefix=/usr --enable-shared --enable-threads=posix --enable-__cxa_atexit --disable-checking --with-gnu-ld --verbose --target=x86_64-slackware-linux --host=x86_64-slackware-linux Thread model: posix gcc version 3.4.4 Here's a more effective (although probably obvious) workaround for the keyboard repeat problem: xset r off This completely disables character repeat. It's sort of annoying if you type a lot like I do, but it does make the computer usable. Also, I never posted the actual message from hangcheck. I don't know if hangcheck can give other messages so maybe this info is pointless, but the message is: Hangcheck: hangcheck value past margin! BTW, if anyone has any bleeding edge patches for this issue, then I'm happy to try them out. Nathan This is definitely happening to me, too. It's an nForce3 (Shuttle SN25P) motherboard with an X2 4600+, running an SMP kernel. I don't know offhand if it happened on a UP kernel, but I doubt it. I've attached a dmesg. Here's the errors with times attached, for comparison: Aug 25 12:44:39 caradoc kernel: input: USB HID v1.10 Device [Microsoft Natural Keyboard Pro] on usb-0000:00:02.0-2.1 Aug 25 14:43:30 caradoc kernel: Losing some ticks... checking if CPU frequency changed. Aug 25 15:01:58 caradoc kernel: warning: many lost ticks. Could this have something to do with the "Nvidia board detected, ignoring ACPI timer override"? I don't know when that was added, but perhaps it's obsolete on some current boards... Created attachment 5784 [details]
Dmesg for another affected system.
Well here's an interesting datapoint. I've been booting with "apic" on the command line, which overrides the Nvidia quirk. Here's the diff in dmesg: -Nvidia board detected. Ignoring ACPI timer override. -ACPI: BIOS IRQ0 pin2 override ignored. +ACPI: IRQ0 used by override. +ACPI: IRQ2 used by override. +..MP-BIOS bug: 8254 timer not connected to IO-APIC + failed. +timer doesn't work through the IO-APIC - disabling NMI Watchdog! +Uhhuh. NMI received for unknown reason 2d. +Dazed and confused, but trying to continue +Do you have a strange power saving mode enabled? -testing NMI watchdog ... OK. +testing NMI watchdog ... CPU#0: NMI appears to be stuck (11->11)! -Losing some ticks... checking if CPU frequency changed. -warning: many lost ticks. -Your time source seems to be instable or some driver is hogging interupts -rip default_idle+0x22/0x30 i.e. it complains, something is definitely wrong, but whatever time source it's falling back to seems to work OK. I haven't had the clock speed up since I started using "apic". The only substantial difference in /proc/interrupts is that the timer interrupt isn't marked as connected to the APIC anymore: 0: 312519 38470 XT-PIC timer Never mind... five hours later it showed up again. [Andi, hope you don't mind being added to CC] Anyone have suggestions on how to debug this? It's not powernow (I don't even have it loaded). I'm willing to test about anything at this point :-) The timer runs fine for a while, until something triggers the problem - sometimes CPU load, sometimes no apparent cause. After that the timer goes very wonky. It's kind of entertaining to try to type with key repeat enabled, though. I have also seen the 'fast time' problem on my system: 2.6.13 + mppe patch MSI neo4 platinum (Nforce4) Athlon64x2 3800+ Nvidia 6600 video card I upgraded to the latest bios (version 1.8), and this seemed to make the problem go away. I am still seeing the 'lost tick' messages (no powernow or cpufreq compiled in) in all of the configs I have tried. (noapic, noacpi, noacpi noapic, apic) I am running with the 'report_lost_ticks' option all the time. Doing some compute and/or network traffic reliably gets this failure with a minute or so. I have been scp'ing some large files - I have never copied 600 MBytes before this fails. Running rsync over a WAN connection also causes this problem, so it is not high traffic rate related. I see this problem with or without the nvidia driver (nv doesn't seem to support my card), and using either the nvidia ethernet or the marvel yukon pci-e on board. My remaining problem is somewhat off-topic for this bug, so I may open a new one. Roy Here's a revised version of a script written by Frank van Maarseveen and posted to lkml: #!/bin/sh for i in `yes|head -100` do time1=`date '+%s.%N'` s1=`cat /proc/interrupts` sleep 1 time2=`date '+%s.%N'` s2=`cat /proc/interrupts` t10=`echo "$s1" | awk '$1=="0:"{ print $2}'` t11=`echo "$s1" | awk '$1=="0:"{ print $3}'` t20=`echo "$s2" | awk '$1=="0:"{ print $2}'` t21=`echo "$s2" | awk '$1=="0:"{ print $3}'` d1=`expr $t20 - $t10` d2=`expr $t21 - $t11` echo $d1 + $d2 = `expr $d1 + $d2` `calc $time2 - $time1`s done | cat -n This shows the number of timer interrupts elapsed on each CPU, and the total time elapsed according to gettimeofday, every second. It's not very accurate but it's accurate enough to show the problem. I've verified some interesting properties: - A normal second has about a thousand timer ticks. I built with HZ=1000 (don't remember why) so this is what I'd expect. - A normal second has all its timer interrupts delivered to CPU1. - There are more and worse "bad" seconds under load than when idle. - A bad second will show 1s elapsed via gettimeofday but substantially fewer timer ticks. I didn't verify that they were actually less than a second but Frank ran similar tests using rsh and got the expected results - they're actually short. - A bad second will show less than a thousand ticks delivered to CPU1, and a few (but not enough to make up the difference) ticks delivered to CPU0. For example: 1 1 + 919 = 920 1.006286s 2 0 + 1006 = 1006 1.00587s 3 0 + 1007 = 1007 1.00634s 4 6 + 738 = 744 1.004804s 5 2 + 830 = 832 1.007727s 42 0 + 1007 = 1007 1.00585s 43 0 + 1007 = 1007 1.006391s 44 1 + 918 = 919 1.00729s 45 5 + 885 = 890 1.061896s 46 0 + 1006 = 1006 1.005736s 47 0 + 1006 = 1006 1.004535s 48 0 + 1007 = 1007 1.005311s So this makes me wonder whether the timer interrupts are supposed to be load balanced to both CPUs (presumably they are and that's not the problem), what's causing them to, and whether the PM timer would work better than the PIT/TSC-based timing. Since I don't think I have an HPET that's my only other option. Hmm, there's always a substantial correction at boot when synchronizing the TSCs. If the two TSCs drift, that would explain the constantly "lost" ticks... Here's what it looks like if I boot with notsc: 1 0 + 1005 = 1005 1.005261s 2 0 + 1007 = 1007 1.005885s 3 3 + 1003 = 1006 1.005783s 4 1 + 1006 = 1007 1.006228s 5 0 + 1007 = 1007 1.006273s 6 6 + 1000 = 1006 1.006229s 7 0 + 1005 = 1005 1.005498s 8 0 + 1006 = 1006 1.005203s 9 1 + 1005 = 1006 1.005421s I'll give it a day or so to work and see if this holds, but it looks much better. Nathan: Does booting w/ notsc improve the situation for you? The 'notsc' fix has worked for me on my troubled box. Similar hardware as above: AMD Athlon 64 X2 3800+ Gigabyte GA-K8N Pro-SLI (nForce 4) Tried various kernels, up to 2.6.13-mm1, losing ticks on them all, error messages as above. Console felt mostly "fine", but when I hopped into KDE and started copying stuff over SSH (fish://) keyboard input repeat would get worse... and then bad... and then _really_ bad... tapping a key would result in ~10 keypresses. Running the tick calculation script verified that as time passed, more ticks got lost, and with an increasing amount of variance. Anyway, 'notsc' is a working fix for me too. Thanks. Booting with the 'notsc' option (kernel 2.6.13, HZ=100) fixes the lost tick messages I was getting. I can now run scp several times without getting any lost tick messages. What are the side effects of the notsc option? Thanks for checking that. Sounds like these dualcore systems are running w/ unsycned TSCs. Using notsc should force you to fall back to the HPET or ACPI PM timer for timekeeping. Would you mind posting a dmesg w/ notsc to see which timesource you end up with? I'm not certian this is the same issue that the original bug-report discusses, but I'll wait to see if Nathan can confirm or deny. Sorry for the delay in my response. I've been in the process of moving to a new apartment. I tried notsc and it seems to fix both hangcheck and lost ticks - at least at first glance. I would like to run some serious calculations overnight to be sure that it's fixed. So far, it looks very promising. Thanks to everyone who worked on this! Nathan Created attachment 5958 [details]
dmesg from boot with notsc on x2 3800+, 2.6.13 + mppe patch
Here is the dmesg from my maching booting with the notsc option
So far I have seen no lost ticks, but I have not been using the machine
heavily.
I will try to run more stuff this weekend.
Created attachment 5960 [details] 2.6.13 SMP on Athlon X2: nanosleep returning waay to soon, clock_gettime(CLOCK_REALTIME...) proceeding too fast My posting to LKML. The script has already made it to bugzilla (comment #18). Additional info: The nanosleep problem seems gone for now (for no apparent reason) but there are still many other subtle timing problems: random keyboard repeats under X (problem becomes manageable by restricting Xorg to run on only one CPU using the taskset command) and apparently a not-so smooth smoothscroll in mozilla. Booting with "nosmp" fixes it. I'll try "notsc" I'd be happy to try a patch. nope -- "notsc" does not fix it for me: still random keyboard repeats kernel: notsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC tried disabling TSC by patching arch/i386/Kconfig because make oldconfig reverted it to =y again. Result _still_ didn't work. tried "notsc" in addition but then kernel hangs right after mounting /. Frank: I believe the other submitters have been dealing with x86-64. If you're using an i386 kernel, maybe we need to open a new bug? in my .xinitrc I have: xset s 300 # black after 5 min xset dpms 0 0 310 # off after 5 minutes 10 sec but occasionally the screen goes black after maybe 20 seconds or so... I first thought a fuse was blown. The random keyboard repeat remains but is manageable. The nanosleep/clock_gettime(CLOCK_REALTIME...) problem is still gone (might be BIOS setting related) and the script for counting per-CPU timer interrupts is unable to detect any time anomalies right now. Even visually all "sleep 1" commands really seem to take a second. I have never seen "lost ticks" or extreme speedups. The common denominator is "time" but the symptoms vary wildly, apparently even on the same machine such as mine. Another common denominator seems to be the AMD Athlon X2 with SMP, for both i386 and x86_64. John: I think it has to do with hardware initialization and i386/x86_64 do not really matter. Even BIOS PnP OS versus no PnP OS seemed to make a difference. Next time I'll start a discussion on LKML because that's a better place. IMO. Created attachment 5966 [details]
2.6.13 SMP on AMD Athlon X2 (i386): time anomalies
posted on lkml
Created attachment 5971 [details]
Another machine w/ the problem
I'm also having the problem. I'm going to try the notsc option and see what
happens.
Hardware ASUS A8N SLI Premium CPU AMD64 X2 3800+ RAM 2G 2-3-3-5 Video ASUS nVidia 6600 HDD RAID 1 /boot RAID 0 / Software Kernel gentoo-source-2.6.13 SMP enable Powernow enable No X server After using for some time notsc in grub.conf my sistem do not show any more in dmesg lost tick messages. I wll tes it more to be shure 100% Noup, after more testing message I have Losing some ticks... checking if CPU frequency changed. :( Created attachment 5993 [details]
Drop single-socket synch assumption
Andi: It looks like we're finally running into live SMP cpufreq systems.
We might need to drop the single-socket assumption in unsynchronized_tsc(). The
patch attached does this. Your thoughts?
I'm running the system since I posted the last message w/ the notsc option and the lost tick messages (and subsequent problems) seems to have disapeered. Frank van Maarseveen: Could you attach a dmesg of i386 kernel on your hardware? using the 'notsc' option I get no lost tick messages anymore. I did many hours of scp copies which previously would exhibit lost ticks in less than 1 minute. Roy OK, I ran calculations all weekend and the last couple of days. When I first starting doing stuff I was able to get a lost ticks message after about 15 hours of loading both CPU cores. That happened once. I rebooted, but now I am unable to reproduce it. My machine has been up for over 1 day with a constant load average > 2 and no lost ticks messages. It seems that either notsc works or makes it extremely difficult to reproduce the problem. Not sure which possibility is worse. Anyway I'll keep an eye on it, but I think this is a good fix for me for now. Nathan Nathan: Thanks for the testing! If you could, I'd appreciate it if you could try the patch from comment #36 just to verify that it automatically triggers the notsc setting on your box. Configuration: AMD64 X2 4400+ ASUS A8N-SLI mainboard Nvidia Nforce 4 chipset ATI Radeon video adapters kernel 2.6.13-git12 nopreempt AMD Cool&Quiet mode disabled System had been running a single core AMD64 CPU without problems for approximately six months (with kernel 2.6.11). I installed a dual core CPU 1.5 weeks ago and immediately noted lost tick warnings and positive clock drift (i.e., clock too fast). Booting with CONFIG_HZ 1000 results in random oopses (but typically when the SATA interface is being initialized). CONFIG_HZ == 250 and 100 don't oops but both exhibit clock drift and occassional lost tick warnings and key repeat flakiness. Booting the same kernel config only UP instead of SMP results in a stable system with no clock drift. Running the UP kernel ntpd clock stability stabilizes at 20 ppm. The SMP kernel never has stability better than 500 ppm and was normally greater than 1000 ppm. Booting a SMP kernel results in 10 out of 11 boots reporting: CPU 1: synchronized TSC with CPU 0 (last diff -82 cycles, maxerr 637 cycles) The remaining boot was only slightly different: CPU 1: synchronized TSC with CPU 0 (last diff -70 cycles, maxerr 613 cycles) Even when booting with "clock=pmtmr" the kernel reports: time.c: Using PIT/TSC based timekeeping. Booting the SMP kernel with "notsc" results in a stable system (although as expected the clock stability reported by ntpd is significantly worse). Booting with the patch in comment #36 but without "notsc" also results in a stable system. In both cases the kernel reports: time.c: Using PM based timekeeping. There is definitely a correlation with clock problems and the timer interrupt being handled by the second CPU. That is, 99.99% of the time the timer is handled by CPU #1. If something causes the timer to by handled by CPU #0 ntpd complains and keyboard repeat starts acting flakey. This is consistent with the hypothesis that the TSC of the two CPU cores drift and the PIT/TSC source is used. It's not too surpising that this problem exists since the AMD64 X2 CPU has two essentially indepdendent cores that share a memory bus. I haven't looked at the details of the implementation but I would expect each core to have an independent TSC. This is in contrast to a SMT CPU where much more logic is shared. Kurtis: Thanks for the verification. As an aside, the "clock=" option is a i386 thing only. Andi: Any feedback before I send this patch to lkml? For what it's worth - I too am having this problem. I'm heading to the datacenter to try some of the suggestions made here. My hardware: Dual Core Athlon 4400+ Asus A8n-SLC Premium 2.6.13.1 kernel Selected lines From DMESG time.c: Using 3.579545 MHz PM timer. time.c: Detected 2211.363 MHz processor. Calibrating delay using timer specific routine.. 4427.26 BogoMIPS (lpj=8854528) CPU 0(2) -> Node 0 -> Core 0 mtrr: v2.0 (20020519) Using local APIC timer interrupts. Detected 12.564 MHz APIC timer. Created attachment 6059 [details]
dmesg Asus A8N-SLI premiul Athlon 64 X2 4400+ 4g ram
Addad my DMESG file
bugzilla.kernel.org lost all the data from yesterday. So I'm repopulating this from my email logs. ------- Additional Comments From nbecker@physics.ucsb.edu 2005-09-19 12:13 ------- I tried the patch in comment #36. It seems to work fine. Sorry for the delay in my response; I was running calculations that prevented me from rebooting last week. ------- Additional Comments From ak@suse.de 2005-09-19 12:35 ------- I asked AMD some time ago and they told me it was synchronized. The TSC on K8 is C state invariant, but not P state invariant, but P states always happen synchronized on dual cores. So I'm not quite convinced of your explanation yet. Most likely you workaround some other bug by switching to pmtimer, Or just changed the timing enough because pmtimer is incredibly slow. It would be better to find the other bug. So - about the patch - is this a real fix or does it just mask the problem? Created attachment 6060 [details]
tsc synchronization check
Here's a modified time consistency test to look for unsynced TSCs. If you
could, please run the program for a little while on any dualcore system seeing
this issue.
------- Additional Comments From ak@suse.de 2005-09-19 13:53 ------- I don't think the program tests what you're looking for. First if pstate/cstate is really wrong then the tsc would just run with a slower frequency and not go backward Your test wouldn't detect that. And then you need to actually idle to see any c/p state problems. Better would be if you port the tsc sync code from smpboot.c to run in user space. Then let the system run for a day or two to give it time to build up any different tscs and then run the tsc sync algorithm and see how much difference it reports. If it's small it's still ok because it's not fully accurate. But again i have my doubts. ------- Additional Comments From drow@false.org 2005-09-19 14:06 ------- Andi, would you expect any output from John's program at all? In fact it produces quite a lot. [Note to self: only works when compiled as 32-bit, duh.] ------- Additional Comments From drow@false.org 2005-09-19 14:10 ------- Here's an example output. The box has thirteen days of uptime at the moment. 2775877798926041 2775877798926057 2775877798926073 2775877798926089 2775877798926105 -------------------- 2775877798926121 2775877616214435 -------------------- 2775877616214453 2775877616214469 2775877616214485 2775877616214510 ------- Additional Comments From jonas@mysql.com 2005-09-19 22:31 ------- Hi, I have a AMD64 X2 and get get "Lost tick". I cant get the notsc to work. I.e kernel is using PIT/TSC even if I add notsc to boot line. Recovered data from email, comment number doesn't match up (should be comment #49) ------- Additional Comments From jonas@mysql.com 2005-09-19 22:54 ------- BTW: John's program from comment#48 produces output as soon as I start compiling on my machine Marc: I believe the patch in comment #36 is the right fix, however Andi suspects it is something else since it goes against what he's been previously told. I'll update the bug if we find a different cause, but for now booting w/ notsc should get you aroudn the issue (assuming you have HPET or ACPI PM support on your hardware). For i386 on Athlon64 X2 the option "clock=pit" seems to be a workaround. The kernel reports "kernel: Using pit for high-res timesource" and all time anomalies seem gone. But still I see this: Sep 20 21:02:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:05:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:08:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:11:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:14:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:17:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:20:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:23:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:26:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:29:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:32:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:35:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:38:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:41:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:44:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:47:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:50:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:53:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:56:50 iapetus kernel: Hangcheck: hangcheck value past margin! Sep 20 21:59:50 iapetus kernel: Hangcheck: hangcheck value past margin! Probably a different issue. Frank: Do you have the HPET or ACPI PM timer enabled in your .config for i386? I have an Athlon64 X2 on a ASUS A8V Deluxe board and I was suffering the same clock problem. Changing the time source did solve the probleme, but I had to *disable* ACPI 2.0 Support in the BIOS for the PM timer to be properly detected (otherwise, the kernel continued to use the TSC timer even with the notsc parameter, and the clock was still bad). I now use the patch from comment #36 and my system runs just fine (without needing the notsc parameter, since it is what the patch does). OK - I tried the patch and after 3 hours it is still working. And I uped the freq from 100 to 250 to really test it. So far so good. And it is using the PM timer now. Applied the patch for unsynced TSCs, and it seems to work on one of my systems, but not completely on the other. Same mainboard, same bios, same distro, same kernel, same memory (Asus A8N-E, 1008, FC3, 2.6.13.2, 4G PC3200). Both systems run stable, but clock is still too fast on one of them. Both use PMtimer as source, and kernel-ticks are at 250Hz. System one (AthlonX2-4400+, cpu model 35, stepping 2 - 2.2GHz, 2x1Mb): ntpq shows jitter <5ms. System two (AthlonX2-4600+, cpu model 43, stepping 1 - 2.4GHz, 2x512k): ntpq shows increasing jitter (1hr after startup, 99% idle, around 40 ms jitter). The little TSCsynctest gives output on both, but looks more like a counter overflowing than an actual problem. -------------------- 4294967292 11 -------------------- IRQbalance is active, timer-interrupts are handled by both CPU0 and CPU1, even distribution. The timertest-script shows interrupt jumping every 10 seconds, 251 to 253 ticks-per-step, and a good 1.004 to 1.009 seconds between. Quite alright. Situation seems to have improved, but let's see if ntpq can keep clocks synced after 24hrs of load - that would be the first time on these systems since they were upgraded to the X2-cpus. Created attachment 6080 [details]
Config of failing i386 system
Created attachment 6081 [details]
dmesg of failing i386 system + clock=pit workaround
Frank: Could you try enabling ACPI and the ACPI PM timer in your .config? I tried a new i386 config but clock=hpet and clock=pmtmr still revert
to PIT and "notsc" still uses the TSC (anomalies confirmed again). Both
CONFIG_HPET_TIMER and CONFIG_X86_PM_TIMER are now set. Diff with previous
config (=attachment id=6080):
> CONFIG_ACPI=y
> CONFIG_ACPI_BOOT=y
> CONFIG_ACPI_INTERPRETER=y
> CONFIG_ACPI_FAN=y
> CONFIG_ACPI_PROCESSOR=y
> CONFIG_ACPI_THERMAL=y
> CONFIG_ACPI_BLACKLIST_YEAR=0
> CONFIG_ACPI_BUS=y
> CONFIG_ACPI_EC=y
> CONFIG_ACPI_POWER=y
> CONFIG_ACPI_PCI=y
> CONFIG_ACPI_SYSTEM=y
> CONFIG_X86_PM_TIMER=y
> CONFIG_PCI_MMCONFIG=y
I've been running a huge number of calculations lately and the timing problem has appeared again while running the patch from comment #36. It seems that patch greatly improves the situation, but does not totally fix it, as some people on this thread seem to already have surmised. It has taken almost 3 days of constantly loading both cores for this to reappear. So far the only message is Losing some ticks... checking if CPU frequency changed. I will continue to keep an eye on it. Nathan Nathan: What exactly was the problem you saw in your last post? Was it just the lost-ticks message? That isn't a critical message, as lost ticks do occaionally occur on many systems without effect. Yes, the message was: Losing some ticks... checking if CPU frequency changed. There are no other symptoms that I've noticed so far. So I guess it's fine, I was just trying to be thorough in my bug report. The clock seems to be OK, although it gained about 12 seconds over a 3 day period - not ideal but not a big problem. Please take a look at this disturbing message someone posted on gentoo forums: ----- BEGIN QUOTES ----- I think that the problem is more basic. It has to be in the kernel's handling of time. So I'm in the same boat as Entropias Entropius wrote: In my case it's not just error messages -- the clock is losing about twenty minutes a day, making me late for work yesterday. I am sitting in an office with 5 PC's. All of them are synched with ntpd to the same time source at boot time. By the end of the day - each computer has a different time showing. They range from an ancient P100 running on the Intel 430TX chipset - to an AMD 751 [Irongate] - to a P4 on Intel 82850 (Tehema). Some use APM and some use ACPI and some use nothing. About all that they have in common is that they cannot keep time. All have been booted at least once in the past three days. The AMD has lost about 30 minutes and the P100 has lost about 20 (compated against the P4 which I am using as a benchmark here). Is the P4 correct? It is today! But I have noticed that after a full-system emerge even my P4 can be up to an hour off. It VERY MUCH APPEARS that the more I emerge - the slower my clocks get. I have not tested to see if it is "emerge" itself or , more likely, the compiler taking up so many cycles that the clock wonks out. So either there is something that I am setting up incorrectly in my kernels (various kernels in play here folks) or "Houston, we have a problem". It sure doesn't appear to be processor or chipset or APM related because NONE of my systems have these things in common. Maybe I have 5 different problems - but if it that is the case and getting an accurate clock is that impossible, then we have a BIG problem with Gentoo! ----- END QUOTES ----- My system was also running for three days w/o boot. Yesterday I noticed a 20min clock drift. I AM USING the notsc option. Gustavo: Unacceptable time drift sounds like a different issue then what this bug is covering. Would you mind filing a new bug describing the time drift you are seeing as well as your NTP settings. thanks. Frank: Could you attach dmesg output with the CONFIG_X86_PM_TIMER enabled config? You may be in the unfortunate (and very rare) situation of having a motherboard that does not support ACPI PM or HPET but supports dualcore AMD cpus. clock=pit may be the only realistic workaround for you. Created attachment 6163 [details] dmesg of failing i386 system + clock=pit + CONFIG_X86_PM_TIMER Lots of dmesg changes due to ACPI enabling. PM related diff seems small: < apm: disabled - APM is not SMP safe (power off active). --- > apm: overridden by ACPI. Frank: Hmm. Unfortunately it does look like your system does not support HPET or ACPI PM. clock=pit is likely going to be your best short term solution. You might want to double check that you've got the latest BIOS and if so, contact your motherboard vendor and see if they might enable HPET or ACPI PM support in the next revision. With the confirmation from the AMD folks, the patch in comment #49 should be the correct soution here. Marking this as resolved, code-fix. I'm going to re-submit the patch to Andrew. john: I will ask for the guy that posted on gentoo.org to open the bug. Since I use a gentoo kernel tainted w/ ati binary-only drivers, and the bug seems only to appear after more than 3 days of uptime, it will be hard for me to try to reproduce it with a "clean" kernel (I need this machine running w/ the binary drivers) The problem seems to be solved. I have an Asus A8N-E motherboard with an AMD 3800+ X2 CPU and with the latest 2.6.15 kernel and the problems seems te be gone. Even after a system stresstest the clock is still normal. I also haven't got any errors in the logs. Seems to be fixed! Hi, Just a quick warning that this bug may have re-appeared in the latest kernel update 2.6.16.21-0.13-smp (IA32) with SuSE Linux 10.1 on AMD x2. Same symptoms showed up after update. Had to use option "clock=pit" to make them (apparently) disappear. Option "notsc" ended up with kernel panic and crash. dmesg | grep -i tsc says: checking TSC synchronization across 2 CPUs: CPU#0 had 0 usecs TSC skew, fixed it up. CPU#1 had -2136799 usecs TSC skew, fixed it up. Sorry if this comment has nothing to do here, -Pierre. |