Subject : The never ending BEEEEP/__smp_call_function_mask with 2.6.25-rc7 Submitter : Chr <chunkeey@web.de> Date : 2008-03-30 21:09 References : http://lkml.org/lkml/2008/3/30/87 This entry is being used for tracking a regression from 2.6.24. Please don't close it until the problem is fixed in the mainline.
Created attachment 15531 [details] some logs if the kernel ends with some ATA garbage... don't worry, it was because of the "nolapic" boot parameter.
Created attachment 15542 [details] let's do some physics! ouch, grep for the jiffies and compare... I put here some "date" marks here and there for the machine that was logging on the serial console...
Created attachment 15546 [details] my config - no comment - ;-)
Created attachment 15553 [details] 2.6.24.4 dmesg
Created attachment 15554 [details] /proc/timer_list of 2.6.24.4
Regressions list annotation: Handled-By : Thomas Gleixner <tglx@linutronix.de>
any news? If not, then we should close this report/WILL_FIX_LATER... after all, I'm away and I won't be able to do anything until may. Regards, Chr
At http://lkml.org/lkml/2008/4/9/89 Christian said: It's still there in 2.6.25-rc8-git7. The workarounds so far: either disable chronyd (NTP-Daemon: my system clock is a bit too fast: ~ -0.879 seconds) or "noapictimer" parameter.
> ------- Comment #8 from rjw@sisk.pl 2008-04-09 13:45 ------- > At http://lkml.org/lkml/2008/4/9/89 Christian said: > > It's still there in 2.6.25-rc8-git7. The workarounds so far: > either disable chronyd (NTP-Daemon: my system clock > is a bit too fast: ~ -0.879 seconds) or "noapictimer" parameter. noapictimer is the correct solution. I just have no idea how we can autodetect this wreckage without a pretty intrusive patch. I cook something for .26. This looks like the known AMD X2 C1E problem, but it seems the CPUs do not have the C1E bit set. Maybe another magic BIOS trick to annoy us. Thanks, tglx
Hmm, it's already the latest BIOS (1303)... so updates won't fix it :( so, ping me if you have a test patch. Regards, Chr.
Confirmed to be present in 2.6.25-rc9. References : http://lkml.org/lkml/2008/4/13/243
Hi, it took me some time to find this report (exactly since the release of .25 -- i do not use the -rc's), mainly because i didn't find the root of the problem. I have the same problem: when chrony runs as a service it takes approx. 20min ~ 2hours and then my machine freezes. IMO this time is related to time chrony needs to adjust to the clock drift. The never ending beep i experienced only once. Maybe a different way to trigger this is to change the clocksource but i'm not really sure about this. The hardware is a bit different (Gigabyte GA-MA770-DS3 with AMD770+SB600) but it's an Athlon X2 too. I already pulled linux-git and try to bisect this now but this can take some time (first use of git-bisect). regards, frank
@Frank, have does the noapictimer kernel-parameter help for you too? I'm still bisecting it... The _regression_ seemed to sneak in before the 2.6.25-rc1... Regards, Chr
(In reply to comment #13) > @Frank, have does the noapictimer kernel-parameter help for you too? > It seems to help (until now). The last kernel i booted in the morning without nolapic_timer (2.6.24-git) freezed after 86 minutes - everytime this happens chrony shows weird numbers (time jumps, no offset) in tracking.log. The same kernel runs now (58 minutes) with nolapic_timer. But it shows some strange hangs (3-5 seconds) in text mode, mostly when switching between terminals. bye, frank ps: i do not use x86_64, i'll attach my config
Created attachment 15866 [details] config from 2.6.24-git
hmm... you say "nolapic_timer"? that's funny... because I couldn't get to boot with this option, for some unknown reasons in sata_nv. :\ But yes, X.org-to-VT-to-X.org switching slowed down... however, I don't know if it's really a kernel fault, since I'm using nvidia's driver. Thanks, Chr
(In reply to comment #16) > hmm... you say "nolapic_timer"? that's funny... because I couldn't get to > boot > with this option, for some unknown reasons in sata_nv. :\ Ehm, yes. I choosed nolapic_timer because noapictimer is x86_64-only. But it doesn't help anyway: the box freezed again. This time it took longer. With this option the kernel complains that it cannot switch to high resolution mode. I usually use nvidia too but for bug-hunting i run the box without X.
Ok, git-bisect spew out this: --- 9d8af78b07976d4d84e0df491abd4e9db848d0ad is first bad commit commit 9d8af78b07976d4d84e0df491abd4e9db848d0ad Author: Bernhard Walle <bwalle@suse.de> Date: Wed Feb 6 01:38:52 2008 -0800 rtc: add HPET RTC emulation to RTC_DRV_CMOS That patch adds the RTC emulation of the HPET timer to the new RTC_DRV_CMOS. The old drivers/char/rtc.ko driver had that functionality and it's important on new systems. [akpm@linux-foundation.org: unbreak alpha build] Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: David Brownell <david-b@pacbell.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <ak@suse.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Robert Picco <Robert.Picco@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- and if i build a kernel w/o HPET (and HPET_EMULATE_RTC) the box doesn't freeze. There's another thread on LKML with the same problem: http://lkml.org/lkml/2008/4/28/247 regards, frank
hmm, I thought I had disabled HPET_EMULATE_RTC once and it froze nonetheless... Anyway, I'm running a 2.6.25-git17 now (with HPET_EMULATE_RTC enabled) and it seems to be stable even without noapictimer workaround...
Frank, http://lkml.org/lkml/2008/5/12/132 has a patch related to this freeze. Can you give it a try ? Thanks, tglx
No, it doesn't help. It's the same thing as before. I tried it with 2.6.25.3. regards, frank
Frank, has the latest mainline still the same problem ?
I don't know because i run only kernels w/o hpet since my last comment. Some minutes ago i booted a fresh build of 2.6.26.3 with hpet enabled and until now it looks good. If this lasts for at least 2 days i'll report it here (despite the fact that .26 introduced a new (timing) regression that shows even more with hpet enabled).
Nope, the box freezed after ~3 hours with hpet enabled and chrony running.
Can you please upload a bootlog of that machine please ?
Created attachment 17628 [details] boot log of 2.6.26.3 with HPET enabled
Created attachment 17629 [details] config of 2.6.26.3 with HPET enabled
Hmm, nothing too scary in there. Can you please grab 2.6.27-rc5 and the patch from: http://bugzilla.kernel.org/attachment.cgi?id=17622 and check whether we made any progress ? Thanks, tglx
Looks good. It has a uptime of 28 hours now and runs quite smooth compared to 2.6.26. regards, frank
can you please provide the bootlog or full dmesg with that kernel ?
Created attachment 17714 [details] full dmesg of 2.6.27-rc5 with hpet patch
Looks good. I'm closing that one. Patches are in 27-rc6 already. Thanks for your patience and help. tglx