Bug 10369

Summary: The never ending BEEEEP/__smp_call_function_mask with 2.6.25-rc7
Product: Platform Specific/Hardware Reporter: Rafael J. Wysocki (rjw)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: CLOSED CODE_FIX    
Severity: normal CC: chunkeey, fzachi, tglx
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.25-rc7 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9832    
Attachments: some logs
let's do some physics!
my config
2.6.24.4 dmesg
/proc/timer_list of 2.6.24.4
config from 2.6.24-git
boot log of 2.6.26.3 with HPET enabled
config of 2.6.26.3 with HPET enabled
full dmesg of 2.6.27-rc5 with hpet patch

Description Rafael J. Wysocki 2008-03-30 12:23:04 UTC
Subject    : The never ending BEEEEP/__smp_call_function_mask with 2.6.25-rc7
Submitter  : Chr <chunkeey@web.de>
Date       : 2008-03-30 21:09
References : http://lkml.org/lkml/2008/3/30/87

This entry is being used for tracking a regression from 2.6.24.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Christian Lamparter 2008-03-31 06:21:12 UTC
Created attachment 15531 [details]
some logs

if the kernel ends with some ATA garbage... don't worry, it was because of the "nolapic" boot parameter.
Comment 2 Christian Lamparter 2008-03-31 14:53:28 UTC
Created attachment 15542 [details]
let's do some physics!

ouch, grep for the jiffies and compare... I put here some "date" marks here and there for the machine that was logging on the serial console...
Comment 3 Christian Lamparter 2008-03-31 16:25:51 UTC
Created attachment 15546 [details]
my config

- no comment - ;-)
Comment 4 Christian Lamparter 2008-04-01 11:53:43 UTC
Created attachment 15553 [details]
2.6.24.4 dmesg
Comment 5 Christian Lamparter 2008-04-01 11:54:43 UTC
Created attachment 15554 [details]
/proc/timer_list of 2.6.24.4
Comment 6 Rafael J. Wysocki 2008-04-08 16:21:37 UTC
Regressions list annotation:
Handled-By : Thomas Gleixner <tglx@linutronix.de>
Comment 7 Christian Lamparter 2008-04-08 16:29:20 UTC
any news? If not, then we should close this report/WILL_FIX_LATER... after all, I'm away and I won't be able to do anything until may.

Regards,
	Chr
Comment 8 Rafael J. Wysocki 2008-04-09 13:45:42 UTC
At http://lkml.org/lkml/2008/4/9/89 Christian said:

It's still there in 2.6.25-rc8-git7. The workarounds so far:
either disable chronyd (NTP-Daemon: my system clock
is a bit too fast: ~ -0.879 seconds) or "noapictimer" parameter.
Comment 9 Thomas Gleixner 2008-04-09 22:54:29 UTC
> ------- Comment #8 from rjw@sisk.pl  2008-04-09 13:45 -------
> At http://lkml.org/lkml/2008/4/9/89 Christian said:
> 
> It's still there in 2.6.25-rc8-git7. The workarounds so far:
> either disable chronyd (NTP-Daemon: my system clock
> is a bit too fast: ~ -0.879 seconds) or "noapictimer" parameter.

noapictimer is the correct solution. I just have no idea how we can
autodetect this wreckage without a pretty intrusive patch. I cook
something for .26.

This looks like the known AMD X2 C1E problem, but it seems the CPUs do
not have the C1E bit set. Maybe another magic BIOS trick to annoy us.

Thanks,
	tglx
Comment 10 Christian Lamparter 2008-04-10 07:27:43 UTC
Hmm, it's already the latest BIOS (1303)... so updates won't fix it :(
so, ping me if you have a test patch.

Regards,
	Chr.
Comment 11 Rafael J. Wysocki 2008-04-17 13:34:33 UTC
Confirmed to be present in 2.6.25-rc9.

References : http://lkml.org/lkml/2008/4/13/243
Comment 12 Frank Zacharias 2008-04-22 14:15:39 UTC
Hi,

it took me some time to find this report (exactly since the release of .25 -- i do not use the -rc's), mainly because i didn't find the root of the problem.
I have the same problem: when chrony runs as a service it takes approx. 20min ~ 2hours and then my machine freezes. IMO this time is related to time chrony needs to adjust to the clock drift. The never ending beep i experienced only once. Maybe a different way to trigger this is to change the clocksource but i'm not really sure about this. The hardware is a bit different (Gigabyte GA-MA770-DS3 with AMD770+SB600) but it's an Athlon X2 too.
I already pulled linux-git and try to bisect this now but this can take some time (first use of git-bisect).

regards, frank
Comment 13 Christian Lamparter 2008-04-23 02:19:53 UTC
@Frank, have does the noapictimer kernel-parameter help for you too? I'm still bisecting it... The _regression_ seemed to sneak in before the 2.6.25-rc1... 

Regards,
Chr
Comment 14 Frank Zacharias 2008-04-23 12:50:23 UTC
(In reply to comment #13)
> @Frank, have does the noapictimer kernel-parameter help for you too?
> 
It seems to help (until now). The last kernel i booted in the morning without nolapic_timer (2.6.24-git) freezed after 86 minutes - everytime this happens chrony shows weird numbers (time jumps, no offset) in tracking.log. The same kernel runs now (58 minutes) with nolapic_timer. But it shows some strange hangs (3-5 seconds) in text mode, mostly when switching between terminals.

bye, frank

ps: i do not use x86_64, i'll attach my config
Comment 15 Frank Zacharias 2008-04-23 12:54:17 UTC
Created attachment 15866 [details]
config from 2.6.24-git
Comment 16 Christian Lamparter 2008-04-23 14:10:18 UTC
hmm... you say "nolapic_timer"? that's funny... because I couldn't get to boot with this option, for some unknown reasons in sata_nv. :\

But yes, X.org-to-VT-to-X.org switching slowed down... however, I don't know if it's really a kernel fault, since I'm using nvidia's driver.

Thanks,
Chr
Comment 17 Frank Zacharias 2008-04-23 22:31:10 UTC
(In reply to comment #16)
> hmm... you say "nolapic_timer"? that's funny... because I couldn't get to
> boot
> with this option, for some unknown reasons in sata_nv. :\
Ehm, yes. I choosed nolapic_timer because noapictimer is x86_64-only. But it doesn't help anyway: the box freezed again. This time it took longer. With this option the kernel complains that it cannot switch to high resolution mode.
I usually use nvidia too but for bug-hunting i run the box without X.
Comment 18 Frank Zacharias 2008-04-30 11:49:42 UTC
Ok, git-bisect spew out this:
---
9d8af78b07976d4d84e0df491abd4e9db848d0ad is first bad commit
commit 9d8af78b07976d4d84e0df491abd4e9db848d0ad
Author: Bernhard Walle <bwalle@suse.de>
Date:   Wed Feb 6 01:38:52 2008 -0800

    rtc: add HPET RTC emulation to RTC_DRV_CMOS
    
    That patch adds the RTC emulation of the HPET timer to the new RTC_DRV_CMOS.
    The old drivers/char/rtc.ko driver had that functionality and it's important
    on new systems.
    
    [akpm@linux-foundation.org: unbreak alpha build]
    Signed-off-by: Bernhard Walle <bwalle@suse.de>
    Cc: Alessandro Zummo <a.zummo@towertech.it>
    Cc: David Brownell <david-b@pacbell.net>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Andi Kleen <ak@suse.de>
    Cc: john stultz <johnstul@us.ibm.com>
    Cc: Robert Picco <Robert.Picco@hp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

and if i build a kernel w/o HPET (and HPET_EMULATE_RTC) the box doesn't freeze. There's another thread on LKML with the same problem: http://lkml.org/lkml/2008/4/28/247

regards, frank
Comment 19 Christian Lamparter 2008-05-01 15:32:55 UTC
hmm, I thought I had disabled HPET_EMULATE_RTC once and it froze nonetheless...
Anyway, I'm running a 2.6.25-git17 now (with HPET_EMULATE_RTC enabled) and it seems to be stable even without noapictimer workaround...
Comment 20 Thomas Gleixner 2008-05-13 00:08:37 UTC
Frank,

http://lkml.org/lkml/2008/5/12/132 has a patch related to this freeze. Can you give it a try ?

Thanks,
       tglx
Comment 21 Frank Zacharias 2008-05-13 22:37:26 UTC
No, it doesn't help. It's the same thing as before. I tried it with 2.6.25.3. 

regards, frank
Comment 22 Thomas Gleixner 2008-09-04 11:19:18 UTC
Frank, has the latest mainline still the same problem ?
Comment 23 Frank Zacharias 2008-09-04 13:11:39 UTC
I don't know because i run only kernels w/o hpet since my last comment. Some minutes ago i booted a fresh build of 2.6.26.3 with hpet enabled and until now it looks good. If this lasts for at least 2 days i'll report it here (despite the fact that .26 introduced a new (timing) regression that shows even more with hpet enabled).
Comment 24 Frank Zacharias 2008-09-04 16:26:56 UTC
Nope, the box freezed after ~3 hours with hpet enabled and chrony running. 
Comment 25 Thomas Gleixner 2008-09-04 16:32:01 UTC
Can you please upload a bootlog of that machine please ?
Comment 26 Frank Zacharias 2008-09-05 00:57:14 UTC
Created attachment 17628 [details]
boot log of 2.6.26.3 with HPET enabled
Comment 27 Frank Zacharias 2008-09-05 00:59:59 UTC
Created attachment 17629 [details]
config of 2.6.26.3 with HPET enabled
Comment 28 Thomas Gleixner 2008-09-05 01:13:54 UTC
Hmm, nothing too scary in there. Can you please grab 2.6.27-rc5 and the patch from: http://bugzilla.kernel.org/attachment.cgi?id=17622 and check whether we made any progress ?

Thanks,

       tglx
Comment 29 Frank Zacharias 2008-09-10 10:25:46 UTC
Looks good. It has a uptime of 28 hours now and runs quite smooth compared to 2.6.26. 

regards, frank
Comment 30 Thomas Gleixner 2008-09-10 12:12:53 UTC
can you please provide the bootlog or full dmesg with that kernel ?
Comment 31 Frank Zacharias 2008-09-10 13:12:33 UTC
Created attachment 17714 [details]
full dmesg of 2.6.27-rc5 with hpet patch
Comment 32 Thomas Gleixner 2008-09-10 13:19:59 UTC
Looks good. I'm closing that one. Patches are in 27-rc6 already.

Thanks for your patience and help.

          tglx