Bug 12774 (NMI-AMD-Phenom9750) - NMI appears to be stuck (0->0)! AMD Phenom Quad Core
Summary: NMI appears to be stuck (0->0)! AMD Phenom Quad Core
Status: RESOLVED OBSOLETE
Alias: NMI-AMD-Phenom9750
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-02-24 14:18 UTC by Luis Chamberlain
Modified: 2013-11-20 13:20 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.38
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Log of nmi_watchdog=1 apic=debug (63.88 KB, application/octet-stream)
2009-02-24 14:20 UTC, Luis Chamberlain
Details
Log of 2.6.27.9-159.fc10.x86_64 with nmi_watchdog=1 apic=debug (44.98 KB, application/octet-stream)
2009-02-25 10:06 UTC, Luis Chamberlain
Details
dmesg output 2.6.38, apic=debug, hpet=verbose (51 bytes, text/plain)
2011-03-21 09:15 UTC, Nicos Gollan
Details

Description Luis Chamberlain 2009-02-24 14:18:14 UTC
Latest working kernel version: none
Earliest failing kernel version: brand new box, 2.6.29-rc6
Distribution: Fedora 10
Hardware Environment: AMD Phenom Quad Core
Software Environment: ?
Problem Description: Just add to kernel cmdline nmi_watchdog=1 and the kernel ring buffer gets the message:

[    0.833155] WARNING: CPU#0: NMI appears to be stuck (0->0)!
[    0.833219] Please report this to bugzilla.kernel.org,
[    0.833290] and attach the output of the 'dmesg' command.
[    0.833362]
[    0.833419] WARNING: CPU#1: NMI appears to be stuck (0->0)!
[    0.833490] Please report this to bugzilla.kernel.org,
[    0.833561] and attach the output of the 'dmesg' command.
[    0.833640]
[    0.833697] WARNING: CPU#2: NMI appears to be stuck (0->0)!
[    0.833768] Please report this to bugzilla.kernel.org,
[    0.833839] and attach the output of the 'dmesg' command.
[    0.833918]
[    0.833975] WARNING: CPU#3: NMI appears to be stuck (0->0)!
[    0.834046] Please report this to bugzilla.kernel.org,
[    0.834116] and attach the output of the 'dmesg' command.


Steps to reproduce: Boot with nmi_watchdog=1
Comment 1 Luis Chamberlain 2009-02-24 14:20:07 UTC
Created attachment 20353 [details]
Log of nmi_watchdog=1 apic=debug

Attached is the output of the kernel ring buffer immediately after a boot up when I append to the kernel's cmdline:

nmi_watchdog=1 apic=debug
Comment 2 Luis Chamberlain 2009-02-25 10:06:43 UTC
Created attachment 20365 [details]
Log of 2.6.27.9-159.fc10.x86_64 with nmi_watchdog=1 apic=debug

Although this report is for 2.6.29-rc6 I figured I'd check on older kernels to see if this might have been a regression.

The attached log is from 2.6.27.9-159.fc10.x86_64 with 

nmi_watchdog=1 apic=debug

NMI previous count and current count is also 0 for all CPUs. So this shows this is not a regression (at least since 2.6.27).

I'm interested in trying to fix this myself. Please advise what things I can look at and I can try to give it a shot. Any pointers are greatly appreciated.
Comment 3 Jike Song 2009-02-25 21:12:12 UTC
I'm not sure what's the problem, but it seems that Ingo has fixed a similar BUG yesteryear:

From ff11571b25152edfb1eb0e6feb7e0009670fe4a5 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Thu, 5 Jun 2008 11:17:16 +0200
Subject: [PATCH] x86, io-apic: fix nmi_watchdog=1 bootup hang

nmi_watchdog=1 hangs on 64-bit:

[    0.250000] Detected 12.564 MHz APIC timer.
[    0.254178] APIC timer registered as dummy, due to nmi_watchdog=1!
[    0.260366] Testing NMI watchdog ... <4>WARNING: CPU#0: NMI appears to be stuck (0->0)!
[    ...     ]
[    0.470003] calling  genl_init+0x0/0xd0
[  hard hang ]

bisected it down to:

 git-bisect start
 git-bisect good 1beee8dc8cf58e3f605bd7b34d7a39939be7d8d2
 git-bisect bad 11582ece0aaa2d0f94f345c08a4ab9997078a083
 git-bisect bad 5479c623bb44089844022c03d4c0eb16d5b7a15f
 git-bisect bad cfb4c7fabeb499e1c29f9d1878968e37a938e28a
 git-bisect good 246dd412d31e4f5de1d43aa6422a325b785f36e4
 git-bisect bad 3f8237eaff7dc1e35fa791dae095574fd974e671
 git-bisect good 90e23b13ab849e2a11f00c655eb3a2011b4623be
 git-bisect bad 833526a34eeefc117df3191a594c3c3a4f15a9ac
 git-bisect good 791b93d3dfaf16c23e978bec0cc0a3dd9d855d63
 git-bisect bad 65767c64068f2c93e56a1accfed5c78230ac12d7
 git-bisect bad 2abc5c05dd82c188e3bdf6641a274f013348d14b
 git-bisect bad 317e1f2597ffb4d4db940577bbe56dc6e881ef07

| 317e1f2597ffb4d4db940577bbe56dc6e881ef07 is first bad commit
| commit 317e1f2597ffb4d4db940577bbe56dc6e881ef07
| Author: Maciej W. Rozycki <macro@linux-mips.org>
| Date:   Wed May 21 22:10:22 2008 +0100
|     x86: I/O APIC: clean up the 8259A on a NMI watchdog failure

the problem is that in the dummy-lapic branch we rely on the i8259A
but if the NMI watchdog fails we turn off IRQ 0 - which doesnt work
too well ;-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/kernel/apic_32.c |    6 ++++--
 arch/x86/kernel/apic_64.c |    6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/apic_32.c b/arch/x86/kernel/apic_32.c
index 4b99b1b..3cebf91 100644
--- a/arch/x86/kernel/apic_32.c
+++ b/arch/x86/kernel/apic_32.c
@@ -541,11 +541,13 @@ void __init setup_boot_APIC_clock(void)
 		 * PIT/HPET going.  Otherwise register lapic as a dummy
 		 * device.
 		 */
-		if (nmi_watchdog != NMI_IO_APIC)
+		if (nmi_watchdog != NMI_IO_APIC) {
 			lapic_clockevent.features &= ~CLOCK_EVT_FEAT_DUMMY;
-		else
+		} else {
 			printk(KERN_WARNING "APIC timer registered as dummy,"
 			       " due to nmi_watchdog=1!\n");
+			timer_through_8259 = 1;
+		}
 	}
 
 	/* Setup the lapic or request the broadcast */
diff --git a/arch/x86/kernel/apic_64.c b/arch/x86/kernel/apic_64.c
index 07fda23..6cbc668 100644
--- a/arch/x86/kernel/apic_64.c
+++ b/arch/x86/kernel/apic_64.c
@@ -413,11 +413,13 @@ void __init setup_boot_APIC_clock(void)
 	 * PIT/HPET going.  Otherwise register lapic as a dummy
 	 * device.
 	 */
-	if (nmi_watchdog != NMI_IO_APIC)
+	if (nmi_watchdog != NMI_IO_APIC) {
 		lapic_clockevent.features &= ~CLOCK_EVT_FEAT_DUMMY;
-	else
+	} else {
 		printk(KERN_WARNING "APIC timer registered as dummy,"
 		       " due to nmi_watchdog=1!\n");
+		timer_through_8259 = 1;
+	}
 
 	setup_APIC_timer();
 }
-- 
1.6.1.48.ge9b8
Comment 4 Luis Chamberlain 2009-02-26 14:56:02 UTC
Thanks -- so my issue is NMI interrupts just never move, I do not get a hang, I just cannot use the NMI watchdog and I'd like to look into hangs.

During nmi initialization there is a check when we force each of the CPUs to be active in a busy loop for 20 ticks and therefore it seems by doing this we force NMI interrupts to trigger at a higher frequency -- then if the delta of number of interrupts since the beginning on each CPU is less than 5 we give up on the NMI watchdog on that CPU. In my case all 4 fail.

The patch above seems to correct an issue of a hang as IRQ0 was disabled when enabling the nmi watchdog failed.

Any pointers as to where I can look are appreciated.
Comment 5 Luis Chamberlain 2009-02-27 09:58:04 UTC
So I checked and I also have the latest BIOS. How is it we link NMI interrupts with the timer interrupts? Or is it that we make the timer interrupts non maskable?

My CPUs have their own LAPIC each with their own timer but from what I gather from the output and the code it seems that they are not used when enabling NMI. Is this correct?
Comment 6 Luis Chamberlain 2009-03-04 15:56:38 UTC
Just tried 2.6.29-rc7 and same luck.
Comment 8 Nicos Gollan 2011-03-21 09:15:56 UTC
Created attachment 51472 [details]
dmesg output 2.6.38, apic=debug, hpet=verbose

Same issue with an AMD Phenom II X6. The NMI count never increases. I'd kinda like to have it for Bug #31122 since I am experiencing the kind of freeze mentioned in the documentation of the NMI watchdog timer :-)
Comment 9 Alan 2013-11-20 13:20:16 UTC
2.6.38 is now obsolete

Note You need to log in before you can comment on or make changes to this bug.