Bug 7839 - boot hang unless "nmi_watchdog=0" - 2.6.19 regression - Asus M2400N (Centrino)
Summary: boot hang unless "nmi_watchdog=0" - 2.6.19 regression - Asus M2400N (Centrino)
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Config-Interrupts (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: Len Brown
URL:
Keywords:
: 9787 (view as bug list)
Depends on:
Blocks:
 
Reported: 2007-01-17 04:37 UTC by Andrej Podzimek
Modified: 2008-01-21 21:05 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.19.2
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
2.6.18.6: dmesg -s64000 (15.09 KB, text/plain)
2007-02-11 09:01 UTC, Andrej Podzimek
Details
2.6.18.6: /proc/cpuinfo (425 bytes, text/plain)
2007-02-11 09:02 UTC, Andrej Podzimek
Details
2.6.18.6: /proc/interrupts (711 bytes, text/plain)
2007-02-11 09:02 UTC, Andrej Podzimek
Details
2.6.20: dmesg -s64000 (15.10 KB, text/plain)
2007-02-11 09:02 UTC, Andrej Podzimek
Details
2.6.20: /proc/cpuinfo (438 bytes, text/plain)
2007-02-11 09:02 UTC, Andrej Podzimek
Details
2.6.20: /proc/interrupts (750 bytes, text/plain)
2007-02-11 09:02 UTC, Andrej Podzimek
Details
dmesg output with nmi_watchdog=0 (6.70 KB, text/plain)
2007-05-03 03:48 UTC, Christian Limberg
Details
dmesg output without nmi_watchdog=0 (6.72 KB, text/plain)
2007-05-03 03:51 UTC, Christian Limberg
Details
keep-watchdog-disabled-by-default fix (3.43 KB, patch)
2007-08-10 13:58 UTC, Daniel Gollub
Details | Diff
keep-watchdog-disabled-by-default fix v2 (3.10 KB, patch)
2007-08-12 01:45 UTC, Daniel Gollub
Details | Diff

Description Andrej Podzimek 2007-01-17 04:37:46 UTC
Most recent kernel where this bug did *NOT* occur: 2.6.18.x
Distribution: Arch Linux + Vanilla kernel
Hardware Environment: Asus M2400N, Centrino, Pentium M
Software Environment: ArchLinux distro kernel, Vanilla kernel, Vanilla + Suspend2

Problem Description:

My machine doesn't boot. The last kernel message I see is
Setting up standard PCI resources

Kernels I tried:
Kernel 2.6.18.x: works fine. (Vanilla, Vanilla + Suspend2 and ArchLinux default)
Kernel 2.6.19.x: always fails. (Vanilla, Vanilla + Suspend2, ArchLinux default,
ArchLinux default + Suspend2)

Kernel command line:
devfs=nomount acpi=on acpi_irq_balance acpi_irq_isa=3,7,12

After removing the irq stuff, the problem persists. (IRQ 7 is needed to use LPT
in ECP mode. The other two IRQs are reserved for my IRDA port.)

I found a recommendation to disable ACPI. But that's what I shouldn't do on a
laptop... How could I diagnose this more precisely?

According to this thread http://bbs.archlinux.org/viewtopic.php?t=28794, Asus
M3N is also affected.

Steps to reproduce:

Compile a 2.6.19.x kernel with gcc 4.1.2 on Asus M2400N, install it and reboot...
Comment 1 Luming Yu 2007-01-23 07:15:10 UTC
does 2.6.20-rc kernel boot?
Comment 2 Andrej Podzimek 2007-01-23 11:42:48 UTC
Nope... rc5 hangs as well. What could be wrong? Are there any config or boot
options I should try?
Comment 3 Luming Yu 2007-01-23 21:55:16 UTC
what is you kernel config?I tried -rc5 on a Asus A6B, it boot just fine.Before
boot, please remove any un-necessarily attached devices. 

Does the boot pass through request_standard_resources()? Please also add a
printk in topology_init to verify.
Comment 4 Andrej Podzimek 2007-01-29 12:38:05 UTC
Done.
With -rc6, the message I aded to topology_init() appears after "Setting up
standard PCI resources", becoming the last line before disaster. ;-)
As for request_standard_resources(), I couldn't see its message anywhere. Either
its end is not reached, or it gets scrolled away too quickly.
Comment 5 Andrej Podzimek 2007-01-31 06:29:00 UTC
One more comment: I don't have any special devices attached and no cards
inserted in my PCMCIA slots. I haven't replaced any hardware parts since I
bought this laptop.
Comment 6 Andrej Podzimek 2007-02-02 10:13:10 UTC
Presumably, -rc6 does work with acpi=off in the kernel command line. However,
this is not a viable solution, as soundcard and modem don't work when ACPI is
disabled. 
Comment 7 Andrej Podzimek 2007-02-09 18:59:59 UTC
2.6.20 freezes the same way. This is pretty strange.
Comment 8 Christian Limberg 2007-02-10 09:27:41 UTC
I had the same issue here with linux 2.6.20 and a ASUS M2400N notebook. The
solution for me was to deactivate APIC support (Processor type and features 
---> Local APIC support on uniprocessors). After that the kernel runs quite well.

Here is a diff of my changed .config:

< CONFIG_X86_UP_APIC=y
< CONFIG_X86_UP_IOAPIC=y
< CONFIG_X86_LOCAL_APIC=y
< CONFIG_X86_IO_APIC=y
---
> # CONFIG_X86_UP_APIC is not set
164d160
< # CONFIG_X86_MCE_P4THERMAL is not set
306,307d301
< # CONFIG_PCI_MSI is not set
< CONFIG_HT_IRQ=y
1787,1788d1780
< CONFIG_X86_FIND_SMP_CONFIG=y
< CONFIG_X86_MPPARSE=y
Comment 9 Andrej Podzimek 2007-02-10 18:21:50 UTC
Bingo! Many thanks for the hint. The kernel command line token "nolapic" does
the thing. What was wrong? Do I need local APIC on a uniprocessor? Does it bring
any advantage?
Comment 10 Len Brown 2007-02-10 20:49:09 UTC
> The kernel command line token "nolapic" does the thing.

Please attach the output from dmesg -s64000 after this boot,
as well as past /proc/interrupts and /proc/cpuinfo.

There are a lot of laptops where the LAPIC is not configured
and is supposed to be disabled by the BIOS.  A few years back
Linux got too clever and re-enabled the LAPIC on these boxes so
it could use the Local APIC timer -- and a whole bunch of
laptops stopped booting.  I thought we fixed that, but maybe
we've had some sort of regression...

dmesg and /proc/interrupts from working 2.6.18 would help us find out
if you can attach them.

If you are doing profiling with NMI, you need the LAPIC.
If you have an IOAPIC, you need the LAPIC to talk to it.
If none of the above, there isn't a benefit
to the LAPIC vs no LAPIC on a uni-processor laptop.
Comment 11 Andrej Podzimek 2007-02-11 09:01:50 UTC
Created attachment 10380 [details]
2.6.18.6: dmesg -s64000

Kernel command line:
devfs=nomount acpi=on acpi_irq_balance acpi_irq_isa=3,7,12
resume2=swap:/dev/hda7
Comment 12 Andrej Podzimek 2007-02-11 09:02:04 UTC
Created attachment 10381 [details]
2.6.18.6: /proc/cpuinfo
Comment 13 Andrej Podzimek 2007-02-11 09:02:18 UTC
Created attachment 10382 [details]
2.6.18.6: /proc/interrupts
Comment 14 Andrej Podzimek 2007-02-11 09:02:32 UTC
Created attachment 10383 [details]
2.6.20: dmesg -s64000

Kernel command line:
devfs=nomount acpi=on nolapic acpi_irq_balance acpi_irq_isa=3,7,12
resume2=swap:/dev/hda7
Comment 15 Andrej Podzimek 2007-02-11 09:02:45 UTC
Created attachment 10384 [details]
2.6.20: /proc/cpuinfo
Comment 16 Andrej Podzimek 2007-02-11 09:02:56 UTC
Created attachment 10385 [details]
2.6.20: /proc/interrupts
Comment 17 Andrej Podzimek 2007-02-11 09:12:09 UTC
Attachments ready. Thank you for your interest. May you need any further
information, just let me know.
Comment 18 Maciej Pawlik 2007-02-19 14:02:35 UTC
Same thing affects HP nx6110. Intel Celeron M 1.3GHz, intel915.

Letest Gentoo
newest working kernel: 2.6.18.6
affected 2.6.19-rc1 and up (including 2.6.20)

When closing the lid during boot-time system freezes instantly, while normal
work (Xorg running) closing and opening the lid has a 1/2 chance to freeze the
system, also happens when X blanks monitor and tries to turn it back on - pretty
annoying. The nolapic option solves the problem. Do you need any more info?
Comment 19 Andreas 2007-02-21 10:10:25 UTC
I have the same problems with an IBM Thinkpad R51 1830-DG4. You can find some 
valuable information about this laptop here:
http://www-307.ibm.com/pc/support/site.wss/product.do?subcategoryind=0&familyind=168386&brandind=10&doccategoryind=0&modelind=171165&doctypeind=9&validate=true&partnumberind=0&sitestyle=lenovo&template=%2Fproductpage%2Flandingpages%2FproductPageLandingPage.vm&operatingsystemind=49979&machineind=168593

My kernel configuration and dmesg output can be found here:
http://forums.gentoo.org/viewtopic-t-540369-start-0-postdays-0-postorder-asc-highlight-.html

My system: recent Gentoo linux
Last usable kernel: gentoo-sources-2.6.18-r6 (kernel 2.6.18.6 with a few 
patches)

I will check if nolapic will work here, but I think it will.
Comment 20 Andreas 2007-02-22 10:49:15 UTC
Ok, tried "nolapic" and works now.
Note, I only had system freezes during the kernel initial boot phase - it would 
boot roughly 3/10 (three out of ten) times. If it did boot with the local APIC 
enabled it worked stable, I had no other limitations and no freezes after the 
boot process.
In addition I don't use any acpi_irq_*something statements. I use an IBM 
Thinkpad R51 as stated in comment #19.

Hope we find what the problem is here with > 2.6.18 kernels...
Comment 21 Len Brown 2007-02-22 23:00:20 UTC
Is booting with "nmi_watchdog=0" instead of "nolapic" a sufficient workaround?
Comment 22 Christian Limberg 2007-02-23 08:04:54 UTC
Yes, kernel 2.6.20 boots with kernel option "nmi_watchdog=0" instead of
"nolapic" and seems to run stable.
Comment 23 Andrej Podzimek 2007-02-23 09:19:05 UTC
Yes, that works here, too.
Comment 24 Maciej Pawlik 2007-02-23 10:53:08 UTC
Here, too.
Comment 25 Andreas 2007-02-24 08:51:40 UTC
Tried 2.6.19-gentoo-r6 with "nmi_watchdog=0" and the boot freeze is gone.
I had a freeze at a very different point tough, during the init process (when 
gentoo is starting all the services in /etc/init.d, just before then runlevel 3 
message) - once. Don't know if this has anything to do with it.
Otherwise works here too. I'll continue testing...
Comment 26 Andreas 2007-02-26 10:11:14 UTC
I've booted Linux with "nmi_watchdog=0" about 6 times since and worked for more 
than 10 hours (sometimes with a lot of applications performing their 
calculations) with no hangs or errors what so ever.
Thank you for the workaround.
Comment 27 Len Brown 2007-03-05 13:37:55 UTC
Ingo re-sent the patch to disable NMI watchdog by default:
http://lkml.org/lkml/2007/3/5/111

And Linus applied it today:
http://lkml.org/lkml/2007/3/5/303

So this should be fixed as of the release that follows 2.6.21-rc2-git4

Closed.
Comment 28 Christian Limberg 2007-05-02 07:13:10 UTC
The regression seems to persist with kernel 2.6.21. With vanilla kernel 2.6.21.1
the boot hangs unless I append nmi_watchdog=0 or deactivate apic.
Comment 29 Andi Kleen 2007-05-02 11:32:54 UTC
2.6.21 doesn't enable the NMI watchdog by default so that would be hard to believe.
Comment 30 Christian Limberg 2007-05-02 12:51:38 UTC
Hmm, what can I say.

The following boot-parameter does work:
kernel (hd0,6)/linux-2.6.21.1 root=/dev/hda3 nmi_watchdog=0

And this one does not:
kernel (hd0,6)/linux-2.6.21.1 root=/dev/hda3

$ uname -a
Linux topeka 2.6.21.1 #1 Wed May 2 21:06:56 CEST 2007 i686 Intel(R) Pentium(R) M
processor 1400MHz GenuineIntel GNU/Linux

How can I verify that the NMI watchdog is disabled by default here?

As an attempt, I have deactivated all kernel options except APIC in my
kernel-config. After that the kernel does not freeze when "Setting up PCI
resources". But I have discovered some interesting kernel output:

Testing NMI watchdog ... <6>Time: tsc clocksource has been installed.
OK.
Using IPI Shortcut mode

The line "Testing NMI watchdog ... <6>" does not occur, when booting the kernel
with nmi_watchdog=0.

Maybe a module enables the NMI watchdog or something like that? I can attach my
.config and my stripped .config if that helps.
Comment 31 Andi Kleen 2007-05-02 15:39:39 UTC
Well it's a placebo what you did. Maybe it just fails randomly? 

dmesg | grep -i watchdog

or cat /proc/interrupts and check if NMIs are increasing.
Comment 32 Christian Limberg 2007-05-03 00:25:09 UTC
No, it's not a random behaviour. With "nmi_watchdog=0" the kernel runs stable,
without it always fails to boot (tested approx. 10 times).

Furthermore I have done some tests with the stripped kernel (APIC, PCI, PATA and
ext3 enabled).

These are the results, when booting without "nmi_watchdog=0":

$ cat /proc/interrupts
           CPU0       
  0:       5868    XT-PIC-XT        timer
  1:        245    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
 12:         17    XT-PIC-XT        i8042
 14:        314    XT-PIC-XT        ide0
 15:         12    XT-PIC-XT        ide1
NMI:         53 
LOC:       5936 
ERR:          0
MIS:          0

$ dmesg | grep "watchdog"
Testing NMI watchdog ... OK.

When I append "nmi_watchdog=0" I get the following output:

$ cat /proc/interrupts

           CPU0       
  0:       7534    XT-PIC-XT        timer
  1:        240    XT-PIC-XT        i8042
  2:          0    XT-PIC-XT        cascade
 12:         17    XT-PIC-XT        i8042
 14:        312    XT-PIC-XT        ide0
 15:         12    XT-PIC-XT        ide1
NMI:          0 
LOC:       7602 
ERR:          0
MIS:          0

$ dmesg | grep "watchdog"

Kernel command line: root=/dev/hda3 nmi_watchdog=0
Comment 33 Andi Kleen 2007-05-03 02:17:53 UTC
Can you add full boot.msg with and without nmi_watchdog=0?
Comment 34 Christian Limberg 2007-05-03 03:48:11 UTC
Created attachment 11382 [details]
dmesg output with nmi_watchdog=0

dmesg output with kernel option nmi_watchdog=0 and stripped kernel.

Note: the boot hangs for several seconds at this line:
0000:00:1d.7 EHCI: BIOS handoff failed (BIOS bug ?) 01010001

But this does not occur with my 'normal' kernel configuration.
Comment 35 Christian Limberg 2007-05-03 03:51:35 UTC
Created attachment 11383 [details]
dmesg output without nmi_watchdog=0

dmesg output without kernel option nmi_watchdog=0 and stripped kernel.

Note: without the parameter the boot also hangs for several seconds at this
line:
0000:00:1d.7 EHCI: BIOS handoff failed (BIOS bug ?) 0101000
Comment 36 Pierre Willenbrock 2007-05-04 11:26:08 UTC
Just some observations:

Christians dmesg(without forcing nmi_watchdog=0) says

Found and enabled local APIC!

which is emitted 2 lines after setting the nmi_watchdog to NMI_LOCAL_APIC, if it
was not NMI_NONE(0) before(in apic.c:1063).

nmi_watchdog gets initialized to NMI_DEFAULT in nmi.c, so his nmi watchdog is
enabled.
Comment 37 Daniel Gollub 2007-08-10 13:58:22 UTC
Created attachment 12347 [details]
keep-watchdog-disabled-by-default fix

I hit the same problem with the same machine and latest kernel (2.6.23-rc2). My patch should solve ... at least it solves for my ASUS M2400N. For more details please have a look at: https://bugzilla.novell.com/show_bug.cgi?id=298084#c9

I guess this could solve some other issues which can worked around by nmi_watchdog=0 as well.

(In reply to comment #36)
> which is emitted 2 lines after setting the nmi_watchdog to NMI_LOCAL_APIC, if
> it
> was not NMI_NONE(0) before(in apic.c:1063).
> 
> nmi_watchdog gets initialized to NMI_DEFAULT in nmi.c, so his nmi watchdog is
> enabled.

Exactly.
Comment 38 Daniel Gollub 2007-08-12 01:45:47 UTC
Created attachment 12356 [details]
keep-watchdog-disabled-by-default fix v2

Fixed little typo in x86_64 changes.
x86_64 changes compiled but untested.
Comment 39 Andi Kleen 2007-08-19 12:48:38 UTC
Patch is in 2.6.23 mainline. Len you can close it.
Comment 40 Len Brown 2008-01-21 21:05:40 UTC
*** Bug 9787 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.