Bug 11814 - Kernel hard hang when >1 core enabled
Summary: Kernel hard hang when >1 core enabled
Status: REJECTED INVALID
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-23 14:54 UTC by Fred Blaise
Modified: 2008-11-11 12:49 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.27.4
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
dmidecode output (17.18 KB, text/plain)
2008-10-23 14:56 UTC, Fred Blaise
Details
lspci -vv output (25.18 KB, text/plain)
2008-10-23 14:57 UTC, Fred Blaise
Details
kernel trace after sysctl -a (4.19 KB, text/plain)
2008-10-28 11:20 UTC, Fred Blaise
Details

Description Fred Blaise 2008-10-23 14:54:35 UTC
Latest working kernel version: n/a
Earliest failing kernel version: 2.6.25.x
Distribution: openSUSE 11.0/11.1 64bits

Hardware Environment: x86_64
AMD 6000+, ASUS M2N-SLI (BIOS 1203), 3 Western Digital HDD 250GB in raid config, with LVM on top, 4gb of ram. Will attach dmidecode and lspci for starters.

Software Environment: Gnome, console, doesn't matter.

Problem Description:

If using SMP code (not using maxcpus=1 to bootup kernel), normal usage produces a hard hang. No log, no sysrq, no nothing, no serial port. Just the finger works.

This started happening with the opensuse distro kernel so I tried several of theirs, nothing works so far. So just tried the latest vanilla stable, 2.6.27.3: same thing, but it seems that it took longer for the bug to trigger, which is probably pure coincidence (and wild hope).

_Apparently_, it started happening after adding hdd in the machine and running linux swraid (1 and 5), with LVM on top.

If I start with maxcpus=1, there is no problem at all.

I have tested in init 3, after blacklisting the nvidia module and rebooting the box. Kernel not tainted.

I pretty much tried all that I knew...

Bug is also reported on bugzilla.novell.com (https://bugzilla.novell.com/show_bug.cgi?id=434600), but has been very quiet for the moment.


Steps to reproduce:
- Probably have an AMD dual-core machine  (does it matter?) configured with lvm-on-raid, 64bits OS...
- Boot normally
- Compile a kernel with make -j3 to trigger the bug much faster.
- Wait less than a minute to give the finger to the machine.

To have it work:
- Append maxcpus=1 on kernel boot line in grub.
Comment 1 Fred Blaise 2008-10-23 14:56:48 UTC
Created attachment 18419 [details]
dmidecode output
Comment 2 Fred Blaise 2008-10-23 14:57:17 UTC
Created attachment 18420 [details]
lspci -vv output
Comment 3 Andrew Morton 2008-10-27 15:13:28 UTC
There isn't much to go on here - we need to get a trace out of the
kernel when it hangs.

Please try enabling the softlockup detector (Documentation/kernel-parameters.txt)
and the nmi watchdog (Documentation/nmi_watchdog.txt, Documentation/kernel-parameters.txt)

Thanks.
Comment 4 Fred Blaise 2008-10-28 10:30:28 UTC
OK, so per the documentation, it is a hard hang. Enabled NMI (nmi_watchdog=1 didn't work, so passed in =2 and interrupts showed up -- kind of weird for a x86_64 SMP kernel, no?).

Unfortunately, even NMI didn't display nor write anything. It was just completely frozen. I didn't even try the softlockup, as I think it doesn't apply to the present case.

I must also mention that this computer is running a dual-boot with winXP 32bits, on which I play resource-intense games. I say this, because it works fine there, the computer doesn't crash, which leads me to think this is not a hardware problem.

Please let me know what I can try next..

Thank you.
Comment 5 Fred Blaise 2008-10-28 11:19:16 UTC
Well, I did get a trace trying something that seems to be totally trivial.

sysctl -a |grep nmi

causes kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000.

(file attached)

This ultimately leads to 1 core being locked up at 100% CPU, or system to hard hang. I don't know what this means at all though, maybe it contains a pointer to what is really happening?

This was under kernel 2.6.27.1-2 (Reinstalled the machine since -- opensuse 11.1beta3 --, because of RAID5 crashing on me... yes, I cumulate...). 

I will try to reproduce under 2.6.27.3, if I can compile it successfully without having the machine hang or I will have to wait until I get my hands on another machine soon.
Comment 6 Fred Blaise 2008-10-28 11:20:15 UTC
Created attachment 18482 [details]
kernel trace after sysctl -a
Comment 7 Fred Blaise 2008-10-28 12:58:16 UTC
The bug above doesn't trigger with 2.6.27.4. 

But the hard hang initially reported still occurs.
Comment 8 Fred Blaise 2008-10-29 05:33:08 UTC
So, I found it, at last.. I just discovered netconsole (at least I learnt something :))

So it is a hardware problem after all. With maxcpus=1, it seems that I was lucky enough to get on the right working core.

Please find the output, I'd appreciate your expert comment on it.

HARDWARE ERROR
CPU 1: Machine Check Exception:                4 Bank 0: b606000000010015
TSC a897f40c81 ADDR 2b563c1bc000 
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check
------------[ cut here ]------------
WARNING: at kernel/smp.c:332 smp_call_function_mask+0x4d/0x21d()


fblaise@scoomoon:~/Desktop> mcelog --k8 --ascii < hwerror.txt 
HARDWARE ERROR
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 0 data cache TSC a897f40c81 
  TLB parity error in virtual array
       bit57 = processor context corrupt
       bit61 = error uncorrected
  TLB error 'data transaction, level 1'
STATUS b606000000010015 MCGSTATUS 4
This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check
------------[ cut here ]------------
WARNING: at kernel/smp.c:332 smp_call_function_mask+0x4d/0x21d()
Comment 9 Fred Blaise 2008-11-11 12:49:37 UTC
Hello,

I received the replacement CPU, and installed it. I could compile the kernel with make -j3 with no problem.

So the problem seemed to really lie in the CPU being defect.

Closing as INVALID.

Thank you.
fred

Note You need to log in before you can comment on or make changes to this bug.