Latest working kernel version: n/a Earliest failing kernel version: 2.6.25.x Distribution: openSUSE 11.0/11.1 64bits Hardware Environment: x86_64 AMD 6000+, ASUS M2N-SLI (BIOS 1203), 3 Western Digital HDD 250GB in raid config, with LVM on top, 4gb of ram. Will attach dmidecode and lspci for starters. Software Environment: Gnome, console, doesn't matter. Problem Description: If using SMP code (not using maxcpus=1 to bootup kernel), normal usage produces a hard hang. No log, no sysrq, no nothing, no serial port. Just the finger works. This started happening with the opensuse distro kernel so I tried several of theirs, nothing works so far. So just tried the latest vanilla stable, 2.6.27.3: same thing, but it seems that it took longer for the bug to trigger, which is probably pure coincidence (and wild hope). _Apparently_, it started happening after adding hdd in the machine and running linux swraid (1 and 5), with LVM on top. If I start with maxcpus=1, there is no problem at all. I have tested in init 3, after blacklisting the nvidia module and rebooting the box. Kernel not tainted. I pretty much tried all that I knew... Bug is also reported on bugzilla.novell.com (https://bugzilla.novell.com/show_bug.cgi?id=434600), but has been very quiet for the moment. Steps to reproduce: - Probably have an AMD dual-core machine (does it matter?) configured with lvm-on-raid, 64bits OS... - Boot normally - Compile a kernel with make -j3 to trigger the bug much faster. - Wait less than a minute to give the finger to the machine. To have it work: - Append maxcpus=1 on kernel boot line in grub.
Created attachment 18419 [details] dmidecode output
Created attachment 18420 [details] lspci -vv output
There isn't much to go on here - we need to get a trace out of the kernel when it hangs. Please try enabling the softlockup detector (Documentation/kernel-parameters.txt) and the nmi watchdog (Documentation/nmi_watchdog.txt, Documentation/kernel-parameters.txt) Thanks.
OK, so per the documentation, it is a hard hang. Enabled NMI (nmi_watchdog=1 didn't work, so passed in =2 and interrupts showed up -- kind of weird for a x86_64 SMP kernel, no?). Unfortunately, even NMI didn't display nor write anything. It was just completely frozen. I didn't even try the softlockup, as I think it doesn't apply to the present case. I must also mention that this computer is running a dual-boot with winXP 32bits, on which I play resource-intense games. I say this, because it works fine there, the computer doesn't crash, which leads me to think this is not a hardware problem. Please let me know what I can try next.. Thank you.
Well, I did get a trace trying something that seems to be totally trivial. sysctl -a |grep nmi causes kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000. (file attached) This ultimately leads to 1 core being locked up at 100% CPU, or system to hard hang. I don't know what this means at all though, maybe it contains a pointer to what is really happening? This was under kernel 2.6.27.1-2 (Reinstalled the machine since -- opensuse 11.1beta3 --, because of RAID5 crashing on me... yes, I cumulate...). I will try to reproduce under 2.6.27.3, if I can compile it successfully without having the machine hang or I will have to wait until I get my hands on another machine soon.
Created attachment 18482 [details] kernel trace after sysctl -a
The bug above doesn't trigger with 2.6.27.4. But the hard hang initially reported still occurs.
So, I found it, at last.. I just discovered netconsole (at least I learnt something :)) So it is a hardware problem after all. With maxcpus=1, it seems that I was lucky enough to get on the right working core. Please find the output, I'd appreciate your expert comment on it. HARDWARE ERROR CPU 1: Machine Check Exception: 4 Bank 0: b606000000010015 TSC a897f40c81 ADDR 2b563c1bc000 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check ------------[ cut here ]------------ WARNING: at kernel/smp.c:332 smp_call_function_mask+0x4d/0x21d() fblaise@scoomoon:~/Desktop> mcelog --k8 --ascii < hwerror.txt HARDWARE ERROR HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 0 data cache TSC a897f40c81 TLB parity error in virtual array bit57 = processor context corrupt bit61 = error uncorrected TLB error 'data transaction, level 1' STATUS b606000000010015 MCGSTATUS 4 This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check ------------[ cut here ]------------ WARNING: at kernel/smp.c:332 smp_call_function_mask+0x4d/0x21d()
Hello, I received the replacement CPU, and installed it. I could compile the kernel with make -j3 with no problem. So the problem seemed to really lie in the CPU being defect. Closing as INVALID. Thank you. fred