Kernel Bug Tracker – Bug 4951
System freezes with SMP kernel on AMD X2 4600
Last modified: 2006-01-19 01:26:39 UTC
Distribution: SuSE 9.3 x86_64
Hardware Environment: AMD X2 4600+ (manchester) with an MSI Neo4 Platinum bios
System completely freezes within a short interval (15 - 30 minutes) after
booting. This system is perfectly stable when running the non-smp x86_64
kernel. It is also stable when running Windows (and windows does indicate the
presence of two processors), so I don't believe there is a hardware problem, but
I haven't completely ruled out the possibility.
I have noticed that things have been getting better with successive kernels.
The original SuSE 9.3 smp kernel would not boot. I upgraded it to the latest
SuSE kernel and got a lot of errors with powernow-k8. The machine booted but
would lock up quickly after. Finally I recompiled the kernel using the latest
stable source 2.6.12. All the powernow-k8 problems seemed to go away, but not
the lock ups.
Here's the output from lspci:
0000:00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
0000:00:01.0 ISA bridge: nVidia Corporation: Unknown device 0050 (rev a3)
0000:00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
0000:00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
0000:00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
0000:00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio
Controller (rev a2)
0000:00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
0000:00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
0000:00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
0000:00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
0000:00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
0000:01:0c.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host
Controller (rev 80)
0000:03:00.0 Ethernet controller: Marvell Technology Group Ltd. Gigabit Ethernet
Controller (rev 15)
0000:05:00.0 VGA compatible controller: nVidia Corporation: Unknown device 0141
/proc/cpuinfo gives me this (i'm currently running the non smp kernel)
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 43
model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
stepping : 1
cpu MHz : 1005.165
cache size : 512 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext fxsr_opt lm
3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips : 1993.38
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
cpu cores : 2
I know that's not much to go on. I don't see any errors in the logs when I
reboot after a lockup. I'm hoping someone else is experiencing a similar issue.
I'll be happy to provide any other information if you want to request it.
Steps to reproduce:
Happens within 15 - 30 minutes after every reboot.
Woops...I mean to say:
Hardware Environment: AMD X2 4600+ (manchester) with an MSI Neo4 Platinum
motherboard bios version 1.5
We've fixed a few x86_64 masties recently. It would be useful to
test current -linus tree, or, better, 2.6.13-rc3-mm2
I tried as you suggested, didn't work. It froze 7 minutes after booting, using
Created attachment 5389 [details]
boot.msg file from 2.6.13-rc3-mm2
I have no idea if this will help or not, but I'm including my boot.msg from
So when it "freezes", is it completely dead? Not responding to pings,
not responding to alt-sysrq combinations?
Have you tried enabling the nmi watchdog?
Yes, it's completely dead. Alt-Sysrq doesn't work, hitting things like capslock
don't work. The machine cannot be pinged. On your suggestion I tried nmi
watchdog...nmi_watchdog=1 on bootup right? It doesn't appear like it helped.
No messages appeared on the console or in the logs. In the boot.msg log I see:
<6>testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!
That appears to be during the boot process though.
So NMI's don't work either.
Sigh. I give up.
It might be worth disabling various drivers in config, removing
or replacing hardware, wiggling the power supply connectors, etc.
Created attachment 5400 [details]
Fix NMI on x64 SMP
Could you apply this patch and we'll see if the NMI watchdog can give us any
idea of what is going on.
Disabling some of the drivers maybe...but I dunno. I don't think it's something
like a loose cable because the system is stable under windows. I know there's
not much to go on. Like I said, I'm hoping that by getting this out there,
maybe other people who can diagnose it better will have similar issues.
I assume the NMI patch is for 2.6.13-rc3-mm2 right? I'm going to try it now.
The NMI fix is in mainline now so it'll be there in -rc4 or the next -mm set.
The NMI fix didn't make it into 2.6.13-rc4. But Alex's attachment
is valid. Can you please add that and try to get the NMI
I tried the patch. Nothing showed up on the console or in the logs as far as I
can tell. However, I'm not entirely sure I using it correctly.
When I boot, I use nmi_watchdog=1 as a boot argument. Is this how I should be
turning it on? Also, where would the nmi watchdog dump any info? I assume the
console, but nothing shows there. Also should I wait a certain amount of time
after it freezes? I waited maybe 30 seconds or so.
Please test with nmi_watchdog=2 (either way should work but i'm more confident
with this). Does the boot message says something "testing NMI watchdog...OK"?
And yes, the console is where it should turn up (after maybe 10 seconds)
You can also check /proc/interrupts after boot to see if any NMIs have been
Trying nmi_watchdog=2 nothing appeared on the console. I did verify test NMI
watchdog...OK. /proc/interrupts had a row with NMI listed on the left and two
columns of numbers.
Something different did happen this time, however. In /var/log/messages it said
Machine check events logged. Then a log showed up called mcelog. In it is:
CPU 1 1 instruction cache TSC 2b5fe2a4e9b
TLB parity error in virtual array
TLB error 'instruction transaction, level 1'
STATUS 9400000000010011 MCGSTATUS 0
I think your machine is sick. MCEs and TLB parity errors look
I'll tentatively reject this bug, sorry. If you can reproduce it
on a different machine then please reopen.
Actually that MCE happens on some machines during boot. That's a BIOS
bug, but does not point to hardware issues.
We still log boot MCEs because it made sense before these class of BIOS
bugs were discovered, although i planned to disable it (maybe should do
that for 2.6.13)
If the MCE only happens directly after boot that would be ok.
Have you tested without powernow?
I've tested it without acpi and apm...would that have the effect of disabling
How much directly after boot is the MCE ok? I get it maybe a few minutes after.
I have some good news to report on this. I installed a bios update and
installed the lastest stable kernel 2.6.13. It's a bit early to tell, but my
machine has been up for 2 hours with no errors in the mcelog (a record...as
stated before lock ups and problems appeared within a few minutes after boot).
Things look really promising. My machine will be up over the 3 day weekend.
I'll report back after that. If it's still up after the 3 day weekend, I think
it may be safe to consider the matter fixed.
I'm keeping my fingers crossed on this.
Ok, I've got more news to report. The machine hasn't locked up yet, and there's
nothing in the mcelog. However, now I get this in the message log:
Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip acpi_processor_idle+0x137/0x389 [processor]
Otherwise things seem ok for the time being. Any ideas what the messages above
Well it didn't lock up. So that's good. However, the clock is running very
fast. It was over an hour ahead when I came in this morning. In addition when
I hit keys on the keyboard it sometimes records multiple keystrokes.
I found another group of people that seem to be having this issue here:
I'm now trying some of the solutions they've tried. There's definitely
something up here, I think.
Do the timer too fast problems still happen with 2.6.14rc1? If yes please attach
an updated boot log. Also try with powernow disabled and notsc.
I'll test rc1 later. I'm a little aprehensive about changing anything because
the machine is working now. The other thing is that I have trouble compiling
drivers against the patched kernels. Is there some sort of trick I need to
know? Typically after I patch a kernel and then try to compile drivers such as
the NVIDIA driver, I get an error about the kernel headers.
The clock is also fine now, but when I boot I use the following options:
clock=pmtmr pci=biosirq pci=irqroute pci=noacpi noapic clock=notsc notsc
report_lost_ticks and nmi_watchdog are for troubleshooting, as for the others I
have no idea which ones or which combination causes things to work.
Powerrnow is enabled and working (verified with /proc/cpuinfo) so I don't think
the problem is there. I also tried noapictimer, however the system would not
boot with that option enabled.
It's been a little while...I just loaded 184.108.40.206. Without the options listed
above, I still get lost ticks. Prior to rebooting because of the kernel update,
the machine had been up for 28 days. Looks like the lockups are gone at least.
I will test 2.6.14 when it's officially released.
I'm trying 2.6.14, the machine has been up for a few days and I'm happy to
report that the lost ticks messages are gone. The only options I'm booting with
are report_lost_ticks=100 nmi_watchdog=2. As far as I can tell, everything
Great if it works. Closing.