Distribution: SuSE 9.3 x86_64 Hardware Environment: AMD X2 4600+ (manchester) with an MSI Neo4 Platinum bios version 1.5 Problem Description: System completely freezes within a short interval (15 - 30 minutes) after booting. This system is perfectly stable when running the non-smp x86_64 kernel. It is also stable when running Windows (and windows does indicate the presence of two processors), so I don't believe there is a hardware problem, but I haven't completely ruled out the possibility. I have noticed that things have been getting better with successive kernels. The original SuSE 9.3 smp kernel would not boot. I upgraded it to the latest SuSE kernel and got a lot of errors with powernow-k8. The machine booted but would lock up quickly after. Finally I recompiled the kernel using the latest stable source 2.6.12. All the powernow-k8 problems seemed to go away, but not the lock ups. Here's the output from lspci: 0000:00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) 0000:00:01.0 ISA bridge: nVidia Corporation: Unknown device 0050 (rev a3) 0000:00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) 0000:00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) 0000:00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) 0000:00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2) 0000:00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) 0000:00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 0000:00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3) 0000:00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) 0000:00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 0000:00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 0000:00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 0000:00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) 0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 0000:01:0c.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host Controller (rev 80) 0000:03:00.0 Ethernet controller: Marvell Technology Group Ltd. Gigabit Ethernet Controller (rev 15) 0000:05:00.0 VGA compatible controller: nVidia Corporation: Unknown device 0141 (rev a2) /proc/cpuinfo gives me this (i'm currently running the non smp kernel) processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 43 model name : AMD Athlon(tm) 64 X2 Dual Core Processor 4600+ stepping : 1 cpu MHz : 1005.165 cache size : 512 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni lahf_lm cmp_legacy bogomips : 1993.38 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp cpu cores : 2 I know that's not much to go on. I don't see any errors in the logs when I reboot after a lockup. I'm hoping someone else is experiencing a similar issue. I'll be happy to provide any other information if you want to request it. Steps to reproduce: Happens within 15 - 30 minutes after every reboot.
Woops...I mean to say: Hardware Environment: AMD X2 4600+ (manchester) with an MSI Neo4 Platinum motherboard bios version 1.5
We've fixed a few x86_64 masties recently. It would be useful to test current -linus tree, or, better, 2.6.13-rc3-mm2
I tried as you suggested, didn't work. It froze 7 minutes after booting, using 2.6.13-r3-mm2.
Created attachment 5389 [details] boot.msg file from 2.6.13-rc3-mm2 I have no idea if this will help or not, but I'm including my boot.msg from booting 2.6.13-rc3-mm2
So when it "freezes", is it completely dead? Not responding to pings, not responding to alt-sysrq combinations? Have you tried enabling the nmi watchdog?
Yes, it's completely dead. Alt-Sysrq doesn't work, hitting things like capslock don't work. The machine cannot be pinged. On your suggestion I tried nmi watchdog...nmi_watchdog=1 on bootup right? It doesn't appear like it helped. No messages appeared on the console or in the logs. In the boot.msg log I see: <6>testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)! That appears to be during the boot process though.
So NMI's don't work either. Sigh. I give up. It might be worth disabling various drivers in config, removing or replacing hardware, wiggling the power supply connectors, etc.
Created attachment 5400 [details] Fix NMI on x64 SMP Could you apply this patch and we'll see if the NMI watchdog can give us any idea of what is going on.
Andrew: Disabling some of the drivers maybe...but I dunno. I don't think it's something like a loose cable because the system is stable under windows. I know there's not much to go on. Like I said, I'm hoping that by getting this out there, maybe other people who can diagnose it better will have similar issues. Alex: I assume the NMI patch is for 2.6.13-rc3-mm2 right? I'm going to try it now.
The NMI fix is in mainline now so it'll be there in -rc4 or the next -mm set.
The NMI fix didn't make it into 2.6.13-rc4. But Alex's attachment is valid. Can you please add that and try to get the NMI watchdog trace?
I tried the patch. Nothing showed up on the console or in the logs as far as I can tell. However, I'm not entirely sure I using it correctly. When I boot, I use nmi_watchdog=1 as a boot argument. Is this how I should be turning it on? Also, where would the nmi watchdog dump any info? I assume the console, but nothing shows there. Also should I wait a certain amount of time after it freezes? I waited maybe 30 seconds or so.
Please test with nmi_watchdog=2 (either way should work but i'm more confident with this). Does the boot message says something "testing NMI watchdog...OK"? And yes, the console is where it should turn up (after maybe 10 seconds) You can also check /proc/interrupts after boot to see if any NMIs have been registered.
Trying nmi_watchdog=2 nothing appeared on the console. I did verify test NMI watchdog...OK. /proc/interrupts had a row with NMI listed on the left and two columns of numbers. Something different did happen this time, however. In /var/log/messages it said Machine check events logged. Then a log showed up called mcelog. In it is: MCE 0 CPU 1 1 instruction cache TSC 2b5fe2a4e9b ADDR ffff8013efa0 TLB parity error in virtual array TLB error 'instruction transaction, level 1' STATUS 9400000000010011 MCGSTATUS 0
I think your machine is sick. MCEs and TLB parity errors look real bad. I'll tentatively reject this bug, sorry. If you can reproduce it on a different machine then please reopen.
Actually that MCE happens on some machines during boot. That's a BIOS bug, but does not point to hardware issues. We still log boot MCEs because it made sense before these class of BIOS bugs were discovered, although i planned to disable it (maybe should do that for 2.6.13) If the MCE only happens directly after boot that would be ok.
Have you tested without powernow?
I've tested it without acpi and apm...would that have the effect of disabling powernow? How much directly after boot is the MCE ok? I get it maybe a few minutes after.
I have some good news to report on this. I installed a bios update and installed the lastest stable kernel 2.6.13. It's a bit early to tell, but my machine has been up for 2 hours with no errors in the mcelog (a record...as stated before lock ups and problems appeared within a few minutes after boot). Things look really promising. My machine will be up over the 3 day weekend. I'll report back after that. If it's still up after the 3 day weekend, I think it may be safe to consider the matter fixed. I'm keeping my fingers crossed on this.
Ok, I've got more news to report. The machine hasn't locked up yet, and there's nothing in the mcelog. However, now I get this in the message log: Losing some ticks... checking if CPU frequency changed. warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip acpi_processor_idle+0x137/0x389 [processor] Otherwise things seem ok for the time being. Any ideas what the messages above mean?
Well it didn't lock up. So that's good. However, the clock is running very fast. It was over an hour ahead when I came in this morning. In addition when I hit keys on the keyboard it sometimes records multiple keystrokes. I found another group of people that seem to be having this issue here: http://forums.gentoo.org/viewtopic-t-191716-postdays-0-postorder-asc-start-0.html I'm now trying some of the solutions they've tried. There's definitely something up here, I think.
Do the timer too fast problems still happen with 2.6.14rc1? If yes please attach an updated boot log. Also try with powernow disabled and notsc.
I'll test rc1 later. I'm a little aprehensive about changing anything because the machine is working now. The other thing is that I have trouble compiling drivers against the patched kernels. Is there some sort of trick I need to know? Typically after I patch a kernel and then try to compile drivers such as the NVIDIA driver, I get an error about the kernel headers. The clock is also fine now, but when I boot I use the following options: clock=pmtmr pci=biosirq pci=irqroute pci=noacpi noapic clock=notsc notsc report_lost_ticks=100 nmi_watchdog=2 report_lost_ticks and nmi_watchdog are for troubleshooting, as for the others I have no idea which ones or which combination causes things to work. Powerrnow is enabled and working (verified with /proc/cpuinfo) so I don't think the problem is there. I also tried noapictimer, however the system would not boot with that option enabled.
It's been a little while...I just loaded 2.6.13.4. Without the options listed above, I still get lost ticks. Prior to rebooting because of the kernel update, the machine had been up for 28 days. Looks like the lockups are gone at least. I will test 2.6.14 when it's officially released.
I'm trying 2.6.14, the machine has been up for a few days and I'm happy to report that the lost ticks messages are gone. The only options I'm booting with are report_lost_ticks=100 nmi_watchdog=2. As far as I can tell, everything seems great.
Great if it works. Closing.