Bug 4951
Summary: | System freezes with SMP kernel on AMD X2 4600 | ||
---|---|---|---|
Product: | Platform Specific/Hardware | Reporter: | Jon Felder (felder) |
Component: | x86-64 | Assignee: | Andi Kleen (andi-bz) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | high | CC: | akpm, alexn |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.12 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
boot.msg file from 2.6.13-rc3-mm2
Fix NMI on x64 SMP |
Description
Jon Felder
2005-07-27 13:17:30 UTC
Woops...I mean to say: Hardware Environment: AMD X2 4600+ (manchester) with an MSI Neo4 Platinum motherboard bios version 1.5 We've fixed a few x86_64 masties recently. It would be useful to test current -linus tree, or, better, 2.6.13-rc3-mm2 I tried as you suggested, didn't work. It froze 7 minutes after booting, using 2.6.13-r3-mm2. Created attachment 5389 [details]
boot.msg file from 2.6.13-rc3-mm2
I have no idea if this will help or not, but I'm including my boot.msg from
booting 2.6.13-rc3-mm2
So when it "freezes", is it completely dead? Not responding to pings, not responding to alt-sysrq combinations? Have you tried enabling the nmi watchdog? Yes, it's completely dead. Alt-Sysrq doesn't work, hitting things like capslock don't work. The machine cannot be pinged. On your suggestion I tried nmi watchdog...nmi_watchdog=1 on bootup right? It doesn't appear like it helped. No messages appeared on the console or in the logs. In the boot.msg log I see: <6>testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)! That appears to be during the boot process though. So NMI's don't work either. Sigh. I give up. It might be worth disabling various drivers in config, removing or replacing hardware, wiggling the power supply connectors, etc. Created attachment 5400 [details]
Fix NMI on x64 SMP
Could you apply this patch and we'll see if the NMI watchdog can give us any
idea of what is going on.
Andrew: Disabling some of the drivers maybe...but I dunno. I don't think it's something like a loose cable because the system is stable under windows. I know there's not much to go on. Like I said, I'm hoping that by getting this out there, maybe other people who can diagnose it better will have similar issues. Alex: I assume the NMI patch is for 2.6.13-rc3-mm2 right? I'm going to try it now. The NMI fix is in mainline now so it'll be there in -rc4 or the next -mm set. The NMI fix didn't make it into 2.6.13-rc4. But Alex's attachment is valid. Can you please add that and try to get the NMI watchdog trace? I tried the patch. Nothing showed up on the console or in the logs as far as I can tell. However, I'm not entirely sure I using it correctly. When I boot, I use nmi_watchdog=1 as a boot argument. Is this how I should be turning it on? Also, where would the nmi watchdog dump any info? I assume the console, but nothing shows there. Also should I wait a certain amount of time after it freezes? I waited maybe 30 seconds or so. Please test with nmi_watchdog=2 (either way should work but i'm more confident with this). Does the boot message says something "testing NMI watchdog...OK"? And yes, the console is where it should turn up (after maybe 10 seconds) You can also check /proc/interrupts after boot to see if any NMIs have been registered. Trying nmi_watchdog=2 nothing appeared on the console. I did verify test NMI watchdog...OK. /proc/interrupts had a row with NMI listed on the left and two columns of numbers. Something different did happen this time, however. In /var/log/messages it said Machine check events logged. Then a log showed up called mcelog. In it is: MCE 0 CPU 1 1 instruction cache TSC 2b5fe2a4e9b ADDR ffff8013efa0 TLB parity error in virtual array TLB error 'instruction transaction, level 1' STATUS 9400000000010011 MCGSTATUS 0 I think your machine is sick. MCEs and TLB parity errors look real bad. I'll tentatively reject this bug, sorry. If you can reproduce it on a different machine then please reopen. Actually that MCE happens on some machines during boot. That's a BIOS bug, but does not point to hardware issues. We still log boot MCEs because it made sense before these class of BIOS bugs were discovered, although i planned to disable it (maybe should do that for 2.6.13) If the MCE only happens directly after boot that would be ok. Have you tested without powernow? I've tested it without acpi and apm...would that have the effect of disabling powernow? How much directly after boot is the MCE ok? I get it maybe a few minutes after. I have some good news to report on this. I installed a bios update and installed the lastest stable kernel 2.6.13. It's a bit early to tell, but my machine has been up for 2 hours with no errors in the mcelog (a record...as stated before lock ups and problems appeared within a few minutes after boot). Things look really promising. My machine will be up over the 3 day weekend. I'll report back after that. If it's still up after the 3 day weekend, I think it may be safe to consider the matter fixed. I'm keeping my fingers crossed on this. Ok, I've got more news to report. The machine hasn't locked up yet, and there's nothing in the mcelog. However, now I get this in the message log: Losing some ticks... checking if CPU frequency changed. warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip acpi_processor_idle+0x137/0x389 [processor] Otherwise things seem ok for the time being. Any ideas what the messages above mean? Well it didn't lock up. So that's good. However, the clock is running very fast. It was over an hour ahead when I came in this morning. In addition when I hit keys on the keyboard it sometimes records multiple keystrokes. I found another group of people that seem to be having this issue here: http://forums.gentoo.org/viewtopic-t-191716-postdays-0-postorder-asc-start-0.html I'm now trying some of the solutions they've tried. There's definitely something up here, I think. Do the timer too fast problems still happen with 2.6.14rc1? If yes please attach an updated boot log. Also try with powernow disabled and notsc. I'll test rc1 later. I'm a little aprehensive about changing anything because the machine is working now. The other thing is that I have trouble compiling drivers against the patched kernels. Is there some sort of trick I need to know? Typically after I patch a kernel and then try to compile drivers such as the NVIDIA driver, I get an error about the kernel headers. The clock is also fine now, but when I boot I use the following options: clock=pmtmr pci=biosirq pci=irqroute pci=noacpi noapic clock=notsc notsc report_lost_ticks=100 nmi_watchdog=2 report_lost_ticks and nmi_watchdog are for troubleshooting, as for the others I have no idea which ones or which combination causes things to work. Powerrnow is enabled and working (verified with /proc/cpuinfo) so I don't think the problem is there. I also tried noapictimer, however the system would not boot with that option enabled. It's been a little while...I just loaded 2.6.13.4. Without the options listed above, I still get lost ticks. Prior to rebooting because of the kernel update, the machine had been up for 28 days. Looks like the lockups are gone at least. I will test 2.6.14 when it's officially released. I'm trying 2.6.14, the machine has been up for a few days and I'm happy to report that the lost ticks messages are gone. The only options I'm booting with are report_lost_ticks=100 nmi_watchdog=2. As far as I can tell, everything seems great. Great if it works. Closing. |