Bug 4951 - System freezes with SMP kernel on AMD X2 4600
System freezes with SMP kernel on AMD X2 4600
Status: CLOSED PATCH_ALREADY_AVAILABLE
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64
i386 Linux
: P2 high
Assigned To: Andi Kleen
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2005-07-27 13:17 UTC by Jon Felder
Modified: 2006-01-19 01:26 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.12
Tree: Mainline
Regression: ---


Attachments
boot.msg file from 2.6.13-rc3-mm2 (34.04 KB, text/plain)
2005-07-27 15:34 UTC, Jon Felder
Details
Fix NMI on x64 SMP (490 bytes, patch)
2005-07-28 14:58 UTC, Alexander Nyberg
Details | Diff

Description Jon Felder 2005-07-27 13:17:30 UTC
Distribution:  SuSE 9.3 x86_64

Hardware Environment:  AMD X2 4600+ (manchester) with an MSI Neo4 Platinum bios
version 1.5

Problem Description:
System completely freezes within a short interval (15 - 30 minutes) after
booting.  This system is perfectly stable when running the non-smp x86_64
kernel.  It is also stable when running Windows (and windows does indicate the
presence of two processors), so I don't believe there is a hardware problem, but
I haven't completely ruled out the possibility.

I have noticed that things have been getting better with successive kernels. 
The original SuSE 9.3 smp kernel would not boot.  I upgraded it to the latest
SuSE kernel and got a lot of errors with powernow-k8. The machine booted but
would lock up quickly after. Finally I recompiled the kernel using the latest
stable source 2.6.12.  All the powernow-k8 problems seemed to go away, but not
the lock ups.

Here's the output from lspci:

0000:00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
0000:00:01.0 ISA bridge: nVidia Corporation: Unknown device 0050 (rev a3)
0000:00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
0000:00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
0000:00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
0000:00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio
Controller (rev a2)
0000:00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
0000:00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
0000:00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
0000:00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
0000:00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
0000:00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
0000:00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Address Map
0000:00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
0000:00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
0000:01:0c.0 FireWire (IEEE 1394): VIA Technologies, Inc. IEEE 1394 Host
Controller (rev 80)
0000:03:00.0 Ethernet controller: Marvell Technology Group Ltd. Gigabit Ethernet
Controller (rev 15)
0000:05:00.0 VGA compatible controller: nVidia Corporation: Unknown device 0141
(rev a2)

/proc/cpuinfo gives me this (i'm currently running the non smp kernel)

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 43
model name      : AMD Athlon(tm) 64 X2 Dual Core Processor 4600+
stepping        : 1
cpu MHz         : 1005.165
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht pni syscall nx mmxext fxsr_opt lm
3dnowext 3dnow pni lahf_lm cmp_legacy
bogomips        : 1993.38
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
cpu cores       : 2


I know that's not much to go on.  I don't see any errors in the logs when I
reboot after a lockup.  I'm hoping someone else is experiencing a similar issue.
 I'll be happy to provide any other information if you want to request it.

Steps to reproduce:
Happens within 15 - 30 minutes after every reboot.
Comment 1 Jon Felder 2005-07-27 13:20:55 UTC
Woops...I mean to say:

Hardware Environment:  AMD X2 4600+ (manchester) with an MSI Neo4 Platinum
motherboard bios version 1.5
Comment 2 Andrew Morton 2005-07-27 13:27:14 UTC
We've fixed a few x86_64 masties recently.  It would be useful to
test current -linus tree, or, better, 2.6.13-rc3-mm2
Comment 3 Jon Felder 2005-07-27 15:03:04 UTC
I tried as you suggested, didn't work.  It froze 7 minutes after booting, using
2.6.13-r3-mm2.
Comment 4 Jon Felder 2005-07-27 15:34:55 UTC
Created attachment 5389 [details]
boot.msg file from 2.6.13-rc3-mm2

I have no idea if this will help or not, but I'm including my boot.msg from
booting 2.6.13-rc3-mm2
Comment 5 Andrew Morton 2005-07-27 15:51:30 UTC
So when it "freezes", is it completely dead?  Not responding to pings,
not responding to alt-sysrq combinations?

Have you tried enabling the nmi watchdog?
Comment 6 Jon Felder 2005-07-27 18:15:42 UTC
Yes, it's completely dead.  Alt-Sysrq doesn't work, hitting things like capslock
don't work.  The machine cannot be pinged.  On your suggestion I tried nmi
watchdog...nmi_watchdog=1 on bootup right?  It doesn't appear like it helped. 
No messages appeared on the console or in the logs.  In the boot.msg log I see:

<6>testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!

That appears to be during the boot process though.
Comment 7 Andrew Morton 2005-07-27 18:27:47 UTC
So NMI's don't work either.

Sigh.  I give up.

It might be worth disabling various drivers in config, removing
or replacing hardware, wiggling the power supply connectors, etc.
Comment 8 Alexander Nyberg 2005-07-28 14:58:43 UTC
Created attachment 5400 [details]
Fix NMI on x64 SMP

Could you apply this patch and we'll see if the NMI watchdog can give us any
idea of what is going on.
Comment 9 Jon Felder 2005-07-28 16:29:41 UTC
Andrew:

Disabling some of the drivers maybe...but I dunno.  I don't think it's something
like a loose cable because the system is stable under windows.  I know there's
not much to go on.  Like I said, I'm hoping that by getting this out there,
maybe other people who can diagnose it better will have similar issues.

Alex:
I assume the NMI patch is for 2.6.13-rc3-mm2 right?  I'm going to try it now.
Comment 10 Alexander Nyberg 2005-07-29 01:14:05 UTC
The NMI fix is in mainline now so it'll be there in -rc4 or the next -mm set.
Comment 11 Andrew Morton 2005-07-29 02:00:04 UTC
The NMI fix didn't make it into 2.6.13-rc4.  But Alex's attachment
is valid.   Can you please add that and try to get the NMI
watchdog trace?
Comment 12 Jon Felder 2005-07-29 10:01:09 UTC
I tried the patch.  Nothing showed up on the console or in the logs as far as I
can tell.  However, I'm not entirely sure I using it correctly.

When I boot, I use nmi_watchdog=1 as a boot argument.  Is this how I should be
turning it on?  Also, where would the nmi watchdog dump any info?  I assume the
console, but nothing shows there.  Also should I wait a certain amount of time
after it freezes?  I waited maybe 30 seconds or so.
Comment 13 Alexander Nyberg 2005-07-29 10:32:34 UTC
Please test with nmi_watchdog=2 (either way should work but i'm more confident
with this). Does the boot message says something "testing NMI watchdog...OK"?
And yes, the console is where it should turn up (after maybe 10 seconds)

You can also check /proc/interrupts after boot to see if any NMIs have been
registered.
Comment 14 Jon Felder 2005-07-29 11:26:13 UTC
Trying nmi_watchdog=2 nothing appeared on the console.  I did verify test NMI
watchdog...OK.  /proc/interrupts had a row with NMI listed on the left and two
columns of numbers.

Something different did happen this time, however.  In /var/log/messages it said
Machine check events logged.  Then a log showed up called mcelog.  In it is:

MCE 0
CPU 1 1 instruction cache TSC 2b5fe2a4e9b
ADDR ffff8013efa0
  TLB parity error in virtual array
  TLB error 'instruction transaction, level 1'
STATUS 9400000000010011 MCGSTATUS 0
Comment 15 Andrew Morton 2005-08-04 15:46:31 UTC
I think your machine is sick.  MCEs and TLB parity errors look
real bad.

I'll tentatively reject this bug, sorry.  If you can reproduce it
on a different machine then please reopen.
Comment 16 Andi Kleen 2005-08-04 15:50:09 UTC
Actually that MCE happens on some machines during boot. That's a BIOS 
bug, but does not point to hardware issues. 
We still log boot MCEs because it made sense before these class of BIOS 
bugs were discovered, although i planned to disable it (maybe should do 
that for 2.6.13) 
 
If the MCE only happens directly after boot that would be ok. 
Comment 17 Andi Kleen 2005-08-04 16:09:40 UTC
Have you tested without powernow? 
 
Comment 18 Jon Felder 2005-08-05 09:55:46 UTC
I've tested it without acpi and apm...would that have the effect of disabling
powernow?

How much directly after boot is the MCE ok?  I get it maybe a few minutes after.
Comment 19 Jon Felder 2005-09-02 14:46:56 UTC
I have some good news to report on this.  I installed a bios update and
installed the lastest stable kernel 2.6.13.  It's a bit early to tell, but my
machine has been up for 2 hours with no errors in the mcelog (a record...as
stated before lock ups and problems appeared within a few minutes after boot).  

Things look really promising.  My machine will be up over the 3 day weekend. 
I'll report back after that.  If it's still up after the 3 day weekend, I think
it may be safe to consider the matter fixed.

I'm keeping my fingers crossed on this.
Comment 20 Jon Felder 2005-09-02 17:18:34 UTC
Ok, I've got more news to report.  The machine hasn't locked up yet, and there's
nothing in the mcelog.  However, now I get this in the message log:

Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip acpi_processor_idle+0x137/0x389 [processor]

Otherwise things seem ok for the time being.  Any ideas what the messages above
mean? 
Comment 21 Jon Felder 2005-09-06 09:30:21 UTC
Well it didn't lock up.  So that's good.  However, the clock is running very
fast.  It was over an hour ahead when I came in this morning.  In addition when
I hit keys on the keyboard it sometimes records multiple keystrokes.

I found another group of people that seem to be having this issue here:

http://forums.gentoo.org/viewtopic-t-191716-postdays-0-postorder-asc-start-0.html

I'm now trying some of the solutions they've tried.  There's definitely
something up here, I think.
Comment 22 Andi Kleen 2005-09-13 01:07:37 UTC
Do the timer too fast problems still happen with 2.6.14rc1? If yes please attach
an updated boot log. Also try with powernow disabled and notsc.
Comment 23 Jon Felder 2005-09-13 13:18:51 UTC
I'll test rc1 later.  I'm a little aprehensive about changing anything because
the machine is working now.  The other thing is that I have trouble compiling
drivers against the patched kernels.  Is there some sort of trick I need to
know?  Typically after I patch a kernel and then try to compile drivers such as
the NVIDIA driver, I get an error about the kernel headers.

The clock is also fine now, but when I boot I use the following options:

clock=pmtmr pci=biosirq pci=irqroute pci=noacpi noapic clock=notsc notsc
report_lost_ticks=100 nmi_watchdog=2

report_lost_ticks and nmi_watchdog are for troubleshooting, as for the others I
have no idea which ones or which combination causes things to work.  

Powerrnow is enabled and working (verified with /proc/cpuinfo) so I don't think
the problem is there.  I also tried noapictimer, however the system would not
boot with that option enabled.


Comment 24 Jon Felder 2005-10-14 16:09:57 UTC
It's been a little while...I just loaded 2.6.13.4.  Without the options listed
above, I still get lost ticks.  Prior to rebooting because of the kernel update,
the machine had been up for 28 days.  Looks like the lockups are gone at least.
 I will test 2.6.14 when it's officially released.
Comment 25 Jon Felder 2005-11-07 09:32:43 UTC
I'm trying 2.6.14, the machine has been up for a few days and I'm happy to
report that the lost ticks messages are gone.  The only options I'm booting with
are report_lost_ticks=100 nmi_watchdog=2.  As far as I can tell, everything
seems great.
Comment 26 Andi Kleen 2006-01-19 01:26:39 UTC
Great if it works. Closing.


Note You need to log in before you can comment on or make changes to this bug.