Bug 11527 - MCE on Athlon64 X2, nForce4 motherboard
Summary: MCE on Athlon64 X2, nForce4 motherboard
Status: CLOSED CODE_FIX
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-09 12:27 UTC by Matteo Croce
Modified: 2009-05-18 16:47 UTC (History)
7 users (show)

See Also:
Kernel Version: 2.6.27-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
a screenshot of the panic (48.67 KB, image/jpeg)
2008-09-09 12:46 UTC, Matteo Croce
Details
a screenshot of the panic, scrolling down (42.22 KB, image/jpeg)
2008-09-09 12:47 UTC, Matteo Croce
Details
the syslog (27.76 KB, text/plain)
2008-09-11 05:53 UTC, Matteo Croce
Details
the 2.6.26.5 syslog (28.02 KB, text/plain)
2008-09-11 11:00 UTC, Matteo Croce
Details
a screenshot to the error (43.83 KB, image/png)
2008-09-23 16:31 UTC, Matteo Croce
Details
collection of many dmesg (124.82 KB, application/x-bzip-compressed-tar)
2008-10-01 16:47 UTC, Matteo Croce
Details
the MCE (12.40 KB, image/png)
2008-10-04 06:09 UTC, Matteo Croce
Details
lspci -vvxxxx output (115.76 KB, text/plain)
2008-10-06 12:37 UTC, Matteo Croce
Details
dmesg output (maxcpus=1) (43.08 KB, text/plain)
2008-10-30 06:07 UTC, Phil Linttell
Details
lspci -vvxxxx ouput (while running with maxcpus=1) (62.88 KB, text/plain)
2008-10-30 06:08 UTC, Phil Linttell
Details
ouput from lspci -vvv (both CPU cores enabled) (19.71 KB, text/plain)
2008-10-30 19:13 UTC, Phil Linttell
Details
output from lscpi -vn (both CPU cores enabled) (6.58 KB, text/plain)
2008-10-30 19:14 UTC, Phil Linttell
Details
output from dmesg (both cores enabled) (43.42 KB, text/plain)
2008-10-30 19:17 UTC, Phil Linttell
Details

Description Matteo Croce 2008-09-09 12:27:08 UTC
Latest working kernel version: 2.6.25
Earliest failing kernel version: 2.6.27-rc5
Distribution: Ubuntu 8.10
Hardware Environment: Athlon64 X2, nForce4 motherboard
Problem Description:

i can't boot, the kernel print a big amount of errors and then the watchdog reboots the machine.
the only thing i can read are mce_panic, do_machine_check

Steps to reproduce:

boot, start X, wait 2 or 3 seconds
Comment 1 Matteo Croce 2008-09-09 12:28:19 UTC
I haven't a serial console, please help me to save a panic dump.
I have a serial->USB adapter which with i flash my router, can I login into the router and read the error from there?
will the USB driver work while panicing?
Comment 2 Matteo Croce 2008-09-09 12:46:19 UTC
Created attachment 17700 [details]
a screenshot of the panic
Comment 3 Matteo Croce 2008-09-09 12:47:21 UTC
Created attachment 17701 [details]
a screenshot of the panic, scrolling down
Comment 4 Matteo Croce 2008-09-09 12:48:17 UTC
In the meantime i've posted two screenshots, low quality tought
Comment 5 Andrew Morton 2008-09-09 13:51:39 UTC
Thanks.  Were you able to test 2.6.26?
Comment 6 Matteo Croce 2008-09-09 16:20:59 UTC
Yes, but it hangs for some obscure reason too.
Really I can't debug it, as being an X user, everything just hangs and I can't check nothing.
I haven't serial, how can I provide an useful stacktrace?
Comment 7 Thomas Gleixner 2008-09-10 01:27:56 UTC
> I haven't serial, how can I provide an useful stacktrace?

netconsole might be worth a try. Documentation/networking/netconsole.txt

Please switch to 2.6.27-rc6 as well.

Thanks,

	tglx
Comment 8 Matteo Croce 2008-09-10 03:58:20 UTC
2.6.27-rc6 hangs too.
in the meantime that i try netconsole, i've seen "smp.c: smp_call_function_single" on top of the backtrace
Comment 9 Matteo Croce 2008-09-10 04:59:35 UTC
tried netconsole, it logs standard messages but not the panic.
maybe the ethernet is unable to send after the error?
However i'm trying 2.6.26.2 as it takes many time before hang (10-30 minutes)
Comment 10 Matteo Croce 2008-09-10 17:35:37 UTC
akpm: 2.6.26.5 never hangs, the bug is 2.6.27 specific
Comment 11 Thomas Gleixner 2008-09-11 04:22:38 UTC
ok, the machine check happens in acpi_pm_read() as far as I can decode your screenshots.

Can you please upload the messages which you see via netconsole up to the point where the kernel crashes ?
Comment 12 Matteo Croce 2008-09-11 05:53:37 UTC
Created attachment 17727 [details]
the syslog

The log from a normal bootup, just before the crash
Comment 13 Thomas Gleixner 2008-09-11 07:36:32 UTC
Can you please add "nopat" to the kernel command line ?
Comment 14 Thomas Gleixner 2008-09-11 07:38:25 UTC
Also can you please provide a dmesg from the last working kernel ?
Comment 15 Matteo Croce 2008-09-11 10:59:41 UTC
hangs with nopat too
Comment 16 Matteo Croce 2008-09-11 11:00:31 UTC
Created attachment 17729 [details]
the 2.6.26.5 syslog
Comment 17 Matteo Croce 2008-09-11 11:03:00 UTC
Thomas, note that PAT was enabled in 2.6.26.5 too
Comment 18 Matteo Croce 2008-09-14 18:57:03 UTC
can't I save the error in RAM and read it after kexecinga new kernel?
Sort of...
Comment 19 Matteo Croce 2008-09-14 19:10:08 UTC
I've found CONFIG_CRASH_DUMP, can it help?
Comment 20 Rafael J. Wysocki 2008-09-14 21:29:29 UTC
On the basis of comment #10 I'm regarding this as a recent regression.
Comment 21 Matteo Croce 2008-09-15 05:00:45 UTC
Errata: the system doesn't reboots, it just hangs forever and the watchdog reboots it.
I discovered it as after loading a crashdump kernel the machine still rebooted, without kexecing the new kernel.
what happens is this: while being in console all just freezes, except the blinking cursor, and I can't change VT or type everything.
Comment 22 Matteo Croce 2008-09-20 19:13:16 UTC
No response from the kernel gurus, i'll git-bisect the kernel and try to find what is the issue
Comment 23 Ingo Molnar 2008-09-20 23:46:39 UTC
> can't I save the error in RAM and read it after kexecinga new kernel? 
> Sort of...

if you dont have a serial console (a null modem cable) then easiest is 
to build netconsole into the kernel (CONFIG_NETCONSOLE=y) and set it up 
on the boot line, via something like:

  netconsole=4444@10.0.1.13/eth0,4444@10.0.1.16/00:11:22:33:44:55

where 10.0.1.13 is the IP of the crashing box, 10.0.1.16 is the IP and 
00:11:22:33:44:55 is the ethernet address of the box you log to. On the 
box you log the crash to run something like:

  netcat -u -p 4444 10.0.1.13 4444 2>&1 | tee -a crash.log

and you should have the log in crash.log.

OTOH, bisecting this bug would certainly be very informative, the crash 
seems to happen reliably.

	Ingo
Comment 24 Matteo Croce 2008-09-21 07:32:17 UTC
network console hangs too, I have to bisect
Comment 25 Matteo Croce 2008-09-21 09:44:24 UTC
i've found what hangs the kernel, it was the ath9k driver
Comment 26 Rafael J. Wysocki 2008-09-21 11:42:52 UTC
So, this is not a regression in a strict sense, because ath9k was not present in the previous kernels.

Can you confirm that the rest of the kernel works correctly if ath9k is not loaded?
Comment 27 Matteo Croce 2008-09-21 13:47:17 UTC
i can use a 2.6.27-rc6 kernel using another wireless driver (madwifi)
Comment 28 John W. Linville 2008-09-22 06:11:12 UTC
Not sure where to start...the output of 'lspci -n' might be a good place.  Also, is there a version where the ath9k loaded successfully?  If so, could we see the dmesg output from that version?

Were you trying to load madwifi as well as ath9k?  If so, does disabling the load of madwifi allow ath9k to load successfully?
Comment 29 Matteo Croce 2008-09-22 14:26:01 UTC
# lspci -nn |grep Atheros
05:08.0 Network controller [0280]: Atheros Communications Inc. AR5416 802.11abgn Wireless PCI Adapter [168c:0023] (rev 01)

It's the first time i try ath9k, I was using madwifi until now.

I can't test ath9k with a 2.6.26 kernel as it doesn't compile.
I have another amd64 machine, ath9k works but disconnects frequently, I have to do ifdown && ifup...
Comment 30 John W. Linville 2008-09-22 14:40:29 UTC
Could you try an earlier 2.6.27-rc?  2.6.27-rc3 perhaps?

You may also wish to try this patch (hopefully merged to Linus soon):

   http://marc.info/?l=linux-wireless&m=122163541519736&w=2
Comment 31 Matteo Croce 2008-09-22 18:46:15 UTC
It hanged too with 2.6.27-rc3 IIRC but i'm not 100% sure.
BTW, the patch you posted fixed the disconnect on the other machine
Comment 32 Matteo Croce 2008-09-23 02:17:28 UTC
Tested on my pc, it still hangs.
John: any way to debug the ath9k driver or the mac80211 stack?
could it be a mac80211 deadlock?
Comment 33 John W. Linville 2008-09-23 14:04:40 UTC
It _might_ be any number of issues -- hard to say with no stack trace.

When it hangs, do you have a text console?  Can you enable sysrq and capture the output of alt-sysrq-p while it is hung?
Comment 34 Matteo Croce 2008-09-23 16:30:24 UTC
i've done many photos to the crash and attached them with gimp, please have a look in the meantime i do OCR on the image
Comment 35 Matteo Croce 2008-09-23 16:31:20 UTC
Created attachment 17979 [details]
a screenshot to the error
Comment 36 Matteo Croce 2008-09-23 17:56:49 UTC
ath9k could leave the card active after rmmod as loading madwifi complains about spurious interrupts:

ath9k 0000:05:08.0: PCI INT A disabled
ath9k: driver unloaded
ath_hal: module license 'Proprietary' taints kernel.
AR5210, AR5211, AR5212, AR5416, RF5111, RF5112, RF2413, RF5413, RF2133)
ath_pci 0000:05:08.0: PCI INT A -> Link[APC1] -> GSI 16 (level, low) -> IRQ 16
MadWifi: ath_attach: Switching rfkill capability off.
wifi0: Atheros AR5418 chip found (MAC 13.10, PHY SChip 8.1, Radio 13.0)
ath_pci: wifi0: Atheros 5416: mem=0xfdee0000, irq=16
udev: renamed network interface ath0 to wlan0
irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper Tainted: P          2.6.27-rc7 #3

Call Trace:
 <IRQ>  [<ffffffff8025ab88>] getnstimeofday+0x48/0xc0
 [<ffffffff8027363e>] __report_bad_irq+0x1e/0x80
 [<ffffffff80273940>] note_interrupt+0x2a0/0x2d0
 [<ffffffff8027426d>] handle_fasteoi_irq+0xdd/0x100
 [<ffffffff8020f9f3>] do_IRQ+0xa3/0x130
 [<ffffffff8020c9b1>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8021f360>] lapic_next_event+0x0/0x10
 [<ffffffff8025f61f>] tick_nohz_stop_sched_tick+0xff/0x3e0
 [<ffffffff8020b3aa>] cpu_idle+0x2a/0x100

handlers:
[<ffffffffa0266e10>] (ath_intr+0x0/0x4460 [ath_pci])
Disabling IRQ #16
Comment 38 Matteo Croce 2008-09-24 03:50:22 UTC
Done but now I I can't load a driver after another as when loading ath9k i get a panic in ath_rx_something
Comment 39 John W. Linville 2008-09-24 07:24:05 UTC
Can we see that panic? :-)
Comment 40 Matteo Croce 2008-09-25 15:52:47 UTC
I didn't have the camera with me this time, sorry
Comment 41 Luis Chamberlain 2008-09-25 16:06:43 UTC
The idea is to kill MadWifi, please rm -rf it, and its modules and lets resolve your issue. Before posting a kernel panic please make sure madwifi is nowhere to be seen:

modprobe -l ath_pci
Comment 42 Luis Chamberlain 2008-09-25 16:08:36 UTC
Also if you can upgrade to 2.6.27-rc7 that would be great as it has some of the pending patches posted by John here already merged.
Comment 43 Matteo Croce 2008-09-25 19:51:36 UTC
I upgraded just after its release, no luck
and for madwifi, i haven't it, i just tried to install to have some network instead of reboot.
Comment 44 Luis Chamberlain 2008-09-25 21:21:03 UTC
We definitely need some sort of panic in order to assist. Can you try to write down the oops, a picture works too?
Comment 45 Matteo Croce 2008-09-28 13:43:06 UTC
What panic? i've already attached the OOPS, it's attachment #17979 [details]
Comment 46 Luis Chamberlain 2008-09-28 19:15:29 UTC
Yes, but that picture doesn't show me anything related to ath9k that we can help with. In fact from reading the trace it seems there may have been some machine check exception run into so the the kernel may have picked that up and thrown up.

The picture I was hoping for was the one of an oops related to ath9k without MadWifi (a proprietary driver) ever touching the kernel.
Comment 47 Matteo Croce 2008-09-29 10:09:42 UTC
I don't use madwifi, i use (well, try to) use ath9k but i get this issue.
i know it's ath9k fault as I can stay connected with an rtl8187 or a zydas card all the day without errors
Comment 48 Luis Chamberlain 2008-09-29 11:19:21 UTC
Alright, well lets work with your current latest oops, which you indicate is attachment #17979 [details]. Your oops indicates its tainted but that is because of the MCE (Machine Check Exception) one of the CPUs hit. The last sane call I see is acpi_pm_read() which just inl()'s the ACPI power management timer. Not sure how ath9k can affect this but I'm just going with what is presented to us through the provided panic.

Can you try adding to your kernel boot parameters:

acpi_pm_good

And if that does not work try adding this:

acpi=off

Also if you can provide the output of cat /proc/interrupts that would be helpful.
Comment 49 Matteo Croce 2008-09-30 05:37:10 UTC
2.6.27-rc8 is out, will try it first :)
Comment 50 Matteo Croce 2008-09-30 05:50:00 UTC
no luck, 2.6.27-rc8 hangs too.
I load the module and in less than 10 seconds the system freezes
Comment 51 Luis Chamberlain 2008-09-30 07:34:10 UTC
Can you try adding to your kernel boot parameters:

acpi_pm_good

And if that does not work try adding this:

acpi=off

Also if you can provide the output of cat /proc/interrupts that would be
helpful.
Comment 52 Matteo Croce 2008-10-01 01:56:04 UTC
# cat /proc/interrupts
           CPU0       CPU1
  0:         15          0   IO-APIC-edge      timer
  6:          0          5   IO-APIC-edge      floppy
  7:          1          0   IO-APIC-edge
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
 14:          3        240   IO-APIC-edge      pata_amd
 15:          0        283   IO-APIC-edge      pata_amd
 16:          0          0   IO-APIC-fasteoi   ath
 18:          0          0   IO-APIC-fasteoi   EMU10K1
 20:          6      24855   IO-APIC-fasteoi   ehci_hcd:usb1
 22:          0          0   IO-APIC-fasteoi   sata_nv
 23:         13       9722   IO-APIC-fasteoi   sata_nv, ohci_hcd:usb2
NMI:          0          0   Non-maskable interrupts
LOC:      14422      21745   Local timer interrupts
RES:       4054       1806   Rescheduling interrupts
CAL:        641        266   function call interrupts
TLB:        622        400   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
SPU:          0          0   Spurious interrupts
ERR:          1
Comment 53 Matteo Croce 2008-10-01 02:42:32 UTC
I've tested these parameters:

with acpi_pm_good it hangs the same
with acpi=off my sata controller (sata_nv) is unable to access the disks, so nothing boots.

but i've found that all works flawlessy with maxcpus=0
i've tried this as the error in the image was duplicated, maybe enabling only one CPU and it would print it only once,
so I can see what there is before the errors, but I get no errors at all...
Comment 54 Matteo Croce 2008-10-01 02:50:19 UTC
Works with maxcpus=1 as well,

could it be a deadlock? warn_on_slow_path...
Comment 55 Luis Chamberlain 2008-10-01 10:53:18 UTC
Well, like I had said before, your panic only provides enough information to indicate the cause of the panic is a Machine Check Exception:

http://en.wikipedia.org/wiki/Machine_Check_Exception

The last sane call I see is acpi_pm_read().

I'm running low on ideas. One possibility is to recompile ath9k with all debug messages enabled and see if the log captures some errors prior to your panic. To do this modify the file drivers/net/wireless/ath9k/core.h on the line that says

#define DBG_DEFAULT (ATH_DBG_FATAL)

Modify this to be:

#define DBG_DEFAULT (ATH_DBG_ANY)

Then since your machine seems to crash as ath9k gets loaded, remove the module from your modules directory before booting into the kernel:

sudo rm -f /lib/modules/2.6.27-rc8/kernel/drivers/net/wireless/ath9k/ath9k.ko
sudo depmod -ae

After doing this boot into 2.6.27-rc8. Then recompile ath9k as follows:

make M=drivers/net/wireless/ath9k/
sudo insmod drivers/net/wireless/ath9k/ath9k.ko
Comment 56 Matteo Croce 2008-10-01 16:46:41 UTC
ATH_DBG_ANY gives an insane amount of output, i did a loop like:

while true
do
	dmesg > dmesg.$i++
	sync
done

and I attach all the dmesgs
Comment 57 Matteo Croce 2008-10-01 16:47:55 UTC
Created attachment 18136 [details]
collection of many dmesg
Comment 58 Matteo Croce 2008-10-01 16:58:20 UTC
ok, i've disabled warn_on_slowpath, let's see what happens...
Comment 59 Matteo Croce 2008-10-03 13:00:45 UTC
No luck, there are no relevant printks before the hang
Comment 60 Luis Chamberlain 2008-10-03 13:26:17 UTC
I've never tried this but it may help to find out the details of root cause to the MCE issue:

http://freshmeat.net/projects/mcelog/
Comment 61 Matteo Croce 2008-10-03 16:04:12 UTC
root@raver:~# mcelog
root@raver:~#
Comment 62 Luis Chamberlain 2008-10-03 18:08:37 UTC
You do have  /dev/mcelog right?

OK well I speak MCE and mcelog now. As per Andi Kleen's documenation, when PCI IO errors are enabled machine checks could be caused by software bugs in drivers -- BUT -- this normally isn’t the case on current x86 machines.

But... lets keep trying to dig what is causing this...

You need to force the MCE again. I suspect the kernel detected the MCE while the kernel was running acpi_pm_read() and it detects a deadlock is possible. So you can reproduce this right? OK well what we need to do is to trigger the MCE again and try to keep the machine alive to capture the log. Usually mcelog is run by cron to capture MC events but if an MC event is captured during kernel context and its an exception it panics. So we can try to force the panic to not happen to see if we can capture the data from the character device /dev/mcelog. The problem is during reboot /dev/mcelog is obviously cleared and I suspect it will be populated only if the MCE banks on the CPU were not cleared (perhaps during a warm boot) upon initialization.

Anyway so lets force the box to not panic during an MCE:

# Note this is dangerous and perhaps stupid :)
root@deathMCEbox:# for i in /sys/devices/system/machinecheck/machinecheck*; do echo 3 > $i/tolerant; done

This forces the kernel to never panic on each CPU even if an MCE was found and the current CPU process was in kernel context and a deadlock was possible (dangerous).

Usually you would have mcelog run in cron as follows, perhaps hourly:

#!/bin/bash
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

But since you are eager to find this and you can trigger this, then run this in a loop as follows to run it every 5 seconds:

root@deathMCEbox:# while true; do /usr/sbin/mcelog --filter >> /var/log/mcelog; sleep 5;

Of course though you cannot do any of this if upon bootup ath9k gets loaded and that damn culprit driver is causing the oops huh? So what you have to do is disable ath9k driver from loading. But we want to enable it later so do something like this to be sure:

root@deathMCEbox:# mv /lib/modules/`uname -r`/kernel/drivers/net/wireless/ath9k/ath9k.ko /drivers/net/wireless/ath9k/ath9k.ko.old
root@deathMCEbox:# depmod -ae
root@deathMCEbox:# reboot

Then AFTER the BOX COMES UP SUCCESSFULLY, you will run:

root@deathMCEbox:# mv /lib/modules/`uname -r`/kernel/drivers/net/wireless/ath9k/ath9k.ko.old /drivers/net/wireless/ath9k/ath9k.ko
root@deathMCEbox:# depmod -ae

At this point start doing all your mcelog stuff I mentioned above...  including the sys/devices/system/machinecheck/machinecheck* tuning I mentioned.

Once you have all all that setup, then one one vty have this running:

tail -f /var/log/mcelog

Then

root@deathMCEbox:# modprobe ath9k

Did you see anything? If so then now you have whip out the Athlon64 X2 and hope you can translate the errors. If you cannot you can try reducing the mask of the MC events through each CPU's bank ctls (/sys/devices/system/machinecheck/machinecheck0/bank*ctl) and try to pin point the error.

Let me know how it goes.
Comment 63 Luis Chamberlain 2008-10-03 18:12:40 UTC
Oh a typo, I forgot the done here:

root@deathMCEbox:# while true; do /usr/sbin/mcelog --filter >> /var/log/mcelog;
sleep 5; done
Comment 64 Luis Chamberlain 2008-10-03 18:15:05 UTC
Oh and I meant that you'd have to whip out the Athlon64 X2 *documentation*, as only with that can you determine some of these errors I suspect. But we'll see.
Comment 65 Matteo Croce 2008-10-04 05:18:39 UTC
the driver doesn't hang on load, it hangs when i do some traffic.
it could hang 2 seconds after ifup or even almost a minute

i guess that your suggestion is the same as booting with mce=3 wich lead me to a hang too
Comment 66 Matteo Croce 2008-10-04 06:09:10 UTC
Created attachment 18158 [details]
the MCE
Comment 67 Matteo Croce 2008-10-04 06:10:57 UTC
booting with mce=3 i get this exception and i made a screenshot
Comment 68 Matteo Croce 2008-10-04 06:16:18 UTC
HARDWARE ERROR
CPU 0: Machine check Exception:               4 Bank 4;  b2OOOOOOOOO7OfOf
TSC 6b0b27b740
This is not a sotware problem!
Run throuh mcelog --ascii to decode and contact your hardware vendor
Kernel panic - not syncing: Machine check
Comment 69 Matteo Croce 2008-10-04 06:18:36 UTC
mcelog --ascii <mce.txt
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache TSC 6b0b27b740
STATUS 0 MCGSTATUS 0
Comment 70 Luis Chamberlain 2008-10-04 14:51:31 UTC
I'm adding Andi Kleen to the CC list. I'm hoping he can ACK if this is indeed a hardware issue or not but from my own review of mcelog documentation and its purpose it seems this is clearly a HARDWARE ISSUE and not a SOFTWARE ISSUE.

Andi -- is this fair to say?

It seems the bug reporter should not try to contact his CPU's hardware vendor.
Comment 71 Luis Chamberlain 2008-10-04 14:53:31 UTC
I mean -- he *SHOULD* contact his CPU's hardware vendor.
Comment 72 Andi Kleen 2008-10-04 16:44:20 UTC
mcelog's output is clear:

This is not a software problem!
Run through mcelog --ascii to decode and contact your hardware vendor

Yes he should.

I think he must have typoed because mcelog should have resolved it.

Bank 4 is the northbridge so it might be that some external IO device 
is not answering, but that would be still a hardware problem.
Comment 73 Luis Chamberlain 2008-10-04 17:08:09 UTC
Thanks for the confirmation Andi, this bug should be closed and marked as not-a-bug. It should be very clear now that this issue is not a software issue but a hardware issue on his first CPU (CPU 0) and the reporter should contact his hardware vendor to try to resolve or get support for this issue.
Comment 74 Matteo Croce 2008-10-04 17:32:13 UTC
why if i don't use ath9k all works fine?
Comment 75 Luis Chamberlain 2008-10-04 17:35:34 UTC
I am not sure but the issue is not a kernel bug do you understand that? What I mean by that is that the kernel panic you are seeing is caused by a hardware issue, and mcelog told you exactly what it was. It may be that by using the device it triggers the hardware issue. Either way the issue is not something we can try to fix in software as the oops is a MCE on your CPU #0.
Comment 76 Matteo Croce 2008-10-04 18:30:30 UTC
Who should I contact? AMD directly?
Comment 77 Luis Chamberlain 2008-10-04 18:35:44 UTC
Sounds like a good candidate.
Comment 78 Matteo Croce 2008-10-05 04:25:26 UTC
Andi, why mcelog didn't resolved it?
I admit i have used o instead of 0 (i used an OCR tool) but using the right numbers it still doesn't prints useful infos
Comment 79 Matteo Croce 2008-10-05 04:50:11 UTC
Here there is the correct one:

[/tmp]$ cat mce
CPU 0: Machine check Exception:               4 Bank 4:  b200000000070f0f
TSC 6b0b27b740
[/tmp]$ mcelog --ascii --k8 <mce
mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache TSC 6b0b27b740
STATUS 0 MCGSTATUS 0
[/tmp]
Comment 80 Matteo Croce 2008-10-05 05:02:12 UTC
andi, your tool is a bit picky when parsing, eg. it looked for the "Machine Check Exception" string case insensitive, while I used ; instead of :. here is the decode:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge   Northbridge Watchdog error
       bit57 = processor context corrupt
       bit61 = error uncorrected
  bus error 'generic participation, request timed out
      generic error mem transaction
      generic access, level generic'
STATUS b200000000070f0f MCGSTATUS 4
Comment 81 Matteo Croce 2008-10-05 09:52:42 UTC
Fortunately latest madwifi release from OpenWrt works like a charm,
i'll stick with it until I get a reply from AMD
(not to be rude, but i _need_ connectivity)
Comment 82 herrmann.der.user 2008-10-06 06:02:51 UTC
Hi Matteo,

can you please provide output of following commands:

 # lspci -vvxxxx


 # ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \
    hexdump -n 8 -e '2/4 "%08x " "\n"' )   < /dev/cpu/0/msr

 # ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \
    hexdump -n 8 -e '2/4 "%08x " "\n"' )   < /dev/cpu/1/msr

Thanks.

I just want to check whether you might be affected by some CPU erratum.
See errata 131 and 169 in
"Revision Guide for AMD NPT Family 0Fh Processors"
(#33610 Rev. 3.30 February 2008).
I just want to know whether the suggested workaround(s) are applied
on your system.

Thanks, Andreas
Comment 83 Matteo Croce 2008-10-06 12:37:53 UTC
Created attachment 18183 [details]
lspci -vvxxxx output
Comment 84 Matteo Croce 2008-10-06 12:39:44 UTC
attached lspci -vvxxxx output to http://bugzilla.kernel.org/attachment.cgi?id=18183

(perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr
0200000e 00400000
(perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr
0200000e 00400000
Comment 85 herrmann.der.user 2008-10-06 13:24:51 UTC
Unless I'm very much mistaken -- at this time of day --
your system has neither the suggested workaround for
Erratum 131 nor for Erratum 169 applied.

For 169 I would have expected to see this bit set:
NB_CFG: ******** *******1
and furthermore some other values in the config space for
device 18.0 are necessary, e.g. something like:

# setpci -s 18.0 0x68.l
0f2ec820

But you have 0x0f00c820, i.e. bit 21 is not set.

For 131 I would have expected to see
NB_CFG: **1***** ********

From my point of view chances are high that your system is affected by
one of those errata.

Can you please check whether a new BIOS is available for your mobo?
Usually that's the way to fix such issues.
Comment 86 Matteo Croce 2008-10-06 14:51:48 UTC
No new BIOSes from my vendor (abit).
can I set those register by hand?
Comment 87 herrmann.der.user 2008-10-06 15:45:45 UTC
To set the bit in Northbridge function 0 should be easy.
This is from my desktop:

# setpci -s 18.0 0x6a.b
2e
# setpci -s 18.0 0x6a.b=0xe
# setpci -s 18.0 0x6a.b
0e
# setpci -s 18.0 0x6a.b=0x2e
# setpci -s 18.0 0x6a.b
2e

To set the NB_CFG MSR needs some more thinking.

But you can use HPA's msr-tools for this, see http://www.kernel.org/pub/linux/utils/cpu/msr-tools/

E.g., again on my desktop:

msr-tools-1.1.2 # ./rdmsr 0xc001001f
40000100000008
msr-tools-1.1.2 # ./wrmsr 0xc001001f 0x40000000000008
msr-tools-1.1.2 # ./rdmsr 0xc001001f
40000000000008
msr-tools-1.1.2 # ./wrmsr 0xc001001f 0x40000100000008
msr-tools-1.1.2 # ./rdmsr 0xc001001f
40000100000008

Of course there is some risk for your system.
In general such bits should be set from BIOS or during early boot.

On your machine you could try to do an
# ./wrmsr 0xc001001f 0x4000010200000e
# setpci -s 18.0 0x6a.b=0x20
before you load your ath9k driver.

But, of course, I don't recommend to do this!
You should better ask your mobo vendor for an BIOS update.
Comment 88 Matteo Croce 2008-10-06 17:41:42 UTC
# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr
0200000e 00400001
# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr
0200000e 00400001

are the values correct now?
Comment 89 herrmann.der.user 2008-10-07 06:56:07 UTC
Yes, looks good.
And what about "setpci -s 18.0 0x6a=0x20" -- what is output of
"setpci -s 18.0 0x6a"?

If all is set up you can try to load/use ath9k.

Either the hang was due Erratum 169 and does not occur anymore or you
encountered some other problem.

BTW, as you have different bits in NB_CFG set as I have, I'd like to double
check those settings. (Guess you have a socket 939 or 940 CPU.)

Can you please provide output of "x86info -a" from your system.
The tool is available at:

http://www.codemonkey.org.uk/projects/x86info/

(Please use version 1.21 of that tool.)
And last not least I'd like to see output of dmidecode.

Thanks!
Comment 90 Matteo Croce 2008-10-07 14:46:49 UTC
# setpci -s 18.0 0x6a
20
# x86info -a
x86info v1.21.  Dave Jones 2001-2007
Feedback to <davej@redhat.com>.     

Found 2 CPUs
MP Table:   
#       APIC ID Version State           Family  Model   Step    Flags
#        0       0x11    BSP, usable     15      35      2       0xfbff
#        1       0x11    AP, usable      15      35      2       0xfbff

--------------------------------------------------------------------------
CPU #1                                                                    
eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x00000001, eax = 00020f32 ebx = 00020800 ecx = 00000001 edx = 178bfbff

eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff
eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d
eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572
eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30
eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140
eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000
eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f
eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000
eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000

Family: 15 Model: 35 Stepping: 2
CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6)
Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+

Feature flags:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3
Extended feature flags:                                                                                  
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy

/dev/cpu/0/msr: No such file or directory
L1 Data TLB (2M/4M):        Fully associative. 8 entries.
L1 Instruction TLB (2M/4M): Fully associative. 8 entries.
L1 Data TLB (4K):           Fully associative. 32 entries.
L1 Instruction TLB (4K):    Fully associative. 32 entries.
L1 Data cache:                                            
        Size: 64Kb      2-way associative.                
        lines per tag=1 line size=64 bytes.               
L1 Instruction cache:                                     
        Size: 64Kb      2-way associative.                
        lines per tag=1 line size=64 bytes.               
L2 Data TLB (2M/4M):        Disabled. 0 entries.          
L2 Instruction TLB (2M/4M): Disabled. 0 entries.          
L2 Data TLB (4K):           4-way associative. 512 entries.
L2 Instruction TLB (4K):    4-way associative. 512 entries.
L2 cache:                                                  
        Size: 1024Kb    16-way associative.                
        lines per tag=1 line size=64 bytes.                

PowerNOW! Technology information
Available features:             
        Temperature sensing diode present.
        Frequency ID control              
        Voltage ID control                
        Thermal Trip                      

Couldn't read MSR 0xc0010041
Couldn't read MSR 0xc0010042

Something went wrong reading MSR_FID_VID_STATUS
SVM: revision 0, 0 ASIDs                       
Address Size: 48 bits virtual, 40 bits physical
The physical package has 2 of 2 possible cores implemented.
Connector type: Socket 939                                 


MTRR registers:
MTRRcap (0xfe): MTRRphysBase0 (0x200): MTRRphysMask0 (0x201): MTRRphysBase1 (0x202): MTRRphysMask1 (0x203): MTRRphysBase2 (0x204): MTRRphysMask2 (0x205): MTRRphysBase3 (0x206): MTRRphysMask3 (0x207): MTRRphysBase4 (0x208): MTRRphysMask4 (0x209): MTRRphysBase5 (0x20a): MTRRphysMask5 (0x20b): MTRRphysBase6 (0x20c): MTRRphysMask6 (0x20d): MTRRphysBase7 (0x20e): MTRRphysMask7 (0x20f): MTRRfix64K_00000 (0x250): MTRRfix16K_80000 (0x258): MTRRfix16K_A0000 (0x259): MTRRfix4K_C8000 (0x269): MTRRfix4K_D0000 0x26a: MTRRfix4K_D8000 0x26b: MTRRfix4K_E0000 0x26c: MTRRfix4K_E8000 0x26d: MTRRfix4K_F0000 0x26e: MTRRfix4K_F8000 0x26f: MTRRdefType (0x2ff):                                        

519022.75GHz processor (estimate).

--------------------------------------------------------------------------
CPU #2                                                                    
eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x00000001, eax = 00020f32 ebx = 01020800 ecx = 00000001 edx = 178bfbff

eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff
eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d
eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572
eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30
eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140
eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000
eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f
eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000
eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000

Family: 15 Model: 35 Stepping: 2
CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6)
Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+

Feature flags:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3
Extended feature flags:                                                                                  
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy

L1 Data TLB (2M/4M):        Fully associative. 8 entries.
L1 Instruction TLB (2M/4M): Fully associative. 8 entries.
L1 Data TLB (4K):           Fully associative. 32 entries.
L1 Instruction TLB (4K):    Fully associative. 32 entries.
L1 Data cache:                                            
        Size: 64Kb      2-way associative.
        lines per tag=1 line size=64 bytes.
L1 Instruction cache:
        Size: 64Kb      2-way associative.
        lines per tag=1 line size=64 bytes.
L2 Data TLB (2M/4M):        Disabled. 0 entries.
L2 Instruction TLB (2M/4M): Disabled. 0 entries.
L2 Data TLB (4K):           4-way associative. 512 entries.
L2 Instruction TLB (4K):    4-way associative. 512 entries.
L2 cache:
        Size: 1024Kb    16-way associative.
        lines per tag=1 line size=64 bytes.

PowerNOW! Technology information
Available features:
        Temperature sensing diode present.
        Frequency ID control
        Voltage ID control
        Thermal Trip

Couldn't read MSR 0xc0010041
Couldn't read MSR 0xc0010042

Something went wrong reading MSR_FID_VID_STATUS
SVM: revision 0, 0 ASIDs
Address Size: 48 bits virtual, 40 bits physical
The physical package has 2 of 2 possible cores implemented.
Connector type: Socket 939


MTRR registers:
MTRRcap (0xfe): MTRRphysBase0 (0x200): MTRRphysMask0 (0x201): MTRRphysBase1 (0x202): MTRRphysMask1 (0x203): MTRRphysBase2 (0x204): MTRRphysMask2 (0x205): MTRRphysBase3 (0x206): MTRRphysMask3 (0x207): MTRRphysBase4 (0x208): MTRRphysMask4 (0x209): MTRRphysBase5 (0x20a): MTRRphysMask5 (0x20b): MTRRphysBase6 (0x20c): MTRRphysMask6 (0x20d): MTRRphysBase7 (0x20e): MTRRphysMask7 (0x20f): MTRRfix64K_00000 (0x250): MTRRfix16K_80000 (0x258): MTRRfix16K_A0000 (0x259): MTRRfix4K_C8000 (0x269): MTRRfix4K_D0000 0x26a: MTRRfix4K_D8000 0x26b: MTRRfix4K_E0000 0x26c: MTRRfix4K_E8000 0x26d: MTRRfix4K_F0000 0x26e: MTRRfix4K_F8000 0x26f: MTRRdefType (0x2ff):

2584453.95GHz processor (estimate).

--------------------------------------------------------------------------
Comment 91 Matteo Croce 2008-10-07 14:47:35 UTC
sorry, wrong kernel
Comment 92 Matteo Croce 2008-10-07 16:10:25 UTC
Just compiled a 2.6.27-rc9 kernel:
root@raver:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr
0200000e 00400001
root@raver:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr
0200000e 00400001
root@raver:~# setpci -s 18.0 0x6a.b
20
root@raver:~# x86info -a
x86info v1.21.  Dave Jones 2001-2007
Feedback to <davej@redhat.com>.     

Found 2 CPUs
MP Table:   
#       APIC ID Version State           Family  Model   Step    Flags
#        0       0x11    BSP, usable     15      35      2       0xfbff
#        1       0x11    AP, usable      15      35      2       0xfbff

--------------------------------------------------------------------------
CPU #1                                                                    
eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x00000001, eax = 00020f32 ebx = 00020800 ecx = 00000001 edx = 178bfbff

eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff
eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d
eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572
eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30
eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140
eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000
eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f
eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000
eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000

Family: 15 Model: 35 Stepping: 2
CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6)
Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+

Feature flags:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3
Extended feature flags:                                                                                  
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy

Number of reporting banks : 5

MCG_CTL:
 Data cache check enabled
  ECC 1 bit error reporting enabled
  ECC multi bit error reporting enabled
  Data cache data parity enabled       
  Data cache main tag parity enabled   
  Data cache snoop tag parity enabled  
  L1 TLB parity enabled                
  L2 TLB parity enabled                
 Instruction cache check enabled       
  ECC 1 bit error reporting enabled    
  ECC multi bit error reporting enabled
  Instruction cache data parity enabled
  IC main tag parity enabled           
  IC snoop tag parity enabled          
  L1 TLB parity enabled                
  L2 TLB parity enabled                
  Predecode array parity enabled       
  Target selector parity enabled       
  Read data error enabled              
 Bus unit check enabled                
  External L2 tag parity error enabled 
  L2 partial tag parity error enabled  
  System ECC TLB reload error enabled  
  L2 ECC TLB reload error enabled      
  L2 ECC K7 deallocate enabled         
  L2 ECC probe deallocate enabled      
  System datareaderror reporting enabled
 Load/Store unit check enabled          
  Read data error enable (loads) enabled
  Read data error enable (stores) enabled

           31       23       15       7 
Bank: 0 (0x400)                         
MC0CTL:    00000000 00000000 00000000 01111111
MC0STATUS: 00000000 00000000 00000000 00000000
MC0ADDR:   11111111 11111111 11111111 11111111
MC0MISC:   00000000 00000000 00000000 00000000

Bank: 1 (0x404)
MC1CTL:    11111111 11111111 11111111 11111111
MC1STATUS: 00000000 00000000 00000000 00000000
MC1ADDR:   11111111 11111111 11111111 11111111
MC1MISC:   00000000 00000000 00000000 00000000

Bank: 2 (0x408)
MC2CTL:    00000000 00001111 11111111 11111111
MC2STATUS: 00000000 00000000 00000000 00000000
MC2ADDR:   11111111 11111111 11111111 11111111
MC2MISC:   11111111 11111111 11111111 11111111

Bank: 3 (0x40c)
MC3CTL:    00000000 00000000 00000000 00000111
MC3STATUS: 00000000 00000000 00000000 00000000
MC3ADDR:   11111111 11111111 11111111 11111111
MC3MISC:   00000000 00000000 00000000 00000000

Bank: 4 (0x410)
MC4CTL:    00000000 00000000 00111111 11111111
MC4STATUS: 00000000 00000000 00000000 00000000
MC4ADDR:   11111111 11111111 11111111 11111111
MC4MISC:   00000000 00000000 00000000 00000000

L1 Data TLB (2M/4M):        Fully associative. 8 entries.
L1 Instruction TLB (2M/4M): Fully associative. 8 entries.
L1 Data TLB (4K):           Fully associative. 32 entries.
L1 Instruction TLB (4K):    Fully associative. 32 entries.
L1 Data cache:                                            
        Size: 64Kb      2-way associative.                
        lines per tag=1 line size=64 bytes.               
L1 Instruction cache:                                     
        Size: 64Kb      2-way associative.                
        lines per tag=1 line size=64 bytes.               
L2 Data TLB (2M/4M):        Disabled. 0 entries.          
L2 Instruction TLB (2M/4M): Disabled. 0 entries.          
L2 Data TLB (4K):           4-way associative. 512 entries.
L2 Instruction TLB (4K):    4-way associative. 512 entries.
L2 cache:                                                  
        Size: 1024Kb    16-way associative.                
        lines per tag=1 line size=64 bytes.                

PowerNOW! Technology information
Available features:             
        Temperature sensing diode present.
        Frequency ID control              
        Voltage ID control                
        Thermal Trip                      

MSR: 0xc0010041=0x0000000100001202 : 00000000 00000000 00000000 00000001
           11111111 11111111 11111111 11111111                          
MSR: 0xc0010042=0x12060812060e0e02 : 00011111 11111111 11111111 11111111
           11111111 11111111 11111111 11111111                          

Voltage ID codes: Maximum=1.400V Startup=1.350V Currently=1.100V
Frequency ID codes: Maximum=11x Startup=11x Currently=5x        
SVM: revision 0, 0 ASIDs                                        
Address Size: 48 bits virtual, 40 bits physical                 
The physical package has 2 of 2 possible cores implemented.     
Connector type: Socket 939                                      


MTRR registers:
MTRRcap (0xfe): 0x0000000000000508
MTRRphysBase0 (0x200): 0x0000000000000006
MTRRphysMask0 (0x201): 0x000000ff80000800
MTRRphysBase1 (0x202): 0x0000000080000006
MTRRphysMask1 (0x203): 0x000000ffc0000800
MTRRphysBase2 (0x204): 0x00000000d0000001
MTRRphysMask2 (0x205): 0x000000fff0000800
MTRRphysBase3 (0x206): 0x0000000000000000
MTRRphysMask3 (0x207): 0x0000000000000000
MTRRphysBase4 (0x208): 0x0000000000000000
MTRRphysMask4 (0x209): 0x0000000000000000
MTRRphysBase5 (0x20a): 0x0000000000000000
MTRRphysMask5 (0x20b): 0x0000000000000000
MTRRphysBase6 (0x20c): 0x0000000000000000
MTRRphysMask6 (0x20d): 0x0000000000000000
MTRRphysBase7 (0x20e): 0x0000000000000000
MTRRphysMask7 (0x20f): 0x0000000000000000
MTRRfix64K_00000 (0x250): 0x0606060606060606
MTRRfix16K_80000 (0x258): 0x0606060606060606
MTRRfix16K_A0000 (0x259): 0x0000000000000000
MTRRfix4K_C8000 (0x269): 0x0000000000000000 
MTRRfix4K_D0000 0x26a: 0x0000000000000000   
MTRRfix4K_D8000 0x26b: 0x0000000000000000   
MTRRfix4K_E0000 0x26c: 0x0000000000000000   
MTRRfix4K_E8000 0x26d: 0x0000000000000000   
MTRRfix4K_F0000 0x26e: 0x0000000000000000   
MTRRfix4K_F8000 0x26f: 0x0000000000000000   
MTRRdefType (0x2ff): 0x0000000000000c00     


1109130.20GHz processor (estimate).

--------------------------------------------------------------------------
CPU #2                                                                    
eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x00000001, eax = 00020f32 ebx = 01020800 ecx = 00000001 edx = 178bfbff

eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65
eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff
eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d
eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572
eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30
eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140
eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000
eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f
eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000
eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000
eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000

Family: 15 Model: 35 Stepping: 2
CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6)
Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+

Feature flags:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3
Extended feature flags:                                                                                  
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy

Number of reporting banks : 5

MCG_CTL:
 Data cache check enabled
  ECC 1 bit error reporting enabled
  ECC multi bit error reporting enabled
  Data cache data parity enabled       
  Data cache main tag parity enabled   
  Data cache snoop tag parity enabled  
  L1 TLB parity enabled                
  L2 TLB parity enabled                
 Instruction cache check enabled       
  ECC 1 bit error reporting enabled    
  ECC multi bit error reporting enabled
  Instruction cache data parity enabled
  IC main tag parity enabled           
  IC snoop tag parity enabled          
  L1 TLB parity enabled                
  L2 TLB parity enabled                
  Predecode array parity enabled       
  Target selector parity enabled       
  Read data error enabled              
 Bus unit check enabled                
  External L2 tag parity error enabled 
  L2 partial tag parity error enabled  
  System ECC TLB reload error enabled  
  L2 ECC TLB reload error enabled      
  L2 ECC K7 deallocate enabled         
  L2 ECC probe deallocate enabled      
  System datareaderror reporting enabled
 Load/Store unit check enabled          
  Read data error enable (loads) enabled
  Read data error enable (stores) enabled

           31       23       15       7 
Bank: 0 (0x400)                         
MC0CTL:    00000000 00000000 00000000 01111111
MC0STATUS: 00000000 00000000 00000000 00000000
MC0ADDR:   11111111 11111111 11111111 11111111
MC0MISC:   00000000 00000000 00000000 00000000

Bank: 1 (0x404)
MC1CTL:    11111111 11111111 11111111 11111111
MC1STATUS: 00000000 00000000 00000000 00000000
MC1ADDR:   11111111 11111111 11111111 11111111
MC1MISC:   00000000 00000000 00000000 00000000

Bank: 2 (0x408)
MC2CTL:    00000000 00001111 11111111 11111111
MC2STATUS: 00000000 00000000 00000000 00000000
MC2ADDR:   11111111 11111111 11111111 11111111
MC2MISC:   11111111 11111111 11111111 11111111

Bank: 3 (0x40c)
MC3CTL:    00000000 00000000 00000000 00000111
MC3STATUS: 00000000 00000000 00000000 00000000
MC3ADDR:   11111111 11111111 11111111 11111111
MC3MISC:   00000000 00000000 00000000 00000000

Bank: 4 (0x410)
MC4CTL:    00000000 00000000 00000000 00000000
MC4STATUS: 00000000 00000000 00000000 00000000
MC4ADDR:   00000000 00000000 00000000 00000000
MC4MISC:   00000000 00000000 00000000 00000000

L1 Data TLB (2M/4M):        Fully associative. 8 entries.
L1 Instruction TLB (2M/4M): Fully associative. 8 entries.
L1 Data TLB (4K):           Fully associative. 32 entries.
L1 Instruction TLB (4K):    Fully associative. 32 entries.
L1 Data cache:                                            
        Size: 64Kb      2-way associative.                
        lines per tag=1 line size=64 bytes.               
L1 Instruction cache:                                     
        Size: 64Kb      2-way associative.                
        lines per tag=1 line size=64 bytes.               
L2 Data TLB (2M/4M):        Disabled. 0 entries.          
L2 Instruction TLB (2M/4M): Disabled. 0 entries.          
L2 Data TLB (4K):           4-way associative. 512 entries.
L2 Instruction TLB (4K):    4-way associative. 512 entries.
L2 cache:                                                  
        Size: 1024Kb    16-way associative.                
        lines per tag=1 line size=64 bytes.                

PowerNOW! Technology information
Available features:             
        Temperature sensing diode present.
        Frequency ID control              
        Voltage ID control                
        Thermal Trip                      

MSR: 0xc0010041=0x0000000100001202 : 00000000 00000000 00000000 00000001
           11111111 11111111 11111111 11111111                          
MSR: 0xc0010042=0x12060812060e0e02 : 00011111 11111111 11111111 11111111
           11111111 11111111 11111111 11111111                          

Voltage ID codes: Maximum=1.400V Startup=1.350V Currently=1.100V
Frequency ID codes: Maximum=11x Startup=11x Currently=5x        
SVM: revision 0, 0 ASIDs                                        
Address Size: 48 bits virtual, 40 bits physical                 
The physical package has 2 of 2 possible cores implemented.     
Connector type: Socket 939                                      


MTRR registers:
MTRRcap (0xfe): 0x0000000000000508
MTRRphysBase0 (0x200): 0x0000000000000006
MTRRphysMask0 (0x201): 0x000000ff80000800
MTRRphysBase1 (0x202): 0x0000000080000006
MTRRphysMask1 (0x203): 0x000000ffc0000800
MTRRphysBase2 (0x204): 0x00000000d0000001
MTRRphysMask2 (0x205): 0x000000fff0000800
MTRRphysBase3 (0x206): 0x0000000000000000
MTRRphysMask3 (0x207): 0x0000000000000000
MTRRphysBase4 (0x208): 0x0000000000000000
MTRRphysMask4 (0x209): 0x0000000000000000
MTRRphysBase5 (0x20a): 0x0000000000000000
MTRRphysMask5 (0x20b): 0x0000000000000000
MTRRphysBase6 (0x20c): 0x0000000000000000
MTRRphysMask6 (0x20d): 0x0000000000000000
MTRRphysBase7 (0x20e): 0x0000000000000000
MTRRphysMask7 (0x20f): 0x0000000000000000
MTRRfix64K_00000 (0x250): 0x0606060606060606
MTRRfix16K_80000 (0x258): 0x0606060606060606
MTRRfix16K_A0000 (0x259): 0x0000000000000000
MTRRfix4K_C8000 (0x269): 0x0000000000000000
MTRRfix4K_D0000 0x26a: 0x0000000000000000
MTRRfix4K_D8000 0x26b: 0x0000000000000000
MTRRfix4K_E0000 0x26c: 0x0000000000000000
MTRRfix4K_E8000 0x26d: 0x0000000000000000
MTRRfix4K_F0000 0x26e: 0x0000000000000000
MTRRfix4K_F8000 0x26f: 0x0000000000000000
MTRRdefType (0x2ff): 0x0000000000000c00


2584449.70GHz processor (estimate).

--------------------------------------------------------------------------
Comment 93 Matteo Croce 2008-10-07 16:11:46 UTC
I have added this initscript to my system:
root@raver:~# cat /etc/init.d/errata_fix
#!/bin/sh

. /lib/lsb/init-functions

log_begin_msg 'Applying AMD errata fix'

/usr/local/bin/wrmsr 0xc001001f 0x4000010200000e && /usr/bin/setpci -s 18.0 0x6a.b=0x20

log_end_msg $?
Comment 94 herrmann.der.user 2008-10-08 03:03:59 UTC
Regarding comment #92:
This affirmed that you have an 939 socket.
And it explains why your NB_CFG MSR (0xc001001f) looks slightly different
from mine. My desktop has AM2 socket. (Just another erratum fix which
was only needed for socket 939/940 CPUs.)

So last interesting point is. Does ath9k work (when you have applied
the errata fix) or do you still get MCEs?

If it works I take this for evidence that you really stumbled over
the erratum 169 problem.

Then the question is how to proceed.
IMHO, it's best to add a the erratum workaround into Linux kernel --
as your mobo vendor obviously doesn't provide you a newer BIOS.
And maybe there are other systems like yours that potentially suffer from
this CPU erratum.
Comment 95 Matteo Croce 2008-10-08 13:42:32 UTC
It hangs too with the same MCE
Comment 96 herrmann.der.user 2008-10-09 03:14:22 UTC
That's unfortunate. (I assume you have double-checked that the workaround
was applied.) But we have one more try.

Not applying workaround for erratum 169 but for erratum 131.
(Revision guide says that erratum 169 has superseded 131,
but who knows.)

That would mean you have to change your init-script like this:
(This time w/o setpci fiddling.)


#!/bin/sh

. /lib/lsb/init-functions

log_begin_msg 'Applying AMD errata fix'

/usr/local/bin/wrmsr 0xc001001f 0x4000000210000e

log_end_msg $?

Would be great if you could verify that last idea
(... that I have at the moment ;-(
Thanks!
Comment 97 Matteo Croce 2008-10-09 03:40:14 UTC
no setpci at all?
Comment 98 herrmann.der.user 2008-10-09 03:48:24 UTC
No, not at all. Workarounds are
(1) Erratum 131: set bit 20 of NB_CFG
(2) Erratum 169: set bit 32 of NB_CFG and set F0x68[22:21] to 01b
(i.e. bit 22=0, bit 21=1 in device 18.0 offset 0x68)

You have verified (2). Well, which did not work.
Now I'd like to explicitely see whehter (1) helps.
Comment 99 Matteo Croce 2008-10-09 10:16:30 UTC
Same MCE, all hangs.
Thanks anyway :(
Comment 100 herrmann.der.user 2008-10-10 03:41:56 UTC
Quite disenchanting. But at least we know it was none of the two errata.
I'll open an internal bug report for your issue. (But I don't know how
much attention this will get ...)
Thanks for your patience and testing.

Andreas
Comment 101 Joachim Deguara 2008-10-13 14:10:58 UTC
Can you tell me which motherboard this is or which add in card?  I figure it must be an addin card as this is an oldish CPU/motherboard with a new wireless card.

I fudged the MCE report you printed out to set Status to 4 and MSGStatus to b200000000070f0f and got the report:

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge TSC 6b0b27b740 (upper bound, found by polled driver)
  Northbridge RAM ECC error
  ECC syndrome = 0
STATUS 4 MCGSTATUS b200000000070f0f

But this doesn't make sense because you don't have ECC memory right?  Have you run memtest overnight just to check the hardware?  If not please do.

We need to capture the real MCE  and decdoe this out. If the MCEs are what we see then it would not be one of the erratum Andreas found as those are Wathdog Timer MCE related.
Comment 102 herrmann.der.user 2008-10-13 14:29:31 UTC
Joachim, please see comment #80 and attachment #18158 [details].
(When I decode the MCE I see the same Watchdog error as in comment #80.)
Comment 103 Joachim Deguara 2008-10-13 14:43:46 UTC
Sorry, disregard my ramblings in comment #101, my local mcelog must be broke and I was trying to coax to hard to give me a correct decoding.
Comment 104 Matteo Croce 2008-10-13 17:12:27 UTC
My motherboard is an Abit AN8 SLI, nForce4 (CK804) chipset
the wireless is a PCI Atheros 802.11n device, quite new in respect to the motherboard.

memtest86 gives TONS of errors, so I assume it's broken. I have the system running for months without issues while memtest86 gives error immediately, and then crashes
Comment 105 Matteo Croce 2008-10-13 17:22:22 UTC
IIRC memtest86 started to give errors when I added 2GB (2x1GB) RAM to my already there 1GB (2x512MB)
my RAM is pretty good tought: the 512MB banks are OCZ Gold, while the 1GB banks are OCZ Platinum, both with CAS 2

i ran an userspace tool for almost a 12 hours: http://pyropus.ca/software/memtester/
and I get no problems at all.

Usually when there is bad RAM the first thing i notice is GCC segfaulting, while i crosscompile software for MIPS, AVR32 and ARM mostly of the time without problem
Comment 106 herrmann.der.user 2008-10-14 02:38:58 UTC
> memtest86 gives TONS of errors

Matteo, that's an interesting point.

Can you please remove the 2x1GB of RAM (which cause your memtest86 errors)
and redo your ath9k driver test.
Comment 107 Matteo Croce 2008-10-14 14:55:32 UTC
I knew that memtest were broken, latest one was fixed heres: https://qa.mandriva.com/42679
and works fine even with all 4 banks
Comment 108 Luis Chamberlain 2008-10-20 16:50:31 UTC
Any other ideas to help the user out?
Comment 109 Phil Linttell 2008-10-30 05:54:03 UTC
I´ve asked for the bug to be re-opened (not sure if I should have cloned it?) because I am able to reproduce this user´s problem on my system, with quite a different configuration.  My system has been freezing shortly after starting wireless (anywhere from immediately to an hour, depending on what I´m doing...almost always within a minute if I start ktorrent)... I´ve been trying to solve it for weeks, and isolated it down to use of the ath9k driver, as this user has done, and setting maxcpus=1 also resolves the hangs on my system.  I´m actually using the system now, which I never could have done before (less one core, of course.)

Note: this doesn´t necessarily eliminate a hardware issue, as this system is only a couple of weeks old and I haven´t run any other OS on it.  However, given it´s a different CPU, motherboard, etc. I thought it was worth re-visiting whether this is really a hardware problem or not.

kernel: 2.6.27-7-generic
dist: kubuntu 8.10rc1 with all patches
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
Memory: 4GB
Motherboard: Asus M3N78-VM
NIC: D-Link DWA-552 (Atheros 5416)

I´ll attach my dmesg and other system configuration info.  I´m not a Linux guru, but I´m happy to help in any way...if you want me to try something please give me very specific instruction.

BTW, I tried the two errata checks:

# ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \
hexdump -n 8 -e '2/4 "%08x " "\n"' ) < /dev/cpu/0/msr

# ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \
hexdump -n 8 -e '2/4 "%08x " "\n"' ) < /dev/cpu/1/msr

But the device doesn´t exist:

bash: /dev/cpu/0/msr: No such file or directory

(Maybe because I´m currently running with maxcpus=1?)
Comment 110 Phil Linttell 2008-10-30 06:07:56 UTC
Created attachment 18519 [details]
dmesg output (maxcpus=1)

Made while running with maxcpus=1, let me know if you want me to re-run with the boot parameter removed.
Comment 111 Phil Linttell 2008-10-30 06:08:48 UTC
Created attachment 18520 [details]
lspci -vvxxxx ouput (while running with maxcpus=1)
Comment 112 Luis Chamberlain 2008-10-30 10:06:18 UTC
Regardless of whether you try with maxcpus=1 or not the problem is the oops message printed *is a hardware issue*. This means that no matter how hard we try to work on the driver code the oops message generated came *from a hardware issue*, not a software driver issue

So this is a *hardware issue*.

Andreas or Joachim, any other ideas for these users to test? It seems its happening on two different CPUs.

AMD Athlon64 X2, nForce4 motherboard
AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
Comment 113 Phil Linttell 2008-10-30 10:12:16 UTC
I haven´t managed to isolate an oops message on my config.... when the freeze happens I´m in the desktop and the screen goes blank... no toggling to a pseudo tty, no restarting X, no toggling numlock on/off.  No trace of any error messages later that I can find.

So, maybe it´s a hardware issue on my system too... but we don´t really know that.

And if it´s a hardware issue, why does it only show up when ath9k is being used....  on-board ethernet is solid.  
Comment 114 Luis Chamberlain 2008-10-30 10:47:16 UTC
If you do not have an oops message how can you know it is the same issue?

This bug report is for an oops message which is produced by a hardware issue. As to why it happens with ath9k and not without it -- the answer should lie in the mcelog dump information. But this is still a hardware issue, not a software issue, the CPU detects a machine check exception.
Comment 115 Phil Linttell 2008-10-30 11:14:45 UTC
If you can tell me how to capture an oops message at the time the system freezes, I´ll gladly try to capture it...  the screen goes a solid colour and no other response can be obtained from the system.

The commonality in the problem is, I believe, that once a wireless interface using the ath9k driver is initialized and associated, then the system will freeze, either immediately, or after some amount of network access.  If the wireless interface is not initialized (using netconfig/iwconfig) then the system is rock solid, and can even be used with a wired interface.  Setting numcpus=1 avoids the freeze and the system is rock solid with the wireless interface.  (There was a spinlock issue in the ath9k driver which resulted in systems freezing, but my understanding is that the fix for this is already in the kernel I´m running.)

The root cause may be hardware, or it may not.  It is certainly a kernel/hardware interaction, because it is easily/reproducibly initiated from software.  The fact we have two quite different configs, with identical behaviour (apart from the oops), suggests to me that it merits additional investigation.

So, my system is solid now, using 802.11n and great throughput with a single core.  I´m typing this on it now with ktorrent running in the background.  If I reboot with the second core enabled, associate to my AP, and start ktorrent, my system will freeze within a minute.  (If I don´t run ktorrent, it´ll still freeze - it´ll just take longer.)

Perhaps the bug is miscategorized?
Comment 116 Luis Chamberlain 2008-10-30 11:39:46 UTC
I'm absolutely certain that this issue should be looked into, and what I am saying is just that I cannot help you with it in ath9k at the moment as the issue reported in *this* bug report was caused by an MCE panic, which is caused on the CPU -- the kernel is just informed of this and since its an exception the kernel knows better to not continue further otherwise the hardware may get affected terribly, so the kernel oopses.

To be clear we are talking about two separate issues now in this bug report, the first one is the one I describe above. In the first case the user rebooted with and added to his kernel parameters:

mce=3

This forces the kernel to not panic even if it gets MCEs, when in kernel context and a deadlock was possible (dangerous). This is dangerous, the more subtle way to try this is to use

mce=bootlog

bootlog seems to be disabled by default on AMD CPUs because some buggy BIOSes are known to leave bogus logs. You can still try this first though.

Additionally you may try

mce=off

To disable MCE.. but not sure this is a good idea.

Then there is your "issue" which you are just saying is the same but are not providing an oops for, but are just saying it hangs with ath9k. What would help is to get your oops and/or mcelog output converted to ascii. I provided instructions on how to do this above in this bug report but here is again:

Usually you would have mcelog run in cron as follows, perhaps hourly as follows:

#!/bin/bash
/usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

But since you are eager to find this and you can trigger this, then run this in
a loop as follows to run it every 5 seconds:

root@deathMCEbox:# while true; do /usr/sbin/mcelog --filter >> /var/log/mcelog;
sleep 5;


Then make the mcelog human readable by running:

mcelog --ascii < mcelog > mcelog.txt

Again here are the two cases reported in this bug report so far:

== Case 1 ==
CPU: Athlon64 X2
Motherboard: nForce4
 * oops captured
 * mcelog ascii provided
 * two erratas applied but issue still occurs
 * this is a hardware issue
 * no fix has been provided yet
----------------------------------------------------
== Case 2 ==
CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
Motherboard: Asus M3N78-VM
 * nothing further information provided
----------------------------------------------------
Comment 117 Luis Chamberlain 2008-10-30 12:28:16 UTC
BTW can the two users who are currently seeing issues with this report back to us:

 * Motherboard model
 * Bios and Bios revision number
 * lspci -vn and lspci -vvv
Comment 118 Jonathan May 2008-10-30 12:42:58 UTC
We would basically like to get the same motherboard and BIOS in house to recreate this issue with PCIe link analysis equipment in place to determine the root cause.
Comment 119 Phil Linttell 2008-10-30 14:42:46 UTC
Jonathan/Luis,

Motherboard: Asus M3N78-VM
http://www.asus.com/products.aspx?l1=3&l2=149&l3=643&l4=0&model=2268&modelmenu=1

BIOS: Ver. 0804 (Dated 10/15/2008) - latest available from Asus web site
Processor: AMD Athlon 64 X2 Dual Core 6000+
Memory: 4GB (2x OCZ 2G DDR2 PC2 6400 )
Peripherals: Seagate 1TB SATA II hard drive (AHCI mode, ST31000340AS), D-Link DWA-552

The motherboard is a current model.

Kernel is 2.6.27-7-generic.  I boot with  iommu=noaperture,memaper=3 maxcpus=1
I installed using kubuntu 8.10 beta (amd64) alternate CD.  All updates applied.

Video is Integrated NVIDIA GeForce 8200 GPU.  I am using Nvidia accelerated graphics driver 177.80 (October 7, 2008) from their web site:
http://www.nvidia.com/object/linux_display_amd64_177.80.html
(Using VGA, not HDMI)

lspci output to follow.
Comment 120 Matteo Croce 2008-10-30 16:20:31 UTC
I have a completely different video subsystem: Ati 4870 with free drivers

Phil: if you want to know the error, switch to a terminal while leaving the network active.
you can do it by doing something like:

while true; do curl http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.27.tar.gz > /dev/null; done
Comment 121 Matteo Croce 2008-10-30 16:21:51 UTC
also, to have the msr device and some x86info verose output:

sudo modprobe msr
sudo modprobe cpuid
Comment 122 Phil Linttell 2008-10-30 19:13:16 UTC
Created attachment 18539 [details]
ouput from lspci -vvv (both CPU cores enabled)

Booted with both CPU cores enabled (no maxcpus boot parameter) and dumped output prior to initializing network
Comment 123 Phil Linttell 2008-10-30 19:14:36 UTC
Created attachment 18540 [details]
output from lscpi -vn (both CPU cores enabled)

Output from 
lspci -vn 

after booting with both cores enabled.
Comment 124 Phil Linttell 2008-10-30 19:17:09 UTC
Created attachment 18541 [details]
output from dmesg (both cores enabled)

This was dumped after initializing the wireless network, with both cores enabled.  The system hanged about 2 seconds after dumping.

Just an aside, it was necessary to turn off Wireless QoS on the router in order to be able to associate at all.
Comment 125 Matteo Croce 2008-10-31 19:54:00 UTC
Phil, can you paste the output of this commands?

modprobe msr
modprobe cpuid
(perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr
(perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr
setpci -s 18.0 0x68.l
Comment 126 Phil Linttell 2008-10-31 21:06:42 UTC
Hi Matteo,

Output is as follows:

root@media:~# modprobe msr
root@media:~# modprobe cpuid
root@media:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr
00000008 00400001
root@media:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr
00000008 00400001
root@media:~# setpci -s 18.0 0x68.l
0f2ec820
root@media:~#
Comment 127 Matteo Croce 2008-11-01 19:34:03 UTC
Phil: Your system seems to be the errata 169.
Andreas: can you confirm it? can't you make a similar system and try to find the error?
My CPU is old, but Phil one is pretty new, so the bug is more severe now
Comment 128 Phil Linttell 2008-11-06 10:15:12 UTC
Andreas,

Assuming Matteo is correct, and I am (also?) suffering errata 169...

1) I should request a BIOS fix from my motherboard vendor (Asus) which might be possible given it's a recent mobo

2) I can try to poke the right bits to effect the work around in Linux... but the values to do this may differ from Matteo's?

3) Given that the hardware is all less than a month old, and I can probably swap it, could I avoid errata 169 by upgrading the CPU to a newer model (such as a quad core?).  I'd rather pay for an upgraded CPU than swap the mobo.

Thanks for your advice.
Comment 129 Matteo Croce 2008-11-09 17:41:33 UTC
I get the same hang while bringing the interface up with Windows, what a coincidence.
It's the first time I tried to do Internet with such OS so it may ben an Atheros HW issue, not an AMD one..
Comment 130 Phil Linttell 2008-11-11 08:28:42 UTC
Hi Matteo,

Thanks very much for trying to re-create the hang.  I've been able to trigger the freeze while on the console screen, and the screen just goes blank - no kernel message at all.  I'm not sure what else I can do to try and capture some useful information.

I don't believe I suffer from errata 169... according to Andreas' earlier comment (#85)

For 169 I would have expected to see this bit set:
NB_CFG: ******** *******1

That bit is set for me:

# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 > "%08x " "\n"') </dev/cpu/0/msr
00000008 00400001


I have submitted a support request to Asus.... 

I'm at a bit of a loss as to how to go about getting 802.11n working on my system.  My understanding is ath9k is the only driver supporting 802.11n?  

Do I try exchanging the board for a DWA-556 PCIe x1 version of the same board?  (I only have one PCIe x1 slot, and would like to use it for a TV tuner card at some point).  Do I switch to NDISwrapper and satisfy myself with 802.11g?  Do I switch to an external router and try to run it in bridge mode?

It seems very strange that we both run into problems on different hardware when the second core is enabled.
Comment 131 Matteo Croce 2008-11-11 11:14:52 UTC
It could be an HW error related to the wifi card, i'd wait a response from Atheros
Comment 132 Luis Chamberlain 2008-11-11 14:59:37 UTC
At this point assumptions are being made, please don't make assumptions. We've already determined the issue here to be a CPU MCE. We cannot do anything about it and need involvement from AMD. If its an issue with the CPU we can potentially detect the fix somewhere in ath9k/the kernel and implement the fix/workaround there. But we need to iron out first why this occurs.
Comment 133 Phil Linttell 2008-11-11 17:48:34 UTC
Hi Luis,

As I said in comment 130, 113, and 115, I do not get an MCE.  I do not get a kernel message of any type.  My system does not seem to be suffering from errata 169.  There's nothing that categorically points to a hardware error.    

The commonality seems to be that both Mateo and I have system freezes with a second CPU core enabled and the ath9k driver.

That's why I was asking for suggestions as to how to go about isolating the problem.

I guess my next step is to re-compile ath9k with debugging.


Would you prefer I open a new bug?  I thought there was value in keeping the two issues associated, but perhaps it's confusing things more than it's helping.

Phil
Comment 134 Matteo Croce 2008-11-11 17:52:02 UTC
Ok sorry for bugging your hardware, but if I and Phil have the same error with two different CPUs and the same card there _could_ be something wrong with the card _or_ with that card combined with such two CPUs
Comment 135 Phil Linttell 2008-11-19 17:51:11 UTC
I've downloaded compat-wireless-2008-11-17 and compiled with ATH_DBG_ANY.  I did a "make install", and "make load" and (after doing a modpobe -r ath9k because mac80211 and another module were bus) make load and a reboot.  I believe I got all the latest drivers installed (how do I verify the version of a loaded driver?) but I can't figure out where the debug output goes to.

How do I find/capture the debug output?  It doesn't seem to show up on the Ctrl-Alt-F8 console screen.

Anyway, the new drivers don't resolve the problem.
Comment 136 Phil Linttell 2008-11-19 18:00:23 UTC
Never mind... I found it....  as Matteo says, an insane amount of output.
Comment 137 Phil Linttell 2008-11-21 10:21:13 UTC
I've spent considerable time looking into this.  I no longer believe this is the ath9k driver.  I was able to run the ath9k driver will all debug enabled and provoke the freeze without any error messages from the driver.

The problem appears to be related to interrupt handling (and possibly CPU frequency power management) on multi-core AMD64 CPUs and NVIDIA drivers.  It occurs over multiple releases of the kernel and the video drivers, different (ADM64 multi-core) CPUs, different motherboards, and different NVIDIA chip-sets.  It appears to effect a quite a large number of people.

Common denominators is ADM64, multi-core, NVIDIA.  Easiest way to diagnose the problem, is that the system will not freeze with MAXCPUS=1.  Some people have found other ways around the problem by restricting interrupt handling to the first core.

The reason this seemed to be related to networking (and ath9k, particularly when I run something like bittorrent) is that the network card is a great way to generate lots of interrupts when under load.

Thanks to those who tried to help.... you've certainly helped to isolate it.

There are quite a few bug reports, distribution and kernel, related to this but which haven't identified the root cause.  I'm going to try and summarize my findings, and see if I can pull some of those bugs together into one (possibly 10453). 

http://bugzilla.kernel.org/show_bug.cgi?id=10453

[Andreas, are you able to provide any guidance as to how to go about getting this investigated further?]
Comment 138 Matteo Croce 2009-01-07 15:42:06 UTC
My new computer just arrived, an Intel Core i7: quad core with hyper threading so I have 8 logical cores
well, when bringing the interface up the system freezes again, but booting with maxcpus=0 fixes it.
So it seems an issue with the atheros card, as now i'm using a Realtek rtl8187 which works just fine.
I won't keep my 8 core PC in single core mode, I hope that the bug is fixed soon or i'll trash my atheros card as the driver is still unstable or just the card is crap hardware
Comment 139 Matteo Croce 2009-01-07 15:43:29 UTC
Also, many thanks to the AMD staff for their support and sorry for bugging their products
Comment 140 Luis Chamberlain 2009-01-07 16:07:49 UTC
Did you get an oops message from when the system hung? Did you get an MCE panic? What kernel are you using now? Have you tried the latest from wireless-testing / compat-wireless?
Comment 141 Matteo Croce 2009-01-07 16:38:27 UTC
The Intel gives no MCE or panic, just freezes.
I'm using a vanilla 2.6.28
never tried wireless-testing or compat-wireless
Comment 142 Luis Chamberlain 2009-01-07 16:56:44 UTC
Please try to see if you can capture a panic, you can first you a virtual terminal and try to see if you see the panic there. You can also try to use netconsole:

Documentation/networking/netconsole.txt 

You can also feel free to try the latest bleeding edge wireless drivers and stack from the wireless-testing tree. This guide shows you how to get it:

http://wireless.kernel.org/en/developers/Documentation/git-guide

Its basically an entire kernel.

Alternatively you can also just get the latest and greatest wireless components only and compile them separately from here:

http://wireless.kernel.org/en/users/Download
Comment 143 Phil Linttell 2009-01-08 07:29:39 UTC
Hi Matteo,

I've come to the conclusion on my system that it's a hardware problem caused by the network card.

I've tried multiple kernels, including the latest stable.  I've configured a serial console to try to capture any kernel oops or panic messages... there isn't one.  Nothing in MCE log. 

At one point I re-compiled the wireless drivers with all debugging..... no messages prior to the freeze.  At the same time, the freeze only appears with multiple CPUs enabled under network load.  Slightly irksome with only two cores.... I can understand why you wouldn't want 7 cores sitting idle.

I've tested pulling the network card and using the on-board LAN....  no crashes under multi-core.

I can't seem to find any other PCI card that is supported under Linux with native 802.11n support (there's a PCI Express version of the same card that might be worth trying, but I only have 1 PCI Express slot that I plan to put a TV capture card into.)  My plan at the moment is to go with a d-link DAP-1522 wireless access point/bridge, that will plug into the on-board ethernet and find a new home for the PCI card.

If I thought there was more that I could do to help find the cause of the lock-up, and that a work-around might be possible in software, I might stick with it.  However, I think the chances of fixing it in software might be slim, and my kernel debugging skills might not be up to the challenge.

I might give one more try with the latest wireless drivers suggested by Luis, but my suspicion is that the problem is in the hardware, specifically the d-link card.  (Googling, I noticed that similar lock-ups were observed under Windows 64-bit early on in the cards life-cycle, but this appeared to be cleared up by new Windows drivers being issued.)
Comment 144 Matteo Croce 2009-03-25 22:36:34 UTC
Seems to be fixed with the kernel 2.6.29, can anyone confirm it?
Comment 145 Phil Linttell 2009-03-25 23:51:11 UTC
(In reply to comment #144)
> Seems to be fixed with the kernel 2.6.29, can anyone confirm it?

Hey, great new Matteo.  You're able to run with heavy network load with 2 cores enabled?  

I'm downloading 2.6.29 stable now, and will try it later and let you know.  I'm still running with maxcpus=1.

Phil
Comment 146 Matteo Croce 2009-03-26 00:34:35 UTC
(In reply to comment #145)
> (In reply to comment #144)
> > Seems to be fixed with the kernel 2.6.29, can anyone confirm it?
> 
> Hey, great new Matteo.  You're able to run with heavy network load with 2
> cores
> enabled?  
> 
> I'm downloading 2.6.29 stable now, and will try it later and let you know. 
> I'm
> still running with maxcpus=1.
> 
> Phil

root@raver:~# uname -a
Linux raver 2.6.29 #2 SMP PREEMPT Tue Mar 24 15:25:36 CET 2009 x86_64 GNU/Linux
root@raver:~# fgrep processor /proc/cpuinfo |wc -l
8
root@raver:~# uptime
 01:33:55 up  3:23,  2 users,  load average: 0.06, 0.08, 0.11
root@raver:~# ifconfig wlan0 |grep 'Byte RX'
          Byte RX:1658605233 (1.6 GB)  Byte TX:47759274 (47.7 MB)
root@raver:~#
Comment 147 Luis Chamberlain 2009-03-26 00:45:09 UTC
Just wanted to point out that if you are not seeing issues on 2.6.29 then you will also see no issues in the upcoming next releases of 2.6.27 and 2.6.28 kernels if the patches I sent make it in by then. The change I expect that would have helped was the serialization of IO. While the issue may be fixed the MCE exception can theoretically still be triggerable but if you're not seeing it then great :)

For those stuck on older kernels and cannot wait until the serialization patches get merged you can use compat-wireless or use the respective ported patches for your kernel, either for 2.6.27 or for 2.6.28:

http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2009-03-12/serialization-v6.tar.bz2
Comment 148 Phil Linttell 2009-03-26 02:12:15 UTC
Hi Matteo and Luis,

I can confirm that 2.6.29 has resolve my issue and I'm able to run with both cores under heavy network load with ath9k and no hanging....

root@media:~# uname -a
Linux media 2.6.29-020629-generic #020629 SMP Tue Mar 24 11:23:53 UTC 2009 x86_64 GNU/Linux

Luis, if you IO serialization changes fixed it.... THANKS!  It's very nice to finally be able to use my second core!  I've invested 100's of hours into trying to track the problem down.  


(In reply to comment #144)
> Seems to be fixed with the kernel 2.6.29, can anyone confirm it?
Comment 149 Luis Chamberlain 2009-03-31 16:50:50 UTC
Can someone close this bug?
Comment 150 Matteo Croce 2009-03-31 16:56:03 UTC
Sure, after 150 comments :)
Comment 151 Luis Chamberlain 2009-05-18 16:15:12 UTC
Can someone close this bug?
Comment 152 Matteo Croce 2009-05-18 16:44:31 UTC
isn't it closed already?

Note You need to log in before you can comment on or make changes to this bug.