Latest working kernel version: 2.6.25 Earliest failing kernel version: 2.6.27-rc5 Distribution: Ubuntu 8.10 Hardware Environment: Athlon64 X2, nForce4 motherboard Problem Description: i can't boot, the kernel print a big amount of errors and then the watchdog reboots the machine. the only thing i can read are mce_panic, do_machine_check Steps to reproduce: boot, start X, wait 2 or 3 seconds
I haven't a serial console, please help me to save a panic dump. I have a serial->USB adapter which with i flash my router, can I login into the router and read the error from there? will the USB driver work while panicing?
Created attachment 17700 [details] a screenshot of the panic
Created attachment 17701 [details] a screenshot of the panic, scrolling down
In the meantime i've posted two screenshots, low quality tought
Thanks. Were you able to test 2.6.26?
Yes, but it hangs for some obscure reason too. Really I can't debug it, as being an X user, everything just hangs and I can't check nothing. I haven't serial, how can I provide an useful stacktrace?
> I haven't serial, how can I provide an useful stacktrace? netconsole might be worth a try. Documentation/networking/netconsole.txt Please switch to 2.6.27-rc6 as well. Thanks, tglx
2.6.27-rc6 hangs too. in the meantime that i try netconsole, i've seen "smp.c: smp_call_function_single" on top of the backtrace
tried netconsole, it logs standard messages but not the panic. maybe the ethernet is unable to send after the error? However i'm trying 2.6.26.2 as it takes many time before hang (10-30 minutes)
akpm: 2.6.26.5 never hangs, the bug is 2.6.27 specific
ok, the machine check happens in acpi_pm_read() as far as I can decode your screenshots. Can you please upload the messages which you see via netconsole up to the point where the kernel crashes ?
Created attachment 17727 [details] the syslog The log from a normal bootup, just before the crash
Can you please add "nopat" to the kernel command line ?
Also can you please provide a dmesg from the last working kernel ?
hangs with nopat too
Created attachment 17729 [details] the 2.6.26.5 syslog
Thomas, note that PAT was enabled in 2.6.26.5 too
can't I save the error in RAM and read it after kexecinga new kernel? Sort of...
I've found CONFIG_CRASH_DUMP, can it help?
On the basis of comment #10 I'm regarding this as a recent regression.
Errata: the system doesn't reboots, it just hangs forever and the watchdog reboots it. I discovered it as after loading a crashdump kernel the machine still rebooted, without kexecing the new kernel. what happens is this: while being in console all just freezes, except the blinking cursor, and I can't change VT or type everything.
No response from the kernel gurus, i'll git-bisect the kernel and try to find what is the issue
> can't I save the error in RAM and read it after kexecinga new kernel? > Sort of... if you dont have a serial console (a null modem cable) then easiest is to build netconsole into the kernel (CONFIG_NETCONSOLE=y) and set it up on the boot line, via something like: netconsole=4444@10.0.1.13/eth0,4444@10.0.1.16/00:11:22:33:44:55 where 10.0.1.13 is the IP of the crashing box, 10.0.1.16 is the IP and 00:11:22:33:44:55 is the ethernet address of the box you log to. On the box you log the crash to run something like: netcat -u -p 4444 10.0.1.13 4444 2>&1 | tee -a crash.log and you should have the log in crash.log. OTOH, bisecting this bug would certainly be very informative, the crash seems to happen reliably. Ingo
network console hangs too, I have to bisect
i've found what hangs the kernel, it was the ath9k driver
So, this is not a regression in a strict sense, because ath9k was not present in the previous kernels. Can you confirm that the rest of the kernel works correctly if ath9k is not loaded?
i can use a 2.6.27-rc6 kernel using another wireless driver (madwifi)
Not sure where to start...the output of 'lspci -n' might be a good place. Also, is there a version where the ath9k loaded successfully? If so, could we see the dmesg output from that version? Were you trying to load madwifi as well as ath9k? If so, does disabling the load of madwifi allow ath9k to load successfully?
# lspci -nn |grep Atheros 05:08.0 Network controller [0280]: Atheros Communications Inc. AR5416 802.11abgn Wireless PCI Adapter [168c:0023] (rev 01) It's the first time i try ath9k, I was using madwifi until now. I can't test ath9k with a 2.6.26 kernel as it doesn't compile. I have another amd64 machine, ath9k works but disconnects frequently, I have to do ifdown && ifup...
Could you try an earlier 2.6.27-rc? 2.6.27-rc3 perhaps? You may also wish to try this patch (hopefully merged to Linus soon): http://marc.info/?l=linux-wireless&m=122163541519736&w=2
It hanged too with 2.6.27-rc3 IIRC but i'm not 100% sure. BTW, the patch you posted fixed the disconnect on the other machine
Tested on my pc, it still hangs. John: any way to debug the ath9k driver or the mac80211 stack? could it be a mac80211 deadlock?
It _might_ be any number of issues -- hard to say with no stack trace. When it hangs, do you have a text console? Can you enable sysrq and capture the output of alt-sysrq-p while it is hung?
i've done many photos to the crash and attached them with gimp, please have a look in the meantime i do OCR on the image
Created attachment 17979 [details] a screenshot to the error
ath9k could leave the card active after rmmod as loading madwifi complains about spurious interrupts: ath9k 0000:05:08.0: PCI INT A disabled ath9k: driver unloaded ath_hal: module license 'Proprietary' taints kernel. AR5210, AR5211, AR5212, AR5416, RF5111, RF5112, RF2413, RF5413, RF2133) ath_pci 0000:05:08.0: PCI INT A -> Link[APC1] -> GSI 16 (level, low) -> IRQ 16 MadWifi: ath_attach: Switching rfkill capability off. wifi0: Atheros AR5418 chip found (MAC 13.10, PHY SChip 8.1, Radio 13.0) ath_pci: wifi0: Atheros 5416: mem=0xfdee0000, irq=16 udev: renamed network interface ath0 to wlan0 irq 16: nobody cared (try booting with the "irqpoll" option) Pid: 0, comm: swapper Tainted: P 2.6.27-rc7 #3 Call Trace: <IRQ> [<ffffffff8025ab88>] getnstimeofday+0x48/0xc0 [<ffffffff8027363e>] __report_bad_irq+0x1e/0x80 [<ffffffff80273940>] note_interrupt+0x2a0/0x2d0 [<ffffffff8027426d>] handle_fasteoi_irq+0xdd/0x100 [<ffffffff8020f9f3>] do_IRQ+0xa3/0x130 [<ffffffff8020c9b1>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8021f360>] lapic_next_event+0x0/0x10 [<ffffffff8025f61f>] tick_nohz_stop_sched_tick+0xff/0x3e0 [<ffffffff8020b3aa>] cpu_idle+0x2a/0x100 handlers: [<ffffffffa0266e10>] (ath_intr+0x0/0x4460 [ath_pci]) Disabling IRQ #16
Can you try this patch? http://git.kernel.org/?p=linux/kernel/git/linville/wireless-2.6.git;a=commitdiff;h=6115e8557a75b5f24b56ed46c60dffef7e7fa992;hp=5d89945e6ec44494285cb8de85d4f43d4647b740
Done but now I I can't load a driver after another as when loading ath9k i get a panic in ath_rx_something
Can we see that panic? :-)
I didn't have the camera with me this time, sorry
The idea is to kill MadWifi, please rm -rf it, and its modules and lets resolve your issue. Before posting a kernel panic please make sure madwifi is nowhere to be seen: modprobe -l ath_pci
Also if you can upgrade to 2.6.27-rc7 that would be great as it has some of the pending patches posted by John here already merged.
I upgraded just after its release, no luck and for madwifi, i haven't it, i just tried to install to have some network instead of reboot.
We definitely need some sort of panic in order to assist. Can you try to write down the oops, a picture works too?
What panic? i've already attached the OOPS, it's attachment #17979 [details]
Yes, but that picture doesn't show me anything related to ath9k that we can help with. In fact from reading the trace it seems there may have been some machine check exception run into so the the kernel may have picked that up and thrown up. The picture I was hoping for was the one of an oops related to ath9k without MadWifi (a proprietary driver) ever touching the kernel.
I don't use madwifi, i use (well, try to) use ath9k but i get this issue. i know it's ath9k fault as I can stay connected with an rtl8187 or a zydas card all the day without errors
Alright, well lets work with your current latest oops, which you indicate is attachment #17979 [details]. Your oops indicates its tainted but that is because of the MCE (Machine Check Exception) one of the CPUs hit. The last sane call I see is acpi_pm_read() which just inl()'s the ACPI power management timer. Not sure how ath9k can affect this but I'm just going with what is presented to us through the provided panic. Can you try adding to your kernel boot parameters: acpi_pm_good And if that does not work try adding this: acpi=off Also if you can provide the output of cat /proc/interrupts that would be helpful.
2.6.27-rc8 is out, will try it first :)
no luck, 2.6.27-rc8 hangs too. I load the module and in less than 10 seconds the system freezes
Can you try adding to your kernel boot parameters: acpi_pm_good And if that does not work try adding this: acpi=off Also if you can provide the output of cat /proc/interrupts that would be helpful.
# cat /proc/interrupts CPU0 CPU1 0: 15 0 IO-APIC-edge timer 6: 0 5 IO-APIC-edge floppy 7: 1 0 IO-APIC-edge 8: 0 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 14: 3 240 IO-APIC-edge pata_amd 15: 0 283 IO-APIC-edge pata_amd 16: 0 0 IO-APIC-fasteoi ath 18: 0 0 IO-APIC-fasteoi EMU10K1 20: 6 24855 IO-APIC-fasteoi ehci_hcd:usb1 22: 0 0 IO-APIC-fasteoi sata_nv 23: 13 9722 IO-APIC-fasteoi sata_nv, ohci_hcd:usb2 NMI: 0 0 Non-maskable interrupts LOC: 14422 21745 Local timer interrupts RES: 4054 1806 Rescheduling interrupts CAL: 641 266 function call interrupts TLB: 622 400 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts SPU: 0 0 Spurious interrupts ERR: 1
I've tested these parameters: with acpi_pm_good it hangs the same with acpi=off my sata controller (sata_nv) is unable to access the disks, so nothing boots. but i've found that all works flawlessy with maxcpus=0 i've tried this as the error in the image was duplicated, maybe enabling only one CPU and it would print it only once, so I can see what there is before the errors, but I get no errors at all...
Works with maxcpus=1 as well, could it be a deadlock? warn_on_slow_path...
Well, like I had said before, your panic only provides enough information to indicate the cause of the panic is a Machine Check Exception: http://en.wikipedia.org/wiki/Machine_Check_Exception The last sane call I see is acpi_pm_read(). I'm running low on ideas. One possibility is to recompile ath9k with all debug messages enabled and see if the log captures some errors prior to your panic. To do this modify the file drivers/net/wireless/ath9k/core.h on the line that says #define DBG_DEFAULT (ATH_DBG_FATAL) Modify this to be: #define DBG_DEFAULT (ATH_DBG_ANY) Then since your machine seems to crash as ath9k gets loaded, remove the module from your modules directory before booting into the kernel: sudo rm -f /lib/modules/2.6.27-rc8/kernel/drivers/net/wireless/ath9k/ath9k.ko sudo depmod -ae After doing this boot into 2.6.27-rc8. Then recompile ath9k as follows: make M=drivers/net/wireless/ath9k/ sudo insmod drivers/net/wireless/ath9k/ath9k.ko
ATH_DBG_ANY gives an insane amount of output, i did a loop like: while true do dmesg > dmesg.$i++ sync done and I attach all the dmesgs
Created attachment 18136 [details] collection of many dmesg
ok, i've disabled warn_on_slowpath, let's see what happens...
No luck, there are no relevant printks before the hang
I've never tried this but it may help to find out the details of root cause to the MCE issue: http://freshmeat.net/projects/mcelog/
root@raver:~# mcelog root@raver:~#
You do have /dev/mcelog right? OK well I speak MCE and mcelog now. As per Andi Kleen's documenation, when PCI IO errors are enabled machine checks could be caused by software bugs in drivers -- BUT -- this normally isn’t the case on current x86 machines. But... lets keep trying to dig what is causing this... You need to force the MCE again. I suspect the kernel detected the MCE while the kernel was running acpi_pm_read() and it detects a deadlock is possible. So you can reproduce this right? OK well what we need to do is to trigger the MCE again and try to keep the machine alive to capture the log. Usually mcelog is run by cron to capture MC events but if an MC event is captured during kernel context and its an exception it panics. So we can try to force the panic to not happen to see if we can capture the data from the character device /dev/mcelog. The problem is during reboot /dev/mcelog is obviously cleared and I suspect it will be populated only if the MCE banks on the CPU were not cleared (perhaps during a warm boot) upon initialization. Anyway so lets force the box to not panic during an MCE: # Note this is dangerous and perhaps stupid :) root@deathMCEbox:# for i in /sys/devices/system/machinecheck/machinecheck*; do echo 3 > $i/tolerant; done This forces the kernel to never panic on each CPU even if an MCE was found and the current CPU process was in kernel context and a deadlock was possible (dangerous). Usually you would have mcelog run in cron as follows, perhaps hourly: #!/bin/bash /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog But since you are eager to find this and you can trigger this, then run this in a loop as follows to run it every 5 seconds: root@deathMCEbox:# while true; do /usr/sbin/mcelog --filter >> /var/log/mcelog; sleep 5; Of course though you cannot do any of this if upon bootup ath9k gets loaded and that damn culprit driver is causing the oops huh? So what you have to do is disable ath9k driver from loading. But we want to enable it later so do something like this to be sure: root@deathMCEbox:# mv /lib/modules/`uname -r`/kernel/drivers/net/wireless/ath9k/ath9k.ko /drivers/net/wireless/ath9k/ath9k.ko.old root@deathMCEbox:# depmod -ae root@deathMCEbox:# reboot Then AFTER the BOX COMES UP SUCCESSFULLY, you will run: root@deathMCEbox:# mv /lib/modules/`uname -r`/kernel/drivers/net/wireless/ath9k/ath9k.ko.old /drivers/net/wireless/ath9k/ath9k.ko root@deathMCEbox:# depmod -ae At this point start doing all your mcelog stuff I mentioned above... including the sys/devices/system/machinecheck/machinecheck* tuning I mentioned. Once you have all all that setup, then one one vty have this running: tail -f /var/log/mcelog Then root@deathMCEbox:# modprobe ath9k Did you see anything? If so then now you have whip out the Athlon64 X2 and hope you can translate the errors. If you cannot you can try reducing the mask of the MC events through each CPU's bank ctls (/sys/devices/system/machinecheck/machinecheck0/bank*ctl) and try to pin point the error. Let me know how it goes.
Oh a typo, I forgot the done here: root@deathMCEbox:# while true; do /usr/sbin/mcelog --filter >> /var/log/mcelog; sleep 5; done
Oh and I meant that you'd have to whip out the Athlon64 X2 *documentation*, as only with that can you determine some of these errors I suspect. But we'll see.
the driver doesn't hang on load, it hangs when i do some traffic. it could hang 2 seconds after ifup or even almost a minute i guess that your suggestion is the same as booting with mce=3 wich lead me to a hang too
Created attachment 18158 [details] the MCE
booting with mce=3 i get this exception and i made a screenshot
HARDWARE ERROR CPU 0: Machine check Exception: 4 Bank 4; b2OOOOOOOOO7OfOf TSC 6b0b27b740 This is not a sotware problem! Run throuh mcelog --ascii to decode and contact your hardware vendor Kernel panic - not syncing: Machine check
mcelog --ascii <mce.txt HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 0 data cache TSC 6b0b27b740 STATUS 0 MCGSTATUS 0
I'm adding Andi Kleen to the CC list. I'm hoping he can ACK if this is indeed a hardware issue or not but from my own review of mcelog documentation and its purpose it seems this is clearly a HARDWARE ISSUE and not a SOFTWARE ISSUE. Andi -- is this fair to say? It seems the bug reporter should not try to contact his CPU's hardware vendor.
I mean -- he *SHOULD* contact his CPU's hardware vendor.
mcelog's output is clear: This is not a software problem! Run through mcelog --ascii to decode and contact your hardware vendor Yes he should. I think he must have typoed because mcelog should have resolved it. Bank 4 is the northbridge so it might be that some external IO device is not answering, but that would be still a hardware problem.
Thanks for the confirmation Andi, this bug should be closed and marked as not-a-bug. It should be very clear now that this issue is not a software issue but a hardware issue on his first CPU (CPU 0) and the reporter should contact his hardware vendor to try to resolve or get support for this issue.
why if i don't use ath9k all works fine?
I am not sure but the issue is not a kernel bug do you understand that? What I mean by that is that the kernel panic you are seeing is caused by a hardware issue, and mcelog told you exactly what it was. It may be that by using the device it triggers the hardware issue. Either way the issue is not something we can try to fix in software as the oops is a MCE on your CPU #0.
Who should I contact? AMD directly?
Sounds like a good candidate.
Andi, why mcelog didn't resolved it? I admit i have used o instead of 0 (i used an OCR tool) but using the right numbers it still doesn't prints useful infos
Here there is the correct one: [/tmp]$ cat mce CPU 0: Machine check Exception: 4 Bank 4: b200000000070f0f TSC 6b0b27b740 [/tmp]$ mcelog --ascii --k8 <mce mcelog: Cannot open /dev/mem for DMI decoding: Permission denied HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 0 data cache TSC 6b0b27b740 STATUS 0 MCGSTATUS 0 [/tmp]
andi, your tool is a bit picky when parsing, eg. it looked for the "Machine Check Exception" string case insensitive, while I used ; instead of :. here is the decode: HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge Northbridge Watchdog error bit57 = processor context corrupt bit61 = error uncorrected bus error 'generic participation, request timed out generic error mem transaction generic access, level generic' STATUS b200000000070f0f MCGSTATUS 4
Fortunately latest madwifi release from OpenWrt works like a charm, i'll stick with it until I get a reply from AMD (not to be rude, but i _need_ connectivity)
Hi Matteo, can you please provide output of following commands: # lspci -vvxxxx # ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \ hexdump -n 8 -e '2/4 "%08x " "\n"' ) < /dev/cpu/0/msr # ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \ hexdump -n 8 -e '2/4 "%08x " "\n"' ) < /dev/cpu/1/msr Thanks. I just want to check whether you might be affected by some CPU erratum. See errata 131 and 169 in "Revision Guide for AMD NPT Family 0Fh Processors" (#33610 Rev. 3.30 February 2008). I just want to know whether the suggested workaround(s) are applied on your system. Thanks, Andreas
Created attachment 18183 [details] lspci -vvxxxx output
attached lspci -vvxxxx output to http://bugzilla.kernel.org/attachment.cgi?id=18183 (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr 0200000e 00400000 (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr 0200000e 00400000
Unless I'm very much mistaken -- at this time of day -- your system has neither the suggested workaround for Erratum 131 nor for Erratum 169 applied. For 169 I would have expected to see this bit set: NB_CFG: ******** *******1 and furthermore some other values in the config space for device 18.0 are necessary, e.g. something like: # setpci -s 18.0 0x68.l 0f2ec820 But you have 0x0f00c820, i.e. bit 21 is not set. For 131 I would have expected to see NB_CFG: **1***** ******** From my point of view chances are high that your system is affected by one of those errata. Can you please check whether a new BIOS is available for your mobo? Usually that's the way to fix such issues.
No new BIOSes from my vendor (abit). can I set those register by hand?
To set the bit in Northbridge function 0 should be easy. This is from my desktop: # setpci -s 18.0 0x6a.b 2e # setpci -s 18.0 0x6a.b=0xe # setpci -s 18.0 0x6a.b 0e # setpci -s 18.0 0x6a.b=0x2e # setpci -s 18.0 0x6a.b 2e To set the NB_CFG MSR needs some more thinking. But you can use HPA's msr-tools for this, see http://www.kernel.org/pub/linux/utils/cpu/msr-tools/ E.g., again on my desktop: msr-tools-1.1.2 # ./rdmsr 0xc001001f 40000100000008 msr-tools-1.1.2 # ./wrmsr 0xc001001f 0x40000000000008 msr-tools-1.1.2 # ./rdmsr 0xc001001f 40000000000008 msr-tools-1.1.2 # ./wrmsr 0xc001001f 0x40000100000008 msr-tools-1.1.2 # ./rdmsr 0xc001001f 40000100000008 Of course there is some risk for your system. In general such bits should be set from BIOS or during early boot. On your machine you could try to do an # ./wrmsr 0xc001001f 0x4000010200000e # setpci -s 18.0 0x6a.b=0x20 before you load your ath9k driver. But, of course, I don't recommend to do this! You should better ask your mobo vendor for an BIOS update.
# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr 0200000e 00400001 # (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr 0200000e 00400001 are the values correct now?
Yes, looks good. And what about "setpci -s 18.0 0x6a=0x20" -- what is output of "setpci -s 18.0 0x6a"? If all is set up you can try to load/use ath9k. Either the hang was due Erratum 169 and does not occur anymore or you encountered some other problem. BTW, as you have different bits in NB_CFG set as I have, I'd like to double check those settings. (Guess you have a socket 939 or 940 CPU.) Can you please provide output of "x86info -a" from your system. The tool is available at: http://www.codemonkey.org.uk/projects/x86info/ (Please use version 1.21 of that tool.) And last not least I'd like to see output of dmidecode. Thanks!
# setpci -s 18.0 0x6a 20 # x86info -a x86info v1.21. Dave Jones 2001-2007 Feedback to <davej@redhat.com>. Found 2 CPUs MP Table: # APIC ID Version State Family Model Step Flags # 0 0x11 BSP, usable 15 35 2 0xfbff # 1 0x11 AP, usable 15 35 2 0xfbff -------------------------------------------------------------------------- CPU #1 eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x00000001, eax = 00020f32 ebx = 00020800 ecx = 00000001 edx = 178bfbff eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572 eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30 eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140 eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000 eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000 eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 Family: 15 Model: 35 Stepping: 2 CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6) Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3 Extended feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy /dev/cpu/0/msr: No such file or directory L1 Data TLB (2M/4M): Fully associative. 8 entries. L1 Instruction TLB (2M/4M): Fully associative. 8 entries. L1 Data TLB (4K): Fully associative. 32 entries. L1 Instruction TLB (4K): Fully associative. 32 entries. L1 Data cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L1 Instruction cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L2 Data TLB (2M/4M): Disabled. 0 entries. L2 Instruction TLB (2M/4M): Disabled. 0 entries. L2 Data TLB (4K): 4-way associative. 512 entries. L2 Instruction TLB (4K): 4-way associative. 512 entries. L2 cache: Size: 1024Kb 16-way associative. lines per tag=1 line size=64 bytes. PowerNOW! Technology information Available features: Temperature sensing diode present. Frequency ID control Voltage ID control Thermal Trip Couldn't read MSR 0xc0010041 Couldn't read MSR 0xc0010042 Something went wrong reading MSR_FID_VID_STATUS SVM: revision 0, 0 ASIDs Address Size: 48 bits virtual, 40 bits physical The physical package has 2 of 2 possible cores implemented. Connector type: Socket 939 MTRR registers: MTRRcap (0xfe): MTRRphysBase0 (0x200): MTRRphysMask0 (0x201): MTRRphysBase1 (0x202): MTRRphysMask1 (0x203): MTRRphysBase2 (0x204): MTRRphysMask2 (0x205): MTRRphysBase3 (0x206): MTRRphysMask3 (0x207): MTRRphysBase4 (0x208): MTRRphysMask4 (0x209): MTRRphysBase5 (0x20a): MTRRphysMask5 (0x20b): MTRRphysBase6 (0x20c): MTRRphysMask6 (0x20d): MTRRphysBase7 (0x20e): MTRRphysMask7 (0x20f): MTRRfix64K_00000 (0x250): MTRRfix16K_80000 (0x258): MTRRfix16K_A0000 (0x259): MTRRfix4K_C8000 (0x269): MTRRfix4K_D0000 0x26a: MTRRfix4K_D8000 0x26b: MTRRfix4K_E0000 0x26c: MTRRfix4K_E8000 0x26d: MTRRfix4K_F0000 0x26e: MTRRfix4K_F8000 0x26f: MTRRdefType (0x2ff): 519022.75GHz processor (estimate). -------------------------------------------------------------------------- CPU #2 eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x00000001, eax = 00020f32 ebx = 01020800 ecx = 00000001 edx = 178bfbff eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572 eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30 eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140 eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000 eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000 eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 Family: 15 Model: 35 Stepping: 2 CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6) Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3 Extended feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy L1 Data TLB (2M/4M): Fully associative. 8 entries. L1 Instruction TLB (2M/4M): Fully associative. 8 entries. L1 Data TLB (4K): Fully associative. 32 entries. L1 Instruction TLB (4K): Fully associative. 32 entries. L1 Data cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L1 Instruction cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L2 Data TLB (2M/4M): Disabled. 0 entries. L2 Instruction TLB (2M/4M): Disabled. 0 entries. L2 Data TLB (4K): 4-way associative. 512 entries. L2 Instruction TLB (4K): 4-way associative. 512 entries. L2 cache: Size: 1024Kb 16-way associative. lines per tag=1 line size=64 bytes. PowerNOW! Technology information Available features: Temperature sensing diode present. Frequency ID control Voltage ID control Thermal Trip Couldn't read MSR 0xc0010041 Couldn't read MSR 0xc0010042 Something went wrong reading MSR_FID_VID_STATUS SVM: revision 0, 0 ASIDs Address Size: 48 bits virtual, 40 bits physical The physical package has 2 of 2 possible cores implemented. Connector type: Socket 939 MTRR registers: MTRRcap (0xfe): MTRRphysBase0 (0x200): MTRRphysMask0 (0x201): MTRRphysBase1 (0x202): MTRRphysMask1 (0x203): MTRRphysBase2 (0x204): MTRRphysMask2 (0x205): MTRRphysBase3 (0x206): MTRRphysMask3 (0x207): MTRRphysBase4 (0x208): MTRRphysMask4 (0x209): MTRRphysBase5 (0x20a): MTRRphysMask5 (0x20b): MTRRphysBase6 (0x20c): MTRRphysMask6 (0x20d): MTRRphysBase7 (0x20e): MTRRphysMask7 (0x20f): MTRRfix64K_00000 (0x250): MTRRfix16K_80000 (0x258): MTRRfix16K_A0000 (0x259): MTRRfix4K_C8000 (0x269): MTRRfix4K_D0000 0x26a: MTRRfix4K_D8000 0x26b: MTRRfix4K_E0000 0x26c: MTRRfix4K_E8000 0x26d: MTRRfix4K_F0000 0x26e: MTRRfix4K_F8000 0x26f: MTRRdefType (0x2ff): 2584453.95GHz processor (estimate). --------------------------------------------------------------------------
sorry, wrong kernel
Just compiled a 2.6.27-rc9 kernel: root@raver:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr 0200000e 00400001 root@raver:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr 0200000e 00400001 root@raver:~# setpci -s 18.0 0x6a.b 20 root@raver:~# x86info -a x86info v1.21. Dave Jones 2001-2007 Feedback to <davej@redhat.com>. Found 2 CPUs MP Table: # APIC ID Version State Family Model Step Flags # 0 0x11 BSP, usable 15 35 2 0xfbff # 1 0x11 AP, usable 15 35 2 0xfbff -------------------------------------------------------------------------- CPU #1 eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x00000001, eax = 00020f32 ebx = 00020800 ecx = 00000001 edx = 178bfbff eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572 eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30 eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140 eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000 eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000 eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 Family: 15 Model: 35 Stepping: 2 CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6) Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3 Extended feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy Number of reporting banks : 5 MCG_CTL: Data cache check enabled ECC 1 bit error reporting enabled ECC multi bit error reporting enabled Data cache data parity enabled Data cache main tag parity enabled Data cache snoop tag parity enabled L1 TLB parity enabled L2 TLB parity enabled Instruction cache check enabled ECC 1 bit error reporting enabled ECC multi bit error reporting enabled Instruction cache data parity enabled IC main tag parity enabled IC snoop tag parity enabled L1 TLB parity enabled L2 TLB parity enabled Predecode array parity enabled Target selector parity enabled Read data error enabled Bus unit check enabled External L2 tag parity error enabled L2 partial tag parity error enabled System ECC TLB reload error enabled L2 ECC TLB reload error enabled L2 ECC K7 deallocate enabled L2 ECC probe deallocate enabled System datareaderror reporting enabled Load/Store unit check enabled Read data error enable (loads) enabled Read data error enable (stores) enabled 31 23 15 7 Bank: 0 (0x400) MC0CTL: 00000000 00000000 00000000 01111111 MC0STATUS: 00000000 00000000 00000000 00000000 MC0ADDR: 11111111 11111111 11111111 11111111 MC0MISC: 00000000 00000000 00000000 00000000 Bank: 1 (0x404) MC1CTL: 11111111 11111111 11111111 11111111 MC1STATUS: 00000000 00000000 00000000 00000000 MC1ADDR: 11111111 11111111 11111111 11111111 MC1MISC: 00000000 00000000 00000000 00000000 Bank: 2 (0x408) MC2CTL: 00000000 00001111 11111111 11111111 MC2STATUS: 00000000 00000000 00000000 00000000 MC2ADDR: 11111111 11111111 11111111 11111111 MC2MISC: 11111111 11111111 11111111 11111111 Bank: 3 (0x40c) MC3CTL: 00000000 00000000 00000000 00000111 MC3STATUS: 00000000 00000000 00000000 00000000 MC3ADDR: 11111111 11111111 11111111 11111111 MC3MISC: 00000000 00000000 00000000 00000000 Bank: 4 (0x410) MC4CTL: 00000000 00000000 00111111 11111111 MC4STATUS: 00000000 00000000 00000000 00000000 MC4ADDR: 11111111 11111111 11111111 11111111 MC4MISC: 00000000 00000000 00000000 00000000 L1 Data TLB (2M/4M): Fully associative. 8 entries. L1 Instruction TLB (2M/4M): Fully associative. 8 entries. L1 Data TLB (4K): Fully associative. 32 entries. L1 Instruction TLB (4K): Fully associative. 32 entries. L1 Data cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L1 Instruction cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L2 Data TLB (2M/4M): Disabled. 0 entries. L2 Instruction TLB (2M/4M): Disabled. 0 entries. L2 Data TLB (4K): 4-way associative. 512 entries. L2 Instruction TLB (4K): 4-way associative. 512 entries. L2 cache: Size: 1024Kb 16-way associative. lines per tag=1 line size=64 bytes. PowerNOW! Technology information Available features: Temperature sensing diode present. Frequency ID control Voltage ID control Thermal Trip MSR: 0xc0010041=0x0000000100001202 : 00000000 00000000 00000000 00000001 11111111 11111111 11111111 11111111 MSR: 0xc0010042=0x12060812060e0e02 : 00011111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 Voltage ID codes: Maximum=1.400V Startup=1.350V Currently=1.100V Frequency ID codes: Maximum=11x Startup=11x Currently=5x SVM: revision 0, 0 ASIDs Address Size: 48 bits virtual, 40 bits physical The physical package has 2 of 2 possible cores implemented. Connector type: Socket 939 MTRR registers: MTRRcap (0xfe): 0x0000000000000508 MTRRphysBase0 (0x200): 0x0000000000000006 MTRRphysMask0 (0x201): 0x000000ff80000800 MTRRphysBase1 (0x202): 0x0000000080000006 MTRRphysMask1 (0x203): 0x000000ffc0000800 MTRRphysBase2 (0x204): 0x00000000d0000001 MTRRphysMask2 (0x205): 0x000000fff0000800 MTRRphysBase3 (0x206): 0x0000000000000000 MTRRphysMask3 (0x207): 0x0000000000000000 MTRRphysBase4 (0x208): 0x0000000000000000 MTRRphysMask4 (0x209): 0x0000000000000000 MTRRphysBase5 (0x20a): 0x0000000000000000 MTRRphysMask5 (0x20b): 0x0000000000000000 MTRRphysBase6 (0x20c): 0x0000000000000000 MTRRphysMask6 (0x20d): 0x0000000000000000 MTRRphysBase7 (0x20e): 0x0000000000000000 MTRRphysMask7 (0x20f): 0x0000000000000000 MTRRfix64K_00000 (0x250): 0x0606060606060606 MTRRfix16K_80000 (0x258): 0x0606060606060606 MTRRfix16K_A0000 (0x259): 0x0000000000000000 MTRRfix4K_C8000 (0x269): 0x0000000000000000 MTRRfix4K_D0000 0x26a: 0x0000000000000000 MTRRfix4K_D8000 0x26b: 0x0000000000000000 MTRRfix4K_E0000 0x26c: 0x0000000000000000 MTRRfix4K_E8000 0x26d: 0x0000000000000000 MTRRfix4K_F0000 0x26e: 0x0000000000000000 MTRRfix4K_F8000 0x26f: 0x0000000000000000 MTRRdefType (0x2ff): 0x0000000000000c00 1109130.20GHz processor (estimate). -------------------------------------------------------------------------- CPU #2 eax in: 0x00000000, eax = 00000001 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x00000001, eax = 00020f32 ebx = 01020800 ecx = 00000001 edx = 178bfbff eax in: 0x80000000, eax = 80000018 ebx = 68747541 ecx = 444d4163 edx = 69746e65 eax in: 0x80000001, eax = 00020f32 ebx = 00000156 ecx = 00000003 edx = e3d3fbff eax in: 0x80000002, eax = 20444d41 ebx = 6c687441 ecx = 74286e6f edx = 3620296d eax in: 0x80000003, eax = 32582034 ebx = 61754420 ecx = 6f43206c edx = 50206572 eax in: 0x80000004, eax = 65636f72 ebx = 726f7373 ecx = 30343420 edx = 00002b30 eax in: 0x80000005, eax = ff08ff08 ebx = ff20ff20 ecx = 40020140 edx = 40020140 eax in: 0x80000006, eax = 00000000 ebx = 42004200 ecx = 04008140 edx = 00000000 eax in: 0x80000007, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 0000000f eax in: 0x80000008, eax = 00003028 ebx = 00000000 ecx = 00000001 edx = 00000000 eax in: 0x80000009, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000a, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000b, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000c, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000d, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000e, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x8000000f, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000010, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000011, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000012, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000013, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000014, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000015, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000016, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000017, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 eax in: 0x80000018, eax = 00000000 ebx = 00000000 ecx = 00000000 edx = 00000000 Family: 15 Model: 35 Stepping: 2 CPU Model : Dual-Core Opteron/Athlon 64 X2 Dual-Core (JH-E6) Processor name string: AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh mmx fxsr sse sse2 ht sse3 Extended feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 nx mmxext mmx fxsr ffxsr lm 3dnowext 3dnow lahf/sahf CmpLegacy Number of reporting banks : 5 MCG_CTL: Data cache check enabled ECC 1 bit error reporting enabled ECC multi bit error reporting enabled Data cache data parity enabled Data cache main tag parity enabled Data cache snoop tag parity enabled L1 TLB parity enabled L2 TLB parity enabled Instruction cache check enabled ECC 1 bit error reporting enabled ECC multi bit error reporting enabled Instruction cache data parity enabled IC main tag parity enabled IC snoop tag parity enabled L1 TLB parity enabled L2 TLB parity enabled Predecode array parity enabled Target selector parity enabled Read data error enabled Bus unit check enabled External L2 tag parity error enabled L2 partial tag parity error enabled System ECC TLB reload error enabled L2 ECC TLB reload error enabled L2 ECC K7 deallocate enabled L2 ECC probe deallocate enabled System datareaderror reporting enabled Load/Store unit check enabled Read data error enable (loads) enabled Read data error enable (stores) enabled 31 23 15 7 Bank: 0 (0x400) MC0CTL: 00000000 00000000 00000000 01111111 MC0STATUS: 00000000 00000000 00000000 00000000 MC0ADDR: 11111111 11111111 11111111 11111111 MC0MISC: 00000000 00000000 00000000 00000000 Bank: 1 (0x404) MC1CTL: 11111111 11111111 11111111 11111111 MC1STATUS: 00000000 00000000 00000000 00000000 MC1ADDR: 11111111 11111111 11111111 11111111 MC1MISC: 00000000 00000000 00000000 00000000 Bank: 2 (0x408) MC2CTL: 00000000 00001111 11111111 11111111 MC2STATUS: 00000000 00000000 00000000 00000000 MC2ADDR: 11111111 11111111 11111111 11111111 MC2MISC: 11111111 11111111 11111111 11111111 Bank: 3 (0x40c) MC3CTL: 00000000 00000000 00000000 00000111 MC3STATUS: 00000000 00000000 00000000 00000000 MC3ADDR: 11111111 11111111 11111111 11111111 MC3MISC: 00000000 00000000 00000000 00000000 Bank: 4 (0x410) MC4CTL: 00000000 00000000 00000000 00000000 MC4STATUS: 00000000 00000000 00000000 00000000 MC4ADDR: 00000000 00000000 00000000 00000000 MC4MISC: 00000000 00000000 00000000 00000000 L1 Data TLB (2M/4M): Fully associative. 8 entries. L1 Instruction TLB (2M/4M): Fully associative. 8 entries. L1 Data TLB (4K): Fully associative. 32 entries. L1 Instruction TLB (4K): Fully associative. 32 entries. L1 Data cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L1 Instruction cache: Size: 64Kb 2-way associative. lines per tag=1 line size=64 bytes. L2 Data TLB (2M/4M): Disabled. 0 entries. L2 Instruction TLB (2M/4M): Disabled. 0 entries. L2 Data TLB (4K): 4-way associative. 512 entries. L2 Instruction TLB (4K): 4-way associative. 512 entries. L2 cache: Size: 1024Kb 16-way associative. lines per tag=1 line size=64 bytes. PowerNOW! Technology information Available features: Temperature sensing diode present. Frequency ID control Voltage ID control Thermal Trip MSR: 0xc0010041=0x0000000100001202 : 00000000 00000000 00000000 00000001 11111111 11111111 11111111 11111111 MSR: 0xc0010042=0x12060812060e0e02 : 00011111 11111111 11111111 11111111 11111111 11111111 11111111 11111111 Voltage ID codes: Maximum=1.400V Startup=1.350V Currently=1.100V Frequency ID codes: Maximum=11x Startup=11x Currently=5x SVM: revision 0, 0 ASIDs Address Size: 48 bits virtual, 40 bits physical The physical package has 2 of 2 possible cores implemented. Connector type: Socket 939 MTRR registers: MTRRcap (0xfe): 0x0000000000000508 MTRRphysBase0 (0x200): 0x0000000000000006 MTRRphysMask0 (0x201): 0x000000ff80000800 MTRRphysBase1 (0x202): 0x0000000080000006 MTRRphysMask1 (0x203): 0x000000ffc0000800 MTRRphysBase2 (0x204): 0x00000000d0000001 MTRRphysMask2 (0x205): 0x000000fff0000800 MTRRphysBase3 (0x206): 0x0000000000000000 MTRRphysMask3 (0x207): 0x0000000000000000 MTRRphysBase4 (0x208): 0x0000000000000000 MTRRphysMask4 (0x209): 0x0000000000000000 MTRRphysBase5 (0x20a): 0x0000000000000000 MTRRphysMask5 (0x20b): 0x0000000000000000 MTRRphysBase6 (0x20c): 0x0000000000000000 MTRRphysMask6 (0x20d): 0x0000000000000000 MTRRphysBase7 (0x20e): 0x0000000000000000 MTRRphysMask7 (0x20f): 0x0000000000000000 MTRRfix64K_00000 (0x250): 0x0606060606060606 MTRRfix16K_80000 (0x258): 0x0606060606060606 MTRRfix16K_A0000 (0x259): 0x0000000000000000 MTRRfix4K_C8000 (0x269): 0x0000000000000000 MTRRfix4K_D0000 0x26a: 0x0000000000000000 MTRRfix4K_D8000 0x26b: 0x0000000000000000 MTRRfix4K_E0000 0x26c: 0x0000000000000000 MTRRfix4K_E8000 0x26d: 0x0000000000000000 MTRRfix4K_F0000 0x26e: 0x0000000000000000 MTRRfix4K_F8000 0x26f: 0x0000000000000000 MTRRdefType (0x2ff): 0x0000000000000c00 2584449.70GHz processor (estimate). --------------------------------------------------------------------------
I have added this initscript to my system: root@raver:~# cat /etc/init.d/errata_fix #!/bin/sh . /lib/lsb/init-functions log_begin_msg 'Applying AMD errata fix' /usr/local/bin/wrmsr 0xc001001f 0x4000010200000e && /usr/bin/setpci -s 18.0 0x6a.b=0x20 log_end_msg $?
Regarding comment #92: This affirmed that you have an 939 socket. And it explains why your NB_CFG MSR (0xc001001f) looks slightly different from mine. My desktop has AM2 socket. (Just another erratum fix which was only needed for socket 939/940 CPUs.) So last interesting point is. Does ath9k work (when you have applied the errata fix) or do you still get MCEs? If it works I take this for evidence that you really stumbled over the erratum 169 problem. Then the question is how to proceed. IMHO, it's best to add a the erratum workaround into Linux kernel -- as your mobo vendor obviously doesn't provide you a newer BIOS. And maybe there are other systems like yours that potentially suffer from this CPU erratum.
It hangs too with the same MCE
That's unfortunate. (I assume you have double-checked that the workaround was applied.) But we have one more try. Not applying workaround for erratum 169 but for erratum 131. (Revision guide says that erratum 169 has superseded 131, but who knows.) That would mean you have to change your init-script like this: (This time w/o setpci fiddling.) #!/bin/sh . /lib/lsb/init-functions log_begin_msg 'Applying AMD errata fix' /usr/local/bin/wrmsr 0xc001001f 0x4000000210000e log_end_msg $? Would be great if you could verify that last idea (... that I have at the moment ;-( Thanks!
no setpci at all?
No, not at all. Workarounds are (1) Erratum 131: set bit 20 of NB_CFG (2) Erratum 169: set bit 32 of NB_CFG and set F0x68[22:21] to 01b (i.e. bit 22=0, bit 21=1 in device 18.0 offset 0x68) You have verified (2). Well, which did not work. Now I'd like to explicitely see whehter (1) helps.
Same MCE, all hangs. Thanks anyway :(
Quite disenchanting. But at least we know it was none of the two errata. I'll open an internal bug report for your issue. (But I don't know how much attention this will get ...) Thanks for your patience and testing. Andreas
Can you tell me which motherboard this is or which add in card? I figure it must be an addin card as this is an oldish CPU/motherboard with a new wireless card. I fudged the MCE report you printed out to set Status to 4 and MSGStatus to b200000000070f0f and got the report: HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge TSC 6b0b27b740 (upper bound, found by polled driver) Northbridge RAM ECC error ECC syndrome = 0 STATUS 4 MCGSTATUS b200000000070f0f But this doesn't make sense because you don't have ECC memory right? Have you run memtest overnight just to check the hardware? If not please do. We need to capture the real MCE and decdoe this out. If the MCEs are what we see then it would not be one of the erratum Andreas found as those are Wathdog Timer MCE related.
Joachim, please see comment #80 and attachment #18158 [details]. (When I decode the MCE I see the same Watchdog error as in comment #80.)
Sorry, disregard my ramblings in comment #101, my local mcelog must be broke and I was trying to coax to hard to give me a correct decoding.
My motherboard is an Abit AN8 SLI, nForce4 (CK804) chipset the wireless is a PCI Atheros 802.11n device, quite new in respect to the motherboard. memtest86 gives TONS of errors, so I assume it's broken. I have the system running for months without issues while memtest86 gives error immediately, and then crashes
IIRC memtest86 started to give errors when I added 2GB (2x1GB) RAM to my already there 1GB (2x512MB) my RAM is pretty good tought: the 512MB banks are OCZ Gold, while the 1GB banks are OCZ Platinum, both with CAS 2 i ran an userspace tool for almost a 12 hours: http://pyropus.ca/software/memtester/ and I get no problems at all. Usually when there is bad RAM the first thing i notice is GCC segfaulting, while i crosscompile software for MIPS, AVR32 and ARM mostly of the time without problem
> memtest86 gives TONS of errors Matteo, that's an interesting point. Can you please remove the 2x1GB of RAM (which cause your memtest86 errors) and redo your ath9k driver test.
I knew that memtest were broken, latest one was fixed heres: https://qa.mandriva.com/42679 and works fine even with all 4 banks
Any other ideas to help the user out?
I´ve asked for the bug to be re-opened (not sure if I should have cloned it?) because I am able to reproduce this user´s problem on my system, with quite a different configuration. My system has been freezing shortly after starting wireless (anywhere from immediately to an hour, depending on what I´m doing...almost always within a minute if I start ktorrent)... I´ve been trying to solve it for weeks, and isolated it down to use of the ath9k driver, as this user has done, and setting maxcpus=1 also resolves the hangs on my system. I´m actually using the system now, which I never could have done before (less one core, of course.) Note: this doesn´t necessarily eliminate a hardware issue, as this system is only a couple of weeks old and I haven´t run any other OS on it. However, given it´s a different CPU, motherboard, etc. I thought it was worth re-visiting whether this is really a hardware problem or not. kernel: 2.6.27-7-generic dist: kubuntu 8.10rc1 with all patches CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ Memory: 4GB Motherboard: Asus M3N78-VM NIC: D-Link DWA-552 (Atheros 5416) I´ll attach my dmesg and other system configuration info. I´m not a Linux guru, but I´m happy to help in any way...if you want me to try something please give me very specific instruction. BTW, I tried the two errata checks: # ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \ hexdump -n 8 -e '2/4 "%08x " "\n"' ) < /dev/cpu/0/msr # ( perl -e 'sysseek(STDIN, 0xC001001F, 0)'; \ hexdump -n 8 -e '2/4 "%08x " "\n"' ) < /dev/cpu/1/msr But the device doesn´t exist: bash: /dev/cpu/0/msr: No such file or directory (Maybe because I´m currently running with maxcpus=1?)
Created attachment 18519 [details] dmesg output (maxcpus=1) Made while running with maxcpus=1, let me know if you want me to re-run with the boot parameter removed.
Created attachment 18520 [details] lspci -vvxxxx ouput (while running with maxcpus=1)
Regardless of whether you try with maxcpus=1 or not the problem is the oops message printed *is a hardware issue*. This means that no matter how hard we try to work on the driver code the oops message generated came *from a hardware issue*, not a software driver issue So this is a *hardware issue*. Andreas or Joachim, any other ideas for these users to test? It seems its happening on two different CPUs. AMD Athlon64 X2, nForce4 motherboard AMD Athlon(tm) 64 X2 Dual Core Processor 6000+
I haven´t managed to isolate an oops message on my config.... when the freeze happens I´m in the desktop and the screen goes blank... no toggling to a pseudo tty, no restarting X, no toggling numlock on/off. No trace of any error messages later that I can find. So, maybe it´s a hardware issue on my system too... but we don´t really know that. And if it´s a hardware issue, why does it only show up when ath9k is being used.... on-board ethernet is solid.
If you do not have an oops message how can you know it is the same issue? This bug report is for an oops message which is produced by a hardware issue. As to why it happens with ath9k and not without it -- the answer should lie in the mcelog dump information. But this is still a hardware issue, not a software issue, the CPU detects a machine check exception.
If you can tell me how to capture an oops message at the time the system freezes, I´ll gladly try to capture it... the screen goes a solid colour and no other response can be obtained from the system. The commonality in the problem is, I believe, that once a wireless interface using the ath9k driver is initialized and associated, then the system will freeze, either immediately, or after some amount of network access. If the wireless interface is not initialized (using netconfig/iwconfig) then the system is rock solid, and can even be used with a wired interface. Setting numcpus=1 avoids the freeze and the system is rock solid with the wireless interface. (There was a spinlock issue in the ath9k driver which resulted in systems freezing, but my understanding is that the fix for this is already in the kernel I´m running.) The root cause may be hardware, or it may not. It is certainly a kernel/hardware interaction, because it is easily/reproducibly initiated from software. The fact we have two quite different configs, with identical behaviour (apart from the oops), suggests to me that it merits additional investigation. So, my system is solid now, using 802.11n and great throughput with a single core. I´m typing this on it now with ktorrent running in the background. If I reboot with the second core enabled, associate to my AP, and start ktorrent, my system will freeze within a minute. (If I don´t run ktorrent, it´ll still freeze - it´ll just take longer.) Perhaps the bug is miscategorized?
I'm absolutely certain that this issue should be looked into, and what I am saying is just that I cannot help you with it in ath9k at the moment as the issue reported in *this* bug report was caused by an MCE panic, which is caused on the CPU -- the kernel is just informed of this and since its an exception the kernel knows better to not continue further otherwise the hardware may get affected terribly, so the kernel oopses. To be clear we are talking about two separate issues now in this bug report, the first one is the one I describe above. In the first case the user rebooted with and added to his kernel parameters: mce=3 This forces the kernel to not panic even if it gets MCEs, when in kernel context and a deadlock was possible (dangerous). This is dangerous, the more subtle way to try this is to use mce=bootlog bootlog seems to be disabled by default on AMD CPUs because some buggy BIOSes are known to leave bogus logs. You can still try this first though. Additionally you may try mce=off To disable MCE.. but not sure this is a good idea. Then there is your "issue" which you are just saying is the same but are not providing an oops for, but are just saying it hangs with ath9k. What would help is to get your oops and/or mcelog output converted to ascii. I provided instructions on how to do this above in this bug report but here is again: Usually you would have mcelog run in cron as follows, perhaps hourly as follows: #!/bin/bash /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog But since you are eager to find this and you can trigger this, then run this in a loop as follows to run it every 5 seconds: root@deathMCEbox:# while true; do /usr/sbin/mcelog --filter >> /var/log/mcelog; sleep 5; Then make the mcelog human readable by running: mcelog --ascii < mcelog > mcelog.txt Again here are the two cases reported in this bug report so far: == Case 1 == CPU: Athlon64 X2 Motherboard: nForce4 * oops captured * mcelog ascii provided * two erratas applied but issue still occurs * this is a hardware issue * no fix has been provided yet ---------------------------------------------------- == Case 2 == CPU: AMD Athlon(tm) 64 X2 Dual Core Processor 6000+ Motherboard: Asus M3N78-VM * nothing further information provided ----------------------------------------------------
BTW can the two users who are currently seeing issues with this report back to us: * Motherboard model * Bios and Bios revision number * lspci -vn and lspci -vvv
We would basically like to get the same motherboard and BIOS in house to recreate this issue with PCIe link analysis equipment in place to determine the root cause.
Jonathan/Luis, Motherboard: Asus M3N78-VM http://www.asus.com/products.aspx?l1=3&l2=149&l3=643&l4=0&model=2268&modelmenu=1 BIOS: Ver. 0804 (Dated 10/15/2008) - latest available from Asus web site Processor: AMD Athlon 64 X2 Dual Core 6000+ Memory: 4GB (2x OCZ 2G DDR2 PC2 6400 ) Peripherals: Seagate 1TB SATA II hard drive (AHCI mode, ST31000340AS), D-Link DWA-552 The motherboard is a current model. Kernel is 2.6.27-7-generic. I boot with iommu=noaperture,memaper=3 maxcpus=1 I installed using kubuntu 8.10 beta (amd64) alternate CD. All updates applied. Video is Integrated NVIDIA GeForce 8200 GPU. I am using Nvidia accelerated graphics driver 177.80 (October 7, 2008) from their web site: http://www.nvidia.com/object/linux_display_amd64_177.80.html (Using VGA, not HDMI) lspci output to follow.
I have a completely different video subsystem: Ati 4870 with free drivers Phil: if you want to know the error, switch to a terminal while leaving the network active. you can do it by doing something like: while true; do curl http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.27.tar.gz > /dev/null; done
also, to have the msr device and some x86info verose output: sudo modprobe msr sudo modprobe cpuid
Created attachment 18539 [details] ouput from lspci -vvv (both CPU cores enabled) Booted with both CPU cores enabled (no maxcpus boot parameter) and dumped output prior to initializing network
Created attachment 18540 [details] output from lscpi -vn (both CPU cores enabled) Output from lspci -vn after booting with both cores enabled.
Created attachment 18541 [details] output from dmesg (both cores enabled) This was dumped after initializing the wireless network, with both cores enabled. The system hanged about 2 seconds after dumping. Just an aside, it was necessary to turn off Wireless QoS on the router in order to be able to associate at all.
Phil, can you paste the output of this commands? modprobe msr modprobe cpuid (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr setpci -s 18.0 0x68.l
Hi Matteo, Output is as follows: root@media:~# modprobe msr root@media:~# modprobe cpuid root@media:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/0/msr 00000008 00400001 root@media:~# (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 "%08x " "\n"') </dev/cpu/1/msr 00000008 00400001 root@media:~# setpci -s 18.0 0x68.l 0f2ec820 root@media:~#
Phil: Your system seems to be the errata 169. Andreas: can you confirm it? can't you make a similar system and try to find the error? My CPU is old, but Phil one is pretty new, so the bug is more severe now
Andreas, Assuming Matteo is correct, and I am (also?) suffering errata 169... 1) I should request a BIOS fix from my motherboard vendor (Asus) which might be possible given it's a recent mobo 2) I can try to poke the right bits to effect the work around in Linux... but the values to do this may differ from Matteo's? 3) Given that the hardware is all less than a month old, and I can probably swap it, could I avoid errata 169 by upgrading the CPU to a newer model (such as a quad core?). I'd rather pay for an upgraded CPU than swap the mobo. Thanks for your advice.
I get the same hang while bringing the interface up with Windows, what a coincidence. It's the first time I tried to do Internet with such OS so it may ben an Atheros HW issue, not an AMD one..
Hi Matteo, Thanks very much for trying to re-create the hang. I've been able to trigger the freeze while on the console screen, and the screen just goes blank - no kernel message at all. I'm not sure what else I can do to try and capture some useful information. I don't believe I suffer from errata 169... according to Andreas' earlier comment (#85) For 169 I would have expected to see this bit set: NB_CFG: ******** *******1 That bit is set for me: # (perl -e 'sysseek(STDIN, 0xC001001F, 0)'; hexdump -n 8 -e '2/4 > "%08x " "\n"') </dev/cpu/0/msr 00000008 00400001 I have submitted a support request to Asus.... I'm at a bit of a loss as to how to go about getting 802.11n working on my system. My understanding is ath9k is the only driver supporting 802.11n? Do I try exchanging the board for a DWA-556 PCIe x1 version of the same board? (I only have one PCIe x1 slot, and would like to use it for a TV tuner card at some point). Do I switch to NDISwrapper and satisfy myself with 802.11g? Do I switch to an external router and try to run it in bridge mode? It seems very strange that we both run into problems on different hardware when the second core is enabled.
It could be an HW error related to the wifi card, i'd wait a response from Atheros
At this point assumptions are being made, please don't make assumptions. We've already determined the issue here to be a CPU MCE. We cannot do anything about it and need involvement from AMD. If its an issue with the CPU we can potentially detect the fix somewhere in ath9k/the kernel and implement the fix/workaround there. But we need to iron out first why this occurs.
Hi Luis, As I said in comment 130, 113, and 115, I do not get an MCE. I do not get a kernel message of any type. My system does not seem to be suffering from errata 169. There's nothing that categorically points to a hardware error. The commonality seems to be that both Mateo and I have system freezes with a second CPU core enabled and the ath9k driver. That's why I was asking for suggestions as to how to go about isolating the problem. I guess my next step is to re-compile ath9k with debugging. Would you prefer I open a new bug? I thought there was value in keeping the two issues associated, but perhaps it's confusing things more than it's helping. Phil
Ok sorry for bugging your hardware, but if I and Phil have the same error with two different CPUs and the same card there _could_ be something wrong with the card _or_ with that card combined with such two CPUs
I've downloaded compat-wireless-2008-11-17 and compiled with ATH_DBG_ANY. I did a "make install", and "make load" and (after doing a modpobe -r ath9k because mac80211 and another module were bus) make load and a reboot. I believe I got all the latest drivers installed (how do I verify the version of a loaded driver?) but I can't figure out where the debug output goes to. How do I find/capture the debug output? It doesn't seem to show up on the Ctrl-Alt-F8 console screen. Anyway, the new drivers don't resolve the problem.
Never mind... I found it.... as Matteo says, an insane amount of output.
I've spent considerable time looking into this. I no longer believe this is the ath9k driver. I was able to run the ath9k driver will all debug enabled and provoke the freeze without any error messages from the driver. The problem appears to be related to interrupt handling (and possibly CPU frequency power management) on multi-core AMD64 CPUs and NVIDIA drivers. It occurs over multiple releases of the kernel and the video drivers, different (ADM64 multi-core) CPUs, different motherboards, and different NVIDIA chip-sets. It appears to effect a quite a large number of people. Common denominators is ADM64, multi-core, NVIDIA. Easiest way to diagnose the problem, is that the system will not freeze with MAXCPUS=1. Some people have found other ways around the problem by restricting interrupt handling to the first core. The reason this seemed to be related to networking (and ath9k, particularly when I run something like bittorrent) is that the network card is a great way to generate lots of interrupts when under load. Thanks to those who tried to help.... you've certainly helped to isolate it. There are quite a few bug reports, distribution and kernel, related to this but which haven't identified the root cause. I'm going to try and summarize my findings, and see if I can pull some of those bugs together into one (possibly 10453). http://bugzilla.kernel.org/show_bug.cgi?id=10453 [Andreas, are you able to provide any guidance as to how to go about getting this investigated further?]
My new computer just arrived, an Intel Core i7: quad core with hyper threading so I have 8 logical cores well, when bringing the interface up the system freezes again, but booting with maxcpus=0 fixes it. So it seems an issue with the atheros card, as now i'm using a Realtek rtl8187 which works just fine. I won't keep my 8 core PC in single core mode, I hope that the bug is fixed soon or i'll trash my atheros card as the driver is still unstable or just the card is crap hardware
Also, many thanks to the AMD staff for their support and sorry for bugging their products
Did you get an oops message from when the system hung? Did you get an MCE panic? What kernel are you using now? Have you tried the latest from wireless-testing / compat-wireless?
The Intel gives no MCE or panic, just freezes. I'm using a vanilla 2.6.28 never tried wireless-testing or compat-wireless
Please try to see if you can capture a panic, you can first you a virtual terminal and try to see if you see the panic there. You can also try to use netconsole: Documentation/networking/netconsole.txt You can also feel free to try the latest bleeding edge wireless drivers and stack from the wireless-testing tree. This guide shows you how to get it: http://wireless.kernel.org/en/developers/Documentation/git-guide Its basically an entire kernel. Alternatively you can also just get the latest and greatest wireless components only and compile them separately from here: http://wireless.kernel.org/en/users/Download
Hi Matteo, I've come to the conclusion on my system that it's a hardware problem caused by the network card. I've tried multiple kernels, including the latest stable. I've configured a serial console to try to capture any kernel oops or panic messages... there isn't one. Nothing in MCE log. At one point I re-compiled the wireless drivers with all debugging..... no messages prior to the freeze. At the same time, the freeze only appears with multiple CPUs enabled under network load. Slightly irksome with only two cores.... I can understand why you wouldn't want 7 cores sitting idle. I've tested pulling the network card and using the on-board LAN.... no crashes under multi-core. I can't seem to find any other PCI card that is supported under Linux with native 802.11n support (there's a PCI Express version of the same card that might be worth trying, but I only have 1 PCI Express slot that I plan to put a TV capture card into.) My plan at the moment is to go with a d-link DAP-1522 wireless access point/bridge, that will plug into the on-board ethernet and find a new home for the PCI card. If I thought there was more that I could do to help find the cause of the lock-up, and that a work-around might be possible in software, I might stick with it. However, I think the chances of fixing it in software might be slim, and my kernel debugging skills might not be up to the challenge. I might give one more try with the latest wireless drivers suggested by Luis, but my suspicion is that the problem is in the hardware, specifically the d-link card. (Googling, I noticed that similar lock-ups were observed under Windows 64-bit early on in the cards life-cycle, but this appeared to be cleared up by new Windows drivers being issued.)
Seems to be fixed with the kernel 2.6.29, can anyone confirm it?
(In reply to comment #144) > Seems to be fixed with the kernel 2.6.29, can anyone confirm it? Hey, great new Matteo. You're able to run with heavy network load with 2 cores enabled? I'm downloading 2.6.29 stable now, and will try it later and let you know. I'm still running with maxcpus=1. Phil
(In reply to comment #145) > (In reply to comment #144) > > Seems to be fixed with the kernel 2.6.29, can anyone confirm it? > > Hey, great new Matteo. You're able to run with heavy network load with 2 > cores > enabled? > > I'm downloading 2.6.29 stable now, and will try it later and let you know. > I'm > still running with maxcpus=1. > > Phil root@raver:~# uname -a Linux raver 2.6.29 #2 SMP PREEMPT Tue Mar 24 15:25:36 CET 2009 x86_64 GNU/Linux root@raver:~# fgrep processor /proc/cpuinfo |wc -l 8 root@raver:~# uptime 01:33:55 up 3:23, 2 users, load average: 0.06, 0.08, 0.11 root@raver:~# ifconfig wlan0 |grep 'Byte RX' Byte RX:1658605233 (1.6 GB) Byte TX:47759274 (47.7 MB) root@raver:~#
Just wanted to point out that if you are not seeing issues on 2.6.29 then you will also see no issues in the upcoming next releases of 2.6.27 and 2.6.28 kernels if the patches I sent make it in by then. The change I expect that would have helped was the serialization of IO. While the issue may be fixed the MCE exception can theoretically still be triggerable but if you're not seeing it then great :) For those stuck on older kernels and cannot wait until the serialization patches get merged you can use compat-wireless or use the respective ported patches for your kernel, either for 2.6.27 or for 2.6.28: http://www.kernel.org/pub/linux/kernel/people/mcgrof/patches/ath9k/2009-03-12/serialization-v6.tar.bz2
Hi Matteo and Luis, I can confirm that 2.6.29 has resolve my issue and I'm able to run with both cores under heavy network load with ath9k and no hanging.... root@media:~# uname -a Linux media 2.6.29-020629-generic #020629 SMP Tue Mar 24 11:23:53 UTC 2009 x86_64 GNU/Linux Luis, if you IO serialization changes fixed it.... THANKS! It's very nice to finally be able to use my second core! I've invested 100's of hours into trying to track the problem down. (In reply to comment #144) > Seems to be fixed with the kernel 2.6.29, can anyone confirm it?
Can someone close this bug?
Sure, after 150 comments :)
isn't it closed already?