Latest working kernel version: Don't know of any. Earliest failing kernel version: Tried with linux-image-generic and a previous version of the server kernel. (Can't remember the exact version off the top of my head, but can investigate if anyone wants to know.) Distribution: Ubuntu 8.04.1. Hardy Hardware Environment: ASUS M3A-H/HDMI, 4GB RAM, AMD PHENOM X3 8750, 4xSATA, 256MB Radeon 2400 PRO PCI Express. Software Environment: Ubuntu 64bit server edition. Using the latest available bios update (Version 0801, dated 2008/07/29) Problem Description: When not using acpi=off, the system goes straight into kernel panic. This occurs on or after installation. Steps to reproduce: Use the same hardware set-up as me. Boot into ubuntu.
I just noticed that a newer version of the bios my motherboard is now available (0803) which states: 0803 BIOS for M3A-H/HDMI 1. Resolve failing to install new VGA driver problem. 2. Resolve failing to install 64bit OS after RAID is built. I am using software raid, rather than hardware raid via the bios. Nevertheless, it is possible that this update might cure my problem. I'll try out the new bios driver and report back my findings here.
Will you please attach the output of acpidump, lspci -vvvxxx? thanks.
Will you please capture the screenshot of kernel panic with ACPI enabled? thanks.
Created attachment 18197 [details] The output of lspci -vvvxxx This is with bios Version 0801.
Created attachment 18198 [details] photo of kernel panic screen after booting without using acpi=off This is with bios Version 0801. Let me know if you want me to email you a full quality RGB colour version.
Created attachment 18203 [details] photo of kernel panic screen after booting without using acpi=off (bios version 0803) This is with bios Version 0801.
Created attachment 18204 [details] The output of lspci -vvvxxx (bios version 0803) This is with bios Version 0803.
So, I have now upgraded the bios to the latest version (0803). This has not cured the kernel panic problem however. I have attached kernel panic screenshots and lspci -vvvxxx output for both versions 0801 and 0803 of the bios. I made a mistake when describing one of the attachments. Comment #6 should end by saying: This is with bios Version 0803. I couldn't figure out any way of correcting this directly. Cheers.
how about boot option max_cpus=1 or can you try latest base kernel since you are using 2.6.24 kernel.
kernel version 2.6.24 is the latest offered to me using aptitude. Please let me know how to access and install the latest base kernel (in a way which allows me to roll back to my current version if it doesn't work.) I'll try max_cpus=1 this evening, UK time. Thanks.
I tried booting with max_cpus=1 (and without acpi=off). The system went straight into kernel panic as before. Looking at the screen output, the address of the paging request is slightly different to the address which appears when not specifying max_cpus=1, but otherwise is the same.
Please try "maxcpus=1" (for max_cpus is not actually a kernel parameter) or equally "nosmp". For the crash is in a kfree of a per-cpu data structure that will likely not occur on a uniprocessor config. also please try "acpi=ht", which will do this MP part of ACPI, but no other part. If it still fails there, then we know this is strictly a SMP configuration issue and not a core ACPI issue. How many cores does this hardware have, and how many processors is the kernel config'd for? (grep CONFIG_NR_CPUS .config) If under 32, try upping to 32. Please attach the output from dmesg -s64000 from a kernel that boots successfully.
It's got an AMD PHENOM X3 triple core processor. I've done a "sudo find . -type f -name .config" from the root directory, but I haven't found any files called .config. dmesg output to come...
Created attachment 18221 [details] dmesg -s64000 output (2.6.24-19-server kernel, acpi=off)
The system boots with maxcpus=1. No kernel panic. Now about to try with acpi=ht (but no maxcpus=1)
I can also confirm that the system successfully boots when using the acpi=ht boot param.
> The system boots with maxcpus=1. No kernel panic. please attach the dmesg from this successful ACPI mode boot. also, please attach the output from acpidump -- doesn't matter how you boot to get it. if you are running a distro kernel rather than one you build, your config file is probably in /boot/config*, > successfully boots when using the acpi=ht okay, that one is pretty much the same as your acpi=off boot, which successfully booted all 3 cores.
acpidump wasn't installed, so I installed it. Attachment coming up...
Created attachment 18225 [details] Output of acpidump
According to /boot/config-2.6.24-19-server: CONFIG_NR_CPUS=64
Created attachment 18226 [details] dmesg -s64000 output (2.6.24-19-server kernel, acpi=ht)
Created attachment 18227 [details] dmesg -s64000 (2.6.24-19-server kernel, maxcpus=1)
[ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) [ 0.000000] Processor #0 (Bootup-CPU) [ 0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) [ 0.000000] Processor #1 [ 0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) [ 0.000000] Processor #2 [ 0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled) [ 0.000000] ACPI: IOAPIC (id[0x03] address[0xfec00000] gsi_base[0]) [ 0.000000] IOAPIC[0]: apic_id 3, address 0xfec00000, GSI 0-23 only thing strange here is the lapci_id(0x83) for your disabled processor #3, but hopefully we don't touch that... please try booting with maxcpus=2 please try booting with maxcpus=3
I know it's a bit weird, but an X3 processor is a quad core with one processor disabled. I chose it over a similarly priced quad core version for environmental and running cost reasons. I knew I would be running a server and for more or less the same performance within a set budget, running only three cores consumes less power than running an equivalent quad core processor. Ironically, it looks like it might be the unconventional triple core design that is causing problems with acpi and preventing ubuntu from managing power consumption effectively! Funny how good intentioned decisions you make can come back to bite you. I'll try the maxcpus=2 and maxcpus=3 options later this evening (UK time). Thanks for your help.
Unfortunately, I get kernel panic with maxcpus=2 and maxcpus=3. I'll attach a kernel panic screenshot...
Created attachment 18233 [details] kernel panic screen when using maxcpus=3
please try a latest base kernel. intel_cacheinfo.c has some changes (eg, kobject related fixes), maybe this is already fixed.
I'd be happy to try the latest kernel. Please can you explain how I can get and install it? Should I be looking for a kernel with a particular version number? Can I get it with aptitude? Will I have to download the source and compile it? Sorry for being so naive, but I am fairly new to linux and have never done this before so it will be necessary to be more explicit or give me some pointers to other information and instructions that I can follow. Cheers.
you need download a kernel and build it. There are a lot of 'HOWTO' about kernel building in internet. http://www.digitalhermit.com/linux/Kernel-Build-HOWTO.html
Okay. Thanks for the info. That guide looks pretty good. It looks like I should set aside several hours for this task at least. I'm going away and wont have time to try it before I leave. As a result I wont be able to try the latest kernel before the beginning of September. I should have occasional remote access to the server while I am away however, so I may be able to help if anyone requires any additional information that's easy to get and doesn't require rebooting the server.
Just to let you know that I upgraded to kernel version 2.6.24-21-server today using aptitude. I still get the same kernel panic behaviour. I am able to suppress it with acpi=off or acpi=ht. I haven't tried with maxcpus=1, but I presume that this would also prevent the kernel panic as it did for kernel version 2.6.24-19.
If this is an SMP bug in the AMD CPUID4 code, then I'm puzzled why the system boots all 3 cores in MPS mode when acpi=off -- why doesn't MPS mode have the same problem?
The cpuid4 emulation is independent of ACPI yes and should be active in MPS mode too. Also the _exit function should be only called at boot when some error happens. It might be interesting to find out what error that is. It must be one of the kobject_register() failing. Also it's quite possible it is buggy -- error code tends to be not well tested. Perhaps it would be best to pass it on to someone from AMD who can hopefully reproduce it.
I contacted AMD processor support and asked them if they would take a look at this problem, pointing them to this bug page. I got the following response: -------------------- ACPI functionality is motherboard (BIOS) controlled; not processor controlled. Please contact the motherboard manufacturer for more information. You may want to check if Asus has a beta BIOS which would provide better ACPI support for Linux based OSs. If you have any questions, please reply back to this email or contact us at 408-749-3060 (US CPU Support) or 44-1276-803299 (EU CPU Support). -------------------- Am I right in thinking that the CPUID4 code IS processor related? Perhaps they didn't take the time to read the full bug report.
I have now also requested that ASUS support take a look at this bug and am waiting for a reply.
Yes it is processor related. Although when it fails only with ACPI and not with MPS there's probably some ACPI dependency too.
Okay thanks. I've written back to AMD Support saying that while the processor may not ultimately be the root cause, the bug is processor related, so it would be useful to have their input. By taking part I hope they can help us rule out the processor as being the cause.
Hi Paul, Just tested 2.6.24.7 the latest official stable 2.6.24 kernel on a down-cored Phenom CPU. No panic occurred. I don't know how different the Ubuntu kernel (2.6.24-19-server) is from the official kernels. Thus I highly suggest that you retest with an official kernel. E.g. you can use the recently released 2.6.27 kernel. Furthermore I have seen following lines in your dmesg: mtrr: your CPUs had inconsistent fixed MTRR settings mtrr: probably your BIOS does not setup all CPUs. mtrr: corrected configuration. So your BIOS is at least "suboptimal". Last not least there seems to be a new BIOS availale for your mobo. 1001 BIOS of M3A-H/HDMI (It's from 2008/10/07): 1. Resolve system is freezing after set USB hard disk to the first boot device in BIOS. 2. Enhance system stability when using certain memory. 3. Fix C1E function fail issue. As further steps I suggest: (1) update BIOS and retest (2) install mainline kernel and retest (3) do further debugging if needed
Andreas, thank you very much for your help. I will indeed try out the new bios update and, if necessary, the latest mainline kernel version once I get back from holidays at the beginning of November. (I think I said September before in a previous comment by mistake, but that's plain silly.) To anyone here: I don't know the differences between an ubuntu kernel and a linux mainline kernel. Is installing a mainline kernel on an ubuntu server install likely to suffer any incompatibility problems? Is installing a mainline kernel the same as installing a kernel labelled 'linux' and 'generic' via aptitude? Thank you.
Help please!!! I am back online now and decided to try the new bios update. This was apparently successful. However, it reset all my bios settings. I have restored them to what I believe to be the correct state, but now it gets through all the diagnostics and when it attempts to boot it says: Reboot and Select proper Boot device or Insert Boot Media in selected Boot device and press a key. I have made sure that the boot disk is selected as the primary boot device rather than the CD rom (or floppy which I don't have) but it still says it. I thought, no problem, I'll simply reinstall version 0803 of the bios... but now I have the same problem with that. This is (was) a live server. Any help greatly appreciated.
Panic over. It was attempting to boot from disk... just not the right one. I have now tried Bios version 1001. This does not solve the problem. Also tried with maxcpus=3. I have also performed all updates offered by aptitude, including updating to kernel version 2.6.24-21-server. Next step is to try a 'mainline linux kernel'. Not sure what I am doing here, since I have never tried it before. About to do some research. Once again, any help gratefully received.
Before trying to compile a mainline kernel for myself, I have now upgraded Ubuntu to version 8.10, which also allowed me to update my kernel to version 2.6.27-7-server. I still get kernel panic on booting - but the diagnostic output is now slightly different. The kernel panic can still be suppressed by using the acpi=ht boot param. I will now attach all the relevant diagnostic information for this new configuration for reference: a) a screen-shot of the new kernel panic b) the output of acpidump c) the output of dmesg -s64000 d) the output of lspci -vvvxxx
Created attachment 18587 [details] photo of kernel panic screen after booting (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
The description below should read: photo of kernel panic screen after booting (bios version 1001, ubuntu 8.10, kernel 2.6.27-7) (In reply to comment #43) > Created an attachment (id=18587) [details] > photo of kernel panic screen after booting without using acpi=off (bios > version > 0803, ubuntu 8.10, kernel 2.6.27-7) >
Created attachment 18589 [details] output of acpidump (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
Created attachment 18590 [details] output of dmesg -s64000 (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
Created attachment 18591 [details] output of lspci -vvvxxx (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
I have been looking at the diagnostic information I have just posted. Two things are prominent in the new kernel panic output: 1) it now says BUG: unable to handle kernel NULL pointer dereference at 00000000 2) it now also says Aperture pointing to e820 RAM. Ignoring. Your BIOS doesn't leave a memory aperture hole. Please enable the IOMMU option in the BIOS setup. This costs you 64 MB of RAM. There is more information about the aperture problem in the dmesg output: [ 0.010000] Node 0: aperture @ 8174000000 size 32 MB [ 0.010000] Aperture beyond 4GB. Ignoring. [ 0.010000] Your BIOS doesn't leave a aperture memory hole [ 0.010000] Please enable the IOMMU option in the BIOS setup [ 0.010000] This costs you 64 MB of RAM [ 0.010000] Mapping aperture over 65536 KB of RAM @ 20000000 [ 0.010000] PM: Registered nosave memory: 0000000020000000 - 0000000024000000 I don't know if the latter problem is causing the kernel panic, but I decided to try to resolve it in case it is. I have had a look around the ASUS M3A-H/HDMI bios setup, but I can't find any IOMMU option. I did a internet search and I came up with a discussion (http://vip.asus.com/forum/view.aspx?id=20080110054618984&board_id=1&model=M2NPV-VM&page=1&SLanguage=en-us) which pointed me to an information source (ftp://download.nvidia.com/XFree86/Linux-x86/1.0-8174/README/32bit_html/appendix-l.html) which says: "On AMD's AMD64 platform, the size of the IOMMU can be configured in the system BIOS or, if no IOMMU BIOS option is available, using the 'iommu=memaper' kernel parameter. This kernel parameter expects an order and instructs the Linux kernel to create an IOMMU of size 32MB^order overlapping physical memory. If the system's default IOMMU is smaller than 64MB, the Linux kernel automatically replaces it with a 64MB IOMMU." So, it looks like I might be able to use boot parameters to sort out this problem. I have looked up the iommu boot parameters in the kernel documentation (http://www.kernel.org/doc/Documentation/kernel-parameters.txt) and found the following potentially relevant parameters: iommu= [x86] off force noforce biomerge panic nopanic merge nomerge forcesac soft amd_iommu= [HW,X86-84] Pass parameters to the AMD IOMMU driver in the system. Possible values are: isolate - enable device isolation (each device, as far as possible, will get its own protection domain) amd_iommu_size= [HW,X86-64] Define the size of the aperture for the AMD IOMMU driver. Possible values are: '32M', '64M' (default), '128M', '256M', '512M', '1G' The 'iommu=memaper' option is not mentioned in the official documentation. I am not sure what the bios and dmesg diagnostic messages are asking me to do with these boot parameters. Please can someone advise me how I should use the iommu boot parameters? iommu=force ? Thank you.
Update: On further investigation, I found some additional relevant IOMMU info in my dmesg output: [ 0.490101] PCI-DMA: Disabling AGP. [ 0.490553] PCI-DMA: aperture base @ 20000000 size 65536 KB [ 0.490553] PCI-DMA: using GART IOMMU. [ 0.490553] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture So, my AGP is being disabled by the kernel and GART IOMMU is being used: I don't know if that is the correct type. I notice that there are several types in the boot-options.txt document (which appears to have been deleted from the 2.6.26 documentation? http://www.linuxhq.com/kernel/v2.6/27/Documentation/x86_64/boot-options.txt). I then found a (annoyingly bloated) thread in a forum which provided a work around to the IOMMU problem for a very similar set-up to mine (amd64, asus motherboard, pci express, my agp getting disabled by the kernel). The workaround is to use the iommu=noaperture boot option (http://www.linuxquestions.org/questions/showthread.php?p=3173751#post3173751). When I do this I still get a kernel panic. However, the lines: Aperture pointing to e820 RAM. Ignoring. Your BIOS doesn't leave a memory aperture hole. Please enable the IOMMU option in the BIOS setup. This costs you 64 MB of RAM. no longer appear at the start. Instead, the kernel panic starts directly with BUG: unable to handle kernel NULL pointer dereference at 00000000 So, it looks like the IOMMU messages might not be the cause of the kernel panic after all and their appearance might be the result of a separate bug in the kernel that doesn't recognise that it doesn't need to create a memory aperture when agp is disabled. So, I am still none the wiser. I created an account with ASUS support and sent them a support message on the 18 October. They said they would try to get back to me within 24 hours and that if it took longer than 48 hours to reply my query would automatically become a 'first priority' but I still haven't received a reply yet.
Paul, the CPUID4 panic problem is unlikely to have anything to do with the IOMMU setup. Someone just has to reproduce and fix, but normally that requires a mainline kernel (if you want to stay with Ubuntu kernels, you'll have to ask Ubuntu support) Also IOMMU should be independent of acpi=off, so if it works with acpi=off then the IOMMU is likely ok. That iommu=memaper results in a NULL pointer reference is a regression, best you open a separate bug report for that. Fixing that won't help your problem though.
Dear all. I reported this as a bug with Ubuntu (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/292619) and was given instructions on how to installed an upstream kernel (https://wiki.ubuntu.com/KernelTeam/GitKernelBuild) which I have now done. However, even with the upstream kernel I still get kernel panic on boot, so this is apparently not an Ubuntu specific problem and so I have come back here to try t resolve it. What information do I need to provide to help debug this? Thank you.
hmm, it would be great if you can attach the screen shot when the upstream kernel crashes.
Created attachment 18889 [details] Kernel panic screen obtained with the upstream kernel
This isn't ACPI related.
I will be unable to access email from November 22nd until the 31st. I will respond to email sent during this period as soon as I can. In my absence, refer Linux or Xen issues to the OSRC at osrc@elbe.amd.com. Personal email should be sent to mlangsdo@io.com. -Mark Langsdorf Operating System Research Center AMD <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7652.24"> <TITLE>Out of Office AutoReply: [Bug 11714] Kernel panic on boot if SMP - ASUS M3A-H/HDMI</TITLE> </HEAD> <BODY> <!-- Converted from text/plain format --> <P><FONT SIZE=2>I will be unable to access email from November 22nd until the 31st. I will respond to email sent during this period as soon as I can.<BR> <BR> In my absence, refer Linux or Xen issues to the OSRC at osrc@elbe.amd.com. Personal email should be sent to mlangsdo@io.com. <BR> <BR> -Mark Langsdorf<BR> Operating System Research Center<BR> AMD<BR> </FONT> </P> </BODY> </HTML>
If it isn't ACPI related, how do I find out what is causing it? Shaohua, can you tell from the kernel panic screen that it's not ACPI related, or did you draw that conclusion from other information? Thank you.
OS can find all CPUs with ACPI enabled. But in this case, it cashes in cache detect, which isn't ACPI related and maybe the specific CPU cache detect issue.
Created attachment 19260 [details] Patch for testing Can you please apply the attached patch to the mainline kernel and retry ? Thanks, tglx
Created attachment 19269 [details] Kernel panic on boot with patch id=19260 applied to upsteam kernel I have tried the patch, but I still get kernel panic on boot. I have attached a screen shot. The kernel panic can be suppressed using the acpi=ht boot param. Please let me know if I can provide any more info.
I just updated my bios to the latest version (1102, 2008/11/14, http://support.asus.com/download/download.aspx?SLanguage=en-us&model=M3A-H/HDMI). I have tried with the latest official ubuntu kernel and with the patched mainline kernel. Both give rise to kernel panic screens on boot if I don't use acpi=ht. However, the messages on the kernel panic screens are different. For the custom kernel the trace is so long that more than a whole screen's worth is output. I have taken a picture of the final visible output for both kernels. Will attach now...
Created attachment 19314 [details] Kernel panic screen on boot. Bios v1102. Lastest ubuntu kernel Bios v1102. Lastest ubuntu kernel (2.6.27-9-server)
Created attachment 19315 [details] Kernel panic screen on boot. Bios v1102. Patched mainline kernel (v2.6.28-rc4-custom) Bios v1102 Mainline kernel with patch id19260 applied (v2.6.28-rc4-custom)
Update: Just updated to the latest bios (1201). Still get kernel panic with latest ubuntu kernel and with mainline kernel with patch id19260 applied (v2.6.28-rc4-custom).
Created attachment 19787 [details] test patch V2 I just noticed that the previous patch #19260 checks the wrong variable. Can you please retest ? Thanks, tglx
Created attachment 19808 [details] Kernel panic on boot. Bios v1201. Mainline kernel (v2.6.29-rc1-custom) This is with the latest mainline kernel (no patch).
Created attachment 19809 [details] Kernel panic on boot. Bios v1201. Patched (id=19787) mainline kernel (v2.6.29-rc1-custom) This is with the patched mainline kernel. Still get kernel panic on boot.
Created attachment 19981 [details] full trace the kernel in splashtop works fine with m3a-h/hdmi http://bugzilla.kernel.org/show_bug.cgi?id=12352
So, it looks like this bug has crept in since kernel version 2.6.20 (version of kernel that splashtop uses according to the notes on of bug 12352). The earliest I have tried with is 2.6.24.
I can't say for sure that the splashtop kernel uses acpi or not (maybe I'll try to hack a terminal into it), but it can turn off the machine (but after that the bios complains that the machine failed to boot last time). It is also possible that DeviceVM changed something in the kernel. the source is available here (after registration): http://www.splashtop.com/open_source.php
As I have the same problem and would like to get this fixed: is there anything else that could be done? As reference: I reported this bug 2 month ago over at launchpad: https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/305165 In addition to the things allready said in this thread: - the problem occurs for me only if I enable surroundview in BIOS. This option is necessary to get Hybrid Crossfire to work. If I disable surroundview everything works like it should: I can use the onboard graphic (surroundview disabled) or the discrete graphic (surroundview disabled) without problems. - Testing acpi=off in grub gave me no kernel-panic. - I tested several Live-CDs to look if this problem is Ubuntu-specific. Only Live-CD that worked for me was Knoppix 5.1 with a 2.6.19 kernel. Everything else I tried did not work (Ubuntu 8.04.1, 32 & 64bit; Debian 5.0, 32bit; Ubuntu 8.10, 32 & 64bit; OpenSUSE 11.0, 32bit). Is there more info needed to get this fixed? - Distribution: Ubuntu 8.10 Intrepid - Hardware Environment: ASUS M3A-H/HDMI, 2GB RAM, AMD Athlon64 4850e, 1xSATA, 256MB Radeon 3470 PCI Express (discrete) & Radeon HD 3200 (onboard) <-- for hybrid crossfire. - BIOS version 0801 as later versions have problems with q-fan. But later versions like 1301 does not fix the problem.
Does any of the systems which only work with maxcpus=1 boot with a kernel using CONFIG_MTRR=n? There is another bugzilla where similar symptoms and workarounds are discussed, see bug #11541. It seems that switching off MTRR code works around the problem.
I recompiled the stock Ubuntu 8.10 32bit kernel with CONFIG_MTRR=n and I get no kernel panik. Even if I enable surroundview in BIOS :-) But now X doesn't start: "(EE) No devices detected.". But that may be related to the radeon driver. Maybe it would work with fglrx because radeon or radeonhd doesn't manage hybrid crossfire. Or radeon/radeonhd don't know what to do if there are two cards. No idea.
Ok... X works now. Had to put BusID for the primary card into my xorg.conf.
Paul, can you confirm that switching off MTRR kernel code avoids the panic on your system, too? If yes, this bugzilla should be closed as duplicate of bug #11541
Yes. I will do. I plan to try it out this weekend. Paul
I can confirm that compiling without the MTRR code avoids the kernel panic for me and so the finger is pointed at this code. I agree that the root cause of this kernel panic is likely to be the same as for bug #11541 and so am marking it as a duplicate. *** This bug has been marked as a duplicate of bug 11541 ***