Bug 11714 - Kernel panic on boot if SMP - ASUS M3A-H/HDMI
Kernel panic on boot if SMP - ASUS M3A-H/HDMI
Status: RESOLVED DUPLICATE of bug 11541
Product: Platform Specific/Hardware
Classification: Unclassified
Component: i386
All Linux
: P1 high
Assigned To: platform_i386
:
Depends on:
Blocks: 56331
  Show dependency treegraph
 
Reported: 2008-10-07 01:45 UTC by Paul
Modified: 2013-04-09 06:23 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.24-19-server
Tree: Mainline
Regression: ---


Attachments
The output of lspci -vvvxxx (41.88 KB, text/plain)
2008-10-07 14:27 UTC, Paul
Details
photo of kernel panic screen after booting without using acpi=off (560.42 KB, image/jpeg)
2008-10-07 14:34 UTC, Paul
Details
photo of kernel panic screen after booting without using acpi=off (bios version 0803) (959.02 KB, image/jpeg)
2008-10-07 16:29 UTC, Paul
Details
The output of lspci -vvvxxx (bios version 0803) (41.88 KB, text/plain)
2008-10-07 16:31 UTC, Paul
Details
dmesg -s64000 output (2.6.24-19-server kernel, acpi=off) (28.91 KB, text/plain)
2008-10-08 13:01 UTC, Paul
Details
Output of acpidump (174.84 KB, text/plain)
2008-10-08 14:35 UTC, Paul
Details
dmesg -s64000 output (2.6.24-19-server kernel, acpi=ht) (36.61 KB, text/plain)
2008-10-08 14:42 UTC, Paul
Details
dmesg -s64000 (2.6.24-19-server kernel, maxcpus=1) (33.87 KB, text/plain)
2008-10-08 14:46 UTC, Paul
Details
kernel panic screen when using maxcpus=3 (941.27 KB, image/jpeg)
2008-10-09 10:40 UTC, Paul
Details
photo of kernel panic screen after booting (bios version 1001, ubuntu 8.10, kernel 2.6.27-7) (978.17 KB, image/jpeg)
2008-11-01 16:04 UTC, Paul
Details
output of acpidump (bios version 1001, ubuntu 8.10, kernel 2.6.27-7) (49.40 KB, application/gzip)
2008-11-01 16:10 UTC, Paul
Details
output of dmesg -s64000 (bios version 1001, ubuntu 8.10, kernel 2.6.27-7) (39.83 KB, text/plain)
2008-11-01 16:12 UTC, Paul
Details
output of lspci -vvvxxx (bios version 1001, ubuntu 8.10, kernel 2.6.27-7) (45.04 KB, text/plain)
2008-11-01 16:16 UTC, Paul
Details
Kernel panic screen obtained with the upstream kernel (148.66 KB, image/png)
2008-11-17 11:21 UTC, Paul
Details
Patch for testing (936 bytes, patch)
2008-12-11 23:50 UTC, Thomas Gleixner
Details | Diff
Kernel panic on boot with patch id=19260 applied to upsteam kernel (130.55 KB, image/png)
2008-12-12 20:11 UTC, Paul
Details
Kernel panic screen on boot. Bios v1102. Lastest ubuntu kernel (127.74 KB, image/png)
2008-12-15 12:05 UTC, Paul
Details
Kernel panic screen on boot. Bios v1102. Patched mainline kernel (v2.6.28-rc4-custom) (143.34 KB, image/png)
2008-12-15 12:07 UTC, Paul
Details
test patch V2 (940 bytes, patch)
2009-01-14 05:41 UTC, Thomas Gleixner
Details | Diff
Kernel panic on boot. Bios v1201. Mainline kernel (v2.6.29-rc1-custom) (134.09 KB, image/png)
2009-01-14 16:38 UTC, Paul
Details
Kernel panic on boot. Bios v1201. Patched (id=19787) mainline kernel (v2.6.29-rc1-custom) (133.35 KB, image/png)
2009-01-14 16:41 UTC, Paul
Details
full trace (50 bytes, text/plain)
2009-01-24 18:41 UTC, Balázs Hámorszky
Details

Description Paul 2008-10-07 01:45:03 UTC
Latest working kernel version: Don't know of any.
Earliest failing kernel version: Tried with linux-image-generic and a previous version of the server kernel. (Can't remember the exact version off the top of my head, but can investigate if anyone wants to know.)
Distribution: Ubuntu 8.04.1. Hardy
Hardware Environment: ASUS M3A-H/HDMI, 4GB RAM, AMD PHENOM X3 8750, 4xSATA, 256MB Radeon 2400 PRO PCI Express.
Software Environment:  Ubuntu 64bit server edition. Using the latest available bios update (Version 0801, dated 2008/07/29)
Problem Description: When not using acpi=off, the system goes straight into kernel panic. This occurs on or after installation.

Steps to reproduce: Use the same hardware set-up as me. Boot into ubuntu.
Comment 1 Paul 2008-10-07 02:00:17 UTC
I just noticed that a newer version of the bios my motherboard is now available (0803) which states:

0803 BIOS for M3A-H/HDMI
1. Resolve failing to install new VGA driver problem.
2. Resolve failing to install 64bit OS after RAID is built. 

I am using software raid, rather than hardware raid via the bios. Nevertheless, it is possible that this update might cure my problem. I'll try out the new bios driver and report back my findings here.
Comment 2 ykzhao 2008-10-07 02:28:11 UTC
Will you please attach the output of acpidump, lspci -vvvxxx?
thanks.
Comment 3 ykzhao 2008-10-07 02:30:00 UTC
Will you please capture the screenshot of kernel panic with ACPI enabled?
thanks.
Comment 4 Paul 2008-10-07 14:27:17 UTC
Created attachment 18197 [details]
The output of lspci -vvvxxx

This is with bios Version 0801.
Comment 5 Paul 2008-10-07 14:34:24 UTC
Created attachment 18198 [details]
photo of kernel panic screen after booting without using acpi=off

This is with bios Version 0801.
Let me know if you want me to email you a full quality RGB colour version.
Comment 6 Paul 2008-10-07 16:29:28 UTC
Created attachment 18203 [details]
photo of kernel panic screen after booting without using acpi=off (bios version 0803)

This is with bios Version 0801.
Comment 7 Paul 2008-10-07 16:31:10 UTC
Created attachment 18204 [details]
The output of lspci -vvvxxx  (bios version 0803)

This is with bios Version 0803.
Comment 8 Paul 2008-10-07 16:37:12 UTC
So, I have now upgraded the bios to the latest version (0803). This has not cured the kernel panic problem however. I have attached kernel panic screenshots and lspci -vvvxxx output for both versions 0801 and 0803 of the bios. 

I made a mistake when describing one of the attachments. Comment #6 should end by saying: This is with bios Version 0803. I couldn't figure out any way of correcting this directly.

Cheers.
Comment 9 Shaohua 2008-10-07 23:56:51 UTC
how about boot option max_cpus=1
or can you try latest base kernel since you are using 2.6.24 kernel.
Comment 10 Paul 2008-10-08 01:37:24 UTC
kernel version 2.6.24 is the latest offered to me using aptitude.

Please let me know how to access and install the latest base kernel (in a way which allows me to roll back to my current version if it doesn't work.)

I'll try max_cpus=1 this evening, UK time.

Thanks.
Comment 11 Paul 2008-10-08 09:45:45 UTC
I tried booting with max_cpus=1 (and without acpi=off). The system went straight into kernel panic as before. Looking at the screen output, the address of the paging request is slightly different to the address which appears when not specifying max_cpus=1, but otherwise is the same.
Comment 12 Len Brown 2008-10-08 12:05:55 UTC
Please try "maxcpus=1" (for max_cpus is not actually a kernel parameter)
or equally "nosmp".

For the crash is in a kfree of a per-cpu data structure
that will likely not occur on a uniprocessor config.

also please try "acpi=ht", which will do this MP part of ACPI,
but no other part.  If it still fails there, then we know this
is strictly a SMP configuration issue and not a core ACPI issue.

How many cores does this hardware have, and how many processors
is the kernel config'd for?
(grep CONFIG_NR_CPUS .config)  If under 32, try upping to 32.

Please attach the output from dmesg -s64000 from a kernel that
boots successfully.
Comment 13 Paul 2008-10-08 12:59:54 UTC
It's got an AMD PHENOM X3 triple core processor.

I've done a "sudo find . -type f -name .config" from the root directory, but I haven't found any files called .config.

dmesg output to come...
Comment 14 Paul 2008-10-08 13:01:52 UTC
Created attachment 18221 [details]
dmesg -s64000 output (2.6.24-19-server kernel, acpi=off)
Comment 15 Paul 2008-10-08 13:15:57 UTC
The system boots with maxcpus=1. No kernel panic.

Now about to try with acpi=ht (but no maxcpus=1)
Comment 16 Paul 2008-10-08 13:21:14 UTC
I can also confirm that the system successfully boots when using the acpi=ht boot param.
Comment 17 Len Brown 2008-10-08 14:12:02 UTC
> The system boots with maxcpus=1. No kernel panic.

please attach the dmesg from this successful ACPI mode boot.

also, please attach the output from acpidump -- doesn't matter
how you boot to get it.

if you are running a distro kernel rather than one you build,
your config file is probably in /boot/config*,

> successfully boots when using the acpi=ht

okay, that one is pretty much the same as your acpi=off boot, which successfully booted all 3 cores.
Comment 18 Paul 2008-10-08 14:33:57 UTC
acpidump wasn't installed, so I installed it. Attachment coming up...
Comment 19 Paul 2008-10-08 14:35:20 UTC
Created attachment 18225 [details]
Output of acpidump
Comment 20 Paul 2008-10-08 14:38:28 UTC
According to /boot/config-2.6.24-19-server: CONFIG_NR_CPUS=64


Comment 21 Paul 2008-10-08 14:42:18 UTC
Created attachment 18226 [details]
dmesg -s64000 output (2.6.24-19-server kernel, acpi=ht)
Comment 22 Paul 2008-10-08 14:46:18 UTC
Created attachment 18227 [details]
dmesg -s64000 (2.6.24-19-server kernel, maxcpus=1)
Comment 23 Len Brown 2008-10-08 18:18:36 UTC
[    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[    0.000000] Processor #0 (Bootup-CPU)
[    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
[    0.000000] Processor #1
[    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled)
[    0.000000] Processor #2
[    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled)
[    0.000000] ACPI: IOAPIC (id[0x03] address[0xfec00000] gsi_base[0])
[    0.000000] IOAPIC[0]: apic_id 3, address 0xfec00000, GSI 0-23

only thing strange here is the lapci_id(0x83) for your disabled processor #3,
but hopefully we don't touch that...

please try booting with maxcpus=2
please try booting with maxcpus=3


Comment 24 Paul 2008-10-09 03:02:10 UTC
I know it's a bit weird, but an X3 processor is a quad core with one processor disabled. I chose it over a similarly priced quad core version for environmental and running cost reasons. I knew I would be running a server and for more or less the same performance within a set budget, running only three cores consumes less power than running an equivalent quad core processor. Ironically, it looks like it might be the unconventional triple core design that is causing problems with acpi and preventing ubuntu from managing power consumption effectively! Funny how good intentioned decisions you make can come back to bite you.

I'll try the maxcpus=2 and maxcpus=3 options later this evening (UK time). Thanks for your help.
Comment 25 Paul 2008-10-09 10:34:35 UTC
Unfortunately, I get kernel panic with maxcpus=2 and maxcpus=3. I'll attach a kernel panic screenshot...
Comment 26 Paul 2008-10-09 10:40:37 UTC
Created attachment 18233 [details]
kernel panic screen when using maxcpus=3
Comment 27 Shaohua 2008-10-15 01:15:29 UTC
please try a latest base kernel. intel_cacheinfo.c has some changes (eg, kobject related fixes), maybe this is already fixed.
Comment 28 Paul 2008-10-15 01:44:19 UTC
I'd be happy to try the latest kernel. Please can you explain how I can get and install it? Should I be looking for a kernel with a particular version number? Can I get it with aptitude? Will I have to download the source and compile it? Sorry for being so naive, but I am fairly new to linux and have never done this before so it will be necessary to be more explicit or give me some pointers to other information and instructions that I can follow.

Cheers.
Comment 29 Shaohua 2008-10-15 01:50:45 UTC
you need download a kernel and build it. There are a lot of 'HOWTO' about kernel building in internet.
http://www.digitalhermit.com/linux/Kernel-Build-HOWTO.html
Comment 30 Paul 2008-10-15 03:14:27 UTC
Okay. Thanks for the info. That guide looks pretty good.

It looks like I should set aside several hours for this task at least. I'm going away and wont have time to try it before I leave. As a result I wont be able to try the latest kernel before the beginning of September.

I should have occasional remote access to the server while I am away however, so I may be able to help if anyone requires any additional information that's easy to get and doesn't require rebooting the server.
Comment 31 Paul 2008-10-15 16:07:02 UTC
Just to let you know that I upgraded to kernel version 2.6.24-21-server today using aptitude. I still get the same kernel panic behaviour. I am able to suppress it with acpi=off or acpi=ht. I haven't tried with maxcpus=1, but I presume that this would also prevent the kernel panic as it did for kernel version 2.6.24-19.
Comment 32 Len Brown 2008-10-16 09:06:54 UTC
If this is an SMP bug in the AMD CPUID4 code,
then I'm puzzled why the system boots all 3 cores
in MPS mode when acpi=off -- why doesn't MPS mode
have the same problem?

Comment 33 Andi Kleen 2008-10-16 11:33:12 UTC
The cpuid4 emulation is independent of ACPI yes and should be active in MPS mode
too. 

Also the _exit function should be only called at boot when some error happens.
It might be interesting to find out what error that is. It must be
one of the kobject_register() failing.

Also it's quite possible it is buggy -- error code tends to be not well
tested.

Perhaps it would be best to pass it on to someone from AMD who
can hopefully reproduce it.


Comment 34 Paul 2008-10-18 02:01:28 UTC
I contacted AMD processor support and asked them if they would take a look at this problem, pointing them to this bug page. I got the following response:

--------------------
ACPI functionality is motherboard (BIOS) controlled; not processor controlled. Please contact the motherboard manufacturer for more information. You may want to check if Asus has a beta BIOS which would provide better ACPI support for Linux based OSs.

If you have any questions, please reply back to this email or contact us at 408-749-3060 (US CPU Support) or 44-1276-803299 (EU CPU Support).
--------------------

Am I right in thinking that the CPUID4 code IS processor related? Perhaps they didn't take the time to read the full bug report.
Comment 35 Paul 2008-10-18 02:39:15 UTC
I have now also requested that ASUS support take a look at this bug and am waiting for a reply.
Comment 36 Andi Kleen 2008-10-18 05:24:06 UTC
Yes it is processor related.

Although when it fails only with ACPI and not with MPS there's probably
some ACPI dependency too.
Comment 37 Paul 2008-10-18 05:34:02 UTC
Okay thanks. I've written back to AMD Support saying that while the processor may not ultimately be the root cause, the bug is processor related, so it would be useful to have their input. By taking part I hope they can help us rule out the processor as being the cause. 
Comment 38 herrmann.der.user 2008-10-20 05:36:32 UTC
Hi Paul,

Just tested 2.6.24.7 the latest official stable 2.6.24 kernel on
a down-cored Phenom CPU. No panic occurred.

I don't know how different the Ubuntu kernel (2.6.24-19-server) is from
the official kernels.
Thus I highly suggest that you retest with an official kernel.
E.g. you can use the recently released 2.6.27 kernel.

Furthermore I have seen following lines in your dmesg:

  mtrr: your CPUs had inconsistent fixed MTRR settings
  mtrr: probably your BIOS does not setup all CPUs.
  mtrr: corrected configuration.

So your BIOS is at least "suboptimal".

Last not least there seems to be a new BIOS availale for your mobo.

 1001 BIOS of M3A-H/HDMI (It's from 2008/10/07):
 1. Resolve system is freezing after set USB
    hard disk to the first boot device in BIOS.
 2. Enhance system stability when using certain memory.
 3. Fix C1E function fail issue.

As further steps I suggest:
(1) update BIOS and retest
(2) install mainline kernel and retest
(3) do further debugging if needed
Comment 39 Paul 2008-10-20 09:46:58 UTC
Andreas, thank you very much for your help. I will indeed try out the new bios update and, if necessary, the latest mainline kernel version once I get back from holidays at the beginning of November. (I think I said September before in a previous comment by mistake, but that's plain silly.)

To anyone here: I don't know the differences between an ubuntu kernel and a linux mainline kernel. Is installing a mainline kernel on an ubuntu server install likely to suffer any incompatibility problems? Is installing a mainline kernel the same as installing a kernel labelled 'linux' and 'generic' via aptitude?

Thank you.
Comment 40 Paul 2008-11-01 10:43:55 UTC
Help please!!!

I am back online now and decided to try the new bios update. This was apparently successful. However, it reset all my bios settings. I have restored them to what I believe to be the correct state, but now it gets through all the diagnostics and when it attempts to boot it says:

Reboot and Select proper Boot device
or Insert Boot Media in selected Boot device and press a key.

I have made sure that the boot disk is selected as the primary boot device rather than the CD rom (or floppy which I don't have) but it still says it. I thought, no problem, I'll simply reinstall version 0803 of the bios... but now I have the same problem with that. This is (was) a live server.

Any help greatly appreciated.
Comment 41 Paul 2008-11-01 11:12:50 UTC
Panic over. It was attempting to boot from disk... just not the right one.

I have now tried Bios version 1001. This does not solve the problem. Also tried with maxcpus=3. I have also performed all updates offered by aptitude, including updating to kernel version 2.6.24-21-server.

Next step is to try a 'mainline linux kernel'. Not sure what I am doing here, since I have never tried it before. About to do some research. Once again, any help gratefully received.
Comment 42 Paul 2008-11-01 16:00:58 UTC
Before trying to compile a mainline kernel for myself, I have now upgraded Ubuntu to version 8.10, which also allowed me to update my kernel to version 2.6.27-7-server. I still get kernel panic on booting - but the diagnostic output is now slightly different. The kernel panic can still be suppressed by using the acpi=ht boot param.

I will now attach all the relevant diagnostic information for this new configuration for reference:
a) a screen-shot of the new kernel panic
b) the output of acpidump
c) the output of dmesg -s64000
d) the output of lspci -vvvxxx
Comment 43 Paul 2008-11-01 16:04:06 UTC
Created attachment 18587 [details]
photo of kernel panic screen after booting (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
Comment 44 Paul 2008-11-01 16:08:28 UTC
The description below should read:
photo of kernel panic screen after booting (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)

(In reply to comment #43)
> Created an attachment (id=18587) [details]
> photo of kernel panic screen after booting without using acpi=off (bios version
> 0803, ubuntu 8.10, kernel 2.6.27-7)
> 

Comment 45 Paul 2008-11-01 16:10:07 UTC
Created attachment 18589 [details]
output of acpidump (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
Comment 46 Paul 2008-11-01 16:12:55 UTC
Created attachment 18590 [details]
output of dmesg -s64000 (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
Comment 47 Paul 2008-11-01 16:16:46 UTC
Created attachment 18591 [details]
output of lspci -vvvxxx (bios version 1001, ubuntu 8.10, kernel 2.6.27-7)
Comment 48 Paul 2008-11-01 17:01:35 UTC
I have been looking at the diagnostic information I have just posted.

Two things are prominent in the new kernel panic output:

1) it now says
BUG: unable to handle kernel NULL pointer dereference at 00000000

2) it now also says
Aperture pointing to e820 RAM. Ignoring.
Your BIOS doesn't leave a memory aperture hole.
Please enable the IOMMU option in the BIOS setup.
This costs you 64 MB of RAM.

There is more information about the aperture problem in the dmesg output:

[    0.010000] Node 0: aperture @ 8174000000 size 32 MB
[    0.010000] Aperture beyond 4GB. Ignoring.
[    0.010000] Your BIOS doesn't leave a aperture memory hole
[    0.010000] Please enable the IOMMU option in the BIOS setup
[    0.010000] This costs you 64 MB of RAM
[    0.010000] Mapping aperture over 65536 KB of RAM @ 20000000
[    0.010000] PM: Registered nosave memory: 0000000020000000 - 0000000024000000

I don't know if the latter problem is causing the kernel panic, but I decided to try to resolve it in case it is. I have had a look around the ASUS M3A-H/HDMI bios setup, but I can't find any IOMMU option. I did a internet search and I came up with a discussion (http://vip.asus.com/forum/view.aspx?id=20080110054618984&board_id=1&model=M2NPV-VM&page=1&SLanguage=en-us) which pointed me to an information source (ftp://download.nvidia.com/XFree86/Linux-x86/1.0-8174/README/32bit_html/appendix-l.html) which says:

"On AMD's AMD64 platform, the size of the IOMMU can be configured in the system BIOS or, if no IOMMU BIOS option is available, using the 'iommu=memaper' kernel parameter. This kernel parameter expects an order and instructs the Linux kernel to create an IOMMU of size 32MB^order overlapping physical memory. If the system's default IOMMU is smaller than 64MB, the Linux kernel automatically replaces it with a 64MB IOMMU."

So, it looks like I might be able to use boot parameters to sort out this problem. I have looked up the iommu boot parameters in the kernel documentation (http://www.kernel.org/doc/Documentation/kernel-parameters.txt) and found the following potentially relevant parameters:

	iommu=		[x86]
		off
		force
		noforce
		biomerge
		panic
		nopanic
		merge
		nomerge
		forcesac
		soft

	amd_iommu=	[HW,X86-84]
			Pass parameters to the AMD IOMMU driver in the system.
			Possible values are:
			isolate - enable device isolation (each device, as far
			          as possible, will get its own protection
			          domain)
	amd_iommu_size= [HW,X86-64]
			Define the size of the aperture for the AMD IOMMU
			driver. Possible values are:
			'32M', '64M' (default), '128M', '256M', '512M', '1G'

The 'iommu=memaper' option is not mentioned in the official documentation.

I am not sure what the bios and dmesg diagnostic messages are asking me to do with these boot parameters. Please can someone advise me how I should use the iommu boot parameters? iommu=force ?

Thank you.


Comment 49 Paul 2008-11-01 18:58:59 UTC
Update:

On further investigation, I found some additional relevant IOMMU info in my dmesg output:

[    0.490101] PCI-DMA: Disabling AGP.
[    0.490553] PCI-DMA: aperture base @ 20000000 size 65536 KB
[    0.490553] PCI-DMA: using GART IOMMU.
[    0.490553] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

So, my AGP is being disabled by the kernel and GART IOMMU is being used: I don't know if that is the correct type. I notice that there are several types in the boot-options.txt document (which appears to have been deleted from the 2.6.26 documentation? http://www.linuxhq.com/kernel/v2.6/27/Documentation/x86_64/boot-options.txt).

I then found a (annoyingly bloated) thread in a forum which provided a work around to the IOMMU problem for a very similar set-up to mine (amd64, asus motherboard, pci express, my agp getting disabled by the kernel). The workaround is to use the iommu=noaperture boot option (http://www.linuxquestions.org/questions/showthread.php?p=3173751#post3173751).

When I do this I still get a kernel panic. However, the lines:

Aperture pointing to e820 RAM. Ignoring.
Your BIOS doesn't leave a memory aperture hole.
Please enable the IOMMU option in the BIOS setup.
This costs you 64 MB of RAM.

no longer appear at the start. Instead, the kernel panic starts directly with

BUG: unable to handle kernel NULL pointer dereference at 00000000

So, it looks like the IOMMU messages might not be the cause of the kernel panic after all and their appearance might be the result of a separate bug in the kernel that doesn't recognise that it doesn't need to create a memory aperture when agp is disabled.

So, I am still none the wiser. I created an account with ASUS support and sent them a support message on the 18 October. They said they would try to get back to me within 24 hours and that if it took longer than 48 hours to reply my query would automatically become a 'first priority' but I still haven't received a reply yet.
Comment 50 Andi Kleen 2008-11-02 01:28:06 UTC
Paul, the CPUID4 panic problem is unlikely to have anything to do with the IOMMU
setup. Someone just has to reproduce and fix, but normally that requires
a mainline kernel (if you want to stay with Ubuntu kernels, you'll have to ask
Ubuntu support)

Also IOMMU should be independent of acpi=off, so if it works with acpi=off
then the IOMMU is likely ok.

That iommu=memaper results in a NULL pointer reference is a regression,
best you open a separate bug report for that. Fixing that won't help
your problem though.
Comment 51 Paul 2008-11-15 08:06:54 UTC
Dear all.

I reported this as a bug with Ubuntu (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/292619) and was given instructions on how to installed an upstream kernel (https://wiki.ubuntu.com/KernelTeam/GitKernelBuild) which I have now done.

However, even with the upstream kernel I still get kernel panic on boot, so this is apparently not an Ubuntu specific problem and so I have come back here to try t resolve it. What information do I need to provide to help debug this?

Thank you.
Comment 52 Zhang Rui 2008-11-17 00:31:19 UTC
hmm, it would be great if you can attach the screen shot when the upstream kernel crashes.
Comment 53 Paul 2008-11-17 11:21:32 UTC
Created attachment 18889 [details]
Kernel panic screen obtained with the upstream kernel
Comment 54 Shaohua 2008-11-25 18:47:55 UTC
This isn't ACPI related.
Comment 55 Mark Langsdorf 2008-11-25 18:49:32 UTC
I will be unable to access email from November 22nd until the 31st.  I will respond to email sent during this period as soon as I can.

In my absence, refer Linux or Xen issues to the OSRC at osrc@elbe.amd.com.  Personal email should be sent to mlangsdo@io.com.  

-Mark Langsdorf
Operating System Research Center
AMD
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7652.24">
<TITLE>Out of Office AutoReply: [Bug 11714] Kernel panic on boot if SMP - ASUS M3A-H/HDMI</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=2>I will be unable to access email from November 22nd until the 31st.&nbsp; I will respond to email sent during this period as soon as I can.<BR>
<BR>
In my absence, refer Linux or Xen issues to the OSRC at osrc@elbe.amd.com.&nbsp; Personal email should be sent to mlangsdo@io.com.&nbsp;<BR>
<BR>
-Mark Langsdorf<BR>
Operating System Research Center<BR>
AMD<BR>
</FONT>
</P>

</BODY>
</HTML>
Comment 56 Paul 2008-12-04 16:46:52 UTC
If it isn't ACPI related, how do I find out what is causing it?

Shaohua, can you tell from the kernel panic screen that it's not ACPI related, or did you draw that conclusion from other information?

Thank you.
Comment 57 Shaohua 2008-12-04 17:21:26 UTC
OS can find all CPUs with ACPI enabled. But in this case, it cashes in cache detect, which isn't ACPI related and maybe the specific CPU cache detect issue.
Comment 58 Thomas Gleixner 2008-12-11 23:50:05 UTC
Created attachment 19260 [details]
Patch for testing

Can you please apply the attached patch to the mainline kernel and retry ? 

Thanks,
       tglx
Comment 59 Paul 2008-12-12 20:11:45 UTC
Created attachment 19269 [details]
Kernel panic on boot with patch id=19260 applied to upsteam kernel

I have tried the patch, but I still get kernel panic on boot. I have attached a screen shot. The kernel panic can be suppressed using the acpi=ht boot param. Please let me know if I can provide any more info.
Comment 60 Paul 2008-12-15 12:02:44 UTC
I just updated my bios to the latest version (1102, 2008/11/14, http://support.asus.com/download/download.aspx?SLanguage=en-us&model=M3A-H/HDMI). I have tried with the latest official ubuntu kernel and with the patched mainline kernel. Both give rise to kernel panic screens on boot if I don't use acpi=ht. However, the messages on the kernel panic screens are different. For the custom kernel the trace is so long that more than a whole screen's worth is output. I have taken a picture of the final visible output for both kernels. Will attach now...
Comment 61 Paul 2008-12-15 12:05:23 UTC
Created attachment 19314 [details]
Kernel panic screen on boot. Bios v1102. Lastest ubuntu kernel

Bios v1102.
Lastest ubuntu kernel (2.6.27-9-server)
Comment 62 Paul 2008-12-15 12:07:41 UTC
Created attachment 19315 [details]
Kernel panic screen on boot. Bios v1102. Patched mainline kernel (v2.6.28-rc4-custom)

Bios v1102
Mainline kernel with patch id19260 applied (v2.6.28-rc4-custom)
Comment 63 Paul 2009-01-12 15:34:06 UTC
Update: Just updated to the latest bios (1201). Still get kernel panic with latest ubuntu kernel and with mainline kernel with patch id19260 applied (v2.6.28-rc4-custom).
Comment 64 Thomas Gleixner 2009-01-14 05:41:20 UTC
Created attachment 19787 [details]
test patch V2

I just noticed that the previous patch #19260 checks the wrong variable. Can you please retest ?

Thanks,
        tglx
Comment 65 Paul 2009-01-14 16:38:41 UTC
Created attachment 19808 [details]
Kernel panic on boot. Bios v1201. Mainline kernel (v2.6.29-rc1-custom)

This is with the latest mainline kernel (no patch).
Comment 66 Paul 2009-01-14 16:41:02 UTC
Created attachment 19809 [details]
Kernel panic on boot. Bios v1201. Patched (id=19787) mainline kernel (v2.6.29-rc1-custom)

This is with the patched mainline kernel. Still get kernel panic on boot.
Comment 67 Balázs Hámorszky 2009-01-24 18:41:50 UTC
Created attachment 19981 [details]
full trace

the kernel in splashtop works fine with m3a-h/hdmi
http://bugzilla.kernel.org/show_bug.cgi?id=12352
Comment 68 Paul 2009-01-25 06:44:38 UTC
So, it looks like this bug has crept in since kernel version 2.6.20 (version of kernel that splashtop uses according to the notes on of bug 12352). The earliest I have tried with is 2.6.24.
Comment 69 Balázs Hámorszky 2009-01-26 04:45:58 UTC
I can't say for sure that the splashtop kernel uses acpi or not (maybe I'll try to hack a terminal into it), but it can turn off the machine (but after that the bios complains that the machine failed to boot last time).
It is also possible that DeviceVM changed something in the kernel.
the source is available here (after registration):
http://www.splashtop.com/open_source.php
Comment 70 Martin Reiche 2009-02-19 07:05:58 UTC
As I have the same problem and would like to get this fixed: is there anything else that could be done?
As reference: I reported this bug 2 month ago over at launchpad:
https://bugs.launchpad.net/ubuntu/+source/linux-meta/+bug/305165

In addition to the things allready said in this thread:
- the problem occurs for me only if I enable surroundview in BIOS. This option is necessary to get Hybrid Crossfire to work. If I disable surroundview everything works like it should: I can use the onboard graphic (surroundview disabled) or the discrete graphic (surroundview disabled) without problems.
- Testing acpi=off in grub gave me no kernel-panic.
- I tested several Live-CDs to look if this problem is Ubuntu-specific. Only Live-CD that worked for me was Knoppix 5.1 with a 2.6.19 kernel. Everything else I tried did not work (Ubuntu 8.04.1, 32 & 64bit; Debian 5.0, 32bit; Ubuntu 8.10, 32 & 64bit; OpenSUSE 11.0, 32bit).

Is there more info needed to get this fixed?

- Distribution: Ubuntu 8.10 Intrepid
- Hardware Environment: ASUS M3A-H/HDMI, 2GB RAM, AMD Athlon64 4850e, 1xSATA,
256MB Radeon 3470 PCI Express (discrete) & Radeon HD 3200 (onboard) <-- for hybrid crossfire.
- BIOS version 0801 as later versions have problems with q-fan. But later versions like 1301 does not fix the problem.
Comment 71 herrmann.der.user 2009-02-19 09:13:37 UTC
Does any of the systems which only work with maxcpus=1 boot with
a kernel using CONFIG_MTRR=n?

There is another bugzilla where similar symptoms and workarounds
are discussed, see bug #11541. It seems that switching off
MTRR code works around the problem.
Comment 72 Martin Reiche 2009-02-19 13:47:35 UTC
I recompiled the stock Ubuntu 8.10 32bit kernel with CONFIG_MTRR=n and I get no kernel panik. Even if I enable surroundview in BIOS :-)

But now X doesn't start: "(EE) No devices detected.". But that may be related to the radeon driver. Maybe it would work with fglrx because radeon or radeonhd doesn't manage hybrid crossfire. Or radeon/radeonhd don't know what to do if there are two cards. No idea.
Comment 73 Martin Reiche 2009-02-19 14:43:04 UTC
Ok... X works now. Had to put BusID for the primary card into my xorg.conf.
Comment 74 herrmann.der.user 2009-02-20 06:31:55 UTC
Paul, can you confirm that switching off MTRR kernel code avoids
the panic on your system, too?

If yes, this bugzilla should be closed as duplicate of bug #11541
Comment 75 Paul 2009-02-20 07:15:45 UTC
Yes. I will do. I plan to try it out this weekend.

Paul
Comment 76 Paul 2009-02-20 18:03:19 UTC
I can confirm that compiling without the MTRR code avoids the kernel panic for me and so the finger is pointed at this code. I agree that the root cause of this kernel panic is likely to be the same as for bug #11541 and so am marking it as a duplicate.

*** This bug has been marked as a duplicate of bug 11541 ***

Note You need to log in before you can comment on or make changes to this bug.