Most recent kernel where this bug did not occur: 2.6.13.2 Distribution: kernel.org Hardware Environment: dual-core Athlon w/4GB RAM, ASUS A8N-SLI Premium motherboard Software Environment: FC4 Problem Description: IOMMU handling is broken in 2.6.14-rc2, but it used to work in 2.6.13.2. ASUS BIOS 1007 sets up memory remapping (using PAE or hardware on AMD's Rev.E CPUs) to make all 4GB visible to the OS; this works on dual-core Athlons for both SMP and uniprocessor kernels 2.6.13.2; but is broken in 2.6.14-rc2 SMP kernels, which lock up on bootup unless PCI memory remapping is disabled in BIOS. With remapping disabled, the kernel boots, but I see only 3GB instead of 4GB. Steps to reproduce: Build the SMP kernel 2.6.14-rc2, enable memory remapping in BIOS, reboot. Run large stream benchmarks (e.g. 2GB) unless the kernel crashes. SMP kernels based on 2.6.12 boot with memory remapping enabled in BIOS, but lockup when >200MB gets used, e.g. by stream benchmark. This was fixed in 2.6.13, which worked fine, e.g.: Linux version 2.6.13.1_smp [...] BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009e800 (usable) BIOS-e820: 000000000009e800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000bfff0000 (usable) BIOS-e820: 00000000bfff0000 - 00000000bfff3000 (ACPI NVS) BIOS-e820: 00000000bfff3000 - 00000000c0000000 (ACPI data) BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000140000000 (usable) [...] Checking aperture... CPU 0: aperture @ 1a60000000 size 32 MB Aperture from northbridge cpu 0 too small (32 MB) No AGP bridge found Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 8000000 [...] PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ 8000000 size 65536 KB PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture ...and the above works fine. Switching to kernel 2.6.14-rc2, I get a total lockup about 30 seconds after boot when PCI memory remapping is enabled in BIOS. This kernel boots only if BIOS remapping is disabled, but then I see only 3GB (not 4GB) of RAM. The IOMMU related messages become: Linux version 2.6.14-rc2_smp [...] PCI-DMA: Disabling IOMMU. Note that CONFIG_GART_IOMMU=y was used in both kernels (the second kernel had fewer configuration options turned on, e.g. no NUMA since the CPU was a single dual-core Athlon). ASUS BIOS 1007 doesn't have any IOMMU options, but the first example used PCI memory remapping, while the second one crashed unless remapping was off. I believe that AMD's IOMMU is needed to see full 4GB, so the fact that the second example locks up unless remapping is disabled is bad news. On today's 64-bit machines, old 3GB memory limit is unacceptable -- sometimes even 16-32GB needs to be usable in Linux.
Can you show the full message log for all cases (especially the lockup) ?
Memory setup is still broken in 2.6.14-rc4 SMP on dual core Athlons with 4GB physical RAM (it used to work properly in 2.6.13.2 SMP). I cannot provide a log because 2.6.14-rc4 SMP locks up on bootup, but what I could copy from the screen reads: PCI-DMA: More than 4GB of RAM and no IOMMU. PCI-DMA: 32bit PCI IO may malfunction. PCI-DMA: Disabling IOMMU. ...and at that point the new kernel totally locks up. FYI, ASUS BIOS uses HW memory remapping with Rev. E Athlons (and later) to make all 4GB visible. Also, 2.6.14-rc4 was built ("make oldconfig" and take defaults on all new options) with Fedora Core 4 config-2.6.13-1.1526_FC4smp (this FC4 kernel also works fine).
Created attachment 6312 [details] Correct dmesg output under good Fedora Core kernel 2.6.13-1.1526_FC4smp This dmesg output results when good Fedora Core kernel 2.6.13-1.1526_FC4smp is used (the system works fine). Identically configured (plus defaults on new config options) 2.6.14-rc4 SMP kernel locks up on bootup after failing to detect IOMMU.
There are actually no relevant changes between 2.6.13 and 2.6.14. I don't know what kernel changes Fedora did. Can you verify that 2.6.13 from kernel.org also didn't show the problem?
2.6.13.2 from kernel.org worked fine -- and that's why 2.6.13-based FC4 kernels also work fine. 2.6.14-rc{2,4} from kernel.org broke this.
Another interesting memory event, possibly related, observed (only once) with the above-mentioned (good) FC4 kernel: I got a number of memory access faults (non-canonical pointers?) on starting known good binaries, but upon reboot problems went away. Perhaps even that (good) 2.6.13.2-based FC4 SMP kernel can have intermittent memory setup problems... WARNING: Kernel Errors Present RPC: error 5 connecting to ...: 2 Time(s) mozilla-xremote[3425] general protection rip:3325e121eb rsp:7fffffe2baa8 error:0...: 1 Time(s) mozilla-xremote[3444] general protection rip:3325e121eb rsp:7ffffff04c48 error:0...: 1 Time(s) mozilla-xremote[3485] general protection rip:3325e121eb rsp:7fffffb7a828 error:0...: 1 Time(s) mozilla-xremote[3504] general protection rip:3325e121eb rsp:7fffff8fdac8 error:0...: 1 Time(s) mozilla-xremote[3523] general protection rip:3325e121eb rsp:7ffffffae898 error:0...: 1 Time(s) mozilla-xremote[3537] general protection rip:3325e121eb rsp:7fffffd8ad78 error:0...: 1 Time(s) mozilla-xremote[3567] general protection rip:3325e121eb rsp:7fffff8acbd8 error:0...: 1 Time(s) mozilla-xremote[3570] general protection rip:3325e121eb rsp:7fffff9e77c8 error:0...: 1 Time(s) mozilla-xremote[3779] general protection rip:3325e121eb rsp:7fffffb99c78 error:0...: 1 Time(s) mozilla-xremote[3793] general protection rip:3325e121eb rsp:7fffffa88908 error:0...: 1 Time(s) mozilla-xremote[4172] general protection rip:3325e121eb rsp:7ffffff31098 error:0...: 1 Time(s) mozilla-xremote[4189] general protection rip:3325e121eb rsp:7ffffff39ff8 error:0...: 1 Time(s) mozilla-xremote[4203] general protection rip:3325e121eb rsp:7fffff97f8e8 error:0...: 1 Time(s) thunderbird-bin[3435] general protection rip:3325e121eb rsp:7fffffb24ae8 error:0...: 1 Time(s) thunderbird-bin[3454] general protection rip:3325e121eb rsp:7fffffa5eab8 error:0...: 1 Time(s) thunderbird-bin[3495] general protection rip:3325e121eb rsp:7fffffaf7db8 error:0...: 1 Time(s) thunderbird-bin[3514] general protection rip:3325e121eb rsp:7fffffb38f58 error:0...: 1 Time(s) xsetroot[3109] general protection rip:3325e121eb rsp:7fffffc2dff8 error:0...: 1 Time(s) xsetroot[3608] general protection rip:3325e121eb rsp:7fffffdd1f88 error:0...: 1 Time(s) --------------------
The latest 2.6.14-based FC4 kernels 2.6.14-1.1637_FC4{smp,} fail to boot, whereas 2.6.13-based FC4 kernels 2.6.13-1.1532_FC4{smp,} worked fine. The machine uses a dual-core Athlon 64 CPU and ASUS A8N-SLI Premium motherboard with the latest BIOS 1008. The machine doesn't have AGP bus, only PCI-E. The problem was (re?)-introduced in 2.6.14, possibly related to SWIOTLB handling in arch/x86_64/kernel/aperture.c -- on bootup, 2.6.14-based kernels complain about not finding IOMMU and more than 4GB of RAM, then panic. Since the IOMMU is definitively present, this is probably just a failure to open a suitable aperture for the IOMMU. Doing a diff on aperture.c between 2.6.13 and 2.6.14, I find the following: --- kernel-2.6.13/linux-2.6.13/arch/x86_64/kernel/aperture.c 2005-08-28 17:41:01.000000000 -0600 +++ kernel-2.6.14/linux-2.6.14/arch/x86_64/kernel/aperture.c 2005-10-27 18:02:08.000000000 -0600 @@ -245,6 +245,8 @@ if (aper_alloc) { /* Got the aperture from the AGP bridge */ + } else if (swiotlb && !valid_agp) { + /* Do nothing */ } else if ((!no_iommu && end_pfn >= 0xffffffff>>PAGE_SHIFT) || force_iommu || valid_agp || Could the above change be responsible for boot failure with 2.6.14-based kernels? Again, 2.6.13-based kernels work fine.
Some ideas: http://lkml.org/lkml/2005/11/6/54 suggests booting with "iommu=soft swiotlb=65536". The theory proposed therein says that GART is probably not fully initialized in a system that has no AGP -- but that GART is needed anyway. This is a severe kernel-panics-on-bootup type bug introduced in the 2.6.13->14 transition. Reverting to 2.6.13.2 kernels makes the bug go away. Do you still need info, or can this be declared an official Linux kernel bug?
I got my hands on an Asus board (K8N-DL) and tracked it down to a broken MCFG table. The workaround is to boot with pci=nommconf (please type exactly as written, I suspect quite a few people misspelled it) I have a workaround in process to fix up the broken MCFG table. pci=nommconf helps you, right?
Correct! Kernel 2.6.14-1.1637_FC4smp boots normally with the pci=nommconf boot line option. Since this doesn't use SWIOTLB, I assume that it should have less overhead than soft IOMMU. With pci=nommconf, I see the following on bootup: [...] Allocating PCI resources starting at c2000000 (gap: c0000000:20000000) Checking aperture... CPU 0: aperture @ 1a30000000 size 32 MB Aperture from northbridge cpu 0 too small (32 MB) No AGP bridge found Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup [...My comment: There is *no* IOMMU option in ASUS BIOS setup...] This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 8000000 Built 1 zonelists Kernel command line: ro root=LABEL=/ rhgb quiet 3 pci=nommconf [...] PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ 8000000 size 65536 KB PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture [...My comment: There is no AGP on this system...]
Andi, what is the status of this issue?
Should be all solved. Well the workarounds caused other issues that are not solved yet, but that doesn't affect this system.