Bug 5343
Summary: | IOMMU setup broken 2.6.13.2 -> 2.6.14-rc2 | ||
---|---|---|---|
Product: | Memory Management | Reporter: | jl-icase |
Component: | Other | Assignee: | Andrew Morton (akpm) |
Status: | CLOSED PATCH_ALREADY_AVAILABLE | ||
Severity: | blocking | CC: | andi-bz, bunk |
Priority: | P2 | ||
Hardware: | i386 | ||
OS: | Linux | ||
Kernel Version: | 2.6.14-rc2 | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: | Correct dmesg output under good Fedora Core kernel 2.6.13-1.1526_FC4smp |
Description
jl-icase
2005-10-01 09:46:29 UTC
Can you show the full message log for all cases (especially the lockup) ? Memory setup is still broken in 2.6.14-rc4 SMP on dual core Athlons with 4GB physical RAM (it used to work properly in 2.6.13.2 SMP). I cannot provide a log because 2.6.14-rc4 SMP locks up on bootup, but what I could copy from the screen reads: PCI-DMA: More than 4GB of RAM and no IOMMU. PCI-DMA: 32bit PCI IO may malfunction. PCI-DMA: Disabling IOMMU. ...and at that point the new kernel totally locks up. FYI, ASUS BIOS uses HW memory remapping with Rev. E Athlons (and later) to make all 4GB visible. Also, 2.6.14-rc4 was built ("make oldconfig" and take defaults on all new options) with Fedora Core 4 config-2.6.13-1.1526_FC4smp (this FC4 kernel also works fine). Created attachment 6312 [details]
Correct dmesg output under good Fedora Core kernel 2.6.13-1.1526_FC4smp
This dmesg output results when good Fedora Core kernel 2.6.13-1.1526_FC4smp is
used (the system works fine). Identically configured (plus defaults on new
config options) 2.6.14-rc4 SMP kernel locks up on bootup after failing to
detect IOMMU.
There are actually no relevant changes between 2.6.13 and 2.6.14. I don't know what kernel changes Fedora did. Can you verify that 2.6.13 from kernel.org also didn't show the problem? 2.6.13.2 from kernel.org worked fine -- and that's why 2.6.13-based FC4 kernels also work fine. 2.6.14-rc{2,4} from kernel.org broke this. Another interesting memory event, possibly related, observed (only once) with the above-mentioned (good) FC4 kernel: I got a number of memory access faults (non-canonical pointers?) on starting known good binaries, but upon reboot problems went away. Perhaps even that (good) 2.6.13.2-based FC4 SMP kernel can have intermittent memory setup problems... WARNING: Kernel Errors Present RPC: error 5 connecting to ...: 2 Time(s) mozilla-xremote[3425] general protection rip:3325e121eb rsp:7fffffe2baa8 error:0...: 1 Time(s) mozilla-xremote[3444] general protection rip:3325e121eb rsp:7ffffff04c48 error:0...: 1 Time(s) mozilla-xremote[3485] general protection rip:3325e121eb rsp:7fffffb7a828 error:0...: 1 Time(s) mozilla-xremote[3504] general protection rip:3325e121eb rsp:7fffff8fdac8 error:0...: 1 Time(s) mozilla-xremote[3523] general protection rip:3325e121eb rsp:7ffffffae898 error:0...: 1 Time(s) mozilla-xremote[3537] general protection rip:3325e121eb rsp:7fffffd8ad78 error:0...: 1 Time(s) mozilla-xremote[3567] general protection rip:3325e121eb rsp:7fffff8acbd8 error:0...: 1 Time(s) mozilla-xremote[3570] general protection rip:3325e121eb rsp:7fffff9e77c8 error:0...: 1 Time(s) mozilla-xremote[3779] general protection rip:3325e121eb rsp:7fffffb99c78 error:0...: 1 Time(s) mozilla-xremote[3793] general protection rip:3325e121eb rsp:7fffffa88908 error:0...: 1 Time(s) mozilla-xremote[4172] general protection rip:3325e121eb rsp:7ffffff31098 error:0...: 1 Time(s) mozilla-xremote[4189] general protection rip:3325e121eb rsp:7ffffff39ff8 error:0...: 1 Time(s) mozilla-xremote[4203] general protection rip:3325e121eb rsp:7fffff97f8e8 error:0...: 1 Time(s) thunderbird-bin[3435] general protection rip:3325e121eb rsp:7fffffb24ae8 error:0...: 1 Time(s) thunderbird-bin[3454] general protection rip:3325e121eb rsp:7fffffa5eab8 error:0...: 1 Time(s) thunderbird-bin[3495] general protection rip:3325e121eb rsp:7fffffaf7db8 error:0...: 1 Time(s) thunderbird-bin[3514] general protection rip:3325e121eb rsp:7fffffb38f58 error:0...: 1 Time(s) xsetroot[3109] general protection rip:3325e121eb rsp:7fffffc2dff8 error:0...: 1 Time(s) xsetroot[3608] general protection rip:3325e121eb rsp:7fffffdd1f88 error:0...: 1 Time(s) -------------------- The latest 2.6.14-based FC4 kernels 2.6.14-1.1637_FC4{smp,} fail to boot, whereas 2.6.13-based FC4 kernels 2.6.13-1.1532_FC4{smp,} worked fine. The machine uses a dual-core Athlon 64 CPU and ASUS A8N-SLI Premium motherboard with the latest BIOS 1008. The machine doesn't have AGP bus, only PCI-E. The problem was (re?)-introduced in 2.6.14, possibly related to SWIOTLB handling in arch/x86_64/kernel/aperture.c -- on bootup, 2.6.14-based kernels complain about not finding IOMMU and more than 4GB of RAM, then panic. Since the IOMMU is definitively present, this is probably just a failure to open a suitable aperture for the IOMMU. Doing a diff on aperture.c between 2.6.13 and 2.6.14, I find the following: --- kernel-2.6.13/linux-2.6.13/arch/x86_64/kernel/aperture.c 2005-08-28 17:41:01.000000000 -0600 +++ kernel-2.6.14/linux-2.6.14/arch/x86_64/kernel/aperture.c 2005-10-27 18:02:08.000000000 -0600 @@ -245,6 +245,8 @@ if (aper_alloc) { /* Got the aperture from the AGP bridge */ + } else if (swiotlb && !valid_agp) { + /* Do nothing */ } else if ((!no_iommu && end_pfn >= 0xffffffff>>PAGE_SHIFT) || force_iommu || valid_agp || Could the above change be responsible for boot failure with 2.6.14-based kernels? Again, 2.6.13-based kernels work fine. Some ideas: http://lkml.org/lkml/2005/11/6/54 suggests booting with "iommu=soft swiotlb=65536". The theory proposed therein says that GART is probably not fully initialized in a system that has no AGP -- but that GART is needed anyway. This is a severe kernel-panics-on-bootup type bug introduced in the 2.6.13->14 transition. Reverting to 2.6.13.2 kernels makes the bug go away. Do you still need info, or can this be declared an official Linux kernel bug? I got my hands on an Asus board (K8N-DL) and tracked it down to a broken MCFG table. The workaround is to boot with pci=nommconf (please type exactly as written, I suspect quite a few people misspelled it) I have a workaround in process to fix up the broken MCFG table. pci=nommconf helps you, right? Correct! Kernel 2.6.14-1.1637_FC4smp boots normally with the pci=nommconf boot line option. Since this doesn't use SWIOTLB, I assume that it should have less overhead than soft IOMMU. With pci=nommconf, I see the following on bootup: [...] Allocating PCI resources starting at c2000000 (gap: c0000000:20000000) Checking aperture... CPU 0: aperture @ 1a30000000 size 32 MB Aperture from northbridge cpu 0 too small (32 MB) No AGP bridge found Your BIOS doesn't leave a aperture memory hole Please enable the IOMMU option in the BIOS setup [...My comment: There is *no* IOMMU option in ASUS BIOS setup...] This costs you 64 MB of RAM Mapping aperture over 65536 KB of RAM @ 8000000 Built 1 zonelists Kernel command line: ro root=LABEL=/ rhgb quiet 3 pci=nommconf [...] PCI-DMA: Disabling AGP. PCI-DMA: aperture base @ 8000000 size 65536 KB PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture [...My comment: There is no AGP on this system...] Andi, what is the status of this issue? Should be all solved. Well the workarounds caused other issues that are not solved yet, but that doesn't affect this system. |