Kernel Bug Tracker – Bug 11821
System crashes/unstable with intel_iommu
Last modified: 2013-04-09 06:23:26 UTC
When Intel VT is enabled in the BIOS of my Dell Latitude E6400, the system becomes severely unstable and crashes during boot. I can work-around the instability with intel_iommu=off.
Depending on the precise kernel version (and of the phase of the moon?), the crashes/instability may differ. With stock 126.96.36.199:
- During the boot process, different stack traces start popping up one after another. Around the time when X is started, the machine stops responding.
- If I boot the machine in runlevel 3, stack traces continue to scroll over the screen (too fast to be able to read them). Several minutes after the series of stack traces have appeared, the machine locks up completey and scroll, caps and num lock start blinking.
I was able to read a part of the last two backtraces.
General proection fault
And the last one which locked up the machine, after which the machine halted itself:
With other kernels (Mandriva's kernels), the system boots (with a backtrace when hald-addon-, but the e1000e network card timeouts after a few seconds of data transfer, needing a reboot to make it work again. With intel_iommu, the e1000e network card is perfectly stable.
Created attachment 18428 [details]
Created attachment 18429 [details]
dmesg 188.8.131.52 with intel_iommu=off
Created attachment 18431 [details]
dmesg Mandriva's TMB kernel based on 2.6.27-rc8 with intel_iommu
The system boots and is somewhat usable with this kernel with intel_iommu enabled, however, there is still one backtrace when hald-addon-dell-backlight is started, an you also see the e1000e timeout error appearing in this configuration. Booting this same kernel with intel_iommu=off, makes both problems disappear.
Same problem here with the same system (DELL Latitude E6400), but appeared only when I've enabled the _Support for DMA Remapping Devices_ option in the kernel configuration (Bus Options section) in searching of ~ 200Mb of RAM lost.
With an ACPI DMAR capable system, as the Latitude is, and that option enabled, I see almost all of the RAM available but kernel goes unstable as the message "BUG: unable to handle kernel NULL pointer dereference..." claims.
I think that this problem is more related with DMAR part rather then the whole IOMMU.
We are trying to make iommu working on laptops and desktop(Dell Latitude E6400, Lenovo X200 & Dell Precision T5400). However, we on all three computers have exactly the same problem. We getting interrupt with 3 type of errors reported:
a) PTE doesn't have write bit set
b) PTE doesn't have read bit set
c) Present bit in root entry is clear
We had reported more details @ https://bugzilla.redhat.co/show_bug.cgi?id=479996
Please advice us a way to debug it, what peaces should we look at, print/check, etc...
1. Could we blame BIOS for that?
2. Is it possible that we build kernel with options what conflicting with iommu implementation in some way? If so, could you share your binary/source/config what works for you?
3. What desktop/laptops been tested and you aware that it works?
4. KVM folks, is it KVM+IOMMU works?
5. We used 2.6.28-rc2 and up-to-date git kernel. The iommu code is very different why it has changed so drastically and who are in change of that?
(In reply to comment #5)
> We are trying to make iommu working on laptops and desktop(Dell Latitude E6400,
> Lenovo X200 & Dell Precision T5400). However, we on all three computers have
> exactly the same problem. We getting interrupt with 3 type of errors reported:
> a) PTE doesn't have write bit set
> b) PTE doesn't have read bit set
> c) Present bit in root entry is clear
Oleg, these problems are better tracked in bug #12578
The issue in this bug is a NULL dereference in pci_find_upstream_pcie_bridge()
can you do me a favor to get the acpidump of your dell latitude e6400 laptop?
I need this to support a new feature in Linux kernel.
BTW: you can get the latest pmtools at http://www.lesswatts.org/projects/acpi/utilities.php
build and run "acpidump > acpidump.log" to get the acpidump output.
Created attachment 21824 [details]
acpidump Dell Latitude E6400 (BIOS version A06)
not sure this is relevant, but on X200 BIOS changelog I found this:
BIOS: 3.02 / ECP: 1.04
* (Fix) Fixed an issue that might not get the Operating System started after applying the BIOS version 3.01 if both the Intel(R) Virtualization Technology setting and the Intel(R) VT-d Fearure setting are enabled under Config and CPU in the BIOS Setup Utility.
Created attachment 22204 [details]
dmesg with 2.6.30(.1)
Oops remains with
Author: Kenji Kaneshige <email@example.com>
Date: Tue Feb 17 14:14:36 2009 +0900
(Kernel >= 2.6.30)
But changed from "pci_find_upstream_pcie_bridge" to "get_domain_for_dev"
Created attachment 22205 [details]
Please could you provide output of 'lspci -vt' and dmesg with the attached patch (and with iommu enabled).
Hm, this isn't even a PCI device. We shouldn't be in IOMMU code for this at all.
I suspect that we should be setting archdata.dma_ops for PCI devices only, and _not_ defaulting to the IOMMU dma_ops for all devices.
Although the dcdbas driver is arguably buggy, since it's abusing the DMA API to just allocate memory below 4GiB.
Re: dcdbas. Is there another way to allocate memory guaranteed to be below 4GB without using the DMA API? The calls are to a "device" (though that happens to be SMM) which requires memory below 4GB.
Created attachment 22207 [details]
I think this should fix the crash, but the driver may well still not work if you have memory above 4GiB.
Created attachment 22208 [details]
workaround for dcdbas driver
This might even make the offending driver work, but I don't think it's the right thing to do.
Created attachment 22231 [details]
dmesg with iommu-2.6 git kernel tree
I tried the debug patch but it doesn't add anything to the already known output. I also tried to add proposed patches to my kernel tree without success. So instead finding right kernel tree to patch or patch it by myself, I've committed directly the iommu-2.6 git branch (commit 3dfc813d94bba2046c6aed216e0fd69ac93a8e03).
Here comes dmesg log. I had to remove the inteldrm(fb) driver and using the default VESA frame buffer driver because of a useless blank screen.
Oops doesn't occur anymore but DMAR seems to fail on the Graphic Controller (also MTRR has some problems with that ... I don't know if they're related).
We know the graphics driver is broken and doesn't use the DMA API properly. For now, you need to enable the "workaround broken graphics drivers" option (CONFIG_DMAR_BROKEN_GFX_WA) or boot with 'intel_iommu=igfx_off' on your command line to make it work.
The tree you tested has my patch from comment #14, but not the one from comment #15 -- since you _do_ have memory above 4GiB, the dcdbas driver might still be unreliable. I'll leave the Dell folks to work out what they want to do about that.
Thanks for testing.