Subject : [regression] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged Submitter : Michael <schnitzelkuchen@googlemail.com> Date : 2009-11-15 10:48 References : http://lkml.org/lkml/2009/11/15/40 Notify-Also : David Woodhouse <david.woodhouse@intel.com> This entry is being used for tracking a regression from 2.6.31. Please don't close it until the problem is fixed in the mainline.
I'm expecting the graphics folks to look into this. Zhenyu?
The fix is probably http://git.infradead.org/users/dwmw2/iommu-agp.git/commitdiff/135cbc4e Waiting for confirmation from reporter.
http://git.kernel.org/linus/ec402ba97a6479dd80488b4404a73275e894289f
Closing.
Was it ever confirmed that this actually fixed the problem?
(In reply to comment #5) > Was it ever confirmed that this actually fixed the problem? No, i didnt confirm that the problem was fixed, and the problem isnt fixed (at least for me/my laptop).
Zhenyu, what's the status on this one?
It looks happen on 965GM (without VT-d), but I can't produce this on T61 last time I tried.
Right, 965GM (without VT-d) is my Setup. It may help to know that i have 4G of RAM. If any additional tests or output is needed, i am willing to help.
I can confirm this with a GM965 (Thinkpad R61). Having 4G of RAM too.
Today I've reproduced this bug with vanilla 2.6.32 kernel on 945GM hardware. So this bug is not 965-specific.
I suffer from the same issue on a Dell D830 (T7500 GM965 4GB ram) and I've found a workaraund: limiting the kernel memory booting with "mem=...". With "mem=4400M" or more I can see the bug when starting gdm. If I use less memory the bug become harder to reproduce but I was able to reproduce it with "mem=4300". I wasn't able to reproduce the bug with "mem=4275M" or less. Tell me if you need more info or tests.
Hi again, I played with what Marco stated. And I can confirm this behavior, currently my kernel uses 3 of the 4G i have installed and the system runs just fine (no problems so far that is). Cheers, Marcus
I can also confirm the workaround. I am currently running 2.6.32 with mem=4075M. I have no more problems with X, 2d and 3d work. Hope this helps, brot
With the newest 2.6.33-rc1 kernel i am getting the same errors. dmesg says: [ 68.174125] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 68.174134] render error detected, EIR: 0x00000000 [ 68.174180] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 751 at 749) Hope we can get this fixed for the 2.6.33 kernel :)
sorry, I still have no memory upgrade on my 965GM to produce this.
I'm not sure if the cause is different but I'm seeing the same error message on my 945GM (intel 2.9.99.902, mesa 7.7, libdrm 2.4.17 and kernel 2.6.32.2 (which contains the commit from comment #3)) with only 2gig system memory.
Please attach full dmesg.
Full dmesg is hard to come across as it gets filled up with this message over and over until the fifo is exhausted so I quite literally only get this error message in the log. I've gone back to v2.6.31.6 kernel to try and see if this helps but it seems to have replaced the error with something else: [drm:i915_gem_object_bind_to_gtt] *ERROR* Invalid object alignment requested 4096 (no other info in dmesg there). I fear this is a different problem so don't want to pollute this bug report. I can open a new one if preferred or feel free to ask for other debug info/bisects etc.
*** Bug 14862 has been marked as a duplicate of this bug. ***
Created attachment 24373 [details] disable pci dma mapping in non-iommu case For non-iommu machine with large memory, swiotlb is used, but it looks our access to graphics buffer doesn't work properly in that case, I'm not quite sure which part is really broken, guess hardware status page access but its coherency is guarded by GPU already...This one trys to only do pci dma mapping in case real iommu hw is available, that revert back to my origin patch's behavior. David, how do you think about it?
After talk with Shaohua Li, in swiotlb case with bounce buffer current GEM driver can't work out things correctly. For hardware status page which is critical to trace requests to GPU, although it's cache coherent set by GPU, but in bounce buffer case without sync operation it can't be really coherent. And in GEM domain change which need to flush CPU cache, the cache for bounce buffer never gets flushed too. So disable pci dma mapping usage looks the correct solution for now.
I have applied the "disable pci dma mapping in non-iommu case" patch to the 2.6.32-gentoo-r1 kernel, and everything works like it should :)
I can confirm too, that the "disable pci dma mapping in non-iommu case" patch seems to fix that issue, no more black/corrupted screen and the performance (glxgears ;o)) is on par with 2.6.31. Thank you, Marcus
Handled-By : zhenyuw <zhenyuw@linux.intel.com> Patch : http://bugzilla.kernel.org/attachment.cgi?id=24373
Created attachment 24389 [details] set dma mask in i915 driver Please help to test this patch instead, revert the agp patch above. David noticed if we've setup 36 bit dma mask properly, swiotlb shouldn't be a problem as dma_capable will be true for map_page to return unmangled address. I don't have the testing machine, please help to test this one. thanks.
The drm-i915-set-dma-mask patch doesn't work for me. Same as before. The agp-intel-pci-map-only-for-iommu-detected patch does work, but I also needed to add the following line to compile intel-agp as a module: diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c index 75e14e2..79595e2 100644 --- a/arch/x86/kernel/pci-dma.c +++ b/arch/x86/kernel/pci-dma.c @@ -33,6 +33,7 @@ int iommu_merge __read_mostly = 0; int no_iommu __read_mostly; /* Set this to 1 if there is a HW IOMMU in the system */ int iommu_detected __read_mostly = 0; +EXPORT_SYMBOL(iommu_detected); /* * This variable becomes 1 if iommu=pt is passed on the kernel command line. Thanks.
*** Bug 14728 has been marked as a duplicate of this bug. ***
The drm-i915-set-dma-mask does not work here. dmesg says: [ 45.593231] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 45.593244] render error detected, EIR: 0x00000000 [ 45.593248] i915: Waking up sleeping processes [ 45.593270] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 2 at 1) [ 45.661072] sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both [ 45.661735] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 50.689226] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung [ 50.689234] render error detected, EIR: 0x00000000 [ 50.689239] i915: Waking up sleeping processes
Created attachment 24430 [details] remove dma mask setting in drm_pci_alloc() Here's the refreshed patch after investigate the real cause of dma mask failure. Please help to verify this one, thanks.
The "remove dma mask"-patch fixes the problem on my machine (GM965, 2.6.32.2 vanilla kernel). Thank you Martin
I can confirm, it fixed my bug.
Feels like fixed here too (GM965, 2.6.32-zen4 + patch). Cheers, Marcus
Patch : http://bugzilla.kernel.org/attachment.cgi?id=24430
Just wanted to confirm that the new "remove dma mask setting in drm_pci_alloc()" patch works for me, too, on 2.6.33-rc2, but resuming from suspend only works with KMS enabled. But that is probably something different. Thanks a lot, Henry.
I can confirm that the new 'remove dma mask setting' patch fixes it here too. Is there any chance of this making 2.6.32.3?
im not sure if its the same bug, but it seems awfuly similar. anyway, when the gpu freezes, here, dmesg gets spammed with [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged ive already tried http://bugzilla.kernel.org/attachment.cgi?id=24430 with 2.6.32.2 and the issue is still there.. hardware is 00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)
I've run tests on 945GM, 945GME which seems fine. You can try to disable CONFIG_DMAR in your kernel to see if you can still produce the problem. If still yes, that should be other problem. You may open another bug as this one is mostly for problem on 965G. Dave has sent the fix patch to Linus.
Tomas, can you CC me in on any 945 bug you open regarding the comment above. I've seen this same behaviour (as per my comment further up) and it's unlikely I'll be able to test the CONFIG_DMAR anytime soon so you'll probably beat me to it!
Fixed by commit e6be8d9d17bd44061116f601fe2609b3ace7aa69 .