Bug 14627 - i915: *ERROR* Execbuf while wedged
Summary: i915: *ERROR* Execbuf while wedged
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Wang, Zhenyu Z
URL:
Keywords:
: 14728 14862 (view as bug list)
Depends on:
Blocks: 14230
  Show dependency tree
 
Reported: 2009-11-16 22:10 UTC by Rafael J. Wysocki
Modified: 2010-01-15 09:01 UTC (History)
16 users (show)

See Also:
Kernel Version: 2.6.32-rc6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
disable pci dma mapping in non-iommu case (3.08 KB, patch)
2009-12-30 02:23 UTC, zhenyuw
Details | Diff
set dma mask in i915 driver (587 bytes, patch)
2009-12-31 04:46 UTC, zhenyuw
Details | Diff
remove dma mask setting in drm_pci_alloc() (4.31 KB, patch)
2010-01-04 12:03 UTC, zhenyuw
Details | Diff

Description Rafael J. Wysocki 2009-11-16 22:10:42 UTC
Subject    : [regression] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
Submitter  : Michael <schnitzelkuchen@googlemail.com>
Date       : 2009-11-15 10:48
References : http://lkml.org/lkml/2009/11/15/40
Notify-Also : David Woodhouse <david.woodhouse@intel.com>

This entry is being used for tracking a regression from 2.6.31.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 David Woodhouse 2009-11-17 23:01:08 UTC
I'm expecting the graphics folks to look into this. Zhenyu?
Comment 2 David Woodhouse 2009-11-18 22:09:38 UTC
The fix is probably http://git.infradead.org/users/dwmw2/iommu-agp.git/commitdiff/135cbc4e

Waiting for confirmation from reporter.
Comment 4 Rafael J. Wysocki 2009-11-20 20:43:57 UTC
Closing.
Comment 5 David Woodhouse 2009-11-20 23:31:23 UTC
Was it ever confirmed that this actually fixed the problem?
Comment 6 Michael Groh 2009-11-24 09:07:48 UTC
(In reply to comment #5)
> Was it ever confirmed that this actually fixed the problem?

No, i didnt confirm that the problem was fixed, and the problem isnt fixed (at least for me/my laptop).
Comment 7 Jesse Barnes 2009-12-02 19:49:14 UTC
Zhenyu, what's the status on this one?
Comment 8 zhenyuw 2009-12-03 01:22:55 UTC
It looks happen on 965GM (without VT-d), but I can't produce this on T61 last time I tried.
Comment 9 Michael Groh 2009-12-03 09:05:32 UTC
Right, 965GM (without VT-d) is my Setup. It may help to know that i have 4G of RAM.

If any additional tests or output is needed, i am willing to help.
Comment 10 Marcus Fritzsch 2009-12-08 09:31:55 UTC
I can confirm this with a GM965 (Thinkpad R61). Having 4G of RAM too.
Comment 11 Vasily Khoruzhick 2009-12-09 13:56:19 UTC
Today I've reproduced this bug with vanilla 2.6.32 kernel on 945GM hardware. So this bug is not 965-specific.
Comment 12 Marco Innocenti 2009-12-12 08:26:38 UTC
I suffer from the same issue on a Dell D830 (T7500 GM965 4GB ram) and I've found a workaraund: limiting the kernel memory booting with "mem=...".
With "mem=4400M" or more I can see the bug when starting gdm. If I use less memory the bug become harder to reproduce but I was able to reproduce it with "mem=4300". I wasn't able to reproduce the bug with "mem=4275M" or less.
Tell me if you need more info or tests.
Comment 13 Marcus Fritzsch 2009-12-12 14:42:44 UTC
Hi again,

I played with what Marco stated. And I can confirm this behavior, currently my kernel uses 3 of the 4G i have installed and the system runs just fine (no problems so far that is).

Cheers,
Marcus
Comment 14 Michael Groh 2009-12-14 13:49:40 UTC
I can also confirm the workaround.

I am currently running 2.6.32 with mem=4075M. I have no more problems with X, 2d and 3d work.

Hope this helps,
brot
Comment 15 Michael Groh 2009-12-20 17:49:27 UTC
With the newest 2.6.33-rc1 kernel i am getting the same errors. dmesg says:

[   68.174125] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[   68.174134] render error detected, EIR: 0x00000000
[   68.174180] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 751 at 749) 

Hope we can get this fixed for the 2.6.33 kernel :)
Comment 16 zhenyuw 2009-12-21 03:55:58 UTC
sorry, I still have no memory upgrade on my 965GM to produce this.
Comment 17 Colin Guthrie 2009-12-24 16:25:21 UTC
I'm not sure if the cause is different but I'm seeing the same error message on my 945GM (intel 2.9.99.902, mesa 7.7, libdrm 2.4.17 and kernel 2.6.32.2 (which contains the commit from comment #3)) with only 2gig system memory.
Comment 18 zhenyuw 2009-12-29 01:34:35 UTC
Please attach full dmesg.
Comment 19 Colin Guthrie 2009-12-29 11:46:36 UTC
Full dmesg is hard to come across as it gets filled up with this message over and over until the fifo is exhausted so I quite literally only get this error message in the log. I've gone back to v2.6.31.6 kernel to try and see if this helps but it seems to have replaced the error with something else:
[drm:i915_gem_object_bind_to_gtt] *ERROR* Invalid object alignment requested 4096

(no other info in dmesg there).

I fear this is a different problem so don't want to pollute this bug report. I can open a new one if preferred or feel free to ask for other debug info/bisects etc.
Comment 20 Rafael J. Wysocki 2009-12-29 21:31:28 UTC
*** Bug 14862 has been marked as a duplicate of this bug. ***
Comment 21 zhenyuw 2009-12-30 02:23:27 UTC
Created attachment 24373 [details]
disable pci dma mapping in non-iommu case

For non-iommu machine with large memory, swiotlb is used, but it looks our access to graphics buffer doesn't work properly in that case, I'm not quite sure which part is really broken, guess hardware status page access but its coherency is guarded by GPU already...This one trys to only do pci dma mapping in case real iommu hw is available, that revert back to my origin patch's behavior. David, how do you think about it?
Comment 22 zhenyuw 2009-12-30 02:38:39 UTC
After talk with Shaohua Li, in swiotlb case with bounce buffer current GEM driver can't work out things correctly. For hardware status page which is critical to trace requests to GPU, although it's cache coherent set by GPU, but in bounce buffer case without sync operation it can't be really coherent. And in GEM domain change which need to flush CPU cache, the cache for bounce buffer never gets flushed too. So disable pci dma mapping usage looks the correct solution for now.
Comment 23 Michael Groh 2009-12-30 04:16:55 UTC
I have applied the "disable pci dma mapping in non-iommu case" patch to the 2.6.32-gentoo-r1 kernel, and everything works like it should :)
Comment 24 Marcus Fritzsch 2009-12-30 12:13:20 UTC
I can confirm too, that the "disable pci dma mapping in non-iommu case" patch seems to fix that issue, no more black/corrupted screen and the performance (glxgears ;o)) is on par with 2.6.31.

Thank you,
Marcus
Comment 25 Rafael J. Wysocki 2009-12-30 21:03:31 UTC
Handled-By : zhenyuw <zhenyuw@linux.intel.com>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=24373
Comment 26 zhenyuw 2009-12-31 04:46:04 UTC
Created attachment 24389 [details]
set dma mask in i915 driver

Please help to test this patch instead, revert the agp patch above. David noticed if we've setup 36 bit dma mask properly, swiotlb shouldn't be a problem as dma_capable will be true for map_page to return unmangled address. 

I don't have the testing machine, please help to test this one.

thanks.
Comment 27 Henry Gebhardt 2009-12-31 10:31:35 UTC
The drm-i915-set-dma-mask patch doesn't work for me. Same as before.

The agp-intel-pci-map-only-for-iommu-detected patch does work, but I also needed to add the following line to compile intel-agp as a module:

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 75e14e2..79595e2 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -33,6 +33,7 @@ int iommu_merge __read_mostly = 0;
 int no_iommu __read_mostly;
 /* Set this to 1 if there is a HW IOMMU in the system */
 int iommu_detected __read_mostly = 0;
+EXPORT_SYMBOL(iommu_detected);
 
 /*
  * This variable becomes 1 if iommu=pt is passed on the kernel command line.


Thanks.
Comment 28 Rafael J. Wysocki 2009-12-31 10:51:57 UTC
*** Bug 14728 has been marked as a duplicate of this bug. ***
Comment 29 Michael Groh 2010-01-01 13:43:17 UTC
The drm-i915-set-dma-mask does not work here. dmesg says:

[   45.593231] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung                                                                                            
[   45.593244] render error detected, EIR: 0x00000000                                                                                                                              
[   45.593248] i915: Waking up sleeping processes                                                                                                                                  
[   45.593270] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 2 at 1)                                                                                      
[   45.661072] sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both                                                                                                   
[   45.661735] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready                                                                                                                   
[   50.689226] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung                                                                                            
[   50.689234] render error detected, EIR: 0x00000000                                                                                                                              
[   50.689239] i915: Waking up sleeping processes
Comment 30 zhenyuw 2010-01-04 12:03:48 UTC
Created attachment 24430 [details]
remove dma mask setting in drm_pci_alloc()

Here's the refreshed patch after investigate the real cause of dma mask failure.
Please help to verify this one, thanks.
Comment 31 Martin Knopp 2010-01-04 15:31:30 UTC
The "remove dma mask"-patch fixes the problem on my machine (GM965, 2.6.32.2 vanilla kernel).

Thank you
Martin
Comment 32 Kornel Lugosi 2010-01-04 16:31:32 UTC
I can confirm, it fixed my bug.
Comment 33 Marcus Fritzsch 2010-01-04 18:16:28 UTC
Feels like fixed here too (GM965, 2.6.32-zen4 + patch).

Cheers,
Marcus
Comment 34 Rafael J. Wysocki 2010-01-04 19:33:26 UTC
Patch : http://bugzilla.kernel.org/attachment.cgi?id=24430
Comment 35 Henry Gebhardt 2010-01-04 20:41:12 UTC
Just wanted to confirm that the new "remove dma mask setting in drm_pci_alloc()" patch works for me, too, on 2.6.33-rc2, but resuming from suspend only works with KMS enabled. But that is probably something different. Thanks a lot, Henry.
Comment 36 Zephaniah E. Hull. 2010-01-05 23:31:50 UTC
I can confirm that the new 'remove dma mask setting' patch fixes it here too.

Is there any chance of this making 2.6.32.3?
Comment 37 tomas m 2010-01-06 19:54:51 UTC
im not sure if its the same bug, but it seems awfuly similar.

anyway, when the gpu freezes, here, dmesg gets spammed with

[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged

ive already tried http://bugzilla.kernel.org/attachment.cgi?id=24430
with 2.6.32.2 and the issue is still there..

hardware is 
00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)
Comment 38 zhenyuw 2010-01-07 08:28:37 UTC
I've run tests on 945GM, 945GME which seems fine. You can try to disable CONFIG_DMAR in your kernel to see if you can still produce the problem. If still yes, that should be other problem. You may open another bug as this one is mostly for problem on 965G. Dave has sent the fix patch to Linus.
Comment 39 Colin Guthrie 2010-01-07 10:29:03 UTC
Tomas, can you CC me in on any 945 bug you open regarding the comment above. I've seen this same behaviour (as per my comment further up) and it's unlikely I'll be able to test the CONFIG_DMAR anytime soon so you'll probably beat me to it!
Comment 40 Rafael J. Wysocki 2010-01-10 22:31:08 UTC
Fixed by commit e6be8d9d17bd44061116f601fe2609b3ace7aa69 .

Note You need to log in before you can comment on or make changes to this bug.