Bug 15004

Summary: i915: *ERROR* Execbuf while wedged
Product: Drivers Reporter: tomas m (tmezzadra)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: CLOSED DOCUMENTED    
Severity: normal CC: anarsoul, colin, finstaden, jasondbecker, jbarnes, jcnengel, loonyphoenix, rjw, theholyettlz
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32.2 Tree: Mainline
Regression: Yes
Bug Depends on:    
Bug Blocks: 14230    
Attachments: config
Xorg log
Part of syslog

Description tomas m 2010-01-07 18:53:44 UTC
Created attachment 24477 [details]
config

as suggested in http://bugzilla.kernel.org/show_bug.cgi?id=14627

im openint a new bug report since it appears to be a different issue.

ive already patched 2.6.32.2 with the fix propsed there, and disabled CONFIG_DMAR. yet the issue is still there.

the behaviour is the same as the one  in the previous bug.

the screen hangs, mouse still moves, and computer continues to work, i can even close open windows blindly. i can switch to a virtual terminal and keep on working from there.

dmesg follows;
---
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000
i915: Waking up sleeping processes
[drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 2744405 at 2744397)
reboot required
[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
...
...
...
[drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[drm] DAC-6: set mode 1680x1050 1b
-----------------------



attaching config.


one slight detail that might anoy. BFS is enabled, although ive already ruled it out by trying 2.6.33-rc2 vanilla. same problem. am i required to test 2.6.32.3 without BFS? is it relevant?
Comment 1 James Ettle 2010-01-07 22:10:40 UTC
Saw a load of these after a suspend/resume cycle. (Also running BFS.) Same messages as above.
Comment 2 tomas m 2010-01-08 15:53:05 UTC
may i add more info:

ive got 2 screens, a 1280x800 LVDS notebook screen and a 1680x1050 screen attached to the VGA port.

both are placed in xorg in a frambuffer size of

intel(0): Allocate new frame buffer 1680x1850 stride 2048


and compiz is running.

im attaching xorg.0.log too
Comment 3 tomas m 2010-01-08 15:54:31 UTC
Created attachment 24483 [details]
Xorg log
Comment 4 Johannes Engel 2010-01-11 12:25:50 UTC
*** Bug 15045 has been marked as a duplicate of this bug. ***
Comment 5 Rafael J. Wysocki 2010-01-12 22:01:00 UTC
What's the last working kernel here?
Comment 6 tomas m 2010-01-12 22:20:32 UTC
hmmm, cant remember when started happening... im building 2.6.31.11 right now to test. im almost sure it wasnt a problem with 2.6.31, meanwhile testing with my distribution's 2.6.31.6
Comment 7 Jesse Barnes 2010-01-12 23:00:30 UTC
Looks like the GPU hang check code is getting it wrong, the EIR is all zeros.  Maybe some batch is just taking forever to finish?

Can you bisect this down to a specific bad commit?
Comment 8 Colin Guthrie 2010-01-13 10:12:27 UTC
For me I can use a 2.6.32.3 kernel fine with intel driver 2.9.x but with 2.10 it bombs out with this error. So while the problem could be in the kernel, the trigger case could be something in 2.10 only? May or may not be helpful :p
Comment 9 tomas m 2010-01-13 13:22:30 UTC
i tested 2.6.31.11 and while it locks the hardware similarly, it does so with a different error and i could not switch to a tty. i could ssh into the system though. im trying to bisect the kernel. could you do the same with the intel driver?
Comment 10 Johannes Engel 2010-01-13 16:48:24 UTC
I can confirm both statements: 2.10.0 together with 2.6.32.* shows the behaviour described above, 2.9.x + 2.6.31.11 works, 2.9.x + 2.6.32.* does not throw the error messages, but freezes as tomas said. Since I cannot ssh into my machine, I cannot give the exact error message though.
Comment 11 mif86 2010-01-13 19:15:47 UTC
I'm using 2.6.33-rc3-git5 with intel 2.9.1 and its working perfectly.
Using 2.10.0 gives exactly the same symptoms as johannes explains
Comment 12 Jason Becker 2010-01-15 17:35:04 UTC
I have this issue as well on (openSUSE) 2.6.32-3-pae kernel with a dual monitor setup: 1280x800 notebook screen and a 1680x1050 screen attached to the VGA port.

lspci:

00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller])
	Subsystem: Toshiba America Info Systems Device ff03
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at dc100000 (32-bit, non-prefetchable) [size=512K]
	I/O ports at 1800 [size=8]
	Memory at c0000000 (32-bit, prefetchable) [size=256M]
	Memory at dc200000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
	Capabilities: [d0] Power Management version 2
	Kernel driver in use: i915

dmesg:

[171330.189103] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[171330.189119] render error detected, EIR: 0x00000000
[171330.189125] i915: Waking up sleeping processes
[171330.189193] reboot required
[171330.189209] [drm:i915_wait_request] *ERROR* i915_wait_request returns -5 (awaiting 1778246 at 1778244)
[171330.222994] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[171330.249447] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[171330.249783] [drm:i915_gem_execbuffer] *ERROR* Execbuf while wedged
[snip]

Let me know if I can provide additional info to troubleshoot.

Cheers
Comment 13 Victor 2010-01-15 21:35:21 UTC
I'm also having this on Arch Linux with both these kernels:
2.6.32.3-1 (Arch's default kerenl)
2.6.32-5 (a patched kernel with BFS support, http://aur.archlinux.org/packages.php?ID=15224)
This intel driver:
2.10.0-2 (http://aur.archlinux.org/packages.php?ID=22968)
And this Intel DRI module:
7.7-1 (don't remember its source)
Comment 14 tomas m 2010-01-15 22:34:39 UTC
im trying to bisect this, but it has proven more dificult that i first thought. sometimes it takes 12hs to trigger the bug. so its difficult to tell good commits from bad ones... if anyone is willing to help me trigger this, mail me privately so that i can send the bisect log, and we can tackle this together.
Comment 15 Vasily Khoruzhick 2010-01-18 09:28:57 UTC
Created attachment 24614 [details]
Part of syslog

I'm getting this error from time to time on my 945gm with latest stable kernel
(2.6.32.2 with gentoo patches). I've noticed that it's really easy to reproduce
this bug with kde 4.4rc1 - just switch plasma to netbook mode.

As you can see in attached log, "[drm:i915_gem_object_pin] *ERROR* Failure to
install fence: -28" message preceeded subj message - it seems there's a fence leak somewhere.

Video driver components versions:
mesa-7.7
libdrm-2.4.17
xf86-video-intel-2.10.0
Comment 16 Vasily Khoruzhick 2010-01-18 09:29:12 UTC
*** Bug 15072 has been marked as a duplicate of this bug. ***
Comment 17 tomas m 2010-01-18 13:22:31 UTC
i was given an easy way to trigger this through email. 

-enable compiz resize in normal mode (windows contents are upgraded on the fly)
- resize like crazy for about 10 ~ 20 secs. 

ive found that at a certain point in the drm intel next branch, the bug turns, from what we have reported to:

[drm:i915_gem_object_bind_to_gtt] *ERROR* Invalid object alignment requested 4096

where the screen hard locks (no tty switch available), i can ssh into the system.

im not sure if its the same bug, or an old bug that was fixed. this is somewhere around the 2.6.31-rc kernels. right now im hunting for the commit that flips this error. Am i doing the right thing?
Comment 18 tomas m 2010-01-18 16:37:28 UTC
my results:

e67b8ce1b59006ba41245838db60b6fcda365ba8 is the first bad commit
commit e67b8ce1b59006ba41245838db60b6fcda365ba8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 14 16:50:26 2009 +0100

this commit introduces the execbuf while wedged error. where the mouse pointer still works, and you can click on the frozen screen to close programs.

prior to that, i get the "Invalid object alignment" error where the screen + mouse freeze and i need to ssh into the box to reboot it.

it looks like they are both related. whats the next step? should i hunt down the Invalid object bug?
Comment 19 tomas m 2010-01-19 15:17:54 UTC
maybe its an intel driver issue. 
2.9.1 works ok with 2.6.32.4
2.9.99.901 fails with same kernel. wedged message error

i tried to bisect this but half of the commits between those versions fail to build or break xorg badly.

should i file a bug with the xf86-video-intel people?
Comment 20 tomas m 2010-01-19 15:35:15 UTC
there already is a bug: https://bugs.freedesktop.org/show_bug.cgi?id=25475
Comment 21 Rafael J. Wysocki 2010-01-27 20:57:53 UTC
On Wednesday 27 January 2010, Chris Wilson wrote:
> On Wed, 27 Jan 2010 10:01:44 -0800, Jesse Barnes <jbarnes@virtuousgeek.org>
> wrote:
> > On Sun, 24 Jan 2010 23:23:11 +0100 (CET)
> > "Rafael J. Wysocki" <rjw@sisk.pl> wrote:
> > 
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.31 and 2.6.32.
> > > 
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.31 and 2.6.32.  Please verify if it still should
> > > be listed and let me know (either way).
> > > 
> > > 
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15004
> > > Subject           : i915: *ERROR* Execbuf while wedged
> > > Submitter : tomas m <tmezzadra@gmail.com>
> > > Date              : 2010-01-07 18:53 (18 days old)
> > 
> > Chris, any ideas about this one?  I remember seeing a few like this
> > that have been bisected to 2D driver changes recently...
> 
> Yes, this is almost certainly our userspace driver sending the GPU into a
> spin. And no, we haven't identified the cause yet, there have a been a lot
> of conflicting reports and guesswork, with very little information.
Comment 22 tomas m 2010-01-27 21:35:33 UTC
Rafael, i dont know what to make of your LKML quote. 

concerning what chris said. i dont know how to provide more information. whichever info i could collect, ive already provided, here, and in the bug report @ xorg. (comment #20).
Comment 23 Rafael J. Wysocki 2010-01-27 21:58:08 UTC
This is an additional status update, it doesn't mean you're expected to do anything more.
Comment 24 tomas m 2010-02-10 16:03:04 UTC
there has been a patch applied to libdrm that appears to fix this bug...and many others. this appears to be fixed for me.

more info here: http://bugs.freedesktop.org/show_bug.cgi?id=25475#c88

so i guess this bug can be closed now ;)
Comment 25 Rafael J. Wysocki 2010-02-10 18:09:30 UTC
OK, closing.