Bug 15664 - Graphics hang and kernel backtrace when starting Azureus with Compiz enabled
Summary: Graphics hang and kernel backtrace when starting Azureus with Compiz enabled
Status: CLOSED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 high
Assignee: drivers_video-dri-intel@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks: 15310
  Show dependency tree
 
Reported: 2010-04-01 01:09 UTC by Alex Villacis Lasso
Modified: 2011-03-06 00:32 UTC (History)
10 users (show)

See Also:
Kernel Version: 2.6.34-rc3
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output showing kernel backtrace at moment of hang (62.19 KB, text/plain)
2010-04-01 01:09 UTC, Alex Villacis Lasso
Details
Xorg log at moment of graphics hang (78.48 KB, text/plain)
2010-04-01 01:11 UTC, Alex Villacis Lasso
Details
lspci -v output for my machine right after the crash (7.52 KB, text/plain)
2010-04-01 01:12 UTC, Alex Villacis Lasso
Details
Kernel configuration used to compile crashing kernel (94.27 KB, text/plain)
2010-04-01 01:15 UTC, Alex Villacis Lasso
Details
/var/log/messages at time of crash, bzip compressed (37.63 KB, application/octet-stream)
2010-04-01 01:18 UTC, Alex Villacis Lasso
Details
dmesg of reporter with just the to crashing kernel runs (158.07 KB, application/octet-stream)
2010-04-15 08:00 UTC, Daniel Vetter
Details
debug kernel patch (660 bytes, patch)
2010-04-15 11:31 UTC, Daniel Vetter
Details | Diff
Extract of dmesg with 2.6.34-rc4 and NO PATCH (7.18 KB, text/plain)
2010-04-15 14:42 UTC, Alex Villacis Lasso
Details
debug patch (500 bytes, patch)
2010-04-15 20:44 UTC, Daniel Vetter
Details | Diff
dmesg output after 2.6.34-rc4 and second patch (53.06 KB, text/plain)
2010-04-16 15:11 UTC, Alex Villacis Lasso
Details
dmesg output after 2.6.34-rc5 and BUG_ON patch (53.09 KB, text/plain)
2010-04-22 15:47 UTC, Alex Villacis Lasso
Details
be paranoid in rmfb (1.04 KB, patch)
2010-05-05 19:29 UTC, Daniel Vetter
Details | Diff
dmesg output after 2.6.34-rc6 and BUG_ON patch and paranoid rmfb patch (52.29 KB, text/plain)
2010-05-06 00:53 UTC, Alex Villacis Lasso
Details
dmesg output after 2.6.33 and BUG_ON patch and paranoid rmfb patch (46.67 KB, text/plain)
2010-05-06 18:05 UTC, Alex Villacis Lasso
Details
print fb<->bo association (823 bytes, patch)
2010-05-14 12:12 UTC, Daniel Vetter
Details | Diff
dmesg output after 2.6.34-rc7 and BUG_ON, paranoid rmfb, fb-bo patches (53.32 KB, text/plain)
2010-05-14 16:46 UTC, Alex Villacis Lasso
Details
backtrace with 2.6.34 with no extra patches (123.15 KB, text/plain)
2010-05-21 23:53 UTC, Alex Villacis Lasso
Details
Backtrace with 2.6.35-rc2 (64.84 KB, text/plain)
2010-06-09 21:12 UTC, Alex Villacis Lasso
Details
Extract of /var/log/messages with 2.6.35-rc3 and patched DRM locking (131.32 KB, text/plain)
2010-06-28 15:33 UTC, Alex Villacis Lasso
Details
Repeat unbind during free. (1.67 KB, patch)
2010-07-23 13:37 UTC, Chris Wilson
Details | Diff
Repeat unbind during free. (4.90 KB, patch)
2010-07-23 14:46 UTC, Chris Wilson
Details | Diff
Repeat unbind during free. (4.98 KB, patch)
2010-07-23 14:56 UTC, Chris Wilson
Details | Diff
Repeat unbind during free. (9.35 KB, patch)
2010-07-23 18:22 UTC, Chris Wilson
Details | Diff
dmesg output with different backtrace (67.32 KB, text/plain)
2010-07-30 15:39 UTC, Alex Villacis Lasso
Details

Description Alex Villacis Lasso 2010-04-01 01:09:52 UTC
Created attachment 25789 [details]
dmesg output showing kernel backtrace at moment of hang

System is Fedora 12 x86_64 with latest updates. Recently compiled and installed kernel 2.6.34-rc3. A few hours after running the kernel, I try to start Azureus (the Bittorrent client), and the graphics screen freezes completely. I cannot move the mouse, and the keyboard is unresponsive, and not even the numlock LED changes when pressing the numlock key. At the same time I hear a single beep from the PC speaker.

I have found that the machine is still responsive via network. When I logged remotely, I was able to collect the attached kernel backtraces.

This problem is intermittent. Twice I have had the problem, and other times I have started Azureus without problems.

Steps to reproduce:
1) Run 2.6.34-rc3 kernel on x86_64 and intel graphics chipset
2) Enable compiz on GNOME
3) Try to start Azureus (might take a few retries)

Expected results: no crash, application starts normally
Actual results: graphics hang
Comment 1 Alex Villacis Lasso 2010-04-01 01:11:34 UTC
Created attachment 25790 [details]
Xorg log at moment of graphics hang

Cannot see anything wrong with this particular log.
Comment 2 Alex Villacis Lasso 2010-04-01 01:12:39 UTC
Created attachment 25791 [details]
lspci -v output for my machine right after the crash
Comment 3 Alex Villacis Lasso 2010-04-01 01:15:00 UTC
Created attachment 25792 [details]
Kernel configuration used to compile crashing kernel
Comment 4 Alex Villacis Lasso 2010-04-01 01:16:54 UTC
Forgot to mention: no such hangs occurred with 2.6.34-rc2 (as far as I remember) or 2.6.33.
Comment 5 Alex Villacis Lasso 2010-04-01 01:18:19 UTC
Created attachment 25793 [details]
/var/log/messages at time of crash, bzip compressed

Look for date Mar 31. There are two hangs logged there.
Comment 6 Daniel Vetter 2010-04-15 08:00:20 UTC
Created attachment 26008 [details]
dmesg of reporter with just the to crashing kernel runs

Thanks for the report. I haven't yet looked closely, but I looks like an issue I've been hunting for a while. Bisecting is likely not worth it (if it really is the same problem) because I have reports with kernels older than 2.6.34-rc2.

I've also cut down your logfile to only show the dmesg output of the two crashing kernel runs - messages from other services just add noise for kernel problems. Please do this next time you upload a dmesg from the disk logs.
Comment 7 Daniel Vetter 2010-04-15 10:10:30 UTC
Ok, I've decoded the oops and it's barfing on the 5th page (address in RDX) in the pages array. In other words, that pages array handed to drm_clflush_pages has been corrupted (it's not even a valid kernel address anymore, hence the gp fault).

Possible explanation: The pages_refcount of the corresponding gem bo dropped to zero and the pages got freed. At least I have gathered tons of backtraces from the correpsonding BUG_ON in i915_gem_object_put_pages from various testers while developing my i855 gtt cohereny fix:

https://bugs.freedesktop.org/show_bug.cgi?id=27187

[look for put_pages and or BUG to find the relevant dmesgs]

All the testers from that bug report are using my unmap-inactive-objects hack, i.e. it's much more likely that they hit the BUG_ON(pages_refcount == 0) (which is usually hit in the gtt unmap path) before the pages array has a chance to be corrupted.

We only access the pages array when clflushing. So there's plenty of time for corruptions to happen without ill effects on vanilla kernels.

This reporter's dmesg has an error about a fb inconsistency right before the kernel blows up in both cases. Adding Jesse because this might be pageflip related.
Comment 8 Daniele C. 2010-04-15 10:39:30 UTC
I am also affected.

drm-intel-next 2.6.34-rc2
mesa 7.7.1
libgl 7.7.1
xorg-server 1.7.6
xf86-video-intel 2.10.0
libdrm 2.4.18/2.4.20
Comment 9 Daniel Vetter 2010-04-15 11:31:49 UTC
Created attachment 26011 [details]
debug kernel patch

Alex, can you please apply this patch against the kernel and try to rehang your box. This will kill performance (so expect a somewhat sluggish feel on the desktop), but it should also kill your box pretty fast (if my theory is right).

If it hangs, please capture the full dmesg.
Comment 10 Alex Villacis Lasso 2010-04-15 14:42:03 UTC
Created attachment 26019 [details]
Extract of dmesg with 2.6.34-rc4 and NO PATCH

The bug still persists in 2.6.34-rc4. This backtrace seems almost identical to the previous one. I have not yet applied the patch at this point.
Comment 11 Rafael J. Wysocki 2010-04-15 16:52:31 UTC
On Thursday 15 April 2010, Alex Villací­s Lasso wrote:
> El 07/04/10 16:13, Rafael J. Wysocki escribió:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.33.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15664
> > Subject             : Graphics hang and kernel backtrace when starting
> Azureus with Compiz enabled
> > Submitter   : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec>
> > Date                : 2010-04-01 01:09 (7 days old)
> >
> >    
> Still present in 2.6.34-rc4.
Comment 12 Daniel Vetter 2010-04-15 20:44:58 UTC
Created attachment 26024 [details]
debug patch

Please retest with this debug patch - it contains some paranoia-checks to see which refcount went south. Also enable slab debugging (CONFIG_DEBUG_SLAB), this should help to shed some light on this if it's really a use-after-free thing (it looks like).

btw, it's definitely not the same bug as the BUG_ON in put_pages I've mentioned earlier - that's resolved and was definitely a bug in one of my debug patches.

Thanks for the other dmesg. I haven't looked as closely as with the other ones, but it's essentially the same error.
Comment 13 Alex Villacis Lasso 2010-04-16 15:11:54 UTC
Created attachment 26029 [details]
dmesg output after 2.6.34-rc4 and second patch

I applied the patch against 2.6.34-rc4 (offset -1 lines) and rebooted. I could not even reach the graphical login. The screen froze with the completed Fedora logo in the blue background, with no X cursor. I logged from the network and captured the attached dmesg output with backtrace.
Comment 14 Daniele C. 2010-04-16 17:00:17 UTC
@Alex: I experienced the same issue with unpatched Fedora kernel, are you sure that patch was applied?
Comment 15 Alex Villacis Lasso 2010-04-16 20:08:40 UTC
@Daniele: I applied the second patch (the one that just adds the BUG_ON). I am hitting

BUG_ON(atomic_read(&obj->refcount.refcount) == 0);

which was the one added by the second patch. So yes, the patch is applied. Or else I do not understand your question.

Unpatched -rc4 lets me login and work for a while, until the hang happens, as remarked in comment #10.
Comment 16 Daniel Vetter 2010-04-16 21:05:44 UTC
[comment seems to have been eaten by some mta, resending via the web interface]

> --- Comment #13 from Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> 
> 2010-04-16 15:11:54 ---
> I applied the patch against 2.6.34-rc4 (offset -1 lines) and rebooted. I
> could
> not even reach the graphical login. The screen froze with the completed
> Fedora
> logo in the blue background, with no X cursor. I logged from the network and
> captured the attached dmesg output with backtrace.

Thanks for testing. So yep, refcounts gone south. Unfortunately I've been
a little bit too lazy when doing that debug patch. Can you check with your
sources which of the two BUG_ONs my patch inserted you've hit? If it's the
first one, please exchange them like this:

        BUG_ON(atomic_read(&obj->refcount.refcount) == 0);
        BUG_ON(obj_priv->pages_refcount == 0);

and check whether it's still only hitting the pages_refcount thing.
Thanks.
Comment 17 Alex Villacis Lasso 2010-04-19 19:06:21 UTC
I am hitting the second BUG_ON, not the first. My kernel was compiled with CONFIG_SLUB and CONFIG_SLUB_DEBUG, but the last dmesg was without any slub_debug configured. I will retest with slub_debug.
Comment 18 Alex Villacis Lasso 2010-04-19 19:11:17 UTC
I meant, I did not pass slub_debug as a kernel command line parameter.
Comment 19 Alex Villacis Lasso 2010-04-22 15:47:09 UTC
Created attachment 26096 [details]
dmesg output after 2.6.34-rc5 and BUG_ON patch

Bug still present in 2.6.34-rc5, crashes in the same way as 2.6.34-rc4. It hits the *second* BUG_ON at:
        BUG_ON(obj_priv->pages_refcount == 0);
        BUG_ON(atomic_read(&obj->refcount.refcount) == 0);
Comment 20 Daniel Vetter 2010-04-22 17:25:35 UTC
> --- Comment #19 from Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> 
> 2010-04-22 15:47:09 ---
> Created an attachment (id=26096)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26096)
> dmesg output after 2.6.34-rc5 and BUG_ON patch
> 
> Bug still present in 2.6.34-rc5, crashes in the same way as 2.6.34-rc4. It
> hits
> the *second* BUG_ON at:
>         BUG_ON(obj_priv->pages_refcount == 0);
>         BUG_ON(atomic_read(&obj->refcount.refcount) == 0);

Thanks for the clarification. I was somewhat confused which BUG_ON you've
hit, but too busy to ask for clarification.

Jesse, this dmesg (like all the ones before) again shows the fb related
ERROR right before. Can you please take all look? The BUG_ON Alex is
hitting indicates a screw-up refcount somewhere.
Comment 21 Rafael J. Wysocki 2010-04-22 17:53:52 UTC
On Thursday 22 April 2010, Alex Villací­s Lasso wrote:
> El 19/04/10 22:19, Rafael J. Wysocki escribió:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.33.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15664
> > Subject             : Graphics hang and kernel backtrace when starting
> Azureus with Compiz enabled
> > Submitter   : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec>
> > Date                : 2010-04-01 01:09 (19 days old)
> >
> >
> >
> >    
> Still present in -rc5. I am testing with a patch that triggers a BUG_ON 
> when the refcount reaches zero, and it triggered in rc4 and rc5. The 
> patch is attached as part of the bug report.
Comment 22 Alex Villacis Lasso 2010-04-30 16:55:52 UTC
Still present in 2.6.34-rc6.
Comment 23 Alex Villacis Lasso 2010-04-30 16:57:03 UTC
Still present in 2.6.34-rc6.
Comment 24 Rafael J. Wysocki 2010-05-04 22:19:19 UTC
On Wednesday 05 May 2010, Alex Villací­s Lasso wrote:
> El 04/05/10 16:21, Rafael J. Wysocki escribió:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.33.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15664
> > Subject             : Graphics hang and kernel backtrace when starting
> Azureus with Compiz enabled
> > Submitter   : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec>
> > Date                : 2010-04-01 01:09 (34 days old)
> >
> >
> >
> >    
> Still present in -rc6
Comment 25 Daniel Vetter 2010-05-05 19:29:35 UTC
Created attachment 26243 [details]
be paranoid in rmfb

Ok, that's a shot in the dark. Care to test this patch?

If it doesn't turn up anything new, can you please try to bisect this regression? If it helps to crash the kernel faster, simply apply my earlier debug patch for testing.

Thanks, Daniel
Comment 26 Alex Villacis Lasso 2010-05-06 00:53:37 UTC
Created attachment 26251 [details]
dmesg output after 2.6.34-rc6 and BUG_ON patch and paranoid rmfb patch

Seems there is nothing new. Bisecting might take a long time since this is my primary workstation machine and I cannot just reboot it at will.
Comment 27 Alex Villacis Lasso 2010-05-06 18:05:33 UTC
Created attachment 26264 [details]
dmesg output after 2.6.33 and BUG_ON patch and paranoid rmfb patch

Now this is new.

I was trying to locate a kernel version that does not crash, in order to start bisecting. I remember vanilla 2.6.33 had no video-related crash problems, so I started there. I applied the BUG_ON patch and the paranoid rmfb patch to 2.6.33, recompiled, and rebooted. This time I hit the BUG_ON in the same way as done in 2.6.34-rcX. 

My conclusion is that 2.6.33 already had the refcount bug but was saved from crashing until recent drm changes exposed the bug. The trouble is that without the BUG_ON patch, 2.6.34-rcX works for a while, even if internally its objects are incorrectly refcounted. What do you think?

BTW, my stock Fedora 12 kernel (2.6.32.11-99.fc12.x86_64) also has the "tried to remove a fb..." message, so I think it might have the hidden refcounting bug too. However, it has never crashed yet.
Comment 28 Daniel Vetter 2010-05-14 12:12:29 UTC
Created attachment 26378 [details]
print fb<->bo association

Please apply this debug patch on top of the rest. This is just to check the the gem object the kernel is crashing on is really associated with an fb (I think - but testing is always better). Please upload the full dmesg of a crashing kernel. Thanks.
Comment 29 Alex Villacis Lasso 2010-05-14 16:46:51 UTC
Created attachment 26384 [details]
dmesg output after 2.6.34-rc7 and BUG_ON, paranoid rmfb, fb-bo patches

As requested, here is the full dmesg output with crash.
Comment 30 field_it 2010-05-14 18:11:36 UTC
Same BUG here.

FYI:

http://article.gmane.org/gmane.comp.video.dri.devel/46062

Thx Daniel.

Cheers
Comment 31 Daniel Vetter 2010-05-15 08:28:38 UTC
> --- Comment #29 from Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> 
> 2010-05-14 16:46:51 ---
> Created an attachment (id=26384)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26384)
> dmesg output after 2.6.34-rc7 and BUG_ON, paranoid rmfb, fb-bo patches
> 
> As requested, here is the full dmesg output with crash.

Thanks. Confirms indeed that we blow up on the backing bo of a just freed
framebuffer.
Comment 32 Chris Wilson 2010-05-15 09:02:07 UTC
I've fixed some page-flipping and framebufer reference counting bugs in xf86-video-intel very recently. Could you please try with an up-to-date xorg driver?

commit 9f54107f866a25cf670f81f7c52b8c108728c6a5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 11 14:55:16 2010 +0100

    dri2: Handle reference counting across page flipping
    
    1. Instead of swapping bos, swap the entire private structure.
    
    2. If we update the pixmap bo for the Screen, make sure we update the
    reference inside intel->front_buffer so that xrandr still functions.
    
    Fixes:
    
      Bug 27922 - i965: Rapidly resizing OpenGL window causes GPU to hang.
      https://bugs.freedesktop.org/show_bug.cgi?id=27922
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

commit 0d2392d44aae95d6b571d98f7ec323cf672a687f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 14 10:32:12 2010 +0100

    dri: Hold reference to buffers across swap
    
    As we schedule swaps for some time in the future and may process a
    detachment prior to receiving the vblank notification from the kernel,
    we need to hold a reference to the buffers for our swap event handler.
    
    Fixes:
      Bug 28080 - "glresize" causes X server segfault with indirect rendering.
      https://bugs.freedesktop.org/show_bug.cgi?id=28080
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 33 Alex Villacis Lasso 2010-05-18 14:59:58 UTC
I am running 2.6.34 now, with NO additional patches. There have been no crashes in at least a day. However, I *still* get the line on dmesg:
[   31.313153] [drm:drm_mode_rmfb] *ERROR* tried to remove a fb that we didn't own

So I do not think the problem can be considered solved yet. The -rc3 kernel also ran for a day or so, and then crashed...
Comment 34 Alex Villacis Lasso 2010-05-21 23:53:35 UTC
Created attachment 26490 [details]
backtrace with 2.6.34 with no extra patches

I ran 2.6.34 for several days, and then I get this backtrace and hang. So the problem is DEFINITELY not solved. Grrr...
Comment 35 Alex Villacis Lasso 2010-06-07 21:15:45 UTC
The telltale messae "tried to remove a fb" still appears in 2.6.35-rc2. In addition, I am now affected by bug #16149, but reverting the commit mentioned in the bug report fixes the problem.
Comment 36 Alex Villacis Lasso 2010-06-09 21:12:24 UTC
Created attachment 26703 [details]
Backtrace with 2.6.35-rc2

The machine finally produced a backtrace and hang, but now the backtrace is different. Is there any news on this?
Comment 37 Alex Villacis Lasso 2010-06-15 17:21:47 UTC
The message "tried to remove a fb" still appears in 2.6.35-rc3.
Comment 38 Rafael J. Wysocki 2010-06-21 18:23:29 UTC
On Monday, June 21, 2010, Alex Villací­s Lasso wrote:
> El 20/06/10 17:34, Rafael J. Wysocki escribió:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.33 and 2.6.34.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.33 and 2.6.34.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15664
> > Subject             : Graphics hang and kernel backtrace when starting
> Azureus with Compiz enabled
> > Submitter   : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec>
> > Date                : 2010-04-01 01:09 (81 days old)
> >    
> Still present in 2.6.35-rc3.
Comment 39 Alex Villacis Lasso 2010-06-28 15:33:45 UTC
Created attachment 26962 [details]
Extract of /var/log/messages with 2.6.35-rc3 and patched DRM locking

I can reproduce the reference miscount at home using the paranoid rmfb patch. In addition, I patched the DRM locking functions to WARN_ON on each call, and posted the resulting dmesg as written to /var/log/messages . Is this useful at all?
Comment 40 Alex Villacis Lasso 2010-07-06 16:01:22 UTC
Still having the "tried to remove a fb..." message with 2.6.35-rc4.
Comment 41 Rafael J. Wysocki 2010-07-09 23:34:31 UTC
Do you have the VGA framebuffer selected in .config?
Comment 42 Alex Villacis Lasso 2010-07-12 16:10:13 UTC
[alex@srv64 linux-2.6.34-rc4-git]$ grep VGA .config
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_VGA_SWITCHEROO is not set
CONFIG_VGASTATE=m
# CONFIG_FB_SVGALIB is not set
CONFIG_FB_VGA16=m
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y

I see that CONFIG_FB_VGA16 is set to module.
Comment 43 Rafael J. Wysocki 2010-07-13 21:24:33 UTC
On Tuesday, July 13, 2010, Alex Villací­s Lasso wrote:
> El 09/07/10 19:25, Rafael J. Wysocki escribió:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.33 and 2.6.34.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.33 and 2.6.34.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15664
> > Subject             : Graphics hang and kernel backtrace when starting
> Azureus with Compiz enabled
> > Submitter   : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec>
> > Date                : 2010-04-01 01:09 (100 days old)
> >    
> I am currently testing 2.6.35-rc4 at commit 
> 589643be6693c46fbc54bae77745f336c8ed4bcc . So far it has been up for 20 
> hours, but it still showed the message [drm:drm_mode_rmfb] *ERROR* tried 
> to remove a fb that we didn't own on boot.
Comment 44 Chris Wilson 2010-07-23 13:37:04 UTC
Created attachment 27218 [details]
Repeat unbind during free.

The crash/hang is definitely a use-after-free; the attached patch fixes a likely suspect.

Not sure what the cause of the invalid rmfb though.
Comment 45 Chris Wilson 2010-07-23 14:46:28 UTC
Created attachment 27219 [details]
Repeat unbind during free.

Daniel Vetter pointed out the stupid userspace mistake in the first attempt...
Comment 46 Chris Wilson 2010-07-23 14:56:11 UTC
Created attachment 27220 [details]
Repeat unbind during free.

Third time lucky. Move the deferred into the main retire requests, and hopefully the patch will apply to the stable tree as well.
Comment 47 Alex Villacis Lasso 2010-07-23 15:37:22 UTC
(In reply to comment #46)
> Created an attachment (id=27220) [details]
> Repeat unbind during free.
> 
> Third time lucky. Move the deferred into the main retire requests, and
> hopefully the patch will apply to the stable tree as well.

Does not apply to v2.6.35-rc6:

[alex@srv64 linux-2.6.35-rc6]$ patch -p1 --dry-run < ../bug15664-0001-drm-i915-Repeat-unbinding-during-free-if-interrupted.patch 
patching file drivers/gpu/drm/i915/i915_drv.h
Hunk #1 succeeded at 542 (offset -9 lines).
patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 succeeded at 53 (offset 1 line).
Hunk #2 FAILED at 1747.
Hunk #3 succeeded at 1946 (offset 16 lines).
Hunk #4 succeeded at 1987 with fuzz 1 (offset 18 lines).
Hunk #5 succeeded at 4442 (offset 152 lines).
Hunk #6 succeeded at 4490 with fuzz 1 (offset 176 lines).
1 out of 6 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem.c.rej
Comment 48 Chris Wilson 2010-07-23 18:22:50 UTC
Created attachment 27227 [details]
Repeat unbind during free.

Sorry about that, too great a delta between trees.
Comment 49 Rafael J. Wysocki 2010-07-23 19:44:26 UTC
On Friday, July 23, 2010, Alex Villací­s Lasso wrote:
> El 23/07/10 07:11, Rafael J. Wysocki escribió:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.33 and 2.6.34.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.33 and 2.6.34.  Please verify if it still should
> > be listed and let the tracking team know (either way).
> >
> >
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15664
> > Subject             : Graphics hang and kernel backtrace when starting
> Azureus with Compiz enabled
> > Submitter   : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec>
> > Date                : 2010-04-01 01:09 (114 days old)
> >
> >    
> Could not trigger with -rc5 after 9 days. Testing now with 2.6.35-rc6.
Comment 50 Alex Villacis Lasso 2010-07-30 15:39:51 UTC
Created attachment 27302 [details]
dmesg output with different backtrace

My machine just hung today, after running for 7 days. However, now the backtrace is completely different. This was produced with 2.6.35-rc6 plus the "Repeat unbind during free" patch.
Comment 51 Jesse Barnes 2010-07-30 15:51:20 UTC
> Created an attachment (id=27302)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=27302)
> dmesg output with different backtrace
> 
> My machine just hung today, after running for 7 days. However, now the
> backtrace is completely different. This was produced with 2.6.35-rc6 plus the
> "Repeat unbind during free" patch.

Can you file a new bug for this issue at bugs.freedesktop.org (we look
at it more and it tends to be a lot faster than bz.kernel.org).
Comment 52 Alex Villacis Lasso 2010-07-30 20:52:55 UTC
Bug created at freedesktop bugzilla:

https://bugs.freedesktop.org/show_bug.cgi?id=29325
Comment 53 Florian Mickler 2011-03-06 00:32:15 UTC
fix merged for .36-rc1:

commit be72615bcf4d5b7b314d836c5e1b4baa4b65dad1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 23 23:18:50 2010 +0100

    drm/i915: Repeat unbinding during free if interrupted (v6)

Note You need to log in before you can comment on or make changes to this bug.