Created attachment 25789 [details] dmesg output showing kernel backtrace at moment of hang System is Fedora 12 x86_64 with latest updates. Recently compiled and installed kernel 2.6.34-rc3. A few hours after running the kernel, I try to start Azureus (the Bittorrent client), and the graphics screen freezes completely. I cannot move the mouse, and the keyboard is unresponsive, and not even the numlock LED changes when pressing the numlock key. At the same time I hear a single beep from the PC speaker. I have found that the machine is still responsive via network. When I logged remotely, I was able to collect the attached kernel backtraces. This problem is intermittent. Twice I have had the problem, and other times I have started Azureus without problems. Steps to reproduce: 1) Run 2.6.34-rc3 kernel on x86_64 and intel graphics chipset 2) Enable compiz on GNOME 3) Try to start Azureus (might take a few retries) Expected results: no crash, application starts normally Actual results: graphics hang
Created attachment 25790 [details] Xorg log at moment of graphics hang Cannot see anything wrong with this particular log.
Created attachment 25791 [details] lspci -v output for my machine right after the crash
Created attachment 25792 [details] Kernel configuration used to compile crashing kernel
Forgot to mention: no such hangs occurred with 2.6.34-rc2 (as far as I remember) or 2.6.33.
Created attachment 25793 [details] /var/log/messages at time of crash, bzip compressed Look for date Mar 31. There are two hangs logged there.
Created attachment 26008 [details] dmesg of reporter with just the to crashing kernel runs Thanks for the report. I haven't yet looked closely, but I looks like an issue I've been hunting for a while. Bisecting is likely not worth it (if it really is the same problem) because I have reports with kernels older than 2.6.34-rc2. I've also cut down your logfile to only show the dmesg output of the two crashing kernel runs - messages from other services just add noise for kernel problems. Please do this next time you upload a dmesg from the disk logs.
Ok, I've decoded the oops and it's barfing on the 5th page (address in RDX) in the pages array. In other words, that pages array handed to drm_clflush_pages has been corrupted (it's not even a valid kernel address anymore, hence the gp fault). Possible explanation: The pages_refcount of the corresponding gem bo dropped to zero and the pages got freed. At least I have gathered tons of backtraces from the correpsonding BUG_ON in i915_gem_object_put_pages from various testers while developing my i855 gtt cohereny fix: https://bugs.freedesktop.org/show_bug.cgi?id=27187 [look for put_pages and or BUG to find the relevant dmesgs] All the testers from that bug report are using my unmap-inactive-objects hack, i.e. it's much more likely that they hit the BUG_ON(pages_refcount == 0) (which is usually hit in the gtt unmap path) before the pages array has a chance to be corrupted. We only access the pages array when clflushing. So there's plenty of time for corruptions to happen without ill effects on vanilla kernels. This reporter's dmesg has an error about a fb inconsistency right before the kernel blows up in both cases. Adding Jesse because this might be pageflip related.
I am also affected. drm-intel-next 2.6.34-rc2 mesa 7.7.1 libgl 7.7.1 xorg-server 1.7.6 xf86-video-intel 2.10.0 libdrm 2.4.18/2.4.20
Created attachment 26011 [details] debug kernel patch Alex, can you please apply this patch against the kernel and try to rehang your box. This will kill performance (so expect a somewhat sluggish feel on the desktop), but it should also kill your box pretty fast (if my theory is right). If it hangs, please capture the full dmesg.
Created attachment 26019 [details] Extract of dmesg with 2.6.34-rc4 and NO PATCH The bug still persists in 2.6.34-rc4. This backtrace seems almost identical to the previous one. I have not yet applied the patch at this point.
On Thursday 15 April 2010, Alex Villacís Lasso wrote: > El 07/04/10 16:13, Rafael J. Wysocki escribió: > > This message has been generated automatically as a part of a summary report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.33. Please verify if it still should be listed and let the > tracking team > > know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15664 > > Subject : Graphics hang and kernel backtrace when starting > Azureus with Compiz enabled > > Submitter : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec> > > Date : 2010-04-01 01:09 (7 days old) > > > > > Still present in 2.6.34-rc4.
Created attachment 26024 [details] debug patch Please retest with this debug patch - it contains some paranoia-checks to see which refcount went south. Also enable slab debugging (CONFIG_DEBUG_SLAB), this should help to shed some light on this if it's really a use-after-free thing (it looks like). btw, it's definitely not the same bug as the BUG_ON in put_pages I've mentioned earlier - that's resolved and was definitely a bug in one of my debug patches. Thanks for the other dmesg. I haven't looked as closely as with the other ones, but it's essentially the same error.
Created attachment 26029 [details] dmesg output after 2.6.34-rc4 and second patch I applied the patch against 2.6.34-rc4 (offset -1 lines) and rebooted. I could not even reach the graphical login. The screen froze with the completed Fedora logo in the blue background, with no X cursor. I logged from the network and captured the attached dmesg output with backtrace.
@Alex: I experienced the same issue with unpatched Fedora kernel, are you sure that patch was applied?
@Daniele: I applied the second patch (the one that just adds the BUG_ON). I am hitting BUG_ON(atomic_read(&obj->refcount.refcount) == 0); which was the one added by the second patch. So yes, the patch is applied. Or else I do not understand your question. Unpatched -rc4 lets me login and work for a while, until the hang happens, as remarked in comment #10.
[comment seems to have been eaten by some mta, resending via the web interface] > --- Comment #13 from Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> > 2010-04-16 15:11:54 --- > I applied the patch against 2.6.34-rc4 (offset -1 lines) and rebooted. I > could > not even reach the graphical login. The screen froze with the completed > Fedora > logo in the blue background, with no X cursor. I logged from the network and > captured the attached dmesg output with backtrace. Thanks for testing. So yep, refcounts gone south. Unfortunately I've been a little bit too lazy when doing that debug patch. Can you check with your sources which of the two BUG_ONs my patch inserted you've hit? If it's the first one, please exchange them like this: BUG_ON(atomic_read(&obj->refcount.refcount) == 0); BUG_ON(obj_priv->pages_refcount == 0); and check whether it's still only hitting the pages_refcount thing. Thanks.
I am hitting the second BUG_ON, not the first. My kernel was compiled with CONFIG_SLUB and CONFIG_SLUB_DEBUG, but the last dmesg was without any slub_debug configured. I will retest with slub_debug.
I meant, I did not pass slub_debug as a kernel command line parameter.
Created attachment 26096 [details] dmesg output after 2.6.34-rc5 and BUG_ON patch Bug still present in 2.6.34-rc5, crashes in the same way as 2.6.34-rc4. It hits the *second* BUG_ON at: BUG_ON(obj_priv->pages_refcount == 0); BUG_ON(atomic_read(&obj->refcount.refcount) == 0);
> --- Comment #19 from Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> > 2010-04-22 15:47:09 --- > Created an attachment (id=26096) > --> (https://bugzilla.kernel.org/attachment.cgi?id=26096) > dmesg output after 2.6.34-rc5 and BUG_ON patch > > Bug still present in 2.6.34-rc5, crashes in the same way as 2.6.34-rc4. It > hits > the *second* BUG_ON at: > BUG_ON(obj_priv->pages_refcount == 0); > BUG_ON(atomic_read(&obj->refcount.refcount) == 0); Thanks for the clarification. I was somewhat confused which BUG_ON you've hit, but too busy to ask for clarification. Jesse, this dmesg (like all the ones before) again shows the fb related ERROR right before. Can you please take all look? The BUG_ON Alex is hitting indicates a screw-up refcount somewhere.
On Thursday 22 April 2010, Alex Villacís Lasso wrote: > El 19/04/10 22:19, Rafael J. Wysocki escribió: > > This message has been generated automatically as a part of a summary report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.33. Please verify if it still should be listed and let the > tracking team > > know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15664 > > Subject : Graphics hang and kernel backtrace when starting > Azureus with Compiz enabled > > Submitter : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec> > > Date : 2010-04-01 01:09 (19 days old) > > > > > > > > > Still present in -rc5. I am testing with a patch that triggers a BUG_ON > when the refcount reaches zero, and it triggered in rc4 and rc5. The > patch is attached as part of the bug report.
Still present in 2.6.34-rc6.
On Wednesday 05 May 2010, Alex Villacís Lasso wrote: > El 04/05/10 16:21, Rafael J. Wysocki escribió: > > This message has been generated automatically as a part of a summary report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.33. Please verify if it still should be listed and let the > tracking team > > know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15664 > > Subject : Graphics hang and kernel backtrace when starting > Azureus with Compiz enabled > > Submitter : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec> > > Date : 2010-04-01 01:09 (34 days old) > > > > > > > > > Still present in -rc6
Created attachment 26243 [details] be paranoid in rmfb Ok, that's a shot in the dark. Care to test this patch? If it doesn't turn up anything new, can you please try to bisect this regression? If it helps to crash the kernel faster, simply apply my earlier debug patch for testing. Thanks, Daniel
Created attachment 26251 [details] dmesg output after 2.6.34-rc6 and BUG_ON patch and paranoid rmfb patch Seems there is nothing new. Bisecting might take a long time since this is my primary workstation machine and I cannot just reboot it at will.
Created attachment 26264 [details] dmesg output after 2.6.33 and BUG_ON patch and paranoid rmfb patch Now this is new. I was trying to locate a kernel version that does not crash, in order to start bisecting. I remember vanilla 2.6.33 had no video-related crash problems, so I started there. I applied the BUG_ON patch and the paranoid rmfb patch to 2.6.33, recompiled, and rebooted. This time I hit the BUG_ON in the same way as done in 2.6.34-rcX. My conclusion is that 2.6.33 already had the refcount bug but was saved from crashing until recent drm changes exposed the bug. The trouble is that without the BUG_ON patch, 2.6.34-rcX works for a while, even if internally its objects are incorrectly refcounted. What do you think? BTW, my stock Fedora 12 kernel (2.6.32.11-99.fc12.x86_64) also has the "tried to remove a fb..." message, so I think it might have the hidden refcounting bug too. However, it has never crashed yet.
Created attachment 26378 [details] print fb<->bo association Please apply this debug patch on top of the rest. This is just to check the the gem object the kernel is crashing on is really associated with an fb (I think - but testing is always better). Please upload the full dmesg of a crashing kernel. Thanks.
Created attachment 26384 [details] dmesg output after 2.6.34-rc7 and BUG_ON, paranoid rmfb, fb-bo patches As requested, here is the full dmesg output with crash.
Same BUG here. FYI: http://article.gmane.org/gmane.comp.video.dri.devel/46062 Thx Daniel. Cheers
> --- Comment #29 from Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> > 2010-05-14 16:46:51 --- > Created an attachment (id=26384) > --> (https://bugzilla.kernel.org/attachment.cgi?id=26384) > dmesg output after 2.6.34-rc7 and BUG_ON, paranoid rmfb, fb-bo patches > > As requested, here is the full dmesg output with crash. Thanks. Confirms indeed that we blow up on the backing bo of a just freed framebuffer.
I've fixed some page-flipping and framebufer reference counting bugs in xf86-video-intel very recently. Could you please try with an up-to-date xorg driver? commit 9f54107f866a25cf670f81f7c52b8c108728c6a5 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue May 11 14:55:16 2010 +0100 dri2: Handle reference counting across page flipping 1. Instead of swapping bos, swap the entire private structure. 2. If we update the pixmap bo for the Screen, make sure we update the reference inside intel->front_buffer so that xrandr still functions. Fixes: Bug 27922 - i965: Rapidly resizing OpenGL window causes GPU to hang. https://bugs.freedesktop.org/show_bug.cgi?id=27922 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> commit 0d2392d44aae95d6b571d98f7ec323cf672a687f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 14 10:32:12 2010 +0100 dri: Hold reference to buffers across swap As we schedule swaps for some time in the future and may process a detachment prior to receiving the vblank notification from the kernel, we need to hold a reference to the buffers for our swap event handler. Fixes: Bug 28080 - "glresize" causes X server segfault with indirect rendering. https://bugs.freedesktop.org/show_bug.cgi?id=28080 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
I am running 2.6.34 now, with NO additional patches. There have been no crashes in at least a day. However, I *still* get the line on dmesg: [ 31.313153] [drm:drm_mode_rmfb] *ERROR* tried to remove a fb that we didn't own So I do not think the problem can be considered solved yet. The -rc3 kernel also ran for a day or so, and then crashed...
Created attachment 26490 [details] backtrace with 2.6.34 with no extra patches I ran 2.6.34 for several days, and then I get this backtrace and hang. So the problem is DEFINITELY not solved. Grrr...
The telltale messae "tried to remove a fb" still appears in 2.6.35-rc2. In addition, I am now affected by bug #16149, but reverting the commit mentioned in the bug report fixes the problem.
Created attachment 26703 [details] Backtrace with 2.6.35-rc2 The machine finally produced a backtrace and hang, but now the backtrace is different. Is there any news on this?
The message "tried to remove a fb" still appears in 2.6.35-rc3.
On Monday, June 21, 2010, Alex Villacís Lasso wrote: > El 20/06/10 17:34, Rafael J. Wysocki escribió: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.33 and 2.6.34. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.33 and 2.6.34. Please verify if it still should > > be listed and let the tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15664 > > Subject : Graphics hang and kernel backtrace when starting > Azureus with Compiz enabled > > Submitter : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec> > > Date : 2010-04-01 01:09 (81 days old) > > > Still present in 2.6.35-rc3.
Created attachment 26962 [details] Extract of /var/log/messages with 2.6.35-rc3 and patched DRM locking I can reproduce the reference miscount at home using the paranoid rmfb patch. In addition, I patched the DRM locking functions to WARN_ON on each call, and posted the resulting dmesg as written to /var/log/messages . Is this useful at all?
Still having the "tried to remove a fb..." message with 2.6.35-rc4.
Do you have the VGA framebuffer selected in .config?
[alex@srv64 linux-2.6.34-rc4-git]$ grep VGA .config CONFIG_VGA_ARB=y CONFIG_VGA_ARB_MAX_GPUS=16 # CONFIG_VGA_SWITCHEROO is not set CONFIG_VGASTATE=m # CONFIG_FB_SVGALIB is not set CONFIG_FB_VGA16=m CONFIG_VGA_CONSOLE=y CONFIG_VGACON_SOFT_SCROLLBACK=y CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64 # CONFIG_LOGO_LINUX_VGA16 is not set CONFIG_USB_SISUSBVGA=m CONFIG_USB_SISUSBVGA_CON=y I see that CONFIG_FB_VGA16 is set to module.
On Tuesday, July 13, 2010, Alex Villacís Lasso wrote: > El 09/07/10 19:25, Rafael J. Wysocki escribió: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.33 and 2.6.34. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.33 and 2.6.34. Please verify if it still should > > be listed and let the tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15664 > > Subject : Graphics hang and kernel backtrace when starting > Azureus with Compiz enabled > > Submitter : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec> > > Date : 2010-04-01 01:09 (100 days old) > > > I am currently testing 2.6.35-rc4 at commit > 589643be6693c46fbc54bae77745f336c8ed4bcc . So far it has been up for 20 > hours, but it still showed the message [drm:drm_mode_rmfb] *ERROR* tried > to remove a fb that we didn't own on boot.
Created attachment 27218 [details] Repeat unbind during free. The crash/hang is definitely a use-after-free; the attached patch fixes a likely suspect. Not sure what the cause of the invalid rmfb though.
Created attachment 27219 [details] Repeat unbind during free. Daniel Vetter pointed out the stupid userspace mistake in the first attempt...
Created attachment 27220 [details] Repeat unbind during free. Third time lucky. Move the deferred into the main retire requests, and hopefully the patch will apply to the stable tree as well.
(In reply to comment #46) > Created an attachment (id=27220) [details] > Repeat unbind during free. > > Third time lucky. Move the deferred into the main retire requests, and > hopefully the patch will apply to the stable tree as well. Does not apply to v2.6.35-rc6: [alex@srv64 linux-2.6.35-rc6]$ patch -p1 --dry-run < ../bug15664-0001-drm-i915-Repeat-unbinding-during-free-if-interrupted.patch patching file drivers/gpu/drm/i915/i915_drv.h Hunk #1 succeeded at 542 (offset -9 lines). patching file drivers/gpu/drm/i915/i915_gem.c Hunk #1 succeeded at 53 (offset 1 line). Hunk #2 FAILED at 1747. Hunk #3 succeeded at 1946 (offset 16 lines). Hunk #4 succeeded at 1987 with fuzz 1 (offset 18 lines). Hunk #5 succeeded at 4442 (offset 152 lines). Hunk #6 succeeded at 4490 with fuzz 1 (offset 176 lines). 1 out of 6 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem.c.rej
Created attachment 27227 [details] Repeat unbind during free. Sorry about that, too great a delta between trees.
On Friday, July 23, 2010, Alex Villacís Lasso wrote: > El 23/07/10 07:11, Rafael J. Wysocki escribió: > > This message has been generated automatically as a part of a report > > of regressions introduced between 2.6.33 and 2.6.34. > > > > The following bug entry is on the current list of known regressions > > introduced between 2.6.33 and 2.6.34. Please verify if it still should > > be listed and let the tracking team know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15664 > > Subject : Graphics hang and kernel backtrace when starting > Azureus with Compiz enabled > > Submitter : Alex Villacis Lasso<avillaci@ceibo.fiec.espol.edu.ec> > > Date : 2010-04-01 01:09 (114 days old) > > > > > Could not trigger with -rc5 after 9 days. Testing now with 2.6.35-rc6.
Created attachment 27302 [details] dmesg output with different backtrace My machine just hung today, after running for 7 days. However, now the backtrace is completely different. This was produced with 2.6.35-rc6 plus the "Repeat unbind during free" patch.
> Created an attachment (id=27302) > --> (https://bugzilla.kernel.org/attachment.cgi?id=27302) > dmesg output with different backtrace > > My machine just hung today, after running for 7 days. However, now the > backtrace is completely different. This was produced with 2.6.35-rc6 plus the > "Repeat unbind during free" patch. Can you file a new bug for this issue at bugs.freedesktop.org (we look at it more and it tends to be a lot faster than bz.kernel.org).
Bug created at freedesktop bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29325
fix merged for .36-rc1: commit be72615bcf4d5b7b314d836c5e1b4baa4b65dad1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 23 23:18:50 2010 +0100 drm/i915: Repeat unbinding during free if interrupted (v6)