Created attachment 79181 [details] syslog of crash I've had this a couple of times in 3.6rc4 (I don't recall seeing it rc3): kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3245! invalid opcode: 0000 [#1] SMP ... Call Trace: Sep 4 10:38:30 sierra kernel: [11712.658968] [<ffffffffa00aee68>] i915_gem_object_pin_to_display_plane+0x88/0x100 [i915] It locks up the PC, requiring a hard reboot. A longer syslog is attached. The nvidia module was loaded because it was running on a separate X server (ie via bumblebee) and it had itself just crashed, but I don't think it is relevant because the last time the PC froze the nvidia module wasn't loaded, and because the nvidia module crashes frequently but the i915 driver rarely does.
If you've had the nvidia module even loaded during a boot then it could easily be responsible. One driver can crash another.
I've seen it crash with the same error without the nvidia module ever being loaded. I only mentioned it because the nvidia module was loaded at the time of the syslog. The nvidia module is blacklisted so that it doesn't get loaded at boot - it only gets loaded if I manually start an X server with it via bumblebee.
Created attachment 79231 [details] syslog of second crash Here's a second crash from later today. It looks like the same stack trace (no nvidia module was ever loaded this time).
rocko, I think you'll have to bisect it.
That's a real pity, because I don't know how to reproduce it. For instance, it happened again for the first time since the 4th of September - ie a full 8 days later, so bisecting will take forever. I was hoping the stack trace might be of use.
Created attachment 79861 [details] another crashlog from 3.6-rc5
Created attachment 80521 [details] Flush outstanding unpin tasks. The working theory is that we have a backlog of incomplete pageflip tasks leaking the pincount.
OK, trying the patch out now, so far so good. (But this is a really tricky one to trigger; I don't think I've seen it now in the last 11 days.)
Any sign of success?
Created attachment 83911 [details] syslog of a crash I have seen the occasional hang, for instance the attached was from five days ago when running on kernel 3.6.2. There's an intel_unpin_fb_obj call in the stack trace - is that relevant to this bug? At the time of this crash I didn't have the patch from this bug applied, it will be a standard kernel. However, I assumed the patch was already sent to git because I had to revert it in an earlier rc when it conflicted with a git update.
Created attachment 83971 [details] Flush outstanding unpin tasks. It was never applied as I was waiting for confirmation that it does fix things before yelling at Daniel.
Oops, I guess I should have said something earlier. Is the last crash I attached the same bug then? Which version of the kernel is the new patch for? I tried it against 3.7-rc1 (my preferred one since it is working quite nicely at the moment) and 3.6.2 and in both cases it failed to apply hunk #2, ie to method do_intel_finish_page_flip. In both cases this line isn't in the method: if (atomic_read(&obj->pending_flip) == 0) wake_up(&dev_priv->pending_flip_queue); Instead, the lines before where it is trying to apply the patch are "atomic_clear_mask(1 << intel_crtc->plane, &obj->pending_flip.counter);" and "wake_up(&dev_priv->pending_flip_queue);". Should I just replace "schedule_work(&work->work);" with "queue_work(dev_priv->wq, &work->work);"?
(In reply to comment #12) > Oops, I guess I should have said something earlier. Is the last crash I > attached the same bug then? Its close enough that the cause is likely to be the same, but for the moment I'm just a little puzzled by that appearing to be an unpin leak... > Which version of the kernel is the new patch for? I tried it against 3.7-rc1 > (my preferred one since it is working quite nicely at the moment) and 3.6.2 > and > in both cases it failed to apply hunk #2, ie to method It was against dinq, rebasing now against 3.7-rc1 for you.
Created attachment 83981 [details] Flush outstanding unpin tasks (v3.7-rc1)
Thanks, I've applied it against the latest git and am now trying it out.
So I've been running with the patch against 3.7-rcX for the last four weeks now. Generally it has behaved fine, although I've did seen it crash once - right after the nvidia driver crashed with a Xid error. The syslog looked similar to the previous errors, but I only got a couple of lines of it before the kernel locked up, so it's not any help. I do find that if I run SNA with the latest video driver (xf86-video-intel-2.6.99.902 at the moment), occasionally unity/compiz locks up after the screen saver - I just get a black screen where only the mouse pointer moves until I restart unity. Is it at all possible that this behaviour is linked to the patch? If I use UXA I get the slightly less annoying glitch where the screen locks up completely at random (again except for the mouse) but I can reset it by CTRL-ALT-F1 and then back again with CTRL-ALT-F7.
(In reply to comment #16) > I do find that if I run SNA with the latest video driver > (xf86-video-intel-2.6.99.902 at the moment), occasionally unity/compiz locks > up > after the screen saver - I just get a black screen where only the mouse > pointer > moves until I restart unity. Is it at all possible that this behaviour is > linked to the patch? No, there are a few other funnies in the kernel eating the pageflip...
A patch referencing this bug report has been merged in Linux v3.8-rc1: commit b4a98e57fc27854b5938fc8b08b68e5e68b91e1f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Nov 1 09:26:26 2012 +0000 drm/i915: Flush outstanding unpin tasks before pageflipping