Bug 46991

Summary: kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3245 (RIP [<ffffffffa00aed94>] i915_gem_object_pin+0x144/0x190 [i915])
Product: Drivers Reporter: rocko (rockorequin)
Component: Video(DRI - Intel)Assignee: Chris Wilson (chris)
Status: RESOLVED CODE_FIX    
Severity: high CC: alan, ben, chris, florian, intel-gfx-bugs, rockorequin
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.6-rc4 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: syslog of crash
syslog of second crash
another crashlog from 3.6-rc5
Flush outstanding unpin tasks.
syslog of a crash
Flush outstanding unpin tasks.
Flush outstanding unpin tasks (v3.7-rc1)

Description rocko 2012-09-04 02:53:28 UTC
Created attachment 79181 [details]
syslog of crash

I've had this a couple of times in 3.6rc4 (I don't recall seeing it rc3):

kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3245!
invalid opcode: 0000 [#1] SMP 
...
Call Trace:
Sep  4 10:38:30 sierra kernel: [11712.658968]  [<ffffffffa00aee68>] i915_gem_object_pin_to_display_plane+0x88/0x100 [i915]


It locks up the PC, requiring a hard reboot.

A longer syslog is attached. The nvidia module was loaded because it was running on a separate X server (ie via bumblebee) and it had itself just crashed, but I don't think it is relevant because the last time the PC froze the nvidia module wasn't loaded, and because the nvidia module crashes frequently but the i915 driver rarely does.
Comment 1 Alan 2012-09-04 11:00:51 UTC
If you've had the nvidia module even loaded during a boot then it could easily be responsible. One driver can crash another.
Comment 2 rocko 2012-09-04 12:52:43 UTC
I've seen it crash with the same error without the nvidia module ever being loaded. I only mentioned it because the nvidia module was loaded at the time of the syslog. The nvidia module is blacklisted so that it doesn't get loaded at boot - it only gets loaded if I manually start an X server with it via bumblebee.
Comment 3 rocko 2012-09-04 12:55:31 UTC
Created attachment 79231 [details]
syslog of second crash

Here's a second crash from later today. It looks like the same stack trace (no nvidia module was ever loaded this time).
Comment 4 Ben Widawsky 2012-09-04 21:16:56 UTC
rocko, I think you'll have to bisect it.
Comment 5 rocko 2012-09-11 23:12:30 UTC
That's a real pity, because I don't know how to reproduce it. For instance, it happened again for the first time since the 4th of September - ie a full 8 days later, so bisecting will take forever. I was hoping the stack trace might be of use.
Comment 6 rocko 2012-09-11 23:14:13 UTC
Created attachment 79861 [details]
another crashlog from 3.6-rc5
Comment 7 Chris Wilson 2012-09-19 11:23:16 UTC
Created attachment 80521 [details]
Flush outstanding unpin tasks.

The working theory is that we have a backlog of incomplete pageflip tasks leaking the pincount.
Comment 8 rocko 2012-09-19 23:54:46 UTC
OK, trying the patch out now, so far so good. (But this is a really tricky one to trigger; I don't think I've seen it now in the last 11 days.)
Comment 9 Chris Wilson 2012-10-18 14:27:38 UTC
Any sign of success?
Comment 10 rocko 2012-10-18 20:56:01 UTC
Created attachment 83911 [details]
syslog of a crash

I have seen the occasional hang, for instance the attached was from five days ago when running on kernel 3.6.2. There's an intel_unpin_fb_obj call in the stack trace - is that relevant to this bug? 

At the time of this crash I didn't have the patch from this bug applied, it will be a standard kernel. However, I assumed the patch was already sent to git because I had to revert it in an earlier rc when it conflicted with a git update.
Comment 11 Chris Wilson 2012-10-19 08:19:53 UTC
Created attachment 83971 [details]
Flush outstanding unpin tasks.

It was never applied as I was waiting for confirmation that it does fix things before yelling at Daniel.
Comment 12 rocko 2012-10-19 09:09:24 UTC
Oops, I guess I should have said something earlier. Is the last crash I attached the same bug then?

Which version of the kernel is the new patch for? I tried it against 3.7-rc1 (my preferred one since it is working quite nicely at the moment) and 3.6.2 and in both cases it failed to apply hunk #2, ie to method do_intel_finish_page_flip. In both cases this line isn't in the method:

 if (atomic_read(&obj->pending_flip) == 0)
  		wake_up(&dev_priv->pending_flip_queue);

Instead, the lines before where it is trying to apply the patch are "atomic_clear_mask(1 << intel_crtc->plane, &obj->pending_flip.counter);" and "wake_up(&dev_priv->pending_flip_queue);".

Should I just replace "schedule_work(&work->work);" with "queue_work(dev_priv->wq, &work->work);"?
Comment 13 Chris Wilson 2012-10-19 09:29:07 UTC
(In reply to comment #12)
> Oops, I guess I should have said something earlier. Is the last crash I
> attached the same bug then?

Its close enough that the cause is likely to be the same, but for the moment I'm just a little puzzled by that appearing to be an unpin leak...

> Which version of the kernel is the new patch for? I tried it against 3.7-rc1
> (my preferred one since it is working quite nicely at the moment) and 3.6.2
> and
> in both cases it failed to apply hunk #2, ie to method

It was against dinq, rebasing now against 3.7-rc1 for you.
Comment 14 Chris Wilson 2012-10-19 09:29:32 UTC
Created attachment 83981 [details]
Flush outstanding unpin tasks (v3.7-rc1)
Comment 15 rocko 2012-10-20 03:23:39 UTC
Thanks, I've applied it against the latest git and am now trying it out.
Comment 16 rocko 2012-11-17 05:27:33 UTC
So I've been running with the patch against 3.7-rcX for the last four weeks now. Generally it has behaved fine, although I've did seen it crash once - right after the nvidia driver crashed with a Xid error. The syslog looked similar to the previous errors, but I only got a couple of lines of it before the kernel locked up, so it's not any help.

I do find that if I run SNA with the latest video driver (xf86-video-intel-2.6.99.902 at the moment), occasionally unity/compiz locks up after the screen saver - I just get a black screen where only the mouse pointer moves until I restart unity. Is it at all possible that this behaviour is linked to the patch? If I use UXA I get the slightly less annoying glitch where the screen locks up completely at random (again except for the mouse) but I can reset it by CTRL-ALT-F1 and then back again with CTRL-ALT-F7.
Comment 17 Chris Wilson 2012-11-17 08:26:54 UTC
(In reply to comment #16)
> I do find that if I run SNA with the latest video driver
> (xf86-video-intel-2.6.99.902 at the moment), occasionally unity/compiz locks
> up
> after the screen saver - I just get a black screen where only the mouse
> pointer
> moves until I restart unity. Is it at all possible that this behaviour is
> linked to the patch? 

No, there are a few other funnies in the kernel eating the pageflip...
Comment 18 Florian Mickler 2012-12-22 09:24:44 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit b4a98e57fc27854b5938fc8b08b68e5e68b91e1f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 1 09:26:26 2012 +0000

    drm/i915: Flush outstanding unpin tasks before pageflipping