Bug 46991
Summary: | kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3245 (RIP [<ffffffffa00aed94>] i915_gem_object_pin+0x144/0x190 [i915]) | ||
---|---|---|---|
Product: | Drivers | Reporter: | rocko (rockorequin) |
Component: | Video(DRI - Intel) | Assignee: | Chris Wilson (chris) |
Status: | RESOLVED CODE_FIX | ||
Severity: | high | CC: | alan, ben, chris, florian, intel-gfx-bugs, rockorequin |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.6-rc4 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
syslog of crash
syslog of second crash another crashlog from 3.6-rc5 Flush outstanding unpin tasks. syslog of a crash Flush outstanding unpin tasks. Flush outstanding unpin tasks (v3.7-rc1) |
If you've had the nvidia module even loaded during a boot then it could easily be responsible. One driver can crash another. I've seen it crash with the same error without the nvidia module ever being loaded. I only mentioned it because the nvidia module was loaded at the time of the syslog. The nvidia module is blacklisted so that it doesn't get loaded at boot - it only gets loaded if I manually start an X server with it via bumblebee. Created attachment 79231 [details]
syslog of second crash
Here's a second crash from later today. It looks like the same stack trace (no nvidia module was ever loaded this time).
rocko, I think you'll have to bisect it. That's a real pity, because I don't know how to reproduce it. For instance, it happened again for the first time since the 4th of September - ie a full 8 days later, so bisecting will take forever. I was hoping the stack trace might be of use. Created attachment 79861 [details]
another crashlog from 3.6-rc5
Created attachment 80521 [details]
Flush outstanding unpin tasks.
The working theory is that we have a backlog of incomplete pageflip tasks leaking the pincount.
OK, trying the patch out now, so far so good. (But this is a really tricky one to trigger; I don't think I've seen it now in the last 11 days.) Any sign of success? Created attachment 83911 [details]
syslog of a crash
I have seen the occasional hang, for instance the attached was from five days ago when running on kernel 3.6.2. There's an intel_unpin_fb_obj call in the stack trace - is that relevant to this bug?
At the time of this crash I didn't have the patch from this bug applied, it will be a standard kernel. However, I assumed the patch was already sent to git because I had to revert it in an earlier rc when it conflicted with a git update.
Created attachment 83971 [details]
Flush outstanding unpin tasks.
It was never applied as I was waiting for confirmation that it does fix things before yelling at Daniel.
Oops, I guess I should have said something earlier. Is the last crash I attached the same bug then? Which version of the kernel is the new patch for? I tried it against 3.7-rc1 (my preferred one since it is working quite nicely at the moment) and 3.6.2 and in both cases it failed to apply hunk #2, ie to method do_intel_finish_page_flip. In both cases this line isn't in the method: if (atomic_read(&obj->pending_flip) == 0) wake_up(&dev_priv->pending_flip_queue); Instead, the lines before where it is trying to apply the patch are "atomic_clear_mask(1 << intel_crtc->plane, &obj->pending_flip.counter);" and "wake_up(&dev_priv->pending_flip_queue);". Should I just replace "schedule_work(&work->work);" with "queue_work(dev_priv->wq, &work->work);"? (In reply to comment #12) > Oops, I guess I should have said something earlier. Is the last crash I > attached the same bug then? Its close enough that the cause is likely to be the same, but for the moment I'm just a little puzzled by that appearing to be an unpin leak... > Which version of the kernel is the new patch for? I tried it against 3.7-rc1 > (my preferred one since it is working quite nicely at the moment) and 3.6.2 > and > in both cases it failed to apply hunk #2, ie to method It was against dinq, rebasing now against 3.7-rc1 for you. Created attachment 83981 [details]
Flush outstanding unpin tasks (v3.7-rc1)
Thanks, I've applied it against the latest git and am now trying it out. So I've been running with the patch against 3.7-rcX for the last four weeks now. Generally it has behaved fine, although I've did seen it crash once - right after the nvidia driver crashed with a Xid error. The syslog looked similar to the previous errors, but I only got a couple of lines of it before the kernel locked up, so it's not any help. I do find that if I run SNA with the latest video driver (xf86-video-intel-2.6.99.902 at the moment), occasionally unity/compiz locks up after the screen saver - I just get a black screen where only the mouse pointer moves until I restart unity. Is it at all possible that this behaviour is linked to the patch? If I use UXA I get the slightly less annoying glitch where the screen locks up completely at random (again except for the mouse) but I can reset it by CTRL-ALT-F1 and then back again with CTRL-ALT-F7. (In reply to comment #16) > I do find that if I run SNA with the latest video driver > (xf86-video-intel-2.6.99.902 at the moment), occasionally unity/compiz locks > up > after the screen saver - I just get a black screen where only the mouse > pointer > moves until I restart unity. Is it at all possible that this behaviour is > linked to the patch? No, there are a few other funnies in the kernel eating the pageflip... A patch referencing this bug report has been merged in Linux v3.8-rc1: commit b4a98e57fc27854b5938fc8b08b68e5e68b91e1f Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Nov 1 09:26:26 2012 +0000 drm/i915: Flush outstanding unpin tasks before pageflipping |
Created attachment 79181 [details] syslog of crash I've had this a couple of times in 3.6rc4 (I don't recall seeing it rc3): kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3245! invalid opcode: 0000 [#1] SMP ... Call Trace: Sep 4 10:38:30 sierra kernel: [11712.658968] [<ffffffffa00aee68>] i915_gem_object_pin_to_display_plane+0x88/0x100 [i915] It locks up the PC, requiring a hard reboot. A longer syslog is attached. The nvidia module was loaded because it was running on a separate X server (ie via bumblebee) and it had itself just crashed, but I don't think it is relevant because the last time the PC froze the nvidia module wasn't loaded, and because the nvidia module crashes frequently but the i915 driver rarely does.