Bug 20452 - Clarkdale: kernel BUG at drivers/gpu/drm/i915/i915_gem_evict.c:244!
Summary: Clarkdale: kernel BUG at drivers/gpu/drm/i915/i915_gem_evict.c:244!
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: drivers_video-dri-intel@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-10-17 11:10 UTC by Martin Rogge
Modified: 2010-11-09 21:30 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.36-rc8, 2.6.36
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Syslog (12.17 KB, text/plain)
2010-10-17 11:10 UTC, Martin Rogge
Details
Kernel config (47.29 KB, text/plain)
2010-10-17 11:12 UTC, Martin Rogge
Details
Dmesg (32.32 KB, text/plain)
2010-10-17 11:12 UTC, Martin Rogge
Details
Lspci -vv (26.12 KB, text/plain)
2010-10-17 11:13 UTC, Martin Rogge
Details

Description Martin Rogge 2010-10-17 11:10:31 UTC
Created attachment 33792 [details]
Syslog

Due to various i915 display related issues on a couple of machines (GM45, i3 Clarkdale) I am frequently updating and testing parts of the i915 stack.

With some of the latest user space cpomponents (xorg 1.8.2, libdrm-2.4.22, mesa-7.9, xf86-video-intel-2.13.0, pixman-0.19.4, cairo-1.10.0) I now hit a sporadic display freeze that leaves a trace in the syslog. Unfortunately I cannot really bisect due to the sporadicness. 

I logged two different symptoms in the attached system log. 

The first one was less severe, allowed me to switch to a virtual terminal. I could kill and restart X, although the display remained black and I had to reboot. The log has a lot of lines of the form
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 7527114 at 7527108)
The kernel was a BFS-modified 2.6.35.4.

The second incident (for which this ticket is created) happened under a vanilla 2.6.36-rc8 kernel. The symptom was harder. I could not switch to virtual terminals but I could ssh in. Interestingly the system did not react to kill -9 of the X server ie. the process list remained unchanged (X and all children in state S). I had to reboot. Main characteristics in the log is the kernel BUG and stack trace.
Comment 1 Martin Rogge 2010-10-17 11:12:01 UTC
Created attachment 33802 [details]
Kernel config
Comment 2 Martin Rogge 2010-10-17 11:12:40 UTC
Created attachment 33812 [details]
Dmesg
Comment 3 Martin Rogge 2010-10-17 11:13:27 UTC
Created attachment 33822 [details]
Lspci -vv
Comment 4 Martin Rogge 2010-10-22 21:19:20 UTC
I have upgraded to vanilla 2.6.36. After 24 hours I was caught out by an X freeze with more or less the same symptoms as before with 2.6.35.4-ck1. However, there was nothing logged in the syslog. I had full control from an ssh session, but the display was irrevocably stuck. It would not switch back to text mode nor display a freshly started xorg server. Reboot was the only option.
Comment 5 Martin Rogge 2010-10-26 18:50:54 UTC
Another crash last night. Again the phenomenon that I could not kill any processes. I had to ssh in and reboot. There was no event caught in the system logs.
Comment 6 Chris Wilson 2010-10-27 20:50:00 UTC
The BUG was coincidentally fixed with

commit 69dc4987cbe5fe70ae1c2a08906d431d53cdd242
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Oct 19 10:36:51 2010 +0100

    drm/i915: Track objects in global active list (as well as per-ring)
    
    To handle retirements, we need per-ring tracking of active objects.
    To handle evictions, we need global tracking of active objects.
    
    As we enable more rings, rebuilding the global list from the individual
    per-ring lists quickly grows tiresome and overly complicated. Tracking the
    active objects in two lists is the lesser of two evils.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

The underlying issue (the cause of the hangs leading to this BUG) is a broken userspace driver.
Comment 7 Martin Rogge 2010-10-28 18:41:02 UTC
Thanks!

Due to dependencies the commit doesn't apply to vanilla 2.6.36. Has it been submitted to the stable branch of 2.6.36?
Comment 8 Chris Wilson 2010-10-28 19:01:42 UTC
No, I was looking at solving a different problem and only realized later the bug that lurked there.

The minimal patch for stable would be a candidate for stable is:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 90b1d67..a538002 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2045,6 +2045,8 @@ i915_gpu_idle(struct drm_device *dev)
        if (seqno1 == 0)
                return -ENOMEM;
        ret = i915_wait_request(dev, seqno1, &dev_priv->render_ring);
+       if (ret)
+           return ret;
 
        if (HAS_BSD(dev)) {
                seqno2 = i915_add_request(dev, NULL, I915_GEM_GPU_DOMAINS,
Comment 9 Martin Rogge 2010-10-29 21:58:49 UTC
Thanks again. I shall apply the patch and if the system survives for a week we can call it 100% test success. ;-)
Comment 10 Martin Rogge 2010-11-09 21:30:18 UTC
me again. I have not observed the kernel bug since applying the patch. So we can call it tested and closed. 

Unfortunately I have experienced a couple of different Xorg freezes for which I may open another ticket if I can find the motivation for it. To be honest, after 6 months of trying I am getting very tired of this buggy stuff. All I want is a stable system but obviously that is asking too much.

Note You need to log in before you can comment on or make changes to this bug.