I got a GPU hang when slip screensaver is running on SandyBridge machines. When /usr/lib/xscreensaver/slip is running for long enough (usually about 20 minutes), X hangs up. This doesn't seem to happen on 2.6.36 kernel, so it's possibly a regression in 2.6.37. 2.6.38-rc2 (with a few fixes) still shows the same problem. I tested it on an IronLake machine, and it doesn't show the problem. So, this can be SandyBridge-specific. The only indication of a hang is the kernel message like below, so far: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle, missed IRQ. Let me know if you anything is needed for debugging.
I fear I may know the outcome of this.... echo 1 > /sys/kernel/debug/dri/0/i915_wedged # manually trigger the hangcheck attach /sys/kernel/debug/dri/0/i915_error_state
Should I invoke i915_wedged echo after the actual GPU hang or before? Sorry, I'm not familiar with i915 GPU debugging...
Once the error message starts appears, invoke a manual wedged. The hangcheck finds that one ring is waiting upon a semaphore and so kicks that ring. Unfortunately, the other ring is still not progressing and so the next time the first ring needs a result from the second, it will hang again. Ad infinitum. I guess I should add some termination condition such that if the first ring hangs again upon the same result from the second ring, it's dead.
Hm, the machine locks up completely after invoking the manual wedge :-< No sysrq reaction, too. The last kernel message before the death is: [drm] Manually setting wedged to 1
Gah. Nope, can't see it. Care to do the old printk(KERN_ERR "%s:%d\n", __FUNCTION__, __LINE__); to find out where we die?
It locks up at i915_reset() in the error work. I remember that I also tried to implement the automatic GPU reset, and got a lock-up.
OK, got the i915_error_state output after disabling i915_reset() in the code. Attached below.
Created attachment 45782 [details] i915_error_state after GPU hang
That's a screensaver! How eco-unfriendly! ;-) Nothing amiss within the dump, and not the bug I feared it was. Have you tried the following patch?
Created attachment 45802 [details] The joy of poor documentation.
(In reply to comment #9) > That's a screensaver! How eco-unfriendly! ;-) Oh, you didn't know that the real purpose of screensavers is for stress-tests? ;) > Nothing amiss within the dump, and not the bug I feared it was. > > Have you tried the following patch? Tried now, but still hangs up...
Actually I have an impression that the patch makes triggering the hang rather more quickly. Now it locks up in two minutes or so.
BTW, the lock-up in i915_reset() is in gen6_do_reset(). The machine locks up before returning from it.
Created attachment 46022 [details] Missing invalidate for BLT/BSD rings Well this is the fix for the bug I feared you had found... Maybe...
Thanks, but the new patch didn't help, too. It still hangs after a few minutes run...
Takashi, Chris, is this fixed in the meantime?
Not fixed yet.
Bug 28882 (https://bugzilla.kernel.org/show_bug.cgi?id=28882) could be the same issue.
Takashi, any progress with commit 91355834646328e7edc6bd25176ae44bcd7386c7 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 4 19:22:40 2011 +0000 drm/i915: Do not overflow the MMADDR write FIFO that's the only lead I've had so far. And for all the bug-watchers at home, 28882 is definitely not the same bug.
I tested 2.6.38-rc8, but still hangs. FWIW, this is 32bit kernel and X is without compositing.
Created attachment 51312 [details] Flush BLT before the interrupt Found this bug by attacking the BLT ring with firefox. Looks to be very similar...
I had to backport this for 2.6.38 kernel, and it didn't help, unfortunately. Still the same hang occurs after some time...
I am running this on both a SugarBar rev09 and HuronRiver rev09, waiting for a hang. How long does it usually take on you machine to die?
It pretty depends on the machine and the situation. If you are (un)lucky, you'll get it in a couple of minutes. It may take 20 minutes or so, too. Also, don't use compiz. It seems compositing could work around this bug. I tried the freshly installed openSUSE-11.4 64bit, and no hang was seen there.
Joy of joys, I experienced a hang on the HuronRiver. It blew up in the middle of a series of seemingly identical ops. But slowing it down by flushing the batches/caches it is still going. Try either of or both in xorg.conf: Section "Driver" Option "DebugFlushCaches" "True" Option "DebugFlushBatches" "True" EndSection I smell a silicon validation issue... ;-)
For kicks, try adding this to xf86-video-intel: diff --git a/src/intel_uxa.c b/src/intel_uxa.c index 13d8cf9..66e531a 100644 --- a/src/intel_uxa.c +++ b/src/intel_uxa.c @@ -506,6 +506,9 @@ static void intel_uxa_done_copy(PixmapPtr dest) { ScrnInfoPtr scrn = xf86Screens[dest->drawable.pScreen->myNum]; + if (IS_GEN6(intel_get_screen_private(scrn))) + intel_batch_emit_flush(scrn); + intel_debug_flush(scrn); }
Hooray, the patch in comment 26 fixes the hang, indeed.
But not even close to explaining what's going on though. Mystery gen6 bug.
Well the good news is that this is not limited to the BLT ring... I've just had an near-identical crash flooding the RENDER ring.
:( I was just adding a comment to suggest that this appears to have been resolved, when it hung. Oh well.
(In reply to comment #30) > :( I was just adding a comment to suggest that this appears to have been > resolved, when it hung. Oh well. If the patch in comment #26 is included in version 2.16.0 then the bug is indeed not fxed - please see bug #35122 .
The workaround that we eventually landed upon in xf86-video-intel-2.16.901 was: commit 46f97127c22ea42bc8fdae59d2a133e4b8b6c997 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Oct 16 21:40:15 2011 +0100 snb,ivb: Workaround unknown blitter death The first workaround was a performance killing MI_FLUSH_DW after every op. This workaround appears to be a stable compromise instead, only requiring a redundant command after every BLT command with little impact on throughput. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27892 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=39524 Tested-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>