Bug 27892

Summary: SNB: GPU hang with Slip xscreensaver
Product: Drivers Reporter: Takashi Iwai (tiwai)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: RESOLVED CODE_FIX    
Severity: normal CC: amitshah, chris, eric, florian, keithp, maciej.rutecki, mat, nemesis, pebolle, rjw, sndirsch, toralf.foerster, v_mac
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.37 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 21782    
Attachments: i915_error_state after GPU hang
The joy of poor documentation.
Missing invalidate for BLT/BSD rings
Flush BLT before the interrupt

Description Takashi Iwai 2011-01-31 12:06:58 UTC
I got a GPU hang when slip screensaver is running on SandyBridge machines.  When /usr/lib/xscreensaver/slip is running for long enough (usually about 20 minutes), X hangs up.

This doesn't seem to happen on 2.6.36 kernel, so it's possibly a regression in 2.6.37.

2.6.38-rc2 (with a few fixes) still shows the same problem.

I tested it on an IronLake machine, and it doesn't show the problem.  So, this can be SandyBridge-specific.

The only indication of a hang is the kernel message like below, so far:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU idle,
missed IRQ.

Let me know if you anything is needed for debugging.
Comment 1 Chris Wilson 2011-01-31 12:15:20 UTC
I fear I may know the outcome of this....

echo 1 > /sys/kernel/debug/dri/0/i915_wedged # manually trigger the hangcheck

attach /sys/kernel/debug/dri/0/i915_error_state
Comment 2 Takashi Iwai 2011-01-31 12:31:05 UTC
Should I invoke i915_wedged echo after the actual GPU hang or before?
Sorry, I'm not familiar with i915 GPU debugging...
Comment 3 Chris Wilson 2011-01-31 12:50:51 UTC
Once the error message starts appears, invoke a manual wedged.

The hangcheck finds that one ring is waiting upon a semaphore and so kicks that ring. Unfortunately, the other ring is still not progressing and so the next time the first ring needs a result from the second, it will hang again. Ad infinitum.

I guess I should add some termination condition such that if the first ring hangs again upon the same result from the second ring, it's dead.
Comment 4 Takashi Iwai 2011-01-31 13:36:57 UTC
Hm, the machine locks up completely after invoking the manual wedge :-<
No sysrq reaction, too.

The last kernel message before the death is:
    [drm] Manually setting wedged to 1
Comment 5 Chris Wilson 2011-02-01 10:49:07 UTC
Gah. Nope, can't see it. Care to do the old
  printk(KERN_ERR "%s:%d\n", __FUNCTION__, __LINE__);
to find out where we die?
Comment 6 Takashi Iwai 2011-02-01 14:46:12 UTC
It locks up at i915_reset() in the error work.

I remember that I also tried to implement the automatic GPU reset, and got a lock-up.
Comment 7 Takashi Iwai 2011-02-01 15:07:20 UTC
OK, got the i915_error_state output after disabling i915_reset() in the code.
Attached below.
Comment 8 Takashi Iwai 2011-02-01 15:09:31 UTC
Created attachment 45782 [details]
i915_error_state after GPU hang
Comment 9 Chris Wilson 2011-02-01 15:49:34 UTC
That's a screensaver! How eco-unfriendly! ;-)

Nothing amiss within the dump, and not the bug I feared it was.

Have you tried the following patch?
Comment 10 Chris Wilson 2011-02-01 15:50:31 UTC
Created attachment 45802 [details]
The joy of poor documentation.
Comment 11 Takashi Iwai 2011-02-01 16:03:40 UTC
(In reply to comment #9)
> That's a screensaver! How eco-unfriendly! ;-)

Oh, you didn't know that the real purpose of screensavers is for stress-tests? ;)
 
> Nothing amiss within the dump, and not the bug I feared it was.
> 
> Have you tried the following patch?

Tried now, but still hangs up...
Comment 12 Takashi Iwai 2011-02-01 16:12:22 UTC
Actually I have an impression that the patch makes triggering the hang rather more quickly.  Now it locks up in two minutes or so.
Comment 13 Takashi Iwai 2011-02-01 16:35:49 UTC
BTW, the lock-up in i915_reset() is in gen6_do_reset().  The machine locks up before returning from it.
Comment 14 Chris Wilson 2011-02-02 12:23:05 UTC
Created attachment 46022 [details]
Missing invalidate for BLT/BSD rings

Well this is the fix for the bug I feared you had found... Maybe...
Comment 15 Takashi Iwai 2011-02-02 13:28:46 UTC
Thanks, but the new patch didn't help, too.  It still hangs after a few minutes run...
Comment 16 Florian Mickler 2011-02-20 15:11:30 UTC
Takashi, Chris, is this fixed in the meantime?
Comment 17 Takashi Iwai 2011-02-21 10:03:43 UTC
Not fixed yet.
Comment 18 Amit Shah 2011-02-28 12:00:09 UTC
Bug 28882 (https://bugzilla.kernel.org/show_bug.cgi?id=28882) could be the same issue.
Comment 19 Chris Wilson 2011-03-10 08:34:58 UTC
Takashi, any progress with

commit 91355834646328e7edc6bd25176ae44bcd7386c7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 4 19:22:40 2011 +0000

    drm/i915: Do not overflow the MMADDR write FIFO
    
that's the only lead I've had so far.

And for all the bug-watchers at home, 28882 is definitely not the same bug.
Comment 20 Takashi Iwai 2011-03-11 11:10:20 UTC
I tested 2.6.38-rc8, but still hangs.

FWIW, this is 32bit kernel and X is without compositing.
Comment 21 Chris Wilson 2011-03-20 09:08:44 UTC
Created attachment 51312 [details]
Flush BLT before the interrupt

Found this bug by attacking the BLT ring with firefox. Looks to be very similar...
Comment 22 Takashi Iwai 2011-03-21 12:13:06 UTC
I had to backport this for 2.6.38 kernel, and it didn't help, unfortunately.
Still the same hang occurs after some time...
Comment 23 Chris Wilson 2011-03-22 16:18:26 UTC
I am running this on both a SugarBar rev09 and HuronRiver rev09, waiting for a hang. How long does it usually take on you machine to die?
Comment 24 Takashi Iwai 2011-03-22 16:27:57 UTC
It pretty depends on the machine and the situation.
If you are (un)lucky, you'll get it in a couple of minutes.  It may take 20 minutes or so, too.

Also, don't use compiz.  It seems compositing could work around this bug.
I tried the freshly installed openSUSE-11.4 64bit, and no hang was seen there.
Comment 25 Chris Wilson 2011-03-22 17:37:14 UTC
Joy of joys, I experienced a hang on the HuronRiver. It blew up in the middle of a series of seemingly identical ops. But slowing it down by flushing the batches/caches it is still going.

Try either of or both in xorg.conf:

Section "Driver"
  Option "DebugFlushCaches" "True"
  Option "DebugFlushBatches" "True"
EndSection

I smell a silicon validation issue... ;-)
Comment 26 Chris Wilson 2011-03-22 18:03:54 UTC
For kicks, try adding this to xf86-video-intel:

diff --git a/src/intel_uxa.c b/src/intel_uxa.c
index 13d8cf9..66e531a 100644
--- a/src/intel_uxa.c
+++ b/src/intel_uxa.c
@@ -506,6 +506,9 @@ static void intel_uxa_done_copy(PixmapPtr dest)
 {
        ScrnInfoPtr scrn = xf86Screens[dest->drawable.pScreen->myNum];
 
+       if (IS_GEN6(intel_get_screen_private(scrn)))
+               intel_batch_emit_flush(scrn);
+
        intel_debug_flush(scrn);
 }
Comment 27 Takashi Iwai 2011-03-23 13:24:08 UTC
Hooray, the patch in comment 26 fixes the hang, indeed.
Comment 28 Chris Wilson 2011-03-23 13:27:30 UTC
But not even close to explaining what's going on though. Mystery gen6 bug.
Comment 29 Chris Wilson 2011-04-07 13:40:10 UTC
Well the good news is that this is not limited to the BLT ring... I've just had an near-identical crash flooding the RENDER ring.
Comment 30 Chris Wilson 2011-07-06 21:10:55 UTC
:( I was just adding a comment to suggest that this appears to have been resolved, when it hung. Oh well.
Comment 31 Toralf Förster 2011-08-10 07:50:39 UTC
(In reply to comment #30)
> :( I was just adding a comment to suggest that this appears to have been
> resolved, when it hung. Oh well.

If the patch in comment #26 is included in version 2.16.0 then the bug is indeed not fxed - please see bug #35122 .
Comment 32 Chris Wilson 2012-01-26 19:17:39 UTC
The workaround that we eventually landed upon in xf86-video-intel-2.16.901 was:

commit 46f97127c22ea42bc8fdae59d2a133e4b8b6c997
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Oct 16 21:40:15 2011 +0100

    snb,ivb: Workaround unknown blitter death
    
    The first workaround was a performance killing MI_FLUSH_DW after every
    op. This workaround appears to be a stable compromise instead, only
    requiring a redundant command after every BLT command with little
    impact on throughput.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27892
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=39524
    Tested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>