Bug 52311

Summary: Flash 11.5 video hangs ivy bridge
Product: Drivers Reporter: Christoph Haag (haagch.christoph)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED CODE_FIX    
Severity: normal CC: chris, daniel, florian, intel-gfx-bugs, shuber2, timo.jyrinki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.8-rc2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: full dmesg with hang at the end
Xorg.0.log
i915_error_state, gzipped because of the size limit
i915_error_state from drm-fixed
i915_error_state_drm-fixes with #19 and #20 applied
error state with #22
i915_error_state, SNB, git 9452618e

Description Christoph Haag 2013-01-04 18:09:58 UTC
Created attachment 90481 [details]
full dmesg with hang at the end

Software: linux 3.8-rc2, the problem also occured in rc1. In 3.7 it does NOT hang.
mesa/xf86-video-intel are the latest from git master, I'm also using sna
Hardware: i7 3632qm

Unfortunately I have only been able to trigger this bug with flash 11.5 but fortunately it always happens immediately so it's easy to reproduce.

Steps:
Use google chrome / chromium with pepper flash
Watch a flash video (e.g. from youtube) / try to put it on fullscreen

Without compositing: As soon as the video starts the hang occurs
With compositing (kwin 4.10 beta, opengl): The video plays fine, but as soon as I put it on fullscreen the hang occurs.

Maybe I'll write this snippet from dmesg directly here so it's easier to find with google (I have not found that exact problem):

[ 1424.837260] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1424.837270] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 1424.842657] [drm:kick_ring] *ERROR* Kicking stuck wait on render ring

The rest is hopefully clear from the attached files.
Comment 1 Christoph Haag 2013-01-04 18:10:34 UTC
Created attachment 90491 [details]
Xorg.0.log
Comment 2 Christoph Haag 2013-01-04 18:12:20 UTC
Created attachment 90501 [details]
i915_error_state, gzipped because of the size limit
Comment 3 Chris Wilson 2013-01-04 18:33:08 UTC
It doesn't seem to like screen height-1 scanline waits. Probably should convert those into a vblank wait.
Comment 4 Chris Wilson 2013-01-04 18:42:42 UTC
What you can try instead is:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index c66bb47..40205f1 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -2774,6 +2774,11 @@ static bool sna_emit_wait_for_scanline_gen7(struct sna *s
        assert(y2 > y1);
        assert(sna->kgem.mode);
 
+       if (full_height) {
+               y2 -= 2;
+               full_height = 0;
+       }
+
        b = kgem_get_batch(&sna->kgem, 16);
        b[0] = MI_LOAD_REGISTER_IMM | 1;
        b[1] = 0x44050; /* DERRMR */
Comment 5 Christoph Haag 2013-01-04 19:26:56 UTC
It does not seem to help but I have discovered some more symptoms:

I have one notebook screen and one external screen. It does only happen if I try to use fullscreen on the external screen, both when it is on the left and when it is on the right of the notebook screen. On the notebook screen it works fine, when the external screen is enabled and when it is disabled.

If both screens are enabled and I try xrandr --output LVDS1 --off my external screen gets no signal anymore without any error message. But X is still running it seems.

All of the above is both with the patch and without.
Comment 6 Chris Wilson 2013-01-04 19:56:23 UTC
That will be something like an error was detected whilst changing state so it disabled the external output as well. You should be able to recover by running xrandr --output HDMI1 --off ; xrandr --ouput HDMI1 --preferred

Once you regain control of just the HDMI, can you please try testing with

xrandr --output HDMI1 --off ; xrandr --output HDMI1 --crtc 0

and

xrandr --output HDMI1 --off ; xrandr --output HDMI1 --crtc 1

(if you feel really brave, --crtc 2 as well).

You can also repeat the test using s/HDMI1/LVDS1/

If it consistently fails with either on crtc 1, then I'm requesting the wrong events to be unmasked and delivered, or setting up the wrong scanline register, etc.
Comment 7 Christoph Haag 2013-01-04 21:17:11 UTC
Well, this is frustrating to test. It seems to have something to do with how the crtcs are configured but I have not found out any pattern...


This is how enabling only hdmi finally worked conistently:

xrandr --output LVDS1 --off --output HDMI1 --preferred ; sleep 2 ; xrandr --output HDMI1 --off ; xrandr --output HDMI1 --preferred --crtc 1

When just randomly enabling and disabling hdmi only it sometimes seemed to work but went black as soon as the kwin compositor did anything and without compositing it mostly worked... But with the above command it consistently works.



So with HDMI1 only there usually is no hang with any of the crtcs. After playing around a bit with dual settings and comming back to HDMI1 only with crtc 1 the hang happened once. But usually it doesn't.

Actually according to xrandr --verbose without any parameters LVDS1 seems to be connected via crtc 1 and HDMI1 via crtc 0.
So the hang originally happened with crtc 0 on the external screen, but not with crtc 1 on the internal screen.
When I tried just switching them, internal with crtc 0 and external with crtc 1, the hang just happened on both screens.

I have not tried all 9 possible combinations, but it mostly seems to cause the hang everywhere. Maybe tomorrow I'll be up for a more systematic test.
Comment 8 Daniel Vetter 2013-01-14 09:38:23 UTC
Tracking as a regression, since a kernel upgrade shows it, and we have regression reports about it:

https://lkml.org/lkml/2013/1/14/33
Comment 9 Chris Wilson 2013-01-21 11:40:31 UTC
The bspec recommends setting a few other registers when trying to program waits:

https://patchwork.kernel.org/patch/2008311/
https://patchwork.kernel.org/patch/2008341/
https://patchwork.kernel.org/patch/2008371/

And also update your DDX for a few fine adjustments - I especially like the caveat that the LOAD_SCANLINE + WAIT must in the same cacheline on IVB.
Comment 10 Stefan Huber 2013-01-21 12:28:04 UTC
The three patches applied onto v3.8-rc4 have resolved the issue for me.

The issue appeared on the following system: v3.8-rc4 kernel (git bisect gave me commit d7d4eeddb8f72342f70621c4b3cb718af9361712), xf86-video-intel-2.20-{13,18} with sna enabled (with uxa the problem did not appear), a rev 09 Intel Sandybridge graphics adapter. Mplayer with default xv video output triggered the issue for me after a few seconds.
Comment 11 Christoph Haag 2013-01-21 14:09:05 UTC
Sorry for having been inactive for a while as at least one person said to be in the process of bisecting already...

I have applied the three patches to 3.8rc4 too and it is not fixed for me.

I saw this: https://lkml.org/lkml/2013/1/14/165
With Option "SwapbuffersWait" "false" in my Device section the GPU does not hang.
Comment 12 Timo Jyrinki 2013-01-23 06:22:13 UTC
Confirming that disabling SwapbuffersWait seems to fix my problem. (downstream bug http://pad.lv/1102390)
Comment 13 Daniel Vetter 2013-01-23 09:33:02 UTC
Can you please retest with latest drm-intel-fixes from http://cgit.freedesktop.org/~danvet/drm-intel ? I've merged 2 patches from Chris Wilson which should help here.
Comment 14 Christoph Haag 2013-01-23 11:59:06 UTC
3.8.0-rc3-g9452618, it does not help against my hang.

The patches you mentioned are f05bb0c and 1c8c38c I guess, sounds like the ones already posted here in #9.

By the way, drm-intel-fixes makes activating and deactivating screens horribly slow. 10-15 seconds to change from LVDS1 only to LVDS1 + HDMI1 + HDMI2 side-by-side. It takes another 10-30 seconds to make the mouse pointer stop lagging.
I accidentally built drm-intel-testing first and there screen switching works relatively fast (~2 seconds). This is not really relevant to the bug report, just to let you know that the patches between drm-intel-testing and drm-intel-fixes have some side effects.
Comment 15 Timo Jyrinki 2013-01-23 14:18:04 UTC
The kernel from git in comment #13 does not fix my problem. My i915_state.txt as a direct link http://is.gd/jOugbB
Comment 16 Chris Wilson 2013-01-23 14:41:06 UTC
(In reply to comment #15)
> The kernel from git in comment #13 does not fix my problem. My i915_state.txt
> as a direct link http://is.gd/jOugbB

That error state was not from drm-intel-fixes.
Comment 17 Christoph Haag 2013-01-23 15:32:42 UTC
Created attachment 91701 [details]
i915_error_state from drm-fixed

So, is the error state from drm-fixes supposed to be different? This is mine (and I am definitely running 3.8.0-rc3-g9452618).
Comment 18 Chris Wilson 2013-01-23 15:37:08 UTC
(In reply to comment #17)
> Created an attachment (id=91701) [details]
> i915_error_state from drm-fixed
> 
> So, is the error state from drm-fixes supposed to be different? This is mine
> (and I am definitely running 3.8.0-rc3-g9452618).

Yes, it contains these two extra lines:

FORCEWAKE: 0x00010003
DERRMR: 0xfffff7ff
Comment 19 Chris Wilson 2013-01-23 15:57:22 UTC
(In reply to comment #17)
> Created an attachment (id=91701) [details]
> i915_error_state from drm-fixed

Here we are supposed to be waiting for the scanline window to be outside of the [437,438) range. It suggests that maybe IVB has a similar granularity to SNB, maybe:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 967b88b..5eefb20 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -2788,6 +2788,12 @@ static bool sna_emit_wait_for_scanline_gen7(struct sna *s
                y1 = crtc->bounds.y2;
        y2--;
 
+       /* The scanline granularity is 3 bits */
+       y1 &= ~7;
+       y2 &= ~7;
+       if (y2 == y1)
+               return false;
+
        b = kgem_get_batch(&sna->kgem);
 
        /* Both the LRI and WAIT_FOR_EVENT must be in the same cacheline */
Comment 20 Chris Wilson 2013-01-23 15:59:20 UTC
And for IVB, you should also be sure to test https://patchwork.kernel.org/patch/2008371/ as well - this was only explicitly mentioned as a w/a for SNB...
Comment 21 Christoph Haag 2013-01-23 16:53:25 UTC
Created attachment 91711 [details]
i915_error_state_drm-fixes with #19 and #20 applied

I think I have the patches 19 to xf86-video-intel git master and 20 on top of intel-drm-fixes and it doesn't seem to help.
Comment 22 Chris Wilson 2013-01-23 17:04:17 UTC
Ok, maybe it just doesn't like the wait-for-vblank, so try:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 967b88b..25b492f 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -2783,6 +2783,8 @@ static bool sna_emit_wait_for_scanline_gen7(struct sna *sn
        assert(y2 > y1);
        assert(sna->kgem.mode);
 
+       full_height = false;
+
        /* Always program one less than the desired value */
        if (--y1 < 0)
                y1 = crtc->bounds.y2;
Comment 23 Christoph Haag 2013-01-23 17:19:58 UTC
Created attachment 91721 [details]
error state with #22

No change. This happened with a html5 video on youtube.
Comment 24 Timo Jyrinki 2013-01-23 17:28:53 UTC
Created attachment 91731 [details]
i915_error_state, SNB, git 9452618e

(In reply to comment #16)
> That error state was not from drm-intel-fixes.

True, that was the original one. Furthermore as indicated in the downstream bug I've SNB, not IVB, and my way of triggering the bug is attaching an external screen, then closing the laptop lid. Nevertheless, in case it's the same root cause here's the i915_error_state.txt that got generated during the git 9452618e kernel usage. In case it's completely different and I'm in the wrong bug (sorry), I'll file a new one for mine.
Comment 25 Chris Wilson 2013-01-23 18:02:35 UTC
Meh, I completely fluffed programming the WAIT_FOR_EVENT on the secondary pipes. Please retest with xf86-video-intel 3c3a87a2d4261cbd66602812637328a04787f510.
Comment 26 Christoph Haag 2013-01-23 18:37:00 UTC
Ok, I tried ea8148b24d48db4f46205817db8a55dd6ea1a4b3, the next commit onto 3c3a87a2d4261cbd66602812637328a04787f510 with no patches.

It works fine for me now. The GPU hang is finally gone with the patched intel-drm-fixes kernel and also with an unpatched 3.8rc4 kernel.

Thanks.
Comment 27 Chris Wilson 2013-01-23 22:39:33 UTC
Timo, can you test with a new ddx? It should be available through xorg-edgers in a day or so, though if you find the time to compile it now, I would be grateful. As you are on SNB, you will need to use drm-intel-fixes as well.
Comment 28 Timo Jyrinki 2013-01-24 06:36:46 UTC
Confirming that the problem is fixed, tried out git 778dba90 (five commits after 3c3a87a2).
Comment 29 Daniel Vetter 2013-01-24 09:59:08 UTC
Ok, looks like things are working nicely now, thanks everyone for reporting this bug and testing patches.
Comment 30 Florian Mickler 2013-02-05 00:08:14 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc6:

commit f05bb0c7b624252a5e768287e340e8e45df96e42
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jan 20 16:33:32 2013 +0000

    drm/i915: GFX_MODE Flush TLB Invalidate Mode must be '1' for scanline waits
Comment 31 Florian Mickler 2013-02-05 00:11:31 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc6:

commit 1c8c38c588ea91f8deeae21284840459d1bb58e3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jan 20 16:11:20 2013 +0000

    drm/i915: Disable AsyncFlip performance optimisations