Created attachment 90481 [details] full dmesg with hang at the end Software: linux 3.8-rc2, the problem also occured in rc1. In 3.7 it does NOT hang. mesa/xf86-video-intel are the latest from git master, I'm also using sna Hardware: i7 3632qm Unfortunately I have only been able to trigger this bug with flash 11.5 but fortunately it always happens immediately so it's easy to reproduce. Steps: Use google chrome / chromium with pepper flash Watch a flash video (e.g. from youtube) / try to put it on fullscreen Without compositing: As soon as the video starts the hang occurs With compositing (kwin 4.10 beta, opengl): The video plays fine, but as soon as I put it on fullscreen the hang occurs. Maybe I'll write this snippet from dmesg directly here so it's easier to find with google (I have not found that exact problem): [ 1424.837260] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 1424.837270] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [ 1424.842657] [drm:kick_ring] *ERROR* Kicking stuck wait on render ring The rest is hopefully clear from the attached files.
Created attachment 90491 [details] Xorg.0.log
Created attachment 90501 [details] i915_error_state, gzipped because of the size limit
It doesn't seem to like screen height-1 scanline waits. Probably should convert those into a vblank wait.
What you can try instead is: diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c index c66bb47..40205f1 100644 --- a/src/sna/sna_display.c +++ b/src/sna/sna_display.c @@ -2774,6 +2774,11 @@ static bool sna_emit_wait_for_scanline_gen7(struct sna *s assert(y2 > y1); assert(sna->kgem.mode); + if (full_height) { + y2 -= 2; + full_height = 0; + } + b = kgem_get_batch(&sna->kgem, 16); b[0] = MI_LOAD_REGISTER_IMM | 1; b[1] = 0x44050; /* DERRMR */
It does not seem to help but I have discovered some more symptoms: I have one notebook screen and one external screen. It does only happen if I try to use fullscreen on the external screen, both when it is on the left and when it is on the right of the notebook screen. On the notebook screen it works fine, when the external screen is enabled and when it is disabled. If both screens are enabled and I try xrandr --output LVDS1 --off my external screen gets no signal anymore without any error message. But X is still running it seems. All of the above is both with the patch and without.
That will be something like an error was detected whilst changing state so it disabled the external output as well. You should be able to recover by running xrandr --output HDMI1 --off ; xrandr --ouput HDMI1 --preferred Once you regain control of just the HDMI, can you please try testing with xrandr --output HDMI1 --off ; xrandr --output HDMI1 --crtc 0 and xrandr --output HDMI1 --off ; xrandr --output HDMI1 --crtc 1 (if you feel really brave, --crtc 2 as well). You can also repeat the test using s/HDMI1/LVDS1/ If it consistently fails with either on crtc 1, then I'm requesting the wrong events to be unmasked and delivered, or setting up the wrong scanline register, etc.
Well, this is frustrating to test. It seems to have something to do with how the crtcs are configured but I have not found out any pattern... This is how enabling only hdmi finally worked conistently: xrandr --output LVDS1 --off --output HDMI1 --preferred ; sleep 2 ; xrandr --output HDMI1 --off ; xrandr --output HDMI1 --preferred --crtc 1 When just randomly enabling and disabling hdmi only it sometimes seemed to work but went black as soon as the kwin compositor did anything and without compositing it mostly worked... But with the above command it consistently works. So with HDMI1 only there usually is no hang with any of the crtcs. After playing around a bit with dual settings and comming back to HDMI1 only with crtc 1 the hang happened once. But usually it doesn't. Actually according to xrandr --verbose without any parameters LVDS1 seems to be connected via crtc 1 and HDMI1 via crtc 0. So the hang originally happened with crtc 0 on the external screen, but not with crtc 1 on the internal screen. When I tried just switching them, internal with crtc 0 and external with crtc 1, the hang just happened on both screens. I have not tried all 9 possible combinations, but it mostly seems to cause the hang everywhere. Maybe tomorrow I'll be up for a more systematic test.
Tracking as a regression, since a kernel upgrade shows it, and we have regression reports about it: https://lkml.org/lkml/2013/1/14/33
The bspec recommends setting a few other registers when trying to program waits: https://patchwork.kernel.org/patch/2008311/ https://patchwork.kernel.org/patch/2008341/ https://patchwork.kernel.org/patch/2008371/ And also update your DDX for a few fine adjustments - I especially like the caveat that the LOAD_SCANLINE + WAIT must in the same cacheline on IVB.
The three patches applied onto v3.8-rc4 have resolved the issue for me. The issue appeared on the following system: v3.8-rc4 kernel (git bisect gave me commit d7d4eeddb8f72342f70621c4b3cb718af9361712), xf86-video-intel-2.20-{13,18} with sna enabled (with uxa the problem did not appear), a rev 09 Intel Sandybridge graphics adapter. Mplayer with default xv video output triggered the issue for me after a few seconds.
Sorry for having been inactive for a while as at least one person said to be in the process of bisecting already... I have applied the three patches to 3.8rc4 too and it is not fixed for me. I saw this: https://lkml.org/lkml/2013/1/14/165 With Option "SwapbuffersWait" "false" in my Device section the GPU does not hang.
Confirming that disabling SwapbuffersWait seems to fix my problem. (downstream bug http://pad.lv/1102390)
Can you please retest with latest drm-intel-fixes from http://cgit.freedesktop.org/~danvet/drm-intel ? I've merged 2 patches from Chris Wilson which should help here.
3.8.0-rc3-g9452618, it does not help against my hang. The patches you mentioned are f05bb0c and 1c8c38c I guess, sounds like the ones already posted here in #9. By the way, drm-intel-fixes makes activating and deactivating screens horribly slow. 10-15 seconds to change from LVDS1 only to LVDS1 + HDMI1 + HDMI2 side-by-side. It takes another 10-30 seconds to make the mouse pointer stop lagging. I accidentally built drm-intel-testing first and there screen switching works relatively fast (~2 seconds). This is not really relevant to the bug report, just to let you know that the patches between drm-intel-testing and drm-intel-fixes have some side effects.
The kernel from git in comment #13 does not fix my problem. My i915_state.txt as a direct link http://is.gd/jOugbB
(In reply to comment #15) > The kernel from git in comment #13 does not fix my problem. My i915_state.txt > as a direct link http://is.gd/jOugbB That error state was not from drm-intel-fixes.
Created attachment 91701 [details] i915_error_state from drm-fixed So, is the error state from drm-fixes supposed to be different? This is mine (and I am definitely running 3.8.0-rc3-g9452618).
(In reply to comment #17) > Created an attachment (id=91701) [details] > i915_error_state from drm-fixed > > So, is the error state from drm-fixes supposed to be different? This is mine > (and I am definitely running 3.8.0-rc3-g9452618). Yes, it contains these two extra lines: FORCEWAKE: 0x00010003 DERRMR: 0xfffff7ff
(In reply to comment #17) > Created an attachment (id=91701) [details] > i915_error_state from drm-fixed Here we are supposed to be waiting for the scanline window to be outside of the [437,438) range. It suggests that maybe IVB has a similar granularity to SNB, maybe: diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c index 967b88b..5eefb20 100644 --- a/src/sna/sna_display.c +++ b/src/sna/sna_display.c @@ -2788,6 +2788,12 @@ static bool sna_emit_wait_for_scanline_gen7(struct sna *s y1 = crtc->bounds.y2; y2--; + /* The scanline granularity is 3 bits */ + y1 &= ~7; + y2 &= ~7; + if (y2 == y1) + return false; + b = kgem_get_batch(&sna->kgem); /* Both the LRI and WAIT_FOR_EVENT must be in the same cacheline */
And for IVB, you should also be sure to test https://patchwork.kernel.org/patch/2008371/ as well - this was only explicitly mentioned as a w/a for SNB...
Created attachment 91711 [details] i915_error_state_drm-fixes with #19 and #20 applied I think I have the patches 19 to xf86-video-intel git master and 20 on top of intel-drm-fixes and it doesn't seem to help.
Ok, maybe it just doesn't like the wait-for-vblank, so try: diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c index 967b88b..25b492f 100644 --- a/src/sna/sna_display.c +++ b/src/sna/sna_display.c @@ -2783,6 +2783,8 @@ static bool sna_emit_wait_for_scanline_gen7(struct sna *sn assert(y2 > y1); assert(sna->kgem.mode); + full_height = false; + /* Always program one less than the desired value */ if (--y1 < 0) y1 = crtc->bounds.y2;
Created attachment 91721 [details] error state with #22 No change. This happened with a html5 video on youtube.
Created attachment 91731 [details] i915_error_state, SNB, git 9452618e (In reply to comment #16) > That error state was not from drm-intel-fixes. True, that was the original one. Furthermore as indicated in the downstream bug I've SNB, not IVB, and my way of triggering the bug is attaching an external screen, then closing the laptop lid. Nevertheless, in case it's the same root cause here's the i915_error_state.txt that got generated during the git 9452618e kernel usage. In case it's completely different and I'm in the wrong bug (sorry), I'll file a new one for mine.
Meh, I completely fluffed programming the WAIT_FOR_EVENT on the secondary pipes. Please retest with xf86-video-intel 3c3a87a2d4261cbd66602812637328a04787f510.
Ok, I tried ea8148b24d48db4f46205817db8a55dd6ea1a4b3, the next commit onto 3c3a87a2d4261cbd66602812637328a04787f510 with no patches. It works fine for me now. The GPU hang is finally gone with the patched intel-drm-fixes kernel and also with an unpatched 3.8rc4 kernel. Thanks.
Timo, can you test with a new ddx? It should be available through xorg-edgers in a day or so, though if you find the time to compile it now, I would be grateful. As you are on SNB, you will need to use drm-intel-fixes as well.
Confirming that the problem is fixed, tried out git 778dba90 (five commits after 3c3a87a2).
Ok, looks like things are working nicely now, thanks everyone for reporting this bug and testing patches.
A patch referencing this bug report has been merged in Linux v3.8-rc6: commit f05bb0c7b624252a5e768287e340e8e45df96e42 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Jan 20 16:33:32 2013 +0000 drm/i915: GFX_MODE Flush TLB Invalidate Mode must be '1' for scanline waits
A patch referencing this bug report has been merged in Linux v3.8-rc6: commit 1c8c38c588ea91f8deeae21284840459d1bb58e3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Jan 20 16:11:20 2013 +0000 drm/i915: Disable AsyncFlip performance optimisations