Bug 27572

Summary: [i915] display artifacts on some windows (snapshots incld)
Product: Drivers Reporter: tomas m (tmezzadra)
Component: Video(DRI - Intel)Assignee: drivers_video-dri-intel (drivers_video-dri-intel)
Status: RESOLVED OBSOLETE    
Severity: low CC: chris, daniel, eddy.petrisor+linbug, gdamjan, indan, r.schtz, raa.lkml
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.38-rc Subsystem:
Regression: Yes Bisected commit-id:
Attachments: black stripes over a pidgin conversation.
white stripes over a pidgin conversation
the google logo rendered by firefox.
slashdot page with artifacts. kernel 2.6.38-rc4
updated version of the patch in comment #31 for 2.6.38 stable
Image corruption with 2.6.32

Description tomas m 2011-01-25 11:22:04 UTC
Created attachment 45112 [details]
black stripes over a pidgin conversation. 

happens since 2.6.38-rc
KMS is enabled.

running a dualscreen display.


Name           : xf86-video-intel
Version        : 2.14.0-1

Name           : mesa
Version        : 7.10-1

Name           : xorg-server
Version        : 1.9.3.901-1

when clicking on the window, the corruption goes away.

im attaching two screenshots. the stripes appear to be pixels from other portions of the screen.

this is running compiz. havent tested without it, but im doing it now
Comment 1 tomas m 2011-01-25 11:25:36 UTC
Created attachment 45122 [details]
white stripes over a pidgin conversation

forgot to mention: blue boxes are intentional, for privacy reasons. screen below them was correctly rendered

sometimes, the screen gets stretched vertically, where some additional 1 pixel width lines get inserted in between (as if being interlaced)
Comment 2 tomas m 2011-01-26 01:42:53 UTC
Created attachment 45202 [details]
the google logo rendered by firefox.
Comment 3 Chris Wilson 2011-01-26 20:25:47 UTC
Looks like a tiling ddx/mesa bug.
Comment 4 tomas m 2011-01-29 17:15:51 UTC
should i file this bug somewhere else?
Comment 5 Chris Wilson 2011-01-29 17:27:59 UTC
You haven't given much to go on. With an unspecified kernel version and no log files, you could easily be reporting one of the short-lived regressions before rc1 or something much more sinister.

If it is indeed a kernel regression, then it would have been trivial to bisect...
Comment 6 tomas m 2011-01-29 18:03:06 UTC
will try to bisect this and report back.

bisecting before rc1 has always proven unfruitful to me in the past.
Comment 7 tomas m 2011-01-31 02:18:26 UTC
rc1 and rc2 have the same issue.

i will try to bisect from 2.6.37 to 2.6.38-rc1 but its quite hard to catch it (have to use the system for a while for the bug to show itself).
Comment 8 tomas m 2011-02-12 15:42:08 UTC
bisected this between 2.6.37 and 2.6.38-rc1

came down to this:

----------------
6bda10d152735c22baf1dcd92937420b4b0a359a is the first bad commit
commit 6bda10d152735c22baf1dcd92937420b4b0a359a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Dec 5 21:04:18 2010 +0000

    drm/i915: Completely disable fence pipelining.
    
    I'm still seeing tiling corruption of PutImage and CopyArea (I think)
    under mutter on pnv, so obviously the pipelining logic is deeply flawed.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

:040000 040000 379a8302a494aee383969f5cefe5a166ac040a47 a15f4dce648440a8445da8c784ef7fa44a99ad42 M      drivers
--------------------------


ive reverted this on 2.6.38-rc1 and artifacts appear to be gone.

does this make any sense?
Comment 9 tomas m 2011-02-13 14:01:59 UTC
its a regresion
Comment 10 Chris Wilson 2011-02-13 20:38:49 UTC
I doubt your bisection. That patch does exactly what it says and restores the behaviour to 2.6.37.
Comment 11 tomas m 2011-02-14 01:56:50 UTC
right now im testing rc4 with the patch reverted, and things work ok.

i will build it without revering and report back. but im almost sure of the outcome :(

testing rc1 with the patch reverted did indeed remove the artifacts for me (?????)
Comment 12 tomas m 2011-02-14 02:36:34 UTC
Created attachment 47742 [details]
slashdot page with artifacts. kernel 2.6.38-rc4

and it didnt take me long to grab another example.

rc4 without the patch reverted.

i dont know what triggers this, but reverting this patch fixes the issue for me.

is there anything else i can try to get to the bottom of this?

scrolling the webpage makes them go away and come back.
Comment 13 tomas m 2011-02-14 23:23:53 UTC
Chris,

Ive been testing 2.6.37 and found no screen corruption.

is it possible that another commit in conjunction with the one from comment 8 be triggering this? 

is there a way i can test this theory out?
Comment 14 Alex Riesen 2011-02-19 14:56:31 UTC
FWIW, I had seen problems in updating parts on the screen.

For example just trying to scroll back in rxvt would cause only parts of rxvt window to be redrawn (usually some 100 pixels at the top and always the last line), but the middle of the windows stayed at the old contents. Same for YouTube videos in firefox (with Adobe's plugin).

After reverting the patch I couldn't reproduce the problem yet.
Comment 15 Alex Riesen 2011-02-19 15:02:14 UTC
Strangely, I also cannot reproduce the problem without reverting the patch but when using compiz. Probably because all the graphic output goes through the compositor and takes a differen path in the graphics code.
Comment 16 Indan 2011-02-21 03:02:07 UTC
I suspect you're hitting the same problem as me. Please try Daniel's test patch:
https://lkml.org/lkml/2011/2/20/25

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 17bd766..d275c96 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -762,9 +762,6 @@ static int i915_getparam(struct drm_device *dev, void *data,
 	case I915_PARAM_HAS_BLT:
 		value = HAS_BLT(dev);
 		break;
-	case I915_PARAM_HAS_RELAXED_FENCING:
-		value = 1;
-		break;
 	case I915_PARAM_HAS_COHERENT_RINGS:
 		value = 1;
 		break;

If that does not totally fix it, try his other patch too.
Comment 17 Alex Riesen 2011-02-21 18:58:49 UTC
This patch can't be fix for anything! It just leaves "value" uninitialized
if "param->param" set to I915_PARAM_HAS_RELAXED_FENCING.

gcc complains about uninitialized "value", BTW.
Comment 18 tomas m 2011-02-21 19:04:15 UTC
that patch is not supposed to be a fix. but a workaround test to figure where we are failing.
Comment 19 Alex Riesen 2011-02-21 19:16:57 UTC
ah. That's why the variable was left at random value...
Comment 20 Alex Riesen 2011-02-21 20:55:57 UTC
Whatever the reasons: it does not help at all. I initialized the value with 0, just to be the opposite to the removed case. Or because zeros happen relatively often in uninitialized on-stack variables?
Comment 21 tomas m 2011-02-21 22:09:59 UTC
the patch does not fix the issue for me. :(

gonna test the first one next
Comment 22 Indan 2011-02-22 01:37:12 UTC
(In reply to comment #20)
> Whatever the reasons: it does not help at all. I initialized the value with
> 0,
> just to be the opposite to the removed case. Or because zeros happen
> relatively
> often in uninitialized on-stack variables?

Alex, you missed that it removes the whole case, not just the value setting.
Initializing to 0 does not help, because the userspace code only checks the 
return value, not the actual value set. You have to remove the 
I915_PARAM_HAS_RELAXED_FENCING case.

Tomas, too bad it doesn't work for you. Your corruption looks different than
the one I get, but it was worth a try.
Comment 23 Alex Riesen 2011-02-22 06:35:51 UTC
Indan, you missed that the variable values becomes UN-FSCKING-initialized!
Comment 24 Indan 2011-02-22 07:46:27 UTC
Read the code.

'value' isn't used if the param is unknown, because it goes to the default 
case which return EINVAL. So it's impossible to use it uninitialised, whatever 
gcc says. And if you read the userspace code which actually uses this ioctl, 
the damn 'value' isn't even used, it just checks the function return value 
(which is a bug in it self, but oh well).

Anyway, I discovered I have lots of screen corruption when running without
xcompmgr, so there are even more problems, though it seems caused by user
space bugginess.
Comment 25 Alex Riesen 2011-02-22 08:12:03 UTC
I see. You should have just replaced the "break;" in that case with
"return -EINVAL;" and your intention would be immediately obvious.

I re-tested this (with "return -EINVAL;") on my laptop and could not
reproduce any picture corruption running without compiz compositor.

Chris?

Is this change any better than reverting that "disable fence pipelining" commit?
If so, could we have it (or something which you deem better) submitted
for v2.6.38? Just now 2.6.38 is almost unusable on this chipset.
Comment 26 Indan 2011-02-22 08:42:35 UTC
It's Daniel's patch, not mine. And his original patch set value = 0, which
is clear, but as mentioned before, the userspace code ignores that and only
checks the return value. It's just a quick patch trying to narrow down the
source of the problem, the patch was called "for-indan-2.patch". Daniels is
working in his spare time on the Intel code, supposedly for fun ;-), and I'm
just a user like you who tries to keep things working for myself.

But anyway, it would be nice to have some real fix before the release,
or if the source isn't found, at least a temporary hack to disable the
faulty code.

I'll test with "disable fence pipelining" reverted and tell if it makes a 
difference for me.
Comment 27 Indan 2011-02-22 09:54:03 UTC
I tried plain 2.6.38-rc6 with "disable fence pipelining" reverted and I 
still got the screen corruption, so reverting 6bda10d152735c22baf1dcd
doesn't work for me.
Comment 28 Alex Riesen 2011-02-22 10:04:01 UTC
But returning EINVAL for I915_PARAM_HAS_RELAXED_FENCING does?
Comment 29 Indan 2011-02-22 10:10:04 UTC
Yes.
Comment 30 Alex Riesen 2011-02-22 10:12:59 UTC
Tomas, does returning -EINVAL for I915_PARAM_HAS_RELAXED_FENCING helps in your case?

(Just replace the "break;" in the case with "return -EINVAL;").
Comment 31 Indan 2011-02-22 10:28:25 UTC
Tomas, what hardware do you have? Is it a gen 2 or 3 chipset?
If so, then perhaps we can use this patch for 2.6.38, assuming
the fence stuff actually works for someone:

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 17bd766..d70edcd 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -763,6 +763,8 @@ static int i915_getparam(struct drm_device *dev, void *data,
 		value = HAS_BLT(dev);
 		break;
 	case I915_PARAM_HAS_RELAXED_FENCING:
+		if (INTEL_INFO(dev)->gen <= 3)
+			return -EINVAL;
 		value = 1;
 		break;
 	case I915_PARAM_HAS_COHERENT_RINGS:

But I think it's safer to remove the I915_PARAM_HAS_RELAXED_FENCING
till it got sorted out properly.

Alex, Tomas already reported that that patch doesn't help his case.
Comment 32 tomas m 2011-02-22 10:43:54 UTC
hi Indian,

ive got a 945GMA (gen3?)

im testing Daniel's first patch and it fixed my issues.

now im going to test this last patch you proposed here with -EINVAL to see if it makes a difference.

will report back when im done with it
Comment 33 tomas m 2011-02-22 10:51:13 UTC
and i spoke too soon. the first patch doesnt fix it either :(
Comment 34 Indan 2011-02-23 01:53:25 UTC
hi Thomas,

Daniel has a real fix at:
http://lists.freedesktop.org/archives/dri-devel/2011-February/008658.html

Can you give that a try?

Are you using X composition? If not, can you try with xcompmgr -a
and see if that improves things for you? If it does, it might be
the same bug I'm hitting. Do you get it without compiz?

Can you also try an older intel X driver, 2.12 or 2.13 or so?

Lastly, can you go back to a clean 2.6.38-rc6 kernel, test it out
with and without 6bda10d152735c22baf1dcd92937420b4b0a359a reverted
and make really sure that reverting that fixes it for you.
Comment 35 Daniel Vetter 2011-02-23 10:38:25 UTC
> --- Comment #34 from Indan <indan@nul.nu>  2011-02-23 01:53:25 ---
> Daniel has a real fix at:
> http://lists.freedesktop.org/archives/dri-devel/2011-February/008658.html

You can safely ignore this patch. It only fixes a very special corruption due
to relaxed tiling. This kind of corruption manifests itself in garbage in the
lower-left corner of pixmaps (think ui elements) if and only if the height
rounded up to the next multiple of 8 is not a multiple of 16. Your
corruptions look different.

[lower-left corner means: at most 8 pixels high, at most half the width of
the total pixmap]
Comment 36 tomas m 2011-02-24 12:53:22 UTC
Indian,

ive tested 2.6.38-rc6 and artifacts appeared after several hours of usage.

im testing 2.6.38-rc6 - 6bda10d152735c22baf1 and i havent seen artifacts yet. but as you stated, its much easier to detect a bad build than a good one, and considering the doubt Chris placed on my bisection, im going to test this more to be sure.

after this test, i will downgrade userspace and check results 

thanks for the followup on this
Comment 37 tomas m 2011-02-25 12:49:01 UTC
i did not see the artifacts with 6bda10d15 reverted.

went back to a clean rc6 with xf86-video-intel 2.13 and have not seen the artifacts there either. will keep on testing some more.

next, i will try to bisect the xorg driver but last time i tried to bisect xf86-video-intel, it was impossible (too many changes produced too many shortlived bugs).
Comment 38 Indan 2011-02-25 23:11:24 UTC
Tomas, if you can't easily trigger a bug then don't bother bisecting.

But in this case if reverting 6bda10d15 fixes it for you, then it should be a kernel bug. Chris seems to be working on it, let's wait till he comes with a fix, because reverting 6bda10d15 apparently causes other bugs.
Comment 39 tomas m 2011-02-26 04:17:09 UTC
Indian, thanks for the input.

its difficult as in: it doesnt show 100% of the time and in the same test case. but using the computer for half an hour has always shown it.

ive been extra careful to bisect this, testing working kernels days at a time.

now im actually running rc6 clean with xf86-video-intel 2.13 and it seems to work. its definately something interwined between the xorg driver and the kernel.

and of course. im always open to test patches, use cases, etcetera.

im sure reverting 6bda10 causes other bugs, the commit itself was introduced to revert to previous behaviour just because of that. but in the mix...something broke here and thats what i wish to find out ;)
Comment 40 tomas m 2011-02-27 12:55:52 UTC
after testing for several days. the bug has appeared.

i dont know what else to do. will revert back 6bda10 and test more exhaustedly.
Comment 41 Indan 2011-02-28 00:12:47 UTC
Are you sure it's the same corruption, or is it slightly different?
It could be another bug.
Comment 42 tomas m 2011-02-28 01:03:23 UTC
the only difference was the time it took for it to appear.
Comment 43 Richard Schütz 2011-03-16 23:43:03 UTC
I noticed the graphic glitches in 2.6.38 final with my Intel GMA3150. An updated version of the patch in comment #31 seems to fix the problem for me.
Comment 44 Richard Schütz 2011-03-16 23:45:52 UTC
Created attachment 51002 [details]
updated version of the patch in comment #31 for 2.6.38 stable
Comment 45 Indan 2011-03-17 00:11:41 UTC
Daniel's patch that fixed the problem from comment 31 in upstream was reverted because it caused problems for some gen >4 hardware. It's fixed in userspace with libdrm 2.4.24 and xf86-video-intel newer than 2011-02-22 instead. This is unrelated to tomas's problem.

Richard, I think GMA3150 is newer than gen 3, so that patch shouldn't change anything for you...

Tomas, can you confirm that with 6bda10 you don't get any corruptions?
Comment 46 Daniel Vetter 2011-03-17 10:57:28 UTC
> --- Comment #45 from Indan <indan@nul.nu>  2011-03-17 00:11:41 ---
> Richard, I think GMA3150 is newer than gen 3, so that patch shouldn't
> change
> anything for you...

GMA3510 is a gen3 device (codename g33/pineview in the source). It's
essentially a i945 with a bigger gtt and hw virtualization support. Only
GMA devices with the X in the marketing name are gen4 and later (e.g GMA
X3100).
Comment 47 Indan 2011-03-17 11:24:51 UTC
(In reply to comment #46)
> GMA3510 is a gen3 device (codename g33/pineview in the source). It's
> essentially a i945 with a bigger gtt and hw virtualization support. Only
> GMA devices with the X in the marketing name are gen4 and later (e.g GMA
> X3100).

I grepped the source, but didn't find any mentioning of GMA, 
nor 3510, but a quick internet search showed that it was an
Atom, so I expected it would be "new".

Intel marketing is a nightmare to navigate. If you go to
http://intellinuxgraphics.org/documentation.html there's no 
mapping from the model names to the docs or chipset generation.
Closest thing I found is i915_drv.c, but that uses other
code names too. :-(
Comment 48 tomas m 2011-03-17 12:23:44 UTC
> Tomas, can you confirm that with 6bda10 you don't get any corruptions?

if you mean, with 6bda10 reverted, then yes.

ive been running rc6 rc7 rc8 and final with 6bda10 reverted and i have seen no corruption.

i did try the final release without reverting 6bda10 and the glitches did appear.
Comment 49 Damjan Georgievski 2011-03-24 10:53:40 UTC
I'm seeing glitches in Firefox, urxvt and awesome (those I use most often) too.
Distro is ArchLinux, hardware is Thinkpad x60s.

packages:
xorg-server 1.9.4.901-1
xf86-video-intel 2.14.0-3
libdrm 2.4.23-2
intel-dri 7.10.1-1

hardware:
945GM
Comment 50 tomas m 2011-03-24 13:43:23 UTC
Damjan,

have you tried reverting 6bda10 ?

could you post your results?
Comment 51 Eddy Petrișor 2011-04-03 10:50:41 UTC
Created attachment 53312 [details]
Image corruption with 2.6.32

I wanted to check if the corruption was happening with the Debian 2.6.32-5-amd64 kernel, too. Before upgrading to Debian's Squeeze release I haven't experienced these issues although I had been using 2.6.32+27~bpo.5+1. So I suspect there could be some user space code that triggers the bug.
Comment 52 Eddy Petrișor 2011-04-15 08:33:58 UTC
Is this the same bug as bug #26002 ?
Comment 53 Daniel Vetter 2011-04-15 09:04:29 UTC
On Fri, Apr 15, 2011 at 08:34:03AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #52 from Eddy Petrișor <eddy.petrisor+linbug@gmail.com> 
> 2011-04-15 08:33:58 ---
> Is this the same bug as bug #26002 ?

Maybe, maybe not. Generally with corruptions and gpu hangs it's better to
handle bugs separately until a root cause is clearly identified (with
either a fix or at least pointing to the same culprit when it's a
regression). There are simply too many ways corruptions can show up and
gpus can hang.

If you hit something like this it's easiest if you could ask us on irc
(#intel-gfx on freenode) - it's much quicker to triage such things
interactively.
-Daniel
Comment 54 Alex Riesen 2011-04-19 20:37:41 UTC
The problem still persists in 2.6.39-rc4 (it look like it even worse since 2.6.38). A "man wmii" (probably any long enough page) in urxvt is absolutely unusable after paging back and forth a few times. After the artifacts (all over urxvt window) appear, the windows starts blinking, roughly once in 5 seconds.

To confirm it is not wmii, I tried twm with almost the same effect: the blinking only happens once, after ~10 seconds since last attempt to change window contents.

I used a very minimal environment, just X, window manager, urxvt and man.

As before, I couldn't reproduce the situation exactly in same way under a full desktop environment (an XFCE desktop, with and without Compiz), but there were small quirks here and there: a missing pixel, sluggish or jerky scrolling in a terminal: just do a long listing ("find /"), than start scrolling the terminal up one line at a time. After a while scrolling pauses. If I leave the buttons now, the window will be update after a while (but to something slightly shifted up or down for half a symbol or so).

And I also get these:

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

many times in a row. Sometimes they coincide with really weird display (kind of expected).
Comment 55 Daniel Vetter 2011-04-20 06:22:11 UTC
On Tue, Apr 19, 2011 at 10:37 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung

Your gpu hung. Yes, we're doing quite some work to make this non-fatal
by resetting the chip
(and if that fails switching to sw rendering). But it's kinda expected
to get graphics corruptions
when that happens. The kernel should dump an error state in the file
i915_error_state in
debugfs. Can you please rehang your gpu and attach that?
Comment 56 Alex Riesen 2011-04-20 08:02:30 UTC
(In reply to comment #55)
> On Tue, Apr 19, 2011 at 10:37 PM,  <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
> > [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
> 
> Your gpu hung.

The case when "GPU hung" happens is VERY rare.
What about all the "scrolling in urxvt" cases?

> The kernel should dump an error state in the file i915_error_state in
> debugfs. Can you please rehang your gpu and attach that?

Will do, but as I said, such cases are rare.
Comment 57 tomas m 2011-05-07 14:17:23 UTC
just reporting i cannot reproduce this in the 2.6.39-rc series

tbh, tested since rc4 cause i could not boot before that.
Comment 58 Alex Riesen 2011-05-11 18:44:30 UTC
2.6.39-rc7: scrolling still stalls.
Comment 59 Alex Riesen 2011-05-15 19:41:12 UTC
The "scrolling stalls" bisected to a7a09aebe8c0dd2b76c7b97018a9c614ddb483a5:

    drm/i915: Rework execbuffer pinning
    
    Avoid evicting buffers that will be used later in the batch in order to
    make room for the initial buffers by pinning all bound buffers in a
    single pass before binding (and evicting for) fresh buffer.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

I am not able to simply revert it, the code is heavily modified and moved
into a file of its own.
Comment 60 Daniel Vetter 2012-03-25 13:42:02 UTC
To Alex Riesen and the scrolling stalls: Please file a separate bug for that, preferrably on bugs.freedesktop.org.

To everyone else: This bug report got out of hand and mixes too many issues. If you still have hangs, graphics corruptions and other issues on the latest code, please file a new bug report. Thanks everyone for  reporting issues.
Comment 61 Alex Riesen 2012-03-25 13:44:48 UTC
Haven't seen the scrolling bug for a good while. Will file a report on freedesktop.org if see it again.