Bug 60530

Summary: Garbage displayed after resume from suspend to RAM
Product: Drivers Reporter: Rafael J. Wysocki (rjw)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: abrouwers, bjorn.bidar, bwat47, chris, daniel, david.pretty, frankvanklaveren, h, intel-gfx-bugs, jcalvinowens, kernel.org, linuxbugs, lvml, menzinoah, mjt, ralf, stavallo, stuffcorpse
Priority: P1    
Hardware: All   
OS: Linux   
URL: http://marc.info/?t=137311876900002&r=1&w=4
Kernel Version: 3.10 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Snapshot taken with ksnapshot
Password prompt screen right after resume (camera shot)
Desktop after typing in the password
intel_reg_dumper output before suspend
intel_reg_dumper output after resume
intel_reg_dumper output without commit 19b2dbd before suspend
intel_reg_dumper output without commit 19b2dbd after resume
intel_reg_dump before hibernate
intel_reg_dumper output after hibernate, before standby
intel_reg_dumper output after hibernate and standby
test patch 1: unmap tiled objects from userspace before suspend
test patch 2: restore fence regs after gem hw init completes
reg_dumps from commit 7d13205
reg_dumps on 3.10 (commit 8bb495e)
reg_dumps for master with patch 1 applied
reg_dumps for master with patch 2 applied
correctly restore fence registers with objects attached

Description Rafael J. Wysocki 2013-07-06 21:04:40 UTC
Created attachment 106824 [details]
Snapshot taken with ksnapshot

As described in the linked message:

I've just started to play with a new Acer Aspire S5 test box and noticed that
garbage is displayed after resume from suspend to RAM with the 3.10 kernel
(under KDE 4.10.3 on openSUSE 12.3).  The display corruption goes away after
killing X and restarting it.

The CPU is a Core i5-3317U (Ivy Bridge), i915 graphics.

That doesn't happen with 3.9 (same config otherwise).

Also it turns out to happen on an SNB-based machine (not 100% of the time).
Comment 1 Rafael J. Wysocki 2013-07-06 21:28:57 UTC
Created attachment 106825 [details]
Password prompt screen right after resume (camera shot)
Comment 2 Rafael J. Wysocki 2013-07-06 21:30:58 UTC
Right after resume the screen doesn't look like in the file attached to Description (this is what ksnapshot saved).

The "lock screen" password prompt screen looks like the camera shot (scaled, due to the BZ attachment size limit) attached in comment #1.
Comment 3 Rafael J. Wysocki 2013-07-06 21:34:52 UTC
Created attachment 106826 [details]
Desktop after typing in the password

After typing in the password the password prompt goes away and the desktop looks like this (visible is a corrupted application window).
Comment 4 Rafael J. Wysocki 2013-07-06 22:37:31 UTC
So far, I haven't been able to reproduce the problem on the IVB-based machine with 3.10.0-rc7, so I'm going to revert commit 19b2dbd and retest.
Comment 5 Rafael J. Wysocki 2013-07-06 23:20:15 UTC
So far, with commit 19b2dbd reverted, I haven't been able to reproduce the problem.
Comment 6 Chris Wilson 2013-07-07 08:29:44 UTC
Easy enough to check:

Please run intel_reg_dumper before and resume.
Comment 7 Rafael J. Wysocki 2013-07-07 13:26:26 UTC
Created attachment 106828 [details]
intel_reg_dumper output before suspend
Comment 8 Rafael J. Wysocki 2013-07-07 13:27:02 UTC
Created attachment 106829 [details]
intel_reg_dumper output after resume
Comment 9 Rafael J. Wysocki 2013-07-07 13:36:05 UTC
Created attachment 106830 [details]
intel_reg_dumper output without commit 19b2dbd before suspend
Comment 10 Rafael J. Wysocki 2013-07-07 13:36:34 UTC
Created attachment 106831 [details]
intel_reg_dumper output without commit 19b2dbd after resume
Comment 11 Rafael J. Wysocki 2013-07-07 13:39:50 UTC
Without commit 19b2dbd the problem is definitely not reproducible for me (on two different machines).
Comment 12 Chris Wilson 2013-07-07 13:54:28 UTC
Hmm, I was expecting intel_reg_dumper to include the fence registers.
Comment 13 Rafael J. Wysocki 2013-07-07 20:10:38 UTC
Well, if you have a debug tool that'll give you the information you need and that I can run on 64-bit, please let me know.
Comment 14 Calvin Owens 2013-07-08 04:33:04 UTC
I've encountered the same problem on an Asus laptop, although the corruption doesn't seem to be quite as bad for me. Reverting 19b2dbd fixes the issue.

vendor_id       : GenuineIntel
cpu family      : 6
model           : 58
model name      : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
stepping        : 9
microcode       : 0x12
Comment 15 Jona Stubbe 2013-07-08 15:09:10 UTC
Created attachment 106842 [details]
intel_reg_dump before hibernate
Comment 16 Jona Stubbe 2013-07-08 15:10:12 UTC
Created attachment 106843 [details]
intel_reg_dumper output after hibernate, before standby
Comment 17 Jona Stubbe 2013-07-08 15:11:04 UTC
Created attachment 106844 [details]
intel_reg_dumper output after hibernate and standby
Comment 18 Jona Stubbe 2013-07-08 15:21:58 UTC
I had this issue on my Acer Aspire 5750G laptop as well.

Vendor ID:             GenuineIntel
CPU family:            6
Model:                 42
Model name:            Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz
Stepping:              7

It also happened on hibernate (suspend-to-disk).

I'm using past tense here since, ironically, after installing intel-gpu-tools (no kernel update), the corruption did not occur anymore, neither on standby nor hibernate.  If it interests you, I have attached the reg dumps. 

A remaining problem is that resume seems to break the power management so that the CPU core temperature, normally under 50° Celsius, skyrockets to over 70°C in about 2 minutes, so that I have to restart my computer in order to not endanger the hardware. (Is this a separate bug?)
Comment 19 Stefano Avallone 2013-07-08 15:46:22 UTC
I suspended to disk and then twice to RAM in a row with no problem. This is an Ivybridge desktop PC (Intel (R) Core(TM) i7-3770 CPU @ 3.40GHz) running kernel 3.10.

I am using KDE 4.11 beta 2. I read that KDE 4.11 ships some changes to the code handling suspend/resume, don't know if that matters...
Comment 20 Jona Stubbe 2013-07-09 06:47:09 UTC
The overheating issues seemed to be unrelated and appear fixed after updating the 'intel-dri' package to version 9.1.4-3 (Arch Linux). Graphical glitches are still present after resume.
Comment 21 Daniel Vetter 2013-07-09 07:59:48 UTC
So first a call to order: This bug here seems to have caught the attention of google and already gathered a few me-too reports: If you think this describes your issue please check first whether reverting the offending commit (19b2dbde5732170a03bd82cc8bd442cf88d856f7 on upstream) works around it. If that's not the case please file a new bug report. Otherwise we'll quickly have a mess of contradicting reporters ...

Now on the bug itself, that looks very much like we've lost track of fences (i.e. the screenshots are rather typical for this, not random garbage). Which is strange since that commit fixes a different case of "lost track of fences". I'll hunt down a few theories and hopefully should have a patch or so soon.
Comment 22 Daniel Vetter 2013-07-09 08:21:13 UTC
Created attachment 106848 [details]
test patch 1: unmap tiled objects from userspace before suspend
Comment 23 Daniel Vetter 2013-07-09 08:23:18 UTC
Created attachment 106849 [details]
test patch 2: restore fence regs after gem hw init completes

Please test the attached two patches, they should undo what the other patch might have changed accidentally without breaking that fix. Patch 2 has imo the higher chances to work out.
Comment 24 Daniel Vetter 2013-07-09 08:48:34 UTC
Ok, I've pushed an updated intel_reg_dumper to http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/

Please grab the latest git from there and attach new reg dumps for working and broken kernels. Also please attach the contents of dri/0/i915_gem_gtt from debugfs for both cases to that we can correlate the registers.
Comment 25 Jona Stubbe 2013-07-09 18:20:13 UTC
Created attachment 106850 [details]
reg_dumps from commit 7d13205

taken with the git version of intel-gpu-tools (installed via 'yaourt -S intel-gpu-tools-git' on my arch laptop)
Comment 26 Jona Stubbe 2013-07-09 18:22:01 UTC
The power management problems seem to persist after downgrading to 7d1320; I'm looking for that issue on the bugtracker separately now.
Comment 27 Jona Stubbe 2013-07-09 19:07:32 UTC
Created attachment 106851 [details]
reg_dumps on 3.10 (commit 8bb495e)

reg_dumps from the 8bb495e (with regressions). Taken with the same version of intel-gpu-tools as the dumps from commit 7d13205.
Ignore what I said earlier about the temperatures, I highly suspect the crappy cooling in this laptop is at fault.
Comment 28 Jona Stubbe 2013-07-09 19:38:37 UTC
After applying the first test patch on master I get this build error:

drivers/gpu/drm/i915/i915_gem.c: In function 'i915_gem_reset_fences':
drivers/gpu/drm/i915/i915_gem.c:2149:26: error: 'obj' undeclared (first use in this function)
    i915_gem_release_mmap(obj);
                          ^
drivers/gpu/drm/i915/i915_gem.c:2149:26: note: each undeclared identifier is reported only once for each function it appears in

Or was I supposed to apply the two patches together?
Comment 29 Jona Stubbe 2013-07-09 20:24:08 UTC
Created attachment 106852 [details]
reg_dumps for master with patch 1 applied

I took a look a the first patch and concluded that there was supposed to be 'reg->obj' instead of 'obj'. It compiled well but the display corruption was still there.
Comment 30 Jona Stubbe 2013-07-09 20:51:38 UTC
Created attachment 106853 [details]
reg_dumps for master with patch 2 applied

Patch 2 compiled correctly, but did not fix the bug for me. (returning to 3.10-rc6 for now)
Comment 31 Daniel Vetter 2013-07-11 07:37:43 UTC
At least on my snb here I can only reproduce garbage when enabling the UXA xf86-video-intel backend, not with SNA. Can everyone who sees this please check in Xorg.log which backend is in use (just grep for UXA|SNA)?
Comment 32 Jona Stubbe 2013-07-11 11:32:41 UTC
I used UXA after I got graphical glitches from SNA in Mozilla Firefox (window contents of other applications appearing in images) but using SNA seems to fix this resume issue, even for the latest mainline release.
Comment 33 Chris Wilson 2013-07-11 11:39:50 UTC
(In reply to Jona Stubbe from comment #32)
> I used UXA after I got graphical glitches from SNA in Mozilla Firefox
> (window contents of other applications appearing in images).

For the record, they are fixed by
commit daa13e1ca587bc773c1aae415ed1af6554117bd4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 28 16:54:08 2013 +0100

    drm/i915: Only clear write-domains after a successful wait-seqno
Comment 34 Daniel Vetter 2013-07-11 11:53:30 UTC
(In reply to Chris Wilson from comment #33)
> (In reply to Jona Stubbe from comment #32)
> > I used UXA after I got graphical glitches from SNA in Mozilla Firefox
> > (window contents of other applications appearing in images).
> 
> For the record, they are fixed by
> commit daa13e1ca587bc773c1aae415ed1af6554117bd4
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Jun 28 16:54:08 2013 +0100
> 
>     drm/i915: Only clear write-domains after a successful wait-seqno

Note that this commit is currently only available in linux-next or drm-intel-nightly. I'll send the pull to Dave for it shortly, but I guess it'd be useful if people could retest whether this fixes the issue - the regression fixed by this patch is a rather old one.
Comment 35 Stefano Avallone 2013-07-11 12:17:21 UTC
I am the one who reported not to be affected by this resume issue (comment 19), though I am on Ivybridge. I am indeed using the SNA backend.
Comment 36 Chris Wilson 2013-07-11 12:27:50 UTC
That's actually reassuring since trying to ascribe this to a difference in UXA vs SNA across resume was worrisome. I guess the reason why Daniel found it easier to reproduce with UXA is that UXA tends to use fences a lot more than SNA.
Comment 37 Frank van Klaveren 2013-07-11 12:29:09 UTC
I am using OpenELEC (http://openelec.tv/news/22-releases/99-testing-openelec-3-1-2-released) with the 3.10 kernel (Mesa-9.1.4, xf86-video-intel-2.21.11) and I might be affected by this too (i3-3220 Ivy Bridge). 

After resume from suspend, XBMC is showing a lot of distortions on the screen:

http://i.imgur.com/ktxJ9uG.jpg
http://i.imgur.com/4JrRGVt.jpg

The issue is reported to OpenELEC here: https://github.com/OpenELEC/OpenELEC.tv/issues/2453

UXA / SNA both have the problem.
Comment 38 Rafael J. Wysocki 2013-07-11 22:52:41 UTC
On Thursday, July 11, 2013 07:37:43 AM bugzilla-daemon@bugzilla.kernel.org wrote:

> --- Comment #31 from Daniel Vetter <daniel@ffwll.ch> ---
> At least on my snb here I can only reproduce garbage when enabling the UXA
> xf86-video-intel backend, not with SNA. Can everyone who sees this please
> check
> in Xorg.log which backend is in use (just grep for UXA|SNA)?

UXA for me.
Comment 39 Brandon Watkins 2013-07-14 18:22:52 UTC
I can reproduce this quite easily with SNA backend. Intel hd4000, using the latest 3.10.1 kernel on arch linux. Almost every resume gnome-shell becomes totally unusable, covered in graphical garbage and artifacts.
Comment 40 Calvin Owens 2013-07-16 16:34:25 UTC
(In reply to Daniel Vetter from comment #31)
> At least on my snb here I can only reproduce garbage when enabling the UXA
> xf86-video-intel backend, not with SNA. Can everyone who sees this please
> check in Xorg.log which backend is in use (just grep for UXA|SNA)?

On my laptop:
[    30.592] (II) intel(0): SNA initialized with IvyBridge backend
Comment 41 Daniel Vetter 2013-07-17 14:12:14 UTC
Created attachment 106910 [details]
correctly restore fence registers with objects attached

Please test the attached patch and report whether it works or whether I need to go back to banging my head against the wall.
Comment 42 Björn Bidar 2013-07-18 21:47:15 UTC
I tested it a bit and it fixes the bug for me (Thinkpad Edge E530 (Intel 3rdn gen), 3.10.1 with pf patchset)
Comment 43 Daniel Vetter 2013-07-18 22:09:58 UTC
Ok, I think this is the real fix (since I've gotten other confirmations on irc, too). Patch merged to drm-intel-fixes, will get forwarded soon and then trickle back to stable trees:

commit 94a335dba34ff47cad3d6d0c29b452d43a1be3c8
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jul 17 14:51:28 2013 +0200

    drm/i915: correctly restore fences with objects attached

Thanks everyone for reporting this issue and testing stuff.
Comment 44 Lutz Vieweg 2013-07-22 21:53:37 UTC
*** Bug 60598 has been marked as a duplicate of this bug. ***
Comment 45 abrouwers 2013-07-29 11:18:11 UTC
(In reply to Daniel Vetter from comment #43)
> Ok, I think this is the real fix (since I've gotten other confirmations on
> irc, too). Patch merged to drm-intel-fixes, will get forwarded soon and then
> trickle back to stable trees:
> 
> commit 94a335dba34ff47cad3d6d0c29b452d43a1be3c8
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed Jul 17 14:51:28 2013 +0200
> 
>     drm/i915: correctly restore fences with objects attached
> 
> Thanks everyone for reporting this issue and testing stuff.

Is there anything that needs to be done in order for this to be bakported?  It seems that 3.10.3 and 3.10.4 still do not contain the fix - it'd be would really awesome to not have to restart gnome-shell after every suspend cycle :-)
Comment 46 Björn Bidar 2013-07-29 12:06:23 UTC
There's no backport required to apply this to 3.10.1+, I already use it with this version (+ pf patchset) and have no issues.
Comment 47 abrouwers 2013-07-29 12:23:35 UTC
Sorry, I meant actually getting the fix in to the mainless 3.10.x series.  At least 3.10.2, .3, and .4 do not contain the fix.
Comment 48 Michael Tokarev 2013-08-03 07:49:55 UTC
I too just come across this issue and this patch, on 3.10.4 kernel.  The issue is easily triggerable by suspending while (eg) glxgears is running, 3 reboots in a row confirmed this (each time I ran glxgears, suspended and resumed, and each time the display were garbled).  With the proposed patch I can't trigger the issue anymore, tried multiple suspend/resume  cycles.  I took commit 94a335dba34ff47cad3d6d0c29b452d43a1be3c8 from Linus tree and applied to to 3.10.4.
Comment 49 Brandon Watkins 2013-08-04 18:34:28 UTC
What upstream kernel version will patch be included in? Is it fixed in the new 3.10.5?
Comment 50 Daniel Vetter 2013-08-04 20:34:06 UTC
3.10.5 should have the fix:

commit 19a280cac37e30243023a7f53651504a135ac960
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jul 17 14:51:28 2013 +0200

    drm/i915: correctly restore fences with objects attached
    
    commit 94a335dba34ff47cad3d6d0c29b452d43a1be3c8 upstream.
    
    To avoid stalls we delay tiling changes and especially hold of
    committing the new fence state for as long as possible.
    Synchronization points are in the execbuf code and in our gtt fault
    handler.
    
    Unfortunately we've missed that tricky detail when adding proper fence
    restore code in
    
    commit 19b2dbde5732170a03bd82cc8bd442cf88d856f7
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Wed Jun 12 10:15:12 2013 +0100
    
        drm/i915: Restore fences after resume and GPU resets
    
    The result was that we've restored fences for objects with no tiling,
    since the object<->fence link still existed after resume. Now that
    wouldn't have been too bad since any subsequent access would have
    fixed things up, but if we've changed from tiled to untiled real havoc
    happened:
    
    The tiling stride is stored -1 in the fence register, so a stride of 0
    resulted in all 1s in the top 32bits, and so a completely bogus fence
    spanning everything from the start of the object to the top of the
    GTT. The tell-tale in the register dumps looks like:
    
                     FENCE START 2: 0x0214d001
                     FENCE END 2: 0xfffff3ff
    
    Bit 11 isn't set since the hw doesn't store it, even when writing all
    1s (at least on my snb here).
    
    To prevent such a gaffle in the future add a sanity check for fences
    with an untiled object attached in i915_gem_write_fence.
    
    v2: Fix the WARN, spotted by Chris.
    
    v3: Trying to reuse get_fences looked ugly and obfuscated the code.
    Instead reuse update_fence and to make it really dtrt also move the
    fence dirty state clearing into update_fence.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=60530
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Stéphane Marchesin <marcheu@chromium.org>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Matthew Garrett <matthew.garrett@nebula.com>
    Tested-by: Björn Bidar <theodorstormgrade@gmail.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>