Bug 52661

Summary: GPU hang when transferring large files
Product: Drivers Reporter: Sudaraka Wijesinghe (sudaraka)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED CODE_FIX    
Severity: normal CC: daniel, intel-gfx-bugs
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.7 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: lspci output
journalctl output from the HP laptop that complete freezes
/proc/cpuinfo
/proc/meminfo
Kernel configuration
i915_error_state from 3.7.2 kernel after GPU hang

Description Sudaraka Wijesinghe 2013-01-13 22:44:39 UTC
Created attachment 91221 [details]
lspci output

Every time a large file (i.e 2GB) is transferred from one place to another, GPU hangs with the following error message.

-- from: journalctl -f --------------------------------------------------------
Jan 13 20:18:11 sw-main kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Jan 13 20:18:12 sw-main kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Jan 13 20:18:12 sw-main kernel: [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
Jan 13 20:18:12 sw-main kernel: [drm:i915_reset] *ERROR* Failed to reset chip.
-------------------------------------------------------------------------------

Hardware info of the machine (Acer Aspire 4738) is attached, On another machine it freezes entirely with attached error log (hp-error.txt)


-- bisect result --------------------------------------------------------------
504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 is the first bad commit
commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 23 13:12:52 2012 +0100

    drm/i915: Use cpu relocations if the object is in the GTT but not mappable
    
    This prevents the case of unbinding the object in order to process the
    relocations through the GTT and then rebinding it only to then proceed
    to use cpu relocations as the object is now in the CPU write domain. By
    choosing to use cpu relocations up front, we can therefore avoid the
    rebind penalty.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 090ed3d52b4f3210b988877f747b6ff86e123385 1d48be89ded4777a543b693db833de64877059c4 M	drivers
-------------------------------------------------------------------------------
Comment 1 Sudaraka Wijesinghe 2013-01-13 22:46:29 UTC
Created attachment 91231 [details]
journalctl output from the HP laptop that complete freezes
Comment 2 Sudaraka Wijesinghe 2013-01-13 22:47:11 UTC
Created attachment 91241 [details]
/proc/cpuinfo
Comment 3 Sudaraka Wijesinghe 2013-01-13 22:47:31 UTC
Created attachment 91251 [details]
/proc/meminfo
Comment 4 Sudaraka Wijesinghe 2013-01-13 22:48:08 UTC
Created attachment 91261 [details]
Kernel configuration
Comment 5 Daniel Vetter 2013-01-14 13:28:21 UTC
Can you please rehang your machine and attach the i915_error_state file from debugfs? Also, please test this on latest upstream git branch from Linus' tree. Make sure that you have the following commit:

commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 10 18:03:00 2013 +0100

    drm/i915: Revert shrinker changes from "Track unbound pages"
Comment 6 Sudaraka Wijesinghe 2013-01-14 16:24:29 UTC
Created attachment 91321 [details]
i915_error_state from 3.7.2 kernel after GPU hang
Comment 7 Sudaraka Wijesinghe 2013-01-14 16:25:07 UTC
On 01/14/13 18:58, bugzilla-daemon@bugzilla.kernel.org wrote:
> --- Comment #5 from Daniel Vetter <daniel@ffwll.ch>  2013-01-14 13:28:21 ---
> Can you please rehang your machine and attach the i915_error_state file from
> debugfs? Also, please test this on latest upstream git branch from Linus'
> tree.

Problem doesn't seem to be in the Linus' tree (3.8.0-rc3), so I guess
it's already fixed.

I will attach the i915_error_state from 3.7.2 anyway if you need it.

You may close this if no longer necessary, or if there is some thing to
be looked into I'm glad to help with testing.

Thanks.
Comment 8 Daniel Vetter 2013-01-14 16:42:46 UTC
Yeah, we've (re)applied a workaround for this one, but already have tons of reporters. See https://bugs.freedesktop.org/show_bug.cgi?id=55984