Bug 49571 - [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Summary: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 high
Assignee: intel-gfx-bugs@lists.freedesktop.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-10-26 04:12 UTC by Justin P. Mattock
Modified: 2013-01-16 14:19 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.7.0-rc2-00008-g0390c88
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg of crash (66.21 KB, application/octet-stream)
2012-10-26 04:17 UTC, Justin P. Mattock
Details
Full bisect log (2.88 KB, text/x-log)
2012-12-08 16:49 UTC, Brad Jackson
Details
bisect_log_i915_hangcheck_timer_elapsed (3.40 KB, text/plain)
2012-12-11 03:59 UTC, Justin P. Mattock
Details

Description Justin P. Mattock 2012-10-26 04:12:08 UTC
this bug is with intel i915. system becomes _very_ sluggish once this fires off.

the original thread is here..:
https://lkml.org/lkml/2012/10/23/498

attached is 
intel_error_decode which has intel_error_state
and dmesg of the last crash. I will start a bisect to see if I get anywhere on this(might be a few weeks since this is a slow trigger)will post if I get anywhere.
Comment 1 Justin P. Mattock 2012-10-26 04:17:09 UTC
Created attachment 84921 [details]
dmesg of crash

dmesg of crash.
Comment 2 Justin P. Mattock 2012-10-26 04:36:43 UTC
such a pain.. 3.4 mbs is to big for bugzilla
here is the link to a filesharing site with intel_error_decode..:

http://www.fileserve.com/file/FYdxvFC/intel_error_decode

just scroll to the middle of the page, wait for the timer, do the captcha then download. ignore all the other crap(since _all_ these sites do this crap with popups etc..)
Comment 3 Daniel Vetter 2012-10-26 08:04:00 UTC
Ok, I'll guess we'll move it back to the mailing list again. Btw, I've asked you to report the bug on bugs.freedesktop.org since that bugzilla does not have a stupid size limitation for attachments and happily handles an i915_error_state ;-)
Comment 4 Justin P. Mattock 2012-11-15 01:32:09 UTC
just for the record please dont close this.. I am still bisecting this. about half way through the bisect. hopefully I make sure I get the proper commits down(good/bad) so I dont go astray with the bisect.

from looking at it.. its between:

3.6.0-00722-gbd0d104 and 3.6.0-06942-g11126c6

I have about 200 revisions to test.. on this before I can get an idea.
Comment 5 Justin P. Mattock 2012-12-03 01:30:45 UTC
okay.. still bisecting a few more commits, but not sure if I need to do so, given that the rest of the commits are with kernel-doc and fat.
image is here through gitk:
http://i1341.photobucket.com/albums/o745/justinmattock/bisect.png

the bisect log is here: (3 commits left)

http://fpaste.org/JBuh/

from now I should be running a good kernel(according to the bisect):
commit 65478428 but might go and finish the bisect early so I can start running commit 66b8ef677 (revert)

will post what I find out..
Comment 6 Justin P. Mattock 2012-12-03 22:16:02 UTC
seems there might already be a fix upstream for this..:
5b7aa7d5b

will run the bisect for a few more days, then can pull to the Mainline to test this to see if I crash and burn. seems after three days I get the crash, so whatever (parent/child) process is firing off is getting denied by this commit(or old commit now)
Comment 7 Brad Jackson 2012-12-04 20:24:52 UTC
My Dell Latitude E6500 has this problem. I can test any patches or maybe do a bisect if needed.
Comment 8 Justin P. Mattock 2012-12-05 04:44:02 UTC
what kernel version are you on? 
over here the bisect is pointing to: 

[66b8ef67756b3051bf42a077a82c3c5c279caa5b] device_cgroup: add "deny_all" in dev_cgroup structure

keep in mind looking through the history of device_cgroup.c I am noticing lots of changes with deny_all commit so the current Mainline could already be fixed.

with that in mind will continue bisecting.. once finished will pull the current Mainline to see if I hit this, then will go from there. 
NOTE:during the bisect I am getting consistent crash at around 3 days of being up(uptime) but other machines could be different
Comment 9 Brad Jackson 2012-12-05 11:47:43 UTC
This has been happening on every 3.7-RC I've tried. It occurs within minutes of booting up and I have to give up and reboot into 3.6.x for a usable desktop.

The main activity on fixing this bug appears to be here. I do use the rc6 boot flag and haven't tried disabling it.

https://bugs.freedesktop.org/show_bug.cgi?id=55984
Comment 10 Justin P. Mattock 2012-12-05 16:24:34 UTC
okay so doing a bisect on your part should not take so long since you can easily hit this in my case was 3 days before firing off, causing the bisect to last over a month now to get proper results(hopefully). if you can.. try the bisect to see if you get the same results since its easily reproduceable.

right now im testing a commit and wont reboot unless I hit the bug, or after 5 days of running this revision(only way to really test this from here) then if good I will pull to the Mainline and run that to see if all the device_cgroup fixes are fixing this, if not, then I revert the cgroup stuff on the Mainline to verify its _really_ the cause of this crash..
Comment 11 Brad Jackson 2012-12-05 21:11:59 UTC
Tested 3.7.0-RC8 with device_cgroup.c changes reverted back to your bisect revision. I've been unable to reproduce the crash so far. I will post again if it still does it.
Comment 12 Justin P. Mattock 2012-12-05 21:25:32 UTC
yeah this is so we can verify that the cgroup deny_all policy is blocking a process causing the GPU to crap out.
kind of like SELinux, having one AVC's denial can keep X11 from starting when in enforcing mode, but in this case(reading through the code) its between parent/child processes mostly the child process is being denied. Im guessing probably the reason for different times on different machines for when this crash occurs. but keep in mind I dont know much about cgroup to really give a definitive answer. main thing now is finding the culprit(commit) so atleast we can get a handle on this bug

good sign over here:
uptime:
13:23pm  up 3 days 14:24,  4 users,  load average: 0.20, 0.14, 0.14

the longest on a bad kernel was 3 days 04:00

thanks for testing/helping on this comment #11
Comment 13 Brad Jackson 2012-12-05 21:42:00 UTC
Sorry, bad news. My desktop just froze briefly and logged the error. Took almost 90 minutes this time.
Comment 14 Justin P. Mattock 2012-12-05 21:51:31 UTC
and this is before the cgroup stuff? 

if so then my bisect is astray.. maybe you can bisect this since you have a more stable crash than I on this?
Comment 16 Justin P. Mattock 2012-12-05 22:44:17 UTC
okay.. so you are on a kernel before commit: 66b8ef6775 

over here I am on commit: 65478428443 in of which has not crashed the kernel. I am going ahead with my plan to finish this bisect, then will pull to the current to verify the Mainline. I can do another bisect, but not until I finish with these other tests of verifying that the Mainline is fixed or not. 

as for you, bisecting would be great since you can easily reproduce this.
Comment 17 Brad Jackson 2012-12-07 21:33:56 UTC
This is where I am on my bisect. Might have it finished tonight or tomorrow.

Bisecting: 14 revisions left to test after this (roughly 4 steps)
[0327d6ba998ca181013a5a1709701a6532a41972] drm/i915: Extract general object init routine

git bisect start
# bad: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6
git bisect bad a0d271cbfed1dd50278c6b06bead3d00ba0a88f9
# good: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6
git bisect good a0d271cbfed1dd50278c6b06bead3d00ba0a88f9
# good: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6
git bisect good a0d271cbfed1dd50278c6b06bead3d00ba0a88f9
# bad: [ddffeb8c4d0331609ef2581d84de4d763607bd37] Linux 3.7-rc1
git bisect bad ddffeb8c4d0331609ef2581d84de4d763607bd37
# good: [24d7b40a60cf19008334bcbcbd98da374d4d9c64] ARM: OMAP2+: PM: MPU DVFS: use generic CPU device for MPU-SS
git bisect good 24d7b40a60cf19008334bcbcbd98da374d4d9c64
# bad: [21c8715f0a1f4df8bfa2bd6f3915e5e33c1c2e6e] ARM: integrator: use __iomem pointers for MMIO, part 2
git bisect bad 21c8715f0a1f4df8bfa2bd6f3915e5e33c1c2e6e
# bad: [33e2a4227ddff7c18921ac175fae3ab0e3ff8a76] lib/decompress.c add __init to decompress_method and data
git bisect bad 33e2a4227ddff7c18921ac175fae3ab0e3ff8a76
# good: [3a494318b14b1bc0f59d2d6ce84c505c74d82d2a] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
git bisect good 3a494318b14b1bc0f59d2d6ce84c505c74d82d2a
# bad: [268d28371cd326be4dfcd7eba5917bf4b9d30c8f] Merge branch 'drm-nouveau-next' of git://anongit.freedesktop.org/git/nouveau/linux-2.6 into drm-next
git bisect bad 268d28371cd326be4dfcd7eba5917bf4b9d30c8f
# skip: [8c0bd3c02d52eff11396e81b4d217ee668e03528] drm/i915: placeholder getparam
git bisect skip 8c0bd3c02d52eff11396e81b4d217ee668e03528
# bad: [37e680a15ff6375ff02773161f817e90a38c51f7] drm/i915: Introduce drm_i915_gem_object_ops
git bisect bad 37e680a15ff6375ff02773161f817e90a38c51f7
# good: [1aa4b628ee63f55db96c7e820257b6e4948abb74] drm/i915: don't save all the encoder/crtc state in set_config
git bisect good 1aa4b628ee63f55db96c7e820257b6e4948abb74
# bad: [4772eaebcdf86dd65630339dbe58316b90f80aed] drm/i915: add gen7_render_ring_flush
git bisect bad 4772eaebcdf86dd65630339dbe58316b90f80aed
# good: [ce5fdd63752390b89804b1e246de267b398d1515] gma500: fix incorrect declaration.
git bisect good ce5fdd63752390b89804b1e246de267b398d1515
Comment 18 Brad Jackson 2012-12-07 22:01:58 UTC
One more.

Bisecting: 6 revisions left to test after this (roughly 3 steps)
[de9932d13c95c4125d2ed8703f5574fcc108bbcf] drm/i915: fix reassignment of variable "intel_dp->DP"

I was running glxgears and other video and disk intensive processes when it happened and this was in the console. Not sure if that is useful.

intel_do_flush_locked failed: Input/output error
Comment 19 Brad Jackson 2012-12-07 23:05:07 UTC
Should have this done tomorrow. Current build is stable so far.

Bisecting: 2 revisions left to test after this (roughly 2 steps)
[b7abb7144356536c41a81eda2bbfd46027e529f6] drm/i915: Show (count, size) of purgeable objects in i915_gem_objects

Bisecting: 1 revision left to test after this (roughly 1 step)
[6c085a728cf000ac1865d66f8c9b52935558b328] drm/i915: Track unbound pages

# bad: [0327d6ba998ca181013a5a1709701a6532a41972] drm/i915: Extract general object init routine
git bisect bad 0327d6ba998ca181013a5a1709701a6532a41972
# bad: [de9932d13c95c4125d2ed8703f5574fcc108bbcf] drm/i915: fix reassignment of variable "intel_dp->DP"
git bisect bad de9932d13c95c4125d2ed8703f5574fcc108bbcf
# bad: [b7abb7144356536c41a81eda2bbfd46027e529f6] drm/i915: Show (count, size) of purgeable objects in i915_gem_objects
git bisect bad b7abb7144356536c41a81eda2bbfd46027e529f6
Comment 20 Justin P. Mattock 2012-12-08 02:33:14 UTC
well.. much faster you doing this than waiting a month for any result.. looking at the bisect logs seems you are in the right direction(but you never know). let me know what you find out. over here running the current Mainline for a few days to see if I hit this. bisect is finished, will post the results for the record.

Thanks for doing the bisect on this!
Comment 21 Brad Jackson 2012-12-08 16:45:16 UTC
Here's the rest of the bisect. Will attach full bisect log.

Bisecting: 0 revisions left to test after this (roughly 0 steps)
[225067eedf1f4d4065940232c9069fcb255206ee] drm/i915: move functions around

6c085a728cf000ac1865d66f8c9b52935558b328 is the first bad commit
commit 6c085a728cf000ac1865d66f8c9b52935558b328
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Aug 20 11:40:46 2012 +0200

drm/i915: Track unbound pages
Comment 22 Brad Jackson 2012-12-08 16:49:48 UTC
Created attachment 88651 [details]
Full bisect log

One thing confuses me. I tried 3.7-rc1 and it had the crash, so my bisect was between 3.6.0 and 3.1-rc1. But part way through the bisect, I had an unbootable kernel that couldn't mount the partition, so I had to do a skip. After that, it started including rc3 and then rc2. So it's possible that my bisect is not right, or that there are two patches that cause the problem.
Comment 23 Daniel Vetter 2012-12-08 17:20:58 UTC
On Sat, Dec 8, 2012 at 5:49 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> --- Comment #22 from Brad  Jackson <bjackson0971@gmail.com>  2012-12-08
> 16:49:48 ---
> Created an attachment (id=88651)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=88651)
> Full bisect log
>
> One thing confuses me. I tried 3.7-rc1 and it had the crash, so my bisect was
> between 3.6.0 and 3.1-rc1. But part way through the bisect, I had an
> unbootable
> kernel that couldn't mount the partition, so I had to do a skip. After that,
> it
> started including rc3 and then rc2. So it's possible that my bisect is not
> right, or that there are two patches that cause the problem.

That's completely expected - the kernel version and also the one git
shows you just goes backwards in history until it finds the first tag
and uses that. Which means if the drm-intel-next tree for 3.7 is e.g.
based of 3.6-rc2, and your bisect is only in there (since all the
other branches merged into 3.7-rc1 seem to work) git lets you test
just commits on that branch.

It looks like git bisect went back in time before the 3.6 release, but
in reality the commits your testing with that 3.6-rc2 kernel have
never been in 3.6, so really need to be tested. You're not the first
to get totally confused by this though ;-)
Comment 24 Justin P. Mattock 2012-12-08 17:57:23 UTC
yeah bisecting can be linear that way.. one minute its at 3.6 then its at 3.2 then back to 3.6 main thing is getting the proper good/bad so it does not go astray. in your case that might have occurred when you had to skip a revision because of brokenness(hopefully not though). over here still running the Mainline with no signs of breakage, but still have another day and a half to run since I seem to be hitting this at/around 3 days. will keep you updated
Comment 25 Justin P. Mattock 2012-12-11 03:59:05 UTC
Created attachment 88891 [details]
bisect_log_i915_hangcheck_timer_elapsed

the following attachment is the bisect log that I did for this bug. at the moment I have been running the Mainline kernel without hitting this bug. normally at/around 3days I would reliably hit this and as it stands my uptime is:

 19:58pm  up 3 days 22:39,  6 users,  load average: 0.21, 0.12, 0.14

which is a good sign that this bug is fixed. but will run this kernel for 5 days to really make sure. will post and update if this fires off.
Comment 26 Brad Jackson 2012-12-11 16:46:36 UTC
I just got the hang on a 3.7.0 release kernel after a couple hours of uptime.
Comment 27 Justin P. Mattock 2012-12-11 18:24:21 UTC
over here, still running the Mainline without hitting this.. will continue to run to see if it fires off. 
as for comment#26 maybe try another go at the bisect, you may need to even perform a git rebase after the initial bisect to perform a bisect in the merge to find the culprit.

also since you can trigger this in a short amount of time, you may try:  CONFIG_CGROUPS = n
before doing any bisects to verify that the cgroup commits are causing this or not.
Comment 28 Brad Jackson 2012-12-11 19:34:54 UTC
If I had a patch to cleanly revert 6c085a728cf000ac1865d66f8c9b52935558b328 from 3.7.0, I could see if that fixes the problem. I tried to revert it against 3.7.0-rc8, but there were errors.
Comment 29 njin 2012-12-25 19:01:50 UTC
downstream at launchpad:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1093651?comments=all
Comment 30 Daniel Vetter 2013-01-07 16:37:29 UTC
Smells like yet another variation of the 3.7 gpu hang regression. Please attach a dmesg with drm.debug=0xe (I've lost track of your hw) and test the following two kernel patches (alone or in combination, we're still trying to figure out what's going on):

https://bugs.freedesktop.org/attachment.cgi?id=72022

https://bugs.freedesktop.org/attachment.cgi?id=71805
Comment 31 Daniel Vetter 2013-01-16 14:19:19 UTC
Presumably fixed with

commit 262b6d363fcff16359c93bd58c297f961f6e6273
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 15 16:17:54 2013 +0000

    drm/i915: Invalidate the relocation presumed_offsets along the slow path

and paper over by

commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 10 18:03:00 2013 +0100

    drm/i915: Revert shrinker changes from "Track unbound pages"

and

commit 901593f2bf221659a605bdc1dcb11376ea934163
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Dec 19 16:51:06 2012 +0000

    drm: Only evict the blocks required to create the requested hole

Thanks a lot for reporting this regression and please reopen if you're still experiencing issues. Note that the first commit is only available in drm-intel-fixes for now.

Note You need to log in before you can comment on or make changes to this bug.