Bug 15671 - intel graphic card hanging (Hangcheck timer elapsed... GPU hung)
intel graphic card hanging (Hangcheck timer elapsed... GPU hung)
Status: CLOSED CODE_FIX
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel)
All Linux
: P1 normal
Assigned To: drivers_video-dri-intel@kernel-bugs.osdl.org
:
Depends on:
Blocks: 15310
  Show dependency treegraph
 
Reported: 2010-04-01 18:34 UTC by Maciej Rutecki
Modified: 2011-02-17 03:41 UTC (History)
6 users (show)

See Also:
Kernel Version: 2.6.34-rc2
Tree: Mainline
Regression: Yes


Attachments
dmesg output showing GPU hung messages (72.30 KB, text/plain)
2010-07-22 06:10 UTC, Eduardo Bacchi Kienetz
Details
GPU hung in 2.6.35-rc4-71625-gd44a78e (70.36 KB, text/plain)
2010-07-27 05:17 UTC, Eduardo Bacchi Kienetz
Details
dmesg output while running 2.6.38rc4. (62.87 KB, text/plain)
2011-02-17 03:39 UTC, Eduardo Bacchi Kienetz
Details

Description Maciej Rutecki 2010-04-01 18:34:23 UTC
Subject    : intel graphic card hanging (Hangcheck timer elapsed... GPU hung)
Submitter  : Norbert Preining <preining@logic.at>
Date       : 2010-03-27 16:11
Message-ID : 20100327161104.GA12043@gamma.logic.tuwien.ac.at
References : http://marc.info/?l=linux-kernel&m=126970883105262&w=2

This entry is being used for tracking a regression from 2.6.33.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2010-06-13 12:06:12 UTC
Handled-By : Jesse Barnes <jbarnes@virtuousgeek.org>
Comment 2 Eduardo Bacchi Kienetz 2010-06-14 17:14:07 UTC
I can confirm this with kernel 2.6.34 as downloaded from kernel.org.
I get many of these (until X freezes and only reboot seems to solve, init 3, init 4 doesn't seem to help - Slackware 13.1): 
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5
(awaiting 463980 at 462562)
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000

I can test fixes every night once I'm back home from work and provide some feedback.
Comment 3 Jesse Barnes 2010-07-02 23:17:59 UTC
Does this still happen with libdrm and xf86-video-intel from git?
Comment 4 Eduardo Bacchi Kienetz 2010-07-05 19:47:27 UTC
I've cloned git://anongit.freedesktop.org/git/xorg/driver/xf86-video-intel
and git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel

First I tried drm-intel alone, but weirdness still happened. I then compiled xf86-video-intel (after compiling libdrm 2.4.21, since it wouldn't go on with 2.4.20), and I played tremulous for quite a while and the bug wasn't triggered. Before I was also getting some weird flickering with dark spots in the screen (like triangle shapes), which disappeared as well. There is some bug in the USB on this branch though, since my keyboard started to eventually go crazy (I have some dmesg output), but that's obviously unrelated, just FYI.
Let me know if there is any commit you want me to revert to try to spot the fix (I wouldn't really like to bisect as it might take forever).
BTW, out of curiosity, how to I make sure I'm using the right xf86-video-intel? Not that I think I am, just wondering how to check version or something.
Thanks!
Comment 5 Eduardo Bacchi Kienetz 2010-07-05 20:31:13 UTC
Oh, I should also mention that the mouse cursor (the arrow, just to be clear) got messed up after this. It takes longer to draw up (well, I need to keep moving the mouse for it to fully draw), drawing vertically like "blinds". I'm not sure what it's related to, but appeared with this new kernel + xf86-video-intel + libdrm 2.4.21.
I'll boot with the old kernel and take a look (since now my xf86-video-intel got replaced with the new one anyway).
Comment 6 Eduardo Bacchi Kienetz 2010-07-07 15:09:37 UTC
I booted with the old kernel, but now X is using the new libdrm + xf86-video-intel since that's not in the kernel, and I don't seem to be able to trigger the bug anymore. I still saw the dark spots eventually, but X didn't freeze anymore. So maybe libdrm and/or xf86-video-intel were the culprits, unless they are simply not reaching the bug in the video driver anymore. Anyway, I'll now keep booting with the new kernel and if the bug appears again I'll report here. I'll be waiting for the release into the mainline kernel. Thanks again guys.
Comment 7 Eduardo Bacchi Kienetz 2010-07-22 06:07:23 UTC
Well, turns out that a path to the bug still exists somewhere. I'm pretty sure it used to happen more frequently with the older kernel (2.6.34), so this one seems to have fixed most of the sources, but there is still something left (took days to happen). I'm also attaching my dmesg output for review.
Comment 8 Eduardo Bacchi Kienetz 2010-07-22 06:10:58 UTC
Created attachment 27202 [details]
dmesg output showing GPU hung messages

I was wrong, bug isn't completely gone as I earlier reported, just happens more infrequently now.
Comment 9 Norbert Preining 2010-07-23 15:53:01 UTC
I'm sorry to say that with 2.6.35-rc6 the same still happens: Starting d2x-xl when the program starts the renderer X is killed, and the graphic card cannot be recreated but by restarting the computer. In the log I have many of these messages:
[ 9851.516046] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 9851.516285] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 130915 at 130914)

So seeing 
    Status: CLOSED CODE_FIX
makes me a bit surprised, do you have other patches to be applied? But it might be that my user mode (drm etc) is too old, I am running Debian/sid.

Thanks

Norbert
Comment 10 Jesse Barnes 2010-07-23 16:31:21 UTC
http://git.kernel.org/?p=linux/kernel/git/anholt/drm-intel.git;a=commit;h=f602afd4b86ff862ec54e6e3009e0e425129e9b0

That's from drm-intel-next, can you give that tree a try?  It makes the hang check test a bit less trigger happy.
Comment 11 Eduardo Bacchi Kienetz 2010-07-23 16:41:53 UTC
Jesse, note my message dated 2010-07-05 (comment #4), where I said I've cloned that repo. So, even after that, I still get the error above. That clearly means the above-mentioned commit (dated jun 6th, so likely already in the branch by the time I cloned it) is not a complete fix.
Comment 12 Eduardo Bacchi Kienetz 2010-07-23 17:34:23 UTC
You know what, nevermind, Jun 6th is the date Chris committed to his original branch, while Eric merged it only July 8th, so I likely missed that! I'll pull it tonight, recompile and test again. grr.... sorry.
Comment 13 Eduardo Bacchi Kienetz 2010-07-27 05:17:26 UTC
Created attachment 27264 [details]
GPU hung in 2.6.35-rc4-71625-gd44a78e

Very well, so I've pulled the latest git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel, so that I'd get the commit mentioned by Jesse in comment #10, but I still get the GPU hung error. Actually, it seems to have gotten worse with 2.6.35rc4. I also get dark spots quite frequently. I'll see if at some point I am lucky enough to get a photo of it. BTW, when the dark spots (of random shapes, mostly triangles), it's a sign that the GPU is about to hang.
Please check my dmesg output attached.
Morever, something with the screen size gets changed using this update. That is: if I boot with 2.6.34-69471-ge3a815f, and run, for example, tremulous (the game), it shows up in full screen. Doing the same procedure while running on 2.6.35rc4 I get a screen size of non-widescreen (edges cut, black). This is more as a side note, since I understand it could be due to something completely unrelated.
Comment 14 Eduardo Bacchi Kienetz 2010-08-23 15:04:56 UTC
wow, guys... I've discovered something really interesting this weekend regarding this problem!
I was decided to at least improve my situation regarding this bug and went to the kernel configuration to try to review what suspicious option I could have activated to make the bug appear so often. Turned out that I disabled 6-10 options in my kernel configuration (via make menuconfig), recompiled, installed and rebooted. The bug now gets triggered very occasionally to the point of me now being actually able to play whole maps of Urban Terror (that's the easiest way to trigger the intel-drm-or-whatever bug).
Naturally I was dumb enough not to save the .config that would make the bug appear often, as I didn't believe my kernel changes would affect it at all and the bug would still appear as usual. So, FYI, there is some option that once activated triggers the nasty bug as seen in my previous attachments. 
I'll try to activate some options and retest until I can trigger the bug again and will report.
Comment 15 Eduardo Bacchi Kienetz 2011-02-16 21:02:12 UTC
GPU Hung still happens on 2.6.38rc4. Will post an update (dmesg output) when I get home. Just a self-reminded and pre-notice to you.
Comment 16 Eduardo Bacchi Kienetz 2011-02-17 03:39:12 UTC
Created attachment 48052 [details]
dmesg output while running 2.6.38rc4.

GPU still hangs while running 2.6.38rc4 as seen at the end of the attached file (dmesg > 2.6.38rc4).
Comment 17 Eduardo Bacchi Kienetz 2011-02-17 03:41:39 UTC
BTW, note that the frigging USB is behaving weirdly with all those messages. Wonder if it could be conflicting with something else somehow, but that's also new behavior :รพ

Note You need to log in before you can comment on or make changes to this bug.