Subject : intel graphic card hanging (Hangcheck timer elapsed... GPU hung)
Submitter : Norbert Preining <firstname.lastname@example.org>
Date : 2010-03-27 16:11
Message-ID : 20100327161104.GA12043@gamma.logic.tuwien.ac.at
References : http://marc.info/?l=linux-kernel&m=126970883105262&w=2
This entry is being used for tracking a regression from 2.6.33. Please don't
close it until the problem is fixed in the mainline.
Handled-By : Jesse Barnes <email@example.com>
I can confirm this with kernel 2.6.34 as downloaded from kernel.org.
I get many of these (until X freezes and only reboot seems to solve, init 3, init 4 doesn't seem to help - Slackware 13.1):
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5
(awaiting 463980 at 462562)
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000
I can test fixes every night once I'm back home from work and provide some feedback.
Does this still happen with libdrm and xf86-video-intel from git?
I've cloned git://anongit.freedesktop.org/git/xorg/driver/xf86-video-intel
First I tried drm-intel alone, but weirdness still happened. I then compiled xf86-video-intel (after compiling libdrm 2.4.21, since it wouldn't go on with 2.4.20), and I played tremulous for quite a while and the bug wasn't triggered. Before I was also getting some weird flickering with dark spots in the screen (like triangle shapes), which disappeared as well. There is some bug in the USB on this branch though, since my keyboard started to eventually go crazy (I have some dmesg output), but that's obviously unrelated, just FYI.
Let me know if there is any commit you want me to revert to try to spot the fix (I wouldn't really like to bisect as it might take forever).
BTW, out of curiosity, how to I make sure I'm using the right xf86-video-intel? Not that I think I am, just wondering how to check version or something.
Oh, I should also mention that the mouse cursor (the arrow, just to be clear) got messed up after this. It takes longer to draw up (well, I need to keep moving the mouse for it to fully draw), drawing vertically like "blinds". I'm not sure what it's related to, but appeared with this new kernel + xf86-video-intel + libdrm 2.4.21.
I'll boot with the old kernel and take a look (since now my xf86-video-intel got replaced with the new one anyway).
I booted with the old kernel, but now X is using the new libdrm + xf86-video-intel since that's not in the kernel, and I don't seem to be able to trigger the bug anymore. I still saw the dark spots eventually, but X didn't freeze anymore. So maybe libdrm and/or xf86-video-intel were the culprits, unless they are simply not reaching the bug in the video driver anymore. Anyway, I'll now keep booting with the new kernel and if the bug appears again I'll report here. I'll be waiting for the release into the mainline kernel. Thanks again guys.
Well, turns out that a path to the bug still exists somewhere. I'm pretty sure it used to happen more frequently with the older kernel (2.6.34), so this one seems to have fixed most of the sources, but there is still something left (took days to happen). I'm also attaching my dmesg output for review.
Created attachment 27202 [details]
dmesg output showing GPU hung messages
I was wrong, bug isn't completely gone as I earlier reported, just happens more infrequently now.
I'm sorry to say that with 2.6.35-rc6 the same still happens: Starting d2x-xl when the program starts the renderer X is killed, and the graphic card cannot be recreated but by restarting the computer. In the log I have many of these messages:
[ 9851.516046] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 9851.516285] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 130915 at 130914)
Status: CLOSED CODE_FIX
makes me a bit surprised, do you have other patches to be applied? But it might be that my user mode (drm etc) is too old, I am running Debian/sid.
That's from drm-intel-next, can you give that tree a try? It makes the hang check test a bit less trigger happy.
Jesse, note my message dated 2010-07-05 (comment #4), where I said I've cloned that repo. So, even after that, I still get the error above. That clearly means the above-mentioned commit (dated jun 6th, so likely already in the branch by the time I cloned it) is not a complete fix.
You know what, nevermind, Jun 6th is the date Chris committed to his original branch, while Eric merged it only July 8th, so I likely missed that! I'll pull it tonight, recompile and test again. grr.... sorry.
Created attachment 27264 [details]
GPU hung in 2.6.35-rc4-71625-gd44a78e
Very well, so I've pulled the latest git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel, so that I'd get the commit mentioned by Jesse in comment #10, but I still get the GPU hung error. Actually, it seems to have gotten worse with 2.6.35rc4. I also get dark spots quite frequently. I'll see if at some point I am lucky enough to get a photo of it. BTW, when the dark spots (of random shapes, mostly triangles), it's a sign that the GPU is about to hang.
Please check my dmesg output attached.
Morever, something with the screen size gets changed using this update. That is: if I boot with 2.6.34-69471-ge3a815f, and run, for example, tremulous (the game), it shows up in full screen. Doing the same procedure while running on 2.6.35rc4 I get a screen size of non-widescreen (edges cut, black). This is more as a side note, since I understand it could be due to something completely unrelated.
wow, guys... I've discovered something really interesting this weekend regarding this problem!
I was decided to at least improve my situation regarding this bug and went to the kernel configuration to try to review what suspicious option I could have activated to make the bug appear so often. Turned out that I disabled 6-10 options in my kernel configuration (via make menuconfig), recompiled, installed and rebooted. The bug now gets triggered very occasionally to the point of me now being actually able to play whole maps of Urban Terror (that's the easiest way to trigger the intel-drm-or-whatever bug).
Naturally I was dumb enough not to save the .config that would make the bug appear often, as I didn't believe my kernel changes would affect it at all and the bug would still appear as usual. So, FYI, there is some option that once activated triggers the nasty bug as seen in my previous attachments.
I'll try to activate some options and retest until I can trigger the bug again and will report.
GPU Hung still happens on 2.6.38rc4. Will post an update (dmesg output) when I get home. Just a self-reminded and pre-notice to you.
Created attachment 48052 [details]
dmesg output while running 2.6.38rc4.
GPU still hangs while running 2.6.38rc4 as seen at the end of the attached file (dmesg > 2.6.38rc4).
BTW, note that the frigging USB is behaving weirdly with all those messages. Wonder if it could be conflicting with something else somehow, but that's also new behavior :þ