Bug 43267
Description
jrierab
2012-05-19 19:09:47 UTC
We need the i915_error_state from debugfs. Also, the dmesg seems to be from a pretty old kernel (we should have fixed the other bugs you're hitting in there), so please double-check that you're indeed running the drm-intel-experimental kernel. Then attach the complete dmesg (make sure it contains everything from the first boot message up to including the first hangcheck report). Please attach these as individual text/plain files, it makes handling them much eaiser. Also, please list the versions of your userspace driver, i.e. mesa, libdrm, xf86-video-intel. Created attachment 73333 [details]
dmesg drm-intel-experimental kernel
Created attachment 73334 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel
Oops, I've mixed things up, this is a different warning. You're gpu is pretty much dropping every single write. gpu hangs are highly expected. I've never seen anything like that. Can you please also attach the dmesg when booting up on 3.0.17 on the same system? Also, the dmesg is still cut off. You can try booting with log_buf_size=4M or so to make the dmesg buffer much bigger. Thanks for your quick answer, Daniel ! The previous attachment was from the latest available upstream kernel v3.4-rc6-precise (rc7 is not complete on http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc7-precise/). Now, I am attaching the info using the drm-intel-experimental/2012-05-14-precise (latest). * The dmesg is run without parameters. Does it include all your needed info? * The i915_error_state is included in the "Full Report Experimental" (dri directory). I am not enabled to upload it as plain text, because the system does not allow files so big. The versions used: xf86-video-intel: 2:2.17.0-1ubuntu4 libdm-intel1: 2.4.32-1ubuntu1 libdrm2: 2.4.32-1ubuntu1 libgl1-mesa-dri: 8.0.2-0ubuntu3 I am not sure if this is the info you need (my first time reporting an upstream bug). If anything else, let me know. Created attachment 73337 [details]
dmesg (full) Ubuntu 3.0.0-17.30-generic 3.0.22
Created attachment 73338 [details]
dmesg (full) Linux version 3.4.0-994-generic
Created attachment 73339 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel
This contains full dmesg to compare against 3.0.0.17 one.
Whatever the cause, your gpu seems to literally drop off the earth. First we have some WARN backtraces about debug registers, where the hw tells us that it had to drop some writes. Then the gpu dies. And in the error_state all hw registers in the gt domain are zero. I have no idea what's going on here. Can you please try to bisect where this issue has been introduced? (In reply to comment #10) > Then the gpu dies. And in the error_state all hw > registers in the gt domain are zero. I speak from experience that it is possible to kill the (SNB) GPU so completely that the register reads return zero from hangcheck. Still the general trend is for the hw to become more resilient... It may even be possible that is the hang killing the GPU preventing the writes, rather than the dropped writes killing the GPU. I will try my best to bisect the problem. It will take some time, though, because I do not have an exact procedure to trigger the bug. It may happen in the first 30s after booting, or it can run flawlessly for half an hour. So if it happens, it is for sure the bug is present. But if not... I should run that kernel enough time to be quite certain that the bug is not still there but hasn't triggered. I will perform the test on my production system (sic) and report back. First of all, I would try to find in which branch the bug was introduced. We know for sure it is not present in 3.0.0.17 (running stable for more than one month now) and it is present in 3.2.0.23. This gives three branches and a lot of kernels to test... Created attachment 73363 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel v2012-05-22
Latest kernel produces a
[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
a few seconds after boot, but not the 2-3 seconds black screen. Only a noticeable delay (so I get the dmesg).
Then, after some 5 minutes, the windows decorations and Unity disappeared. The full report correspond to this situation. There are some errors and a
[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
I am currently testing kernels in the http://kernel.ubuntu.com/~kernel-ppa/mainline/. Using kernel 3.2.11-030211-generic for some hours now without related crash. I have a question, because I do not know the different numeration of kernels: Ubuntu uses 3.2.0-24-generic, while in the PPA something like 3.2.11-030211-generic is used, for instance. Is the PPA kernel version equivalent to an Ubuntu 3.2.0-11-generic one? Because if it is not the case, there is something which escapes me... A good hint or pointer will be appreciated. Answering to myself about the different kernel versions between Ubuntu and mainline: http://kernel.ubuntu.com/~kernel-ppa/info/kernel-version-map.html. Now, the tests: * 3.2.14-030214-generic OK (1h) * 3.2.15-030215-generic OK (4h) 1 IRQ error making USB transfers sloooow down (not related to this bug) These results are a bit surprising, as Ubuntu Ubuntu 3.2.0.23 presents the bug and is based on mainline 3.2.14, which seems to be good... Can it be because a patch released for a later mainline kernel version was applied to the Ubuntu one? I will continue to test mainline kernels until one fails, then go back to be sure the previous one is stable. Does anyone have a better idea or suggestions? Ok, because the gpu disappeared it's been a bit harder than usual to figure out where exactly it died, but thanks to the new ring state dumping in drm-intel-experimental things worked out. The gpu seems to die int this tiny blitter batchbuffer: batchbuffer (blitter ring) at 0x0f8e3000: 0x0f8e3000: 0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1) 0x0f8e3004: 0x03cc0500: format 8888, pitch 1280, rop 0xcc, clipping disabled, 0x0f8e3008: 0x001d03a1: dst (929,29) 0x0f8e300c: 0x002d03b1: dst (945,45) 0x0f8e3010: 0x0498d000: dst offset 0x0498d000 0x0f8e3014: 0x001d03a1: src (929,29) 0x0f8e3018: 0x00000500: src pitch 1280 0x0f8e301c: 0x077ea000: src offset 0x077ea000 0x0f8e3020: 0x40c00001: XY_SETUP_CLIP_BLT (rgb disabled, alpha disabled, src tile 0, dst tile 0) 0x0f8e3024: 0x00000000: cliprect (0,0) 0x0f8e3028: 0x00000000: cliprect (0,1280) 0x0f8e302c: 0x05000000: MI_BATCH_BUFFER_END Quick check: Are you enabling the i915_enable_fbc module option by chance? (In reply to comment #16) > The gpu seems to die int this tiny > blitter batchbuffer: Not quite. Look again. ;-) (In reply to comment #17) > Quick check: Are you enabling the i915_enable_fbc module option by chance? Me? Nope. I do not have any special tweaks. Booting with default options. My test system is even a clean one, with no additional SW (only the Ubuntu updates). To double-check for paranoia, because iirc debian/ubuntu enabled fbc by default once: Can you please boot with i915.i915_enable_fbc=0 added to the kernel cmdline, check that it is indeed disable with cat /sys/module/i915/parameters/i915_enable_fbc and then try to reproduce the issue? Created attachment 73470 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel v2012-05-26
Latest drm-intel-experimental (2012-05-26) kernel and booting with boot with i915.i915_enable_fbc=0
This time the GUI freezes (but windows decorations are still there and mouse still moves normally, however menus and panels does not respond to clicks). I could get the report switching to console. When back to GUI, I needed to start a new session and all the GUI was weird, like having small portions of windows or decoration drawn, while others not, all of it changing when moving the mouse.
It is interesting to see in dmesg that the bug fires even before the system is completely up (I mean the "eth0: no IPv6 routers present" line, which is where non buggy kernels end after full boot). At least, I did not notice this behaviour before.
If you think there is anything else I can test, do not hesitate to ask :-)
I've just noticed that you have a bunch of virtual box drivers in your kernel. Can you please try to reproduce this issue without them loaded? Preferrably on a mainline kernel ... Created attachment 73552 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2
The vbox drivers were present because the report was from my main system. Others one were from my clean system, however.
Anyway, here it is the report for the first sign of the bug (a brief system freeze) on my clean partition (Ubuntu Precise with up-to-date updates and nothing else) for the last mainline kernel, v3.5-rc2-quantal.
This first small (2-5 seconds) "freeze" did not produce black screen or missing decorations. Several seconds after that, the system freeze completely. I was unable to switch to a shell session and even it does not respond to the power button (until I hold it enough time to force a power-off, of course). All the while, the cursor was responding to my mouse movements, but nothing else. Obviously, I could not take any info for this complete freeze.
Well, after so many tests in the 3.2 branch, at last some good news: * 3.4.0-030400rc2-generic KO -10min (GPU Hung! Bug certainly present here) * 3.4.0-030400rc1-generic OK 45min (seems to be safe) I will work some hours with 3.4-rc1 to confirm that it is completely free from the bug. It seems that a patch introduced between these two kernels has been backported to the Ubuntu ones, which are based on 3.2.14 and 3.2.16 mainline kernels (but original mainline kernels 3.2.14-3.2.19 seem to be safe according to my tests). Hm, one big thing we've changed between -rc1 and -rc2 is that rc6 is now enabled by default. Can you please check whether disabling that with i915.i915_enable_rc6=0 on the kernel cmdline prevents the hangs even on broken kernels? YES ! This seems to be it!!! Running nearly 2h with 3.4.0-030400rc2-generic without the bug, when loading with i915.i915_enable_rc6=0. Tomorrow I will test the last official Ubuntu kernel with the same option, just to confirm. Today, more than 3h with Ubuntu kernel 3.2.0-24.39-generic (based on mainline 3.2.16) with i915.i915_enable_rc6=0, without errors. It seems clear that the rc6 trick has something to tell us about this bug ;-) Just to confirm, all 3 of your systems fail with drm-intel-experimental, but work with i915.i915_enable_rc6=0? Do you use the same config for all? Might be worth attaching that. Yep, it does. Can you please retry this exercise with the latest 3.5-rc kernel, too? Just to make sure this trick still works there ... We're working on a few patches around rc6, so stay tuned. Could you also try using other combination of i915.i915_enable_rc6 parameters, like: - i915.i915_enable_rc6=2 - i915.i915_enable_rc6=3 to check if it still happens? (In reply to comment #28) > Just to confirm, all 3 of your systems fail with drm-intel-experimental, but > work with i915.i915_enable_rc6=0? There are only two PC involved. One which does not fails (i5-2400) and it is running with the current Ubuntu kernel. The other one (i5-2500k) has two partitions: the first one is my main system, which have lots of things installed, and the second one is a clean Ubuntu Precise installation with up-to-date updates, and nothing more (this is ideal for testing). Normally, I do my tests in the clean partition, however as I had to test so many kernels revisions and options (more than 20 until now), sometimes I perform the tests on the main system, specially when I believe that the kernel would not crash and I would need to run it for some hours to confirm that. Created attachment 73611 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 i915_enable_rc6=2
3.5.0-030500rc2-generic with i915.i915_enable_rc6=0 working OK for more than 30min (will continue to test later). I believe it is safe.
3.5.0-030500rc2-generic with i915.i915_enable_rc6=2 crash just after boot (GUI freeze while cursor moving, but able to switch to console to get report).
Created attachment 73621 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 i915_enable_rc6=3
3.5.0-030500rc2-generic with i915.i915_enable_rc6=3 crash just after boot (GUI freeze while cursor moving, but able to switch to console to get report).
Could you please try the patch at http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/11894 ? By those symptoms, it looks like your GPU could be affected by the issue which that patch tries to address (basically, we awake the GPU from its sleep state, but the report of it awareness comes in before all the threads are really ready. So it could happen that we are sending commands to it before it is ready to listen to them, and then chaos happens). (In reply to comment #34) > Could you please try the patch at > http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/11894 ? Sorry, but I am not an expert on kernel compilation. I have tried to apply the patch to the last 3.5-rc2 mainline kernel, but it has failed (the first 3 hunks). If there is a pre-built kernel somewhere which has the patch built-in, I would gladly test it. If not, I will try to find a kernel where the patch applies ok... Patch applied perfectly on top of 3.5-rc3, dunno what happen for you (I guess some whitespace mangling somewhere): http://cgit.freedesktop.org/~danvet/drm/log/?h=for-jirierab Sorry, disregard that, wrong bug report. The patch doesn't apply because it needs another one, too. I'll update the branch. Ok, updated branch with the right patches pushed. See http://cgit.freedesktop.org/~danvet/drm/log/?h=for-jirierab Created attachment 74151 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc3 patch attemp
Thank you Daniel for applying the patch and generating a test branch for me.
Today I've been able to compile the branch, as follows:
1. Download the zipped branch code.
2. Unzip and cd to the directory.
3. cp /boot/config-$(uname -r) .config (I was on Ubuntu 3.2.0-25-generic)
4. make oldconfig (and accepting all default options)
5. fakeroot make-kpkg --initrd --append-to-version=-custom kernel-image kernel-headers
Unfortunately, the generated kernel does not seem to solve this issue. The system has crashed right away after initial boot (mouse moved, but system was unresponsive). I've been able to switch to console and generate the attached report.
If you see something wrong on my procedure or would like to test other things, just let me know.
Can you please retest with latest drm-intel-fixes? Mmm... I have tested the kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-experimental/2012-09-17-quantal/ but I can't conclude anything. The system boots, but neither the keyboard nor the mouse are working (the rest seems ok: notificacions, etc...), so I could not generate any report. The bizarre think is that it occurs the same with i915.i915_enable_rc6=0, which is the trick I use with my current kernel to boot and work. So, it seems that his effect is not related to this bug. Hi jrierab, could you please test drm-intel-fixes branch from http://cgit.freedesktop.org/~danvet/drm-intel/ ? Also I'd like to know if there are any differences between i915.i915_enable_rc6=0 i915.i915_enable_rc6=1 and also not using this flag at boot cmdline letting it use default. Thanks Created attachment 80761 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.6-rc6 (default, rc6=0, rc6=1)
Well, good news at last :-)
I've compiled that kernel and tested it on my clean, up-to-date upgraded, system (booting with default, rc6=1, rc6=0). No sign of this bug has shown so far. However, as there is not a clear way to trigger the bug, I can not be absolutely sure it is really gone until enough uptime is accumulated in the system, but my impression it really gooood.
I will install the kernel on my main system and see how it performs in real life. If anything goes wrong, I will report immediately. But, so far, so good. So, many thanks to you guys for your hard work !
Thanks for verifying that. I'm closing this bug for now. Feel free to reopen if you face this issue again on your main system. Created attachment 82671 [details] Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.6-rc7 Unfortunately, the bug is not gone. I've suffered some related hangs in my system. They are much distant in time now, but still... I've noticed a new kernel 3.6-rc7 in http://cgit.freedesktop.org/~danvet/drm-intel/ so I've compiled and tested it. The attachment corresponds to the initial phase of the bug symptoms. If I continue to work without noticing the first symptoms of the bug (basically slow desktop response to mouse or interactions with menus), the GPU will completely hung with the following message in dmesg: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Normally this will produce a desktop freeze, or even worse, a full system hung. The attached report corresponds to a default boot (no i915_enable_rc6 trick). I will continue to test with this option set to 0 and 1. Anything else you need, just let me know. Bug is still present in kernel 3.6-rc7. Please check my comment above. We've added a few snb/rc6 related workarounds to 3.8. Can you please retest with the latest 3.8-rc kernels? Created attachment 91851 [details] Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc4 I've testing kernel 3.8 rc4 (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc4-raring/) for nearly a week now. It is very stable. The bug appeared only two times, and I am not sure of the first one (could not get the report about it). The second time, the desktop cease to respond (but mouse was moving) without previous warnings. I was able to switch to console and get the report attached. Then, went to desktop again and all window's decorations were missing, and all the windows were in the same desktop space. So, this bug seems to be still present, but it appears now very rarely. BTW the dmesg output is about 250K and it appears truncated. Do you know Why it is so? I run the kernel with the log_buf_size=4M parameter, so it should be complete. Anyway, I attached a initial dmesg, created just after rebooting from the crash, in case you would like to see the exact system configuration. Also, I noticed a new 2.8rc5 kernel, so I will test it ASAP. So we're finally getting somewhere, nice. Since your machine still dies somewhere in a ddx batchbuffer, testing SNA and UXA acceleration backends would be interesting. Please make sure that you have the latest xf86-video-intel release though. The error state looks symptomatic of the dropped mmio hangs - nearly all the registers read zero which is a sure sign of a GPU pining for the fjords. (In reply to comment #49) > So we're finally getting somewhere, nice. Since your machine still dies > somewhere in a ddx batchbuffer, testing SNA and UXA acceleration backends > would > be interesting. Please make sure that you have the latest xf86-video-intel > release though. Daniel, I was just testing with an up-to-date Ubuntu Quantal system with the latest 2.8.rc4 kernel. Where can I get the latest xf86-video-intel release? And I don't know what to do with that SNA and UXA acceleration stuff... I am willing to help with the testing, but as I'm not a kernel/driver developer I will need more specific instructions to do so. If this message was addressed to me, of course. On Sun, Jan 27, 2013 at 1:51 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > Daniel, I was just testing with an up-to-date Ubuntu Quantal system with the > latest 2.8.rc4 kernel. Where can I get the latest xf86-video-intel > release? And I don't know what to do with that SNA and UXA acceleration > stuff... xorg-edgers ppa should have it all: https://launchpad.net/~xorg-edgers/+archive/ppa SNA/UXA you can select with an xorg.conf snippet: Section "Device" Identifier "igd" Driver "intel" Option "AccelMethod" "SNA" EndSection You can double-check whether it worked by looking for SNA/UXA in Xorg.log Created attachment 92641 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc6 + SNA
Another crash testing with kernel 3.8 rc6 with latest intel drivers and SNA acceleration as requested. This time the logs are complete (Xorg conf and log is included also) and the system could be resumed after switching to terminal and then back to desktop.
Will test with UXA acceleration now.
Created attachment 92821 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc6 + UXA
First symptoms of the bug in 2.8.rc6 with UXA acceleration using latest intel drivers. Desktop freezes for several seconds (5-10) but recovered after that.
dmesg does not show "GPU Hung", but bug starts with several "*ERROR* Timed out waiting for forcewake to ack request." and seems to recover after a "[sched_delayed] sched: RT throttling activated".
Created attachment 93491 [details]
Fix SNB rc6 init with documented sequence and threashold values
Could you please test this patch and let me know what changed in the issues you are facing?
Hi, today, exactly this error occurred to me after applying an 'apt-get dist-upgrade' to a system. Here some system details: CPU: Intel Core i5-2500 CPU @ 3.30Ghz OS: Ubuntu 12.04.02 (64bit) NEW KERNEL: 3.0.2-39-generic (buildd@lamiak) #62-Ubuntu SMP Thu Feb 28 00:28:52 UTC 2013 (Ubuntu 3.2.0-39.62-generic 3.2.39) OLD KERNEL: 3.0.2-38-generic (buildd@akateko) #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 CONFIGURATION: The OS-configuration is almost as distributed. No heavy tunings. NOTE: The 2D-version of the Unity desktop is used. At an earlier dist-upgrade (might be 2.0.3-35) the the 3D-Unity panel got difficulties in a way that the graphics adapter did not return the rendered objects like the buttons etc. Rebooting the system did not solve the problem when it occurred. Only after making a real power off of the system restored the functionality. I assumed, the reset of the graphics adapter did not succeed. I know, the kernel is the distribution specific one and not in active development any more. But the chance to advance the filed problem is, that on this system the error does occur almost immediately after login when using 3.0.2-39. Using the previous kernel 3.0.2-38, everything seems to be stable as before the upgrade. I have access to the system for the next two weeks, so if you would like to get some details and testing, I'm willing to do my best (being almost typical user with some unnoticeable administration experience). So tell me what information you would like to get and how I should gather it. sorry, a typo occurred in my previous comment: please exchange "2.0.3-35" by "3.0.2-35". Oh no, last night obviously my eyes got scrolled over, so I always mismatched the major version in all version notes: The erratic system is running with the originally distributed kernel versions of Ubuntu 12.04 / 64bit. So the major kernel version is 3.2.0 The kernel showing the error is 3.2.0-39 The kernel not having this problem is 3.2.0-38 I have similar (or the same) symptoms on my system (using Ubuntu 12.10, 64bit and an intel core i3-2130 sandy bridge). Here, a workaround seems to be to completely remove the xserver-xorg-video-intel driver package. This is clearly not acceptable for most people because it also removes a lot of functionality. However, it might be useful for someone. Created attachment 96921 [details] Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.9..0-rc5 + UXA Sorry for taking too long to try your patch, Rodrigo. I've not had much time in the last few weeks... Today I tested the last kernel 3.9.0-rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc5-raring/ with UXA active and last intel drivers from xorg-edgers ppa (no i915_enable_rc6 trick). No "GPU hung" for more than 3 hours uptime running glxgears and compiling linux kernel. However, there are symptoms of the/some bug. Lot's of: - [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request. - [drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out - and some crashes also. From the user point-of-view, only some half-a-second non-responsive UI when a crash happens (only 2 times in 3 hours, 575s and 9620s in dmesg log file). So the system seems to be perfectly usable. Far better than before in any case. The attached report file contains a directory 'Reports' with all the info from the first crash (aprox. 575s uptime), after the first half-a-second noticeable freeze of UI. Also, there is a 'dmesg_3.9.0-rc5_12000.txt' with full dmesg log (uptime more than 3:20h) to see the error frequency. In the meantime, I updated my kernel source from git and tried to apply your patch. It was partially rejected (*.rej files are included in the report also). However, I compiled it anyway and it is running just now, while I'm writting. The glxgears demo is also running from the beginning, and up to now, 3600s uptime, no sign of any bug in dmesg. So, I will continue to test it and inform here about the results. Can you please also test the latest drm-intel-nigthly git branch from http://cgit.freedesktop.org/~danvet/drm-intel ? It contains some patches to fix an off-by-one in the timeout of the forcewake functions - it could very well be that this ends up upsetting the hw. Daniel, the test did not go well. Lots of "*ERROR* Timed out waiting for forcewake to ack request." and crashes, most like the report in my previous post. However, while compiling that kernel version I've continued to use the last kernel from git ( git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) with the patch from Rodrigo (#55, https://bugzilla.kernel.org/show_bug.cgi?id=43267#c55) and also running glxgears, as commented in previous post. This makes for an accumulated uptime of nearly 5h without any single symptom of the bug. The patch was not fully applied (rejected files are in my previous attachment), but I believe it is worth to look at what was applied, because it is the only difference between my report with crashes and 5h of calm. Anyway, I will continue to work with that patched kernel to accumulate more uptime. Are you sure my patch was the only difference? From the 2 .rej files in your last attachment it seems that my patch didn't applied entirely. :( I agreed with Chris that this pathc I sent was ugly and introduced another already fixed bugs. I have to send a new version, but before that could you try reverting this patch: commit 1ee9ae3244c4789f3184c5123f3b2d7e405b3f4c Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Aug 15 10:41:45 2012 +0200 drm/i915: use hsw rps tuning values everywhere on gen6+ Thanks Created attachment 97491 [details] Git Kernel HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 vs patched diff Rodrigo, I believe you are right. It was very late yesterday when I wrote the last comment :-( There could be more differences, because my initial (#60) test was with kernel 3.9.0-rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc5-raring/ and the (partially) patched one is from HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 kernel branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux (I am using it right now, with glxsgears running, and no bug symptoms). Your patch was applied partially, because it was older than the kernel I applied it to. I attach the diff file here, in case you would like to check what was really applied. I would also like to revert it and try the original kernel as it is, just to check. In your last comment, you refer to a certain patch. I suppose it is from the http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-nightly branch, isn't it? On Fri, Apr 5, 2013 at 8:26 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > > In your last comment, you refer to a certain patch. I suppose it is from the > http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-nightly > branch, isn't it? The mentioned git commit (1ee9ae3244c4789f3184c5123f3b2d) is already included in 3.6. But to keep variability low it would be best to test the revert on one of the recent kernels you've tested already. -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch Ok. Let's summarize my tests. All of them are performed with the HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 kernel branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux, UXA accelaration activated and same intel drivers from xorg-edgers ppa (git 20130328). 1. HEAD kernel (no patch, no revert). No sign of bug for 3h. But just when I was about to power off, it hung for about 1 minute. Then recovered. 2. HEAD with revert commit as suggested. After 30m uptime, lots of errors in dmesg ("*ERROR* Timed out waiting for forcewake to ack request."). After a little while, a completely system hung, with no recovery. 3. HEAD with Rodrigo's patch partially applied. Accumulates more than 11h uptime without any sign of the bug. I captured reports of 1 and 2 (just before complete system hung). They are probably very similar at the ones already attached, but feel free to ask for them if you feel they will give some useful info. I am using 3 as my default boot kernel for now on. Since there is no easy way to exactly trigger the bug, only with enough accumulated uptime can we assume that the bug is really gone. However, a look at the diff file from 1 to 3, which is attached in #64 (https://bugzilla.kernel.org/show_bug.cgi?id=43267#c64) should be considered. It really seems that the patch does something good. Hi jrierab, Thank you very much for your efforts. But if possible I'd like you help me to figure out what part of my ugly patch is really fixing the hungs. I have 3 guesses, so I prepared 3 small patches over your tree and put them on a branch for you: http://cgit.freedesktop.org/~vivijim/drm-intel/log/?h=snb-rc6-43267 Could you test the head of this branch and if possible also individual commits. Thanks again, Rodrigo. Hi Rodrigo, Testing will take a while ;-) Up to now: * "Original" 3.9.0.rc5 with partial patch accumulates 15:30h without error. * Head branch accumulates 4:15h without error. * Head with TurboDisable patch failed after 1:42 uptime with a single error: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 6116.667033] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state Report captured, if you are interested. * Now testing head with FixFreqTable patch. Will continue to inform as new results will became available. Best Regards, Jordi. Maybe it will not take so much time after all... * Head with FixfreqTable failed after 00:12h uptime. First with several: [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request. [drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out Then several crashes like: WARNING: at drivers/gpu/drm/i915/intel_pm.c:4350 gen6_gt_check_fifodbg+0x41/0x60 [i915]() followed by an stack dump. Report captured. * Now testing head with FixRPControl. So, does this branch's head with 3 patches applied work similar to that partial first patch? I'm currious if I might be missing something from the first partial one. Also if I understood correctly, any of this patches alone fixed the error and only the sum of them are fixing the hungs? Rodrigo, I'm still doing my homework with that branch ;-) I need to test on my home system, and I have not much time lately... However, today I've been testing the base branch, just to be absolutely sure that the bug is still present there, and it has crashed after an accumulated 5:30h (1:15h today). It failed with: "[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state". Desktop freezes completely for more than 20 seconds. I was able to switch to console and get a report. Then, switched back to graphics and eventually recovered, but with small freezes and lags in desktop for several seconds. Lots of crash errors followed on dmesg after the first one. Summarizing the tests: * "Original" 3.9.0.rc5 with partial patch accumulated 15:30h without error. * Head branch (no patch) failed after 5:30h. Single error followed by multiple crashes. * Head with TurboDisable patch failed after 1:42h uptime with a single error, but I didn't wait to see if crashes appeared in dmesg after a while. * Head with FixfreqTable failed after 00:12h uptime. Several errors and crashes. * Now testing head with FixRPControl. No error for about 6h uptime. So, maybe FixRPControl does the trick. Or maybe I am simply lucky to avoid hitting the bug by now, but it will appear eventually. I will try to accumulate enough running time to be as sure as possible about it. Mmm... * Head with FixRPControl failed after about 7h uptime, with a single error, but after a freeze of about 20 seconds the system recovered and I am still working with it. No more errors after 30 minutes. So, no single patch seems to completely remove the bug. Tomorrow I will try to continue testing the branch will all three patches applied together, like the original partial patch. To see if this changes something. Tests with all three patches applied together also failed. First, with a crash in glxgears after some 2:15h uptime, but then after some 5:30h the same type of errors appeared... So, either the first partial patch contains something more, or I am simply having luck to have not found the bug in the first 3.9 branch with that path. I am using that kernel just now, with an accumulated uptime of more than 21h without any symptom of the bug. I feel lost in regards to this bug and how to proceed from here. The last test results are a bit disorientating. Well, confirmed at last. The bug was still present in the "original" 3.9.0.rc5 with partial RV's patch. It failed with a single error after more than 30h uptime. So, the patch does not resolve the issue. In the meanwhile, I've update my work PC to Ubuntu 13.04 (kernel 3.8.0-19-generic). The interesting thing is that it failed today with the classic message: "[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung" I say it is interesting because it never failed with previous kernel versions (which failed in my home PC). Furthermore, as it is up and running much more time than my home desktop, it would be easy to test kernels or patches and accumulate uptime. So, feel free to share any patch you would like to test. Hi jrierab, many rc6 fixes were added recently. Could you please check if you still face this issue on a newer tree? Thanks, Rodrigo. jrierab, ping. Created attachment 110881 [details]
Kernel_intel_nightly_2103_08_24 error
PING jrierab (jrierab) 56(84) bytes of data.
64 bytes from jrierab: icmp_req=1 ttl=64 time=several days
--- jrierab ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time several days
Sorry, mates. First, I was on holidays and then I missed the email between an accumulated lot of junk.
To the point. Last test I performed was on Kernel_intel_nightly_2103_08_24 3.11.0. Sorry to say that some variant of this bug was still present:
[29117.440040] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[29117.440043] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring
[29117.440045] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
I will test the last intel kernel branch to check how it does now.
Created attachment 111011 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel kernel 2013-10-13
I've compiled and tested the lastest drm-intel kernel. Lots of drm related errors, so it seems that something similar to this bug is still present. The good news is that there aren't noticeable hangs for the moment.
[drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
[drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out
[drm] stuck on render ring
[drm] stuck on blitter ring
[drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
[drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out
An error report is attached.
Yeah, your gpu seems to drop off the world completely on occasion. No ideas in the pipeline for how to fix this unfortunately. Please test Ken's snb blorp fixes from http://cgit.freedesktop.org/~kwg/mesa/log/?h=snbfixes Note that this is a mesa series, not kernel patches. But it could be that a gpu hang caused by mesa results in your gpu dropping off the earth - we unfortunately can't exactly tell where it died :( Ok, Daniel. Never compiled a mesa patch until now. Will try. Have you seen this comment in launchpad? https://bugs.launchpad.net/ubuntu/+source/linux/+bug/946899/comments/105 It seems that André can consistently reproduce this bug. This may be of some help, at the very least, to test patches. I've spammed a lot of bug reports with test requests. Changes in the snb mesa flushing code (which Ken's patches are mostly) are known to be fairly risky, so we're gathering tested-bys to make sure we can safely backport the patches to stable mesa releases. Hi jrierab, did you have any luck with mesa patch Daniel pointed out? Any news on this? Any luck with new code? I have been testing with Ken's patches and still getting these errors: [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request. [drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out The hangs seem to be shorter and less frequent though. Appears to be fixed with a BIOS update to my ASRock Z68 Pro3-M motherboard. Update changes the IGPU voltage from 'auto' to 'fixed' at 1.25V. Only downside so far is the CPU is running hotter now. Yeah, smells like a hw issue then, closing accordingly. |