Bug 43267 - [SNB rc6] GPU hangs with i915_enable_rc6=1
Summary: [SNB rc6] GPU hangs with i915_enable_rc6=1
Status: RESOLVED INVALID
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Rodrigo Vivi
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-05-19 19:09 UTC by jrierab
Modified: 2014-06-18 16:15 UTC (History)
10 users (show)

See Also:
Kernel Version: 3.8.rc6
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) (28.75 KB, application/octet-stream)
2012-05-19 19:09 UTC, jrierab
Details
dmesg drm-intel-experimental kernel (247.24 KB, text/plain)
2012-05-19 20:46 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel (357.65 KB, application/octet-stream)
2012-05-19 20:52 UTC, jrierab
Details
dmesg (full) Ubuntu 3.0.0-17.30-generic 3.0.22 (55.40 KB, text/plain)
2012-05-19 21:30 UTC, jrierab
Details
dmesg (full) Linux version 3.4.0-994-generic (76.69 KB, text/plain)
2012-05-19 21:52 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel (362.66 KB, application/x-gzip)
2012-05-19 21:53 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel v2012-05-22 (405.13 KB, application/x-gzip)
2012-05-23 17:12 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel v2012-05-26 (360.02 KB, application/octet-stream)
2012-05-30 17:41 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 (351.12 KB, application/octet-stream)
2012-06-10 16:15 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 i915_enable_rc6=2 (354.07 KB, application/octet-stream)
2012-06-14 18:27 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 i915_enable_rc6=3 (364.63 KB, application/octet-stream)
2012-06-14 18:29 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc3 patch attemp (362.73 KB, application/octet-stream)
2012-06-23 19:41 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.6-rc6 (default, rc6=0, rc6=1) (71.43 KB, application/x-gzip)
2012-09-22 20:45 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.6-rc7 (35.47 KB, application/x-zip)
2012-10-08 16:53 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc4 (93.49 KB, application/x-7z-compressed)
2013-01-26 19:30 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc6 + SNA (30.86 KB, application/x-7z-compressed)
2013-02-07 18:59 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc6 + UXA (27.93 KB, application/x-7z-compressed)
2013-02-10 10:50 UTC, jrierab
Details
Fix SNB rc6 init with documented sequence and threashold values (5.19 KB, patch)
2013-02-18 13:54 UTC, Rodrigo Vivi
Details | Diff
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.9..0-rc5 + UXA (44.59 KB, application/x-7z-compressed)
2013-04-01 15:45 UTC, jrierab
Details
Git Kernel HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 vs patched diff (4.51 KB, patch)
2013-04-05 18:26 UTC, jrierab
Details | Diff
Kernel_intel_nightly_2103_08_24 error (305.78 KB, application/gzip)
2013-10-13 16:23 UTC, jrierab
Details
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel kernel 2013-10-13 (338.09 KB, application/gzip)
2013-10-14 20:37 UTC, jrierab
Details

Description jrierab 2012-05-19 19:09:47 UTC
Created attachment 73330 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) 

[1.] One line summary of the problem:

System hangs or becomes unusable with error [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung



[2.] Full description of the problem/report:

The bug produces several 2-3 seconds black screen and back to normal for 10-20 seconds, in sequence, normally followed by a completely desktop hang which requires a full reset. Sometimes, the hang does not occur, but all windows decorations and the unity bar disappears (like the window manager is dead).

The bug appears randomly, but more often if switching from desktop spaces, and with firefox navigator open.

I have reported to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/946899 and directed here for support. Next, I copy the updated details:

I have 3 similar systems with the following configurations:

1- Work PC, with Precise updated from Oneiric since Beta 2, i5-2400. Does not seems to be affected by the bug.

2- Main home PC, with Precise updated from Oneiric since Beta 2, i5-2500K. The bug and completely system hangs occurs so often that it is nearly unusable (so I marked the bug as highly severe). This happens also with the latest mainline kernel 3.4-rc7-precise (http://kernel.ubuntu.com/~kernel-ppa/mainline/). However, if I start with kernel 3.0.17 from Oneiric, the bug does not appear (so, I mark this bug as a regression). I am working with it for nearly two days without a single hang. So, this may be a workaround.

3- Same home PC, fresh Ubuntu Precise distribution installed in a clean partition, same i5-2500K. I use it as a test platform. The bug has occurred, and the reports attached belongs to this clean system. Nothing more is installed from default, safe the precise updates and kernel 3.4-rc7-precise.

Also, the kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-experimental/2012-05-14-precise/ has been tested on the clean system and the bug is still present.



[3.] Keywords (i.e., modules, networking, kernel):

drm, i915_hangcheck_elapsed, GPU hung, 3.4-rc7-precise



[4.] Kernel version (from /proc/version):

Linux version 3.4.0-030400rc6-generic (apw@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201205061835 SMP Sun May 6 22:36:08 UTC 2012



[7.] Environment

All relevant files included in attachment.



[7.2.] Processor information (from /proc/cpuinfo):

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
stepping	: 7
microcode	: 0x14
cpu MHz		: 1600.000
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 6585.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:


[7.3.] Module information (from /proc/modules):
[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
[7.5.] PCI information ('lspci -vvv' as root)
[7.6.] SCSI information (from /proc/scsi/scsi)

All this information is also included in attachment.



[7.7.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):

The contents of /proc/dri/0/* is also included in attachment.



[X.] Other notes, patches, fixes, workarounds:

As mentioned before, a workaround is to use kernel 3.0.17 from Oneiric. So, it seems to me a regression.

Please do not hesitate to contact me for weather info or test I can do to help in resolving this issue.
Comment 1 Daniel Vetter 2012-05-19 19:32:00 UTC
We need the i915_error_state from debugfs. Also, the dmesg seems to be from a pretty old kernel (we should have fixed the other bugs you're hitting in there), so please double-check that you're indeed running the drm-intel-experimental kernel. Then attach the complete dmesg (make sure it contains everything from the first boot message up to including the first hangcheck report).

Please attach these as individual text/plain files, it makes handling them much eaiser.

Also, please list the versions of your userspace driver, i.e. mesa, libdrm, xf86-video-intel.
Comment 2 jrierab 2012-05-19 20:46:42 UTC
Created attachment 73333 [details]
dmesg drm-intel-experimental kernel
Comment 3 jrierab 2012-05-19 20:52:10 UTC
Created attachment 73334 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel
Comment 4 Daniel Vetter 2012-05-19 20:59:36 UTC
Oops, I've mixed things up, this is a different warning. You're gpu is pretty much dropping every single write. gpu hangs are highly expected.

I've never seen anything like that.

Can you please also attach the dmesg when booting up on 3.0.17 on the same system?
Comment 5 Daniel Vetter 2012-05-19 21:01:33 UTC
Also, the dmesg is still cut off. You can try booting with log_buf_size=4M or so to make the dmesg buffer much bigger.
Comment 6 jrierab 2012-05-19 21:09:20 UTC
Thanks for your quick answer, Daniel !

The previous attachment was from the latest available upstream kernel v3.4-rc6-precise (rc7 is not complete on http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc7-precise/).

Now, I am attaching the info using the drm-intel-experimental/2012-05-14-precise (latest). 

* The dmesg is run without parameters. Does it include all your needed info?

* The i915_error_state is included in the "Full Report Experimental" (dri directory). I am not enabled to upload it as plain text, because the system does not allow files so big.

The versions used:

xf86-video-intel: 2:2.17.0-1ubuntu4
libdm-intel1:     2.4.32-1ubuntu1
libdrm2:          2.4.32-1ubuntu1
libgl1-mesa-dri:  8.0.2-0ubuntu3

I am not sure if this is the info you need (my first time reporting an upstream bug). If anything else, let me know.
Comment 7 jrierab 2012-05-19 21:30:29 UTC
Created attachment 73337 [details]
dmesg (full) Ubuntu 3.0.0-17.30-generic 3.0.22
Comment 8 jrierab 2012-05-19 21:52:21 UTC
Created attachment 73338 [details]
dmesg (full) Linux version 3.4.0-994-generic
Comment 9 jrierab 2012-05-19 21:53:46 UTC
Created attachment 73339 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel

This contains full dmesg to compare against 3.0.0.17 one.
Comment 10 Daniel Vetter 2012-05-19 22:45:29 UTC
Whatever the cause, your gpu seems to literally drop off the earth. First we have some WARN backtraces about debug registers, where the hw tells us that it had to drop some writes. Then the gpu dies. And in the error_state all hw registers in the gt domain are zero.

I have no idea what's going on here. Can you please try to bisect where this issue has been introduced?
Comment 11 Chris Wilson 2012-05-20 09:21:43 UTC
(In reply to comment #10)
> Then the gpu dies. And in the error_state all hw
> registers in the gt domain are zero.

I speak from experience that it is possible to kill the (SNB) GPU so completely that the register reads return zero from hangcheck. Still the general trend is for the hw to become more resilient...

It may even be possible that is the hang killing the GPU preventing the writes, rather than the dropped writes killing the GPU.
Comment 12 jrierab 2012-05-20 09:38:06 UTC
I will try my best to bisect the problem. It will take some time, though, because I do not have an exact procedure to trigger the bug. It may happen in the first 30s after booting, or it can run flawlessly for half an hour. So if it happens, it is for sure the bug is present. But if not... I should run that kernel enough time to be quite certain that the bug is not still there but hasn't triggered. 

I will perform the test on my production system (sic) and report back. First of all, I would try to find in which branch the bug was introduced. We know for sure it is not present in 3.0.0.17 (running stable for more than one month now) and it is present in 3.2.0.23. This gives three branches and a lot of kernels to test...
Comment 13 jrierab 2012-05-23 17:12:13 UTC
Created attachment 73363 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel v2012-05-22

Latest kernel produces a
[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
a few seconds after boot, but not the 2-3 seconds black screen. Only a noticeable delay (so I get the dmesg).

Then, after some 5 minutes, the windows decorations and Unity disappeared. The full report correspond to this situation. There are some errors and a
[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Comment 14 jrierab 2012-05-23 17:21:08 UTC
I am currently testing kernels in the http://kernel.ubuntu.com/~kernel-ppa/mainline/. Using kernel 3.2.11-030211-generic for some hours now without related crash.

I have a question, because I do not know the different numeration of kernels:
Ubuntu uses 3.2.0-24-generic, while in the PPA something like 3.2.11-030211-generic is used, for instance. Is the PPA kernel version equivalent to an Ubuntu 3.2.0-11-generic one? 

Because if it is not the case, there is something which escapes me... A good hint or pointer will be appreciated.
Comment 15 jrierab 2012-05-27 22:33:13 UTC
Answering to myself about the different kernel versions between Ubuntu and mainline: http://kernel.ubuntu.com/~kernel-ppa/info/kernel-version-map.html.

Now, the tests:
* 3.2.14-030214-generic     OK (1h)
* 3.2.15-030215-generic     OK (4h) 1 IRQ error making USB transfers sloooow down (not related to this bug)

These results are a bit surprising, as Ubuntu Ubuntu 3.2.0.23 presents the bug and is based on mainline 3.2.14, which seems to be good... Can it be because a patch released for a later mainline kernel version was applied to the Ubuntu one?

I will continue to test mainline kernels until one fails, then go back to be sure the previous one is stable. Does anyone have a better idea or suggestions?
Comment 16 Daniel Vetter 2012-05-29 08:13:36 UTC
Ok, because the gpu disappeared it's been a bit harder than usual to figure out where exactly it died, but thanks to the new ring state dumping in drm-intel-experimental things worked out. The gpu seems to die int this tiny blitter batchbuffer:

batchbuffer (blitter ring) at 0x0f8e3000:
0x0f8e3000:      0x54f08806: XY_SRC_COPY_BLT (rgb enabled, alpha enabled, src tile 1, dst tile 1)
0x0f8e3004:      0x03cc0500:    format 8888, pitch 1280, rop 0xcc, clipping disabled,  
0x0f8e3008:      0x001d03a1:    dst (929,29)
0x0f8e300c:      0x002d03b1:    dst (945,45)
0x0f8e3010:      0x0498d000:    dst offset 0x0498d000
0x0f8e3014:      0x001d03a1:    src (929,29)
0x0f8e3018:      0x00000500:    src pitch 1280
0x0f8e301c:      0x077ea000:    src offset 0x077ea000
0x0f8e3020:      0x40c00001: XY_SETUP_CLIP_BLT (rgb disabled, alpha disabled, src tile 0, dst tile 0)
0x0f8e3024:      0x00000000:    cliprect (0,0)
0x0f8e3028:      0x00000000:    cliprect (0,1280)
0x0f8e302c:      0x05000000: MI_BATCH_BUFFER_END
Comment 17 Daniel Vetter 2012-05-29 08:15:42 UTC
Quick check: Are you enabling the i915_enable_fbc module option by chance?
Comment 18 Chris Wilson 2012-05-29 08:20:08 UTC
(In reply to comment #16)
> The gpu seems to die int this tiny
> blitter batchbuffer:

Not quite. Look again. ;-)
Comment 19 jrierab 2012-05-29 18:26:41 UTC
(In reply to comment #17)
> Quick check: Are you enabling the i915_enable_fbc module option by chance?

Me? Nope. I do not have any special tweaks. Booting with default options.

My test system is even a clean one, with no additional SW (only the Ubuntu updates).
Comment 20 Daniel Vetter 2012-05-29 18:54:04 UTC
To double-check for paranoia, because iirc debian/ubuntu enabled fbc by default once: Can you please boot with i915.i915_enable_fbc=0 added to the kernel cmdline, check that it is indeed disable with

cat /sys/module/i915/parameters/i915_enable_fbc

and then try to reproduce the issue?
Comment 21 jrierab 2012-05-30 17:41:55 UTC
Created attachment 73470 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel-experimental kernel v2012-05-26

Latest drm-intel-experimental (2012-05-26) kernel and booting with boot with i915.i915_enable_fbc=0 

This time the GUI freezes (but windows decorations are still there and mouse still moves normally, however menus and panels does not respond to clicks). I could get the report switching to console. When back to GUI, I needed to start a new session and all the GUI was weird, like having small portions of windows or decoration drawn, while others not, all of it changing when moving the mouse.

It is interesting to see in dmesg that the bug fires even before the system is completely up (I mean the "eth0: no IPv6 routers present" line, which is where non buggy kernels end after full boot). At least, I did not notice this behaviour before.

If you think there is anything else I can test, do not hesitate to ask :-)
Comment 22 Daniel Vetter 2012-06-06 13:17:37 UTC
I've just noticed that you have a bunch of virtual box drivers in your kernel. Can you please try to reproduce this issue without them loaded? Preferrably on a mainline kernel ...
Comment 23 jrierab 2012-06-10 16:15:27 UTC
Created attachment 73552 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2

The vbox drivers were present because the report was from my main system. Others one were from my clean system, however.

Anyway, here it is the report for the first sign of the bug (a brief system freeze) on my clean partition (Ubuntu Precise with up-to-date updates and nothing else) for the last mainline kernel, v3.5-rc2-quantal.

This first small (2-5 seconds) "freeze" did not produce black screen or missing decorations. Several seconds after that, the system freeze completely. I was unable to switch to a shell session and even it does not respond to the power button (until I hold it enough time to force a power-off, of course). All the while, the cursor was responding to my mouse movements, but nothing else. Obviously, I could not take any info for this complete freeze.
Comment 24 jrierab 2012-06-11 21:20:16 UTC
Well, after so many tests in the 3.2 branch, at last some good news:

* 3.4.0-030400rc2-generic  KO	-10min	(GPU Hung! Bug certainly present here)
* 3.4.0-030400rc1-generic  OK	45min   (seems to be safe)

I will work some hours with 3.4-rc1 to confirm that it is completely free from the bug. It seems that a patch introduced between these two kernels has been backported to the Ubuntu ones, which are based on 3.2.14 and 3.2.16 mainline kernels (but original mainline kernels 3.2.14-3.2.19 seem to be safe according to my tests).
Comment 25 Daniel Vetter 2012-06-12 13:13:13 UTC
Hm, one big thing we've changed between -rc1 and -rc2 is that rc6 is now enabled by default. Can you please check whether disabling that with i915.i915_enable_rc6=0 on the kernel cmdline prevents the hangs even on broken kernels?
Comment 26 jrierab 2012-06-12 23:11:23 UTC
YES !

This seems to be it!!! Running nearly 2h with 3.4.0-030400rc2-generic without the bug, when loading with i915.i915_enable_rc6=0. Tomorrow I will test the last official Ubuntu kernel with the same option, just to confirm.
Comment 27 jrierab 2012-06-13 23:05:19 UTC
Today, more than 3h with Ubuntu kernel 3.2.0-24.39-generic (based on mainline 3.2.16) with i915.i915_enable_rc6=0, without errors. It seems clear that the rc6 trick has something to tell us about this bug ;-)
Comment 28 Chris Wilson 2012-06-13 23:18:52 UTC
Just to confirm, all 3 of your systems fail with drm-intel-experimental, but work with i915.i915_enable_rc6=0?

Do you use the same config for all? Might be worth attaching that.
Comment 29 Daniel Vetter 2012-06-13 23:39:01 UTC
Yep, it does. Can you please retry this exercise with the latest 3.5-rc kernel, too? Just to make sure this trick still works there ...

We're working on a few patches around rc6, so stay tuned.
Comment 30 Eugeni Dodonov 2012-06-14 12:14:01 UTC
Could you also try using other combination of i915.i915_enable_rc6 parameters, like:
 - i915.i915_enable_rc6=2
 - i915.i915_enable_rc6=3

to check if it still happens?
Comment 31 jrierab 2012-06-14 17:53:15 UTC
(In reply to comment #28)
> Just to confirm, all 3 of your systems fail with drm-intel-experimental, but
> work with i915.i915_enable_rc6=0?

There are only two PC involved. One which does not fails (i5-2400) and it is running with the current Ubuntu kernel. The other one (i5-2500k) has two partitions: the first one is my main system, which have lots of things installed, and the second one is a clean Ubuntu Precise installation with up-to-date updates, and nothing more (this is ideal for testing). Normally, I do my tests in the clean partition, however as I had to test so many kernels revisions and options (more than 20 until now), sometimes I perform the tests on the main system, specially when I believe that the kernel would not crash and I would need to run it for some hours to confirm that.
Comment 32 jrierab 2012-06-14 18:27:59 UTC
Created attachment 73611 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 i915_enable_rc6=2

3.5.0-030500rc2-generic with i915.i915_enable_rc6=0 working OK for more than 30min (will continue to test later). I believe it is safe.

3.5.0-030500rc2-generic with i915.i915_enable_rc6=2 crash just after boot (GUI freeze while cursor moving, but able to switch to console to get report).
Comment 33 jrierab 2012-06-14 18:29:20 UTC
Created attachment 73621 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc2 i915_enable_rc6=3

3.5.0-030500rc2-generic with i915.i915_enable_rc6=3 crash just after boot (GUI freeze while cursor moving, but able to switch to console to get report).
Comment 34 Eugeni Dodonov 2012-06-15 18:34:40 UTC
Could you please try the patch at http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/11894 ?

By those symptoms, it looks like your GPU could be affected by the issue which that patch tries to address (basically, we awake the GPU from its sleep state, but the report of it awareness comes in before all the threads are really ready. So it could happen that we are sending commands to it before it is ready to listen to them, and then chaos happens).
Comment 35 jrierab 2012-06-19 19:06:15 UTC
(In reply to comment #34)
> Could you please try the patch at
> http://permalink.gmane.org/gmane.comp.freedesktop.xorg.drivers.intel/11894 ?

Sorry, but I am not an expert on kernel compilation. I have tried to apply the patch to the last 3.5-rc2 mainline kernel, but it has failed (the first 3 hunks).

If there is a pre-built kernel somewhere which has the patch built-in, I would gladly test it. If not, I will try to find a kernel where the patch applies ok...
Comment 36 Daniel Vetter 2012-06-19 19:09:38 UTC
Patch applied perfectly on top of 3.5-rc3, dunno what happen for you (I guess some whitespace mangling somewhere):

http://cgit.freedesktop.org/~danvet/drm/log/?h=for-jirierab
Comment 37 Daniel Vetter 2012-06-19 19:13:40 UTC
Sorry, disregard that, wrong bug report. The patch doesn't apply because it needs another one, too. I'll update the branch.
Comment 38 Daniel Vetter 2012-06-19 19:15:10 UTC
Ok, updated branch with the right patches pushed. See http://cgit.freedesktop.org/~danvet/drm/log/?h=for-jirierab
Comment 39 jrierab 2012-06-23 19:41:40 UTC
Created attachment 74151 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.5-rc3 patch attemp

Thank you Daniel for applying the patch and generating a test branch for me.

Today I've been able to compile the branch, as follows:

1. Download the zipped branch code.
2. Unzip and cd to the directory.
3. cp /boot/config-$(uname -r) .config   (I was on Ubuntu 3.2.0-25-generic)
4. make oldconfig                        (and accepting all default options)
5. fakeroot make-kpkg --initrd --append-to-version=-custom kernel-image kernel-headers

Unfortunately, the generated kernel does not seem to solve this issue. The system has crashed right away after initial boot (mouse moved, but system was unresponsive). I've been able to switch to console and generate the attached report.

If you see something wrong on my procedure or would like to test other things, just let me know.
Comment 40 Daniel Vetter 2012-09-17 16:42:27 UTC
Can you please retest with latest drm-intel-fixes?
Comment 41 jrierab 2012-09-17 21:09:12 UTC
Mmm...

I have tested the kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-experimental/2012-09-17-quantal/ but I can't conclude anything. 

The system boots, but neither the keyboard nor the mouse are working (the rest seems ok: notificacions, etc...), so I could not generate any report. The bizarre think is that it occurs the same with i915.i915_enable_rc6=0, which is the trick I use with my current kernel to boot and work. So, it seems that his effect is not related to this bug.
Comment 42 Rodrigo Vivi 2012-09-21 02:18:34 UTC
Hi jrierab, could you please test drm-intel-fixes branch from http://cgit.freedesktop.org/~danvet/drm-intel/ ?
Also I'd like to know if there are any differences between 
i915.i915_enable_rc6=0
i915.i915_enable_rc6=1
and also not using this flag at boot cmdline letting it use default.
Thanks
Comment 43 jrierab 2012-09-22 20:45:41 UTC
Created attachment 80761 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.6-rc6 (default, rc6=0, rc6=1)

Well, good news at last :-)

I've compiled that kernel and tested it on my clean, up-to-date upgraded, system (booting with default, rc6=1, rc6=0). No sign of this bug has shown so far. However, as there is not a clear way to trigger the bug, I can not be absolutely sure it is really gone until enough uptime is accumulated in the system, but my impression it really gooood.

I will install the kernel on my main system and see how it performs in real life. If anything goes wrong, I will report immediately. But, so far, so good. So, many thanks to you guys for your hard work !
Comment 44 Rodrigo Vivi 2012-09-24 16:27:46 UTC
Thanks for verifying that.
I'm closing this bug for now. Feel free to reopen if you face this issue again on your main system.
Comment 45 jrierab 2012-10-08 16:53:49 UTC
Created attachment 82671 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.6-rc7

Unfortunately, the bug is not gone. I've suffered some related hangs in my system. They are much distant in time now, but still...

I've noticed a new kernel 3.6-rc7 in http://cgit.freedesktop.org/~danvet/drm-intel/ so I've compiled and tested it. The attachment corresponds to the initial phase of the bug symptoms. If I continue to work without noticing the first symptoms of the bug (basically slow desktop response to mouse or interactions with menus), the GPU will completely hung with the following message in dmesg:

[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

Normally this will produce a desktop freeze, or even worse, a full system hung.

The attached report corresponds to a default boot (no i915_enable_rc6 trick). I will continue to test with this option set to 0 and 1.

Anything else you need, just let me know.
Comment 46 jrierab 2012-10-08 16:55:05 UTC
Bug is still present in kernel 3.6-rc7. Please check my comment above.
Comment 47 Daniel Vetter 2013-01-16 14:26:05 UTC
We've added a few snb/rc6 related workarounds to 3.8. Can you please retest with the latest 3.8-rc kernels?
Comment 48 jrierab 2013-01-26 19:30:37 UTC
Created attachment 91851 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc4

I've testing kernel 3.8 rc4 (http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc4-raring/) for nearly a week now. It is very stable. The bug appeared only two times, and I am not sure of the first one (could not get the report about it).

The second time, the desktop cease to respond (but mouse was moving) without previous warnings. I was able to switch to console and get the report attached. Then, went to desktop again and all window's decorations were missing, and all the windows were in the same desktop space.

So, this bug seems to be still present, but it appears now very rarely.

BTW the dmesg output is about 250K and it appears truncated. Do you know Why it is so? I run the kernel with the log_buf_size=4M parameter, so it should be complete. Anyway, I attached a initial dmesg, created just after rebooting from the crash, in case you would like to see the exact system configuration.

Also, I noticed a new 2.8rc5 kernel, so I will test it ASAP.
Comment 49 Daniel Vetter 2013-01-26 21:52:31 UTC
So we're finally getting somewhere, nice. Since your machine still dies somewhere in a ddx batchbuffer, testing SNA and UXA acceleration backends would be interesting. Please make sure that you have the latest xf86-video-intel release though.
Comment 50 Chris Wilson 2013-01-27 12:08:23 UTC
The error state looks symptomatic of the dropped mmio hangs - nearly all the registers read zero which is a sure sign of a GPU pining for the fjords.
Comment 51 jrierab 2013-01-27 12:51:33 UTC
(In reply to comment #49)
> So we're finally getting somewhere, nice. Since your machine still dies
> somewhere in a ddx batchbuffer, testing SNA and UXA acceleration backends
> would
> be interesting. Please make sure that you have the latest xf86-video-intel
> release though.

Daniel, I was just testing with an up-to-date Ubuntu Quantal system with the latest 2.8.rc4 kernel. Where can I get the latest xf86-video-intel
release? And I don't know what to do with that SNA and UXA acceleration stuff...

I am willing to help with the testing, but as I'm not a kernel/driver developer I will need more specific instructions to do so. If this message was addressed to me, of course.
Comment 52 Daniel Vetter 2013-01-27 16:16:57 UTC
On Sun, Jan 27, 2013 at 1:51 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
> Daniel, I was just testing with an up-to-date Ubuntu Quantal system with the
> latest 2.8.rc4 kernel. Where can I get the latest xf86-video-intel
> release? And I don't know what to do with that SNA and UXA acceleration
> stuff...

xorg-edgers ppa should have it all:
https://launchpad.net/~xorg-edgers/+archive/ppa

SNA/UXA you can select with an xorg.conf snippet:


Section "Device"
        Identifier "igd"
        Driver "intel"
        Option "AccelMethod" "SNA"
EndSection

You can double-check whether it worked by looking for SNA/UXA in Xorg.log
Comment 53 jrierab 2013-02-07 18:59:28 UTC
Created attachment 92641 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc6 + SNA

Another crash testing with kernel 3.8 rc6 with latest intel drivers and SNA acceleration as requested. This time the logs are complete (Xorg conf and log is included also) and the system could be resumed after switching to terminal and then back to desktop.

Will test with UXA acceleration now.
Comment 54 jrierab 2013-02-10 10:50:00 UTC
Created attachment 92821 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.8.rc6 + UXA

First symptoms of the bug in 2.8.rc6 with UXA acceleration using latest intel drivers. Desktop freezes for several seconds (5-10) but recovered after that.

dmesg does not show "GPU Hung", but bug starts with several "*ERROR* Timed out waiting for forcewake to ack request." and seems to recover after a "[sched_delayed] sched: RT throttling activated".
Comment 55 Rodrigo Vivi 2013-02-18 13:54:21 UTC
Created attachment 93491 [details]
Fix SNB rc6 init with documented sequence and threashold values

Could you please test this patch and let me know what changed in the issues you are facing?
Comment 56 dorg 2013-03-25 00:12:29 UTC
Hi, today, exactly this error occurred to me after applying an 'apt-get dist-upgrade' to a system.

Here some system details:
CPU:  Intel Core i5-2500 CPU @ 3.30Ghz
OS:   Ubuntu 12.04.02  (64bit)
NEW KERNEL:   3.0.2-39-generic (buildd@lamiak)   #62-Ubuntu SMP Thu Feb 28 00:28:52 UTC 2013 (Ubuntu 3.2.0-39.62-generic 3.2.39)

OLD KERNEL:   3.0.2-38-generic (buildd@akateko)  #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013

CONFIGURATION:
The OS-configuration is almost as distributed. No heavy tunings.

NOTE:   The 2D-version of the Unity desktop is used. At an earlier dist-upgrade  (might be 2.0.3-35) the the 3D-Unity panel got difficulties in a way that the graphics adapter did not return the rendered objects like the buttons etc. Rebooting the system did not solve the problem when it occurred. Only after making a real power off of the system restored the functionality. I assumed, the reset of the graphics adapter did not succeed.


I know, the kernel is the distribution specific one and not in active development any more. But the chance to advance the filed problem is, that on this system the error does occur almost immediately after login when using 3.0.2-39. Using the previous kernel 3.0.2-38, everything seems to be stable as before the upgrade.

I have access to the system for the next two weeks, so if you would like to get some details and testing, I'm willing to do my best (being almost typical user with some unnoticeable administration experience).

So tell me what information you would like to get and how I should gather it.
Comment 57 dorg 2013-03-25 00:20:31 UTC
sorry, a typo occurred in my previous comment:
please exchange "2.0.3-35" by "3.0.2-35".
Comment 58 dorg 2013-03-25 09:11:09 UTC
Oh no, last night obviously my eyes got scrolled over, so I always mismatched the major version in all version notes:

The erratic system is running with the originally distributed kernel versions of Ubuntu 12.04 / 64bit. So the major kernel version is  3.2.0

The kernel  showing   the error is     3.2.0-39
The kernel not having this problem is  3.2.0-38
Comment 59 Robert Bredereck 2013-03-26 10:59:48 UTC
I have similar (or the same) symptoms on my system (using Ubuntu 12.10, 64bit and an intel core i3-2130 sandy bridge). Here, a workaround seems to be to completely remove the xserver-xorg-video-intel driver package.

This is clearly not acceptable for most people because it also removes a lot of functionality. However, it might be useful for someone.
Comment 60 jrierab 2013-04-01 15:45:36 UTC
Created attachment 96921 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) mainline kernel 3.9..0-rc5 + UXA

Sorry for taking too long to try your patch, Rodrigo. I've not had much time in the last few weeks...

Today I tested the last kernel 3.9.0-rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc5-raring/ with UXA active and last intel drivers from xorg-edgers ppa (no i915_enable_rc6 trick). No "GPU hung" for more than 3 hours uptime running glxgears and compiling linux kernel.

However, there are symptoms of the/some bug. Lot's of:
- [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
- [drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out
- and some crashes also.

From the user point-of-view, only some half-a-second non-responsive UI when a crash happens (only 2 times in 3 hours, 575s and 9620s in dmesg log file). So the system seems to be perfectly usable. Far better than before in any case.

The attached report file contains a directory 'Reports' with all the info from the first crash (aprox. 575s uptime), after the first half-a-second noticeable freeze of UI. Also, there is a 'dmesg_3.9.0-rc5_12000.txt' with full dmesg log (uptime more than 3:20h) to see the error frequency.

In the meantime, I updated my kernel source from git and tried to apply your patch. It was partially rejected (*.rej files are included in the report also). However, I compiled it anyway and it is running just now, while I'm writting. The glxgears demo is also running from the beginning, and up to now, 3600s uptime, no sign of any bug in dmesg. So, I will continue to test it and inform here about the results.
Comment 61 Daniel Vetter 2013-04-02 08:20:07 UTC
Can you please also test the latest drm-intel-nigthly git branch from http://cgit.freedesktop.org/~danvet/drm-intel ?

It contains some patches to fix an off-by-one in the timeout of the forcewake functions - it could very well be that this ends up upsetting the hw.
Comment 62 jrierab 2013-04-04 23:21:28 UTC
Daniel, the test did not go well. Lots of "*ERROR* Timed out waiting for forcewake to ack request." and crashes, most like the report in my previous post.

However, while compiling that kernel version I've continued to use the last kernel from git ( git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) with the patch from Rodrigo (#55, https://bugzilla.kernel.org/show_bug.cgi?id=43267#c55) and also running glxgears, as commented in previous post. This makes for an accumulated uptime of nearly 5h without any single symptom of the bug. The patch was not fully applied (rejected files are in my previous attachment), but I believe it is worth to look at what was applied, because it is the only difference between my report with crashes and 5h of calm.

Anyway, I will continue to work with that patched kernel to accumulate more uptime.
Comment 63 Rodrigo Vivi 2013-04-05 16:57:17 UTC
Are you sure my patch was the only difference?

From the 2 .rej files in your last attachment it seems that my patch didn't applied entirely. :(

I agreed with Chris that this pathc I sent was ugly and introduced another already fixed bugs. I have to send a new version, but before that could you try reverting this patch:

commit 1ee9ae3244c4789f3184c5123f3b2d7e405b3f4c
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Aug 15 10:41:45 2012 +0200

    drm/i915: use hsw rps tuning values everywhere on gen6+

Thanks
Comment 64 jrierab 2013-04-05 18:26:02 UTC
Created attachment 97491 [details]
Git Kernel HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 vs patched diff

Rodrigo, I believe you are right. It was very late yesterday when I wrote the last comment :-(

There could be more differences, because my initial (#60) test was with kernel 3.9.0-rc5 from
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc5-raring/ and the (partially) patched one is from HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 kernel branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux (I am using it right now, with glxsgears running, and no bug symptoms).

Your patch was applied partially, because it was older than the kernel I applied it to. I attach the diff file here, in case you would like to check what was really applied. I would also like to revert it and try the original kernel as it is, just to check.

In your last comment, you refer to a certain patch. I suppose it is from the http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-nightly branch, isn't it?
Comment 65 Daniel Vetter 2013-04-05 19:00:59 UTC
On Fri, Apr 5, 2013 at 8:26 PM,  <bugzilla-daemon@bugzilla.kernel.org> wrote:
>
> In your last comment, you refer to a certain patch. I suppose it is from the
> http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-nightly
> branch, isn't it?

The mentioned git commit (1ee9ae3244c4789f3184c5123f3b2d) is already
included in 3.6. But to keep variability low it would be best to test
the revert on one of the recent kernels you've tested already.
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
Comment 66 jrierab 2013-04-06 14:36:01 UTC
Ok. Let's summarize my tests. All of them are performed with the HEAD 07961ac7c0ee8b546658717034fe692fd12eefa9 kernel branch 'master' of
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux, UXA accelaration activated and same intel drivers from xorg-edgers ppa (git 20130328).

1. HEAD kernel (no patch, no revert). No sign of bug for 3h. But just when I was about to power off, it hung for about 1 minute. Then recovered.

2. HEAD with revert commit as suggested. After 30m uptime, lots of errors in dmesg ("*ERROR* Timed out waiting for forcewake to ack request."). After a little while, a completely system hung, with no recovery.

3. HEAD with Rodrigo's patch partially applied. Accumulates more than 11h uptime without any sign of the bug.

I captured reports of 1 and 2 (just before complete system hung). They are probably very similar at the ones already attached, but feel free to ask for them if you feel they will give some useful info.

I am using 3 as my default boot kernel for now on. Since there is no easy way to exactly trigger the bug, only with enough accumulated uptime can we assume that the bug is really gone. 

However, a look at the diff file from 1 to 3, which is attached in #64 (https://bugzilla.kernel.org/show_bug.cgi?id=43267#c64) should be considered. It really seems that the patch does something good.
Comment 67 Rodrigo Vivi 2013-04-09 20:41:49 UTC
Hi jrierab, 

Thank you very much for your efforts. But if possible I'd like you help me to figure out what part of my ugly patch is really fixing the hungs.
I have 3 guesses, so I prepared 3 small patches over your tree and put them on a branch for you:
http://cgit.freedesktop.org/~vivijim/drm-intel/log/?h=snb-rc6-43267

Could you test the head of this branch and if possible also individual commits.

Thanks again,
Rodrigo.
Comment 68 jrierab 2013-04-12 18:17:09 UTC
Hi Rodrigo,

Testing will take a while ;-)

Up to now:

* "Original" 3.9.0.rc5 with partial patch accumulates 15:30h without error.

* Head branch accumulates 4:15h without error.

* Head with TurboDisable patch failed after 1:42 uptime with a single error:

[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 6116.667033] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state

Report captured, if you are interested.

* Now testing head with FixFreqTable patch.

Will continue to inform as new results will became available.

Best Regards,
Jordi.
Comment 69 jrierab 2013-04-12 18:31:16 UTC
Maybe it will not take so much time after all...

* Head with FixfreqTable failed after 00:12h uptime. First with several:

[drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
[drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out

Then several crashes like:

WARNING: at drivers/gpu/drm/i915/intel_pm.c:4350 gen6_gt_check_fifodbg+0x41/0x60 [i915]()

followed by an stack dump.

Report captured.

* Now testing head with FixRPControl.
Comment 70 Rodrigo Vivi 2013-04-18 19:50:05 UTC
So, does this branch's head with 3 patches applied work similar to that partial first patch?

I'm currious if I might be missing something from the first partial one.

Also if I understood correctly, any of this patches alone fixed the error and only the sum of them are fixing the hungs?
Comment 71 jrierab 2013-04-19 18:40:11 UTC
Rodrigo, I'm still doing my homework with that branch ;-)

I need to test on my home system, and I have not much time lately...

However, today I've been testing the base branch, just to be absolutely sure that the bug is still present there, and it has crashed after an accumulated 5:30h (1:15h today). It failed with:

"[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
 [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state".

Desktop freezes completely for more than 20 seconds. I was able to switch to console and get a report. Then, switched back to graphics and eventually recovered, but with small freezes and lags in desktop for several seconds. Lots of crash errors followed on dmesg after the first one.

Summarizing the tests:

* "Original" 3.9.0.rc5 with partial patch accumulated 15:30h without error.

* Head branch (no patch) failed after 5:30h. Single error followed by multiple crashes.

* Head with TurboDisable patch failed after 1:42h uptime with a single error, but I didn't wait to see if crashes appeared in dmesg after a while.

* Head with FixfreqTable failed after 00:12h uptime. Several errors and crashes.

* Now testing head with FixRPControl. No error for about 6h uptime.

So, maybe FixRPControl does the trick. 

Or maybe I am simply lucky to avoid hitting the bug by now, but it will appear eventually. I will try to accumulate enough running time to be as sure as possible about it.
Comment 72 jrierab 2013-04-19 20:31:34 UTC
Mmm...

* Head with FixRPControl failed after about 7h uptime, with a single error, but after a freeze of about 20 seconds the system recovered and I am still working with it. No more errors after 30 minutes.


So, no single patch seems to completely remove the bug.

Tomorrow I will try to continue testing the branch will all three patches applied together, like the original partial patch. To see if this changes something.
Comment 73 jrierab 2013-05-01 18:06:50 UTC
Tests with all three patches applied together also failed. First, with a crash in glxgears after some 2:15h uptime, but then after some 5:30h the same type of errors appeared...

So, either the first partial patch contains something more, or I am simply having luck to have not found the bug in the first 3.9 branch with that path. I am using that kernel just now, with an accumulated uptime of more than 21h without any symptom of the bug.

I feel lost in regards to this bug and how to proceed from here. The last test results are a bit disorientating.
Comment 74 jrierab 2013-05-07 18:09:13 UTC
Well, confirmed at last. The bug was still present in the "original" 3.9.0.rc5 with partial RV's patch. It failed with a single error after more than 30h uptime. So, the patch does not resolve the issue.

In the meanwhile, I've update my work PC to Ubuntu 13.04 (kernel 3.8.0-19-generic). The interesting thing is that it failed today with the classic message:

"[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung"

I say it is interesting because it never failed with previous kernel versions (which failed in my home PC). Furthermore, as it is up and running much more time than my home desktop, it would be easy to test kernels or patches and accumulate uptime.

So, feel free to share any patch you would like to test.
Comment 75 Rodrigo Vivi 2013-08-15 22:28:54 UTC
Hi jrierab, 

many rc6 fixes were added recently. Could you please check if you still face this issue on a newer tree?

Thanks,
Rodrigo.
Comment 76 Jani Nikula 2013-10-09 11:26:50 UTC
jrierab, ping.
Comment 77 jrierab 2013-10-13 16:23:22 UTC
Created attachment 110881 [details]
Kernel_intel_nightly_2103_08_24 error

PING jrierab (jrierab) 56(84) bytes of data.
64 bytes from jrierab: icmp_req=1 ttl=64 time=several days
--- jrierab ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time several days

Sorry, mates. First, I was on holidays and then I missed the email between an accumulated lot of junk.

To the point. Last test I performed was on Kernel_intel_nightly_2103_08_24 3.11.0. Sorry to say that some variant of this bug was still present:

[29117.440040] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[29117.440043] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring
[29117.440045] [drm] capturing error event; look for more information in /sys/class/drm/card0/error

I will test the last intel kernel branch to check how it does now.
Comment 78 jrierab 2013-10-14 20:37:08 UTC
Created attachment 111011 [details]
Full report (dmesg, cpuinfo, version, /proc/dri/0/* etc...) drm-intel kernel 2013-10-13

I've compiled and tested the lastest drm-intel kernel. Lots of drm related errors, so it seems that something similar to this bug is still present. The good news is that there aren't noticeable hangs for the moment.

[drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
[drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out
[drm] stuck on render ring
[drm] stuck on blitter ring
[drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
[drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out

An error report is attached.
Comment 79 Daniel Vetter 2013-10-16 08:34:22 UTC
Yeah, your gpu seems to drop off the world completely on occasion. No ideas in the pipeline for how to fix this unfortunately.
Comment 80 Daniel Vetter 2013-10-28 18:19:22 UTC
Please test Ken's snb blorp fixes from

http://cgit.freedesktop.org/~kwg/mesa/log/?h=snbfixes

Note that this is a mesa series, not kernel patches. But it could be that a gpu hang caused by mesa results in your gpu dropping off the earth - we unfortunately can't exactly tell where it died :(
Comment 81 jrierab 2013-10-28 18:25:37 UTC
Ok, Daniel. Never compiled a mesa patch until now. Will try.

Have you seen this comment in launchpad? https://bugs.launchpad.net/ubuntu/+source/linux/+bug/946899/comments/105 It seems that André can consistently reproduce this bug. This may be of some help, at the very least, to test patches.
Comment 82 Daniel Vetter 2013-10-28 18:27:24 UTC
I've spammed a lot of bug reports with test requests. Changes in the snb mesa flushing code (which Ken's patches are mostly) are known to be fairly risky, so we're gathering tested-bys to make sure we can safely backport the patches to stable mesa releases.
Comment 83 Rodrigo Vivi 2014-01-30 09:30:08 UTC
Hi jrierab,

did you have any luck with mesa patch Daniel pointed out?
Comment 84 Rodrigo Vivi 2014-03-14 14:29:06 UTC
Any news on this? Any luck with new code?
Comment 85 Jonathan 2014-05-25 00:15:41 UTC
I have been testing with Ken's patches and still getting these errors:

[drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake to ack request.
[drm:__gen6_gt_wait_for_thread_c0] *ERROR* GT thread status wait timed out

The hangs seem to be shorter and less frequent though.
Comment 86 Jonathan 2014-05-25 20:14:41 UTC
Appears to be fixed with a BIOS update to my ASRock Z68 Pro3-M motherboard. Update changes the IGPU voltage from 'auto' to 'fixed' at 1.25V. Only downside so far is the CPU is running hotter now.
Comment 87 Daniel Vetter 2014-06-18 16:15:32 UTC
Yeah, smells like a hw issue then, closing accordingly.

Note You need to log in before you can comment on or make changes to this bug.