Bug 58381

Summary: i915: GPU hangs after upgrade to 3.8.13 or 3.9
Product: Drivers Reporter: Aleksandr Mezin (mezin.alexander)
Component: Video(DRI - Intel)Assignee: Daniel Vetter (daniel)
Status: RESOLVED CODE_FIX    
Severity: normal CC: chris, daniel, intel-gfx-bugs, mk
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.8.13 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: i 915 error state
i915_error_state (kernel 3.9.5)
Don't punt on the gen6 w/a pipe_controls for disabled depth state
i915_error_state (kernel 3.10.7, mesa 9.1.6 with patch)
i915_error_state, kernel 3.10.10, mesa 9.1.6 (with patch)

Description Aleksandr Mezin 2013-05-16 21:41:39 UTC
I've enabled full hardware accelleration in Chromium, and before 3.8.13 everything worked fine. However, when I upgraded to 3.8.13, playing YouTube videos sometimes cause GPU hang, choppy compositor and X crash later.

dmesg contains this message: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

GPU:
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)

It seems that the problem is introduced by this commit:
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=b578b3a82d830e2170d403b1fb29b649e26a48fb

If I revert it, I don't experience GPU hangs again. The same happens for 3.9 kernels.
Comment 1 Chris Wilson 2013-05-17 10:12:41 UTC
Unlikely to be the root cause, nor anything more significant than changing the timing behind the hangs. Please attach the i915_error_state.
Comment 2 Daniel Vetter 2013-05-20 18:54:49 UTC
The regressing commit from stable, just so we can keep a close check on our elephants:

commit b578b3a82d830e2170d403b1fb29b649e26a48fb
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Apr 4 21:31:03 2013 +0100

    drm/i915: Workaround incoherence between fences and LLC across multiple CPUs
Comment 3 Aleksandr Mezin 2013-05-25 15:51:22 UTC
i915_error_state:
Bugzilla says that the attachement is too big, so I posted it there: http://ge.tt/3CoX8dh/v/0?c
Comment 4 Aleksandr Mezin 2013-05-27 08:08:17 UTC
Experienced this problem again. i915_error_state: http://ge.tt/9Abhykh/v/0?c
This happens only when I play some video with vaapi.
kernel 3.9.4, not patched
Comment 5 Chris Wilson 2013-06-07 21:23:05 UTC
Mesa gen6 blorp death; update mesa.
Comment 6 Aleksandr Mezin 2013-06-08 21:21:43 UTC
Rebuilt mesa from git master today and it didn't help
Comment 7 Chris Wilson 2013-06-08 22:20:20 UTC
Compare the i915_error_state.
Comment 8 Aleksandr Mezin 2013-06-09 05:06:05 UTC
Diff is larger that the file itself, so here is new i915_error_state: http://ge.tt/5V6EEri/v/0?c
Comment 9 Toralf Förster 2013-06-16 09:55:53 UTC
Created attachment 104821 [details]
i 915 error state

I'm running a T420 with i915 modules, kernel 3.9.6 and mesa 9.1.3 and experiencing such things rarely too:

$ ls -l /var/tmp/*i915*
-rw-r--r-- 1 root root 313708 May  4 15:08 /var/tmp/i915_error_state.20130504-150841.txt.gz
-rw-r--r-- 1 root root 319491 May 19 14:04 /var/tmp/i915_error_state.20130519-140449.txt.gz
-rw-r--r-- 1 root root 316166 Jun 16 11:49 /var/tmp/i915_error_state.20130616-114928.txt.gz
Comment 10 Daniel Vetter 2013-06-16 11:36:25 UTC
Toralf, please file a separate bug report and with your error state - there are gazillions of reasons for gpu hangs, each one different. Mixing them up just resuslts in a giant mess.
Comment 11 Daniel Vetter 2013-06-16 11:38:28 UTC
Aleaxander, can you please attach the new error state gzip'ed? The download link doesn't work for me ...
Comment 12 Aleksandr Mezin 2013-06-22 04:43:50 UTC
Created attachment 105701 [details]
i915_error_state (kernel 3.9.5)

I installed 3.10-rc6 and can't reproduce this bug on it
Comment 13 Daniel Vetter 2013-06-24 16:17:11 UTC
Created attachment 105881 [details]
Don't punt on the gen6 w/a pipe_controls for disabled depth state

Can you please test whether the attached mesa patch (against latest mesa git, but should applly to 9.1, too) helps?
Comment 14 Aleksandr Mezin 2013-08-13 10:13:12 UTC
I tried this patch with mesa 9.1.6 and 9.2 (from git), the problem is still here.
Currently I have kernel 3.10.5, but for 3.9.x results was the same.

The problem isn't limited only to flash videos, gpu also hangs in some games almost immediately. And sometimes it hangs even when only KDE is running, without any application opened. The patch doesn't change anything at all.
Comment 15 Daniel Vetter 2013-08-13 10:27:54 UTC
Can you please upload a new error state with latest versions you've tested (kernel, mesa)? Just to make sure it's still the same bug report.
Comment 16 Aleksandr Mezin 2013-08-23 15:14:39 UTC
After thorough testing: attached patch actually fixes the problem.

Other hangs happen only when I compile kernel with custom flags.
Comment 17 Aleksandr Mezin 2013-08-23 16:04:30 UTC
Created attachment 107291 [details]
i915_error_state (kernel 3.10.7, mesa 9.1.6 with patch)

And just after I wrote comment gpu hung again.
Comment 18 Aleksandr Mezin 2013-08-27 06:58:42 UTC
With attached patch hangs still happen. However, seems that without patch hangs happen more frequent.

Hangs don't happen with "i915_enable_rc6=0".

"-march=native" in KCFLAGS causes a lot more hangs.

Before 3.8.13/3.9 I never had hangs with "-march=native" and "i915_enable_rc6=3"
Comment 19 Aleksandr Mezin 2013-09-02 02:14:48 UTC
Created attachment 107383 [details]
i915_error_state, kernel 3.10.10, mesa 9.1.6 (with patch)

dmesg:
[41785.031185] Watchdog[7014]: segfault at 0 ip 00007f4d5cc094ee sp 00007f4d49367fb0 error 6 in chrome[7f4d5be2c000+4a85000]
[41794.077941] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41794.077950] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[41794.080850] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[41802.055404] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41814.071594] Watchdog[23460]: segfault at 0 ip 00007fa2c71b54ee sp 00007fa2b3913fb0 error 6 in chrome[7fa2c63d8000+4a85000]
[41816.070944] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

Only KDE and Chromium were running, no videos were open.
Comment 20 Mark E. 2013-09-06 17:12:49 UTC
I was going to file a separate bug but this one seems quite similar.   I have had the below problem with at least my last few kernels (3.9.6, 3.10.1), previously with 3.2 or 3.6 it was fine.  I could verify the last working version if asked, but the problem definitely started between 3.2 and 3.9; I remember noticing this immediately after compiling the kernel but decided to live with it.

There are three distinct but similar issues:

A) When coming back from hibernate or sleep, the screen is often (50/50) screwed up, sometimes illegibly.  HID works, and moving the desktop around or back and forth to console a few times usually clears it up.

B) Sometimes that does not work and a reboot is required, meaning (A) and (B) differ only in severity.

C) Less often, the screen rapidly degrades, and HID response is very sluggish, implying some wheels are spinning. Syslog/console will then contain:

[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_stat

This error report distinguishes (C) from (A) and (B). I have checked i915_error  immediately in situation (A) and (B) when it is still possible, but there is no error there then. 

Within 60 seconds in situation (C) the system reboots itself. I've created a shell shortcut to dump i915_error_state to disk with minimal keystrokes and will report that here if anything comes out of it.

BTW: This problem occurs without chrome installed.
Comment 21 Daniel Vetter 2013-09-06 19:06:49 UTC
(In reply to Mark E. from comment #20)
> I was going to file a separate bug but this one seems quite similar.   I
> have had the below problem with at least my last few kernels (3.9.6,
> 3.10.1), previously with 3.2 or 3.6 it was fine.  I could verify the last
> working version if asked, but the problem definitely started between 3.2 and
> 3.9; I remember noticing this immediately after compiling the kernel but
> decided to live with it.
> 
> There are three distinct but similar issues:
> 
> A) When coming back from hibernate or sleep, the screen is often (50/50)
> screwed up, sometimes illegibly.  HID works, and moving the desktop around
> or back and forth to console a few times usually clears it up.
> 
> B) Sometimes that does not work and a reboot is required, meaning (A) and
> (B) differ only in severity.
> 
> C) Less often, the screen rapidly degrades, and HID response is very
> sluggish, implying some wheels are spinning. Syslog/console will then
> contain:
> 
> [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
> [drm] capturing error event; look for more information in
> /sys/kernel/debug/dri/0/i915_error_stat
> 
> This error report distinguishes (C) from (A) and (B). I have checked
> i915_error  immediately in situation (A) and (B) when it is still possible,
> but there is no error there then. 
> 
> Within 60 seconds in situation (C) the system reboots itself. I've created a
> shell shortcut to dump i915_error_state to disk with minimal keystrokes and
> will report that here if anything comes out of it.
> 
> BTW: This problem occurs without chrome installed.

A) and B) are very likely the same bug, but also very likely completely unrelated to issue C).

For issue C) we can't tell what's wrong without the error state. But please file a new bug report - for us it's much easier to mark a bug as duplicate after analysis than trying to untangle different bugs reported in the same bugzilla.

For issue A/B please also file a new (separate) bug report. For that one we need dmesg with drm.debug=0xe added to the kernel cmdline (you might need to increase the dmesg buffer or grab dmesg for disk logs, there's lots of stuff). We need everything from the boot-up message to when you've hit the black screen issue after resume.
Comment 22 Daniel Vetter 2013-09-06 19:13:13 UTC
Alexander, your remaining error state is a dupe of bug #53571. For the mesa issue please file a bug report agains mesa -> dri/i965 on bugs.freedesktop.org so that the mesa team can handle this. My little hack there is probably insufficient to keep things going.