Bug 58381
Summary: | i915: GPU hangs after upgrade to 3.8.13 or 3.9 | ||
---|---|---|---|
Product: | Drivers | Reporter: | Aleksandr Mezin (mezin.alexander) |
Component: | Video(DRI - Intel) | Assignee: | Daniel Vetter (daniel) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | chris, daniel, intel-gfx-bugs, mk |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.8.13 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
i 915 error state
i915_error_state (kernel 3.9.5) Don't punt on the gen6 w/a pipe_controls for disabled depth state i915_error_state (kernel 3.10.7, mesa 9.1.6 with patch) i915_error_state, kernel 3.10.10, mesa 9.1.6 (with patch) |
Description
Aleksandr Mezin
2013-05-16 21:41:39 UTC
Unlikely to be the root cause, nor anything more significant than changing the timing behind the hangs. Please attach the i915_error_state. The regressing commit from stable, just so we can keep a close check on our elephants: commit b578b3a82d830e2170d403b1fb29b649e26a48fb Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Apr 4 21:31:03 2013 +0100 drm/i915: Workaround incoherence between fences and LLC across multiple CPUs i915_error_state: Bugzilla says that the attachement is too big, so I posted it there: http://ge.tt/3CoX8dh/v/0?c Experienced this problem again. i915_error_state: http://ge.tt/9Abhykh/v/0?c This happens only when I play some video with vaapi. kernel 3.9.4, not patched Mesa gen6 blorp death; update mesa. Rebuilt mesa from git master today and it didn't help Compare the i915_error_state. Diff is larger that the file itself, so here is new i915_error_state: http://ge.tt/5V6EEri/v/0?c Created attachment 104821 [details]
i 915 error state
I'm running a T420 with i915 modules, kernel 3.9.6 and mesa 9.1.3 and experiencing such things rarely too:
$ ls -l /var/tmp/*i915*
-rw-r--r-- 1 root root 313708 May 4 15:08 /var/tmp/i915_error_state.20130504-150841.txt.gz
-rw-r--r-- 1 root root 319491 May 19 14:04 /var/tmp/i915_error_state.20130519-140449.txt.gz
-rw-r--r-- 1 root root 316166 Jun 16 11:49 /var/tmp/i915_error_state.20130616-114928.txt.gz
Toralf, please file a separate bug report and with your error state - there are gazillions of reasons for gpu hangs, each one different. Mixing them up just resuslts in a giant mess. Aleaxander, can you please attach the new error state gzip'ed? The download link doesn't work for me ... Created attachment 105701 [details]
i915_error_state (kernel 3.9.5)
I installed 3.10-rc6 and can't reproduce this bug on it
Created attachment 105881 [details]
Don't punt on the gen6 w/a pipe_controls for disabled depth state
Can you please test whether the attached mesa patch (against latest mesa git, but should applly to 9.1, too) helps?
I tried this patch with mesa 9.1.6 and 9.2 (from git), the problem is still here. Currently I have kernel 3.10.5, but for 3.9.x results was the same. The problem isn't limited only to flash videos, gpu also hangs in some games almost immediately. And sometimes it hangs even when only KDE is running, without any application opened. The patch doesn't change anything at all. Can you please upload a new error state with latest versions you've tested (kernel, mesa)? Just to make sure it's still the same bug report. After thorough testing: attached patch actually fixes the problem. Other hangs happen only when I compile kernel with custom flags. Created attachment 107291 [details]
i915_error_state (kernel 3.10.7, mesa 9.1.6 with patch)
And just after I wrote comment gpu hung again.
With attached patch hangs still happen. However, seems that without patch hangs happen more frequent. Hangs don't happen with "i915_enable_rc6=0". "-march=native" in KCFLAGS causes a lot more hangs. Before 3.8.13/3.9 I never had hangs with "-march=native" and "i915_enable_rc6=3" Created attachment 107383 [details]
i915_error_state, kernel 3.10.10, mesa 9.1.6 (with patch)
dmesg:
[41785.031185] Watchdog[7014]: segfault at 0 ip 00007f4d5cc094ee sp 00007f4d49367fb0 error 6 in chrome[7f4d5be2c000+4a85000]
[41794.077941] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41794.077950] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[41794.080850] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[41802.055404] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41814.071594] Watchdog[23460]: segfault at 0 ip 00007fa2c71b54ee sp 00007fa2b3913fb0 error 6 in chrome[7fa2c63d8000+4a85000]
[41816.070944] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Only KDE and Chromium were running, no videos were open.
I was going to file a separate bug but this one seems quite similar. I have had the below problem with at least my last few kernels (3.9.6, 3.10.1), previously with 3.2 or 3.6 it was fine. I could verify the last working version if asked, but the problem definitely started between 3.2 and 3.9; I remember noticing this immediately after compiling the kernel but decided to live with it. There are three distinct but similar issues: A) When coming back from hibernate or sleep, the screen is often (50/50) screwed up, sometimes illegibly. HID works, and moving the desktop around or back and forth to console a few times usually clears it up. B) Sometimes that does not work and a reboot is required, meaning (A) and (B) differ only in severity. C) Less often, the screen rapidly degrades, and HID response is very sluggish, implying some wheels are spinning. Syslog/console will then contain: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_stat This error report distinguishes (C) from (A) and (B). I have checked i915_error immediately in situation (A) and (B) when it is still possible, but there is no error there then. Within 60 seconds in situation (C) the system reboots itself. I've created a shell shortcut to dump i915_error_state to disk with minimal keystrokes and will report that here if anything comes out of it. BTW: This problem occurs without chrome installed. (In reply to Mark E. from comment #20) > I was going to file a separate bug but this one seems quite similar. I > have had the below problem with at least my last few kernels (3.9.6, > 3.10.1), previously with 3.2 or 3.6 it was fine. I could verify the last > working version if asked, but the problem definitely started between 3.2 and > 3.9; I remember noticing this immediately after compiling the kernel but > decided to live with it. > > There are three distinct but similar issues: > > A) When coming back from hibernate or sleep, the screen is often (50/50) > screwed up, sometimes illegibly. HID works, and moving the desktop around > or back and forth to console a few times usually clears it up. > > B) Sometimes that does not work and a reboot is required, meaning (A) and > (B) differ only in severity. > > C) Less often, the screen rapidly degrades, and HID response is very > sluggish, implying some wheels are spinning. Syslog/console will then > contain: > > [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung > [drm] capturing error event; look for more information in > /sys/kernel/debug/dri/0/i915_error_stat > > This error report distinguishes (C) from (A) and (B). I have checked > i915_error immediately in situation (A) and (B) when it is still possible, > but there is no error there then. > > Within 60 seconds in situation (C) the system reboots itself. I've created a > shell shortcut to dump i915_error_state to disk with minimal keystrokes and > will report that here if anything comes out of it. > > BTW: This problem occurs without chrome installed. A) and B) are very likely the same bug, but also very likely completely unrelated to issue C). For issue C) we can't tell what's wrong without the error state. But please file a new bug report - for us it's much easier to mark a bug as duplicate after analysis than trying to untangle different bugs reported in the same bugzilla. For issue A/B please also file a new (separate) bug report. For that one we need dmesg with drm.debug=0xe added to the kernel cmdline (you might need to increase the dmesg buffer or grab dmesg for disk logs, there's lots of stuff). We need everything from the boot-up message to when you've hit the black screen issue after resume. Alexander, your remaining error state is a dupe of bug #53571. For the mesa issue please file a bug report agains mesa -> dri/i965 on bugs.freedesktop.org so that the mesa team can handle this. My little hack there is probably insufficient to keep things going. |