Bug 15911

Summary:	Intermittent X crash (freeze)
Product:	Drivers	Reporter:	Christian von Schultz (kernel)
Component:	Video(DRI - Intel)	Assignee:	Chris Wilson (chris)
Status:	RESOLVED PATCH_ALREADY_AVAILABLE
Severity:	normal	CC:	chris, jbarnes
Priority:	P1
Hardware:	All
OS:	Linux
Kernel Version:	2.6.32.1	Subsystem:
Regression:	Yes	Bisected commit-id:
Attachments:	Output of lspci -vvv /proc/version, /proc/modules, /proc/ioports, /proc/iomem dmesg (linux 2.6.32.1) .config (linux 2.6.32.1) .config (linux 2.6.31.13) dmesg (linux 2.6.31.13) Output of lspci -vvv (linux 2.6.31.13) Linux 2.6.31.13: /proc/version, /proc/modules, /proc/ioports, /proc/iomem X server log perf report (linux 2.6.32.12) Protect mmapped buffers from causal eviction. Protect mmapped buffers from causal eviction. linux-2.6.32.12/drivers/gpu/drm/i915/i915_gem.c.rej v2.6.32.12 - Protect mmapped buffers from casual eviction. perf report (patched linux 2.6.32.12) when switching virtual consoles dmesg after trying to switch to X (patched linux 2.6.32.12) v2.6.32.12 - Protect mmapped buffers from casual eviction.

Description Christian von Schultz 2010-05-05 14:02:12 UTC

Created attachment 26226 [details]
Output of lspci -vvv

* Problem: Intermittent X crash (freeze)
* Triggered by: Firefox
* Symptoms: The system freezes. I can move the mouse cursor, but that's all.
* Recovery: Alt+SysRq+K kills X, making xdm give me a new login screen.
* Regression: Does not happen with Linux 2.6.31.13 (from kernel.org).
              Does happen with Linux 2.6.32.1 (from kernel.org).

* Secondary symptom:
The screen is entirely black during shutdown, after X goes down,
instead of going to a console with "Shutting down the-daemon ... [ok]"
messages as usual. (If there has been a crash during the computing
session, that is.) 

* Procedure for trigging crash:
Surf the web for a while. It does not happen immediately, and I have
not been able to find a procedure that always and reliably produces
the crash. But after surfing around for a while, go to a tab
containing a couple of large images (on the order of 9070px × 550px).
Then it crashes, and the system freezes before Firefox has drawn the
images in their tab. Or else, it doesn't crash, and you have to surf
some more, and then go back to the big-image tab and make it crash.
If you are interested in what kind of web pages that trigger the
crash, see <http://christian.vonschultz.se/2010/x-crashing-graph.xhtml>.

* Other software involved:
Firefox 3.6.3 from mozilla.com.
X.Org X Server 1.7.6 (Release Date: 2010-03-17)
Distro: Gentoo.
$ equery list xorg
[ Searching for package 'xorg' in all categories among: ]
 * installed packages
[I--] [M ] app-doc/xorg-docs-1.4-r1 (0)
[I--] [  ] x11-base/xorg-drivers-1.7 (0)
[I--] [  ] x11-base/xorg-server-1.7.6 (0)
[I--] [  ] x11-base/xorg-x11-7.4-r1 (0)
[I--] [  ] x11-misc/xorg-cf-files-1.0.3 (0)
$ equery list x11-drivers/
[ Searching for all packages in 'x11-drivers' among: ]
 * installed packages
[I--] [  ] x11-drivers/xf86-input-evdev-2.3.2 (0)
[I--] [  ] x11-drivers/xf86-input-keyboard-1.4.0 (0)
[I--] [  ] x11-drivers/xf86-input-mouse-1.5.0 (0)
[I--] [  ] x11-drivers/xf86-input-synaptics-1.2.1 (0)
[I--] [  ] x11-drivers/xf86-video-intel-2.9.1 (0)

* Alternative recovery:
SSH into the system, and kill X. (I think "kill -9" is required, but I
don't remember for sure just now.) Note that Ctrl+Alt+Bsp does not
kill X, even if X is configured to quit upon that key combination.
Switchting to a virtual console fails too - I'm stuck with X until I
kill it with Alt+SysRq+K.

* Git bisection:
$ git bisect bad
07f73f6912667621276b002e33844ef283d98203 is the first bad commit
commit 07f73f6912667621276b002e33844ef283d98203
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 14 16:50:30 2009 +0100

    drm/i915: Improve behaviour under memory pressure
    
    Due to the necessity of having to take the struct_mutex, the i915
    shrinker can not free the inactive lists if we fail to allocate memory
    whilst processing a batch buffer, triggering an OOM and an ENOMEM that
    is reported back to userspace. In order to fare better under such
    circumstances we need to manually retry a failed allocation after
    evicting inactive buffers.
    
    To do so involves 3 steps:
    1. Marking the backing shm pages as NORETRY.
    2. Updating the get_pages() callers to evict something on failure and then
       retry.
    3. Revamping the evict something logic to be smarter about the required
       buffer size and prefer to use volatile or clean inactive pages.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

:040000 040000 564eb5007c4c2456f2977cdb41d961ce87a9b612 bde06eece95e814ad6542f7ed3087775eb527120 M      drivers
$ git bisect log
git bisect start
# bad: [22763c5cf3690a681551162c15d34d935308c8d7] Linux 2.6.32
git bisect bad 22763c5cf3690a681551162c15d34d935308c8d7
# good: [74fca6a42863ffacaf7ba6f1936a9f228950f657] Linux 2.6.31
git bisect good 74fca6a42863ffacaf7ba6f1936a9f228950f657
# good: [73c583e4e2dd0fbbf2fafe0cc57ff75314fe72df] Merge branch 'omap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6
git bisect good 73c583e4e2dd0fbbf2fafe0cc57ff75314fe72df
# bad: [8b3f6af86378d0a10ca2f1ded1da124aef13b62c] Merge branch 'master' of /home/davem/src/GIT/linux-2.6/
git bisect bad 8b3f6af86378d0a10ca2f1ded1da124aef13b62c
# good: [a87e84b5cdfacf11af4e8a85c4bca9793658536f] Merge branch 'for-2.6.32' of git://linux-nfs.org/~bfields/linux
git bisect good a87e84b5cdfacf11af4e8a85c4bca9793658536f
# good: [fd8b327ee46593ccc5230bfd053287fbf7c38a69] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6
git bisect good fd8b327ee46593ccc5230bfd053287fbf7c38a69
# good: [9f6ac7850a9c6363f4117fd2248e232a2d534627] Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
git bisect good 9f6ac7850a9c6363f4117fd2248e232a2d534627
# good: [2c9871de0ae89a0e2c365ea6e277135fe031d8b4] Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
git bisect good 2c9871de0ae89a0e2c365ea6e277135fe031d8b4
# bad: [94e0fb086fc5663c38bbc0fe86d698be8314f82f] Merge branch 'drm-intel-next' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
git bisect bad 94e0fb086fc5663c38bbc0fe86d698be8314f82f
# bad: [2d7ef395b310e17c86fa6190f21ea1f2eccae5d1] drm/i915: Immediately discard any backing storage for uneeded objects
git bisect bad 2d7ef395b310e17c86fa6190f21ea1f2eccae5d1
# good: [11ed50ec2a316928c2bacc1149bded86c6a96068] drm/i915: Implement GPU reset on i965
git bisect good 11ed50ec2a316928c2bacc1149bded86c6a96068
# bad: [d660467c3ff2a0b7413e1b7a51452b34ffb49e5f] drm/i915: prevent FIFO calculation overflows on 32 bits with high dotclocks
git bisect bad d660467c3ff2a0b7413e1b7a51452b34ffb49e5f
# good: [e67b8ce1b59006ba41245838db60b6fcda365ba8] drm/i915: Remove stored gtt_alignment
git bisect good e67b8ce1b59006ba41245838db60b6fcda365ba8
# good: [3ef94daae7530b4ebcd2e5f48f1028cd2d2470ba] drm/i915: Add ioctl to set 'purgeability' of objects
git bisect good 3ef94daae7530b4ebcd2e5f48f1028cd2d2470ba
# bad: [b7e53aba2f0e6abf23e3f07b38b241145c33a005] drm/i915: remove restore in resume
git bisect bad b7e53aba2f0e6abf23e3f07b38b241145c33a005
# bad: [07f73f6912667621276b002e33844ef283d98203] drm/i915: Improve behaviour under memory pressure
git bisect bad 07f73f6912667621276b002e33844ef283d98203
$ 

I cannot absolutely guarantee that the ones marked as "good" are
actually good, due to the intermittent character of the problem. They
are _probably_ good. I can, however, guarantee that the ones marked as
bad are bad.

* Random thoughts:
I don't know if this is a bug in the kernel or in the X driver, or a
problem with my configuration or something else. I found that changing
kernel version affected the outcome, so that's what I have been
doing. I'm filing this report under Video(DRI - Intel), because the
git bisection says something about drm/i915 - DRM has something to do
with DRI, right? And I think i915 means Intel. My apologies if it is
the wrong place for this report.

Comment 1 Christian von Schultz 2010-05-05 14:04:30 UTC

Created attachment 26227 [details]
/proc/version, /proc/modules, /proc/ioports, /proc/iomem

Comment 2 Christian von Schultz 2010-05-05 14:07:08 UTC

Created attachment 26228 [details]
dmesg (linux 2.6.32.1)

Comment 3 Christian von Schultz 2010-05-05 14:08:24 UTC

Created attachment 26229 [details]
.config (linux 2.6.32.1)

Comment 4 Christian von Schultz 2010-05-05 14:10:40 UTC

Created attachment 26230 [details]
.config (linux 2.6.31.13)

Comment 5 Christian von Schultz 2010-05-05 14:11:46 UTC

Created attachment 26231 [details]
dmesg (linux 2.6.31.13)

Comment 6 Christian von Schultz 2010-05-05 14:13:20 UTC

Created attachment 26232 [details]
Output of lspci -vvv (linux 2.6.31.13)

Comment 7 Christian von Schultz 2010-05-05 14:14:54 UTC

Created attachment 26233 [details]
Linux 2.6.31.13: /proc/version, /proc/modules, /proc/ioports, /proc/iomem

Comment 8 Christian von Schultz 2010-05-05 14:17:31 UTC

Created attachment 26234 [details]
X server log

Comment 9 Christian von Schultz 2010-05-05 14:21:58 UTC

I should have mentioned, I have found that this problem is also present on Linux 2.6.32.9 from kernel.org, Linux 2.6.32.12 from kernel.org, and Linux 2.6.32-gentoo-r7 from Gentoo.

Comment 10 Chris Wilson 2010-05-06 21:47:55 UTC

I thought this might be one of the unbind bugs unearthed by the shrinker triggering eviction more often, but nothing relevant seems missing from 2.6.32.12. The symptoms you describe more closely match the page-fault-of-doom when handling large images in firefox, where we are attempting to copy from one surface to another, but can only fit one surface in the aperture at any one time. [The system/X just appears to hangs as it proceeds to perform the copy very slowly.]

But that is inconsistent with the bisection result - except in the scenario where the kernel is reclaiming memory from the principal hog, the i915 driver. Or rather such a freeze can be reproduced without the shrinker being involved, given sufficient memory.

A perf profile during the freeze would confirm whether we are spending all our time in the kernel evicting textures and then mapping them back in.

Comment 11 Christian von Schultz 2010-05-07 11:13:29 UTC

Created attachment 26275 [details]
perf report (linux 2.6.32.12)

I have never actually used perf before. There were many event sources to choose from, and I was not sure what to do. I ended up doing a simple "perf record -a". Attached is the perf report, for those with overhead listed as 0.01% or higher. The top ones are the following: 

Samples: 34389583

Overhead      Command              Shared Object  Symbol
........  ...........  .........................  ......

  87.45%            X  [kernel]                   [k] drm_clflush_pages
   0.95%            X  [kernel]                   [k] find_get_page
   0.91%            X  [kernel]                   [k] intel_i915_remove_entries
   0.81%            X  [kernel]                   [k] intel_i915_insert_entries
   0.67%            X  [kernel]                   [k] put_page
   0.66%      swapper  [kernel]                   [k] read_hpet
   0.65%         init  [kernel]                   [k] read_hpet
   0.44%            X  [kernel]                   [k] read_hpet
   0.34%      kcryptd  [kernel]                   [k] enc128
   0.28%            X  [kernel]                   [k] mark_page_accessed
   0.27%            X  [kernel]                   [k] radix_tree_lookup_slot
   0.26%            X  [kernel]                   [k] acpi_os_read_port
   0.17%            X  [kernel]                   [k] do_read_cache_page
   0.16%  firefox-bin  /[...]/firefox/libxul.so   [.] 0x00000000836a2f

Comment 12 Chris Wilson 2010-05-07 16:08:14 UTC

Created attachment 26276 [details]
Protect mmapped buffers from causal eviction.

That profile is consistent with evicting the active buffer (causing cache-line flushes, the bane of our existence). This is an (untested) patch that should address the issue.

Comment 13 Christian von Schultz 2010-05-08 05:18:26 UTC

I applied your patch to Linux 2.6.32.12. I'm afraid to say that it
went from bad to worse. The same procedure made it crash, but this
time I was unable to move the mouse cursor, unable to Alt+SysRq+K, and
unable to use the ACPI power button to make the computer shut down. I
had to shut down the hard way. Also, judging by how the fan was
revving up, something was taking 100% CPU. But I could not get a perf
report - I only got "Samples: 0".

Comment 14 Chris Wilson 2010-05-08 19:11:33 UTC

Created attachment 26286 [details]
Protect mmapped buffers from causal eviction.

Aye, found the same crash as soon as I tested it as well. ;-)

In conjunction with this patch, you may like to try an updated xf86-video-intel which avoids the fallback triggering this pathological behaviour. But first, I'd appreciate your tested-by (and then I can also add stable@kernel.org :)

Comment 15 Christian von Schultz 2010-05-09 10:42:27 UTC

Created attachment 26295 [details]
linux-2.6.32.12/drivers/gpu/drm/i915/i915_gem.c.rej

Linux 2.6.32.12 does not like that patch... (Linux 2.6.33.3 gives
similar results.)

~/software/kernel/linux-2.6.32.12 $ patch -p1 -i ../chris_wilson_8_may.patch 
patching file drivers/gpu/drm/i915/i915_drv.h
Hunk #1 succeeded at 487 (offset -70 lines).
patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 succeeded at 51 (offset -1 lines).
Hunk #2 succeeded at 1067 (offset 2 lines).
Hunk #3 succeeded at 1205 (offset 4 lines).
Hunk #4 succeeded at 2115 (offset -54 lines).
Hunk #5 succeeded at 2136 (offset -49 lines).
Hunk #6 succeeded at 4253 (offset -384 lines).
Hunk #7 succeeded at 4275 (offset -384 lines).
Hunk #8 succeeded at 4853 with fuzz 2 (offset 151 lines).
Hunk #9 FAILED at 5195.
Hunk #10 FAILED at 5239.
Hunk #11 succeeded at 4927 (offset -417 lines).
Hunk #12 FAILED at 5051.
3 out of 12 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem.c.rej

Comment 16 Chris Wilson 2010-05-10 16:19:59 UTC

Created attachment 26318 [details]
v2.6.32.12 - Protect mmapped buffers from casual eviction.

Oops, that was against drm-intel-next which already has quite a few changes in the same area. Please can you try this patch which is against v2.6.32, in effect just dropping the trailing hunks called from the shrinker.

Comment 17 Christian von Schultz 2010-05-11 11:55:40 UTC

I have tested the patch in comment #16 with Linux 2.6.32.12. Everything seems to work. I believe that this patch fixed the bug. :-D

Comment 18 Christian von Schultz 2010-05-12 09:50:46 UTC

Created attachment 26353 [details]
perf report (patched linux 2.6.32.12) when switching virtual consoles

While the patch in comment #16 has fixed the crash, I have discovered that it introduces a new problem. If I switch from X to a virtual console (Ctrl+Alt+F1) and then try to go back to X (Alt+F7), the screen goes black. I can't see anything, and I can't go back to the virtual console (or if I do, I can't see it). When the screen goes black, it goes really black - no backlight or anything.

I went in with SSH and examined what happens when the display goes black. Looking at it with "top", the system seems to be working normally - no high loads or anything like that. A perf report (attached) during the switching shows "/usr/lib64/xorg/modules/drivers/intel_drv.so [.] i830SetLVDSPanelPower" dominating with 53.50% overhead, but if I do a perf record after that, with the display still black, there is no sign of X or intel in the perf report.

Comment 19 Chris Wilson 2010-05-12 10:02:26 UTC

Hmm, not that many samples so it could be a legitimate spike and seems consistent with turning off the display. I can't reproduce the behaviour here on drm-intel-next, can you sanity check that it is a regression caused by the patch? It shouldn't have any effect on changing VT as far as I can see. To recover, I guess you can try "xset -dpms", or "xset dpms on".

Comment 20 Christian von Schultz 2010-05-12 15:13:16 UTC

"xset -dpms" did not turn the display back on, and neither did "xset dpms force on". (If I do "xset dpms force off", the display turns off, but comes back as soon as I type anything.)

Unpatched Linux 2.6.32.1: changing VT works.
Linux 2.6.32.1 patched (comment #16): Ctrl+Alt+F1 works, going back to X (Alt+F7) turns display off.

Unpatched Linux 2.6.32.9: changing VT works.
Linux 2.6.32.9 patched (comment #16): Ctrl+Alt+F1 works, going back to X (Alt+F7) turns display off.

Unpatched Linux 2.6.32.12: changing VT works.
Linux 2.6.32.12 patched (comment #16): Ctrl+Alt+F1 works, going back to X (Alt+F7) turns display off.

With Linux 2.6.32 it seems to be a regression caused by the patch. I'm not sure how to proceed with testing.

Comment 21 Chris Wilson 2010-05-12 16:23:01 UTC

Ok, no doubt the patch is trigger this. Nothing in dmesg? (I am hoping for an OOPS! ;-)

The oddity is that under KMS leave/enter VT do nothing, and I don't see how this could be interfering with dpms. Hmm, I think the answer probably lies in fbcon and why the console isn't appearing when you switch to it.

Comment 22 Christian von Schultz 2010-05-12 17:11:35 UTC

Created attachment 26359 [details]
dmesg after trying to switch to X (patched linux 2.6.32.12)

As a matter of fact, there _is_ something interesting in dmesg. It says "kernel BUG at drivers/gpu/drm/i915/i915_gem.c:4650!" I'm attaching the entire dmesg. The lines starting with "------------[ cut here ]------------" are the ones that appear after attempting to switch back to X, after having visited the first virtual console.

Comment 23 Chris Wilson 2010-05-12 17:26:51 UTC

Created attachment 26360 [details]
v2.6.32.12 - Protect mmapped buffers from casual eviction.

Oh. 2.6.32.12 still has the open-coded evict-everything in idle and you are using (fortunately in this case ;-) UMS so are hitting this path.

This patch should clear the extra list upon leaveVT and so avoid the BUG_ON upon returning. Not sure if it will fix the fbcon behaviour though.

Comment 24 Christian von Schultz 2010-05-12 18:34:37 UTC

Preliminary testing says that the patch in comment #23 fixes the bug. I can switch virtual consoles however I want, and it hasn't crashed yet while browsing. I'll do some more tests tomorrow and report back, but it looks like everything works with the patch in comment #23.

Comment 25 Christian von Schultz 2010-05-13 13:16:57 UTC

Yes, the patch in comment #23 is good. It fixes the crash and the black screen problems. :-)