Bug 15911
Description
Christian von Schultz
2010-05-05 14:02:12 UTC
Created attachment 26227 [details]
/proc/version, /proc/modules, /proc/ioports, /proc/iomem
Created attachment 26228 [details]
dmesg (linux 2.6.32.1)
Created attachment 26229 [details]
.config (linux 2.6.32.1)
Created attachment 26230 [details]
.config (linux 2.6.31.13)
Created attachment 26231 [details]
dmesg (linux 2.6.31.13)
Created attachment 26232 [details]
Output of lspci -vvv (linux 2.6.31.13)
Created attachment 26233 [details]
Linux 2.6.31.13: /proc/version, /proc/modules, /proc/ioports, /proc/iomem
Created attachment 26234 [details]
X server log
I should have mentioned, I have found that this problem is also present on Linux 2.6.32.9 from kernel.org, Linux 2.6.32.12 from kernel.org, and Linux 2.6.32-gentoo-r7 from Gentoo. I thought this might be one of the unbind bugs unearthed by the shrinker triggering eviction more often, but nothing relevant seems missing from 2.6.32.12. The symptoms you describe more closely match the page-fault-of-doom when handling large images in firefox, where we are attempting to copy from one surface to another, but can only fit one surface in the aperture at any one time. [The system/X just appears to hangs as it proceeds to perform the copy very slowly.] But that is inconsistent with the bisection result - except in the scenario where the kernel is reclaiming memory from the principal hog, the i915 driver. Or rather such a freeze can be reproduced without the shrinker being involved, given sufficient memory. A perf profile during the freeze would confirm whether we are spending all our time in the kernel evicting textures and then mapping them back in. Created attachment 26275 [details]
perf report (linux 2.6.32.12)
I have never actually used perf before. There were many event sources to choose from, and I was not sure what to do. I ended up doing a simple "perf record -a". Attached is the perf report, for those with overhead listed as 0.01% or higher. The top ones are the following:
Samples: 34389583
Overhead Command Shared Object Symbol
........ ........... ......................... ......
87.45% X [kernel] [k] drm_clflush_pages
0.95% X [kernel] [k] find_get_page
0.91% X [kernel] [k] intel_i915_remove_entries
0.81% X [kernel] [k] intel_i915_insert_entries
0.67% X [kernel] [k] put_page
0.66% swapper [kernel] [k] read_hpet
0.65% init [kernel] [k] read_hpet
0.44% X [kernel] [k] read_hpet
0.34% kcryptd [kernel] [k] enc128
0.28% X [kernel] [k] mark_page_accessed
0.27% X [kernel] [k] radix_tree_lookup_slot
0.26% X [kernel] [k] acpi_os_read_port
0.17% X [kernel] [k] do_read_cache_page
0.16% firefox-bin /[...]/firefox/libxul.so [.] 0x00000000836a2f
Created attachment 26276 [details]
Protect mmapped buffers from causal eviction.
That profile is consistent with evicting the active buffer (causing cache-line flushes, the bane of our existence). This is an (untested) patch that should address the issue.
I applied your patch to Linux 2.6.32.12. I'm afraid to say that it went from bad to worse. The same procedure made it crash, but this time I was unable to move the mouse cursor, unable to Alt+SysRq+K, and unable to use the ACPI power button to make the computer shut down. I had to shut down the hard way. Also, judging by how the fan was revving up, something was taking 100% CPU. But I could not get a perf report - I only got "Samples: 0". Created attachment 26286 [details] Protect mmapped buffers from causal eviction. Aye, found the same crash as soon as I tested it as well. ;-) In conjunction with this patch, you may like to try an updated xf86-video-intel which avoids the fallback triggering this pathological behaviour. But first, I'd appreciate your tested-by (and then I can also add stable@kernel.org :) Created attachment 26295 [details]
linux-2.6.32.12/drivers/gpu/drm/i915/i915_gem.c.rej
Linux 2.6.32.12 does not like that patch... (Linux 2.6.33.3 gives
similar results.)
~/software/kernel/linux-2.6.32.12 $ patch -p1 -i ../chris_wilson_8_may.patch
patching file drivers/gpu/drm/i915/i915_drv.h
Hunk #1 succeeded at 487 (offset -70 lines).
patching file drivers/gpu/drm/i915/i915_gem.c
Hunk #1 succeeded at 51 (offset -1 lines).
Hunk #2 succeeded at 1067 (offset 2 lines).
Hunk #3 succeeded at 1205 (offset 4 lines).
Hunk #4 succeeded at 2115 (offset -54 lines).
Hunk #5 succeeded at 2136 (offset -49 lines).
Hunk #6 succeeded at 4253 (offset -384 lines).
Hunk #7 succeeded at 4275 (offset -384 lines).
Hunk #8 succeeded at 4853 with fuzz 2 (offset 151 lines).
Hunk #9 FAILED at 5195.
Hunk #10 FAILED at 5239.
Hunk #11 succeeded at 4927 (offset -417 lines).
Hunk #12 FAILED at 5051.
3 out of 12 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem.c.rej
Created attachment 26318 [details]
v2.6.32.12 - Protect mmapped buffers from casual eviction.
Oops, that was against drm-intel-next which already has quite a few changes in the same area. Please can you try this patch which is against v2.6.32, in effect just dropping the trailing hunks called from the shrinker.
I have tested the patch in comment #16 with Linux 2.6.32.12. Everything seems to work. I believe that this patch fixed the bug. :-D Created attachment 26353 [details] perf report (patched linux 2.6.32.12) when switching virtual consoles While the patch in comment #16 has fixed the crash, I have discovered that it introduces a new problem. If I switch from X to a virtual console (Ctrl+Alt+F1) and then try to go back to X (Alt+F7), the screen goes black. I can't see anything, and I can't go back to the virtual console (or if I do, I can't see it). When the screen goes black, it goes really black - no backlight or anything. I went in with SSH and examined what happens when the display goes black. Looking at it with "top", the system seems to be working normally - no high loads or anything like that. A perf report (attached) during the switching shows "/usr/lib64/xorg/modules/drivers/intel_drv.so [.] i830SetLVDSPanelPower" dominating with 53.50% overhead, but if I do a perf record after that, with the display still black, there is no sign of X or intel in the perf report. Hmm, not that many samples so it could be a legitimate spike and seems consistent with turning off the display. I can't reproduce the behaviour here on drm-intel-next, can you sanity check that it is a regression caused by the patch? It shouldn't have any effect on changing VT as far as I can see. To recover, I guess you can try "xset -dpms", or "xset dpms on". "xset -dpms" did not turn the display back on, and neither did "xset dpms force on". (If I do "xset dpms force off", the display turns off, but comes back as soon as I type anything.) Unpatched Linux 2.6.32.1: changing VT works. Linux 2.6.32.1 patched (comment #16): Ctrl+Alt+F1 works, going back to X (Alt+F7) turns display off. Unpatched Linux 2.6.32.9: changing VT works. Linux 2.6.32.9 patched (comment #16): Ctrl+Alt+F1 works, going back to X (Alt+F7) turns display off. Unpatched Linux 2.6.32.12: changing VT works. Linux 2.6.32.12 patched (comment #16): Ctrl+Alt+F1 works, going back to X (Alt+F7) turns display off. With Linux 2.6.32 it seems to be a regression caused by the patch. I'm not sure how to proceed with testing. Ok, no doubt the patch is trigger this. Nothing in dmesg? (I am hoping for an OOPS! ;-) The oddity is that under KMS leave/enter VT do nothing, and I don't see how this could be interfering with dpms. Hmm, I think the answer probably lies in fbcon and why the console isn't appearing when you switch to it. Created attachment 26359 [details]
dmesg after trying to switch to X (patched linux 2.6.32.12)
As a matter of fact, there _is_ something interesting in dmesg. It says "kernel BUG at drivers/gpu/drm/i915/i915_gem.c:4650!" I'm attaching the entire dmesg. The lines starting with "------------[ cut here ]------------" are the ones that appear after attempting to switch back to X, after having visited the first virtual console.
Created attachment 26360 [details]
v2.6.32.12 - Protect mmapped buffers from casual eviction.
Oh. 2.6.32.12 still has the open-coded evict-everything in idle and you are using (fortunately in this case ;-) UMS so are hitting this path.
This patch should clear the extra list upon leaveVT and so avoid the BUG_ON upon returning. Not sure if it will fix the fbcon behaviour though.
Preliminary testing says that the patch in comment #23 fixes the bug. I can switch virtual consoles however I want, and it hasn't crashed yet while browsing. I'll do some more tests tomorrow and report back, but it looks like everything works with the patch in comment #23. Yes, the patch in comment #23 is good. It fixes the crash and the black screen problems. :-) |