Bug 64841
Summary: | [snb] s4 regression due to hsw s4 gtt quiescent workaround | ||
---|---|---|---|
Product: | Drivers | Reporter: | Milan Plzik (milan.plzik) |
Component: | Video(DRI - Intel) | Assignee: | Daniel Vetter (daniel) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | aaron.lu, ben, bjackson0971, daniel, intel-gfx-bugs, jhyeon, przanoni, rjw, tiwai, tprevite |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.13-rc | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: | functional revert on top of drm-intel-nightly |
Description
Milan Plzik
2013-11-12 08:25:58 UTC
If you do this: # echo shutdown > /sys/power/disk before you hibernate, does it make a difference? My Dell Latitude E6520 laptop has a similar hang. Suspend worked in my 3.12-rc6 build, but is broken in rc7 and 3.12 final. The echo shutdown made no difference for me. On Wednesday, November 20, 2013 09:30:50 PM bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=64841 > > Brad Jackson <bjackson0971@gmail.com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |bjackson0971@gmail.com > > --- Comment #2 from Brad Jackson <bjackson0971@gmail.com> --- > My Dell Latitude E6520 laptop has a similar hang. Suspend worked in my > 3.12-rc6 > build, but is broken in rc7 and 3.12 final. The echo shutdown made no > difference for me. Any chance to bisect changes between 3.12-rc6 and 3.12-rc7? 828c79087cec61eaf4c76bb32c222fbe35ac3930 is the first bad commit commit 828c79087cec61eaf4c76bb32c222fbe35ac3930 Author: Ben Widawsky <benjamin.widawsky@intel.com> Date: Wed Oct 16 09:21:30 2013 -0700 drm/i915: Disable GGTT PTEs on GEN6+ suspend Confirmed that backing that patch out of 3.12.1 fixes suspend. I guess this provides further evidence that the GPU is not behaving. Which, see #1, makes me wonder if a revert is even the correct solution. A few questions: 1. Prior to this patch, had you ever seen any corruption? Are you sure? Can you confirm this with slub debugging, or some other way? One risk to simply disabling this workaround is that it can prevent real memory corruption. 2. Can you enable drm.debug=0xe, and using netconsole or some other mechanism try to collect all messages after the hang? 3. What is the actual PCI ID of gfx, 0:2.0? 4. Is s3 effected? As for me (I have not yet done any tests, since this is also my production laptop): 1) 3.6.11 worked fine for me. I'm seeing some pixmap corruption in X and chromium, but that might be more because of userspace components, and I also often see "[drm:ring_stuck] *ERROR* Kicking stuck wait on render ring" when using DisplayPort monitor, but I hope that's another issue. 2) I hope I'll find some time to test this during weekend 3) PCI ID is: 00:02.0 0300: 8086:0126 (rev 09) (prog-if 00 [VGA controller]) 4) S3 works fine for me, I use it as an alternative for S4. (In reply to Milan Plzik from comment #7) > As for me (I have not yet done any tests, since this is also my production > laptop): > > 1) 3.6.11 worked fine for me. I hope you mean 3.11.6 I'm seeing some pixmap corruption in X and > chromium, but that might be more because of userspace components, and I also > often see "[drm:ring_stuck] *ERROR* Kicking stuck wait on render ring" when > using DisplayPort monitor, but I hope that's another issue. I primarily meant outside of gfx domain. It would be likely to see fs corruption for example. Depending on your system, it might be unlikely to see this corruption without using various memory debugging tools. My fear is we've traded silent memory corruption for hangs. > > 2) I hope I'll find some time to test this during weekend Do you have the same symptom, ie you laste see: "[drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off." Have you done the bisect to my commit also? > > 3) PCI ID is: > > 00:02.0 0300: 8086:0126 (rev 09) (prog-if 00 [VGA controller]) > > 4) S3 works fine for me, I use it as an alternative for S4. This patch sadly/ironically, is what fixes S4 on HSW. 1) Sorry, I meant 3.11.1; just stupid typo. I have not yet seen corruption of any kind, excluding corruption caused by not-synced filesystem when the hang occurs. :) 2) Yes, I'm the original poster. I have not yet managed to do bisect. Actually the memory corruption of S4 was seen on multiple Haswell machines, but not on older ones, AFAIK. Maybe a oneliner below should suffice? (Of course, it'd be better to fix the comment in the final patch.) --- a/drivers/gpu/drm/i915/i915_gem_gtt.c +++ b/drivers/gpu/drm/i915/i915_gem_gtt.c @@ -818,7 +818,7 @@ void i915_gem_suspend_gtt_mappings(struct drm_device *dev) /* Don't bother messing with faults pre GEN6 as we have little * documentation supporting that it's a good idea. */ - if (INTEL_INFO(dev)->gen < 6) + if (INTEL_INFO(dev)->gen < 7) return; i915_check_and_clear_faults(dev); Imo we have clear evidence here that across S4 the gpu writes to the gtt when we no longer expect it to. The only difference is that on Haswell it seems to not crash if we set all ptes to invalid, whereas snb falls over. Not surprising really because snb. Two things: - Milan, can you please grab a working kernel and stress-test S4? Script would be good which does an S4 every 5 minutes or so with X running (maybe some gl apps if you have some handy) and memory pressure (kernel compile in an endless loop of make -j 4 && make clean or so). If this is the same bug as on hsw we'll eventually see memory/fs corruptions (so have backups ready). - For the fix I think we should try to point all ptes at some piece in stolen memory instead of invalid ptes. Ofc that's just a stop-gap, I guess we really need to figure out what function is exactly doing these gtt writes. Milan, can you also try to reproduce this on our development branch? I've tried it on two sandybridge machines, and cannot reproduce. I wonder if we've managed to fix it in some other way already. http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-nightly Takashi, the only issue I have with your approach is I am afraid we're just ignoring corruption that will manifest itself in some other way later. I suppose if we're lucky, we always end up doing a read, and we therefore won't corrupt memory - but unless someone has the time and proficiency to prove we're not corrupting memory, I think it's a dangerous fix (I think trading an obvious failure for a non-obvious one is a step up). Daniel, the stolen memory approach seems good as long as we make sure it's not a piece of stolen memory used by anything else. I can't say I am surprised that we've found yet another thing that makes Sandybridge unhappy. I would like to know what Windows does, since I tested their "quiescing" method in Takashi's original bug, andit demonstrably did not work. (In reply to Ben Widawsky from comment #12) > prove we're not corrupting memory, I think it's a dangerous fix (I think > trading an obvious failure for a non-obvious one is a step up). step down For the original bug we could try to reproduce it with fbcon/vgacon disable to rule out them stomping onto the gtt when they shouldn't. Nowadays we have a Kconfig option for that. BTW, does S2RAM still work? In 3.13.0, I can suspend to RAM, but suspend to disk still freezes. Suspend to disk still freezes for me in 3.14.0-rc8. I am continuing to use 3.12.x until this is fixed. Created attachment 130761 [details]
functional revert on top of drm-intel-nightly
Can you please test this patch on top of latest drm-intel-nightly (what's going in for 3.15), should also apply on 3.14?
I'll queue that up if it fixes your issue.
The patch didn't apply to 3.14-rc8, but I edited i915_gem_suspend_gtt_mappings manually (line 845) and suspend to disk and resume now work. Suspend to RAM also works. My apologies that this regression was lingering for so long. Please complain earlier next time around. Fix is in dinq, will land in 3.15-rc1 and then get backported to stable kernels: commit 79541f94acdbbc33441b3b56bd5ee831ba080b9e Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Mar 26 20:08:20 2014 +0100 drm/i915: Undo gtt scratch pte unmapping again |