Bug 64841 - [snb] s4 regression due to hsw s4 gtt quiescent workaround
Summary: [snb] s4 regression due to hsw s4 gtt quiescent workaround
Status: RESOLVED CODE_FIX
Alias: None
Product: Drivers
Classification: Unclassified
Component: Video(DRI - Intel) (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Daniel Vetter
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-12 08:25 UTC by Milan Plzik
Modified: 2014-03-26 19:11 UTC (History)
10 users (show)

See Also:
Kernel Version: 3.13-rc
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
functional revert on top of drm-intel-nightly (526 bytes, patch)
2014-03-26 18:41 UTC, Daniel Vetter
Details | Diff

Description Milan Plzik 2013-11-12 08:25:58 UTC
After invoking suspend to disk, linux kernel starts writing pages to disk, but ultimately freezes after message from GPU shows; excerpt from dmesg on console:

PM: Image saving progress:  30%
PM: Image saving progress:  40%
PM: Image saving progress:  50%
[drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

After this line, there is no disk activity and system does not react to any keypress (also no magic sysrq combination). Forced shutdown is required to bring system to usable state.

-----

Affected hardware:
Lenovo Thinkpad X220 tablet

uname -a:
Linux localhost 3.12.0 #2 SMP PREEMPT Mon Nov 11 18:44:20 CET 2013 x86_64 GNU/Linux

lspci -v:
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
	Subsystem: Lenovo Device 21db
	Flags: bus master, fast devsel, latency 0
	Capabilities: <access denied>

00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Device 21db
	Flags: bus master, fast devsel, latency 0, IRQ 47
	Memory at f0000000 (64-bit, non-prefetchable) [size=4M]
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at 5000 [size=64]
	Expansion ROM at <unassigned> [disabled]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915

00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
	Subsystem: Lenovo Device 21db
	Flags: bus master, fast devsel, latency 0, IRQ 42
	Memory at f2525000 (64-bit, non-prefetchable) [size=16]
	Capabilities: <access denied>
	Kernel driver in use: mei_me

00:16.3 Serial controller: Intel Corporation 6 Series/C200 Series Chipset Family KT Controller (rev 04) (prog-if 02 [16550])
	Subsystem: Lenovo Device 21db
	Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 19
	I/O ports at 50b0 [size=8]
	Memory at f252c000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: <access denied>
	Kernel driver in use: serial

00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 04)
	Subsystem: Lenovo Device 21ce
	Flags: bus master, fast devsel, latency 0, IRQ 44
	Memory at f2500000 (32-bit, non-prefetchable) [size=128K]
	Memory at f252b000 (32-bit, non-prefetchable) [size=4K]
	I/O ports at 5080 [size=32]
	Capabilities: <access denied>
	Kernel driver in use: e1000e
	Kernel modules: e1000e

00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 04) (prog-if 20 [EHCI])
	Subsystem: Lenovo Device 21db
	Flags: bus master, medium devsel, latency 0, IRQ 16
	Memory at f252a000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: <access denied>
	Kernel driver in use: ehci-pci
	Kernel modules: ehci_pci

00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 04)
	Subsystem: Lenovo Device 21db
	Flags: bus master, fast devsel, latency 0, IRQ 46
	Memory at f2520000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel

00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b4) (prog-if 00 [Normal decode])
	Flags: fast devsel
	Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1c.1 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2 (rev b4) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0
	Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
	Memory behind bridge: f2400000-f24fffff
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 (rev b4) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0
	Bus: primary=00, secondary=05, subordinate=0c, sec-latency=0
	I/O behind bridge: 00004000-00004fff
	Memory behind bridge: f1c00000-f23fffff
	Prefetchable memory behind bridge: 00000000f0400000-00000000f0bfffff
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b4) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0
	Bus: primary=00, secondary=0d, subordinate=0d, sec-latency=0
	I/O behind bridge: 00003000-00003fff
	Memory behind bridge: f1400000-f1bfffff
	Prefetchable memory behind bridge: 00000000f0c00000-00000000f13fffff
	Capabilities: <access denied>
	Kernel driver in use: pcieport

00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 04) (prog-if 20 [EHCI])
	Subsystem: Lenovo Device 21db
	Flags: bus master, medium devsel, latency 0, IRQ 23
	Memory at f2529000 (32-bit, non-prefetchable) [size=1K]
	Capabilities: <access denied>
	Kernel driver in use: ehci-pci
	Kernel modules: ehci_pci

00:1f.0 ISA bridge: Intel Corporation QM67 Express Chipset Family LPC Controller (rev 04)
	Subsystem: Lenovo Device 21db
	Flags: bus master, medium devsel, latency 0
	Capabilities: <access denied>
	Kernel driver in use: lpc_ich
	Kernel modules: lpc_ich

00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 04) (prog-if 01 [AHCI 1.0])
	Subsystem: Lenovo Device 21db
	Flags: bus master, 66MHz, medium devsel, latency 0, IRQ 43
	I/O ports at 50a8 [size=8]
	I/O ports at 50bc [size=4]
	I/O ports at 50a0 [size=8]
	I/O ports at 50b8 [size=4]
	I/O ports at 5060 [size=32]
	Memory at f2528000 (32-bit, non-prefetchable) [size=2K]
	Capabilities: <access denied>
	Kernel driver in use: ahci
	Kernel modules: ahci

00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04)
	Subsystem: Lenovo Device 21db
	Flags: medium devsel, IRQ 18
	Memory at f2524000 (64-bit, non-prefetchable) [size=256]
	I/O ports at efa0 [size=32]
	Kernel driver in use: i801_smbus
	Kernel modules: i2c_i801

03:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 [Taylor Peak] (rev 34)
	Subsystem: Intel Corporation Centrino Advanced-N 6205 AGN
	Flags: bus master, fast devsel, latency 0, IRQ 45
	Memory at f2400000 (64-bit, non-prefetchable) [size=8K]
	Capabilities: <access denied>
	Kernel driver in use: iwlwifi
	Kernel modules: iwlwifi

0d:00.0 System peripheral: Ricoh Co Ltd PCIe SDXC/MMC Host Controller (rev 07) (prog-if 01)
	Subsystem: Lenovo Device 21db
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at f1400000 (32-bit, non-prefetchable) [size=256]
	Capabilities: <access denied>
	Kernel driver in use: sdhci-pci
Comment 1 Aaron Lu 2013-11-18 02:03:31 UTC
If you do this:
# echo shutdown > /sys/power/disk
before you hibernate, does it make a difference?
Comment 2 Brad Jackson 2013-11-20 21:30:50 UTC
My Dell Latitude E6520 laptop has a similar hang. Suspend worked in my 3.12-rc6 build, but is broken in rc7 and 3.12 final. The echo shutdown made no difference for me.
Comment 3 Rafael J. Wysocki 2013-11-21 00:05:26 UTC
On Wednesday, November 20, 2013 09:30:50 PM bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=64841
> 
> Brad  Jackson <bjackson0971@gmail.com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |bjackson0971@gmail.com
> 
> --- Comment #2 from Brad  Jackson <bjackson0971@gmail.com> ---
> My Dell Latitude E6520 laptop has a similar hang. Suspend worked in my
> 3.12-rc6
> build, but is broken in rc7 and 3.12 final. The echo shutdown made no
> difference for me.

Any chance to bisect changes between 3.12-rc6 and 3.12-rc7?
Comment 4 Brad Jackson 2013-11-21 16:09:18 UTC
828c79087cec61eaf4c76bb32c222fbe35ac3930 is the first bad commit
commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Wed Oct 16 09:21:30 2013 -0700

    drm/i915: Disable GGTT PTEs on GEN6+ suspend
Comment 5 Brad Jackson 2013-11-21 16:34:24 UTC
Confirmed that backing that patch out of 3.12.1 fixes suspend.
Comment 6 Ben Widawsky 2013-11-21 21:29:06 UTC
I guess this provides further evidence that the GPU is not behaving. Which, see #1, makes me wonder if a revert is even the correct solution.

A few questions:

1. Prior to this patch, had you ever seen any corruption? Are you sure? Can you confirm this with slub debugging, or some other way? One risk to simply disabling this workaround is that it can prevent real memory corruption.

2. Can you enable drm.debug=0xe, and using netconsole or some other mechanism try to collect all messages after the hang?

3. What is the actual PCI ID of gfx, 0:2.0?

4. Is s3 effected?
Comment 7 Milan Plzik 2013-11-21 23:39:17 UTC
As for me (I have not yet done any tests, since this is also my production laptop):

1) 3.6.11 worked fine for me. I'm seeing some pixmap corruption in X and chromium, but that might be more because of userspace components, and I also often see "[drm:ring_stuck] *ERROR* Kicking stuck wait on render ring" when using DisplayPort monitor, but I hope  that's another issue.

2) I hope I'll find some time to test this during weekend

3) PCI ID is:

00:02.0 0300: 8086:0126 (rev 09) (prog-if 00 [VGA controller])

4) S3 works fine for me, I use it as an alternative for S4.
Comment 8 Ben Widawsky 2013-11-21 23:47:20 UTC
(In reply to Milan Plzik from comment #7)
> As for me (I have not yet done any tests, since this is also my production
> laptop):
> 
> 1) 3.6.11 worked fine for me. 

I hope you mean 3.11.6

I'm seeing some pixmap corruption in X and
> chromium, but that might be more because of userspace components, and I also
> often see "[drm:ring_stuck] *ERROR* Kicking stuck wait on render ring" when
> using DisplayPort monitor, but I hope  that's another issue.

I primarily meant outside of gfx domain. It would be likely to see fs corruption for example. Depending on your system, it might be unlikely to see this corruption without using various memory debugging tools. My fear is we've traded silent memory corruption for hangs.

> 
> 2) I hope I'll find some time to test this during weekend

Do you have the same symptom, ie you laste see: "[drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off." Have you done the bisect to my commit also?

> 
> 3) PCI ID is:
> 
> 00:02.0 0300: 8086:0126 (rev 09) (prog-if 00 [VGA controller])
> 
> 4) S3 works fine for me, I use it as an alternative for S4.

This patch sadly/ironically, is what fixes S4 on HSW.
Comment 9 Milan Plzik 2013-11-21 23:57:45 UTC
1) Sorry, I meant 3.11.1; just stupid typo. I have not yet seen corruption of any kind, excluding corruption caused by not-synced filesystem when the hang occurs. :)

2) Yes, I'm the original poster. I have not yet managed to do bisect.
Comment 10 Takashi Iwai 2013-11-22 06:50:01 UTC
Actually the memory corruption of S4 was seen on multiple Haswell machines, but not on older ones, AFAIK.

Maybe a oneliner below should suffice?  (Of course, it'd be better to fix the comment in the final patch.)

--- a/drivers/gpu/drm/i915/i915_gem_gtt.c
+++ b/drivers/gpu/drm/i915/i915_gem_gtt.c
@@ -818,7 +818,7 @@ void i915_gem_suspend_gtt_mappings(struct drm_device *dev)
        /* Don't bother messing with faults pre GEN6 as we have little
         * documentation supporting that it's a good idea.
         */
-       if (INTEL_INFO(dev)->gen < 6)
+       if (INTEL_INFO(dev)->gen < 7)
                return;
 
        i915_check_and_clear_faults(dev);
Comment 11 Daniel Vetter 2013-11-25 10:03:46 UTC
Imo we have clear evidence here that across S4 the gpu writes to the gtt when we no longer expect it to. The only difference is that on Haswell it seems to not crash if we set all ptes to invalid, whereas snb falls over. Not surprising really because snb.

Two things:
- Milan, can you please grab a working kernel and stress-test S4? Script would be good which does an S4 every 5 minutes or so with X running (maybe some gl apps if you have some handy) and memory pressure (kernel compile in an endless loop of make -j 4 && make clean or so). If this is the same bug as on hsw we'll eventually see memory/fs corruptions (so have backups ready).

- For the fix I think we should try to point all ptes at some piece in stolen memory instead of invalid ptes. Ofc that's just a stop-gap, I guess we really need to figure out what function is exactly doing these gtt writes.
Comment 12 Ben Widawsky 2013-11-25 19:17:51 UTC
Milan, can you also try to reproduce this on our development branch? I've tried it on two sandybridge machines, and cannot reproduce. I wonder if we've managed to fix it in some other way already.
http://cgit.freedesktop.org/~danvet/drm-intel/log/?h=drm-intel-nightly

Takashi, the only issue I have with your approach is I am afraid we're just ignoring corruption that will manifest itself in some other way later. I suppose if we're lucky, we always end up doing a read, and we therefore won't corrupt memory - but unless someone has the time and proficiency to prove we're not corrupting memory, I think it's a dangerous fix (I think trading an obvious failure for a non-obvious one is a step up).

Daniel, the stolen memory approach seems good as long as we make sure it's not a piece of stolen memory used by anything else. I can't say I am surprised that we've found yet another thing that makes Sandybridge unhappy. I would like to know what Windows does, since I tested their "quiescing" method in Takashi's original bug, andit demonstrably did not work.
Comment 13 Ben Widawsky 2013-11-25 19:18:46 UTC
(In reply to Ben Widawsky from comment #12)


> prove we're not corrupting memory, I think it's a dangerous fix (I think
> trading an obvious failure for a non-obvious one is a step up).

step down
Comment 14 Daniel Vetter 2013-11-25 22:48:17 UTC
For the original bug we could try to reproduce it with fbcon/vgacon disable to rule out them stomping onto the gtt when they shouldn't. Nowadays we have a Kconfig option for that.
Comment 15 Takashi Iwai 2013-12-05 17:39:50 UTC
BTW, does S2RAM still work?
Comment 16 Brad Jackson 2014-01-22 21:12:18 UTC
In 3.13.0, I can suspend to RAM, but suspend to disk still freezes.
Comment 17 Brad Jackson 2014-03-25 14:38:39 UTC
Suspend to disk still freezes for me in 3.14.0-rc8. I am continuing to use 3.12.x until this is fixed.
Comment 18 Daniel Vetter 2014-03-26 18:41:28 UTC
Created attachment 130761 [details]
functional revert on top of drm-intel-nightly

Can you please test this patch on top of latest drm-intel-nightly (what's going in for 3.15), should also apply on 3.14?

I'll queue that up if it fixes your issue.
Comment 19 Brad Jackson 2014-03-26 19:07:40 UTC
The patch didn't apply to 3.14-rc8, but I edited i915_gem_suspend_gtt_mappings manually (line 845) and suspend to disk and resume now work. Suspend to RAM also works.
Comment 20 Daniel Vetter 2014-03-26 19:11:42 UTC
My apologies that this regression was lingering for so long. Please complain earlier next time around. Fix is in dinq, will land in 3.15-rc1 and then get backported to stable kernels:

commit 79541f94acdbbc33441b3b56bd5ee831ba080b9e
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Mar 26 20:08:20 2014 +0100

    drm/i915: Undo gtt scratch pte unmapping again

Note You need to log in before you can comment on or make changes to this bug.