Bug 59321

Summary: [hsw] S4 broken with Haswell
Product: Drivers Reporter: Takashi Iwai (tiwai)
Component: Video(DRI - Intel)Assignee: intel-gfx-bugs (intel-gfx-bugs)
Status: RESOLVED MOVED    
Severity: normal CC: ben, daniel, imre.deak, intel-gfx-bugs, jens-bugzilla.kernel.org, mauromol, mmarek, przanoni
Priority: P2    
Hardware: All   
OS: Linux   
Kernel Version: 3.10,3.11 Subsystem:
Regression: No Bisected commit-id:
Attachments: Possible fix
Don't let the GT write to memory after we're suspending.
Idle harder
3.17-rc1 crash "Watchdog detected hard LOCKUP on CPU #x"
3.17-rc1 Crash during hibernate/resume, OOM failures

Description Takashi Iwai 2013-06-05 09:27:18 UTC
On laptops with Haswell, the machine hangs up after certain S4 cycles, typically up to 20 cycles.  3.10-rc4 is most unstable, usually hits in a couple of S4 cycles.

With luck, you get Oops message like below and goes to death slowly.

 general protection fault: 0000 [#1] SMP 
 CPU: 3 PID: 3804 Comm: packagekitd Tainted: GF            3.10.0-rc4-test+ #1
 task: ffff880231ea8380 ti: ffff88022e138000 task.ti: ffff88q6
 RIP: 0010:[<ffffffff81166ed0>]  [<ffffffff81166ed0>] path_lookupat+0x120/0x830
 RSP: 0018:ffff88022e139cd8  EFLAGS: 00010246
 RAX: 00f9000000f80000 RBX: 0000000000000000 RCX: 0000000000000000
 RDX: ffff88022e139d18 RSI: 0000000000000000 RDI: ffff88022ed4e740
 RBP: ffff88022e139d58 R08: ffff88022e139c3f R09: ffff8802358e303e
 R10: ffff88022ed4e778 R11: 0000000000000003 R12: ffff88022ec38da0
 R13: ffff88022e139da8 R14: 0000000000000000 R15: ffff88022e139d08
 FS:  00007f509d934700(0000) GS:ffff88023eac0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f509d924cd8 CR3: 0000000231418000 CR4: 00000000001407e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Stack:
  ffff88022e139cf8 0000000000000000 ffff88022e139cf8 ffffffff81162945
  ffff88022e139d98 0000000000000286 ffff8802360673a0 ffff88022ed4e740
  ffff88022ec38da0 0000000000000000 000000d02e139d68 ffff8802358e3000
 Call Trace:
  [<ffffffff81162945>] ? terminate_walk+0x35/0x40
  [<ffffffff81167613>] filename_lookup+0x33/0xd0
  [<ffffffff8116877b>] user_path_at_empty+0x7b/0xb0
  [<ffffffff8117681c>] ? mntput_no_expire+0x4c/0x1b0
  [<ffffffff8115d5c7>] ? cp_new_stat+0x137/0x150
  [<ffffffff811687bc>] user_path_at+0xc/0x10
  [<ffffffff8115d881>] vfs_fstatat+0x51/0xb0
  [<ffffffff8115d949>] vfs_lstat+0x19/0x20
  [<ffffffff8115d96f>] SyS_newlstat+0x1f/0x50
  [<ffffffff814701d2>] system_call_fastpath+0x16/0x1b
 Code: ff ff 83 f8 00 89 c3 0f 85 6e 06 00 00 4c 8b 65 c0 4d 85 e4 0f 84 47 06 00 00 41 f6 44 24 02 04 0f 85 a2 01 00 00 49 8b 44 24 20 <48> 83 78 08 00 0f 84 71 01 00 00 41 83 e6 01 90 0f 84 87 01 00 
 RIP  [<ffffffff81166ed0>] path_lookupat+0x120/0x830
  RSP <ffff88022e139cd8>

The Oops patterns vary quite a lot, but most of them are related with vfs path lookup.  For example, another typical Oops is something like below (in this case, it was on 3.0-based kernel with drm/i915 backports, but also seen on all kernels):
 BUG: soft lockup - CPU#0 stuck for 23s! [sh:11043]
 CPU 0 
 Pid: 11043, comm: sh Tainted: G      D    NX 3.0.65-0.6.6.1.5358.1.PTF-default 
 RIP: 0010:[<ffffffff81445c58>]  [<ffffffff81445c58>] _raw_spin_lock+0x18/0x20
 RSP: 0018:ffff8801b384fc40  EFLAGS: 00000297
 RAX: 000000000000f221 RBX: ffff8801b384fc78 RCX: 0000000000013568
 RDX: 000000000000f220 RSI: ffffc90000878760 RDI: ffffffff81a02700
 RBP: ffffc90000878760 R08: 0000000000000007 R09: 0000000000000025
 R10: 0000000000000007 R11: ffffffff811e17e0 R12: ffffffff8144e2ee
 R13: 0000000000000000 R14: 00000002000200da R15: 0000000000000000
 FS:  00007ff4bd5f1700(0000) GS:ffff8801bfa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ff4bcd88428 CR3: 00000001b3970000 CR4: 00000000001406f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process sh (pid: 11043, threadinfo ffff8801b384e000, task ffff8801543f64c0)
 Stack:
  ffffffff81168fa0 ffff88018d19e838 ffffffff8116aa75 0000000000000000
  ffff8801b331fbc0 ffff8801b331fbc0 ffff8801bec04d58 0000000000000000
  ffffffff811ad190 ffff8801bec04d58 ffff8801b331fbc0 ffff880190f71540
 Call Trace:
  [<ffffffff81168fa0>] inode_sb_list_add+0x10/0x50
  [<ffffffff8116aa75>] iget_locked+0x155/0x170
  [<ffffffff811ad190>] proc_get_inode+0x10/0x110
  [<ffffffff811b3dd9>] proc_lookup_de+0x69/0xe0
  [<ffffffff811adc20>] proc_root_lookup+0x20/0x60
  [<ffffffff8115b012>] d_alloc_and_lookup+0x42/0x80
  [<ffffffff8115c7c5>] do_lookup+0x2a5/0x3a0
  [<ffffffff8115d992>] do_last+0x102/0x800
  [<ffffffff8115ecf9>] path_openat+0xd9/0x420
  [<ffffffff8115f17c>] do_filp_open+0x4c/0xc0
  [<ffffffff8114fdc1>] do_sys_open+0x171/0x1f0
  [<ffffffff8144d912>] system_call_fastpath+0x16/0x1b
  [<00007ff4bcd18da0>] 0x7ff4bcd18d9f
 Code: 0f 95 c0 0f b6 c0 c3 66 66 66 2e 0f 1f 84 00 00 00 00 00 b8 00 00 01 00 00 0f c1 07 0f b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 <eb> f5 c3 0f 1f 44 00 00 9c 58 0f 1f 44 00 00 48 89 c6 fa 66 0f 


The problem is found on all kernels up to 3.10-rc4.
Also, it's seen on different Haswell variants.  At least, Mobile GT2 and ULT show the problem.

Some more data points:

- The S4 problem appears both on user-space and kernel hibernation methods.

- S4 cycles are more stable when no network is connected.
  The crash above is seen with the test:
    * running SLED11 user-space with updated X stack, and
    * starting netconsole over Ethernet (r8169 or e1000e drivers)

  Without the network connection, S4 survived once over 100 cycles.
  But it might be just a luck.

- S4 becomes more stable if you disable loading i915 module in initrd.
  On SUSE kernel, i915 module is loaded in initrd, and initrd triggers the resume of S4 image either via suspend user-space command or writing sysfs.
  When I exclude i915 module by setting $NO_KMS_IN_INITRD in /etc/sysconfig/kernel and run mkinitrd, the problem is rarely seen.
Comment 1 Takashi Iwai 2013-06-05 09:56:18 UTC
BTW, I checked fdo bugzilla
   https://bugs.freedesktop.org/show_bug.cgi?id=63586
and tried to revert the commit mentioned there.  It didn't help.

But the Oops pattern shown there (in comment 9) looks similar as what I've seen (procfs path lookup), so this might be the same cause.

I'll try whether the point before the commit mentioned above survives my test case, too.
Comment 2 Takashi Iwai 2013-06-05 11:04:58 UTC
BTW, I tried S4 stress tests without i915 KMS (nomodeset), and it survived well.

It doesn't mean that it must be i915 driver's bug, as Oops implies some memory corruption or such, but at least i915 driver influences a lot on the buggy S4 behavior.
Comment 3 Takashi Iwai 2013-06-05 13:18:05 UTC
(In reply to comment #1)
> BTW, I checked fdo bugzilla
>    https://bugs.freedesktop.org/show_bug.cgi?id=63586
> and tried to revert the commit mentioned there.  It didn't help.
> 
> But the Oops pattern shown there (in comment 9) looks similar as what I've
> seen
> (procfs path lookup), so this might be the same cause.
> 
> I'll try whether the point before the commit mentioned above survives my test
> case, too.

The kernel at commit 0e8ffe1bf81b crashed after 20 cycles, with similar Oops about procfs path lookup.
Comment 4 Daniel Vetter 2013-06-05 13:43:35 UTC
Afaik Haswell has been flakey ever since, with random deaths at all kinds of strange places. Thus far I haven't seen any progress on this at all, so I'll escalate this. Been a while since I've had to do a maintainer-drill internally anyway ;-)
Comment 5 Paulo Zanoni 2013-06-05 18:56:14 UTC
(In reply to comment #2)
> BTW, I tried S4 stress tests without i915 KMS (nomodeset), and it survived
> well.

Please define "survived well". Did it crash at least a single time without i915 loaded?
Comment 6 Takashi Iwai 2013-06-05 19:22:39 UTC
No, it didn't crash at all without i915 for over 100 cycles.  No Oops is seen, too.  This is meant as "survived well".
If it were more than 10000 cycles, I would have concluded that it doesn't crash in normal situation :)
Comment 7 Paulo Zanoni 2013-06-06 20:39:06 UTC
Hi

I want to have exactly the same environment as you have, so I can reproduce it locally.

- Which outputs are you using when reproducing this bug? VGA? eDP? DP? HDMI? DVI? - Do you attach/remove any of them while trying to reproduce?
- Do you have X running when you suspend?
- How do you suspend? By clicking on some interface or running some specific command?
- Do you suspend from X or do you vt switch away before doing that?

Thanks,
Paulo
Comment 8 Paulo Zanoni 2013-06-06 21:38:14 UTC
Ok, so I have been playing with this for some time today. The machine I've used has only an eDP monitor.

I enabled a bunch of debug options in the Kernel, including kmemleak. I could reproduce the bug a few times, and so far I notice that the S4 problem only happens *after* I see kmemleak complains about some weird stuff that happens when we're trying to turn the eDP panel power on. I have seen this 3 times:

- Boot the machine
- Check for kmemleak on dmesg
- S4 suspend
- Boot again and check for kememleak on dmesg
- So far, I have only seen crashes after dmesg complains about kmemleak

Another thing I have to point is that the crash happens when I'm already back to X, many seconds after the real reboot.

Do you also observe that?

Thanks,
Paulo
Comment 9 Paulo Zanoni 2013-06-06 22:22:30 UTC
Another bug which looks like a memory corruption and might actually be the same thing as the one you're seeing:

Remove eDP and all other outputs, attach only HDMI. Boot the machine, load i915 and see the error message. Happens 100% of the time for me.
Comment 10 Takashi Iwai 2013-06-07 06:18:24 UTC
(In reply to comment #7)
> Hi
> 
> I want to have exactly the same environment as you have, so I can reproduce
> it
> locally.
> 
> - Which outputs are you using when reproducing this bug? VGA? eDP? DP? HDMI?
> DVI? - Do you attach/remove any of them while trying to reproduce?

We've seen the hangs on both laptops and a desktop machine.
eDP is used on all laptops, so at least eDP is always connected/used.
I forgot about the detail of desktop, but it doesn't have eDP, at least.

It happens without attaching/removing connections.  Boot a laptop with eDP only, try S4 a few times, and it hangs.

> - Do you have X running when you suspend?

In most test cases, yes.  It's GNOME 2.6 with compiz.
But the hang happened without X, too.

> - How do you suspend? By clicking on some interface or running some specific
> command?

Then hang happens all cases.  No matter whether the suspend through the button (via pm-suspend), the direct kernel suspend, or via user-space suspend.

> - Do you suspend from X or do you vt switch away before doing that?

When pm-utils hook is running, yes, it's switched to VT1 before doing suspend.
But the crash happens even without it by just writing /sys/power/disk on a X terminal.
Comment 11 Takashi Iwai 2013-06-07 06:21:57 UTC
(In reply to comment #8)
> Ok, so I have been playing with this for some time today. The machine I've
> used
> has only an eDP monitor.
> 
> I enabled a bunch of debug options in the Kernel, including kmemleak. I could
> reproduce the bug a few times, and so far I notice that the S4 problem only
> happens *after* I see kmemleak complains about some weird stuff that happens
> when we're trying to turn the eDP panel power on. I have seen this 3 times:
> 
> - Boot the machine
> - Check for kmemleak on dmesg
> - S4 suspend
> - Boot again and check for kememleak on dmesg
> - So far, I have only seen crashes after dmesg complains about kmemleak
> 
> Another thing I have to point is that the crash happens when I'm already back
> to X, many seconds after the real reboot.
> 
> Do you also observe that?

OK, will try with kmemleak.  I already tried other debug options but it didn't catch anything special before the crash.

The crashing behavior isn't always same.  As stated in the bug description, with a luck, you can get the Oops message.  With a bad luck, the machine immediately crashes during the resume.  Possibly because lots of tasks run via pm-utils resume hooks.
Comment 12 Takashi Iwai 2013-06-07 11:49:26 UTC
kmemleak didn't catch the error in my case.  The machine first shows the general protection fault: 0000 at do_dentry_open.

Paulo, could you give your kernel config showing the kmemleak result, so that I can test on machines here, too?
Comment 13 Paulo Zanoni 2013-06-11 22:19:00 UTC
Created attachment 104441 [details]
Possible fix

Hi

Does this patch help? I found it while debugging another memory corruption from our driver...

Thanks,
Paulo
Comment 14 Takashi Iwai 2013-06-18 16:20:07 UTC
No dice, unfortunately.  It still gives the same Oops after the first S4.
Comment 15 Paulo Zanoni 2013-09-06 20:56:47 UTC
Hi

This really seems to be a memory corruption problem. I already had to fsck my disk twice while debugging this. I think our best bet is to try to bisect this using the linux-stable tree.

Do you know any Kernel version that can't reproduce the problem? If you have the time, you could perhaps try to do the bisecting.

Thanks,
Paulo
Comment 16 Takashi Iwai 2013-09-09 07:58:10 UTC
S4 has been always broken on Haswell, thus it's no bisectable regression.  (Otherwise I would have done it :)

And, this seems broken only on Haswell.  S4 works fine with older chips (IvyBridge, at least) with the very same kernel.

BTW, the comment 2 is still valid, and HP confirmed that, too.  Don't load i915 driver in initrd of a resume kernel (i.e. the kernel that loads the S4 image), then the crash probability goes down from 10% to 2% or less.
Comment 17 Paulo Zanoni 2013-10-01 19:47:50 UTC
Hi

I did some tests, and it seems that if I disable fbcon, vgacon and their friends I can't reproduce the problem. Can you please confirm that?

Also, my tests show that the problem happens even if we don't start X. Can you also confirm that?

In the meantime, I'll keep testing.

Thanks,
Paulo
Comment 18 Takashi Iwai 2013-10-02 05:36:05 UTC
(In reply to Paulo Zanoni from comment #17)
> Hi
> 
> I did some tests, and it seems that if I disable fbcon, vgacon and their
> friends I can't reproduce the problem. Can you please confirm that?

Do you mean to disable the corresponding Kconfig?  If so, could you share your kconfig to test here, too?
 
> Also, my tests show that the problem happens even if we don't start X. Can
> you also confirm that?

Yes, read comment 10 again :)
 
> In the meantime, I'll keep testing.

Thanks!
Comment 19 Paulo Zanoni 2013-10-07 19:07:54 UTC
(In reply to Takashi Iwai from comment #18)
> (In reply to Paulo Zanoni from comment #17)
> > Hi
> > 
> > I did some tests, and it seems that if I disable fbcon, vgacon and their
> > friends I can't reproduce the problem. Can you please confirm that?
> 
> Do you mean to disable the corresponding Kconfig?  If so, could you share
> your kconfig to test here, too?

Nevermind, I redid the tests and I was still able to reproduce the bug. I'm sorry. But yes, I meant the .config file. You need CONFIG_EXPERT before you can change the values of fbcon and vgacon.
Comment 20 Paulo Zanoni 2013-10-07 19:13:35 UTC
Hi

I did some more investigation and I discovered the following:

- It seems that, after resuming, if you run "slabinfo -v" (from tools/vm/), there's a good chance you'll see dmesg messages saying we detected corruption on our slabs. It seems to me that it is much much easier to reproduce the bug with "hibernate, resume, run slabinfo -v, check dmesg, hibernate, resume, etc" than with just "hibernate, resume". Can you confirm that?

- It also seems that the bug goes away if the kernel that resumes the machine doesn't load i915.ko. So an experiment you can try is: boot the machine normally, with i915.ko loaded, tell it to hibernate. Then make the machine wake-up, and use the "modprobe.blacklist=i915" option when loading the kernel that will resume the machine. After it resumes, check if the bug is there (possibly with slabinfo -v). The bug should be gone. Can you please confirm that?

Thanks,
Paulo
Comment 21 Takashi Iwai 2013-10-07 19:31:18 UTC
(In reply to Paulo Zanoni from comment #20)
> Hi
> 
> I did some more investigation and I discovered the following:
> 
> - It seems that, after resuming, if you run "slabinfo -v" (from tools/vm/),
> there's a good chance you'll see dmesg messages saying we detected
> corruption on our slabs. It seems to me that it is much much easier to
> reproduce the bug with "hibernate, resume, run slabinfo -v, check dmesg,
> hibernate, resume, etc" than with just "hibernate, resume". Can you confirm
> that?

I have no time in this week due to company's event, but I guess this would trigger more often, too.  In my tests, an easy way to reproduce the bug is to run netconsole in background on the Haswell machine.  Then it causes after just a couple of S4 cycles.

> - It also seems that the bug goes away if the kernel that resumes the
> machine doesn't load i915.ko. So an experiment you can try is: boot the
> machine normally, with i915.ko loaded, tell it to hibernate. Then make the
> machine wake-up, and use the "modprobe.blacklist=i915" option when loading
> the kernel that will resume the machine. After it resumes, check if the bug
> is there (possibly with slabinfo -v). The bug should be gone. Can you please
> confirm that?

This was already mentioned in the bug description!
Comment 22 Ben Widawsky 2013-10-07 23:40:53 UTC
Created attachment 110401 [details]
Don't let the GT write to memory after we're suspending.

Takashi, can you please test this patch?
Comment 23 Ben Widawsky 2013-10-07 23:50:34 UTC
Created attachment 110411 [details]
Idle harder

Please try this patch (with and without the previous, if possible) as well. If you only have time to test 1, please test this with the previous patch.

This one is only compile tested.
Comment 24 Paulo Zanoni 2013-10-08 14:57:58 UTC
(In reply to Ben Widawsky from comment #22)
> Created attachment 110401 [details]
> Don't let the GT write to memory after we're suspending.
> 
> Takashi, can you please test this patch?

I tested this patch and, alone, it seems enough to fix the bug.

I tested it yesterday (the similar version which you sent to my personal email) and today (the version on this bugzilla). Both versions survived many hibernate/resume cycles without problems on "slabinfo -v". The interesting thing is that in two cases a problem happened where "slabinfo -v" got stuck, never finishing. I've never seen this problem before, so it may be caused by your patch, or it may be just another bug that was "hidden" behind the previous easier-to-reproduce bug which you fixed with the patch.
Comment 25 Paulo Zanoni 2013-10-08 14:58:29 UTC
(In reply to Ben Widawsky from comment #23)
> Created attachment 110411 [details]
> Idle harder
> 
> Please try this patch (with and without the previous, if possible) as well.
> If you only have time to test 1, please test this with the previous patch.
> 
> This one is only compile tested.

This patch alone is not enough to fix the bug: I can still reproduce the problem.
Comment 26 Ben Widawsky 2013-10-08 18:30:10 UTC
(In reply to Paulo Zanoni from comment #25)
> (In reply to Ben Widawsky from comment #23)
> > Created attachment 110411 [details]
> > Idle harder
> > 
> > Please try this patch (with and without the previous, if possible) as well.
> > If you only have time to test 1, please test this with the previous patch.
> > 
> > This one is only compile tested.
> 
> This patch alone is not enough to fix the bug: I can still reproduce the
> problem.

If anybody else happens to test the patch, please let me know if you see the WARN or DRM_ERROR. Thanks.
Comment 27 Takashi Iwai 2013-10-09 08:02:32 UTC
Thanks Ben, it looks promising.  I'll try to find some time and test the patches.

BTW what are the magic registers 0x4194 and 0x2050?  Are they available no matter which GPU generation?
Comment 28 Ben Widawsky 2013-10-09 19:02:16 UTC
(In reply to Takashi Iwai from comment #27)
> Thanks Ben, it looks promising.  I'll try to find some time and test the
> patches.
> 
> BTW what are the magic registers 0x4194 and 0x2050?  Are they available no
> matter which GPU generation?

0x4194 has existed for some time. It's just hardcoded as a quick hack since all the experiments with Paulo shows the render ring causing issue. See #define RING_FAULT_REG(ring)    (0x4094 + 0x100*(ring)->id) in drivers/gpu/drm/i915/i915_reg.h


I'm not sure how long 0x2050 has been in the HW. It's a register which is meant for debug purposes only, and let's just say it's magic for now (you can figure out from the code what it should be telling us).
Comment 29 Takashi Iwai 2013-10-10 11:42:54 UTC
As Paulo already confirmed, the first patch alone seems working.  With the first patch, more than 100 S4 cycles with net and proc loads survived and never crashed.  Great! \o/

With the second patch, there is no visible change, neither WARN nor DRM_ERROR.

Feel free to take my tested-by tags when submitting to upstream:
  Tested-by: Takashi Iwai <tiwai@suse.de>

(In reply to Ben Widawsky from comment #28)
> (In reply to Takashi Iwai from comment #27)
> > BTW what are the magic registers 0x4194 and 0x2050?  Are they available no
> > matter which GPU generation?
> 
> 0x4194 has existed for some time. It's just hardcoded as a quick hack since
> all the experiments with Paulo shows the render ring causing issue. See
> #define RING_FAULT_REG(ring)    (0x4094 + 0x100*(ring)->id) in
> drivers/gpu/drm/i915/i915_reg.h
 
Thanks for the pointer.
 
> I'm not sure how long 0x2050 has been in the HW. It's a register which is
> meant for debug purposes only, and let's just say it's magic for now (you
> can figure out from the code what it should be telling us).

OK, I was just curious because the functions are applied globally to all chips.
Comment 30 Takashi Iwai 2013-10-15 12:08:02 UTC
Let me know when the final patch is ready and upstreamed (hopefully merged in time for 3.12).
Comment 31 Ben Widawsky 2013-10-15 23:51:27 UTC
I've cleaned up the patches. Unfortunately I cannot reproduce the issue on my machine locally, so I am waiting for someone else on my team to test them. If you'd like to test them, that might speed things up.

They are the top two patches here:
http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503

We should be able to hit 3.12.
Comment 32 Ben Widawsky 2013-10-15 23:54:18 UTC
(In reply to Ben Widawsky from comment #31)
> I've cleaned up the patches. Unfortunately I cannot reproduce the issue on
> my machine locally, so I am waiting for someone else on my team to test
> them. If you'd like to test them, that might speed things up.
> 
> They are the top two patches here:
> http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/
> ?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503
> 
> We should be able to hit 3.12.

Whoops, make that top two commits here (I've forced pushed)
http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=bug59321
Comment 33 Takashi Iwai 2013-10-16 05:27:36 UTC
(In reply to Ben Widawsky from comment #32)
> (In reply to Ben Widawsky from comment #31)
> > I've cleaned up the patches. Unfortunately I cannot reproduce the issue on
> > my machine locally, so I am waiting for someone else on my team to test
> > them. If you'd like to test them, that might speed things up.
> > 
> > They are the top two patches here:
> > http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/
> > ?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503
> > 
> > We should be able to hit 3.12.
> 
> Whoops, make that top two commits here (I've forced pushed)
> http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=bug59321

In commit 0036ecbfb, hsw_pte_encode() doesn't seem to clear GEN6_PTE_VALID while others do.  Is it really correct?
Comment 34 Ben Widawsky 2013-10-16 14:51:35 UTC
(In reply to Takashi Iwai from comment #33)
> (In reply to Ben Widawsky from comment #32)
> > (In reply to Ben Widawsky from comment #31)
> > > I've cleaned up the patches. Unfortunately I cannot reproduce the issue
> on
> > > my machine locally, so I am waiting for someone else on my team to test
> > > them. If you'd like to test them, that might speed things up.
> > > 
> > > They are the top two patches here:
> > > http://cgit.freedesktop.org/~bwidawsk/drm-intel/commit/
> > > ?h=bug59321&id=d78498d9ced7d0b9b1b23bdabe02f8467d4d1503
> > > 
> > > We should be able to hit 3.12.
> > 
> > Whoops, make that top two commits here (I've forced pushed)
> > http://cgit.freedesktop.org/~bwidawsk/drm-intel/log/?h=bug59321
> 
> In commit 0036ecbfb, hsw_pte_encode() doesn't seem to clear GEN6_PTE_VALID
> while others do.  Is it really correct?

Darn. That is not correct. And that's the one platform we're trying to fix :P

Just pushed the fix. Hopefully I can get internal testing soon.
Comment 35 Ben Widawsky 2013-10-17 02:16:29 UTC
Posted for review.

http://lists.freedesktop.org/archives/intel-gfx/2013-October/034863.html
Comment 36 Daniel Vetter 2013-10-29 15:12:19 UTC
Ok, fix is merged and has landed:

commit 828c79087cec61eaf4c76bb32c222fbe35ac3930
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Wed Oct 16 09:21:30 2013 -0700

    drm/i915: Disable GGTT PTEs on GEN6+ suspend
Comment 37 Takashi Iwai 2014-04-01 07:23:55 UTC
The fix gets reverted recently, so this bug must be reopened...

I still think applying this fix conditionally to certain chips (either gen >= 7 or only HSW) would be a better workaround.  Or, we may apply it only for hibernation, too.
Comment 38 Daniel Vetter 2014-04-11 14:37:35 UTC
Imo we should attempt to refill the gte ptes with stolen memory entries. That should address both the snb and hsw issue and looks more solid than trying to block all access to memory. Which apparently can kill the system.

Someone from our side was signed up to do the testing but that didn't seem to happen. Meh ...
Comment 39 Jens 2014-08-17 09:44:00 UTC
Is there any activity on this bug yet? I am tracking (what I think is) the same issue at https://bugs.freedesktop.org/show_bug.cgi?id=78424 and have tested multiple kernels from 3.13 to 3.16.1 and it is always a similar ext4 related Oops that is described above when I resume from hibernation.

In brief:
* mainboard: MSI-B85M (MSI-7817 Haswell chipset, i5-4570 CPU)
* Ubuntu 14.04 LTS (amd64)
* hibernation/resume from the console using 'sudo pm-hibernate' seems to work multiple times, when i915.ko is not loaded
* hibernation/resume from within X will work fine once after reboot (will result in "WARNING: SPLL already enabled" error message like above)
* further attempts always fail on resume with a heap of Oops messages related to ext4 (lookup_fast, __inode_permission, etc.) and the machine will grind to death
* It seems like using 'no_console_suspend=1' will delay the Oops until the second or third resume.

Is there any chance of this being fixed soon? How can I help? I use multiple Haswell based machines and badly need the hibernation functionality.

Thank you!
Comment 40 Jani Nikula 2014-08-18 08:22:24 UTC
Raising priority.
Comment 41 Jens 2014-08-18 16:52:18 UTC
I pulled the current "drm-intel-nightly" tree (http://cgit.freedesktop.org/drm-intel/log/?h=drm-intel-nightly, commit 09fcefee...) and tried again.

Setup:
* Ubuntu 14.04 LTS, Kernel 3.16.0+ (3.17rc1 as of now)
* MSI-7817 chipset with i5-i4570
* Boot Lubuntu desktop
* start "make -j4" in the above git checkout
* start Firefox with Youtube video
* hibernate and resume in a loop (I tried 4 reboots with 3..5 resume cycles each)

Results:
* No more WARNING: messages upon resume
* Multiple resumes work fine
* About one in every fifth resume the machine grinds to a halt with dozens of OOM killer messages in the logs

So: A big improvement to before (I can hibernate and resume multiple times in a row, even with a loaded machine!). But we're not quite there yet - where do the OOM errors come from? When I hibernate, only ~1,5G out of 8G RAM were actually used.
Comment 42 Jens 2014-08-18 18:45:07 UTC
Created attachment 147091 [details]
3.17-rc1 crash "Watchdog detected hard LOCKUP on CPU #x"

Here is another (it seems) non-OOM related crash on the mentioned kernel during resume after hibernate which I was only able to catch using a digicam (thus sorry for the file format).
Comment 43 Jens 2014-08-18 18:53:35 UTC
Created attachment 147101 [details]
3.17-rc1 Crash during hibernate/resume, OOM failures

Here is another dmesg output. The first file is a clean resume after a hibernate. The second file is a subsequent resume which triggered the mentioned OOM chaos. Unfortunately the first part was cut off because of the limited size of the dmesg buffer.
Comment 44 Jens 2014-08-20 08:03:23 UTC
Another crash, after 5 (working) cycles of hibernate/resume during a kernel "make -j4" to keep the machine busy. I have seen ext4_ functions in the call trace often - this might be the same memory corruption that causes the OOM killer to run amok.

[  976.247858] CPU: 2 PID: 7160 Comm: rm Tainted: G        W   E  3.17.0-rc1+ #5
[  976.247876] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014
[  976.247895] task: ffff8800c89ee400 ti: ffff880133e08000 task.ti: ffff880133e08000
[  976.247914] RIP: 0010:[<ffffffff811ba875>]  [<ffffffff811ba875>] kmem_cache_alloc+0x75/0x1e0
[  976.247938] RSP: 0018:ffff880133e0bbb8  EFLAGS: 00010286
[  976.247951] RAX: 0000000000000000 RBX: ffff88021ea5d340 RCX: 0000000000000338
[  976.247969] RDX: 0000000000000337 RSI: 0000000000000050 RDI: ffff88020fdf6600
[  976.247986] RBP: ffff880133e0bbe8 R08: 000000000001b200 R09: ffffffff8128e632
[  976.248002] R10: ffff880133e0bb38 R11: ffffea00083c3040 R12: 0031000000300000
[  976.248019] R13: 0000000000000050 R14: ffff88020fdf6600 R15: ffff88020fdf6600
[  976.248037] FS:  00002b3156a34b80(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000
[  976.248056] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  976.248070] CR2: 00002b3156d5dfa0 CR3: 000000011799e000 CR4: 00000000001407e0
[  976.248087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  976.248103] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  976.248120] Stack:
[  976.248125]  ffffffff8128e632 ffff88021ea5d340 ffff88020fe9d800 0000000000000000
[  976.248145]  0000000004cd599c 000000000000599c ffff880133e0bcc0 ffffffff8128e632
[  976.248166]  ffffea00083c3040 0000000000000000 0000000100000000 0000000004cd59a0
[  976.248186] Call Trace:
[  976.248194]  [<ffffffff8128e632>] ? ext4_free_blocks+0x6d2/0xb40
[  976.248210]  [<ffffffff8128e632>] ext4_free_blocks+0x6d2/0xb40
[  976.248226]  [<ffffffff81281369>] ext4_ext_remove_space+0x7d9/0x1050
[  976.248242]  [<ffffffff812954c9>] ? ext4_es_free_extent+0x59/0x60
[  976.248258]  [<ffffffff81283bd0>] ext4_ext_truncate+0xb0/0xe0
[  976.248274]  [<ffffffff8125cf57>] ext4_truncate+0x387/0x3d0
[  976.248289]  [<ffffffff8125db11>] ext4_evict_inode+0x491/0x4f0
[  976.248305]  [<ffffffff811f23b4>] evict+0xb4/0x180
[  976.248317]  [<ffffffff811f2bf5>] iput+0xf5/0x180
[  976.248330]  [<ffffffff811e4db3>] do_unlinkat+0x193/0x2c0
[  976.248344]  [<ffffffff81021d25>] ? syscall_trace_enter+0x145/0x250
[  976.248360]  [<ffffffff811e859b>] SyS_unlinkat+0x1b/0x40
[  976.248375]  [<ffffffff817513ff>] tracesys+0xe1/0xe6
[  976.248387] Code: dd 00 00 49 8b 50 08 4d 8b 20 49 8b 40 10 4d 85 e4 0f 84 17 01 00 00 48 85 c0 0f 84 0e 01 00 00 49 63 46 20 48 8d 4a 01 4d 8b 06 <49> 8b 1c 04 4c 89 e0 65 49 0f c7 08 0f 94 c0 84 c0 74 b9 49 63 
[  976.248475] RIP  [<ffffffff811ba875>] kmem_cache_alloc+0x75/0x1e0
[  976.248491]  RSP <ffff880133e0bbb8>
[  976.254794] ---[ end trace 9b598b75bf3f05bd ]---
[  985.828721] r8169 0000:02:00.0 eth0: link up
Comment 45 Imre Deak 2014-09-11 15:09:21 UTC
Could you give a try if the following branch fixes the issue? :
https://github.com/ideak/linux/commits/suspend-fix
Comment 46 Jens 2014-09-12 22:10:48 UTC
No crashes after 8 hibernate/resume cycles with this kernel tree. Great!

However, I got the below backtrace upon the second resume, and after that the network was flaky (about each second resume I had not network access until I hibernated again):


[  460.342312] r8169 0000:02:00.0 eth0: link down
[  460.342331] r8169 0000:02:00.0 eth0: link down
[  460.342369] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[  463.191127] r8169 0000:02:00.0 eth0: link up
[  463.191134] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  559.157551] ------------[ cut here ]------------
[  559.157557] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x276/0x280()
[  559.157558] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  559.157559] Modules linked in: bnep(E) rfcomm(E) bluetooth(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_controller(E) snd_hda_codec(E) snd_hwdep(E) snd_pcm(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) snd_seq_device(E) snd_timer(E) intel_rapl(E) snd(E) soundcore(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) lpc_ich(E) serio_raw(E) kvm_intel(E) mei_me(E) mei(E) kvm(E) shpchp(E) tpm_infineon(E) intel_smartconnect(E) mac_hid(E) parport_pc(E) ppdev(E) lp(E) parport(E) dm_crypt(E) netconsole(E) configfs(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) i915(E) mxm_wmi(E) r8169(E) i2c_algo_bit(E) drm_kms_helper(E) hid_generic(E) aesni_intel(E) ahci(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) libahci(E) mii(E) ablk_helper(E) usbhid(E) cryptd(E) drm(E) hid(E) wmi(E) video(E)
[  559.157582] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G            E  3.17.0-rc4+ #1
[  559.157583] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014
[  559.157584]  0000000000000009 ffff88021ea03d98 ffffffff8174630a ffff88021ea03de0
[  559.157585]  ffff88021ea03dd0 ffffffff8106c8bd 0000000000000000 ffff88020fb4a000
[  559.157586]  ffff88020f812880 0000000000000001 0000000000000000 ffff88021ea03e30
[  559.157588] Call Trace:
[  559.157589]  <IRQ>  [<ffffffff8174630a>] dump_stack+0x45/0x56
[  559.157595]  [<ffffffff8106c8bd>] warn_slowpath_common+0x7d/0xa0
[  559.157596]  [<ffffffff8106c92c>] warn_slowpath_fmt+0x4c/0x50
[  559.157598]  [<ffffffff8166b8c6>] dev_watchdog+0x276/0x280
[  559.157600]  [<ffffffff8166b650>] ? dev_graft_qdisc+0x80/0x80
[  559.157602]  [<ffffffff810cefd6>] call_timer_fn+0x36/0x100
[  559.157603]  [<ffffffff8166b650>] ? dev_graft_qdisc+0x80/0x80
[  559.157605]  [<ffffffff810d076f>] run_timer_softirq+0x20f/0x310
[  559.157607]  [<ffffffff81070865>] __do_softirq+0xf5/0x2e0
[  559.157609]  [<ffffffff81070d25>] irq_exit+0x105/0x110
[  559.157611]  [<ffffffff81751855>] smp_apic_timer_interrupt+0x45/0x60
[  559.157612]  [<ffffffff8174f95d>] apic_timer_interrupt+0x6d/0x80
[  559.157613]  <EOI>  [<ffffffff815f4582>] ? poll_idle+0x42/0x90
[  559.157616]  [<ffffffff815f3fc5>] cpuidle_enter_state+0x55/0x170
[  559.157617]  [<ffffffff815f4197>] cpuidle_enter+0x17/0x20
[  559.157620]  [<ffffffff810aadfd>] cpu_startup_entry+0x31d/0x340
[  559.157623]  [<ffffffff81737747>] rest_init+0x77/0x80
[  559.157625]  [<ffffffff81d41084>] start_kernel+0x42f/0x43a
[  559.157626]  [<ffffffff81d40a4e>] ? set_init_arg+0x53/0x53
[  559.157628]  [<ffffffff81d40120>] ? early_idt_handlers+0x120/0x120
[  559.157629]  [<ffffffff81d405ee>] x86_64_start_reservations+0x2a/0x2c
[  559.157630]  [<ffffffff81d40733>] x86_64_start_kernel+0x143/0x152
[  559.157631] ---[ end trace 94d117e156a45de1 ]---
[  559.171533] r8169 0000:02:00.0 eth0: link up
Comment 47 Jens 2014-09-14 20:23:33 UTC
The above was on a MSI B85M chipset:
[  559.157583] Hardware name: MSI MS-7817/CSM-B85M-E45 (MS-7817), BIOS V10.5 05/30/2014

This has so far survived several more hibernate/resume cycles and the above network watchdog failure did not repeat (yet).

On another board (MSI B81M chipset), the same kernel will not boot, it crashes upon boot, I tried to boot several times:

....
Sep 14 22:04:34 desktop kernel: [    2.199146] AVX2 version of gcm_enc/dec engaged.
Sep 14 22:04:34 desktop kernel: [    2.199147] AES CTR mode by8 optimization enabled
Sep 14 22:04:34 desktop kernel: [    2.211116] checking generic (e0000000 500000) vs hw (e0000000 10000000)
Sep 14 22:04:34 desktop kernel: [    2.211117] fb: switching to inteldrmfb from VESA VGA
Sep 14 22:04:34 desktop kernel: [    2.211214] Console: switching to colour dummy device 80x25
Sep 14 22:04:34 desktop kernel: [    2.211235] BUG: unable to handle kernel NULL pointer dereference at 0000000000000100
Sep 14 22:04:34 desktop kernel: [    2.211238] IP: [<ffffffff8147057e>] con_set_unimap+0x4e/0x270
Sep 14 22:04:34 desktop kernel: [    2.211239] PGD 213bea067 PUD 213be9067 PMD 0 
Sep 14 22:04:34 desktop kernel: [    2.211240] Oops: 0000 [#1] SMP 
Sep 14 22:04:34 desktop kernel: [    2.211249] Modules linked in: aesni_intel(E+) mxm_wmi(E) bnep(E) snd_seq(E) i915(E+) rfcomm(E+) snd_seq_device(E) aes_x86_64(E) snd_timer(E) lrw(E) bluetooth(E) gf128mul(E) glue_helper(E) ablk_helper(E) drm_kms_helper(E) cryptd(E) snd(E) drm(E) soundcore(E) mei_me(E) serio_raw(E) i2c_algo_bit(E) lpc_ich(E) mei(E) video(E) wmi(E) intel_smartconnect(E) mac_hid(E) tpm_infineon(E) parport_pc(E) ppdev(E) lp(E) parport(E) ahci(E) libahci(E) r8169(E) mii(E)
Sep 14 22:04:34 desktop kernel: [    2.211251] CPU: 3 PID: 497 Comm: setfont Tainted: G            E  3.17.0-rc4+ #3
Sep 14 22:04:34 desktop kernel: [    2.211251] Hardware name: MSI MS-7817/H81M-P33 (MS-7817), BIOS V1.5 05/30/2014
Sep 14 22:04:34 desktop kernel: [    2.211252] task: ffff8800d34f0000 ti: ffff8800d34e4000 task.ti: ffff8800d34e4000
Sep 14 22:04:34 desktop kernel: [    2.211253] RIP: 0010:[<ffffffff8147057e>]  [<ffffffff8147057e>] con_set_unimap+0x4e/0x270
Sep 14 22:04:34 desktop kernel: [    2.211254] RSP: 0018:ffff8800d34e7d50  EFLAGS: 00010246
Sep 14 22:04:34 desktop kernel: [    2.211254] RAX: ffff88021700c310 RBX: 00000000022b9b10 RCX: ffff88021700c000
Sep 14 22:04:34 desktop kernel: [    2.211255] RDX: 00000000000010c6 RSI: 0000000000000282 RDI: 0000000000000282
Sep 14 22:04:34 desktop kernel: [    2.211255] RBP: ffff8800d34e7db8 R08: ffff8800d34e4000 R09: ffff8800d4293200
Sep 14 22:04:34 desktop kernel: [    2.211255] R10: 000000000000b50d R11: 0000000000000010 R12: 0000000000000000
Sep 14 22:04:34 desktop kernel: [    2.211256] R13: 0000000000004b67 R14: ffff88021700c000 R15: 00000000ffffffff
Sep 14 22:04:34 desktop kernel: [    2.211257] FS:  00007f4da387e740(0000) GS:ffff88021fb80000(0000) knlGS:0000000000000000
Sep 14 22:04:34 desktop kernel: [    2.211257] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 14 22:04:34 desktop kernel: [    2.211257] CR2: 0000000000000100 CR3: 00000000d345c000 CR4: 00000000001407e0
Sep 14 22:04:34 desktop kernel: [    2.211258] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 14 22:04:34 desktop kernel: [    2.211258] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 14 22:04:34 desktop kernel: [    2.211258] Stack:
Sep 14 22:04:34 desktop kernel: [    2.211259]  ffffffff810b1052 ffff8800d34e7d80 ffffffff813258ac ffff8800d34f0000
Sep 14 22:04:34 desktop kernel: [    2.211260]  0000000000000000 0000000000004b67 000002e1d34e7d90 ffff88021700c000
Sep 14 22:04:34 desktop kernel: [    2.211261]  ffff880036bb9400 0000000000000000 0000000000004b67 ffff88021700c000
Sep 14 22:04:34 desktop kernel: [    2.211261] Call Trace:
Sep 14 22:04:34 desktop kernel: [    2.211264]  [<ffffffff810b1052>] ? up+0x32/0x50
Sep 14 22:04:34 desktop kernel: [    2.211267]  [<ffffffff813258ac>] ? apparmor_capable+0x1c/0x60
Sep 14 22:04:34 desktop kernel: [    2.211270]  [<ffffffff8146a475>] vt_ioctl+0xe75/0x11b0
Sep 14 22:04:34 desktop kernel: [    2.211271]  [<ffffffff8145e06d>] tty_ioctl+0x26d/0xbb0
Sep 14 22:04:34 desktop kernel: [    2.211273]  [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
Sep 14 22:04:34 desktop kernel: [    2.211275]  [<ffffffff8109ab34>] ? vtime_account_user+0x54/0x60
Sep 14 22:04:34 desktop kernel: [    2.211277]  [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
Sep 14 22:04:34 desktop kernel: [    2.211279]  [<ffffffff8174eb7f>] tracesys+0xe1/0xe6
Sep 14 22:04:34 desktop kernel: [    2.211287] Code: 14 48 83 c4 40 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 1f 44 00 00 48 89 d3 e8 e0 c2 c4 ff 48 8b 4d d0 48 8b 81 18 03 00 00 4c 8b 20 <49> 83 bc 24 00 01 00 00 01 0f 86 03 01 00 00 48 89 c8 48 05 18 
Sep 14 22:04:34 desktop kernel: [    2.211288] RIP  [<ffffffff8147057e>] con_set_unimap+0x4e/0x270
Sep 14 22:04:34 desktop kernel: [    2.211288]  RSP <ffff8800d34e7d50>
Sep 14 22:04:34 desktop kernel: [    2.211289] CR2: 0000000000000100
Sep 14 22:04:34 desktop kernel: [    2.211290] ---[ end trace 4c81f325d41b44a2 ]---

Both run Lubuntu 14.04 with all current updates applied.
The kernels were created using make-kpkg with all default choices (during 'make oldconfig').

If you need more debugging logs, just say so, I can help testing. Thanks!
Comment 48 Imre Deak 2014-09-15 14:43:01 UTC
Thanks.

I can't say much about the network problem, I think the best would be opening a new bug for it.

Afaics, the console crash happens because i915 switches to the dummy console and then userspace tries to change the font mapping, but the dummy console doesn't support this (its vc_uni_pagedir is never inited). Could you try the following:

diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 610b720..eb867fa 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -539,6 +539,9 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list)
 
 	/* Save original vc_unipagdir_loc in case we allocate a new one */
 	p = *vc->vc_uni_pagedir_loc;
+
+	if (!p)
+		return -EINVAL;
 	
 	if (p->refcount > 1) {
 		int j, k;
Comment 49 Imre Deak 2014-09-15 14:44:10 UTC
(In reply to Imre Deak from comment #48)
> diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
> index 610b720..eb867fa 100644
> --- a/drivers/tty/vt/consolemap.c
> +++ b/drivers/tty/vt/consolemap.c
> @@ -539,6 +539,9 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct
> unipair __user *list)
>  
>       /* Save original vc_unipagdir_loc in case we allocate a new one */
>       p = *vc->vc_uni_pagedir_loc;
> +
> +     if (!p)
> +             return -EINVAL;
>       
>       if (p->refcount > 1) {
>               int j, k;

Oops, wrong patch, the correct one:

diff --git a/drivers/tty/vt/consolemap.c b/drivers/tty/vt/consolemap.c
index 610b720..59b25e0 100644
--- a/drivers/tty/vt/consolemap.c
+++ b/drivers/tty/vt/consolemap.c
@@ -539,6 +539,12 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list)
 
 	/* Save original vc_unipagdir_loc in case we allocate a new one */
 	p = *vc->vc_uni_pagedir_loc;
+
+	if (!p) {
+		err = -EINVAL;
+
+		goto out_unlock;
+	}
 	
 	if (p->refcount > 1) {
 		int j, k;
@@ -623,6 +629,7 @@ int con_set_unimap(struct vc_data *vc, ushort ct, struct unipair __user *list)
 		set_inverse_transl(vc, p, i); /* Update inverse translations */
 	set_inverse_trans_unicode(vc, p);
 
+out_unlock:
 	console_unlock();
 	return err;
 }
Comment 50 Jens 2014-10-02 12:30:12 UTC
The above patch seems to fix the issue. I still have occasional network outages after a resume, another suspend/resume cycle usually fixes this, but the crash is gone (so far). I am running this patch on 3.17 on two machines and they hibernate a few times a day (both MSI-7817 chipsets, one Haswell 81, one 85).
Comment 51 Imre Deak 2014-10-02 13:40:38 UTC
(In reply to Jens from comment #50)
> The above patch seems to fix the issue. I still have occasional network
> outages after a resume, another suspend/resume cycle usually fixes this, but
> the crash is gone (so far). I am running this patch on 3.17 on two machines
> and they hibernate a few times a day (both MSI-7817 chipsets, one Haswell
> 81, one 85).

Thanks!

I sent the VT patch to the maintainers. The patches in the suspend-fix branch are under review on the intel-gfx ML, so we can close this bug (possibly) once those get merged.
Comment 52 Imre Deak 2014-10-24 09:57:09 UTC
Jens, have you seen the problem since your last report (with or w/o the fixes)?

Could you still try if you can reproduce the problem with the latest -nightly kernel and the same tree with the fixes reverted (resetting to 598ae05fd937 - "drm/i915: Emit even number of dwords when emitting LRIs").
Comment 53 Imre Deak 2014-10-24 10:08:26 UTC
(In reply to Imre Deak from comment #52)
> Jens, have you seen the problem since your last report (with or w/o the
> fixes)?
> 
> Could you still try if you can reproduce the problem with the latest
> -nightly kernel and the same tree with the fixes reverted (resetting to
> 598ae05fd937 - "drm/i915: Emit even number of dwords when emitting LRIs").

Also, let's continue to track this on fdo:
https://bugs.freedesktop.org/show_bug.cgi?id=82864