Bug 13811

Summary: [GM965/KMS/UXA] memory corruption on resume from hibernate
Product: Drivers Reporter: Evgeni Golov (sargentd)
Component: Video(DRI - Intel)Assignee: ykzhao (yakui.zhao)
Status: RESOLVED DUPLICATE    
Severity: normal CC: airlied, bgamari, bojan, boyarsh, didit21, fgouget, gordon.jin, guido+kernel.org, jim, jmprieto, kernel, kernelbug, kirill, lists, m.v.b, michel.lafonpuyo, mishu, rjw, saki, vi5u0-kernelbugs
Priority: P2    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.30.1 Subsystem:
Regression: No Bisected commit-id:
Attachments: dmesg with segfaults after resume
dmesg log before hibernating
intel_reg_dumper before hibernating
dmesg log after hibernating
Dmesg and intel_reg_dumper dumps for 2.6.32.8
dmesg before on F-12
intel_gpu_dump before on F-12
dmesg after first thaw on F-12
intel_gpu_dump after first thaw on F-12

Description Evgeni Golov 2009-07-22 19:58:15 UTC
[first reported at https://bugs.freedesktop.org/show_bug.cgi?id=22886, but seems like kernel problem]

Hi,

I have a ThinkPad X300 here, with a GM965 inside. Running kernel 2.6.30.1 (and
2.6.31-rc3), xf86-video-intel 2.8.0 (also happened with 2.7.1), xorg 1.6.2. Ah, and its all x86_64 here, Debian Sid.

When I have KMS enabled, about 50% of my resumes from hibernate (s2disk either via uswsusp or plain "echo disk > /sys/power/state") end up absolutely fubar: X is there, I can move the mouse, type, the running apps run fine, but new ones segfault right after the start (or die with an assertion error). Same happens when there is no Xorg loaded, with only a 1440x900 console and /sys/power/state suspend.

As soon I disable KMS, I can't reproduce this behavior anymore.
Comment 1 Evgeni Golov 2009-07-23 18:23:37 UTC
Just had a quick test with 2.6.29.3, same behaviour.
Also same with only 2GiB RAM (I had 4 in the tests before).
Comment 2 Evgeni Golov 2009-07-24 18:57:25 UTC
And another test: also happens with a 32bit system.
I'm quite puzzled and have no real idea how to debug this :(
Comment 3 lists 2009-07-24 19:50:13 UTC
This is likely the same bug as Debian
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534422
Comment 4 saki 2009-09-06 08:30:38 UTC
I can confirm the same problem on FSC Esprimo Mobile U9200, 4GB ram, FC11,
kernel 2.6.29.6-217.2.8.fc11.i686.PAE
graphic card Intel Mobile GM965/GL960

after resume crashes
when added option nomodeset to kernel in /etc/grub.conf resumes without problem
Comment 5 Anton Boyarshinov 2010-02-11 15:18:15 UTC
I have the same problem with 2.6.32.8 buit myself on Asus eeepc 900.
Additinally i can say that if there are no i915.ko and so on in the initrd, resume works correctly. But when i add i915,drm,agpgart,intel_agp to initrd for mode switch on early boot stage, only normal boot works OK. Resume with this initrd causes multiple segmentation faults and "bad ELF" error messages. Looks like memory corruption.

Can any additional information help to fix this problem?
Comment 6 Anton Boyarshinov 2010-02-12 09:48:48 UTC
Created attachment 24996 [details]
dmesg with segfaults after resume
Comment 7 I. P. Kowal 2010-02-27 16:26:10 UTC
Numerous Intel owners are affected by the bug and nobody does a thing since JULY. The Linux 2.6.33 just came out and guess what? It's still broken.

Didn't people claim how this KMS business will make power management *so much* better, easier and reliable? Like we didn't heard that one before with UXA and other Intel technology that is so much inovative yet always fails to work as good as the old one.

In those periods you finally fix the i915 then we still can't hibernate because we are afraid for our data being that you break it every 3 months and we never know when it will hit again.
Comment 8 Bojan Smojver 2010-03-11 20:59:52 UTC
My experience with kernel 2.6.33 and suspend/resume (F-13 alpha cannot hibernate/thaw for me at all) is here:

https://bugzilla.redhat.com/show_bug.cgi?id=537494#c31

More or less the same thing. With KMS enabled, it doesn't work. Also, new Intel driver doesn't have user mode switching, so KMS is mandatory.
Comment 9 ykzhao 2010-04-02 08:16:29 UTC
Hi, Bojan/Anton
    Sorry for the late response.
    Will you please add the boot option of "drm.debug=0x04 initcall_debug" and do the following test on 2.6.33 kernel?
    a. enter the console mode
    b. Get the output of intel_reg_dumper, dmesg before using hibernation
    c. use the command of "echo disk > /sys/power/state" to enter the hibernation
    d. press the power button and add the boot option of "drm.debug=0x04 resume=XXX". XXX is the swap partition.
    e. after the system is resumed, please confirm whether the system can work well under the console mode and get the output of dmesg, intel_reg_dumper.

Thanks.
Comment 10 François Gouget 2010-04-11 17:32:37 UTC
I have the same problem as Bojan and Anton so I tried to follow the above instructions. However after getting out of hibernation the console was not really functional, for instance I got:

  # ls
  ls: error while loading shared libraries: /lib/libselinux.so.1: unexpected PLT reloc type 0x00

And also after a little bit I got this a few times:

  INIT: Id "3" respawning too fast: disabled for 5 minutes

Somehow dmesg did work anyway but intel_reg_dumper only started, it produced no output and locked up the system (caused more of the INIT messages above).

I'm attaching the logs:
Before: fg-dmesg-pre.log.bz2 and fg-intel-pre.log.bz2
After:  fg-dmesg-post.log.bz2

My system:
* An EeePC 1000H
  00:02.0 VGA compatible controller: Intel Corporation Mobile 945GME Express Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller])
          Subsystem: ASUSTeK Computer Inc. Device 8340
          Flags: bus master, fast devsel, latency 0, IRQ 16
          Memory at f7f00000 (32-bit, non-prefetchable) [size=512K]
          I/O ports at dc00 [size=8]
          Memory at d0000000 (32-bit, prefetchable) [size=256M]
          Memory at f7ec0000 (32-bit, non-prefetchable) [size=256K]
          Expansion ROM at <unassigned> [disabled]
          Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit-
          Capabilities: [d0] Power Management version 2
          Kernel driver in use: i915

* Debian Testing with a kernel from Debian Unstable:
  $ uname -a
  Linux malte 2.6.33-2-686 #1 SMP Thu Mar 18 07:30:30 UTC 2010 i686 GNU/Linux

Should I have stopped the X server when switching to the console?
(i.e. Ctrl-Alt-F2, login, /etc/init.d/gdm stop)
Comment 11 François Gouget 2010-04-11 17:35:48 UTC
Created attachment 25956 [details]
dmesg log before hibernating
Comment 12 François Gouget 2010-04-11 17:37:24 UTC
Created attachment 25957 [details]
intel_reg_dumper before hibernating
Comment 13 François Gouget 2010-04-11 17:40:15 UTC
Created attachment 25958 [details]
dmesg log after hibernating
Comment 14 Anton Boyarshinov 2010-04-12 19:47:10 UTC
I have no vanilla 2.6.33 here and will build and test it tomorrow. Following test was done on 2.6.32.8.

Files in archive:
first try
dmesg before hybernating: dmesg.1
intel_reg_dumper before hybernating: dump.1

dmesg after hybernating: dmesg.2
intel_reg_dumper: can't load libc.so.6 :-D


Second try
dmesg before hybernating: dmesg.3
intel_reg_dumper before hybernating: dump.3

dmesg after hybernating: Segmentation fault
intel_reg_dumper after hybernating: dump.4

Tests have run in console, but with X server working on another VT
Comment 15 Anton Boyarshinov 2010-04-12 19:48:52 UTC
Created attachment 25972 [details]
Dmesg and intel_reg_dumper dumps for 2.6.32.8
Comment 16 Bojan Smojver 2010-04-13 07:48:27 UTC
First, on F-13 there doesn't seem to be a program called intel_reg_dumper. There is something called intel_gpu_dump though. Not sure if that's what you're after.

Second, it seems that F-13 is not capable of thawing hibernated images right now. I get no root/no boot message on startup. I have dmesg and intel_gpu_dump output before hibernation, but I'm pretty sure that'll be useless without the same thing after the thaw.
Comment 17 Bojan Smojver 2010-04-13 08:20:14 UTC
OK, some success with this on F-12. I used pm-hibernate, which is just a bit fancier way to hibernate (it obeys blacklisted modules and such).

I'll attach dmesg and intel_gpu_dump output before the hibernation and after the first thaw. On second thaw, the machine went berserk (could not login to any console any more etc.). So, the whole thing probably isn't going to help much, but here it it nevertheless.
Comment 18 Bojan Smojver 2010-04-13 08:21:58 UTC
Created attachment 25981 [details]
dmesg before on F-12
Comment 19 Bojan Smojver 2010-04-13 08:23:12 UTC
Created attachment 25982 [details]
intel_gpu_dump before on F-12
Comment 20 Bojan Smojver 2010-04-13 08:23:48 UTC
Created attachment 25983 [details]
dmesg after first thaw on F-12
Comment 21 Bojan Smojver 2010-04-13 08:24:40 UTC
Created attachment 25984 [details]
intel_gpu_dump after first thaw on F-12
Comment 22 Bojan Smojver 2010-04-13 08:26:32 UTC
I tried dumping intel gpu state after second thaw, but that resulted in an empty file.

Also, echo disk > /sys/power/state does not work on my notebook (hang on thaw), because module b44 must be unloaded before hibernation (which is what pm-hibernate does).
Comment 23 Michel Lafon-Puyo 2010-04-14 14:12:19 UTC
Hi,

I'm running Arch with kernel 2.6.33 on a Samsung NC10 netbook (Intel GMA950 graphic chipset) with early KMS initialization and I'm issuing the same problem.

Twice, I tried to follow the instructions given above but I was unable to get the log from dmesg et to dump the GPU registers after thaw (segmentation fault).

Here is what I did,
First time:
 - I added drm.debug=0x04 initcall_debug to the boot cmdline
 - After initialization, I switched to a console VT (X was running on another VT)
 - dmesg + intel_gpu_dump
 - echo disk > /sys/power/state
 - I pressed the power button, added drm.debug=0x04 resume=/dev/sda2 to the boot cmd line => thaw was successful and I was back to the console VT
 - dmesg => segmentation fault
 - intel_gpu_dump => segmentation fault

Second time:
 - I added drm.debug=0x04 initcall_debug to the boot cmdline AND added "single"
 - After initialization, I logged in to the console VT (X wasn't running)
 - dmesg + intel_gpu_dump
 - I pressed the power button, added drm.debug=0x04 resume=/dev/sda2 to the boot cmd line => thaw was successful and I was back to the console VT
 - dmesg => segmentation fault
 - intel_gpu_dump => segmentation fault

If needed, I can post the dmesg log and reg dump from the second try but it is very similar to those sent by Bojan.

Thank you for your help.
Comment 24 Anton Boyarshinov 2010-04-19 07:35:58 UTC
(In reply to comment #14)
> I have no vanilla 2.6.33 here and will build and test it tomorrow. Following
> test was done on 2.6.32.8.
SSD on that machine is completly crashed, so i can't test this bug any more, sorry
Comment 25 Bojan Smojver 2010-05-04 06:55:06 UTC
I was able to hibernate F-13 after taking out some boot options that Anaconda stuffed in there. Anyhow, still seeing segfaults. See:

https://bugzilla.redhat.com/attachment.cgi?id=411190
Comment 26 didit21 2010-05-26 17:26:51 UTC
Hi,

I've been affected by this bug for some time but since upgrading to kernel 2.6.34 I can hibernate my laptop and resume it again.
I'm running debian/unstable with kernel 2.6.34-1~experimental.1 from experimental.
Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)
Have I been lucky so far or are other people seeing the same improvement with this kernel?
Comment 27 Bojan Smojver 2010-05-28 02:30:05 UTC
(In reply to comment #26)

> I've been affected by this bug for some time but since upgrading to kernel
> 2.6.34 I can hibernate my laptop and resume it again.

I tried kernel-2.6.34-11.fc14 with F-13 (on advice of one of the Fedora developers), but that did not work. On fourth hibernate/thaw cycle, segfaults made the system unusable.

Did you try to hibernate/thaw multiple times?
Comment 28 didit21 2010-05-28 04:54:06 UTC
(In reply to comment #27)
> I tried kernel-2.6.34-11.fc14 with F-13 (on advice of one of the Fedora
> developers), but that did not work. On fourth hibernate/thaw cycle, segfaults
> made the system unusable.
> 
> Did you try to hibernate/thaw multiple times?

Sure. I did it 14 times already and it still works OK. It's much more than with previous kernels. I hope it's not going to break but if it occurs, I'll add a new comment to this bug.
Comment 29 Bojan Smojver 2010-05-28 07:19:33 UTC
(In reply to comment #28)
 
> Sure. I did it 14 times already and it still works OK. It's much more than
> with
> previous kernels. I hope it's not going to break but if it occurs, I'll add a
> new comment to this bug.

Excellent. Thanks for the update.
Comment 30 I. P. Kowal 2010-05-28 12:24:09 UTC
> Have I been lucky so far or are other people
> seeing the same improvement with this kernel?

There are no improvements in 2.6.34. Tested on two generations (GM45, GM965) of intel graphics on two laptops of completely different vendors.
Comment 31 M. Vefa Bicakci 2010-06-28 19:06:08 UTC
Hello all,

I am having the same issue as well. After I resume from suspend
to disk (e.g. thaw), I get a lot of segfaults. However I started
to see this problem after upgrading from 2.6.32.7 to 2.6.32.9
a few months ago.

I bisected the i915 related commits between these two versions
and found out that the following commit causes the memory
corruption issue after resuming from suspend to disk.

=== 8< ===
commit d8e0902806c0bd2ccc4f6a267ff52565a3ec933b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Jan 27 13:36:32 2010 +0000

    drm/i915: Selectively enable self-reclaim

    commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 upstream.

    Having missed the ENOMEM return via i915_gem_fault(), there are probably
    other paths that I also missed. By not enabling NORETRY by default these
    paths can run the shrinker and take memory from the system (but not from
    our own inactive lists because our shrinker can not run whilst we hold
    the struct mutex) and this may allow the system to survive a little longer
    whilst our drivers consume all available memory.

    References:
      OOM killer unexpectedly called with kernel 2.6.32
      http://bugzilla.kernel.org/show_bug.cgi?id=14933

    v2: Pass gfp into page mapping.
    v3: Use new read_cache_page_gfp() instead of open-coding.

    ...
=== >8 ===

When I revert this commit from any recent vanilla kernel, I
no longer have the memory corruption issue. I have used/tested
kernels compiled this way extensively.

I reported my finding in LKML, but only Rafael Wysocki has shown
interest in it. Here is a link to my first e-mail on LKML:

http://marc.info/?l=linux-kernel&m=126845754409543&w=2

Here's a link to the regression report:

http://bugzilla.kernel.org/show_bug.cgi?id=15585

One thing I realize is that some people have had this issue long
before 2.6.32.8, so I am not sure whether the problems we are
having are the same, or they just show the same symptoms. (This
is the reason I haven't marked 15585 as a duplicate of this bug.)

Could anyone try to revert the above-mentioned commit on a
vanilla 2.6.34 tree, compile it and try the resulting kernel?

Another piece of information is that the memory corruption issue
doesn't exist if one does *not* load the i915.ko module *before*
one calls the "resume" binary (uswsusp or regular version) in
the initramfs. (I stumbled into this information on Arch users'
forum via a search engine.) One way of achieving this is removing
the i915.ko module from the initramfs. I have tested this as well,
and can verify that it works. Obviously, this breaks plymouth and
the like. (For the record, I am using Debian Sid.)

I am willing to help in debugging this problem.

Regards,

M. Vefa Bicakci
Comment 32 Bojan Smojver 2010-07-03 03:59:58 UTC
Looks like progress:

https://bugzilla.redhat.com/show_bug.cgi?id=537494#c53
Comment 34 vi5u0-kernelbugs 2011-01-20 16:23:27 UTC
Sorry all.  When I resume from hibernation, my system still hangs (with about 25% probability) when kernel mode setting (for i915) is enabled, but does not hang with the nomodeset boot parameter.  This is in a kernel to which 985b823b919273fe1327d56d2196b4f92e5d0fae has been applied (Gentoo's 2.6.36-gentoo-r5).  More details, including a kernel oops message and some PM_TRACE output, at <http://forums.gentoo.org/viewtopic-t-860680.html>.

This is now an urgent problem, because distros are shipping XF86 Intel video drivers as stable, for which kernel mode setting is mandatory.
Comment 35 Petr Tesarik 2011-06-06 10:23:28 UTC
I've experienced a similar trouble with openSUSE 11.4. The problematic page is saved correctly, but it gets corrupted upon resume. The corruption always happens in the first 32 bytes of the page. More details can be found at

https://bugzilla.novell.com/show_bug.cgi?id=697699
Comment 36 vi5u0-kernelbugs 2011-06-06 10:30:14 UTC
The problem no longer occurs in Gentoo's 2.6.38-gentoo-r1 with in-kernel memtest enabled.  But I don't know whether it's the kernel upgrade or the enabling of memtest that fixed it.  Does in-kernel memtest write a log message anywhere when it finds bad RAM?
Comment 37 Bojan Smojver 2011-06-07 01:28:20 UTC
I also don't think this has been fixed. Or maybe it's now a different problem, but hibernate/thaw still doesn't work on my ThinkPad T510 with Intel graphics.

I can suspend/resume as many times as I like. However, if I hibernate/thaw, after a few cycles, my box will OOPS almost randomly:

https://bugzilla.redhat.com/show_bug.cgi?id=709915

So, yeah, this should mostly likely be reopened, because it seems that folks with Intel graphics are still experiencing the same (or similar) problems.
Comment 38 Petr Tesarik 2011-06-10 13:24:46 UTC
@vi5u0: I am quite confident this is not bad RAM. I have run memtest86+ on the machine and found no errors. Moreover the corruption happens in many different physical memory pages, but it always follows one of two corruption patterns (all zeroes or series of 0x00aaaaaa in the first 16 bytes of a page).
Comment 39 vi5u0-kernelbugs 2011-06-10 18:39:03 UTC
(In reply to comment #38)
> @vi5u0: I am quite confident this is not bad RAM.

I think you're right: I've now disabled memtest in my 2.6.38-gentoo-r1, and everything still works fine.
Comment 40 Rafael J. Wysocki 2011-06-10 21:26:48 UTC
I'd like to continue debugging this issue in bug #37142 that gives a good
summary of findings so far.

*** This bug has been marked as a duplicate of bug 37142 ***