Bug 13811
Summary: | [GM965/KMS/UXA] memory corruption on resume from hibernate | ||
---|---|---|---|
Product: | Drivers | Reporter: | Evgeni Golov (sargentd) |
Component: | Video(DRI - Intel) | Assignee: | ykzhao (yakui.zhao) |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | airlied, bgamari, bojan, boyarsh, didit21, fgouget, gordon.jin, guido+kernel.org, jim, jmprieto, kernel, kernelbug, kirill, lists, m.v.b, michel.lafonpuyo, mishu, rjw, saki, vi5u0-kernelbugs |
Priority: | P2 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.30.1 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
dmesg with segfaults after resume
dmesg log before hibernating intel_reg_dumper before hibernating dmesg log after hibernating Dmesg and intel_reg_dumper dumps for 2.6.32.8 dmesg before on F-12 intel_gpu_dump before on F-12 dmesg after first thaw on F-12 intel_gpu_dump after first thaw on F-12 |
Description
Evgeni Golov
2009-07-22 19:58:15 UTC
Just had a quick test with 2.6.29.3, same behaviour. Also same with only 2GiB RAM (I had 4 in the tests before). And another test: also happens with a 32bit system. I'm quite puzzled and have no real idea how to debug this :( This is likely the same bug as Debian http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534422 I can confirm the same problem on FSC Esprimo Mobile U9200, 4GB ram, FC11, kernel 2.6.29.6-217.2.8.fc11.i686.PAE graphic card Intel Mobile GM965/GL960 after resume crashes when added option nomodeset to kernel in /etc/grub.conf resumes without problem I have the same problem with 2.6.32.8 buit myself on Asus eeepc 900. Additinally i can say that if there are no i915.ko and so on in the initrd, resume works correctly. But when i add i915,drm,agpgart,intel_agp to initrd for mode switch on early boot stage, only normal boot works OK. Resume with this initrd causes multiple segmentation faults and "bad ELF" error messages. Looks like memory corruption. Can any additional information help to fix this problem? Created attachment 24996 [details]
dmesg with segfaults after resume
Numerous Intel owners are affected by the bug and nobody does a thing since JULY. The Linux 2.6.33 just came out and guess what? It's still broken. Didn't people claim how this KMS business will make power management *so much* better, easier and reliable? Like we didn't heard that one before with UXA and other Intel technology that is so much inovative yet always fails to work as good as the old one. In those periods you finally fix the i915 then we still can't hibernate because we are afraid for our data being that you break it every 3 months and we never know when it will hit again. My experience with kernel 2.6.33 and suspend/resume (F-13 alpha cannot hibernate/thaw for me at all) is here: https://bugzilla.redhat.com/show_bug.cgi?id=537494#c31 More or less the same thing. With KMS enabled, it doesn't work. Also, new Intel driver doesn't have user mode switching, so KMS is mandatory. Hi, Bojan/Anton Sorry for the late response. Will you please add the boot option of "drm.debug=0x04 initcall_debug" and do the following test on 2.6.33 kernel? a. enter the console mode b. Get the output of intel_reg_dumper, dmesg before using hibernation c. use the command of "echo disk > /sys/power/state" to enter the hibernation d. press the power button and add the boot option of "drm.debug=0x04 resume=XXX". XXX is the swap partition. e. after the system is resumed, please confirm whether the system can work well under the console mode and get the output of dmesg, intel_reg_dumper. Thanks. I have the same problem as Bojan and Anton so I tried to follow the above instructions. However after getting out of hibernation the console was not really functional, for instance I got: # ls ls: error while loading shared libraries: /lib/libselinux.so.1: unexpected PLT reloc type 0x00 And also after a little bit I got this a few times: INIT: Id "3" respawning too fast: disabled for 5 minutes Somehow dmesg did work anyway but intel_reg_dumper only started, it produced no output and locked up the system (caused more of the INIT messages above). I'm attaching the logs: Before: fg-dmesg-pre.log.bz2 and fg-intel-pre.log.bz2 After: fg-dmesg-post.log.bz2 My system: * An EeePC 1000H 00:02.0 VGA compatible controller: Intel Corporation Mobile 945GME Express Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Device 8340 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f7f00000 (32-bit, non-prefetchable) [size=512K] I/O ports at dc00 [size=8] Memory at d0000000 (32-bit, prefetchable) [size=256M] Memory at f7ec0000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable- Count=1/1 Maskable- 64bit- Capabilities: [d0] Power Management version 2 Kernel driver in use: i915 * Debian Testing with a kernel from Debian Unstable: $ uname -a Linux malte 2.6.33-2-686 #1 SMP Thu Mar 18 07:30:30 UTC 2010 i686 GNU/Linux Should I have stopped the X server when switching to the console? (i.e. Ctrl-Alt-F2, login, /etc/init.d/gdm stop) Created attachment 25956 [details]
dmesg log before hibernating
Created attachment 25957 [details]
intel_reg_dumper before hibernating
Created attachment 25958 [details]
dmesg log after hibernating
I have no vanilla 2.6.33 here and will build and test it tomorrow. Following test was done on 2.6.32.8. Files in archive: first try dmesg before hybernating: dmesg.1 intel_reg_dumper before hybernating: dump.1 dmesg after hybernating: dmesg.2 intel_reg_dumper: can't load libc.so.6 :-D Second try dmesg before hybernating: dmesg.3 intel_reg_dumper before hybernating: dump.3 dmesg after hybernating: Segmentation fault intel_reg_dumper after hybernating: dump.4 Tests have run in console, but with X server working on another VT Created attachment 25972 [details]
Dmesg and intel_reg_dumper dumps for 2.6.32.8
First, on F-13 there doesn't seem to be a program called intel_reg_dumper. There is something called intel_gpu_dump though. Not sure if that's what you're after. Second, it seems that F-13 is not capable of thawing hibernated images right now. I get no root/no boot message on startup. I have dmesg and intel_gpu_dump output before hibernation, but I'm pretty sure that'll be useless without the same thing after the thaw. OK, some success with this on F-12. I used pm-hibernate, which is just a bit fancier way to hibernate (it obeys blacklisted modules and such). I'll attach dmesg and intel_gpu_dump output before the hibernation and after the first thaw. On second thaw, the machine went berserk (could not login to any console any more etc.). So, the whole thing probably isn't going to help much, but here it it nevertheless. Created attachment 25981 [details]
dmesg before on F-12
Created attachment 25982 [details]
intel_gpu_dump before on F-12
Created attachment 25983 [details]
dmesg after first thaw on F-12
Created attachment 25984 [details]
intel_gpu_dump after first thaw on F-12
I tried dumping intel gpu state after second thaw, but that resulted in an empty file. Also, echo disk > /sys/power/state does not work on my notebook (hang on thaw), because module b44 must be unloaded before hibernation (which is what pm-hibernate does). Hi, I'm running Arch with kernel 2.6.33 on a Samsung NC10 netbook (Intel GMA950 graphic chipset) with early KMS initialization and I'm issuing the same problem. Twice, I tried to follow the instructions given above but I was unable to get the log from dmesg et to dump the GPU registers after thaw (segmentation fault). Here is what I did, First time: - I added drm.debug=0x04 initcall_debug to the boot cmdline - After initialization, I switched to a console VT (X was running on another VT) - dmesg + intel_gpu_dump - echo disk > /sys/power/state - I pressed the power button, added drm.debug=0x04 resume=/dev/sda2 to the boot cmd line => thaw was successful and I was back to the console VT - dmesg => segmentation fault - intel_gpu_dump => segmentation fault Second time: - I added drm.debug=0x04 initcall_debug to the boot cmdline AND added "single" - After initialization, I logged in to the console VT (X wasn't running) - dmesg + intel_gpu_dump - I pressed the power button, added drm.debug=0x04 resume=/dev/sda2 to the boot cmd line => thaw was successful and I was back to the console VT - dmesg => segmentation fault - intel_gpu_dump => segmentation fault If needed, I can post the dmesg log and reg dump from the second try but it is very similar to those sent by Bojan. Thank you for your help. (In reply to comment #14) > I have no vanilla 2.6.33 here and will build and test it tomorrow. Following > test was done on 2.6.32.8. SSD on that machine is completly crashed, so i can't test this bug any more, sorry I was able to hibernate F-13 after taking out some boot options that Anaconda stuffed in there. Anyhow, still seeing segfaults. See: https://bugzilla.redhat.com/attachment.cgi?id=411190 Hi, I've been affected by this bug for some time but since upgrading to kernel 2.6.34 I can hibernate my laptop and resume it again. I'm running debian/unstable with kernel 2.6.34-1~experimental.1 from experimental. Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03) Have I been lucky so far or are other people seeing the same improvement with this kernel? (In reply to comment #26) > I've been affected by this bug for some time but since upgrading to kernel > 2.6.34 I can hibernate my laptop and resume it again. I tried kernel-2.6.34-11.fc14 with F-13 (on advice of one of the Fedora developers), but that did not work. On fourth hibernate/thaw cycle, segfaults made the system unusable. Did you try to hibernate/thaw multiple times? (In reply to comment #27) > I tried kernel-2.6.34-11.fc14 with F-13 (on advice of one of the Fedora > developers), but that did not work. On fourth hibernate/thaw cycle, segfaults > made the system unusable. > > Did you try to hibernate/thaw multiple times? Sure. I did it 14 times already and it still works OK. It's much more than with previous kernels. I hope it's not going to break but if it occurs, I'll add a new comment to this bug. (In reply to comment #28) > Sure. I did it 14 times already and it still works OK. It's much more than > with > previous kernels. I hope it's not going to break but if it occurs, I'll add a > new comment to this bug. Excellent. Thanks for the update. > Have I been lucky so far or are other people
> seeing the same improvement with this kernel?
There are no improvements in 2.6.34. Tested on two generations (GM45, GM965) of intel graphics on two laptops of completely different vendors.
Hello all, I am having the same issue as well. After I resume from suspend to disk (e.g. thaw), I get a lot of segfaults. However I started to see this problem after upgrading from 2.6.32.7 to 2.6.32.9 a few months ago. I bisected the i915 related commits between these two versions and found out that the following commit causes the memory corruption issue after resuming from suspend to disk. === 8< === commit d8e0902806c0bd2ccc4f6a267ff52565a3ec933b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Jan 27 13:36:32 2010 +0000 drm/i915: Selectively enable self-reclaim commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 upstream. Having missed the ENOMEM return via i915_gem_fault(), there are probably other paths that I also missed. By not enabling NORETRY by default these paths can run the shrinker and take memory from the system (but not from our own inactive lists because our shrinker can not run whilst we hold the struct mutex) and this may allow the system to survive a little longer whilst our drivers consume all available memory. References: OOM killer unexpectedly called with kernel 2.6.32 http://bugzilla.kernel.org/show_bug.cgi?id=14933 v2: Pass gfp into page mapping. v3: Use new read_cache_page_gfp() instead of open-coding. ... === >8 === When I revert this commit from any recent vanilla kernel, I no longer have the memory corruption issue. I have used/tested kernels compiled this way extensively. I reported my finding in LKML, but only Rafael Wysocki has shown interest in it. Here is a link to my first e-mail on LKML: http://marc.info/?l=linux-kernel&m=126845754409543&w=2 Here's a link to the regression report: http://bugzilla.kernel.org/show_bug.cgi?id=15585 One thing I realize is that some people have had this issue long before 2.6.32.8, so I am not sure whether the problems we are having are the same, or they just show the same symptoms. (This is the reason I haven't marked 15585 as a duplicate of this bug.) Could anyone try to revert the above-mentioned commit on a vanilla 2.6.34 tree, compile it and try the resulting kernel? Another piece of information is that the memory corruption issue doesn't exist if one does *not* load the i915.ko module *before* one calls the "resume" binary (uswsusp or regular version) in the initramfs. (I stumbled into this information on Arch users' forum via a search engine.) One way of achieving this is removing the i915.ko module from the initramfs. I have tested this as well, and can verify that it works. Obviously, this breaks plymouth and the like. (For the record, I am using Debian Sid.) I am willing to help in debugging this problem. Regards, M. Vefa Bicakci Looks like progress: https://bugzilla.redhat.com/show_bug.cgi?id=537494#c53 Sorry all. When I resume from hibernation, my system still hangs (with about 25% probability) when kernel mode setting (for i915) is enabled, but does not hang with the nomodeset boot parameter. This is in a kernel to which 985b823b919273fe1327d56d2196b4f92e5d0fae has been applied (Gentoo's 2.6.36-gentoo-r5). More details, including a kernel oops message and some PM_TRACE output, at <http://forums.gentoo.org/viewtopic-t-860680.html>. This is now an urgent problem, because distros are shipping XF86 Intel video drivers as stable, for which kernel mode setting is mandatory. I've experienced a similar trouble with openSUSE 11.4. The problematic page is saved correctly, but it gets corrupted upon resume. The corruption always happens in the first 32 bytes of the page. More details can be found at https://bugzilla.novell.com/show_bug.cgi?id=697699 The problem no longer occurs in Gentoo's 2.6.38-gentoo-r1 with in-kernel memtest enabled. But I don't know whether it's the kernel upgrade or the enabling of memtest that fixed it. Does in-kernel memtest write a log message anywhere when it finds bad RAM? I also don't think this has been fixed. Or maybe it's now a different problem, but hibernate/thaw still doesn't work on my ThinkPad T510 with Intel graphics. I can suspend/resume as many times as I like. However, if I hibernate/thaw, after a few cycles, my box will OOPS almost randomly: https://bugzilla.redhat.com/show_bug.cgi?id=709915 So, yeah, this should mostly likely be reopened, because it seems that folks with Intel graphics are still experiencing the same (or similar) problems. @vi5u0: I am quite confident this is not bad RAM. I have run memtest86+ on the machine and found no errors. Moreover the corruption happens in many different physical memory pages, but it always follows one of two corruption patterns (all zeroes or series of 0x00aaaaaa in the first 16 bytes of a page). (In reply to comment #38) > @vi5u0: I am quite confident this is not bad RAM. I think you're right: I've now disabled memtest in my 2.6.38-gentoo-r1, and everything still works fine. I'd like to continue debugging this issue in bug #37142 that gives a good summary of findings so far. *** This bug has been marked as a duplicate of bug 37142 *** |