Bug 16019
Summary: | Resume from hibernate corrupts ext4 | ||
---|---|---|---|
Product: | Power Management | Reporter: | David Lowe (lowe) |
Component: | Hibernation/Suspend | Assignee: | power-management_other |
Status: | CLOSED INSUFFICIENT_DATA | ||
Severity: | high | CC: | fs_ext4, lichtenwalder, mishu, philippe.planchon, rjw, rui.zhang, sandeen, sb, stuart, tytso |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.33.4-95 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 7216 | ||
Attachments: |
log messages for hibernate/resume cycle
backtrace Config for 32bit kernel |
Description
David Lowe
2010-05-20 22:49:44 UTC
Created attachment 26478 [details]
log messages for hibernate/resume cycle
What kind of HDD is the filesystem on? Is this reliably reproducible? It looks like a block allocation is getting corrupted when you hibernate. What are you using as your hibernation partition? David isn't alone in this ... others have reported it. Sadly I haven't had time to look into it. https://bugzilla.redhat.com/show_bug.cgi?id=556560 (In reply to comment #3) > Is this reliably reproducible? It looks like a block allocation is getting > corrupted when you hibernate. > > What are you using as your hibernation partition? The disk is a ATA toshiba mk6037gsx (sata harddrive) set up as a lvm2 volume (28Gb) partitioned as 24 Gb ext4 plus 4 Gb swap As far as I know, the swap is the hibernate partition. Am using pm-hibernate or standard gnome tools to hibernate. The problem happens every time I hibernate. Let me know what other info would be useful. Just to test a theory, could you try doing "hdparm -W" on your disk(s) to check the write cache status? If it's 1 (on), then prior to hibernate, can you try hdparm -W 0 on the disk(s) and see if the problem persists? I'm wondering if it's possible that data in the drive's write cache got lost when the system powered down, if it wasn't properly flushed during hibernate. You'll want to do hdparm -W 1 later to turn them back on if that was the original state. Thanks, -Eric (In reply to comment #6) > Just to test a theory, could you try doing "hdparm -W" on your disk(s) to > check > the write cache status? > > If it's 1 (on), then prior to hibernate, can you try hdparm -W 0 on the > disk(s) > and see if the problem persists? I'm wondering if it's possible that data in > the drive's write cache got lost when the system powered down, if it wasn't > properly flushed during hibernate. > > You'll want to do hdparm -W 1 later to turn them back on if that was the > original state. > > Thanks, > > -Eric I gave that a try and did not find any difference. The original state was 1 (on) so I set it to 0 and did a hibernate. The system comes back up with the unwritable disk as before. Thanks for testing ... so much for that theory. David, was that test done after a fresh/clean reboot, or was there a previous failed hibernate before the test? (I assume you have to reboot when you hit the error, but just checking) (In reply to comment #9) > David, was that test done after a fresh/clean reboot, or was there a previous > failed hibernate before the test? (I assume you have to reboot when you hit > the error, but just checking) Yes, the test was done after a fresh reboot. Created attachment 26491 [details]
backtrace
Previously I had been initiating hibernate from a graphic environment. Just tried doing it from runlevel 3 and the logs now show a kernel oops just before hibernate ends. David, that's not an oops just a warning, and it's unrelated to this problem (fixed in fedora now, too, FWIW, kernel -102 or so). see https://bugzilla.kernel.org/show_bug.cgi?id=15906#c26 & subsequent comments. -Eric (In reply to comment #13) > David, that's not an oops just a warning, and it's unrelated to this problem > (fixed in fedora now, too, FWIW, kernel -102 or so). > > see https://bugzilla.kernel.org/show_bug.cgi?id=15906#c26 & subsequent > comments. > > -Eric I tried kernel 2.6.33.4-106.fc13.x86_64 and no longer get the warning. However the ext4 errors are back with a vengeance -- auto fsck on reboot now fails and a manual fsck is needed to repair the disk. Testing on F13 in an x86 guest with pm-hibernate from runlevel 3 doesn't trip up the error, so it's not something blindingly stupid/obvious/reproducible ... David, do you know if this regressed for you at some point? Has it ever worked? Based on another internal redhat bug suggestion - does booting with the "iommu=soft" kernel parameter affect the bug at all? thanks, -Eric (In reply to comment #17) > Based on another internal redhat bug suggestion - does booting with the > "iommu=soft" kernel parameter affect the bug at all? > > thanks, > -Eric Back in Fedora 11 everything was working fine. With Fedora 12 so many things regressed at the same time that testing was impractical (I think mostly intel stack related...). I think suspend to ram worked fine, and hibernation would just freeze. However with Fedora 13 I am really just left with this one problem, hibernation. iommu=soft gave me one clean hibernation from inside gnome, but the second attempt gave me the same ext4-fs errors I switched over to Ubuntu 10.04 and no longer have problems with hibernate or suspend. I also tested this against the vanilla 2.6.33.4 kernel under ubuntu, without issue. The main differences seem to be: (1) I am no longer using an lvm2 volume with an ext4 partition, just a simple extended partition with ext4 (2) fedora patches to the kernel (3) fedora/ubuntu differences in pm-utils, if any I suppose it's a lot to ask, but I wonder if it'd be possible to test Fedora w/o LVM to further narrow this down? Thanks, -Eric Do you be happen to be using i915 video? Bug 13811 - [GM965/KMS/UXA] memory corruption on resume from hibernate -Eric Hi, came here from "my" fedora bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=603897) I found this bug. I'm having it on an Aspire One with SSD. Guess the lvm theory is very reasonable. If I can spare the time I'll try the alternate install this weekend. Btw, I also got the following messages: ERROR: sil: invalid metadata checksum in area 3 on /dev/dm-1 ERROR: sil: invalid metadata checksum in area 4 on /dev/dm-1 with /dev/dm-1 being the swap partition. Klaus, see also comment #21, what video driver do you have? Eric, I have a [root@linpus ~]# lspci|grep VGA 00:02.0 VGA compatible controller: Intel Corporation Mobile 945GME Express Integrated Graphics Controller (rev 03) [root@linpus ~]# Xorg sees it as [ 35.777] (II) intel(0): Integrated Graphics Chipset: Intel(R) 945GME [ 35.777] (--) intel(0): Chipset: "945GME" And sorry, another bit I forgot. I've been using ubuntu 10.04 also for some time with lots of suspending and hibernating, and never had filesystem troubles. And as stated in Comment #19 ubuntu doesn't use lvm in a normal install. Klaus That may be affected. You might try unloading/blacklisting i915 and test hibernate again, either from plain VGA X, or a text console, see if it makes a difference. (I don't know if ubuntu has a different i915 driver... feel free to test the lvm theory too!) Eric, just did some tests with a clean initrd without i915. No fs corruption. While as soon as the i915 module is in use I get reliable fs corruption. So. Guess the lvm theory is dead... It's the i915 module Klaus ok ... I'm probably going to end up duping this one over to the i915 bug then. Thanks for testing! -Eric I'm using the i915 module on ubuntu 10.04 without issue. (In reply to comment #29) > I'm using the i915 module on ubuntu 10.04 without issue. Which kernel is that? the ubuntu kernel 2.6.32.22 as well as the vanilla kernel 2.6.33.4 I suspect assigning this to an i915 bug is premature, since when I tested this under runlevel 3 in fedora 13, i didn't see any problem until I stressed the system a bit. Well, just going in runlevel 3 is not enough, as drm loads i915 even then, from initramfs. You need to create an initramfs without the i915 moule. If I do the following on Fedora 13: - boot on runlevel 3 with "nomodeset 3" on grub line - from root prompt run "modprobe -r i915" - run pm-hibernate then I don't see anymore the messages from comment 1 and all seems to be fine. Ok. So, I think i915 is certainly one root cause of this problem. There could be others. David, are you certain that you can hit it without i915 loaded? If so then there is more to track down. Thanks, -Eric Also, I'd like to know if the kernels the issue is reproducible with are 64-bit or 32-bit, and if 32-bit, then what memory .config options are set (ie. is highmem set and if so, what kind of highmem etc.). I'm unable to reproduce the issue with my i915 hardware and a 64-bit kernel, FWIW. Created attachment 26846 [details]
Config for 32bit kernel
In my case it's a Fedora 13 32bit kernel. Just for completeness I attached the config (Comment 36). Klaus I don'r know whether this is related, but one in a while I have a lockup of the X server. I can move the mouse, and switch to the tty consoles, and log in, but for a working X I have to reboot. Yesterday I saw in dmesg: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung render error detected, EIR: 0x00000000 [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 302900 at 302900) Klaus Well, it looks like the newest Fedora 13 kernel hibernates just fine without fs corruption I'm running 2.6.33.6-147.fc13.x86_64 with an i915, and unfortunately, it still corrupts the ext4 FS on hibernate/wakeup. (Off-topic: I'd be happy for my laptop to just auto-suspend when it's running on battery, but that doesn't seem to be configurable: it's hibernate or nothing.) Steven Is the problem still present in 2.6.37? Just got me 2.6.37 from rawhide and will test I'm running 2.6.37-2.fc15.i686 from updates-testing for 12 days now, with at least one suspend-cycle a day and it's working fine (up to now). So, for me it seems the problem is gone Hi, running 2.6.37-2.fc15.i686 works fine, but using rawhide kernel-2.6.38-0.rc4.git0.2.fc15.i686 or kernel-2.6.38-0.rc3.git0.1.fc15.i686 shows errors very fast! Even faster than in 2.6.35... Most of the time in /sys... Still the only kernel working is the one from comment #44. I tried the kernel from fc15 alpha, kernel-2.6.38-1.fc15, but no cigar. At most two resumes, then I get a kernel oops, with dentry_* on top. I can try to manually get a trace, if this helps. I'm running 2.6.35.13-91.fc14.x86_64 and have the same ext4-fs issue on resume after hibernate. right now on 2.6.38.6-26.rc1.fc15.i686 (from Fedora 15) the oops happens very regularly... Klaus It's great that kernel bugzilla is back. can you please verify if the problem still exists in the latest upstream kernel? bug closed as there is no response from the bug reporter. please feel free to reopen it if the problem still exists in the latest upstream kernel. |