Bug 16019

Summary: Resume from hibernate corrupts ext4
Product: Power Management Reporter: David Lowe (lowe)
Component: Hibernation/SuspendAssignee: power-management_other
Status: CLOSED INSUFFICIENT_DATA    
Severity: high CC: fs_ext4, lichtenwalder, mishu, philippe.planchon, rjw, rui.zhang, sandeen, sb, stuart, tytso
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33.4-95 Subsystem:
Regression: No Bisected commit-id:
Bug Depends on:    
Bug Blocks: 7216    
Attachments: log messages for hibernate/resume cycle
backtrace
Config for 32bit kernel

Description David Lowe 2010-05-20 22:49:44 UTC
Overview: System appears to hibernate successfully. Upon resume at first sight all seems well until one discovers it is no longer possible to write to disk.

Steps to reproduce:

1) Hibernate system to disk. (Note suspend to ram seems to work ok).

2) Resume from disk.

3) System starts up ok, but disk seems to be mounted read only. Sometimes resume just fails and system auto reboots itself and fixes the ext4 disk errors.

4) Rebooting seems to fix the disk issues.

Expected result:
Resume from hibernate should work ok.

System info: Fedora 13 kernel  2.6.33.4-95.fc13.x86_64
Dell Latitude D630, 2G ram, 4G swap

Attached log messages from suspend/resume cycle. 

First ext4 error is

May 20 06:28:41 trillian kernel: EXT4-fs error (device dm-0): ext4_mb_generate_buddy: EXT4-fs: group 26: 1161 blocks in bitmap, 1149 in gd
May 20 06:28:41 trillian kernel: JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash.

then it ends with lots of

May 20 06:50:20 trillian kernel: EXT4-fs error (device dm-0): mb_free_blocks: double-free of inode 0's block 872631(bit 20663 in group 26)
May 20 06:50:20 trillian kernel: EXT4-fs error (device dm-0): mb_free_blocks: double-free of inode 0's block 872632(bit 20664 in group 26)
Comment 1 David Lowe 2010-05-20 22:55:31 UTC
Created attachment 26478 [details]
log messages for hibernate/resume cycle
Comment 2 Rafael J. Wysocki 2010-05-20 23:00:33 UTC
What kind of HDD is the filesystem on?
Comment 3 Theodore Tso 2010-05-20 23:36:50 UTC
Is this reliably reproducible?  It looks like a block allocation is getting corrupted when you hibernate.  

What are you using as your hibernation partition?
Comment 4 Eric Sandeen 2010-05-21 00:10:59 UTC
David isn't alone in this ... others have reported it.

Sadly I haven't had time to look into it.

https://bugzilla.redhat.com/show_bug.cgi?id=556560
Comment 5 David Lowe 2010-05-21 10:36:07 UTC
(In reply to comment #3)
> Is this reliably reproducible?  It looks like a block allocation is getting
> corrupted when you hibernate.  
> 
> What are you using as your hibernation partition?

The disk is a 
ATA toshiba mk6037gsx (sata harddrive) set up as a 
lvm2 volume (28Gb) partitioned as 24 Gb ext4 plus 4 Gb swap

As far as I know, the swap is the hibernate partition.
Am using pm-hibernate or standard gnome tools to hibernate.

The problem happens every time I hibernate. Let me know what other info would be useful.
Comment 6 Eric Sandeen 2010-05-21 17:10:58 UTC
Just to test a theory, could you try doing "hdparm -W" on your disk(s) to check the write cache status?

If it's 1 (on), then prior to hibernate, can you try hdparm -W 0 on the disk(s) and see if the problem persists?  I'm wondering if it's possible that data in the drive's write cache got lost when the system powered down, if it wasn't properly flushed during hibernate.

You'll want to do hdparm -W 1 later to turn them back on if that was the original state.

Thanks,

-Eric
Comment 7 David Lowe 2010-05-21 18:00:27 UTC
(In reply to comment #6)
> Just to test a theory, could you try doing "hdparm -W" on your disk(s) to
> check
> the write cache status?
> 
> If it's 1 (on), then prior to hibernate, can you try hdparm -W 0 on the
> disk(s)
> and see if the problem persists?  I'm wondering if it's possible that data in
> the drive's write cache got lost when the system powered down, if it wasn't
> properly flushed during hibernate.
> 
> You'll want to do hdparm -W 1 later to turn them back on if that was the
> original state.
> 
> Thanks,
> 
> -Eric

I gave that a try and did not find any difference.

The original state was 1 (on) so I set it to 0 and did a hibernate. 
The system comes back up with the unwritable disk as before.
Comment 8 Eric Sandeen 2010-05-21 18:04:40 UTC
Thanks for testing ... so much for that theory.
Comment 9 Eric Sandeen 2010-05-21 18:25:04 UTC
David, was that test done after a fresh/clean reboot, or was there a previous failed hibernate before the test?  (I assume you have to reboot when you hit the error, but just checking)
Comment 10 David Lowe 2010-05-21 19:58:58 UTC
(In reply to comment #9)
> David, was that test done after a fresh/clean reboot, or was there a previous
> failed hibernate before the test?  (I assume you have to reboot when you hit
> the error, but just checking)

Yes, the test was done after a fresh reboot.
Comment 11 David Lowe 2010-05-22 00:37:52 UTC
Created attachment 26491 [details]
backtrace
Comment 12 David Lowe 2010-05-22 00:40:16 UTC
Previously I had been initiating hibernate from a graphic environment. Just tried doing it from runlevel 3 and the logs now show a kernel oops just before hibernate ends.
Comment 13 Eric Sandeen 2010-05-22 04:34:20 UTC
David, that's not an oops just a warning, and it's unrelated to this problem (fixed in fedora now, too, FWIW, kernel -102 or so).

see https://bugzilla.kernel.org/show_bug.cgi?id=15906#c26 & subsequent comments.

-Eric
Comment 14 David Lowe 2010-05-22 10:11:37 UTC
(In reply to comment #13)
> David, that's not an oops just a warning, and it's unrelated to this problem
> (fixed in fedora now, too, FWIW, kernel -102 or so).
> 
> see https://bugzilla.kernel.org/show_bug.cgi?id=15906#c26 & subsequent
> comments.
> 
> -Eric

I tried kernel 2.6.33.4-106.fc13.x86_64 and no longer get the warning.

However the ext4 errors are back with a vengeance -- auto fsck on reboot now fails and a manual fsck is needed to repair the disk.
Comment 15 Eric Sandeen 2010-05-22 17:03:03 UTC
Testing on F13 in an x86 guest with pm-hibernate from runlevel 3 doesn't trip up the error, so it's not something blindingly stupid/obvious/reproducible ...
Comment 16 Eric Sandeen 2010-05-22 17:10:04 UTC
David, do you know if this regressed for you at some point?  Has it ever worked?
Comment 17 Eric Sandeen 2010-05-24 14:59:33 UTC
Based on another internal redhat bug suggestion - does booting with the "iommu=soft" kernel parameter affect the bug at all?

thanks,
-Eric
Comment 18 David Lowe 2010-05-24 17:00:01 UTC
(In reply to comment #17)
> Based on another internal redhat bug suggestion - does booting with the
> "iommu=soft" kernel parameter affect the bug at all?
> 
> thanks,
> -Eric

Back in Fedora 11 everything was working fine. With Fedora 12 so many things regressed at the same time that testing was impractical (I think mostly intel stack related...). I think suspend to ram worked fine, and hibernation would just freeze. However with Fedora 13 I am really just left with this one problem, hibernation. 

iommu=soft gave me one clean hibernation from inside gnome, but the second attempt gave me the same ext4-fs errors
Comment 19 David Lowe 2010-05-28 23:03:31 UTC
I switched over to Ubuntu 10.04 and no longer have problems with hibernate or suspend. I also tested this against the vanilla 2.6.33.4 kernel under ubuntu, without issue.

The main differences seem to be:
(1) I am no longer using an lvm2 volume with an ext4 partition, just a simple extended partition with ext4
(2) fedora patches to the kernel
(3) fedora/ubuntu differences in pm-utils, if any
Comment 20 Eric Sandeen 2010-06-01 16:02:08 UTC
I suppose it's a lot to ask, but I wonder if it'd be possible to test Fedora w/o LVM to further narrow this down?

Thanks,
-Eric
Comment 21 Eric Sandeen 2010-06-16 03:10:10 UTC
Do you be happen to be using i915 video?

  Bug 13811 -  [GM965/KMS/UXA] memory corruption on resume from hibernate

-Eric
Comment 22 Klaus Lichtenwalder 2010-06-17 15:49:04 UTC
Hi,

came here from "my" fedora bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=603897) I found this bug. I'm having it on an Aspire One with SSD. Guess the lvm theory is very reasonable. If I can spare the time I'll try the alternate install this weekend.
Btw, I also got the following messages:
ERROR: sil: invalid metadata checksum in area 3 on /dev/dm-1
ERROR: sil: invalid metadata checksum in area 4 on /dev/dm-1

with /dev/dm-1 being the swap partition.
Comment 23 Eric Sandeen 2010-06-17 15:51:51 UTC
Klaus, see also comment #21, what video driver do you have?
Comment 24 Klaus Lichtenwalder 2010-06-17 16:27:22 UTC
Eric,

I have a 
[root@linpus ~]# lspci|grep VGA
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GME Express Integrated Graphics Controller (rev 03)
[root@linpus ~]# 

Xorg sees it as
[    35.777] (II) intel(0): Integrated Graphics Chipset: Intel(R) 945GME
[    35.777] (--) intel(0): Chipset: "945GME"
Comment 25 Klaus Lichtenwalder 2010-06-17 16:35:04 UTC
And sorry, another bit I forgot. I've been using ubuntu 10.04 also for some time with lots of suspending and hibernating, and never had  filesystem troubles. And as stated in Comment #19 ubuntu doesn't use lvm in a normal install.

Klaus
Comment 26 Eric Sandeen 2010-06-17 16:47:50 UTC
That may be affected.  You might try unloading/blacklisting i915 and test hibernate again, either from plain VGA X, or a text console, see if it makes a difference.  (I don't know if ubuntu has a different i915 driver... feel free to test the lvm theory too!)
Comment 27 Klaus Lichtenwalder 2010-06-17 19:16:17 UTC
Eric,
just did some tests with a clean initrd without i915. No fs corruption. While as soon as the i915 module is in use I get reliable fs corruption. So. Guess the lvm theory is dead... It's the i915 module

Klaus
Comment 28 Eric Sandeen 2010-06-17 19:24:40 UTC
ok ... I'm probably going to end up duping this one over to the i915 bug then.

Thanks for testing!

-Eric
Comment 29 David Lowe 2010-06-17 21:37:29 UTC
I'm using the i915 module on ubuntu 10.04 without issue.
Comment 30 Eric Sandeen 2010-06-17 21:45:37 UTC
(In reply to comment #29)
> I'm using the i915 module on ubuntu 10.04 without issue.

Which kernel is that?
Comment 31 David Lowe 2010-06-17 23:37:11 UTC
the ubuntu kernel 2.6.32.22 as well as the vanilla kernel
2.6.33.4

I suspect assigning this to an i915 bug is premature, since when I tested this under runlevel 3 in fedora 13, i didn't see any problem until I stressed the system a bit.
Comment 32 Klaus Lichtenwalder 2010-06-18 06:32:19 UTC
Well, just going in runlevel 3 is not enough, as drm loads i915 even then, from initramfs. You need to create an initramfs without the i915 moule.
Comment 33 Mihai Harpau 2010-06-18 09:40:36 UTC
If I do the following on Fedora 13:
- boot on runlevel 3 with "nomodeset 3" on grub line
- from root prompt run "modprobe -r i915"
- run pm-hibernate

then I don't see anymore the messages from comment 1 and all seems to be fine.
Comment 34 Eric Sandeen 2010-06-18 13:59:35 UTC
Ok.  So, I think i915 is certainly one root cause of this problem.  There could be others.

David, are you certain that you can hit it without i915 loaded?  If so then there is more to track down.

Thanks,

-Eric
Comment 35 Rafael J. Wysocki 2010-06-18 14:39:17 UTC
Also, I'd like to know if the kernels the issue is reproducible with are
64-bit or 32-bit, and if 32-bit, then what memory .config options are set
(ie. is highmem set and if so, what kind of highmem etc.).

I'm unable to reproduce the issue with my i915 hardware and a 64-bit kernel,
FWIW.
Comment 36 Klaus Lichtenwalder 2010-06-18 14:51:28 UTC
Created attachment 26846 [details]
Config for 32bit kernel
Comment 37 Klaus Lichtenwalder 2010-06-18 14:52:04 UTC
In my case it's a Fedora 13 32bit kernel. Just for completeness I attached the config (Comment 36).

Klaus
Comment 38 Klaus Lichtenwalder 2010-07-01 05:04:30 UTC
I don'r know whether this is related, but one in a while I have a lockup of the X server. I can move the mouse, and switch to the tty consoles, and log in, but for a working X I have to reboot. Yesterday I saw in dmesg:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
render error detected, EIR: 0x00000000
[drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 302900 at 302900)

Klaus
Comment 39 Klaus Lichtenwalder 2010-07-16 21:13:56 UTC
Well, it looks like the newest Fedora 13 kernel hibernates just fine without fs corruption
Comment 40 Steven Bakker 2010-07-27 20:37:22 UTC
I'm running 2.6.33.6-147.fc13.x86_64 with an i915, and unfortunately, it still corrupts the ext4 FS on hibernate/wakeup.

(Off-topic: I'd be happy for my laptop to just auto-suspend when it's running on battery, but that doesn't seem to be configurable: it's hibernate or nothing.)

Steven
Comment 41 Rafael J. Wysocki 2011-01-16 22:34:39 UTC
Is the problem still present in 2.6.37?
Comment 42 Klaus Lichtenwalder 2011-01-17 18:13:42 UTC
Just got me 2.6.37 from rawhide and will test
Comment 43 Klaus Lichtenwalder 2011-01-30 11:12:09 UTC
I'm running 2.6.37-2.fc15.i686 from updates-testing for 12 days now, with at least one suspend-cycle a day and it's working fine (up to now). So, for me it seems the problem is gone
Comment 44 Klaus Lichtenwalder 2011-02-15 15:31:10 UTC
Hi, 
running 2.6.37-2.fc15.i686 works fine, but using rawhide kernel-2.6.38-0.rc4.git0.2.fc15.i686 or kernel-2.6.38-0.rc3.git0.1.fc15.i686
shows errors very fast! Even faster than in 2.6.35... Most of the time in /sys...
Comment 45 Klaus Lichtenwalder 2011-03-21 15:04:18 UTC
Still the only kernel working is the one from comment #44. I tried the kernel from fc15 alpha, kernel-2.6.38-1.fc15, but no cigar. At most two resumes, then I get a kernel oops, with dentry_* on top. I can try to manually get a trace, if this helps.
Comment 46 Philippe Planchon 2011-05-25 21:10:51 UTC
I'm running 2.6.35.13-91.fc14.x86_64 and have the same ext4-fs issue on resume after hibernate.
Comment 47 Klaus Lichtenwalder 2011-05-26 16:32:54 UTC
right now on 2.6.38.6-26.rc1.fc15.i686 (from Fedora 15) the oops happens very regularly...

Klaus
Comment 48 Zhang Rui 2012-01-18 02:10:40 UTC
It's great that kernel bugzilla is back.

can you please verify if the problem still exists in the latest upstream
kernel?
Comment 49 Zhang Rui 2012-05-24 07:34:11 UTC
bug closed as there is no response from the bug reporter.
please feel free to reopen it if the problem still exists in the latest upstream kernel.