Bug 12151 - Unexplained fsck errors on a ext4 filesystem
Summary: Unexplained fsck errors on a ext4 filesystem
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-12-03 13:38 UTC by Nathan Grennan
Modified: 2009-05-02 22:14 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.27.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Nathan Grennan 2008-12-03 13:38:30 UTC
Distribution: Fedora 10
Hardware Environment: 

Processor:
Q6600 2.4ghz

Memory:
4gb

dmidecode:
Manufacturer: ASUSTeK Computer INC.
Product Name: P5B-Deluxe

lspci:
00:00.0 Host bridge: Intel Corporation 82P965/G965 Memory Controller Hub (rev 02)
00:01.0 PCI bridge: Intel Corporation 82P965/G965 PCI Express Root Port (rev 02)
00:1a.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #4 (rev 02)
00:1a.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #5 (rev 02)
00:1a.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #2 (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 1 (rev 02)
00:1c.2 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 3 (rev 02)
00:1c.4 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 5 (rev 02)
00:1c.5 PCI bridge: Intel Corporation 82801H (ICH8 Family) PCI Express Port 6 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801H (ICH8 Family) USB UHCI Controller #3 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev f2)
00:1f.0 ISA bridge: Intel Corporation 82801HB/HR (ICH8/R) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801HR/HO/HH (ICH8R/DO/DH) 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801H (ICH8 Family) SMBus Controller (rev 02)
02:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8056 PCI-E Gigabit Ethernet Controller (rev 12)
03:00.0 SATA controller: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller (rev 02)
03:00.1 IDE interface: JMicron Technologies, Inc. JMicron 20360/20363 AHCI Controller (rev 02)
04:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet Controller (Copper) (rev 06)
06:01.0 Multimedia audio controller: Creative Labs SB Audigy (rev 04)
06:01.1 Input device controller: Creative Labs SB Audigy Game Port (rev 04)
06:01.2 FireWire (IEEE 1394): Creative Labs SB Audigy FireWire Port (rev 04)
06:02.0 Ethernet controller: Lite-On Communications Inc LNE100TX (rev 20)
06:03.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link)
06:04.0 Ethernet controller: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller (rev 14)


Software Environment:
e2fsprogs-1.41.3-2.fc10.x86_64
rsync-3.0.4-0.fc10.x86_64

Problem Description:
I rebooted and received the errors below from fsck. The nature of the errors suggest to me a race condition or off by one bug. In all but one case the problem is the count being off by one. In the one exception the count is off by two.

I didn't receive complaints from fsck on previous reboots after the creation of the filesystem. The shutdown before the startup that resulted in these errors seemed to have gone normally.

I do remember rsync complaining about at least mpc1211 during the rsync that copied the data across the network from another system. I don't remember the complaint.

Most of the files on the filesystem are video files in the 100mb+ range. There are also other large files like isos, virtualization images, etc. All the files complained about are really small files.

The underlying layers are Linux software raid5 running on 6 1tb hard drives. Other arrays using the same drives are raid1 and raid10.

mkfs command used to make the filesystem:
mkfs.ext4 -j -b 4096 -i 524288 -m 0 -E stride=256 -O extents /dev/md3

Other messages that may relate:

EXT4-fs: barriers enabled
EXT4-fs: barriers enabled
EXT4-fs: barriers enabled
JBD: barrier-based sync failed on md1:8 - disabling barriers
JBD: barrier-based sync failed on md2:8 - disabling barriers
JBD: barrier-based sync failed on md3:8 - disabling barriers

df -h output:
/dev/md1               32G  5.3G   25G  18% /
/dev/md0              198M   14M  174M   8% /boot
/dev/md2              288G   60G  229G  21% /home
/dev/md3              4.1T  2.3T  1.8T  56% /home/data



An automatic fsck check on boot started, and saw errors.

Group descriptor 374 has invalid unused inodes count 1
Group descriptor 375 has invalid unused inodes count 1
Group descriptor 588 has invalid unused inodes count 1
Group descriptor 940 has invalid unused inodes count 1
Group descriptor 1230 has invalid unused inodes count 1
Group descriptor 1486 has invalid unused inodes count 1
Group descriptor 1834 has invalid unused inodes count 1
Group descriptor 2444 has invalid unused inodes count 1
Group descriptor 2854 has invalid unused inodes count 1
Group descriptor 3066 has invalid unused inodes count 1
Group descriptor 3210 has invalid unused inodes count 1
Group descriptor 3933 has invalid unused inodes count 1
Group descriptor 4656 has invalid unused inodes count 1
Extended attribute block 12255232 has reference count 3 should be 1

Pass 1
Extended attribute block 12255232 has reference count 3 should be 1

Pass 2
Entry '..' in ??? (314882) has incorrent filetype (was 2, should be 1).
Entry 'txdps.tex' in /backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/share/texmf/tex/generic/texdraw (453383) has incorrect filetype (was 1, should be 2).
Entry 'mpc1211' in /backup/home/backup/11-26/2006/home/backup/12-3-2005/usr/src/kernels/2.6.14-1.1637_FC4-x86_64/arch/sh/boards (469489) is a link to directory /home/backup/home/backup/11-26-2006/home/backup
/12-3-2005/usr/share/texmf/tex/generic/texdraw/txdps.tex (469505).
Entry 'gencfg.c' in /backup/home/builder/mozilla/nsprpub/pr/include (995609) has an incorrect filetype (was 1, should be 2).
Entry 'CVS' in /backup/home/builder/mozilla/toolkit/themes/pinstripe/mozapps/extentions (1006847) in a link to directory /backup/home/builder/mozilla/nsprpub/pr/include/gencfg.c (1006849).
Entry 'lost+found' in /video/movies (181) has incorrect filetype (was 2, should be 1).
Entry 'text_italic.png' in /backup/home/backup/11-26-2006/usr/share/icons/crystalsvg/16x16/actions (613940) has incorrect filetype (was 1, should be 2).
Entry 'ko' in /backup/home/backup/11-26-2006/usr/share/local (621673) is a link to directory /backup/home/backup/11-26-2006/usr/share/icons/crystalsvg/16x16/actions/text_italic.png (625665).

Pass 3
Unconnected directory inode 314882 (???)
'..' in /backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/share/texmf/tex/generic/texdraw/txdps.tex (469505) is /backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/src/kernels/2.6.14-1.1637_FC4-x
86_64/arch/sh/boards (469489), should be /backup/home/backup/11-26-2006/home/backup/12-3-2005/usr/share/texmf/tex/generic/texdraw (453383).
'..' in /backup/home/backup/11-26-2006/usr/share/icons/crystalsvg/16x16/actions/text_italic.png (625665) is /backup/home/backup/11-26-2006/usr/share/locale (621673), should be /backup/home/backup/11-26-2006/
usr/share/icons/crystalsvg/16x16/actions (613940).
'..' in /backup/home/builder/mozilla/nsprpub/pr/include/gencfg.c (1006849) is /backup/home/builder/mozilla/toolkit/themes/pinstripe/mozapps/extensions (1006847), should be /backup/home/builder/mozilla/nsprpu
b/pr/include (995609).

Pass 4
Inode 181 ref count is 12, should be 11.
Inode 82141 ref is 1, should be 2.
Inode 95745 ref count is 1, should be 2.
Inode 96001 ref count is 1, should be 2.
Inode 150529 ref count is 1, should be 2.
Inode 240641 ref count is 1, should be 2.
Inode 314016 ref count is 347, should be 346.
Inode 314881 ref count is 0, should be 2.
Inode 314882 ref count is 3, should be 2.
Inode 380416 ref count is 4, should be 3.
Inode 380417 ref count is 1, should be 2.
Inode 730596 ref count is 14, should be 13.
Inode 730625 ref count is 1, should be 2.
Inode 730723 ref count is 284, should be 283.
Inode 784897 ref count is 1, should be 2.
Inode 821761 ref count is 1, should be 2.
Inode 1191927 ref count is 7, should be 6.
Inode 1191937 ref count is 1, should be 2.

Pass 5
Block bitmap differences: -40304640 -48693248 -59540393 -79796520 -93519872 -105185280 -(128714059--128714061) -152568096
Free blocks count wrong for group #1230 (1845, counted=1846).
Free blocks count wrong for group #1486 (1851, counted=1852).
Free blocks count wrong for group #1817 (910, counted=911).
Free blocks count wrong for group #2454 (1149, counted=1150)
Free blocks count wrong for group #3210 (1845, counted=1846).
Free blocks count wrong for group #3928 (31706, counted=31709).
Free blocks count wrong for group #4656 (1540, counted=1541).
Free blocks count wrong (494216298, counted=494216308).
Free inodes count wrong for group #374 (0, counted=1).
Free inodes count wrong for group #375 (0, counted=1).
Free inodes count wrong for group #588 (0, counted=1).
Free inodes count wrong for group #940 (0, counted=1).
Free inodes count wrong for group #1230 (0, counted=1).
Directories count wrong for group #1230 (194, counted=193).
Free inodes count wrong for group #1486 (0, counted=1).
Directories count wrong for group #1486 (197, counted=196).
Free inodes count wrong for group #1834 (0, counted=1).
Free inodes count wrong for group #2444 (0, counted=1).
Free inodes count wrong for group #2854 (0, counted=1).
Directories count wrong for group #2854 (67, counted=66).
Free inodes count wrong for group #3066 (0, counted=1).
Free inodes count wrong for group #3210 (0, counted=1).
Directories count wrong for group #3210 (201, counted=200).
Free inodes count wrong for group #3933 (0, counted=1).
Free inodes count wrong for group #4656 (0, counted=1).
Directories count wrong for group #4656 (104, counted=103).
Free inodes count wrong (6899829, counted=6899842).
Comment 1 Eric Sandeen 2008-12-10 18:33:10 UTC
was this a one-time occurrence? have you encountered this again?
Comment 2 Theodore Tso 2009-01-17 18:24:11 UTC
Any updates on this bug report?  Has this happened again for you?
Comment 3 Nathan Grennan 2009-01-18 13:30:58 UTC
At least a related problem seems to have happened on another reboot. This reboot was a hard reset, because of the system going into some mostly hung state. I saw it somewhat respond a few times, ping still work, ssh would return the comment string, and even managed to login once but it hung before the shell.

On the next boot I found I seem to have a corrupt superblock. It sounds a lot like this reboot.

http://kerneltrap.org/mailarchive/linux-ext4/2009/1/5/4598534
Comment 4 Eric Sandeen 2009-03-03 21:41:35 UTC
If you can try the 2.6.29 kernels out of koji and see if you still hit this it'd be great.  As I mentioned on IRC I have found another race w/ inode alloc/free but it's not likely to lead to as much damage as your above fsck found.

Have you been hitting this reliably?
Comment 5 Eric Sandeen 2009-03-03 21:43:25 UTC
oh, and; a boot-time fsck on fedora root filesystems probably means the fs was marked w/ errors prior to the shutdown.  You might look in your logs to see if there's anything there.
Comment 6 Florian Engelhardt 2009-04-30 09:10:41 UTC
Same problem here with 2.6.29.2
I have tried ext4 on a linux software raid 0, it was working flawlessly for about a week. I booted the computer yesterday and hat exactly the same errors as described above. Running e2fsck from a archlinux bootstick took about one hour and left me with a mountable, but free of every kind of directory structur, ext4 filesystem.
I am running that raid 0 on two 500GB Sata Disks. I can provide you with more information on that this evening.
Comment 7 Nathan Grennan 2009-04-30 20:56:56 UTC
I have been running Fedora's kernels the whole time, and I haven't seen this issue with 2.6.29 kernels. I don't think I saw it with 2.6.28 kernels either. I wonder if Fedora has been putting more patches for ext4 in over what goes into Linus's tree.
Comment 8 Theodore Tso 2009-05-01 13:52:35 UTC
We're desperately looking for a reliable reproduction case for this problem.   My suggestion at this point is if you have a large filesystem (the reports for this seem to come from users with > 1TB filesystems) to take a periodic e2image backup of your filesystem before the corruption, and save it on some other filesystem so there is a backup of your filesystem metadata.   This will help recover the filesystem after the corruption.

If you can reproduce this reliably, please let us know.  We haven't been able to get this problem reproduced yet.
Comment 9 Eric Sandeen 2009-05-01 15:19:38 UTC
(In reply to comment #7)
> I have been running Fedora's kernels the whole time, and I haven't seen this
> issue with 2.6.29 kernels. I don't think I saw it with 2.6.28 kernels either.
> I
> wonder if Fedora has been putting more patches for ext4 in over what goes
> into
> Linus's tree.

Nathan - well, not really.  I'll never put something in fedora that hasn't been sent upstream, it's not how we work.

One difference may be that the 2.6.27 kernels in F10 did have the ext4 "stable" backports that Ted was doing...

If you're running .29 kernels from fedora, it should be equivalent to what's upstream.  The only changes in F11 for example are:

Patch2920: linux-2.6-ext4-flush-on-close.patch
Patch2921: linux-2.6-ext4-really-print-warning-once.patch
Comment 10 Nathan Grennan 2009-05-02 16:29:29 UTC
  Here is my basic experience with ext4. I had two basic problems.

 One was where the system would just go off into a hang. The other was this issue. This issue went away when I went with a 2.6.28+ kernel. Backporting patches didn't work for me. I say this because at the time you guys were telling me there were no new patches, but I would have issue with 2.6.27 kernels, but not 2.6.28 kernels. Later cebbert said there was something nasty in 2.6.28.1 kernels, so I upgraded to 2.6.29. I have had zero issues with ext4 since upgrading to 2.6.29.

  I just looked through my irc logs, and found the errors that I think caused this problem. Sandeen, I have mentioned these to you before. How I think it would go would be I would get one of these errors, the system would continue, because that is the crazy default. Then a few days later, I having not noticed these errors, would reboot the system, and receive the fsck issue above. From what I remember reading this issue was fixed.

Feb 16 12:03:19 proton kernel: EXT4-fs error (device md3): ext4_mb_generate_buddy: EXT4-fs: group 
EXT4-fs error (device md3): mb_free_blocks: double-free of inode 0's block 321550248(bit 30632 in group 9812)
Comment 11 Theodore Tso 2009-05-02 20:24:56 UTC
Hi guys (and gals), please don't assume that your problem is the same as a previously reported bug --- especially if the bug report title is as vague as "unexplained fsck errors".     That could mean software bugs, or hardware bugs --- and just because you have an "unexplained fsck error", please don't assume your problem is the same as another person's.   It could be, if the kernel versions are the same, and the symptoms are exactly the same, and especially if the way to reproduce it is the same.

The original bug report dated from a 2.6.27 kernel, and there have been a huge number of bugs fixed since then.   To be honest, not all bug fixes have been backported to the 2.6.27.x series, either.   In some cases it was just way too difficult to do.   So Nathan, if you're at 2.6.29, and you're not seeing any problem, then we're probably better of closing this bug.

Florian, my guess is that whatever problem Nathan reported back in the 2.6.27 kernel is very different what you're seeing.   May I suggest that you open a new bugzilla entry for your problems, and please give us as much detail as possible?

Many thanks.

Note You need to log in before you can comment on or make changes to this bug.