Bug 13230 - unexplained fsck error with ext4 on software raid 10
Summary: unexplained fsck error with ext4 on software raid 10
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: File System
Classification: Unclassified
Component: ext4 (show other bugs)
Hardware: All Linux
: P1 high
Assignee: fs_ext4@kernel-bugs.osdl.org
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-05-03 12:53 UTC by Florian Engelhardt
Modified: 2009-06-10 13:53 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.29.2
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Florian Engelhardt 2009-05-03 12:53:05 UTC
Distribution: Archlinux

Hardware Environment: 

Processor: Intel(R) Atom(TM) CPU  330 @ 1.60GHz (Dual-Core)
Memory: 2GB

lspci:
00:00.0 Host bridge: Intel Corporation 82945G/GZ/P/PL Memory Controller Hub (rev 02)
00:02.0 VGA compatible controller: Intel Corporation 82945G/GZ Integrated Graphics Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 1 (rev 01)
00:1c.2 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 3 (rev 01)
00:1c.3 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express Port 4 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #1 (rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #2 (rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #3 (rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI Controller #4 (rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)
00:1f.2 IDE interface: Intel Corporation 82801GB/GR/GH (ICH7 Family) SATA IDE Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 01)
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 02)
04:00.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)


Software:
e2fsprogs-1.41.5-2

Problem description:

I have a linux software raid10 using four discs (SAMSUNG HD103SI 1TB Sata) created with the following command:
mdadm --create /dev/md0 --assume-clean --chunk=128 --level=raid10 --raid-devices=4 --spare-devices=0 --layout=f2 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

On this md0 i created a ext4 filesystem using:
mkfs.ext4 -O large_file,dir_index,sparse_super -E stride=32,stripe-width=128 -b 4096 /dev/md0

After doing some load on the disks, coping some files and so on, i rebooted the machine. I tried to mount the filesystem after the reboot manualy, but i couldnt, i was advised to run fsck, which gives me errors:

[root@hal9000 ~]# fsck /dev/md0
fsck 1.41.5 (23-Apr-2009)
e2fsck 1.41.5 (23-Apr-2009)
fsck.ext4: Group descriptors look bad... trying backup blocks...
Group descriptor 0 checksum is invalid.  Fix<y>? yes
Group descriptor 1 checksum is invalid.  Fix<y>? yes
Group descriptor 2 checksum is invalid.  Fix<y>? yes
Group descriptor 3 checksum is invalid.  Fix<y>? yes
Group descriptor 4 checksum is invalid.  Fix<y>? yes
Group descriptor ... checksum is invalid.  Fix<y>? yes
Group descriptor 14904 checksum is invalid.  Fix? yes

/dev/md0 contains a file system with errors, check forced.
Resize inode not valid.  Recreate? yes

Pass 1: Checking inodes, blocks, and sizes
Inode 83425 is in use, but has dtime set.  Fix? yes

Inode 83425 has imagic flag set.  Clear? yes

Inode 83425 has a extra size (24906) which is invalid
Fix? yes

Inode 83426 is in use, but has dtime set.  Fix? yes

Inode 83426 has imagic flag set.  Clear? yes

Inode 83426 has a extra size (15123) which is invalid
Fix? yes

Inode 83426 has compression flag set on filesystem without compression support.  Clear? yes

Error while reading over extent tree in inode 83426: Corrupt extent header
Clear inode? yes

Inode 83426, i_blocks is 2892048078, should be 0.  Fix? yes

Inode 83427 is in use, but has dtime set.  Fix? yes

Inode 83427 has a extra size (30948) which is invalid
Fix? yes

Inode 83427 has compression flag set on filesystem without compression support.  Clear? yes

Inode 83427, i_size is 6852659100897434679, should be 0.  Fix? yes

Inode 83427, i_blocks is 24634205603455, should be 0.  Fix? yes

Inode 83428 is in use, but has dtime set.  Fix? yes

Inode 83428 has imagic flag set.  Clear? yes

Inode 83428 has a extra size (10145) which is invalid
Fix? yes

Inode 83428 has INDEX_FL flag set but is not a directory.
Clear HTree index? yes

Inode 83428, i_size is 4063880120011657287, should be 0.  Fix? yes

......



Inode 83432 has INDEX_FL flag set but is not a directory.
Clear HTree index? yes

Inode 83432, i_size is 10651702139991005323, should be 0.  Fix? yes

Inode 83432, i_blocks is 75253109187231, should be 0.  Fix? yes

Inode 83436 has compression flag set on filesystem without compression support.  Clear? yes

Inode 83436 has INDEX_FL flag set but is not a directory.
Clear HTree index? yes

Inode 83436, i_size is 1452293747930507946, should be 0.  Fix? yes

Inode 83436, i_blocks is 225351860648724, should be 0.  Fix? yes

Inode 83437 has compression flag set on filesystem without compression support.  Clear? yes

Inode 83437, i_size is 10409160169330118727, should be 0.  Fix? yes

Inode 83437, i_blocks is 52103229380007, should be 0.  Fix? yes

Inode 83440 has compression flag set on filesystem without compression support.  Clear? yes

Inode 83440 has a bad extended attribute block 258978553.  Clear? yes

Inode 83440, i_size is 16993295139261714503, should be 0.  Fix? yes

Inode 83440, i_blocks is 267701257729082, should be 0.  Fix? yes
Comment 1 Theodore Tso 2009-05-04 02:44:02 UTC
What kind of filesystem operations were you doing, and how long had the filesystem been in service?

This looks like an instance of "low block number corruption", which typically strikes the block group descriptors and inode table.  It seems to happen mostly to people with RAID.

We are very much interested in a way to easily reproduce this problem, as we haven't been able to reproduce it ourselves.
Comment 2 Florian Engelhardt 2009-05-04 07:42:22 UTC
I was coping files via NFS from my computer under my desk to that server. Something around 380 GB.
The filesystem was serving until reboot ;)
As i don´t need the server right now, i can  build a tunnel and give you root access to that machine via ssh.
Comment 3 Florian Engelhardt 2009-05-15 16:58:14 UTC
Any news on this? I need to have this machine up and running stable until 15th of June, so i can still help, try patches, give you root access, ...
Comment 4 Theodore Tso 2009-05-15 17:24:56 UTC
As a matter of fact we think we found the cause of the problem just yesterday.

The fix is now in mainline:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2ec0ae3acec47f628179ee95fe2c4da01b5e9fc4

Was this a problem you can easily reproduce?  If so, I'd appreciate it if you could try out this patch, and confirm that you can no longer reproduce the problem with this patch.
Comment 5 Florian Engelhardt 2009-06-10 06:47:42 UTC
I applied that patch to the 2.6.29.4 kernel and recreated the filesystem. It´s up and running since four days now. I was not able to reproduce this bug. Good work, thanks so far. Is this patch also in the 2.6.30 mainline? This will allow me to switch back to the official kernel of my distribution.
Comment 6 Theodore Tso 2009-06-10 13:53:11 UTC
Yes, this patch is in 2.6.30, and it's in the 2.6.27.y and 2.6.29.y patches which Greg sent out for review yesterday.  So 2.6.27.25 and 2.6.29.5 should have this patch when they get released next week.

Note You need to log in before you can comment on or make changes to this bug.