Bug 8144 - raid5 disk failure followed by xfs filesystem corruption
Summary: raid5 disk failure followed by xfs filesystem corruption
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: i386 Linux
: P2 high
Assignee: XFS Guru
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-03-07 17:48 UTC by lazx888
Modified: 2008-09-26 06:20 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.18
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
Error messages from /var/log/messages (10.39 KB, text/plain)
2007-03-07 17:49 UTC, lazx888
Details

Description lazx888 2007-03-07 17:48:13 UTC
Distribution: Gentoo Linux
Hardware Environment: x86 (amd)

linux-2.6.18-gentoo-r6
xfsprogs-2.8.11
mdadm-2.6.1

raid5 (5 x 250gb disks), one of the disks died, XFS corruption ensued.

Before corruption: ~600gb of data.

After corruption: 200gb lost, 150gb in lost+found, 250gb "okay" but files
missing here and there.

Already have posted this on xfs bugzilla
(http://oss.sgi.com/bugzilla/show_bug.cgi?id=741) and gentoo bugzilla
(http://bugs.gentoo.org/show_bug.cgi?id=169667).
Comment 1 lazx888 2007-03-07 17:49:40 UTC
Created attachment 10648 [details]
Error messages from /var/log/messages
Comment 2 Neil Brown 2007-03-07 19:38:24 UTC
There was a bug in raid5 in 2.6.19 and earlier where by error-returns
weren't properly recognised by the filesystem (depending on the
filesystem) (We cleared the UPTODAT bit but passed a '0' error code).

In this case it was probably a read-ahead request failed due to lack of
resources, as much of the stripe cache was tided up with retries on the failed
drive.

I don't know if this analysis meshes with the reality of how XFS works,
the code is a bit to complex for me to follow easily.

I think this bug should possibly be assigned to someone with XFS
knowledge to comment if that is a possible explanation....
I wonder how I do that...
Comment 3 Neil Brown 2007-03-07 19:39:30 UTC
Maybe I do it like that.... accept the bug first, then reassign...
Comment 4 lazx888 2007-03-13 08:27:51 UTC
Neil:

I still have the machine off and in a "broken" state.

I am planning on redoing the array soon, was wondering if you need any other
info before I do this.

Thanks
Comment 5 Natalie Protasevich 2007-06-24 18:48:31 UTC
Have you been able to bring up your RAID and maybe do more testing with newer kernels? Is the problem still there?
Thanks.

Note You need to log in before you can comment on or make changes to this bug.