Bug 8144

Summary: raid5 disk failure followed by xfs filesystem corruption
Product: File System Reporter: lazx888
Component: XFSAssignee: XFS Guru (xfs-masters)
Status: REJECTED INSUFFICIENT_DATA    
Severity: high CC: neilb, protasnb, sandeen-xfs
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.18 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Error messages from /var/log/messages

Description lazx888 2007-03-07 17:48:13 UTC
Distribution: Gentoo Linux
Hardware Environment: x86 (amd)

linux-2.6.18-gentoo-r6
xfsprogs-2.8.11
mdadm-2.6.1

raid5 (5 x 250gb disks), one of the disks died, XFS corruption ensued.

Before corruption: ~600gb of data.

After corruption: 200gb lost, 150gb in lost+found, 250gb "okay" but files
missing here and there.

Already have posted this on xfs bugzilla
(http://oss.sgi.com/bugzilla/show_bug.cgi?id=741) and gentoo bugzilla
(http://bugs.gentoo.org/show_bug.cgi?id=169667).
Comment 1 lazx888 2007-03-07 17:49:40 UTC
Created attachment 10648 [details]
Error messages from /var/log/messages
Comment 2 Neil Brown 2007-03-07 19:38:24 UTC
There was a bug in raid5 in 2.6.19 and earlier where by error-returns
weren't properly recognised by the filesystem (depending on the
filesystem) (We cleared the UPTODAT bit but passed a '0' error code).

In this case it was probably a read-ahead request failed due to lack of
resources, as much of the stripe cache was tided up with retries on the failed
drive.

I don't know if this analysis meshes with the reality of how XFS works,
the code is a bit to complex for me to follow easily.

I think this bug should possibly be assigned to someone with XFS
knowledge to comment if that is a possible explanation....
I wonder how I do that...
Comment 3 Neil Brown 2007-03-07 19:39:30 UTC
Maybe I do it like that.... accept the bug first, then reassign...
Comment 4 lazx888 2007-03-13 08:27:51 UTC
Neil:

I still have the machine off and in a "broken" state.

I am planning on redoing the array soon, was wondering if you need any other
info before I do this.

Thanks
Comment 5 Natalie Protasevich 2007-06-24 18:48:31 UTC
Have you been able to bring up your RAID and maybe do more testing with newer kernels? Is the problem still there?
Thanks.