8144 – raid5 disk failure followed by xfs filesystem corruption

Bug 8144 - raid5 disk failure followed by xfs filesystem corruption

Summary: raid5 disk failure followed by xfs filesystem corruption

Status:	REJECTED INSUFFICIENT_DATA

Alias:	None

Product:	File System
Classification:	Unclassified
Component:	XFS (show other bugs)
Hardware:	i386 Linux

Importance:	P2 high
Assignee:	XFS Guru

URL:
Keywords:

Depends on:
Blocks:

Reported:	2007-03-07 17:48 UTC by lazx888
Modified:	2008-09-26 06:20 UTC (History)
CC List:	3 users (show)

See Also:
Kernel Version:	2.6.18
Subsystem:
Regression:	---
Bisected commit-id:

Attachments
Error messages from /var/log/messages (10.39 KB, text/plain) 2007-03-07 17:49 UTC, lazx888	Details
Add an attachment (proposed patch, testcase, etc.)

Description lazx888 2007-03-07 17:48:13 UTC

Distribution: Gentoo Linux
Hardware Environment: x86 (amd)

linux-2.6.18-gentoo-r6
xfsprogs-2.8.11
mdadm-2.6.1

raid5 (5 x 250gb disks), one of the disks died, XFS corruption ensued.

Before corruption: ~600gb of data.

After corruption: 200gb lost, 150gb in lost+found, 250gb "okay" but files
missing here and there.

Already have posted this on xfs bugzilla
(http://oss.sgi.com/bugzilla/show_bug.cgi?id=741) and gentoo bugzilla
(http://bugs.gentoo.org/show_bug.cgi?id=169667).

Comment 1 lazx888 2007-03-07 17:49:40 UTC

Created attachment 10648 [details]
Error messages from /var/log/messages

Comment 2 Neil Brown 2007-03-07 19:38:24 UTC

There was a bug in raid5 in 2.6.19 and earlier where by error-returns
weren't properly recognised by the filesystem (depending on the
filesystem) (We cleared the UPTODAT bit but passed a '0' error code).

In this case it was probably a read-ahead request failed due to lack of
resources, as much of the stripe cache was tided up with retries on the failed
drive.

I don't know if this analysis meshes with the reality of how XFS works,
the code is a bit to complex for me to follow easily.

I think this bug should possibly be assigned to someone with XFS
knowledge to comment if that is a possible explanation....
I wonder how I do that...

Comment 3 Neil Brown 2007-03-07 19:39:30 UTC

Maybe I do it like that.... accept the bug first, then reassign...

Comment 4 lazx888 2007-03-13 08:27:51 UTC

Neil:

I still have the machine off and in a "broken" state.

I am planning on redoing the array soon, was wondering if you need any other
info before I do this.

Thanks

Comment 5 Natalie Protasevich 2007-06-24 18:48:31 UTC

Have you been able to bring up your RAID and maybe do more testing with newer kernels? Is the problem still there?
Thanks.

Note You need to log in before you can comment on or make changes to this bug.