Bug 6957 - Oops with fil with holes
Summary: Oops with fil with holes
Status: REJECTED INSUFFICIENT_DATA
Alias: None
Product: File System
Classification: Unclassified
Component: XFS (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: XFS Guru
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-08-04 05:27 UTC by Damian Pietras
Modified: 2008-09-22 16:55 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.18-rc3
Tree: Mainline
Regression: ---


Attachments
Oops when running stress-xfs (21.39 KB, text/plain)
2006-08-04 05:28 UTC, Damian Pietras
Details
Program that writes randomally to a file with holes. (935 bytes, text/plain)
2006-08-04 05:29 UTC, Damian Pietras
Details
Another call trace (1.94 KB, text/plain)
2006-08-17 05:39 UTC, Damian Pietras
Details

Description Damian Pietras 2006-08-04 05:27:37 UTC
Most recent kernel where this bug did not occur: 2.6.18-rc3
Distribution: Custom made
Hardware Environment: i386, P IV SATA disk with 3ware controller
Software Environment: XFS file system on an LVM logical volume
Problem Description: After turning off the power while writing to a file with
holes there is Oops on mount after booting or shortly after writing to the file
again.

Steps to reproduce:

1. Create a logical volume, format it with mkfs.xfs /dev/vg/lv
2. mount the filesystem and create a file with holes with xfs_mkfile -n 10G file
The size of the file is in my case 1GB below the size of the filesystem.
3. Run stress-xfs file
4 after about a minute unplug the power cable
5 turn on the computer, mount the file system and run stress-xfs file

You should see Oops at mount or after a while when running stress-xfs, few power
failures maight be required.
Comment 1 Damian Pietras 2006-08-04 05:28:40 UTC
Created attachment 8697 [details]
Oops when running stress-xfs
Comment 2 Damian Pietras 2006-08-04 05:29:32 UTC
Created attachment 8698 [details]
Program that writes randomally to a file with holes.
Comment 3 Andrew Morton 2006-08-07 00:12:33 UTC
> Most recent kernel where this bug did not occur: 2.6.18-rc3

You misunderstand.  We'd like to know which kernel version did _not_ have this bug.
Comment 4 Damian Pietras 2006-08-10 04:59:23 UTC
I can't find a version of kernel where it works. I've tested 2.6.181-rc3,
2.6.17.7 and 2.6.12.
The Oops has better chance to occur when running stress-xfs just after mount
without any delay like mount /dev/vg/lv /mnt/tmp && ./stress-xfs /mnt/tmp/test
Comment 5 Nathan Scott 2006-08-10 15:43:49 UTC
Your oops is due to some rather braindead code in XFS that tries to
catch inode corruption in the form of an extent which points at the
primary superblock as its start (which is a really bad place to then
go write).  This code shouldn't be causing a panic, it should rather
be reporting the corrupt file and battling on.  I'll fix that.

But, that aside, the root issue here seems to be related to logging
and/or recovery of the log.  Usual initial question here is are you
running with any form of I/O completion ordering guarantees?  Looks
like SATA and device mapper from your report, so I'm guessing not..
could you try your test with a single drive (no DM/LVM), and with
the barrier mount option (this is the default in recent kernels),
just to see if that fares any better.  If it does, it will point our
investigation in one direction, if not we'll need to look elsewhere.

One other question - are you saying you always see this same panic
when you run the test (several times, till it eventually fails)?

thanks.
Comment 6 Damian Pietras 2006-08-17 05:38:29 UTC
Having XFS on a partition directly with the same hadware and software, with the
same test I didn't manage to reproduce this.

The Oops is not always exacly the same (attached another call trace)

One more thing: I can see that if you run stress-xfs just after mounting,
without any delay, the bug occurs always (after first power failure).
Comment 7 Damian Pietras 2006-08-17 05:39:50 UTC
Created attachment 8812 [details]
Another call trace
Comment 8 Natalie Protasevich 2007-10-04 03:03:54 UTC
Damian, is this still a problem with recent kernel?
Thanks.
Comment 9 Damian Pietras 2007-10-04 08:36:57 UTC
I'm not working in the company where I've tested it anymore, so I don't have access to the hardware I used. Now I have only my laptop which is completely different from the server board with a 3ware controller :)

But I repeated the test on my box with kernel 2.6.23-rc9 and the result just after running stress-xfs (after power failure and mount) is:

[   96.023459] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11190
[   96.023580] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11190
[   96.024045] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11191
[   96.024228] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11191
[   96.228231] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11299
[   96.228351] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11299
[   96.245877] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 112a5
[   96.245994] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 112a5
[   96.246568] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 112a6
[   96.246689] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 112a6
[   96.323049] Filesystem "dm-3": Access to block zero in inode 131 start_block: 0 start_off: 0 blkcnt: 0 extent-state: 0 lastx: 11381


And many more such messages. The test was made like the first one: with LVM volume (11G) where the test file is 1G smaller than the volume sie (10G).
Comment 10 Dave Chinner 2007-10-04 14:39:39 UTC
Following up from Nathan's original comments, I see that the original
report showed:

Filesystem "dm-4": Disabling barriers, not supported by the underlying device
XFS mounting filesystem dm-4
Starting XFS recovery on filesystem: dm-4 (logdev: internal)
Ending XFS recovery on filesystem: dm-4 (logdev: internal)
Access to block zero: fs: <dm-4> inode: 131 start_block : 0 start_off : 0 blkcnt : 0 extent-state : 0 

That barriers were disabled on block device you were using. This could lead
to the sort of problem that you re seeing.

Can this be reproduced on a filesystem that mounts with barriers (should
happen by default) and does not produce the "barriers disabled" warning
on mount?

Also, from comment #9, this is from an extent btree block full of
zeros, which again implies a log recovery problem which initially points
back to the above question about barriers....
Comment 11 Dave Chinner 2007-10-05 01:26:01 UTC
No luck reproducing this on a filesystem with barriers enabled
or a drive that does not have write cache enabled. I've build
some wonderfully fragmented files, though (>430,000 extents).

Next is to try a disk with WCE and no barriers.
Comment 12 Damian Pietras 2007-10-05 11:24:41 UTC
(In reply to comment #10)
> Can this be reproduced on a filesystem that mounts with barriers (should
> happen by default) and does not produce the "barriers disabled" warning
> on mount?
> 
> Also, from comment #9, this is from an extent btree block full of
> zeros, which again implies a log recovery problem which initially points
> back to the above question about barriers....
> 

I'm not able to reproduce it without LVM now. Previous tests in my former company showed that with a disk partition (like /dev/sdb1) the bug does not occur.
Comment 13 Damian Pietras 2007-10-05 11:26:45 UTC
(In reply to comment #12)
> I'm not able to reproduce it without LVM now.

To be clear: I can't do the test, I have only my laptop now as described earlier, so I can't play with my partitions, everything is on LVM.
Comment 14 Natalie Protasevich 2007-11-26 23:13:04 UTC
It is unfortunate, we should try to reproduce this exact environment somehow...
Comment 15 Dave Chinner 2007-11-26 23:49:38 UTC
(removing cc as i get one through xfs-masters@oss.sgi.com)

Regardless of whether it can be reproduced or not, the critical
question that we need answered is whether barriers are enabled or
not when the corruption occurs.
Comment 16 Natalie Protasevich 2008-02-02 01:57:30 UTC
Dave, were you able to try with no barriers? - thanks.

Note You need to log in before you can comment on or make changes to this bug.