Bug 12821
Summary: | filesystem corrupts on heavy I/O | ||
---|---|---|---|
Product: | File System | Reporter: | ddi (ddi) |
Component: | ext4 | Assignee: | Eric Sandeen (sandeen) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | apw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | ubuntu-server-2.6.27-9 (smp) | Subsystem: | |
Regression: | --- | Bisected commit-id: | |
Attachments: |
potential fix
stat of the problematic inode in (hacked) debugfs output of debugfs 'stat <2491101>' (split 1 of 2) output of debugfs 'stat <2491101>' (split 2 of 2) |
Description
ddi
2009-03-05 08:42:37 UTC
Is this something you can reproduce? I'm not remembering anything related offhand other than an early bug in the extent code, which I think should be fixed well before 2.6.27. If you find inode #2621457, I expect that it is very fragmented, and I'd guess that any access to it (reading it, that is, or perhaps extending it) will trip this error. If you can try a later kernel, maybe 2.6.27.x or 28.x or even .29 it'd let us know if the problem still exists; if not maybe there is more debugging we can do. Providing an e2image -r of the filesystem would also let us see what kind of layout that file has, though from your comments I'm guessing this may not be possible due to security reasons. It continually spews messages in syslog while it's powered on, but that's not exactly "reproduction" I guess. Tried to find the inode in question with debugfs, it said that this particular inode does not belong to any file - maybe because the inode contains filesystem metadata, not file data? Anyhow, no useful output - probably in no small part because I have no clue what I'm doing with debugfs ;-) New kernel: Okay, upgraded to the latest available Ubuntu Server kernel. Had to yank the power; the sync would never finish (vmstat says it synced about 1kB/s). Looks like the filesystem is also super-slow for some reason. (Possible that a logging daemon was putting error messages in the buffer cache while the sync got farther and farther behind trying to sync those, not sure. Did kill all loggers at one point, but didn't help much. Hmm.) It came up with the new kernel, did a fsck, which finished off with "fsck died with exit status 1". Will try e2image in a moment.. I am thinking that the header check in search_right, in the loop, is still backwards but i'm in the middle of leaving on a road trip so may be thinking about it ... backwards. If others don't pick this up I'll look more tomorrow. The error messages you are seeing are the result of the filesystem being corrupted. The kernel will continue to printk those messages until the filesystem is fixed. The fact that the fsck died with exit status 1 probably means you need to upgrade to e2fsprogs 1.41.4 (1.41.3 and earlier e2fsck had a bug that would be triggered by certain types of corruption). Are you willing to try a newer kernel? This looks familiar and I think it's fixed in the latest 2.6.29-rc7 kernel. I'll need to go back and figure out which bug fix this is most likely related to, and check whether it's been backported to the 2.6.27.y stable series. @sandeen: Thanks a *lot*. Have a nice trip!
@tytso: Same to you, info much appreciated! (I was wondering if the fsck thing was an anomaly or not.)
> Are you willing to try a newer kernel?
> This looks familiar and I think it's
> fixed in the latest 2.6.29-rc7 kernel.
I'd rather not since it's production stuff, but then again if you know what the
problem is, and a shiny new kernel is the cure, then yes. I'll have to live on
the edge, bleeding. (or whatever it's called..)
FYI, e2image is coming up when it's done (running for about an hour now).
If it's what I think it is, it's not an anomaly; e2fsck 1.41.3 had a bug which would cause it to not fix certain types of corruption in the extent tree, and it would just bail out with an exit(1). And the error messages that the kernel printed are consistent with it. Did you try re-running a forced fsck on the filesystem after you got that failure? > If it's what I think it is, it's not an anomaly; (By "anomaly" I just meant that I have no clue whether "fsck died with exit status 1" means "all OK" or "something bad happened". The console message doesn't really say anything about whether fsck exit code 1 is a good, bad or neutral thing. Just sort of states the fact, heh.) > e2fsck 1.41.3 had a bug which would cause it to ... bail out with an exit(1). Got it. No, I didn't try a forced fsck (didn't know it would be different from an automated one), but I'll re-mount ro and do just that when e2image is finished! If it still breaks, I'll upgrade e2progs as suggested? Created attachment 20443 [details]
potential fix
I have not yet tested this patch, but it may fix your problem.
you can wait 'til I've had a chance to test it, or give it a go; if I'm wrong, the worst you should get is more corrupt header messages and EIO errors. :)
The ext4_ext_search_right() function is confusing; it uses a "depth" variable which is 0 at the root and maximum at the leaves, but the on-disk metadata uses a "depth" (actually eh_depth) which is opposite: maximum at the root, and 0 at the leaves.
The ext4_ext_check_header() function expects the on-disk semantics, but we are feeding it the opposite in the while loop in this function. We should be feeding it the on-disk notion of "depth" which we can get with (p_depth - depth) - and if you look, the last (more commonly hit) call to ext4_ext_check_header() does just this.
For a depth 4 tree (p_depth 3) (which would be an incredible amount of extents, so perhaps not your problem, but perhaps....), then, you'd have:
"depth" in function eh_depth p_depth - depth
0 3 3
1 2 2
2 1 <---- I think you are here 1
3 0 0
Your error messages:
[1099257.456522] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0),
depth 1(2)
indicate that it was asked to check that the header was depth (2) but it actually found depth 1 in the disk metadata.
I've not yet crafted a test to hit this codepath at all, so have not tested this yet, but by inspection I *think* it is correct.
Well, the first thing I would do before doing anything else is upgrade to e2fsprogs 1.41.4, and then run '"script -c "e2fsck -fv /dev/sdb1" e2fsck.log', and upload the e2fsck.log file to bugzilla. Then see if the problem goes away. The fact that e2fsck exited with a status code one of 1 causes me to suspect that you really did have an inconsistent filesystem, and it's not the problem which Eric is chasing. e2fsprogs 1.41.4ubuntu1 is available in Jaunty so you could pick those tools up from the URL below: http://archive.ubuntu.com/ubuntu/pool/main/e/e2fsprogs/e2fsprogs_1.41.4-1ubuntu1_amd64.deb http://archive.ubuntu.com/ubuntu/pool/main/e/e2fsprogs/e2fsprogs_1.41.4-1ubuntu1_i386.deb Ok, at least w/ e2fsprogs-1.41.4 the fs doesn't look too bad: # e2fsck -fn sdb1.e2i e2fsck 1.41.4 (27-Jan-2009) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong (102518871, counted=100960243). Fix? no Free inodes count wrong (90170045, counted=90170006). Fix? no LargeData: ********** WARNING: Filesystem still has errors ********** LargeData: 7491/90177536 files (39.4% non-contiguous), 258164220/360683091 blocks the inode in question: debuge4fs: ncheck 2621457 Inode Pathname 2621457 /mysql/0004_rldb/fw1log20090302.MYI is pretty fragmented: # filefrag mysql/0004_rldb/fw1log20090302.MYI File is stored in extents format mysql/0004_rldb/fw1log20090302.MYI: 5820 extents found There is a really large number of indirect (i.e. extent index) blocks in that file: # debugfs -c sdb1.e2i -R "stat <2621457>" | sed -e "s/, /\n/g" | grep IND | wc -l debugfs 1.41.4 (27-Jan-2009) sdb1.e2i: catastrophic mode - not reading inode or group bitmaps 2730 I would still try my patch. -Eric Created attachment 20449 [details]
stat of the problematic inode in (hacked) debugfs
This is for the ext4 devs as much as anything; it's a hacked stat <inode> of the inode that generated the messages.
It does have 3 levels, which would be sufficient to enter the code that generated the warning, due to what I believe is your bug.
dd'd the problematic file to /dev/null just now. Yielded no complaints in the syslog about unexpected eh_depths. Maybe the reboot+fsck had some effect. In retrospect, I should probably have found a network mount (since write speed suffered badly when the problem hit) and done the e2image before reboot/fsck, so we'd have a pristine image. A quick HOWTO on the subject with best practices for ad-hoc network mounting (using random kernel/distro) plus a copy of relevant parts of the e2image man page would be helpful to others in the same situation. comment #8: Priceless information. Thanks! comment #11: Transcripts are great; I'm learning a bit of ext4 debugging as we go along. I take it that there's no on-disk problem (now), then. Confirmed by running fsck 1.41.4 as requested in comment #9: e2fsck 1.41.4 (27-Jan-2009) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information 7648 inodes used (0.01%) 2994 non-contiguous files (39.1%) 17 non-contiguous directories (0.2%) # of inodes with ind/dind/tind blocks: 0/0/0 Extent depth histogram: 3710/3891/35/2 273816895 blocks used (75.92%) 0 bad blocks 7 large files 7593 regular files 46 directories 0 character device files 0 block device files 0 fifos 0 links 0 symbolic links (0 fast symbolic links) 0 sockets -------- 7639 files comment #10: Thanks! Used above. There's some Ubuntu-related issues, I'll try my luck with Launchpad for those *cross fingers*. comment #11 (again): Will try the patch ASAP, meaning once I've figured out how to best "inject" it into the local Ubuntu eco-system without disrupting the package management, updates etc. I'm seeing a similar error with heavy read/write I/O on a 1TB ext4 volume. It's not clear what behavior triggers the error for me. Occasionally I see the error in dmesg... [ 7829.004269] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2491097: unexpected eh_depth - magic f30a, entries 78, max 340(0), depth 1(2) [ 7829.012197] mpage_da_map_blocks block allocation failed for inode 2491097 at logical offset 2788227 with max blocks 7 with error -5 [ 7829.012220] This should not happen.!! Data will be lost ...and sync does not complete. It's not clear from the discussion if e2image needs to be done before the error occurs on a clean mount, or afterward while the system is up. When this happens the reboot procedure does not complete, and I power cycle the machine. After a power cycle, the journal replays and fsck completes, no inconsistencies. The files affected by data loss are part of a BitTorrent network download and after a Torrent data consistency check, I confirm that data has been lost. If I leave the Torrent active downloading for more than an hour or so, the ext4 errors occur. So, "me too" with linux-image-2.6.28-8-generic 2.6.28-8-generic #28-Ubuntu SMP Thu Mar 5 21:49:36 UTC 2009 i686 GNU/Linux. I'm willing to provide debug info to an interested developer, privately. (In reply to comment #14) > I'm seeing a similar error with heavy read/write I/O on a 1TB ext4 volume. > It's > not clear what behavior triggers the error for me. Occasionally I see the > error > in dmesg... > > [ 7829.004269] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header > in inode #2491097: unexpected eh_depth - magic f30a, entries 78, max 340(0), > depth 1(2) > [ 7829.012197] mpage_da_map_blocks block allocation failed for inode 2491097 > at > logical offset 2788227 with max blocks 7 with error -5 > [ 7829.012220] This should not happen.!! Data will be lost > > ...and sync does not complete. It's not clear from the discussion if e2image > needs to be done before the error occurs on a clean mount, or afterward while > the system is up. If you like, it's probably sufficient to point debugfs at the system and do debugfs> stat <2491097> to give us an idea of the layout of that file. > When this happens the reboot procedure does not complete, and I power cycle > the > machine. the fs may have gone readonly after the error, but I'm not sure it should have hung up the system. > After a power cycle, the journal replays and fsck completes, no > inconsistencies. The files affected by data loss are part of a BitTorrent > network download and after a Torrent data consistency check, I confirm that > data has been lost. If I leave the Torrent active downloading for more than > an > hour or so, the ext4 errors occur. Great, if you can reproduce it w/ bittorrent, can you please try the patch in the attachments? Thanks, -Eric Will try patch soon. I mistakenly thought the patched kernel was running today, it was not, and so I have some more information from When Bad Things Happen. Without patch applied: # dmesg output: [17415.140240] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0), depth 1(2) [17415.158912] mpage_da_map_blocks block allocation failed for inode 2491101 at logical offset 1032931 with max blocks 5 with error -5 [17415.158923] This should not happen.!! Data will be lost [17461.031925] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0), depth 1(2) [17461.034989] mpage_da_map_blocks block allocation failed for inode 2491101 at logical offset 1032967 with max blocks 9 with error -5 [17461.034996] This should not happen.!! Data will be lost [18307.004556] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0), depth 1(2) [18307.017829] mpage_da_map_blocks block allocation failed for inode 2491101 at logical offset 1032935 with max blocks 9 with error -5 [18307.017839] This should not happen.!! Data will be lost [18426.839773] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2491101: unexpected eh_depth - magic f30a, entries 228, max 340(0), depth 1(2) [18426.854493] mpage_da_map_blocks block allocation failed for inode 2491101 at logical offset 1032943 with max blocks 5 with error -5 [18426.854503] This should not happen.!! Data will be lost [25859.886794] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2491101: unexpected eh_depth - magic f30a, entries 229, max 340(0), depth 1(2) [25859.908981] mpage_da_map_blocks block allocation failed for inode 2491101 at logical offset 1032924 with max blocks 4 with error -5 [25859.908992] This should not happen.!! Data will be lost # debugfs output: debugfs 1.41.4 (27-Jan-2009) debugfs: stat 2491101 2491101: File not found by ext2_lookup debugfs: ncheck 2491101 Inode Pathname 2491101 /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd debugfs: icheck /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd icheck: Bad block number - /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd icheck: Invalid block number: /Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd # filefrag output: filefrag $MOUNTPATH/Datasets/MAME/LD-CHD/cubeqst/cubeqst.chd (stuck / does not complete) (In reply to comment #16) > # debugfs output: > debugfs 1.41.4 (27-Jan-2009) > debugfs: stat 2491101 > 2491101: File not found by ext2_lookup Just FWIW, what you want is: debugfs: stat <2491101> to get that inode. Thanks, -Eric Created attachment 20561 [details]
output of debugfs 'stat <2491101>' (split 1 of 2)
Having first applied the suggested patch, is debugfs output taken after file completed and verified intact from BitTorrent download. Split and then compressed to comply with 1000k bugzilla constraint: 1 of 2.
Created attachment 20562 [details]
output of debugfs 'stat <2491101>' (split 2 of 2)
Having first applied the suggested patch, is debugfs output taken after file completed and verified intact from BitTorrent download. Split and then compressed to comply with 1000k bugzilla constraint: 2 of 2.
As noted in attachments, I have applied the patch and completed my BitTorrent download (which was problematic without the patch). linux-image-2.6.28-8-generic (Ubuntu): Download unable to complete due to EXT4 error and non-responsive sync. linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), first try: Computer is found unresponsive and screen dark, power and capslock blinking. linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), second try: Computer is normal, no errors in dmesg about EXT4, download is successful. The failure on first try is a new behavior I have not seen before with this machine. I therefore am reporting inconclusively that the patch works for me. (In reply to comment #20) > As noted in attachments, I have applied the patch and completed my BitTorrent > download (which was problematic without the patch). > > linux-image-2.6.28-8-generic (Ubuntu): Download unable to complete due to > EXT4 > error and non-responsive sync. > > linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), first try: > Computer > is found unresponsive and screen dark, power and capslock blinking. That indicates a panic, but we'd need a lot more info, which you probably can't gather at this time. Getting panics visible when you're in X is tricky, netconsole or serial console or similar may be needed. So you paniced, but we have no idea why, unfortunately. > linux-image-2.6.28-9-generic (Ubuntu+potential fix patch), second try: > Computer > is normal, no errors in dmesg about EXT4, download is successful. well, that sounds good ;) > The failure on first try is a new behavior I have not seen before with this > machine. I therefore am reporting inconclusively that the patch works for me. Ok, thanks for the testing.. It's headed upstream in any case, I think, as it seems pretty well correct from inspection and review. Oh, re: the stat of that inode, for a 10G file that's a metric ton of extents... you should encourage whichever torrent client this is to use the preallocation routines now available (fallocate(2)) - a syscall now, exposed by glibc soon. the file has 600+ extent tree *nodes* each of which holds an extent... you have nearly 105000 extents. That's ... impressive ... :) I use Transmission BitTorrent 1.51 and the application default allocation method. It's not clear to me (for purposes of discussion) what the default allocation method is or what it does to fragment output files. I did not know how many extents to expect for a 10G file. # dd if=cubeqst.chd of=cubeqst.chd~copy 22495032+1 records in 22495032+1 records out 11517456457 bytes (12 GB) copied, 7216.4 s, 1.6 MB/s # filefrag cubeqst.chd~copy cubeqst.chd~copy: 66 extents found cubeqst.chd: 104179 extents found Now I know that only 66 extents are needed, and I am stunned by this fact. Thank you for code and time invested, -Other Eric Eric, thanks for the info. Transmission does no preallocation by default. FWIW, in recent transmission you can set "preallocation": 2, in the configfile and it will call posix_fallocate()... and on recent-enough kernels with recent-enough glibc and a smart-enough filesystem (!) that will do a quick "unwritten extents" style preallocation. On older glibc/kernels/filesystems it will just write out as many zeros as it takes to grow the file to the final size. At least in the unwritten extents case, it should wind up with a very contiguous on-disk layout. However, for ext4, it's not coalescing adjacent extents as they fill in, so although the file iteslf is largely contiguous, it's still tracked by many extent records. In my test on a 4.7GB download, I had 5100 extents tracking a file which was actually laid out in 10 contiguous regions. We should probably fix that :) (FWIW filefrag will show it as contiguous, as it doesn't know the on-disk details, it just looks for contiguous regions. It has no way of knowing that ext4 has many more extent records tracking those contiguous regions). Still, probably worth playing with that option; many extent records + contiguous data is better than many extent records + badly fragmented data. Fixed by commit 395a87bfefbc400011417e9eaae33169f9f036c0 upstream. Thanks for the report and all the debugging info! |