Most recent kernel where this bug did not occur: Not easy to say, cause I had lots of XFS corruption problems in the last months. Most prominent kernel bug #6380. But I had XFS crash with 2.6.15.7 as well with *disabled write caches* This slight kind of corruption however seems to be new with 2.6.17.1. I am not absolutely sure, but I think I did not have it with 2.6.16.11, 2.6.16.4 and 2.6.15.7. Well better a slight corruption without lost+found files than what I had before ;) Distribution: Debian Etch / Sid / Experiment (almost nothing from experimental) Hardware Environment: IBM ThinkPad T23, Pentium III 1.13 GHz, 384 MB RAM (lspci stuff attached) Software Environment: 2.6.17.1 + sws2 patches... Problem Description: XFS seems to get corrupted slightly after some days. Well, right now I only had it twice... once last friday and once today. Steps to reproduce: Frankly I have no idea. It just happens. But it only happened in the root partition. /home fortunately has been unaffected by this. Details: It first happened last Friday, see my post on linux-xfs mailinglist: Re: xfs crash with linux 2.6.15.7 and disabled write caches (long) Message-Id: <200606232201.29440.Martin@lichtvoll.de> It again happened today and I have quite some diagnostic date at hand: I halted the computer regularily after some suspend / resume cycles (software suspend 2), cause it seems this way I can detect filesystem corruption more easily (for example when KDE crashes upon shutdown has been a good indicator). I got some kernel messages on tty0, which I found in /var/log/syslog later on. As a sample here the first occurence I found in the log - I attach the whole portion of the log file to this bug report: --------------------------------------------------------------- Jun 27 23:34:00 deepdance shutdown[18694]: shutting down for system halt Jun 27 23:34:02 deepdance kernel: 0x0: 00 00 00 7e 1f 69 00 00 17 62 03 00 ff ff 07 00 Jun 27 23:34:02 deepdance kernel: Filesystem "hda5": XFS internal error xfs_da_do_buf(2) at line 2212 of file fs/xfs/xfs_da_btree.c. Caller 0xc020b60d Jun 27 23:34:02 deepdance kernel: <c021ea5b> xfs_corruption_error+0x10b/0x140 <c020b60d> xfs_da_read_buf+0x3d/0x50 Jun 27 23:34:02 deepdance kernel: <c024f7c1> kmem_zone_alloc+0x61/0xe0 <c020a869> xfs_da_buf_make+0x159/0x160 Jun 27 23:34:02 deepdance kernel: <c020b4bb> xfs_da_do_buf+0x8bb/0x960 <c020b60d> xfs_da_read_buf+0x3d/0x50 Jun 27 23:34:02 deepdance kernel: <c020b60d> xfs_da_read_buf+0x3d/0x50 <c0214ebc> xfs_dir2_leaf_lookup_int+0x6c/0x2d0 Jun 27 23:34:02 deepdance kernel: <c0214ebc> xfs_dir2_leaf_lookup_int+0x6c/0x2d0 <c01f8193> xfs_bmap_last_offset+0x133/0x160 Jun 27 23:34:02 deepdance kernel: <c021564d> xfs_dir2_leaf_lookup+0x2d/0xc0 <c021098a> xfs_dir2_lookup+0x13a/0x160 Jun 27 23:34:02 deepdance kernel: <c0148a16> generic_file_buffered_write+0x3b6/0x6e0 <c02435ac> xfs_dir_lookup_int+0x4c/0x150 Jun 27 23:34:02 deepdance kernel: <c017843f> do_lookup+0x5f/0x180 <c0247c9e> xfs_lookup+0x7e/0xc0 Jun 27 23:34:02 deepdance kernel: <c02576bc> xfs_vn_lookup+0x4c/0xa0 <c0178533> do_lookup+0x153/0x180 Jun 27 23:34:02 deepdance kernel: <c0178dfd> __link_path_walk+0x89d/0xfa0 <c0256e86> xfs_vn_permission+0x26/0x30 Jun 27 23:34:02 deepdance kernel: <c017955c> link_path_walk+0x5c/0x100 <c0105d5a> do_gettimeofday+0x1a/0xd0 Jun 27 23:34:02 deepdance kernel: <c011d1ec> sys_gettimeofday+0x3c/0xb0 <c0179a27> do_path_lookup+0xa7/0x270 Jun 27 23:34:02 deepdance kernel: <c017712f> getname+0xdf/0x110 <c017a26c> __user_walk_fd+0x3c/0x70 Jun 27 23:34:02 deepdance kernel: <c0166daa> sys_faccessat+0xfa/0x180 <c0105d5a> do_gettimeofday+0x1a/0xd0 Jun 27 23:34:02 deepdance kernel: <c011d1ec> sys_gettimeofday+0x3c/0xb0 <c0166e4f> sys_access+0x1f/0x30 Jun 27 23:34:02 deepdance kernel: <c0103027> syscall_call+0x7/0xb --------------------------------------------------------------- Write caches have been disabled all the time: --------------------------------------------------------------- root@deepdance:~ -> hdparm -I /dev/hda | grep -i "write cache" Write cache (No asterik on front of it means disabled...) --------------------------------------------------------------- Mount options are: defaults,barrier,logbufs=8 I booted into an SUSE 10.1 installation with SUSE kernel 2.6.16.13-4-default. Mount options for the Debian partition are the same as above. Write caches should have been disabled by an init script I wrote... strange is just the output of hdparm: --------------------------------------------------------------- deepdance:~ # hdparm -W0 /dev/hda /dev/hda: setting drive write-caching to 0 (off) HDIO_SET_WCACHE(wcache) failed: Success --------------------------------------------------------------- (This doesn't happen under Debian. Maybe I should try blktool wcache off, I can test whether write cache has been successfully disabled under SUSE using hdparm -I /dev/hda as well, but I believe it is, also the barrier mount option is used, so things should be safe anyway).... xfs_check reported errors like this (full output attached): --------------------------------------------------------------- deepdance:~ # xfs_check /dev/hda5 bad free block nvalid/nused 6/-1 for dir ino 5012689 block 16777216 missing free index for data block 0 in dir ino 5012689 missing free index for data block 1 in dir ino 5012689 missing free index for data block 2 in dir ino 5012689 missing free index for data block 3 in dir ino 5012689 missing free index for data block 4 in dir ino 5012689 missing free index for data block 5 in dir ino 5012689 bad free block nvalid/nused 21/-1 for dir ino 33641428 block 16777216 --------------------------------------------------------------- xfs_repair was able to repair it and printed messages like this: --------------------------------------------------------------- empty data block 53 in directory inode 55176185: junking block empty data block 54 in directory inode 55176185: junking block empty data block 56 in directory inode 55176185: junking block empty data block 58 in directory inode 55176185: junking block empty data block 60 in directory inode 55176185: junking block empty data block 63 in directory inode 55176185: junking block empty data block 64 in directory inode 55176185: junking block free block 16777216 entry 52 for directory ino 55176185 bad rebuilding directory inode 55176185 free block 16777216 for directory inode 48409589 bad nused rebuilding directory inode 48409589 --------------------------------------------------------------- I already thought about hardware problems and tried 1) badblocks -s -v -n -o /home/martin/XFS-Probleme/badblocks.txt /dev/hda5 It found no bad blocks 2) smartctl -t long /dev/hda It completed successfully 3) memtest86 over night It found 0 errors. So I am pretty sure that the hardware is well. Regards, Martin
This bug may be related to kernel bug #6737
Created attachment 8427 [details] the corruption errors that XFS wrote to syslog
Created attachment 8428 [details] xfs_check and xfs_repair output
Created attachment 8429 [details] output of lspci and lspci -vvn
Created attachment 8430 [details] configuration of the kernel I used (2.6.17.1 with sws2-2.2.6)
I will now reboot into a 2.6.17.1 without software suspend 2. I will not use software suspend 2 nor the new userspace software suspend for a while to exclude that its suspend related. Actually I highly doubt thats its suspend related, but it still makes sense to test it.
Make sure you're using Mandys patch that I sent to the list earlier too... I'll send that into the -stable folks today. cheers.
Created attachment 8452 [details] patch that might fix the issue I am currently testing a patch that Nathan Scott sent to the stable kernel team and apparently CCd to me. From what I understand this patch may fix the issue I am seeing. For now all seems fine, but its too early to say anything definite. Here is that patch including description of what it does: Fix nused counter. It's currently getting set to -1 rather than getting decremented by 1. Since nused never reaches 0, the "if (!free->hdr.nused)" check in xfs_dir2_leafn_remove() fails every time and xfs_dir2_shrink_inode() doesn't get called when it should. This causes extra blocks to be left on an empty directory and the directory in unable to be converted back to inline extent mode. Signed-off-by: Mandy Kirkconnell <alkirkco@sgi.com> Signed-off-by: Nathan Scott <nathans@sgi.com> --- a/fs/xfs/xfs_dir2_node.c 2006-06-28 08:20:56.000000000 +1000 +++ b/fs/xfs/xfs_dir2_node.c 2006-06-28 08:20:56.000000000 +1000 @@ -972,7 +972,7 @@ xfs_dir2_leafn_remove( /* * One less used entry in the free table. */ - free->hdr.nused = cpu_to_be32(-1); + be32_add(&free->hdr.nused, -1); xfs_dir2_free_log_header(tp, fbp); /* * If this was the last entry in the table, we can
Mandys patch seems to fix this issue. I had three days production use without any corruption and also a rsync backup with lots of file and directory deletion (SUSE 10.0 -> 10.1 update on one partition) worked well. Thanks Mandy and Nathan! This should really go into next stable kernel patch.
Another 24 days without corruption. This patch really seems to be fine!
2.6.17.7 contains the patch. So I am closing this. Kudos to the stable kernel team for finally including it!
Just some additinal information for those that where hit by this bug: Mandy Kirkconnel, XFS: corruption fix: http://marc.theaimsgroup.com/?t=115315520200004&r=1&w=2 XFS FAQ, What is the issue with directory corruption in Linux 2.6.17?: http://oss.sgi.com/projects/xfs/faq.html#dir2 Barry Naujock, Review: xfs_repair fixes for dir2 corruption, 28. Juli 2006: http://oss.sgi.com/archives/xfs/2006-07/msg00374.html