Most recent kernel where this bug did not occur: no later kernel version tested
Distribution: Debian Linux Etch/Sid
Hardware Environment: IBM ThinkPad T23, P3 with 1.13 GHz, 384 MB RAM
Software Environment: Kernel 220.127.116.11, kernel 18.104.22.168 with sws2 2.2.4
I get severe XFS corruption on random occasions. It happens with my Debian
Linux root partition. /home on a different XFS partition was not yet affected
(lucky me). And OpenSUSE 10 on yet another partition was not affected to. I
used OpenSUSE only to recover my broken Debian partition tough.
I have no reproducible pattern do that and it will be triggered, it just
happens. I got XFS corruption three times within 1 week:
1) I don't know when it happened, but I noticed it as dpkg complained about
several errors in /var/lib/dpkg/available. I first suspected that it was
corrupted due to some bug in Debian package management, but then found out that
it just contained lots of garbage characters at the end of the file. In the
middle of the file some text where missing or duplicated.
I booted to OpenSUSE 10 and xfs_check reported errors beyond that usual stuff
about old deleted files (agi unlinked node or something like that) that can
easily be fixed. It has been just a few errors and I restored the "available"
via "apt-cache dumpavail".
2) Next time I wanted to start a mindmap in kdissert, a mind mapping tool for
KDE, which I used with KDE 3.5.2. I just clicked around a bit, added an item or
two and then the machine become unresponsive and finally the X.org (modular
X.org 7 from Debian experimental) died. Then the machine seemed to be locked up
completely. I switched it off finally.
The machine didnt boot again, but GRUB found its menu.lst and I managed to boot
into OpenSUSE 10. OpenSUSE (Kernel 22.214.171.124 or something like that) didn't
manage to mount my root partition: error 990. I do not remember what happened
with xfs_check... I think it reported tons of errors or I started with
xfs_repair straight away. I had to use xfs_repair -L to force log zeroing. It
reported tons of stuff. Unfortunately I did not log it to a file.
Debian linux booted again upto KDE 3.5 nicely. I tried to repair it, finally
giving up due to about 200 MB of stuff in lost+found. I restored a backup from
my externel USB harddisk via rsync.
This was yesterday. I updated my system from 126.96.36.199 to 188.8.131.52, before I had
184.108.40.206 in use and the third crash happened.
martin@deepdance:~ -> dpkg -l | grep kdissert
ii kdissert 1.0.5.debian-3 mindmapping tool
(I doubt its related to kdissert)
3) Today XFS got corrupted again. I had extensive apt-get updating running to
make up for the 3 weeks since the last backup I restored and it also installed
a new koffice version (release 1.5). I wanted to try out kword, it crashed
straight away. I tried from console: bash told me "error while starting the
executable". I did apt-get --reinstall install kword - then it worked.
Ok, once again OpenSUSE 10 and xfs_check. Errors again. Quite a few. This time
I made a log file. Then xfs_repair, also with log file. I attach those two to
this bug report.
One thing that I found was that at least with 2) and 3) I had an empty
file /core in that corrupted XFS filesystem. I thought about the possibility of
a kernel crash that overwrote XFS in-memory datastructures, but I learned, that
the Linux kernel itself usually does not core dump to the filesystem.
On occasion 3 I made sure as I compiled 220.127.116.11 that I disable core dumping
for ELF files. I still got that empty /core in the corrupted Debian root
Steps to reproduce:
I am not really interested to reproduce this ;-). Well I have no idea. Probably
use similar kernel, similar hardware and try to use that system productively
for a while.
I have not had any XFS corruption during my usage of the various 2.6.15 kernels
I had in use.
This bug report is probably related to:
I will revert to 18.104.22.168 for now or even compile 22.214.171.124 as I can not afford
the time to restore my Debian system from scratch once again. I will however
restore it from the backup once again to make absolutely sure that it is
I know I probably won't be of much help debugging this, but I just don't have
the resources to do fs debugging with my laptop that is in heavy productive use
and at least at home I have no spare system either.
I may try again with 2.6.17 as soon as I am convinced that its stable enough.
Created attachment 7845 [details]
output of xfs_check for occasion 3 from the bug report
Created attachment 7846 [details]
output of xfs_repair for occasion 3 from the bug report
I suspected defective RAM on my IBM ThinkPad, so I ran memtest86 for an hour.
It reported to errors. I have 128 MB that were originally in that ThinkPad +
256 Kingston RAM that is made with timings especially for IBM ThinkPad T23.
I will now be running 2.6.15 again and report whether I get corruption again. I
think I won't.
The other bugreport I think this is related to with complete link is:
Does bugzilla generate a link for this? bug #6180
I know this is not a plain vanilla kernel, cause I use sws2, but I used 2.6.15
with sws2 since months already without XFS corruption three times a week. I
would love to test it without sws2, but as I mentioned I have no time to afford
to restore my backup again and again.
Created attachment 7847 [details]
kernel config of 126.96.36.199 under which I had occasion 2 and 3 from the bug report
Created attachment 7848 [details]
kernel config of the lastest 2.6.15 kernel I had in use
I did not have a XFS corruption problem with that kernel, nor with earlier
Created attachment 7849 [details]
output of lscpi-nvv for my IBM ThinkPad T23
Classic symptoms of the write cache being enabled on your drive.
Switch it off, or try a recent kernel with the -o barrier option
(this will be on by defult in 2.6.17).
Many thanks for your prompt answer.
Indeed, write cache should have been switched on, according to this
root@deepdance:~ -> hdparm -i /dev/hda | grep WriteCache
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
root@deepdance:~ -> cat /etc/hdparm.conf
mult_sect_io = 16
write_cache = on
dma = on
apm = 0
acoustic_management = 128
io32_support = 3
keep_settings_over_reset = on
interrupt_unmask = on
I know no way to query it directly.
I switched it to off immediately using hdparm -W 0 and set it to off in the
hdparm.conf file as well.
I take your comment that this should be set to off with kernel 2.6.15 as well.
Well I know SmartFilesystem for AmigaOS relies on data to be flushed to disc to
be written immediately before the call returns in order to ensure a certain
order for atomic writes. So this is the case with XFS as well?
Well ok, Documentation/block/barrier.txt sheds some light on this - just for
other readers of this bug:
"There are four cases,
i. No write-back cache. Keeping requests ordered is enough.
ii. Write-back cache but no flush operation. There's no way to
gurantee physical-medium commit order. This kind of devices can't to
iii. Write-back cache and flush operation but no FUA (forced unit
access). We need two cache flushes - before and after the barrier
Ok, so either barriers on or write cache off. Got this.
"-o barrier" is a mount option? I do not find it documentated anywhere.
Any hints on why it may worked quite well with 2.6.15 but I got three
corruptions in one week with 2.6.16? Just coincidence or was there some write
cache related changes in 2.6.16? During 2.6.15 time I had quite some 3D savage
DRI driver lockups without any data loss.
I lowered severity to normal as it seems from your comment that it is a
misconfiguration on my side. Feel free to raise it again it is seems approbiate
to you. When according to hdparm -i /dev/hda the drive seems to default to
WriteCache enabled it may have severe implications. At least the default
setting should never burn any data --> kernel 2.6.17 with -o barrier on by
I will test kernel 188.8.131.52 with write cache off in hdparm and when that works,
I may try with barrier on and write cache on - I am still a bit scared ATM.
When both works that bug can be closed. A hint in the xfs.txt readme would
still be in order IMHO until 2.6.17 is standard.
Hard drive in my laptop:
root@deepdance:~ -> smartctl -i /dev/hda
smartctl version 5.34 [i686-pc-linux-gnu] Copyright (C) 2002-5 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Hitachi Travelstar 5K80 family
Device Model: HTS548060M9AT00
Serial Number: MRLB21L4G6G3DC
Firmware Version: MGBOA50A
User Capacity: 60.011.642.880 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a
Local Time is: Thu Apr 13 14:18:21 2006 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
How do I query the drive whether write caching is on or off to make sure that
hdparm -W 0 /dev/hda or my entry in /etc/hdparm.conf actually worked the way it
should? hdparm -W does not seem to be able to query the drive status regarding
write cache. Any hints?
One additional thought: Can XFS detect common cases where write operation is
dangerous (ATM only write cache on with barrier off comes to my mind) and do a
readonly mount in that case issuing an error message explaining why it does so?
IMHO a filesystem should follow that better safe than sorry strategy where
possible when it comes to the risk of data corruption. And at least here it
shouldn't have any serious performance impact.
I tested about one week in total with write cache off and kernel 2.6.16 (with
184.108.40.206 and 220.127.116.11). I had no further XFS crashes.
I am now using 2.6.15 again, cause ALSA sound output doesnt work after suspend
to disk with 2.6.16 (different bug I know).
Just for information for readers of this bugreport:
<li>linux-xfs mailing list: <br /><a
912426 - write barrier support</a><br /><a
912426 - disable barriers by default</a></li>
<li>linux-kernel mailing list: <br /> <a
enable XFS write barrier</a></li>
There is a FAQ entry about the write cache issue available now in the XFS faq
I like to try it with kernel 2.6.17 and then also with enabled write caches
again, but I want to wait a little bit longer until I switch to 2.6.17, since
its rather new at the moment. (At all its a productively used system, no test
Humm, can anybody explain or give some pointers about this write cache
I mean - I don't seem to understand how the corruption can occur
while having system *continuously* online.
I am assuming that after each 'event' fs was repaired and machine was rebooted.
[or maybe that isn't the case here?]
There is some kernel documentation about the write barrier stuff. I have no
depacked kernel 2.6.16 at hand here currently, but you should be able to find
it by using find -name "*barrier*" or grep -ir "barrier" *. That explains the
issue quite nicely as does the SGI FAQ I posted before.
But actually even after I reading it I do not understand this issue completely.
I do not understand why I got three crashes with 2.6.16 in one week while with
2.6.15 it worked quite stable. It was not perfect with 2.6.15, but at least I
only got XFS corruption rarely after a DRI savage driver crash or when suspend
to disk did not work correctly - when the machine was not online as you say.
Actually XFS survived most of those crashes nicely. With 2.6.16 at least once -
when I used kdissert - the kernel just went down while I was using the machine
regurlarily (no 3D stuff and no suspend to disk issues). Even when kdissert /
KDE somehow managed to crash X.org the kernel should still be alive and X.org
should have been restarted. So either kernel 2.6.16 was a lot more unstable
than 2.6.15 in the beginning or XFS had an issue with enabled write cache that
happened while it was running and not only on power outages and kernel crashes.
I had no kernel crashes while regular use with 2.6.16 when I disabled write
cache what may point at the second alternative.
I repaired the filesystem after each event either by using xfs_repair or when
damage was to big by replaying a backup via rsync.
Anyway I think its best to test with 2.6.17 again with barrier functionality
and write cache enabled. I will do so once 2.6.17 matured a bit more and I do
not hear about new issues, cause this is a production machine and I loose quite
some time on each filesystem crash that happens.
Hello, Ok, I had a three week test period with 18.104.22.168 + the xfs-fix for
kernel bug #6757 (that one is really needed and IMHO should go into a stable
kernel patch as soon as possible!) + sws2 2.2.6. One week with disabled write
caches, one week with enabled write caches and barrier mount option mentioned
in /etc/fstab, one week with enabled write caches and barrier mount option not
mentioned in /etc/fstab thus specifically testing whether its really the
No problems. xfs_check on the root partition showed three agi unlinked bucket
that xfs_repair fixed but from what I know these are no real defects. If that
shouldn't happen tough, something till needs to be fixed. Please tell me if
thats the case.
I added to that three tests with switching off the computer while writing data
to a XFS partition:
1) rsync -a /usr/src /destination/partition
2) ddrescue /dev/hda1 /destination/partition
3) 1 + 2 + rm -rf /that/usr/src/directory-from-test-one. The rm job was
completed a second before I switched off the laptop, but I am sure that the
other jobs were still running
Result: No problems. No single line of output in xfs_check after each of the
So I am pretty much convinced that XFS is working really stable now with write
caches given that the patch from kernel bug #6757 which is unrelated to the
write cache issue is applied.
Thank you, guys!