Guys. I have been experiencing this problem for many times since I changed my file system to XFS last year. The problem is after copying some files to a XFS partition, if I have a unexpected power cut, then in most cases, those newly copied files will disappear completely as if they had never been created in the partition in the first place. The hard drive are internal ones without RAID. Normally, I have a data loss of 4 or 5GB because of this problem. But yesterday, I copied to a new XFS partition 4 folders containing more than 200 files with a total size of 20Gb. The copy was done at about 08:00AM, then in the evening at 6pm, I got a power cut and all these "newly" copied files were missing again. During the day, there was no other read/write operations performed to this partition. Clearly, this problem can not be caused by the typical caching problem because the cache is too small to cache these data. The only thing I can imagine is that XFS always keeps meta-data in cache and does not flush it until the partition is umounted, but this is just my guess.
Hi Guys, I have seen this behaviour also. I first noticed it late in the 3.5.X builds (Currently I am running 3.7.0-rc8). It has been exacerbated by the addition of more RAM in my system. I now have 16GB vs the 8GB I had previously, the ram upgrade was to alleviate heavy swapping due to the workload on my machine. In my case xfsrepair wa s failing claiming that the Journal is corrupted. Running it with a -l to ignore journal gets me a working filesystem, but it has heavy corruption in many files, even files with only minor changes (in the order of only a few bytes). In a HA situation this could be catastrophic. It feels like journal entries are ending up in buffer cache and not getting written out in a reasonable time frame. I say this because now that there is extra ram free to be used as buffer cache the issue is much worse, previously when I was running less ram and frequently ending up with some more than moderate swap usage, I did not see this issue with the same frequency. The improper shutdowns in my case are caused by a buggy 3G dongle driver not behaving with suspend correctly. If you need me to do testing I have a bunch of VM's I can use to look for the exact 'triggers' but it seems any files changed since the last sync will get damaged.
Lin Li - What kind of storage are you using? Malcolm - "In my case xfsrepair wa s failing claiming that the Journal is corrupted." This makes me think that your storage is not set up properly; that's the classic result of a drive write cache that vaporized on a power loss w/o being flushed during operation via the barrier mechanisms. Both: http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F -Eric
Eric: No my storage is quite fine. Please feel free to check my email address. It might give you a hint about where I work and why I might know a thing or two about setting up XFS storage. The actual physical setup is one harddrive in a laptop. Nothing tricky. And there is no power off so no disk cache is not at play. The issue is resolved in later versions of the kernel. I am running a 3.9 rc3 kernel at the moment but it was resolved in kernels later than ~3.8ish I can try and track down an older kernel to test and collect the information you need, but I'm pretty sure I saw a discussion and fix for it. And I just found it.. http://oss.sgi.com/archives/xfs/2012-12/msg00307.html This is the exact behaviour pattern and explains the large amounts of lmetadata I was seeing in ram when doing kernel dumps. And in this email Dave confirms what I was seeing, the exact time period and the kernel it was fixed in. http://oss.sgi.com/archives/xfs/2012-12/msg00329.html Also OP of this email thread is OP of this bug report http://oss.sgi.com/archives/xfs/2012-12/msg00089.html It does feel like this is not the right place to lodge these bug reports, considering A) the delay in reply and B) it got resolved on the email list already. Not having a go, just an observation, you guys are insanely busy.
I've seen people @$(STORAGE_CORP).com get things wrong before. Receiving a paycheck does not always impart knowledge. OTOH, if you're at @sgi.com, source of all storage knowledge, why log the bug externally? ;) Anyway - ok, glad it's resolved, but even so, the bug you pointed at should not result in corrupted logs, and you mentioned you had one. Hence the question about storage (and/or nobarrier mount options, and/or flaky hardware in a laptop, which can happen to the best of us).
Eric: All good. There might have been more than one bug at play. Only reason I added to this was it looked/sounded like the exact issue I was seeing where I was seeing it. No point to having two bug reports in two different places, combined efforts, that was what I was aiming for. Plus, the XFS/Kernel combo I use most days wasn't showing this issue. Anywho, Thanks for the speedy reply! :D
Well, you're right that the kernel.org bz is not a great place for bugs, unfortunately. We can't manipulate (close, reassign) the bugs - don't know who can, so things stay open forever, and largely get ignored - it's just not a good tool for us. We should probably see about fixing it or eliminating it.
Also all good! I'll stick to mailing lists like I do for everything else! Thanks again for following this up!