Bug 14826

Summary: jdm-20002 reiserfs_xattr_get: Invalid hash for xattr
Product: File System Reporter: Christian Kujau (kernel)
Component: ReiserFSAssignee: ReiseFS developers team (reiserfs-devel)
Status: CLOSED OBSOLETE    
Severity: normal CC: alan, andrex, gregsurbey, jeffm, kernel, kernel, marco.gatti, ohnobinki
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.32 Subsystem:
Regression: Yes Bisected commit-id:

Description Christian Kujau 2009-12-17 09:00:02 UTC
Note: this has been reported on reiserfs-devel by me as well:
http://www.spinics.net/lists/reiserfs-devel/msg02009.html


After upgrading to 2.6.32, the following messages appear in the logs:

REISERFS warning (device xvda3): jdm-20002 reiserfs_xattr_get: Invalid hash for xattr (user.{http:%2F%2Ftwistedmatrix.com%2Fxml_namespace%2Fdav%2F}getcontentmd5) associated with [765 43775 0x0 SD]

The filesystem has been created a long time ago, the mountoptions haven't changed either:

$ grep reiser /proc/mounts 
/dev/xvda3 /var reiserfs rw,nosuid,nodev,noexec,relatime,user_xattr,notail 0 0

So far, the messages are not flooding the logs but are triggered a few times a day:

$ uptime
 09:32:48 up 2 days,  6:30, 10 users,  load average: 0.95, 0.31, 0.11
$ dmesg | grep -c jdm-20002
35

A full dmesg and the .config is available on:
http://nerdbynature.de/bits/2.6.33-git/jdm-20002/

Although cosmetical (right?) and and probably fixable by removing/resetting the xattr flags, this should not happen.
Comment 1 Jethro Beekman 2010-02-14 00:37:47 UTC
I can confirm that these messages keep popping up, seemingly at random, since I upgraded to 2.6.32. This is a *huge* problem when using ACLs, as the file becomes inaccessible.

$cat thefile.xlsx 
cat: thefile.xlsx: Input/output error
$

And in the logs:
Feb 14 01:24:58 victoria [1149610.665902] REISERFS warning (device dm-0): jdm-20002 reiserfs_xattr_get: Invalid hash for xattr (system.posix_acl_access) associated with [1953724787 1882090853 0x7869736f UNKNOWN]

I first noticed this because my automated backups failed. At first it seems this problem happened to a file only for a short period, because the backup would not complain about the same file twice in a row and I can access the failing files now, but for some time now I'm getting the error on one particular file.
Comment 2 Greg Surbey 2010-04-12 18:51:53 UTC
My organization's main file server started experiencing this issue when we upgraded from 2.6.24-gentoo-r8 to 2.6.31-gentoo-r6.

This is how the error manifests itself on our server's console:
fs1 Equipment # pwd
/home/share/Projects/Equipment
fs1 Equipment # cat Equipment\ List.xls
cat: Equipment List.xls: Input/output error
fs1 Equipment # dmesg
REISERFS warning (device dm-0): jdm-20002 reiserfs_xattr_get: Invalid hash for xattr (system.posix_acl_access) associated with [1953724787 1882090853 0x7869736f UNKNOWN]

We can fix each of these errors individually:
# For directories we need to run this extra first step
# setfattr -x system.posix_acl_default Equipment\ Directory
setfattr -x system.posix_acl_access Equipment\ List.xls
# copy permissions from current directory to fix file
getfacl . | setfacl -M - Equipment\ List.xls

For many instances within a directory structure:
# This will find corrupted ACLs recursively starting from where you are
getfattr -Rd -m '.*' . 2>&1 | grep 'Input/output error' | cut -d: -f1

I slapped together a poor man's recursive function since setfattr does not support a '-R' option:
# first, find a good ACL and then back it up
getfacl . > backup.acl
# These next two commands are almost equivalent to "setfacl -Rb ."
# Note: you may need to run these two commands several times to fix everything
# Note: it will fail on files with apostrophes, but you could temporarily rename them...
find . -name '*' | xargs -i setfattr -x system.posix_acl_default "{}"
find . -name '*' | xargs -i setfattr -x system.posix_acl_access "{}"
# now we recursively restore all the ACLs from the good ACL
cat backup.acl | setfacl -RM - .

Google shows me that this issue is not new:
http://old.nabble.com/jdm-20002-reiserfs_xattr_get:-Invalid-hash-for-xattr-td26786353.html

It might be caused by a locking error:
http://markmail.org/thread/zrrenmncxisqlooc

Looking towards a permanent solution, I e-mailed Jeff Mahoney at SUSE who was the last one to work with the reiserfs xattr code:
http://ftp.suse.com/pub/people/jeffm/reiserfs/xattr-rework/

Linus committed Jeff's new code to the mainline "stable" kernel on March 30, 2009:
http://kerneltrap.org/mailarchive/git-commits-head/2009/3/30/5337714

And it got implemented with this patch:
http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.29-git7.log
Comment 3 Marco Gatti 2010-04-13 08:55:39 UTC
It seems to me that things turned bad with acl with the release of kernel 2.6.30.
Check this discussion if you haven't: http://marc.info/?t=126979018700003&r=1&w=2

In particular this post with a simple test case and results: http://marc.info/?l=reiserfs-devel&m=127056179512450&w=2

Is there something i can do to help? I mean just as a linux user...
Comment 4 Greg Surbey 2010-04-13 15:41:09 UTC
Considering that this bug eventually leads to completely unrepairable filesystem corruption, I strongly recommend that Linux revert to the old reiserfs xattr pre-2.6.30 code for the time being.  I realize that these code improvements are a worthy goal, however this is supposed to be a stable kernel release with a stable filesystem.  Losing data is the worst thing that can happen, but losing data silently/randomly is even worse.  Unless this issue is fixed today, this patch-set needs to go back to testing/development mode for the time being.
Comment 5 Jeff Mahoney 2010-04-13 18:58:59 UTC
I'm looking into this today. Once I've identified the source, I can write something up to repair the damage.
Comment 6 Jeff Mahoney 2010-04-13 19:18:49 UTC
Ok, I've found the problem. Turns out expose_privroot is useful after all.

The loss isn't random. It affects xattrs that have been shrunk. The test case in comment #2 demonstrated the issue perfectly because it removes a single ACL, which shrinks the xattr, before it removes all of them. The issue is that when an xattr is shrunk, the size of the header wasn't accounted for, so it's shrunk by 8 bytes too many.

The test case is sensitive to memory pressure because the ACL is cached with the in-memory inode. Accesses to it use the cached value until the inode is dropped from memory and it needs to be re-read from disk. Upon the re-read, the corruption is noticed and an -EIO results.

Unfortunately, this type of damage I am not going to be able to repair. I'll be able to fix the checksums but the data inside a shrunken xattr will be lost.