Most recent kernel where this bug did not occur: 2.6.21.6 Distribution: very custom based on slackware (quite old glibc 2.3 gcc 3.4) Hardware Environment: MB: 2 opteron X64 raid controler 3com series 95XX Software Environment: OCFS2 over DRBD 8.0.4 (active/active) over hardware raid volume NFS servers (version 1.1 from source) Problem Description: NFS shares ocfs2 file systems. Problem appears on NFS3 and NFS4 (both over tcp ) When I write to file over NFS - content of the file is destroyed and contains random bytes. Size of file is preserved. Kernel report no errors. Read works correctly. File written directly on nfs server is OK (I can read it on server and client station). On the same server NFS over reiserfs works correctly. Steps to reproduce: try to mount nfs share on client workstation, and execute echo "test" >> mytestfile
Please try to duplicate this _without_ DRBD. It has been known to cause problems in the past.
I can reproduce problem on the same server without DRBD Test env: on server (172.17.1.3) (2.6.22.11 #2 SMP x86_64 AMD Opteron(tm) Processor 246 unknown) /dev/sda5 on /data1 type ocfs2 (rw,_netdev,heartbeat=local) on client: 172.17.1.3:/data1/assets on /home2/x type nfs (rw,addr=172.17.1.3) when I try to create /modify file on clients mount directory - file contains random data.
OK. Thanks. Would it be possible to post a script or program that demonstrates the problem, as well as a typical output? I'd like to try to reproduce this on my own setup.
Pawel replied: > 1. kernel has no patches > 2. machine is stable (works for 2 years) > 3. I discover that between version 2.6.21.6 and 22.1 ocf2 protocol has > been changed from v 7 to 8 > To reproduce problem is enough to create or modify ANY file on nfs > client directory. It therefore looks to me as if this is an interaction between knfsd and the recent ocfs2 updates. I have not seen this sort of bug on any other client+knfsd+filesystem combination that I've tried. I'm therefore tentatively reassigning this bug to the ocfs2 folks in order to see if they have any ideas...
FYI, none of the ocfs2 protocol changes made were in the area of basic file writes. Mostly, they were related to deleting inodes and dentry handling. I'll try to reproduce this on my test setup tommorrow. Pawel, let me make sure I get this right though - you're saying that it's the nfs client writes which are corrupting file data? I.E., reading and writing from a process on the server is fine, and nfs client reads are also ok (assuming the data hasn't already been corrupted)?
YES 1. Read/Write directly on ocfs2 works correctly (under stress test). 2. Read on client nfs station - works correctly 3. ON NFS client: Write/ append destroy content of file. Size of the file is preserved. 4. After that MD5 of corrupted file calculated on server and client are equal. 5. Quite often corrupted file contains null bytes (not always!) 6. I think that in my test node relocation NOT ALWAYS occurs (modification of small file size less than allocation unit ~ 50B) When I try to make connection between 2.6.21.6 and 2.6.22.1 I received error: ->kernel: (27793,1):o2net_check_handshake:1149 node mail2 (num 0) at ->172.16.238.1:7776 advertised net protocol version 7 but 8 is required, ->disconnecting It suggests that "net protocol" has been changed.
Ok, thanks for making that clear. I tried an nfs mount via exported ocfs2 fs using my existing 2.6.23-rc2 test setup and didn't see any problems reading/writing to files. I'll try your exact kernel version next. The protocol version between 2.6.21 and 2.6.22 got bumped in order to support a change to speed up deleted inode messaging. This is not likely to be related to the bug as you've outlined it. It does mean though that you'd have to upgrade your 2.6.21 node to 2.6.22 (or downgrade 2.6.22 to 2.6.21).
Created attachment 12348 [details] Ocfs2 patch to fix this bug Ok, I was able to reproduce on 2.6.22. It looks like this regression was introduced with the sparse file support patches for 2.6.22. Some of that code was simplified for 2.6.23, which is why it didn't initially reproduce for me. Does the attached patch fix things for you? It survives a kernel build via nfs client export on my test machines.
Any updates? We took the patch through another round of more strenuous testing and it seems to fix the problem at least over here. I'd be great to verify that it fixes things for you before I send it off to be put in the stable tree...
*** Bug 8308 has been marked as a duplicate of this bug. ***
OK, #10 was a type and has been reverted. Is the problem fixed? Has the patch been committed/verified? Thanks.
This issue was limited to 2.6.22. The fix is in 2.6.22.latest stable tree. http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=commitdiff;h=1db5759e2d29c90d99659e132d4a137e20460061
Great, thanks, closing the bug.