Bug 8838

Summary: Random data in files created on NFS (over OCFS2 over DRBD)
Product: File System Reporter: Pawel Zawora (pzawora)
Component: OtherAssignee: Mark Fasheh (mark.fasheh)
Status: CLOSED CODE_FIX    
Severity: high CC: jakethompson1, mark.fasheh, protasnb, sunil.mushran, trondmy
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22.1 SMP X64 Subsystem:
Regression: --- Bisected commit-id:
Attachments: Ocfs2 patch to fix this bug

Description Pawel Zawora 2007-08-02 02:36:50 UTC
Most recent kernel where this bug did not occur: 2.6.21.6
Distribution: very custom based on slackware (quite old glibc 2.3 gcc 3.4)
Hardware Environment:
MB: 2 opteron X64  raid controler 3com series 95XX
Software Environment:
OCFS2 over DRBD 8.0.4 (active/active) over hardware raid volume
NFS servers (version 1.1 from source)
Problem Description:
NFS shares ocfs2 file systems. 
Problem appears on NFS3 and NFS4 (both over tcp )

When I write to file over NFS - content of the file is destroyed and contains random bytes. Size of file is preserved. Kernel report no errors.  Read works correctly. 
File written directly on nfs server is OK (I can read it on server and client station). 
On the same server NFS over reiserfs works correctly.


Steps to reproduce:
try to mount nfs share on client workstation, and execute echo "test" >> mytestfile
Comment 1 Trond Myklebust 2007-08-02 06:57:11 UTC
Please try to duplicate this _without_ DRBD. It has been known to cause problems
in the past.
Comment 2 Pawel Zawora 2007-08-09 03:23:40 UTC
I can reproduce problem on the same server without DRBD 

Test env:
on server  (172.17.1.3)  (2.6.22.11 #2 SMP x86_64 AMD Opteron(tm) Processor 246 unknown)
/dev/sda5 on /data1 type ocfs2 (rw,_netdev,heartbeat=local)

on client:
172.17.1.3:/data1/assets on /home2/x type nfs (rw,addr=172.17.1.3)
when I try to create /modify file on clients mount directory - file contains random data.
Comment 3 Trond Myklebust 2007-08-09 05:51:43 UTC
OK. Thanks.

Would it be possible to post a script or program that demonstrates the problem,
as well as a typical output? I'd like to try to reproduce this on my own setup.
Comment 4 Trond Myklebust 2007-08-09 13:30:52 UTC
Pawel replied:

> 1. kernel has no patches
> 2. machine is stable (works for 2 years)
> 3. I discover that between version 2.6.21.6 and 22.1  ocf2 protocol has 
>    been changed from v 7 to 8

> To reproduce problem is enough to create or modify ANY file on nfs 
> client directory.

It therefore looks to me as if this is an interaction between knfsd and the
recent ocfs2 updates. I have not seen this sort of bug on any other
client+knfsd+filesystem combination that I've tried. I'm therefore
tentatively reassigning this bug to the ocfs2 folks in order to see if
they have any ideas...
Comment 5 Mark Fasheh 2007-08-09 18:29:48 UTC
FYI, none of the ocfs2 protocol changes made were in the area of basic file writes. Mostly, they were related to deleting inodes and dentry handling.

I'll try to reproduce this on my test setup tommorrow.

Pawel, let me make sure I get this right though - you're saying that it's the nfs client writes which are corrupting file data? I.E., reading and writing from a process on the server is fine, and nfs client reads are also ok (assuming the data hasn't already been corrupted)?
Comment 6 Pawel Zawora 2007-08-10 01:33:23 UTC
YES
1. Read/Write directly on ocfs2 works correctly (under stress test). 
2. Read on client nfs station - works correctly
3. ON NFS client: Write/ append destroy content of file. Size of the file is preserved. 
4. After that MD5 of corrupted file calculated on server and client are equal. 
5. Quite often corrupted file contains null bytes (not always!)
6. I think that in my test node relocation NOT ALWAYS occurs (modification of small file size less than allocation unit  ~ 50B) 


When I try to make connection between 2.6.21.6 and 2.6.22.1 I received error:

->kernel: (27793,1):o2net_check_handshake:1149 node mail2 (num 0) at ->172.16.238.1:7776 advertised net protocol version 7 but 8 is required, ->disconnecting

It suggests that "net protocol" has been changed.
Comment 7 Mark Fasheh 2007-08-10 10:39:06 UTC
Ok, thanks for making that clear.

I tried an nfs mount via exported ocfs2 fs using my existing 2.6.23-rc2 test setup and didn't see any problems reading/writing to files. I'll try your exact kernel version next.

The protocol version between 2.6.21 and 2.6.22 got bumped in order to support a change to speed up deleted inode messaging. This is not likely to be related to the bug as you've outlined it. It does mean though that you'd have to upgrade your 2.6.21 node to 2.6.22 (or downgrade 2.6.22 to 2.6.21).
Comment 8 Mark Fasheh 2007-08-10 15:08:41 UTC
Created attachment 12348 [details]
Ocfs2 patch to fix this bug

Ok, I was able to reproduce on 2.6.22. It looks like this regression was introduced with the sparse file support patches for 2.6.22. Some of that code was simplified for 2.6.23, which is why it didn't initially reproduce for me.

Does the attached patch fix things for you? It survives a kernel build via nfs client export on my test machines.
Comment 9 Mark Fasheh 2007-08-14 10:45:24 UTC
Any updates? We took the patch through another round of more strenuous testing and it seems to fix the problem at least over here. I'd be great to verify that it fixes things for you before I send it off to be put in the stable tree...
Comment 10 Fu Michael 2007-10-26 17:22:49 UTC
*** Bug 8308 has been marked as a duplicate of this bug. ***
Comment 11 Natalie Protasevich 2008-02-23 00:44:54 UTC
OK, #10 was a type and has been reverted.

Is the problem fixed? Has the patch been committed/verified?
Thanks.
Comment 12 Sunil Mushran 2008-02-25 17:00:16 UTC
This issue was limited to 2.6.22. The fix is in 2.6.22.latest stable tree.
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=commitdiff;h=1db5759e2d29c90d99659e132d4a137e20460061
Comment 13 Natalie Protasevich 2008-02-25 17:22:21 UTC
Great, thanks, closing the bug.