Used the following in /etc/exports:
1. Mount NFS mount from Client:
sudo mount server:/nfs_export /nfs_mount
2. Start collecting tcpdump data on client and server.
3. Perform a simple dd to cause and NFS write (Using oflag=sync):
strace -o /tmp/strace.dd.joe.out dd if=/dev/zero of=/nfs_mount/syncfile bs=1k count=1 oflag=sync
4. Review tcpdump data and notice the client does not issue the "nfs_file_sync" write requests.
The oflag=sync causes the file to be opened with O_SYNC. Reading descriptions of this flag sounds like this should switch from the new UNSTABLE/COMMIT mode back to the FILE_SYNC mode. However looking at the code, it seems FILE_SYNC would only be used on retries or writeback writes for reclaim.
So the question here is: Is the assumption wrong or the implementation?
To add a little detail here: I slightly modified the test case to write out 100 blocks in sequence. The strace on both kernel versions (2.6.32 based and 2.6.38 based) only show read and writes (no fsync or sync). The tcpdump however shows, again in both cases, that every write (with UNSTABLE flag) is followed by a COMMIT.
This sounds to me, that from a data integrity point of view, the result is the same as one would expect. That is every write is waited for before continuing with the next write.
The only downside I could see is that instead of using one write request with the FILE_SYNC flag, this requires two requests for each write which seems a bit of a waste. And it seems unexpected compared to the documentation of O_SYNC I found at
http://www.faqs.org/docs/Linux-HOWTO/NFS-HOWTO.html#MOUNTOPTIONS in section 5.9.
This should be fixed in the upstream kernel with commit
b31268ac793fd300da66b9c28bbf0a200339ab96 (FS: Use stable writes when not
doing a bulk flush).
Please reopen this bug if the above commit doesn't fix the problem.