Bug 6127

Summary: NFS updates file modification time on close() or after last write, breaks "cp -p", "cp -a" and "cp --preserve=timestamps"
Product: File System Reporter: Bernhard Kaindl (bk)
Component: NFSAssignee: Trond Myklebust (trondmy)
Status: CLOSED CODE_FIX    
Severity: high CC: bunk
Priority: P2    
Hardware: i386   
OS: Linux   
Kernel Version: 2.6.16-rc4 Subsystem:
Regression: --- Bisected commit-id:
Attachments: NFS: writes should not clobber utimes() calls

Description Bernhard Kaindl 2006-02-24 08:40:12 UTC
Most recent kernel where this bug did not occur: 2.4.18 (maybe 2.4 in general)
Distribution: any (Debian, SuSE, ...)
Hardware Environment: any
Software Environment: coreutils-5.92 to CVS HEAD.

======================================================================

Problem Description:

While NFS apparently never claimed to be fully POSIX-compliant
(fcntl and lockf) there seems to be a regression in one point with 2.6:

NFS updates file modification time on close() which clashes with the
way the standrd "cp" command is implemented in coreutils-5.92 and newer.

Apparently, POSIX does not allow updating the modification time on close:

http://lists.gnu.org/archive/html/bug-coreutils/2005-10/msg00221.html

Quoting from there:
> > POSIX is pretty careful in specifying when st_atime, st_mtime, and
> > st_ctime can be changed (see, e.g., readdir()).  POSIX doesn't grant
> > permission for close() to change these times, so close() shouldn't
> > change the times.  This is independent of whether utimes() was called.

According to
http://lists.gnu.org/archive/html/bug-coreutils/2005-12/msg00219.html
the usage pattern of the new cp from coreutils works fine with Linux-2.4.18
and an NFS Server as server.

The problem of updating modification times on close (but apparently
only if there were buffers to be written) causes trouble in "cp" because
it updates the modification time of the file before it closes the
file. This is perfectly legal according to POSIX and worked fine
in the above test with Linux-2.4

The workaround for "cp" would be to call fsync() before updating the
modification time, then close() does not update it again, and
cp -p, -a and --preserve=timestamps work.

The problem has been analzed in the Debiang bugtracker and
http://bugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=340236

The end of it is also recorded in bug-coreutils:
http://lists.gnu.org/archive/html/bug-coreutils/2005-12/msg00219.html

The outcome there was that it would be a kernel problem.

I don't know if it's possible to make NFS behave as it did in the
test with 2.4.18 and Solaris as Server, but I'm filing this to
make the problem researchable which look into this bugzilla for
an answer and to hopefully get an answer too.

=================================================================

Steps to reproduce:

Simple local check using 2.6.16-rc4 with a local NFS server:

mkdir /tmp/coreutils-test
cd /tmp/coreutils-test
wget http://ftp.gnu.org/pub/gnu/coreutils/coreutils-5.94.tar.bz2
tar xvfj coreutils-5.94.tar.bz2
cd coreutils-5.94
./configure
make
mkdir -p test
chown nobody test
echo "$PWD/test 127.0.0.1(rw,sync)" >>/etc/exports

# start/reload the kernel nfsserver now

mkdir mountpoint
mount localhost:/space/test $PWD/mountpoint

# cat /proc/mounts shows me these mount options:
# rw,v3,rsize=32768,wsize=32768,hard,lock,proto=tcp,addr=localhost

cp -p configure configure.new
ls -l configure configure.new
# (modification time idential)

src/cp -p configure mountpoint
# (expect an error: failed to preserve ownerships / Operation not permitted)

ls -l configure mountpoint/configure
# (modification time differ, while the same cp command works locally)

umount $PWD/mountpoint
# stop nfsserver/remove exports entry and reload it.

Summary:

Either NFS needs to set the modification time on the last write (and never
on close) or cp needs a workaround to call fsync() before calling utime().
Comment 1 Bernhard Kaindl 2006-02-24 09:16:50 UTC
Further explanation:

cp versions prior to 5.92 closed the destination file before
triggering the utime system call to set the modification time
of the destination file, while 5.92 to 5.94 and the current CVS
close the file _*_after_*_ setting the modfication time/

That's a sample of how the newer versions do it (from cp -p):

fd 4 is the destination file here:
> open("~", O_WRONLY|O_TRUNC|O_LARGEFILE) = 4
> fstat64(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> write(4, "foo\n", 4)                    = 4
> utimes("/proc/self/fd/4", {1134767841, 0}) = 0
> chmod("baz", 0100644)                   = 0
> close(4)                                = 0

It has also been mentioned:

> 2. Timestamps of empty files are preserved, which reinforces the above
> suspicion (no writes for empty file)

So the cause of the update may indeed that some data is pending to be
written slightly after the utimes call and in any case finally on close,
but this should, according to POSIX not update the mtime/not be visible
to the user.
Comment 2 Bernhard Kaindl 2006-02-24 12:42:43 UTC
I've a possible better workaround for cp (avoids fsync) now:

It creates a copy of the destination fd, closes dest_desc and continues
the remaining operations (utime/utimes/futimesat and fchmod with the copy:

--- src/copy.c
+++ src/copy.c
@@ -420,6 +420,12 @@
       timespec[0] = get_stat_atime (src_sb);
       timespec[1] = get_stat_mtime (src_sb);
 
+      int dest_desc2 = dup (dest_desc);
+
+      close(dest_desc); /* needed for NFS with Linux-2.6 */
+
+      dest_desc = dest_desc2;
+
       if (futimens (dest_desc, dst_name, timespec) != 0)
        {
          error (0, errno, _("preserving times for %s"), quote (dst_name));

copying /etc/passwd then looks like this:

read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 32768) = 1254
write(4, "root:x:0:0:root:/root:/bin/bash\n"..., 1254) = 1254
read(3, "", 32768)                      = 0
dup(4)                                  = 5
close(4)                                = 0
utimes("/proc/self/fd/5", {1140803734, 0}) = 0
fchown(5, 0, 0)                         = 0
close(5)                                = 0

This avoids the need for fsync which waits until the data is
written, which would have a negative impact on the performance
of cp, expecially when copy huge numbers of small files locally.

About using "/proc/self/fd/5" here: It's simply symlink, so nothing
special, but also suboptimal I think since it creates an extra indifection
which the kernel has to resolve internally. Not a big one of course, only
a few function calls in the VFS layer. I think it would be better to use
the futimesat() system call, but that's offtopic here, I'll report that
to the coreutils people.

Still it is a workaround only.
Comment 3 Bernhard Kaindl 2006-02-24 13:55:20 UTC
I found this, Paul Eggert of UCLA got this this approved by the Austin Group:

> I asked the POSIX folks about this (see
> <http://www.opengroup.org/austin/mailarchives/ag-review/msg01971.html>)
> and the answer came back that the Cygwin behavior was indeed a bug;
> see <http://www.opengroup.org/austin/mailarchives/ag/msg08912.html>
> and look for "XBD ERN 61".

----------------------------------------------------------------------
XBD ERN 61 File times update Accept as marked

 Change XBD page 99 lines 3089-3091 (File Times Update) from:

   All fields that are marked for update shall be updated when the file
   ceases to be open by any process, or when a stat(), fstat(), or
   lstat() is performed on the file.

 to:

   All fields that are marked for update shall be updated when the
   file ceases to be open by any process or before a stat(), fstat(),
   lstat(), fsync(), utime(), or utimes() is successfully performed on
   the file.
----------------------------------------------------------------------

So the POSIX folks want to have the fields updated before the functions
are called which makes sense, at least I, from a API user's viewpoint,
would like to have defined, always consistant behaviour.

I have also read that it works this way on OpenBSD 3.4 NFS
client and a Solaris 8 NFS server in this post:
http://lists.gnu.org/archive/html/bug-coreutils/2005-12/msg00219.html

I don't know how utimes() is implemented in NFS, but if this is the
way it's wanted then probably the implementation of utimes in NFS.

I just looked at fs/open.c:do_utimes() which calls fs/attr.c:notify_change()
which calls fs/nfs/inode.c:nfs_setattr(), which has this check:

        /* Write all dirty data if we're changing file permissions or size */
        if ((attr->ia_valid & (ATTR_MODE|ATTR_UID|ATTR_GID|ATTR_SIZE)) != 0) {
                filemap_write_and_wait(inode->i_mapping);
                nfs_wb_all(inode);
        }

Could we add ATTR_CTIME|ATTR_MTIME|ATTR_ATIME which are set by du_utimes()
to this check to write all dirty data also if we are changing timestamps?
Comment 4 Trond Myklebust 2006-02-25 11:27:03 UTC
Created attachment 7472 [details]
NFS: writes should not clobber utimes() calls

Ensure that we flush out writes in the case when someone calls utimes() in
order to set the file times.
Comment 5 Adrian Bunk 2006-03-13 14:03:58 UTC
Trond, what about getting this patch into 2.6.16?
Comment 6 Trond Myklebust 2006-03-13 14:45:03 UTC
I'm sceptical about doing that as it isn't really a critical regression: in
fact, this is the first complaint despite the fact that the issue has existed
for all(?) 2.6 kernels.
Unless someone complains loudly, I'd therefore prefer to wait until 2.6.17.
Comment 7 Trond Myklebust 2006-03-26 11:21:06 UTC
Patch merged into mainline kernel prior to 2.6.17-rc1.