Bug 12557 - NFS going stale for access()/stat() for renamed files like .Xauthority
NFS going stale for access()/stat() for renamed files like .Xauthority
Status: CLOSED DOCUMENTED
Product: File System
Classification: Unclassified
Component: NFS
All Linux
: P1 normal
Assigned To: Trond Myklebust
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-01-27 17:15 UTC by Tim Connors
Modified: 2012-11-02 22:01 UTC (History)
7 users (show)

See Also:
Kernel Version: 3.2
Tree: Mainline
Regression: No


Attachments
Proposed patch (2.14 KB, patch)
2009-03-04 22:58 UTC, Suresh Jayaraman
Details | Diff
Wireshark trace of the stat() test in comment #22 (4.17 KB, application/octet-stream)
2012-11-02 21:31 UTC, Trond Myklebust
Details
Wireshark trace of the stat() test in comment #22 but with acdirmin=3 (4.93 KB, application/octet-stream)
2012-11-02 22:01 UTC, Trond Myklebust
Details

Description Tim Connors 2009-01-27 17:15:55 UTC
Latest working kernel version: 2.6.23

Earliest failing kernel version: 2.6.24?

Distribution: debian/unstable

Hardware Environment:

Software Environment: debian/unstable nfs client running 2.6.24 to 2.6.28, centos/rhel 5.2 nfs server running redhat 2.6.18-92.1.10 kernel.
nfs is mounted with autofs, and mount flags are:
aatlxz:/home/aatlxz /net/aatlxz/home/aatlxz nfs
rw,nosuid,nodev,vers=3,rsize=32768,wsize=32768,namlen=255,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=ud
p,addr=192.231.166.57 0 0

Problem Description: I am getting the occasional ESTALE from a stat() or access() calls for nfs hosted ~/.Xauthority, if a rename has previously been performed on another host, which is a pain because it means I can't open new X clients when this happens (libXau does a stat on .Xauthority before reading its contents).  It is rather rare, happening from in 2.6.26 thrice a day to once per week, and has only just occurred again in 2.6.28 after quite a few days usage, and I suspect it depends on the pattern of usage of whether I am logging into other machines often with ssh (which atomically renames .Xauthority on the remote host, giving this bug the opportunity to arise), or not.

see also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954
and
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html

That ubuntu bug report from some other person indicates that 2.6.27 somewhat mitigated the issue, but clearly it still does happen occasionally in 2.6.28.

Steps to reproduce:
put your home directory on nfs, and ensure you are using ~/.Xauthority as your X authority file (I know, security blah blah).  open 2 xterms.  log into a remote machine with X forwarding (xauth will atomically rename the .Xauthority file, resulting in a new inode).  Open a new local xterm.  Sometimes it fails, sometimes it doesn't.  When it does fail, run 'strace -f xterm' from one of the original local xterms.  The access() call will return ESTALE, and the xlibs won't go on any further.  You can clear this condition by catting the .Xauthority file on the local host or running 'xauth list', which obviously finally gets around to invalidating the cache.  I was under the impression that pretty much everytime I logged into a remote machine with local kernel version of 2.6.26, access() would always return ESTALE.  Under 2.6.28, this is now rarer, but still happens.  strace from 2.6.26 (sorry, I didnt yet get a chance to get one under 2.6.28, but surely it has to be the same) follows:

getgid()                                = 15
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
setresgid(-1, 15, -1)                   = 0
setresuid(-1, 582, -1)                  = 0
open("/proc/meminfo", O_RDONLY)         = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4cd0a2a000
read(3, "MemTotal:      3096244 kB\nMemFree"..., 1024) = 774
close(3)                                = 0
munmap(0x7f4cd0a2a000, 4096)            = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 3
getsockopt(3, SOL_SOCKET, SO_TYPE, [68719476737], [4]) = 0
connect(3, {sa_family=AF_FILE, path="/tmp/.X11-unix/X0"...}, 110) = 0
getpeername(3, {sa_family=AF_FILE, path="/tmp/.X11-unix/X0"...}, [139964394242068]) = 0
uname({sys="Linux", node="aatlxx", ...}) = 0
access("/home/aatlxz/twc/.Xauthority", R_OK) = -1 ESTALE (Stale NFS file handle)
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
select(4, [3], [3], NULL, NULL)         = 1 (out [3])
writev(3, [{"l\0\v\0\0\0\0\0\0\0"..., 10}, {"\0\0"..., 2}], 2) = 12
read(3, 0x198e160, 8)                   = -1 EAGAIN (Resource temporarily unavailable)
select(4, [3], NULL, NULL, NULL)        = 1 (in [3])
read(3, "\0\26\v\0\0\0\6\0"..., 8)      = 8

Note that access()'s manpage doesn't even list ESTALE as a valid error.

I don't think it would really be acceptable to reduce the attribute
cache time, because that would only reduce the window of opportunity
of the bug, not remove it, and it looks like the stale handle lasts
beyond the timeout anyway.

Would the solution be to, if one gets an ESTALE from metadata only
operations like stat(), retry getting the current metadata from the
server freshly as if it was open()ed?
Comment 1 Trond Myklebust 2009-01-28 10:30:37 UTC
Why is this marked as a regression? access() and stat() have _always_ had that
race. It is nothing new to 2.6.28. Removing regression flag.

Anyhow, if this is happening due to a rename() (and not an unlink()), then the bug is likely to be that you are using 'subtree_check' as one of your server export options (you don't list any details about the server setup). 'subtree_check' is known to break renames, and should never be used.

Comment 2 Tim Connors 2009-01-28 12:08:28 UTC
(I didn't mark the regression flag; that was done after)

We've never seen this in our environment, until we had a machine with a greater than 2.6.18 (+redhat patches) kernel.  The ubuntu bug implies this hapenned in 2.6.24.

server exports are according to /proc/fs/nfsd/exports:
/home/aatlxz    *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=666)

Comment 3 Trond Myklebust 2009-01-28 12:32:53 UTC
It doesn't matter what the ubuntu bug says. stat() and access() have _always_ had
the possibility of racing with deletions on the server. This is because both
nfs_permission() and nfs_getattr() are given an inode, and not a path by the VFS.
If the file gets deleted on the server, then the server will return ESTALE
because it doesn't recognise the filehandle.

Normally, a rename of the file on the server is not supposed to invalidate the
filehandle, so my guess is that your old .Xauthority is being deleted by the
other client, and not just renamed.

If so, tough... NFS is not fully cache coherent. It never has been, and it
never will be. You can limit some of the effects by using the
"-olookupcache=none" mount option that was introduced in linux 2.6.28, but
for the reasons I listed above, you can never fully eliminate the ESTALE.
The price to pay is more NFS traffic in order to check every single file
lookup, and hence slower performance.

Comment 4 Suresh Jayaraman 2009-02-19 06:51:18 UTC
Hi Trond,
I have seen a similar bug report where a user trying to launch a xterm from a KDE session on NFS mounted home due to -ESTALE error on .Xauthority.

The packet capture reveals that there were two consequent ACCESS calls the server returns -ESTALE for both and the client doesn't seem to LOOKUP.

Is is not expected that the client upon receiving -ESTALE (due to stale fh) from server during an ACCESS(), it should unhash dcache entry, reinstatiate new dcache entry, new inode? or can ACCESS() be made to handle -ESTALE better?
Comment 5 Suresh Jayaraman 2009-02-19 06:59:48 UTC
little more information:
The unlinking goes something like this:

unlink("/home/jay/.Xauthority")       = 0
link("/home/jay/.Xauthority-n", "/home/jay/.Xauthority") = 0
unlink("/home/jay/.Xauthority-n")     = 0
unlink("/home/jay/.Xauthority-c")     = 0
unlink("/home/jay/.Xauthority-l")     = 0

where -n is the new file, -c is a lock file, and -l is a hard link to
the lock file - all created to try to get exclusive locking over NFS
with out having to use fcntl locking.
Comment 6 Suresh Jayaraman 2009-03-04 22:58:27 UTC
Created attachment 20435 [details]
Proposed patch

The attached patch fixed the issue for me. Can you see whether it fixes the problem for you?
Comment 7 Joe Kislo 2009-07-16 16:16:04 UTC
We are running into a very similar issue; except our use case has nothing to do with home directories on NFS.

Using the attached patch, it did NOT solve our problem, we patched against an  Ubuntu Hardy 2.6.24-24 kernel.  Our test case works a bit differently, and actually does not enter the sequence of code that the patch hits.  Here is our test case:

System A: NFS Server
System B: NFS Client

System A:
mkdir tmpdir
touch tmpdir/tmpfile
tar -cvf x.tar tmpdir

System B:
stat tmpdir/tmpfile

System A:
tar -xvf x.tar

System B:
stat tmpdir/tmpfile

This will reliably generate a:
stat: cannot stat `tmpdir/tmpfile': Stale NFS file handle

error.  Using the Ubuntu Edgy kernel, we do not run into this problem.  Unfortunately we use NFS quite a bit, and having one client be able to completely destroy another client by just untarring a file is a big deal for us :(  

I have provided more details of our use case in the ubuntu bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954

Anybody have any ideas?
Comment 8 Trond Myklebust 2009-07-16 16:53:53 UTC
On Thu, 2009-07-16 at 16:16 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12557
> 
> 
> Joe Kislo <joe@k12s.phast.umass.edu> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |joe@k12s.phast.umass.edu
> 
> 
> 
> 
> --- Comment #7 from Joe Kislo <joe@k12s.phast.umass.edu>  2009-07-16 16:16:04 ---
> We are running into a very similar issue; except our use case has nothing to do
> with home directories on NFS.
> 
> Using the attached patch, it did NOT solve our problem, we patched against an 
> Ubuntu Hardy 2.6.24-24 kernel.  Our test case works a bit differently, and
> actually does not enter the sequence of code that the patch hits.  Here is our
> test case:
> 
> System A: NFS Server
> System B: NFS Client
> 
> System A:
> mkdir tmpdir
> touch tmpdir/tmpfile
> tar -cvf x.tar tmpdir
> 
> System B:
> stat tmpdir/tmpfile
> 
> System A:
> tar -xvf x.tar
> 
> System B:
> stat tmpdir/tmpfile
> 
> This will reliably generate a:
> stat: cannot stat `tmpdir/tmpfile': Stale NFS file handle
> 
> error.  Using the Ubuntu Edgy kernel, we do not run into this problem. 
> Unfortunately we use NFS quite a bit, and having one client be able to
> completely destroy another client by just untarring a file is a big deal for us
> :(  
> 
> I have provided more details of our use case in the ubuntu bug:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954
> 
> Anybody have any ideas?
> 
Yes: either use 'noac', or get Ubuntu to backport the 'lookupcache'
mount option.

Trond
Comment 9 Tobias Oetiker 2009-08-21 12:47:04 UTC
(In reply to comment #7)

> We are running into a very similar issue; except our use case has nothing to 

you are seeing exactly the same thing as the .Xauthority person.
the following script reproduces it nicely on ubuntu 2.6.28-15-generic:

  client$ touch x;
  client$ perl -e 'use POSIX;while(1){if(POSIX::access(q{x}, POSIX::R_OK)) \
                   {print "OK\n"}else{print "$!\n"};sleep 1}'

  server$ rm x;touch x.`date +%s`; touch x

  -------------------------------
  OK
  Stale NFS file handle
  Stale NFS file handle
  ...
  -------------------------------

Note the "touch `date %s`" command. The reason for this is that
when you run rm x; touch x then the new x will  have the same
inode number as the old x. By doing the extra touch, this will 
use up the inode number of the deleted x and the new x will get
a new number.


Now try the following:

client$ perl -e 'use POSIX;while(1){if(lstat(q{x})){print "OK\n"}else{print "$!\n"};sleep 1}'
OK
Stale NFS file handle
OK

While the client does not recover from the access call,
it always recovers instantly from the lstat call on the next try.

The patch attached to this bug fixes the access call to also recover
as quickly as the lstat call ... 

The Problem remaining is, that there is an error at all. 

I do understand why it happens: The access/lstat calls operate on an inode number they got from a cache. So when they figure out that the server does not know about the inode anymore, they can not just go out and try again because they do not know about the filename anymore.

It would be great nfs (vfs ?) could hide this effect from the user by retrying (once) internally somewhere higher up in the processing chain and not exposing the problem to the user.

On another note, while testing I found:

  client$ perl -e 'use POSIX;while(1){if(-e q{new})
{print "OK\n"}else{print    "$!\n"};sleep 1}'

server$ touch new
Comment 10 Tobias Oetiker 2009-08-21 12:49:37 UTC
(In reply to comment #9)
... you get 

  No such file or directory
  No such file or directory
  No such file or directory
  No such file or directory
  No such file or directory

as if the client used the last cache access time to prolong
the cache validity. As soon as I stop asking for a few seconds
the 'new' file will appear on the client.
Comment 11 Tobias Oetiker 2009-08-21 13:37:26 UTC
Trond mentioned the lookupcache option in comment #8 
this is in the kernel since 2.6.27.

So when you mount with

  mount-o lookupcache=none

then the inodes will get looked up from the server on every new file access thus exchanging this problem for some performance impact for applications who open lots of files ... (local access to maildir would probably suffer ?)

The patch for the access problem has been added to 2.6.29 btw.
Comment 12 Trond Myklebust 2009-08-21 22:30:23 UTC
On Fri, 2009-08-21 at 12:47 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=12557

> It would be great nfs (vfs ?) could hide this effect from the user by retrying
> (once) internally somewhere higher up in the processing chain and not exposing
> the problem to the user.

That will require rather extensive VFS changes. I've discussed this with
Al Viro, and we do have a plan for how it could be done, but it will
take time to implement.

Cheers
  Trond
Comment 13 Henrik Carlqvist 2010-01-24 12:31:36 UTC
I notice that this bug thread seems rather inactive. Maybe the bug has been fixed in more recent kernels as it now has taken some time? If so the bug should be closed with a note from which kernel version it has been fixed.

Anyhow I still suffer from the bug in kernel 2.6.24.5 from 2008 and find no reason to upgrade until the bug has been fixed. Instead I prefer Slackware 12.0 with kernel 2.6.21.5 where possible to avoid this bug.

When I have to use the newer kernel I have found the following workaround for the ssh problem with .Xauthority:

1) Add a file /etc/profile.d/nfs_ssh_xauth_fix.csh
-8<------------------------------------
#!/bin/csh
 
# Added by henca to avoid nfs bug in kernel 2.6.24
setenv XAUTHORITY $HOME/.Xauthority-$HOST
-8<------------------------------------

2) Add a file /etc/profile.d/nfs_ssh_xauth_fix.sh
-8<------------------------------------
#!/bin/sh
 
# Added by henca to avoid nfs bug in kernel 2.6.24
XAUTHORITY=$HOME/.Xauthority-$HOST
export XAUTHORITY
-8<------------------------------------

3) Add a file /etc/ssh/sshrc
-8<------------------------------------
if read proto cookie && [ -n "$DISPLAY" ]; then
  case $DISPLAY in
    localhost:*)
      xauth -f $HOME/.Xauthority-$HOSTNAME add \
                unix:$(echo $DISPLAY | cut -c11-) $proto $cookie
      xauth -f $HOME/.Xauthority add \
                unix:$(echo $DISPLAY | cut -c11-) $proto $cookie
      ;;
    *)
      xauth -f $HOME/.Xauthority-$HOSTNAME add $DISPLAY $proto $cookie
      xauth -f $HOME/.Xauthority add $DISPLAY $proto $cookie
      ;;
  esac
  export XAUTHORITY=$HOME/.Xauthority-$HOSTNAME
fi
-8<------------------------------------

4) Modify file /etc/kde/kdm/Xsession, apply the following patch:
-8<------------------------------------
--- /old/Xsession       Fri Jun 29 03:40:20 2007
+++ etc/kde/kdm/Xsession        Tue Nov 10 15:44:14 2009
@@ -63,6 +63,8 @@
     [ -f $HOME/.profile ] && . $HOME/.profile
     ;;
 esac
+
+/usr/bin/xauth merge $HOME/.Xauthority
  
 [ -f /etc/xprofile ] && . /etc/xprofile
 [ -f $HOME/.xprofile ] && . $HOME/.xprofile
-8<------------------------------------

The idea is to avoid the problem of sharing files across NFS mounts by letting each problematic machine use its own uniqe .Xauthority-$HOSTNAME. To not lose old authorizations the old .Xauthority is also read.

Unfortunately this workaround is only a fix for X11 forwarding over ssh, this fix does not help against any other problems which might come from this bug.

regards Henrik
Comment 14 Alan 2012-05-30 12:02:22 UTC
Closing old bugs - there have been too many changes to tell if this bug is remotely valid any more. Please re-test with a modern kernel and reopen the bug giving that version if it is
Comment 15 Tim Connors 2012-06-01 07:33:46 UTC
Is debian 3.2.0-1-amd64 new enough for you?

Easy enough to test:

ant is $HOME fileserver, weinberg is client:

17:24:38 tconnors@weinberg:~ > echo test > bsd.1
17:25:40 tconnors@ant:~ > mv -f bsd.1 bsd
17:26:10 tconnors@weinberg:~ > stat bsd
  File: `bsd'
  Size: 5               Blocks: 8          IO Block: 32768  regular file
Device: 18h/24d Inode: 6505243     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 2983/tconnors)   Gid: ( 1430/    srss)
Access: 2012-06-01 17:26:10.000000000 +1000
Modify: 2012-06-01 17:26:10.000000000 +1000
Change: 2012-06-01 17:26:17.000000000 +1000
17:26:21 tconnors@weinberg:~ > echo test > bsd.1
17:26:33 tconnors@ant:~ > mvi -f bsd.1 bsd
17:26:33 tconnors@weinberg:~ > stat bsd
stat: cannot stat `bsd': Stale NFS file handle

Surely at this point, upon a stale NFS filehandle while access()/stat() is being called, you'd simply flush the cache for that inode and try once more.
Comment 16 Trond Myklebust 2012-10-30 18:25:15 UTC
Use the documented workaround, or use another filesystem. We are _never_
going to fix this in any stable kernels.

Closing...
Comment 17 Jonathan Nieder 2012-11-01 18:42:06 UTC
(In reply to comment #16)
> Use the documented workaround, or use another filesystem. We are _never_
> going to fix this in any stable kernels.

I don't think that is what Tim meant.  Looking around with "git log", it
doesn't seem like the series from [1] has been applied to mainline yet.
Am I misunderstanding?

Thanks,
Jonathan

[1] http://thread.gmane.org/gmane.linux.kernel/1383596
Comment 18 Trond Myklebust 2012-11-02 03:18:47 UTC
Jonathan:

What "series from [1]" ? Are you talking about Suresh's patch?
Nobody has ever promised you or anybody else that this patch would be
applied upstream. It is dealing with just another corner case of the "file was
deleted on server" kind of race (namely the case of "I happen to have tried
something other than a regular lookup first").

As mentioned above, we have a workaround (use the lookupcache mount option),
so why are you trying to argue that this patch is so urgent that it
can't wait? Jeff Layton is working on a VFS level fix to retry lookups when
we race with common system calls, but ultimately the bug here is not with the
Linux NFS client, It would be with the application that thinks that it is
somehow safe to delete shared files without knowing for sure that those files
are no longer in use.
Comment 19 Jonathan Nieder 2012-11-02 03:21:33 UTC
(In reply to comment #18)
> What "series from [1]" ? Are you talking about Suresh's patch?

No, I'm talking about Jeff Layton's series.

[...]
> As mentioned above, we have a workaround (use the lookupcache mount option),
> so why are you trying to argue that this patch is so urgent that it
> can't wait?

I never said anything close to that.  It just seemed odd to close a bug while symptoms are still present and a fix is just on the horizon.

To be completely clear: I don't think you should hurry or apply these to stable.
Comment 20 Trond Myklebust 2012-11-02 03:41:06 UTC
On Fri, 2012-11-02 at 03:21 +0000, bugzilla-daemon@bugzilla.kernel.org 
> > As mentioned above, we have a workaround (use the lookupcache mount option),
> > so why are you trying to argue that this patch is so urgent that it
> > can't wait?
> 
> I never said anything close to that.  It just seemed odd to close a bug while
> symptoms are still present and a fix is just on the horizon.

The reason for closing the bug is that the workaround is _already_
present. If you or anybody else refuses to accept that as a fix, then
that isn't my problem.

A remotely renamed file will not cause problems. A remotely deleted file
will, and there is nothing that we can do to fix that on the client.
Jeff's patches deal with a larger subset of problems than Suresh's,
however those are still a bunch of corner cases and the jury is still
out as to whether the patches will go upstream.

IOW: I've closed the bug because Jeff's patches are completely
irrelevant w.r.t whether or not the bug can be resolved.
Comment 21 Tim Connors 2012-11-02 04:10:27 UTC
(In reply to comment #18)
> Jonathan:
> 
> What "series from [1]" ? Are you talking about Suresh's patch?
> Nobody has ever promised you or anybody else that this patch would be
> applied upstream. It is dealing with just another corner case of the "file was
> deleted on server" kind of race (namely the case of "I happen to have tried
> something other than a regular lookup first").
> 
> As mentioned above, we have a workaround (use the lookupcache mount option),
> so why are you trying to argue that this patch is so urgent that it
> can't wait? Jeff Layton is working on a VFS level fix to retry lookups when
> we race with common system calls, but ultimately the bug here is not with the
> Linux NFS client, It would be with the application that thinks that it is
> somehow safe to delete shared files without knowing for sure that those files
> are no longer in use.

Huh?  In use?

on nfs client:
14:28:00 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp
  File: `tmp.tmp'
  Size: 0               Blocks: 0          IO Block: 1048576 regular empty file
Device: 25h/37d Inode: 5260540     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  738/tconnors)   Gid: (  273/tconnors)
Access: 2012-11-02 14:25:23.765502234 +1100
Modify: 2012-11-02 14:25:23.765502234 +1100
Change: 2012-11-02 14:25:23.765502234 +1100
14:28:15 tconnors@fs:/net/dirac/home/tconnors> 

Meanwhile, on nfs server 9 seconds later:
14:28:24 tconnors@dirac:~> touch tmp.tmp1
14:28:24 tconnors@dirac:~> \mv tmp.tmp1 tmp.tmp
14:28:27 tconnors@dirac:~> 

And back on the client 4 seconds later (longer than default acregmin, shorter than acregmax though):
14:28:31 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp
stat: cannot stat `tmp.tmp': Stale NFS file handle
14:28:31 tconnors@fs:/net/dirac/home/tconnors>

That's a pretty damn long race condition - 13 seconds between first caching the valid stat results, and then next needing to refer to the cached-now-invalid results.  At no time other than during "touch" was the file open and in use.

I'm wondering if this sort of bogisity affects the production nfs servers at work where the even worse result of returning the previous contents of the file prior to the mv *indefinitely* (ie, far longer than acregmax/acdirmax) if anything other than lookupcache=none is specified, as per https://bugzilla.redhat.com/show_bug.cgi?id=488780 and https://bugzilla.redhat.com/show_bug.cgi?id=113636.  It had been attributed to using ext3 with second resolution rather than ext4 with nanosecond resolution, but that sounds like a furrfy.  As I explain in comment 9 of redhat BZ 488780, lookupcache=none is an unworkable workaround.
Comment 22 Trond Myklebust 2012-11-02 21:29:47 UTC
On Fri, 2012-11-02 at 04:10 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=12557
> 
> 
> 
> 
> 
> --- Comment #21 from Tim Connors <tim.w.connors@gmail.com>  2012-11-02 04:10:27 ---
> (In reply to comment #18)
> > Jonathan:
> > 
> > What "series from [1]" ? Are you talking about Suresh's patch?
> > Nobody has ever promised you or anybody else that this patch would be
> > applied upstream. It is dealing with just another corner case of the "file was
> > deleted on server" kind of race (namely the case of "I happen to have tried
> > something other than a regular lookup first").
> > 
> > As mentioned above, we have a workaround (use the lookupcache mount option),
> > so why are you trying to argue that this patch is so urgent that it
> > can't wait? Jeff Layton is working on a VFS level fix to retry lookups when
> > we race with common system calls, but ultimately the bug here is not with the
> > Linux NFS client, It would be with the application that thinks that it is
> > somehow safe to delete shared files without knowing for sure that those files
> > are no longer in use.
> 
> Huh?  In use?
> 
> on nfs client:
> 14:28:00 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp
>   File: `tmp.tmp'
>   Size: 0               Blocks: 0          IO Block: 1048576 regular empty file
> Device: 25h/37d Inode: 5260540     Links: 1
> Access: (0644/-rw-r--r--)  Uid: (  738/tconnors)   Gid: (  273/tconnors)
> Access: 2012-11-02 14:25:23.765502234 +1100
> Modify: 2012-11-02 14:25:23.765502234 +1100
> Change: 2012-11-02 14:25:23.765502234 +1100
> 14:28:15 tconnors@fs:/net/dirac/home/tconnors> 
> 
> Meanwhile, on nfs server 9 seconds later:
> 14:28:24 tconnors@dirac:~> touch tmp.tmp1
> 14:28:24 tconnors@dirac:~> \mv tmp.tmp1 tmp.tmp
> 14:28:27 tconnors@dirac:~> 
> 
> And back on the client 4 seconds later (longer than default acregmin, shorter
> than acregmax though):
> 14:28:31 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp
> stat: cannot stat `tmp.tmp': Stale NFS file handle
> 14:28:31 tconnors@fs:/net/dirac/home/tconnors>
> 
> That's a pretty damn long race condition - 13 seconds between first caching the
> valid stat results, and then next needing to refer to the cached-now-invalid
> results.  At no time other than during "touch" was the file open and in use.

Well here's what I get on a stock Fedora 17 based 3.6.3 client:

[trondmy@lade tmp]$ stat .; touch tmp.tmp; stat tmp.tmp; sleep 9; ssh
dragvoll "(cd /home/trondmy/tmp; touch tmp.tmp1; mv tmp.tmp1 tmp.tmp)";
sleep 4; stat tmp.tmp; stat .
  File: `.'
  Size: 4096      	Blocks: 8          IO Block: 32768  directory
Device: 26h/38d	Inode: 229507141   Links: 4
Access: (0754/drwxr-xr--)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:15:02.172804537 -0400
Modify: 2012-11-02 17:15:03.046066225 -0400
Change: 2012-11-02 17:15:03.000000000 -0400
 Birth: -
  File: `tmp.tmp'
  Size: 0         	Blocks: 0          IO Block: 32768  regular empty
file
Device: 26h/38d	Inode: 229512286   Links: 1
Access: (0640/-rw-r-----)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:15:21.189926202 -0400
Modify: 2012-11-02 17:15:21.189926202 -0400
Change: 2012-11-02 17:15:21.000000000 -0400
 Birth: -
  File: `tmp.tmp'
  Size: 0         	Blocks: 0          IO Block: 32768  regular empty
file
Device: 26h/38d	Inode: 229512286   Links: 1
Access: (0640/-rw-r-----)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:15:21.189926202 -0400
Modify: 2012-11-02 17:15:21.189926202 -0400
Change: 2012-11-02 17:15:21.000000000 -0400
 Birth: -
  File: `.'
  Size: 4096      	Blocks: 8          IO Block: 32768  directory
Device: 26h/38d	Inode: 229507141   Links: 4
Access: (0754/drwxr-xr--)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:15:02.172804537 -0400
Modify: 2012-11-02 17:15:21.186925364 -0400
Change: 2012-11-02 17:15:21.000000000 -0400
 Birth: -

If you look at the wireshark trace for that, then it turns out that the
after 14 seconds, stat for tmp.tmp _does_ in fact go over the wire, and
the Linux server at the other end is happy to tell me not only that the
filehandle for the original tmp.tmp is still valid, but that the file it
points to has nlink==1...

The stat for the current directory '.', on the other hand, is cached,
because acdirmin=15 by default.

So the problem here is not necessarily a violation of acregmin/acregmax
rules as you appear to think (without offering any proof).
Comment 23 Trond Myklebust 2012-11-02 21:31:50 UTC
Created attachment 85401 [details]
Wireshark trace of the stat() test in comment #22
Comment 24 Trond Myklebust 2012-11-02 22:01:03 UTC
Created attachment 85411 [details]
Wireshark trace of the stat() test in comment #22 but with acdirmin=3

Here is another trace that shows what happens when you adjust acdirmin down
to reflect the actual frequency of remote changes in the directory.

This time, the second stat triggers an ACCESS call, which tells the client
that the directory has changed, and triggers a LOOKUP of the file.

We now get:

  File: `.'
  Size: 4096      	Blocks: 8          IO Block: 1048576 directory
Device: 26h/38d	Inode: 1184818     Links: 2
Access: (0755/drwxr-xr-x)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:53:10.911703138 -0400
Modify: 2012-11-02 17:54:12.667679207 -0400
Change: 2012-11-02 17:54:12.667679207 -0400
 Birth: -
  File: `tmp.tmp'
  Size: 0         	Blocks: 0          IO Block: 1048576 regular empty file
Device: 26h/38d	Inode: 1188042     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:54:25.855674106 -0400
Modify: 2012-11-02 17:54:25.855674106 -0400
Change: 2012-11-02 17:54:25.855674106 -0400
 Birth: -
  File: `tmp.tmp'
  Size: 0         	Blocks: 0          IO Block: 1048576 regular empty file
Device: 26h/38d	Inode: 1196335     Links: 1
Access: (0664/-rw-rw-r--)  Uid: (  520/ trondmy)   Gid: (  520/ trondmy)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:54:35.222670615 -0400
Modify: 2012-11-02 17:54:35.222670615 -0400
Change: 2012-11-02 17:54:35.223670615 -0400
 Birth: -
  File: `.'
  Size: 4096      	Blocks: 8          IO Block: 1048576 directory
Device: 26h/38d	Inode: 1184818     Links: 2
Access: (0755/drwxr-xr-x)  Uid: (  520/ trondmy)   Gid: (  100/   users)
Context: system_u:object_r:nfs_t:s0
Access: 2012-11-02 17:53:10.911703138 -0400
Modify: 2012-11-02 17:54:35.223670615 -0400
Change: 2012-11-02 17:54:35.223670615 -0400
 Birth: -

IOW: all changes to the directory are seen correctly.

Note You need to log in before you can comment on or make changes to this bug.