Latest working kernel version: 2.6.23 Earliest failing kernel version: 2.6.24? Distribution: debian/unstable Hardware Environment: Software Environment: debian/unstable nfs client running 2.6.24 to 2.6.28, centos/rhel 5.2 nfs server running redhat 2.6.18-92.1.10 kernel. nfs is mounted with autofs, and mount flags are: aatlxz:/home/aatlxz /net/aatlxz/home/aatlxz nfs rw,nosuid,nodev,vers=3,rsize=32768,wsize=32768,namlen=255,hard,intr,proto=tcp,timeo=600,retrans=2,sec=sys,mountproto=ud p,addr=192.231.166.57 0 0 Problem Description: I am getting the occasional ESTALE from a stat() or access() calls for nfs hosted ~/.Xauthority, if a rename has previously been performed on another host, which is a pain because it means I can't open new X clients when this happens (libXau does a stat on .Xauthority before reading its contents). It is rather rare, happening from in 2.6.26 thrice a day to once per week, and has only just occurred again in 2.6.28 after quite a few days usage, and I suspect it depends on the pattern of usage of whether I am logging into other machines often with ssh (which atomically renames .Xauthority on the remote host, giving this bug the opportunity to arise), or not. see also https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954 and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html That ubuntu bug report from some other person indicates that 2.6.27 somewhat mitigated the issue, but clearly it still does happen occasionally in 2.6.28. Steps to reproduce: put your home directory on nfs, and ensure you are using ~/.Xauthority as your X authority file (I know, security blah blah). open 2 xterms. log into a remote machine with X forwarding (xauth will atomically rename the .Xauthority file, resulting in a new inode). Open a new local xterm. Sometimes it fails, sometimes it doesn't. When it does fail, run 'strace -f xterm' from one of the original local xterms. The access() call will return ESTALE, and the xlibs won't go on any further. You can clear this condition by catting the .Xauthority file on the local host or running 'xauth list', which obviously finally gets around to invalidating the cache. I was under the impression that pretty much everytime I logged into a remote machine with local kernel version of 2.6.26, access() would always return ESTALE. Under 2.6.28, this is now rarer, but still happens. strace from 2.6.26 (sorry, I didnt yet get a chance to get one under 2.6.28, but surely it has to be the same) follows: getgid() = 15 ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 setresgid(-1, 15, -1) = 0 setresuid(-1, 582, -1) = 0 open("/proc/meminfo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4cd0a2a000 read(3, "MemTotal: 3096244 kB\nMemFree"..., 1024) = 774 close(3) = 0 munmap(0x7f4cd0a2a000, 4096) = 0 socket(PF_FILE, SOCK_STREAM, 0) = 3 getsockopt(3, SOL_SOCKET, SO_TYPE, [68719476737], [4]) = 0 connect(3, {sa_family=AF_FILE, path="/tmp/.X11-unix/X0"...}, 110) = 0 getpeername(3, {sa_family=AF_FILE, path="/tmp/.X11-unix/X0"...}, [139964394242068]) = 0 uname({sys="Linux", node="aatlxx", ...}) = 0 access("/home/aatlxz/twc/.Xauthority", R_OK) = -1 ESTALE (Stale NFS file handle) fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl(3, F_SETFD, FD_CLOEXEC) = 0 select(4, [3], [3], NULL, NULL) = 1 (out [3]) writev(3, [{"l\0\v\0\0\0\0\0\0\0"..., 10}, {"\0\0"..., 2}], 2) = 12 read(3, 0x198e160, 8) = -1 EAGAIN (Resource temporarily unavailable) select(4, [3], NULL, NULL, NULL) = 1 (in [3]) read(3, "\0\26\v\0\0\0\6\0"..., 8) = 8 Note that access()'s manpage doesn't even list ESTALE as a valid error. I don't think it would really be acceptable to reduce the attribute cache time, because that would only reduce the window of opportunity of the bug, not remove it, and it looks like the stale handle lasts beyond the timeout anyway. Would the solution be to, if one gets an ESTALE from metadata only operations like stat(), retry getting the current metadata from the server freshly as if it was open()ed?
Why is this marked as a regression? access() and stat() have _always_ had that race. It is nothing new to 2.6.28. Removing regression flag. Anyhow, if this is happening due to a rename() (and not an unlink()), then the bug is likely to be that you are using 'subtree_check' as one of your server export options (you don't list any details about the server setup). 'subtree_check' is known to break renames, and should never be used.
(I didn't mark the regression flag; that was done after) We've never seen this in our environment, until we had a machine with a greater than 2.6.18 (+redhat patches) kernel. The ubuntu bug implies this hapenned in 2.6.24. server exports are according to /proc/fs/nfsd/exports: /home/aatlxz *(rw,root_squash,sync,wdelay,no_subtree_check,fsid=666)
It doesn't matter what the ubuntu bug says. stat() and access() have _always_ had the possibility of racing with deletions on the server. This is because both nfs_permission() and nfs_getattr() are given an inode, and not a path by the VFS. If the file gets deleted on the server, then the server will return ESTALE because it doesn't recognise the filehandle. Normally, a rename of the file on the server is not supposed to invalidate the filehandle, so my guess is that your old .Xauthority is being deleted by the other client, and not just renamed. If so, tough... NFS is not fully cache coherent. It never has been, and it never will be. You can limit some of the effects by using the "-olookupcache=none" mount option that was introduced in linux 2.6.28, but for the reasons I listed above, you can never fully eliminate the ESTALE. The price to pay is more NFS traffic in order to check every single file lookup, and hence slower performance.
Hi Trond, I have seen a similar bug report where a user trying to launch a xterm from a KDE session on NFS mounted home due to -ESTALE error on .Xauthority. The packet capture reveals that there were two consequent ACCESS calls the server returns -ESTALE for both and the client doesn't seem to LOOKUP. Is is not expected that the client upon receiving -ESTALE (due to stale fh) from server during an ACCESS(), it should unhash dcache entry, reinstatiate new dcache entry, new inode? or can ACCESS() be made to handle -ESTALE better?
little more information: The unlinking goes something like this: unlink("/home/jay/.Xauthority") = 0 link("/home/jay/.Xauthority-n", "/home/jay/.Xauthority") = 0 unlink("/home/jay/.Xauthority-n") = 0 unlink("/home/jay/.Xauthority-c") = 0 unlink("/home/jay/.Xauthority-l") = 0 where -n is the new file, -c is a lock file, and -l is a hard link to the lock file - all created to try to get exclusive locking over NFS with out having to use fcntl locking.
Created attachment 20435 [details] Proposed patch The attached patch fixed the issue for me. Can you see whether it fixes the problem for you?
We are running into a very similar issue; except our use case has nothing to do with home directories on NFS. Using the attached patch, it did NOT solve our problem, we patched against an Ubuntu Hardy 2.6.24-24 kernel. Our test case works a bit differently, and actually does not enter the sequence of code that the patch hits. Here is our test case: System A: NFS Server System B: NFS Client System A: mkdir tmpdir touch tmpdir/tmpfile tar -cvf x.tar tmpdir System B: stat tmpdir/tmpfile System A: tar -xvf x.tar System B: stat tmpdir/tmpfile This will reliably generate a: stat: cannot stat `tmpdir/tmpfile': Stale NFS file handle error. Using the Ubuntu Edgy kernel, we do not run into this problem. Unfortunately we use NFS quite a bit, and having one client be able to completely destroy another client by just untarring a file is a big deal for us :( I have provided more details of our use case in the ubuntu bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954 Anybody have any ideas?
On Thu, 2009-07-16 at 16:16 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12557 > > > Joe Kislo <joe@k12s.phast.umass.edu> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |joe@k12s.phast.umass.edu > > > > > --- Comment #7 from Joe Kislo <joe@k12s.phast.umass.edu> 2009-07-16 16:16:04 > --- > We are running into a very similar issue; except our use case has nothing to > do > with home directories on NFS. > > Using the attached patch, it did NOT solve our problem, we patched against an > Ubuntu Hardy 2.6.24-24 kernel. Our test case works a bit differently, and > actually does not enter the sequence of code that the patch hits. Here is > our > test case: > > System A: NFS Server > System B: NFS Client > > System A: > mkdir tmpdir > touch tmpdir/tmpfile > tar -cvf x.tar tmpdir > > System B: > stat tmpdir/tmpfile > > System A: > tar -xvf x.tar > > System B: > stat tmpdir/tmpfile > > This will reliably generate a: > stat: cannot stat `tmpdir/tmpfile': Stale NFS file handle > > error. Using the Ubuntu Edgy kernel, we do not run into this problem. > Unfortunately we use NFS quite a bit, and having one client be able to > completely destroy another client by just untarring a file is a big deal for > us > :( > > I have provided more details of our use case in the ubuntu bug: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954 > > Anybody have any ideas? > Yes: either use 'noac', or get Ubuntu to backport the 'lookupcache' mount option. Trond
(In reply to comment #7) > We are running into a very similar issue; except our use case has nothing to you are seeing exactly the same thing as the .Xauthority person. the following script reproduces it nicely on ubuntu 2.6.28-15-generic: client$ touch x; client$ perl -e 'use POSIX;while(1){if(POSIX::access(q{x}, POSIX::R_OK)) \ {print "OK\n"}else{print "$!\n"};sleep 1}' server$ rm x;touch x.`date +%s`; touch x ------------------------------- OK Stale NFS file handle Stale NFS file handle ... ------------------------------- Note the "touch `date %s`" command. The reason for this is that when you run rm x; touch x then the new x will have the same inode number as the old x. By doing the extra touch, this will use up the inode number of the deleted x and the new x will get a new number. Now try the following: client$ perl -e 'use POSIX;while(1){if(lstat(q{x})){print "OK\n"}else{print "$!\n"};sleep 1}' OK Stale NFS file handle OK While the client does not recover from the access call, it always recovers instantly from the lstat call on the next try. The patch attached to this bug fixes the access call to also recover as quickly as the lstat call ... The Problem remaining is, that there is an error at all. I do understand why it happens: The access/lstat calls operate on an inode number they got from a cache. So when they figure out that the server does not know about the inode anymore, they can not just go out and try again because they do not know about the filename anymore. It would be great nfs (vfs ?) could hide this effect from the user by retrying (once) internally somewhere higher up in the processing chain and not exposing the problem to the user. On another note, while testing I found: client$ perl -e 'use POSIX;while(1){if(-e q{new}) {print "OK\n"}else{print "$!\n"};sleep 1}' server$ touch new
(In reply to comment #9) ... you get No such file or directory No such file or directory No such file or directory No such file or directory No such file or directory as if the client used the last cache access time to prolong the cache validity. As soon as I stop asking for a few seconds the 'new' file will appear on the client.
Trond mentioned the lookupcache option in comment #8 this is in the kernel since 2.6.27. So when you mount with mount-o lookupcache=none then the inodes will get looked up from the server on every new file access thus exchanging this problem for some performance impact for applications who open lots of files ... (local access to maildir would probably suffer ?) The patch for the access problem has been added to 2.6.29 btw.
On Fri, 2009-08-21 at 12:47 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12557 > It would be great nfs (vfs ?) could hide this effect from the user by > retrying > (once) internally somewhere higher up in the processing chain and not > exposing > the problem to the user. That will require rather extensive VFS changes. I've discussed this with Al Viro, and we do have a plan for how it could be done, but it will take time to implement. Cheers Trond
I notice that this bug thread seems rather inactive. Maybe the bug has been fixed in more recent kernels as it now has taken some time? If so the bug should be closed with a note from which kernel version it has been fixed. Anyhow I still suffer from the bug in kernel 2.6.24.5 from 2008 and find no reason to upgrade until the bug has been fixed. Instead I prefer Slackware 12.0 with kernel 2.6.21.5 where possible to avoid this bug. When I have to use the newer kernel I have found the following workaround for the ssh problem with .Xauthority: 1) Add a file /etc/profile.d/nfs_ssh_xauth_fix.csh -8<------------------------------------ #!/bin/csh # Added by henca to avoid nfs bug in kernel 2.6.24 setenv XAUTHORITY $HOME/.Xauthority-$HOST -8<------------------------------------ 2) Add a file /etc/profile.d/nfs_ssh_xauth_fix.sh -8<------------------------------------ #!/bin/sh # Added by henca to avoid nfs bug in kernel 2.6.24 XAUTHORITY=$HOME/.Xauthority-$HOST export XAUTHORITY -8<------------------------------------ 3) Add a file /etc/ssh/sshrc -8<------------------------------------ if read proto cookie && [ -n "$DISPLAY" ]; then case $DISPLAY in localhost:*) xauth -f $HOME/.Xauthority-$HOSTNAME add \ unix:$(echo $DISPLAY | cut -c11-) $proto $cookie xauth -f $HOME/.Xauthority add \ unix:$(echo $DISPLAY | cut -c11-) $proto $cookie ;; *) xauth -f $HOME/.Xauthority-$HOSTNAME add $DISPLAY $proto $cookie xauth -f $HOME/.Xauthority add $DISPLAY $proto $cookie ;; esac export XAUTHORITY=$HOME/.Xauthority-$HOSTNAME fi -8<------------------------------------ 4) Modify file /etc/kde/kdm/Xsession, apply the following patch: -8<------------------------------------ --- /old/Xsession Fri Jun 29 03:40:20 2007 +++ etc/kde/kdm/Xsession Tue Nov 10 15:44:14 2009 @@ -63,6 +63,8 @@ [ -f $HOME/.profile ] && . $HOME/.profile ;; esac + +/usr/bin/xauth merge $HOME/.Xauthority [ -f /etc/xprofile ] && . /etc/xprofile [ -f $HOME/.xprofile ] && . $HOME/.xprofile -8<------------------------------------ The idea is to avoid the problem of sharing files across NFS mounts by letting each problematic machine use its own uniqe .Xauthority-$HOSTNAME. To not lose old authorizations the old .Xauthority is also read. Unfortunately this workaround is only a fix for X11 forwarding over ssh, this fix does not help against any other problems which might come from this bug. regards Henrik
Closing old bugs - there have been too many changes to tell if this bug is remotely valid any more. Please re-test with a modern kernel and reopen the bug giving that version if it is
Is debian 3.2.0-1-amd64 new enough for you? Easy enough to test: ant is $HOME fileserver, weinberg is client: 17:24:38 tconnors@weinberg:~ > echo test > bsd.1 17:25:40 tconnors@ant:~ > mv -f bsd.1 bsd 17:26:10 tconnors@weinberg:~ > stat bsd File: `bsd' Size: 5 Blocks: 8 IO Block: 32768 regular file Device: 18h/24d Inode: 6505243 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 2983/tconnors) Gid: ( 1430/ srss) Access: 2012-06-01 17:26:10.000000000 +1000 Modify: 2012-06-01 17:26:10.000000000 +1000 Change: 2012-06-01 17:26:17.000000000 +1000 17:26:21 tconnors@weinberg:~ > echo test > bsd.1 17:26:33 tconnors@ant:~ > mvi -f bsd.1 bsd 17:26:33 tconnors@weinberg:~ > stat bsd stat: cannot stat `bsd': Stale NFS file handle Surely at this point, upon a stale NFS filehandle while access()/stat() is being called, you'd simply flush the cache for that inode and try once more.
Use the documented workaround, or use another filesystem. We are _never_ going to fix this in any stable kernels. Closing...
(In reply to comment #16) > Use the documented workaround, or use another filesystem. We are _never_ > going to fix this in any stable kernels. I don't think that is what Tim meant. Looking around with "git log", it doesn't seem like the series from [1] has been applied to mainline yet. Am I misunderstanding? Thanks, Jonathan [1] http://thread.gmane.org/gmane.linux.kernel/1383596
Jonathan: What "series from [1]" ? Are you talking about Suresh's patch? Nobody has ever promised you or anybody else that this patch would be applied upstream. It is dealing with just another corner case of the "file was deleted on server" kind of race (namely the case of "I happen to have tried something other than a regular lookup first"). As mentioned above, we have a workaround (use the lookupcache mount option), so why are you trying to argue that this patch is so urgent that it can't wait? Jeff Layton is working on a VFS level fix to retry lookups when we race with common system calls, but ultimately the bug here is not with the Linux NFS client, It would be with the application that thinks that it is somehow safe to delete shared files without knowing for sure that those files are no longer in use.
(In reply to comment #18) > What "series from [1]" ? Are you talking about Suresh's patch? No, I'm talking about Jeff Layton's series. [...] > As mentioned above, we have a workaround (use the lookupcache mount option), > so why are you trying to argue that this patch is so urgent that it > can't wait? I never said anything close to that. It just seemed odd to close a bug while symptoms are still present and a fix is just on the horizon. To be completely clear: I don't think you should hurry or apply these to stable.
On Fri, 2012-11-02 at 03:21 +0000, bugzilla-daemon@bugzilla.kernel.org > > As mentioned above, we have a workaround (use the lookupcache mount > option), > > so why are you trying to argue that this patch is so urgent that it > > can't wait? > > I never said anything close to that. It just seemed odd to close a bug while > symptoms are still present and a fix is just on the horizon. The reason for closing the bug is that the workaround is _already_ present. If you or anybody else refuses to accept that as a fix, then that isn't my problem. A remotely renamed file will not cause problems. A remotely deleted file will, and there is nothing that we can do to fix that on the client. Jeff's patches deal with a larger subset of problems than Suresh's, however those are still a bunch of corner cases and the jury is still out as to whether the patches will go upstream. IOW: I've closed the bug because Jeff's patches are completely irrelevant w.r.t whether or not the bug can be resolved.
(In reply to comment #18) > Jonathan: > > What "series from [1]" ? Are you talking about Suresh's patch? > Nobody has ever promised you or anybody else that this patch would be > applied upstream. It is dealing with just another corner case of the "file > was > deleted on server" kind of race (namely the case of "I happen to have tried > something other than a regular lookup first"). > > As mentioned above, we have a workaround (use the lookupcache mount option), > so why are you trying to argue that this patch is so urgent that it > can't wait? Jeff Layton is working on a VFS level fix to retry lookups when > we race with common system calls, but ultimately the bug here is not with the > Linux NFS client, It would be with the application that thinks that it is > somehow safe to delete shared files without knowing for sure that those files > are no longer in use. Huh? In use? on nfs client: 14:28:00 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp File: `tmp.tmp' Size: 0 Blocks: 0 IO Block: 1048576 regular empty file Device: 25h/37d Inode: 5260540 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 738/tconnors) Gid: ( 273/tconnors) Access: 2012-11-02 14:25:23.765502234 +1100 Modify: 2012-11-02 14:25:23.765502234 +1100 Change: 2012-11-02 14:25:23.765502234 +1100 14:28:15 tconnors@fs:/net/dirac/home/tconnors> Meanwhile, on nfs server 9 seconds later: 14:28:24 tconnors@dirac:~> touch tmp.tmp1 14:28:24 tconnors@dirac:~> \mv tmp.tmp1 tmp.tmp 14:28:27 tconnors@dirac:~> And back on the client 4 seconds later (longer than default acregmin, shorter than acregmax though): 14:28:31 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp stat: cannot stat `tmp.tmp': Stale NFS file handle 14:28:31 tconnors@fs:/net/dirac/home/tconnors> That's a pretty damn long race condition - 13 seconds between first caching the valid stat results, and then next needing to refer to the cached-now-invalid results. At no time other than during "touch" was the file open and in use. I'm wondering if this sort of bogisity affects the production nfs servers at work where the even worse result of returning the previous contents of the file prior to the mv *indefinitely* (ie, far longer than acregmax/acdirmax) if anything other than lookupcache=none is specified, as per https://bugzilla.redhat.com/show_bug.cgi?id=488780 and https://bugzilla.redhat.com/show_bug.cgi?id=113636. It had been attributed to using ext3 with second resolution rather than ext4 with nanosecond resolution, but that sounds like a furrfy. As I explain in comment 9 of redhat BZ 488780, lookupcache=none is an unworkable workaround.
On Fri, 2012-11-02 at 04:10 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=12557 > > > > > > --- Comment #21 from Tim Connors <tim.w.connors@gmail.com> 2012-11-02 > 04:10:27 --- > (In reply to comment #18) > > Jonathan: > > > > What "series from [1]" ? Are you talking about Suresh's patch? > > Nobody has ever promised you or anybody else that this patch would be > > applied upstream. It is dealing with just another corner case of the "file > was > > deleted on server" kind of race (namely the case of "I happen to have tried > > something other than a regular lookup first"). > > > > As mentioned above, we have a workaround (use the lookupcache mount > option), > > so why are you trying to argue that this patch is so urgent that it > > can't wait? Jeff Layton is working on a VFS level fix to retry lookups when > > we race with common system calls, but ultimately the bug here is not with > the > > Linux NFS client, It would be with the application that thinks that it is > > somehow safe to delete shared files without knowing for sure that those > files > > are no longer in use. > > Huh? In use? > > on nfs client: > 14:28:00 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp > File: `tmp.tmp' > Size: 0 Blocks: 0 IO Block: 1048576 regular empty > file > Device: 25h/37d Inode: 5260540 Links: 1 > Access: (0644/-rw-r--r--) Uid: ( 738/tconnors) Gid: ( 273/tconnors) > Access: 2012-11-02 14:25:23.765502234 +1100 > Modify: 2012-11-02 14:25:23.765502234 +1100 > Change: 2012-11-02 14:25:23.765502234 +1100 > 14:28:15 tconnors@fs:/net/dirac/home/tconnors> > > Meanwhile, on nfs server 9 seconds later: > 14:28:24 tconnors@dirac:~> touch tmp.tmp1 > 14:28:24 tconnors@dirac:~> \mv tmp.tmp1 tmp.tmp > 14:28:27 tconnors@dirac:~> > > And back on the client 4 seconds later (longer than default acregmin, shorter > than acregmax though): > 14:28:31 tconnors@fs:/net/dirac/home/tconnors> stat tmp.tmp > stat: cannot stat `tmp.tmp': Stale NFS file handle > 14:28:31 tconnors@fs:/net/dirac/home/tconnors> > > That's a pretty damn long race condition - 13 seconds between first caching > the > valid stat results, and then next needing to refer to the cached-now-invalid > results. At no time other than during "touch" was the file open and in use. Well here's what I get on a stock Fedora 17 based 3.6.3 client: [trondmy@lade tmp]$ stat .; touch tmp.tmp; stat tmp.tmp; sleep 9; ssh dragvoll "(cd /home/trondmy/tmp; touch tmp.tmp1; mv tmp.tmp1 tmp.tmp)"; sleep 4; stat tmp.tmp; stat . File: `.' Size: 4096 Blocks: 8 IO Block: 32768 directory Device: 26h/38d Inode: 229507141 Links: 4 Access: (0754/drwxr-xr--) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:15:02.172804537 -0400 Modify: 2012-11-02 17:15:03.046066225 -0400 Change: 2012-11-02 17:15:03.000000000 -0400 Birth: - File: `tmp.tmp' Size: 0 Blocks: 0 IO Block: 32768 regular empty file Device: 26h/38d Inode: 229512286 Links: 1 Access: (0640/-rw-r-----) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:15:21.189926202 -0400 Modify: 2012-11-02 17:15:21.189926202 -0400 Change: 2012-11-02 17:15:21.000000000 -0400 Birth: - File: `tmp.tmp' Size: 0 Blocks: 0 IO Block: 32768 regular empty file Device: 26h/38d Inode: 229512286 Links: 1 Access: (0640/-rw-r-----) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:15:21.189926202 -0400 Modify: 2012-11-02 17:15:21.189926202 -0400 Change: 2012-11-02 17:15:21.000000000 -0400 Birth: - File: `.' Size: 4096 Blocks: 8 IO Block: 32768 directory Device: 26h/38d Inode: 229507141 Links: 4 Access: (0754/drwxr-xr--) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:15:02.172804537 -0400 Modify: 2012-11-02 17:15:21.186925364 -0400 Change: 2012-11-02 17:15:21.000000000 -0400 Birth: - If you look at the wireshark trace for that, then it turns out that the after 14 seconds, stat for tmp.tmp _does_ in fact go over the wire, and the Linux server at the other end is happy to tell me not only that the filehandle for the original tmp.tmp is still valid, but that the file it points to has nlink==1... The stat for the current directory '.', on the other hand, is cached, because acdirmin=15 by default. So the problem here is not necessarily a violation of acregmin/acregmax rules as you appear to think (without offering any proof).
Created attachment 85401 [details] Wireshark trace of the stat() test in comment #22
Created attachment 85411 [details] Wireshark trace of the stat() test in comment #22 but with acdirmin=3 Here is another trace that shows what happens when you adjust acdirmin down to reflect the actual frequency of remote changes in the directory. This time, the second stat triggers an ACCESS call, which tells the client that the directory has changed, and triggers a LOOKUP of the file. We now get: File: `.' Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 26h/38d Inode: 1184818 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:53:10.911703138 -0400 Modify: 2012-11-02 17:54:12.667679207 -0400 Change: 2012-11-02 17:54:12.667679207 -0400 Birth: - File: `tmp.tmp' Size: 0 Blocks: 0 IO Block: 1048576 regular empty file Device: 26h/38d Inode: 1188042 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:54:25.855674106 -0400 Modify: 2012-11-02 17:54:25.855674106 -0400 Change: 2012-11-02 17:54:25.855674106 -0400 Birth: - File: `tmp.tmp' Size: 0 Blocks: 0 IO Block: 1048576 regular empty file Device: 26h/38d Inode: 1196335 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 520/ trondmy) Gid: ( 520/ trondmy) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:54:35.222670615 -0400 Modify: 2012-11-02 17:54:35.222670615 -0400 Change: 2012-11-02 17:54:35.223670615 -0400 Birth: - File: `.' Size: 4096 Blocks: 8 IO Block: 1048576 directory Device: 26h/38d Inode: 1184818 Links: 2 Access: (0755/drwxr-xr-x) Uid: ( 520/ trondmy) Gid: ( 100/ users) Context: system_u:object_r:nfs_t:s0 Access: 2012-11-02 17:53:10.911703138 -0400 Modify: 2012-11-02 17:54:35.223670615 -0400 Change: 2012-11-02 17:54:35.223670615 -0400 Birth: - IOW: all changes to the directory are seen correctly.