We appear to be running into 'caching issues' with NFS in this version of the kernel. We had recently switched to this version in order to get around a problem we had encountered with NFS in version 3.18.9.
After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form:
T34: ./regex.c:39:22: error: config.h: No such file or directory
The latter error message is from a relatively simpler test case (compared to our regular worflow), from a parallelized and distribured build of binutils 2.25.1 using lsmake (a proprietary make utility from IBM/LSF). The test case runs in parallel on 2 hosts. In all of the failures, "config.h" is almost always created on host A, with the failures happening on host B.
We have already tried mounting the filesystem we were using for the test case with progressively lower values of aregmin/acregmax/acdirmin/acdirmax, and even with lookupcache set to none. None of these helped.
We have ran simpler tests using the nfstest_cache utility from http://wiki.linux-nfs.org/wiki/index.php/NFStest. The results we got appear to suggest that NFS caching is behaving normally.
May I know if you may be able to help shed some light on this issue?
Created attachment 189641 [details]
log file from 'git bisect' testing
Fyi. From our 'git bisect' testing, the following commit appears to be
the possible cause of the behavior we've been seeing:
Author: Al Viro <email@example.com>
Date: Thu May 7 19:24:57 2015 -0400
namei: d_is_negative() should be checked before ->d_seq validation
Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to
be true does *not* mean that inode we'd fetched had been NULL - that
holds only while ->d_seq is still unchanged.
Shift d_is_negative() checks into lookup_fast() prior to ->d_seq
Reported-by: Steven Rostedt <firstname.lastname@example.org>
Tested-by: Steven Rostedt <email@example.com>
Signed-off-by: Al Viro <firstname.lastname@example.org>
Yes, that looks bad. The negative lookup code in lookup_fast() is circumventing the revalidation of the dentry.
Reassigning to the VFS maintainer.
Created attachment 189751 [details]
[PATCH] namei: results of d_is_negative() should be checked after dentry revalidation
Please could you check whether or not the attached patch helps?
Fyi. The patch definitely helped. I just completed a test run and it passed.