We appear to be running into 'caching issues' with NFS in this version of the kernel. We had recently switched to this version in order to get around a problem we had encountered with NFS in version 3.18.9. After switching to version 4.1.6, our parallelized and distributed workflows now fail consistently with errors of the form: T34: ./regex.c:39:22: error: config.h: No such file or directory The latter error message is from a relatively simpler test case (compared to our regular worflow), from a parallelized and distribured build of binutils 2.25.1 using lsmake (a proprietary make utility from IBM/LSF). The test case runs in parallel on 2 hosts. In all of the failures, "config.h" is almost always created on host A, with the failures happening on host B. We have already tried mounting the filesystem we were using for the test case with progressively lower values of aregmin/acregmax/acdirmin/acdirmax, and even with lookupcache set to none. None of these helped. We have ran simpler tests using the nfstest_cache utility from http://wiki.linux-nfs.org/wiki/index.php/NFStest. The results we got appear to suggest that NFS caching is behaving normally. May I know if you may be able to help shed some light on this issue? Thank you. -- Leandro Awa
Created attachment 189641 [details] log file from 'git bisect' testing
Fyi. From our 'git bisect' testing, the following commit appears to be the possible cause of the behavior we've been seeing: commit 766c4cbfacd8634d7580bac6a1b8456e63de3e84 Author: Al Viro <viro@zeniv.linux.org.uk> Date: Thu May 7 19:24:57 2015 -0400 namei: d_is_negative() should be checked before ->d_seq validation Fetching ->d_inode, verifying ->d_seq and finding d_is_negative() to be true does *not* mean that inode we'd fetched had been NULL - that holds only while ->d_seq is still unchanged. Shift d_is_negative() checks into lookup_fast() prior to ->d_seq verification. Reported-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> -- Leandro Awa
Yes, that looks bad. The negative lookup code in lookup_fast() is circumventing the revalidation of the dentry. Reassigning to the VFS maintainer.
Created attachment 189751 [details] [PATCH] namei: results of d_is_negative() should be checked after dentry revalidation Please could you check whether or not the attached patch helps?
Fyi. The patch definitely helped. I just completed a test run and it passed. Thank you. -- Leandro Awa