Most recent kernel where this bug did not occur: don't know Distribution: RHEL 4.5 Hardware Environment: Athlon 4800+, P4 Software Environment: x86_64, i686, portable bug Problem Description: Whenever a file has a lease applied with fcntl(fd, F_SETLEASE, F_WRLCK); the modification time for the file as seen via NFSv3 is temporarily set to the time of the lease acquire. Once lease is removed, modification time reverts to correct original value. Altered modification time appease *only* via NFS. It does not change on the local system. Steps to reproduce: Run attached test program on a file and 'ls -l' the file via a remote NFS mount. Observe modification time. Once test program exits, check again. ----- Specific impact of this bug is spurious 'make' target rebuilding when a single source tree is used to build executables for multiple platforms, including Windows. Reported to Red Hat Bugzilla too. Complete evolution of discovery and analysis appears in report. https://bugzilla.redhat.com/show_bug.cgi?id=396281
Created attachment 13739 [details] nfs_lease_bug.c
Can you reproduce this on a mainline client? AFAIK, that should just return EINVAL. Otherwise, this looks like a problem for Red Hat...
Don't have time to build vanilla kernel and try it, but the testcase is trivial and should be easy to check on existing setups. Reported this as a courtesy in case it's in vanilla kernel. It's obscure--took months to figure out what was causing the spurious 'make' rebuilding. Frustrating experience.
>Can you reproduce this on a mainline client? BTW, this is a server-side issue.
This is by design--see fs/locks.c:lease_get_mtime(). The argument is: if somebody has a write lease on the file (are you exporting this via Samba? That'd be the typical user), then they're caching writes--they're explicitly telling us that we do not know whether the file is still the same, because we may have modified it on a remote client and not told us about it. So lease_get_mtime() reports the current time as the mtime, prompting you to actually try opening the file, at which point the write lease gets broken and any cached writes get flushed out. It's a terrible kludge, I agree, and maybe we should remove it. But I'd first like to understand what circumstances prompted smoebody to add it originally, and talk to the Samba people about how they're using these write leases. (In NFSv4, by the way, there's a callback to the client to allow the server to find out the attributes of a file that the client holding a write lease is caching, which solves this problem. We haven't implemented that yet; it seems likely to be an enormous pain. I wonder if Samba would want something similar?)
Yes, using a Samba mount along with NFS mounts to 'make' build an application on multiple platforms simultaneously. When re-building existing trees the result is spurious target building, a real problem. Took months of frustration to figure out what was going on. Actually reported this as a bug to Samba along with RH, and a request for info from the Samba team provoked the analysis that isolated the problem. Closed the Samba bug but it's all still there: https://bugzilla.samba.org/show_bug.cgi?id=5103 My impression for looking at the Samba code is that the leases are put in place so that files can be monitored for changes, but that's just a guess. Perhaps Samba can use a read lease instead of a write lease.
Thanks for the pointer to the samba bug. I understand the frustration. My feeling is that: yes, it's suboptimal (but perhaps not a bug) for samba to be requesting a write lease when a read lease (sufficient to alert it to modifications of the file) would be enough. It seems to me that the real bug is the incomplete lease implementation--if the purpose of a lease is to allow a remote client to cache writes to the file, then there's no way for us to give a sensible answer to a stat call, unless we break the lease first. Perhaps we could find out (in the samba case) what the consequences would be of not updating mtime in this case? I suppose the worst case would be that a modification to a file made on a samba client could be indefinitely delayed from being flushed to the server. If we agree that that would be less of a problem than these spurious bumps in the mtime, then the best solution for now may just be to rip out lease_get_mtime(). I'll cook up a straw-man patch....
Created attachment 13763 [details] remove lease_get_mtime Not sure if this is what we want to do, but it's at least something you could test to confirm the source of the problem.
Thanks! Happen to have a kernel source tree set up for the Centos/RH images running on the server, so I can try this out in the next week or so. Seems certain to work. In the interim perhaps you could start a dialog with Samba about the reasoning behind the application of the F_WRLCK? Or if you wish I could do so using the Samaba bug report cited above.
Created attachment 13773 [details] kernel-2.6.9-55_remove_lease_get_mtime.patch Adapted patch for RHEL 4.5 2.6.9-55 kernel (attached). Works as expected and eliminates the spurious target rebuilding when a concurrent Samaba and NFS 'make' is run on same source tree. Test case also shows expected behavior. Obviously did not do any regression testing with other Samba file scenarios. In our build trees the same files and directories are never written by more than one server.
Hmmm...kludge or not, it seems like using lease_get_mtime makes everything err on the side of tighter cache consistency, and that seems to be a good thing. I assume that samba is doing this to emulate oplocks on the share. In fact, snipping from the smb.conf manpage: Kernel oplocks support allows Samba oplocks to be broken whenever a local UNIX process or NFS operation accesses a file that smbd(8) has oplocked. This allows complete data consistency between SMB/CIFS, NFS and local file access (and is a very cool feature :-). ...but if we remove lease_get_mtime then that "very cool feature" is now a "very broken feature". A samba client could get an oplock on a file which may never be revoked even though clients may have the file open read only and are polling for changes. While lease_get_mtime is a kludge, it's one that appears to help data consistency and may help prevent corruption. I think that's generally a good thing and not something we should eliminate without some careful consideration. Finally, this probably can be worked around in userspace by setting "kernel oplocks = no" in smb.conf, though I haven't tested it and can't confirm it.
Tried "kernel oplocks = no" in the Samba config and it does work around the issue.
Thanks. That's good to know. The only downside of using that is that samba is not taking out leases for oplocks and is just tracking them internally. Accesses by other applications (and NFS) won't revoke the oplock so you won't have the same data consistency guarantees. On the other hand, the kernel patch that Bruce posted would just make NFS a lot less consistent about revoking the oplocks. When the client can satisfy the read from the cache, then the lease wouldn't be broken. When it has to satisfy the read by querying the server it would be. When the lease isn't broken then the oplock holder can continue to buffer up the writes, and you can end up with a large amount of dirty pages on the client outstanding. Bruce's patch might still be reasonable, but we need to be aware that there might be drawbacks and it could break some people relying on this behavior. Given that this is a nfsv2/3 specific kludge, I'm less keen on breaking it. Finally, NFSv4 doesn't use lease_get_mtime since it's a stateful protocol, and actually has nfsd hold the file open. I assume that sharing the same data with samba and nfsv4 might be a lot smoother, but you may just be best off having the Linux clients use CIFS if you need consistency with oplocks.