Bug 1402 - filesystem cache invalidations take too long at NFS mount time
filesystem cache invalidations take too long at NFS mount time
Product: File System
Classification: Unclassified
Component: VFS
i386 Linux
: P2 normal
Assigned To: Jose R. Santos
Depends on:
  Show dependency treegraph
Reported: 2003-10-22 08:09 UTC by Dave Hansen
Modified: 2004-12-05 05:55 UTC (History)
4 users (show)

See Also:
Kernel Version: 2.6.0-test5
Tree: Mainline
Regression: ---


Description Dave Hansen 2003-10-22 08:09:23 UTC
Hardware Environment:
* Server
- p650 8x1.45Ghz Power4
- 32Gb Ram
- 8x FlipperX Adapters (Emulex FC)
- 6x Goliad GB Adapter (Intel Gigabit)
- 4x FASTtT 700 (56 Arrays defined)
* Client 
- p660 8x750Mhz RS64
- 16Gb Ram
- 6x Goliad GB Adapter (Intel Gigabit)

Software Environment:
- 2.6.0-test5 from Bitkeeper
- 1.21f Emulex Open Source Driver (ported to 2.6)
- SpecSFS97 benchmark code
* Client
- SLES8 SR3 beta
- SpecSFS97 benchmark code


  SpecSFS benchmark fails to the second interation attempt because it take to
long for the clients to mount the exported NFS directories.  When mounting the
NFS mounts by hand, the mountd thread on the server consumes 100% time on a
single CPU.  The machines is very unresponsive when this is happening as well.

Time to mount 56 Disks
	- Before Running SFS: 1.6sec
	- After Running SFS: 1min 38sec

A profile of the machine showed us that following.
Hottest Routines
	- ( 84.6%) .cpu_idle 
	- (  6.6%) .invalidate_list
	- (  3.8%) .shrink_dcache_sb 
	- (  1.9%) .do_slb_bolted
	- (  1.3%) .trc_spinlock
	- (  0.4%) .__copy_tofrom_user

The problem is cause by nfsd filesystem (fs/nfsctl.c new for 2.6) wich controls
the nfsd treads.  Every time we call the filesystem we triger a invalidation of
both the inode caches and dcaches.

The inode cache is invalidated by the fput() call in sys_nfsservctl().  The flow
of the code goes like this:


Looking at fs/nfsd/nfsctl.c
static struct file_system_type nfsd_fs_type = {
        .owner          = THIS_MODULE,
        .name           = "nfsd",
        .get_sb         = nfsd_get_sb,
        .kill_sb        = kill_litter_super,
.kill_sb is defined as kill_litter_super().

The dcache on the other hand get invalidated when we do a do_kern_mount() on
do_open() (fs/nfsctl.c).  Every time we issue a nfsctl system call we mount this
small fs.  The flow for this case looks like:


The second iteration is the one that fail because by the time the fist iteration
is done most of the system memory is beeing occupied by icache and dcache.  For
some reasong the kernel feel the need to invalidate all of the dcache and
icaches of the filesystem the server is exporting to the clients.  Looking at
/proc/slabinfo I see that the memory is not beeing free up, but it does if we
unmount the localfilesystems.  This problems was not seen on my other test
system since it only had 1G of memory.

The problem is easy to recreate as without running SFS if the SFS data is still
on the disk.  To recreate without SFS:

1) umount /all_56_disks
2) mount /all_56_disks /somewere
3) find /all_56_disks -type f -exec cat {} >/dev/null 2>&1 \;
4) mount the NFS exports on the client.

I will attache a patch that works aroud the problem.  It does the following:

1) Eliminates the fput() call on sys_nfsservctl().  I really dont know why it is
needed since this filesystem realy doesnt handle file, it just cause nfsctl to
take some actions depending on the way we open the file(Could someone confirm
this is not needed).

2) If the filesystem name is nfsd do not run shrink_dcache_sb() on
do_remount_sb().  Again, this filesystem doesnt handle real files so it is
"probably" safe to do this(Again, some confirmation would be nice).

The patch is just a workaround for the problem.  Looking at the code, it seems
like the best way for fix this is by changing nfsd_fs_type->(get_sb/kill_sb) to
something that handles this FS a better way.

------- Additional Comment #1 From Jose R. Santos 2003-10-16 17:37 -------

Created an attachment (id=1847)
Patch to avoid invalidating caches

------- Additional Comment #2 From Michael Ranweiler 2003-10-16 17:59 -------

Bruce Allen does NFS stuff - hopefully he can comment on the patch....

------- Additional Comment #3 From Gregory C. Kroah-Hartman 2003-10-16 18:40 -------


you can not just leak memory because you are wanting fast benchmark times.

------- Additional Comment #4 From David C. Hansen 2003-10-16 19:08 -------

I believe a more acceptable solution for now is to increase the SFS timeout for
that particular operation.  I think the benchmark is just trying to make sure
that the mount doesn't take *forever* because the NFS server has gone away.  We
can worry about compliance later.  

------- Additional Comment #5 From Jose R. Santos 2003-10-16 21:21 -------

The patch was never ment to be solution.  It just illustratest the areas where
this could be fixed.

I did try increasing the time out value on the source code, but it still failed.
I may need to increase the value by a rididulos amount.

Greg's sugestion to actualy mount the fs to avoid the invalidation seems
interesting.  I'll try it as soon as my system becomes available.

------- Additional Comment #6 From Jose R. Santos 2003-10-17 13:21 -------

I manage to recreate this on another system to reproduce this with 16Gb of memory.

I also tested mounting the nfsd to /tmp/nfsd.  Time was significantly reduce
from 1m50.732s to 0m30.226s. So doing this seems to elimninate the need to
invalidate the inode cache since there is a active reference left on the superblock.

However it still doesnt resolve the dentry cache since we are always mounting
when ever we call do_open(fs/nfsctl.c).  I need to see how this looks on the
32Gb Power4 box.

------- Additional Comment #7 From David C. Hansen 2003-10-21 18:43 -------

Can we close this bug now that we have an alternative method that doesn't use nfsd?

------- Additional Comment #8 From Jose R. Santos 2003-10-22 09:01 -------

What is the alternative method?  In the old method, the system call for nfsctl 
are pointing to the nfsd filesystem.  The new method uses manipulates the nfsd 
filesystem directly.  Both method use the nfsd filesystem.

Mounting the filesystem help somewhat, but the delays to mount the NFS exports 
are still big.
Comment 1 Andrew Morton 2003-10-22 12:12:34 UTC
The alternative is to speed up invalidate_inodes().
Kirill Korotaev <kk@sw.ru> was working on a patch;
I shall ask him if it is ready.
Comment 2 Jose R. Santos 2003-10-22 12:32:06 UTC
Well, we can avoind calling invalidate_inodes() by mounting the nfsd filesystem
somewere.  This causes deactivate_super() not to kill the sb.

I still dont know how to avoid shrink_dcache_sb() from beeing called.

Speed ups to invalidate_inodes() are very welcome since I doubt this is the last
Ill see invalidate_inodes() problems.
Comment 3 Neil Brown 2003-10-22 20:55:59 UTC
The problem is NOT that the inode or dentry caches are being flushed (though
possibly you know this already, it isn't clear from the comments).

The problem is that the inode and dentry caches *for*the*(tiny)*nfsd*filesystem
are being flushed, and this requires iterating through each cache under a
spinlock, and this can take a while.

Basically, "mount" and "unmount" are slow operations.

As has been mentioned, "unmount" can be avoided by mounting the nfsd filesystem
somewhere.  I recommend /proc/fs/nfsd.
The "mount" can be avoided by not using the nfsservctl systemcall at all. 
This is achived by mounting nfsd at /proc/fs/nfsd, and running the latest
release of nfs-utils.

*fixing* the problem requires optimising the mount and unmount process,
which is, I think, beyond my scope as NFSD maintainer.  
Hence I hereby disown this problem :-)  It is now a "VFS" problem - mount and
unmount are too slow.
Comment 4 Alexander Nyberg 2004-12-05 05:37:13 UTC

Did this end up getting resolved? Can the bug be closed? Was a year
since last activity

Note You need to log in before you can comment on or make changes to this bug.