TLDR: nfs sets a questionable high readahead size (2*3840 pages) , splits readahead up into individual reads of 256 pages and submits these to the server asynchronously. This results in multiple nsfd threads on the server blocking each other waiting for i/o on the underlying file system and block device. Long: Our nfs servers often appear to be overloaded. It seems that to many nsfd threads are blocked doing i/o to the same block device, leaving not enough threads free to answer new request. This went so far that other, idle clients lost state, because they couldn't refresh their leases in time. The parallelism seen on the servers does not always seem to be justified by the load on the clients. A simple `cat BIG-FILE-ON-NFS-SERVER > /dev/null` can engage many, many nfsd threads: The readahead side of a nfs mount with default values is set to 3840 pages(!) by nfs using the "server->rpages * NFS_MAX_READAHEAD" formula. rpages is 1MB/(page size) per default and NFS_MAX_READAHEAD is RPC_DEF_SLOT_TABLE(=16)-1. So its 256*15 = 3840. Because cat uses `posix_fadvice(..,POSIX_FADV_SEQUENTIAL)`, this number is doubled again to 7680 pages. The initial readahead size used is only 512 pages, but during the read, the readahead algorithm increases that to reads of size 7680 pages. These page reads are collected by nfs in batches of 256 pages (because of the rsize 1MB) and submitted asynchronously to the server. This would account for up to 30 request submitted in parallel to the server reading the same file from the same block device, congesting it and putting the nfsd threads out of service for other uses. And that is just from one application thread from a single client. We currently try to mitigate the problem by reducing sunrpc.tcp_max_slot_table_entries from its default value (RPC_MAX_SLOT_TABLE_LIMIT=65536) to something much smaller, e.g. 4. With this setting, a client seems to engage only 4 nfsd threads on the server while it doesn't get slower on the `cat BIGFILE` case. But we didn't test it on high loaded machines yet. Decreasing the readahead-size of nfs helps, but not to much: The minimum readahead size is twice the read size of the buffered i/o, which - in the case of cat - is 256 pages. Turning off readahead for nfs altogether (ra_size=0) slows down the client unacceptably. Increasing the nfsd threads (currently 64 typically) wouldn't help, the block device would still be overloaded by the parallel reads. And you can't reserve 30 nfsd threads for each possible thread on each client. On a localfile access, readahead would stop because of backpressure from a congested block device. However, there seems to be no backpressure path from the nfs server to the client (aside from exhausting the rpc slots). Demonstration (clean cache on client, file cached on server) buczek@simsalabimbambasaladusaladim:~/nfs_problems/nfs_problem7$ cat ./test_1.sh #! /bin/bash set -ve time cat /project/stresser/files/random_10G.dat >/dev/null & sleep 5 sudo bash -c 'cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks' cat /proc/mounts|grep stresser uname -a ssh mrtorgue uname -a wait buczek@simsalabimbambasaladusaladim:~/nfs_problems/nfs_problem7$ ./test_1.sh time cat /project/stresser/files/random_10G.dat >/dev/null & sleep 5 sudo bash -c 'cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks' 43230 4801 0 0x8 0x8cde1f3f 60000 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_pending 43237 4801 0 0x8 0x93de1f3f 60000 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_pending 43238 4801 0 0x8 0x94de1f3f 60000 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_pending 43239 4001 0 0x8 0x95de1f3f 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_transmit_status [sunrpc] q:none 43240 4001 -11 0x8 0x96de1f3f 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_sending 43241 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43242 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43243 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43244 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43245 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43246 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43247 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43248 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43249 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43250 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43251 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43252 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43253 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43254 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43255 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43256 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43257 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43258 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43259 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43260 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43261 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43262 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43263 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43264 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43265 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43266 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending 43267 4001 -11 0x8 0x0 0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending cat /proc/mounts|grep stresser mrtorgue:/amd/mrtorgue/2/project/stresser /project/stresser nfs4 rw,nosuid,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=141.14.16.144,local_lock=none,addr=141.14.17.250 0 0 uname -a Linux simsalabimbambasaladusaladim.molgen.mpg.de 4.19.40.mx64.262 #1 SMP Mon May 6 12:38:40 CEST 2019 x86_64 GNU/Linux ssh mrtorgue uname -a Linux mrtorgue.molgen.mpg.de 4.14.87.mx64.236 #1 SMP Mon Dec 10 09:48:57 CET 2018 x86_64 GNU/Linux wait real 0m11.533s user 0m0.010s sys 0m8.238s buczek@simsalabimbambasaladusaladim:~/nfs_problems/nfs_problem7$ I wonder if this is expected behavior? Best Donald
This seems like a server QoS problem rather than a client issue. The client will happily ensure that other requests are scheduled in between the READs, so this should not cause a problem for processes running on the same client. It is not the job of the client to ensure that servers schedule requests from other clients properly.
Assuming the server is recent Linux, it might e worth looking at the sunrpc.svc_rpc_per_connection_limit module parameter.
To be quite frank, we've seen similar problems at Hammerspace, and we have a set of patches that attempt to ensure that the server has at least one dedicated thread allocated per connection. Hammerspace is, of course, quite happy to contribute these patches upstream (as obligated to do by the GPL), and I am planning to submit them for review soon. The main hangup is just dedicating the time to clean the patches up.