Bug 204019 - server overload cause by client readahead
Summary: server overload cause by client readahead
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 low
Assignee: Trond Myklebust
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-28 21:19 UTC by Donald Buczek
Modified: 2019-07-01 14:22 UTC (History)
2 users (show)

See Also:
Kernel Version: 5.2.0-rc6 , 4.9.40 , nfs4
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Donald Buczek 2019-06-28 21:19:25 UTC
TLDR: nfs sets a questionable high readahead size (2*3840 pages) , splits readahead up into individual reads of 256 pages and submits these to the server asynchronously. This results in multiple nsfd threads on the server blocking each other waiting for i/o on the underlying file system and block device.

Long:

Our nfs servers often appear to be overloaded. It seems that to many nsfd threads are blocked doing i/o to the same block device, leaving not enough threads free to answer new request. This went so far that other, idle clients lost state, because they couldn't refresh their leases in time.

The parallelism seen on the servers does not always seem to be justified by the load on the clients. A simple `cat BIG-FILE-ON-NFS-SERVER > /dev/null` can engage many, many nfsd threads:

The readahead side of a nfs mount with default values is set to 3840 pages(!) by nfs using the "server->rpages * NFS_MAX_READAHEAD" formula. rpages is 1MB/(page size) per default and NFS_MAX_READAHEAD is RPC_DEF_SLOT_TABLE(=16)-1. So its 256*15 = 3840.

Because cat uses `posix_fadvice(..,POSIX_FADV_SEQUENTIAL)`, this number is doubled again to 7680 pages. The initial readahead size used is only 512 pages, but during the read, the readahead algorithm increases that to reads of size 7680 pages.

These page reads are collected by nfs in batches of 256 pages (because of the rsize 1MB) and submitted asynchronously to the server. This would account for up to 30 request submitted in parallel to the server reading the same file from the same block device, congesting it and putting the nfsd threads out of service for other uses. And that is just from one application thread from a single client.

We currently try to mitigate the problem by reducing sunrpc.tcp_max_slot_table_entries from its default value (RPC_MAX_SLOT_TABLE_LIMIT=65536) to something much smaller, e.g. 4. With this setting, a client seems to engage only 4 nfsd threads on the server while it doesn't get slower on the `cat BIGFILE` case. But we didn't test it on high loaded machines yet.

Decreasing the readahead-size of nfs helps, but not to much: The minimum readahead size is twice the read size of the buffered i/o, which - in the case of cat - is 256 pages. Turning off readahead for nfs altogether (ra_size=0) slows down the client unacceptably.

Increasing the nfsd threads (currently 64 typically) wouldn't help, the block device would still be overloaded by the parallel reads. And you can't reserve 30 nfsd threads for each possible thread on each client.

On a localfile access, readahead would stop because of backpressure from a congested block device. However, there seems to be no backpressure path from the nfs server to the client (aside from exhausting the rpc slots).

Demonstration (clean cache on client, file cached on server)

    buczek@simsalabimbambasaladusaladim:~/nfs_problems/nfs_problem7$ cat ./test_1.sh
    #! /bin/bash
    set -ve
    time cat /project/stresser/files/random_10G.dat >/dev/null &
    sleep 5
    sudo bash -c 'cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks'
    cat /proc/mounts|grep stresser
    uname -a
    ssh mrtorgue uname -a
    wait

    buczek@simsalabimbambasaladusaladim:~/nfs_problems/nfs_problem7$ ./test_1.sh
    time cat /project/stresser/files/random_10G.dat >/dev/null &
    sleep 5
    sudo bash -c 'cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks'
    43230 4801      0 0x8 0x8cde1f3f    60000 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_pending
    43237 4801      0 0x8 0x93de1f3f    60000 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_pending
    43238 4801      0 0x8 0x94de1f3f    60000 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_pending
    43239 4001      0 0x8 0x95de1f3f        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_transmit_status [sunrpc] q:none
    43240 4001    -11 0x8 0x96de1f3f        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_status [sunrpc] q:xprt_sending
    43241 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43242 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43243 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43244 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43245 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43246 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43247 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43248 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43249 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43250 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43251 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43252 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43253 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43254 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43255 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43256 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43257 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43258 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43259 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43260 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43261 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43262 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43263 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43264 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43265 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43266 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    43267 4001    -11 0x8 0x0        0 unregister_nfs_fs [nfs] nfsv4 READ a:call_reserveresult [sunrpc] q:xprt_sending
    cat /proc/mounts|grep stresser
    mrtorgue:/amd/mrtorgue/2/project/stresser /project/stresser nfs4 rw,nosuid,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=141.14.16.144,local_lock=none,addr=141.14.17.250 0 0
    uname -a
    Linux simsalabimbambasaladusaladim.molgen.mpg.de 4.19.40.mx64.262 #1 SMP Mon May 6 12:38:40 CEST 2019 x86_64 GNU/Linux
    ssh mrtorgue uname -a
    Linux mrtorgue.molgen.mpg.de 4.14.87.mx64.236 #1 SMP Mon Dec 10 09:48:57 CET 2018 x86_64 GNU/Linux
    wait
    real    0m11.533s
    user    0m0.010s
    sys    0m8.238s
        buczek@simsalabimbambasaladusaladim:~/nfs_problems/nfs_problem7$

I wonder if this is expected behavior?

Best
  Donald
Comment 1 Trond Myklebust 2019-06-28 21:44:22 UTC
This seems like a server QoS problem rather than a client issue.

The client will happily ensure that other requests are scheduled in between the READs, so this should not cause a problem for processes running on the same client.

It is not the job of the client to ensure that servers schedule requests from other clients properly.
Comment 2 bfields 2019-06-28 23:38:55 UTC
Assuming the server is recent Linux, it might e worth looking at the sunrpc.svc_rpc_per_connection_limit module parameter.
Comment 3 Trond Myklebust 2019-06-29 00:41:37 UTC
To be quite frank, we've seen similar problems at Hammerspace, and we have a set of patches that attempt to ensure that the server has at least one dedicated thread allocated per connection.
Hammerspace is, of course, quite happy to contribute these patches upstream (as obligated to do by the GPL), and I am planning to submit them for review soon. The main hangup is just dedicating the time to clean the patches up.

Note You need to log in before you can comment on or make changes to this bug.