Bug 202435 - nfs4.1: RPC request reserved 84 but used 276
Summary: nfs4.1: RPC request reserved 84 but used 276
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-28 14:18 UTC by Donald Buczek
Modified: 2023-12-07 09:35 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.14.87
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Donald Buczek 2019-01-28 14:18:53 UTC
After an upgrade of nfs-utils, nfs version nfs4.1 became available to
our nfs clients. nfs4.1 was previously masked by an outdated mount.nfs
tools. nfs4.2 was not configured in most of our server kernels.
We use automount a lot, so clients started to use nfs4.1.

Two days later, the bug described here was triggered for the fist time.
After it repeated a few times on other client/server combinations, we
forced nfs back to 4.0 (by mount option in the automount map).

The symptom is, that he nsf server logs messages like "RPC request
reserved 84 but used 276" many times a second. At the same time, one
(some?)  remote mounted file system becomes unresponsive on a specific
client. Other mounts from the same client to the same server continue
to work. In the case described in the following, mounting of new file
systems on the same server was also effected. I'm not sure, if this was
always the case.

Blocked processes (`mount.nfs`,`df -a`) start to accumulate on the
client.

If you search the web for "RPC request reserved but used" you find a
few other people reporting this bug, but I didn't find a solution yet.

I have some data from one such occasion, including a full traffic dump
between the client and the server and a function graph trace of the
client and the server, which I hope might be helpful.

NFS-Server is `ex`. It logs:

    2019-01-22T13:04:01.305269+01:00 ex kernel: [1636476.605802] RPC request reserved 84 but used 276
    2019-01-22T13:04:01.306253+01:00 ex kernel: [1636476.606297] RPC request reserved 84 but used 276
    2019-01-22T13:04:01.306270+01:00 ex kernel: [1636476.606612] RPC request reserved 84 but used 276
    2019-01-22T13:04:01.306272+01:00 ex kernel: [1636476.606859] RPC request reserved 84 but used 276
    2019-01-22T13:04:01.307242+01:00 ex kernel: [1636476.607147] RPC request reserved 84 but used 276

A locally developed tool (nfsdtop) [1], which counts all nfs4
operations, identifies the client (`godsavethequeen`), which seems to
bombard the server with nfs4 sequence operations:

    ******************** 1 second trace
    
    Top 5 operation count:
    
          3990 ops by build@godsavethequeen.molgen.mpg.de (3990 sequence )
            30 ops by mielke@heulsuse.molgen.mpg.de (10 getattr 10 sequence 10 putfh )
             1 ops by root@platsch.molgen.mpg.de (1 renew )
    
    Top 5 user read:
    
    
    Top 5 user write:

On the client, `ss` indicates, that one of the nfs tcp connections
between the server and the client is continuously closed down and
reestablished:

    root:godsavethequeen:~/# ss|grep 25.185
    tcp    ESTAB      0      0      141.14.18.6:smpnameres           141.14.25.185:nfs                  
    tcp    SYN-SENT   0      1      141.14.18.6:963                  141.14.25.185:nfs                  
    tcp    ESTAB      0      0      141.14.18.6:57550                141.14.25.185:ssh                  
    root:godsavethequeen:~/# ss|grep 25.185
    tcp    ESTAB      0      0      141.14.18.6:smpnameres           141.14.25.185:nfs                  
    tcp    ESTAB      0      148    141.14.18.6:963                  141.14.25.185:nfs                  
    tcp    ESTAB      0      0      141.14.18.6:57550                141.14.25.185:ssh                  
    root:godsavethequeen:~/# ss|grep 25.185
    tcp    ESTAB      0      0      141.14.18.6:smpnameres           141.14.25.185:nfs                  
    tcp    FIN-WAIT-1 0      1      141.14.18.6:963                  141.14.25.185:nfs                  
    tcp    ESTAB      0      0      141.14.18.6:57550                141.14.25.185:ssh                  
    root:godsavethequeen:~/# ss|grep 25.185
    tcp    ESTAB      0      0      141.14.18.6:smpnameres           141.14.25.185:nfs                  
    tcp    SYN-SENT   0      1      141.14.18.6:963                  141.14.25.185:nfs                  
    tcp    ESTAB      0      0      141.14.18.6:57550                141.14.25.185:ssh                  
    root:godsavethequeen:~/# ss|grep 25.185
    tcp    ESTAB      0      0      141.14.18.6:smpnameres           141.14.25.185:nfs                  
    tcp    SYN-SENT   0      1      141.14.18.6:963                  141.14.25.185:nfs                  
    tcp    ESTAB      0      0      141.14.18.6:57550                141.14.25.185:ssh                  

On the client, a hanging mount.nfs (forked by automount) is visible.
When it is killed, automount continues and queued up mounts are
performed (which hang again):

    root:godsavethequeen:~/# ps -Af|grep mount
    root       1427      1  0 Jan03 ?        00:00:52 /usr/sbin/automount -v
    root      30841   1427  0 Jan22 ?        00:00:00 /usr/bin/mount -t nfs -s -o nosuid ex:/amd/ex/X/X2008/package/node /package/node
    root      30842  30841  0 Jan22 ?        00:00:00 /sbin/mount.nfs ex:/amd/ex/X/X2008/package/node /package/node -s -o rw,nosuid
    root      45258      1  0 Jan17 ?        00:00:00 /usr/sbin/rpc.mountd --foreground --manage-gids
    root      60196  60120  0 12:39 pts/7    00:00:00 umount -f /pkg/node-9.11.2-0
    root      60657  60120  0 12:53 pts/7    00:00:00 grep mount
    root:godsavethequeen:~/# kill 30842
    root:godsavethequeen:~/# ps -Af|grep mount
    root       1427      1  0 Jan03 ?        00:00:52 /usr/sbin/automount -v
    root      45258      1  0 Jan17 ?        00:00:00 /usr/sbin/rpc.mountd --foreground --manage-gids
    root      60196  60120  0 12:39 pts/7    00:00:00 umount -f /pkg/node-9.11.2-0
    root      60695   1427  0 12:53 ?        00:00:00 /usr/bin/umount /project/PausingDynamics
    root      60696  60695 81 12:53 ?        00:00:01 /sbin/umount.nfs /project/PausingDynamics -n
    root      60698  60120  0 12:53 pts/7    00:00:00 grep mount
    root:godsavethequeen:~/# 

I've enabled the function_graph traces on both systems and took a dump
of the network traffic. The traces and the dump were not takes at
exactly the same time, however. I don't want to attach these files here,
because of their size, they are available at [2].

[1] : https://github.molgen.mpg.de/mariux64/mxtools/tree/master/nfsdtop
[2] : https://owww.molgen.mpg.de/~buczek/nfs_problem3/
Comment 1 Trond Myklebust 2019-01-28 14:31:46 UTC
This is a server bug, not a client issue.
Comment 2 bfields 2019-01-28 20:39:46 UTC
(In reply to Donald Buczek from comment #0)
> If you search the web for "RPC request reserved but used" you find a
> few other people reporting this bug, but I didn't find a solution yet.

That symptom could have multiple different causes, so I wouldn't assume all those reports are cases of the same bug.

Looking at the trace....  A bare SEQUENCE call is getting a SEQUENCE+PUTFH+SETATTR+GETATTR reply.

So it's a cached reply.  That's probably confusing the logic that predicts how large a result will be.

There were some patches in that area recently, I'll take a look.
Comment 3 bfields 2019-01-28 20:47:51 UTC
Yes, this looks likely to be fixed by upstream commit 53da6a53e1d4 "nfsd4: catch some false session retries" (may also need preceding 085def3ade52 "nfsd4: fix cached replies to solo SEQUENCE compounds").

Those are probably reasonable to backport to stable branches.  Somebody could check whether they apply and fix the problem and, if so, mail them to stable@vger.kernel.org and cc: to linux-nfs@vger.kernel.org and bfields@fieldses.org.  I'm not volunteering to do that, but would happily ack them.
Comment 4 Donald Buczek 2019-01-29 12:18:47 UTC
These two patches are already included in 4.19 LTS and 4.20 stable. They do apply to 4.14 LTS and all previous LTS versions shown on kernel.org.

The problem occurred only a few times with hundredth of nfs-mounts using nfs4.1 and we don't know what triggered it. So I can't verify, that the problem is fixed by switching a few servers to 4.14 with the two patches on top on it. And I can't reboot all servers to a new kernel.
Comment 5 Jeff Layton 2019-04-20 10:58:22 UTC
I've seen similar messages, and added a small patch to print out the XID of the request:

[   99.433198] RPC request 576820793 reserved 108 but used 324
[  119.840823] RPC request 2540410425 reserved 200 but used 324

I then sniffed traffic while running a testcase that seemed to get a lot of these and saw this:

35112	2019-04-20 06:38:54.139728563	fd90:79d3:5065:f00d::da9	fd90:79d3:5065:f00d::444	NFS	294	V4 Call (Reply In 35156) COMMIT FH: 0x4a1e9ff2 Offset: 0 Len: 0
35156	2019-04-20 06:38:54.176012154	fd90:79d3:5065:f00d::444	fd90:79d3:5065:f00d::da9	NFS	410	V4 Reply (Call In 35112) SETATTR

...and...

43316	2019-04-20 06:39:14.583768975	fd90:79d3:5065:f00d::da9	fd90:79d3:5065:f00d::444	NFS	318	V4 Call (Reply In 43319) CLOSE StateID: 0xacd1
43319	2019-04-20 06:39:14.584101924	fd90:79d3:5065:f00d::444	fd90:79d3:5065:f00d::da9	NFS	410	V4 Reply (Call In 43316) SETATTR

...in both cases, it looks like the server is sending the reply for a different compound (a SETATTR rather than a COMMIT or CLOSE reply).
Comment 6 Donald Buczek 2019-05-21 11:35:24 UTC
Which kernel? The two commits 53da6a53e1d4 "nfsd4: catch some false session retries" and "nfsd4: fix cached replies to solo SEQUENCE compounds", which fixed my problem, have been backported to v4.14 (v4.14.99) and are in all kernels v4.15 and later, so this bug could be closed.

Note You need to log in before you can comment on or make changes to this bug.