After an upgrade of nfs-utils, nfs version nfs4.1 became available to our nfs clients. nfs4.1 was previously masked by an outdated mount.nfs tools. nfs4.2 was not configured in most of our server kernels. We use automount a lot, so clients started to use nfs4.1. Two days later, the bug described here was triggered for the fist time. After it repeated a few times on other client/server combinations, we forced nfs back to 4.0 (by mount option in the automount map). The symptom is, that he nsf server logs messages like "RPC request reserved 84 but used 276" many times a second. At the same time, one (some?) remote mounted file system becomes unresponsive on a specific client. Other mounts from the same client to the same server continue to work. In the case described in the following, mounting of new file systems on the same server was also effected. I'm not sure, if this was always the case. Blocked processes (`mount.nfs`,`df -a`) start to accumulate on the client. If you search the web for "RPC request reserved but used" you find a few other people reporting this bug, but I didn't find a solution yet. I have some data from one such occasion, including a full traffic dump between the client and the server and a function graph trace of the client and the server, which I hope might be helpful. NFS-Server is `ex`. It logs: 2019-01-22T13:04:01.305269+01:00 ex kernel: [1636476.605802] RPC request reserved 84 but used 276 2019-01-22T13:04:01.306253+01:00 ex kernel: [1636476.606297] RPC request reserved 84 but used 276 2019-01-22T13:04:01.306270+01:00 ex kernel: [1636476.606612] RPC request reserved 84 but used 276 2019-01-22T13:04:01.306272+01:00 ex kernel: [1636476.606859] RPC request reserved 84 but used 276 2019-01-22T13:04:01.307242+01:00 ex kernel: [1636476.607147] RPC request reserved 84 but used 276 A locally developed tool (nfsdtop) [1], which counts all nfs4 operations, identifies the client (`godsavethequeen`), which seems to bombard the server with nfs4 sequence operations: ******************** 1 second trace Top 5 operation count: 3990 ops by build@godsavethequeen.molgen.mpg.de (3990 sequence ) 30 ops by mielke@heulsuse.molgen.mpg.de (10 getattr 10 sequence 10 putfh ) 1 ops by root@platsch.molgen.mpg.de (1 renew ) Top 5 user read: Top 5 user write: On the client, `ss` indicates, that one of the nfs tcp connections between the server and the client is continuously closed down and reestablished: root:godsavethequeen:~/# ss|grep 25.185 tcp ESTAB 0 0 141.14.18.6:smpnameres 141.14.25.185:nfs tcp SYN-SENT 0 1 141.14.18.6:963 141.14.25.185:nfs tcp ESTAB 0 0 141.14.18.6:57550 141.14.25.185:ssh root:godsavethequeen:~/# ss|grep 25.185 tcp ESTAB 0 0 141.14.18.6:smpnameres 141.14.25.185:nfs tcp ESTAB 0 148 141.14.18.6:963 141.14.25.185:nfs tcp ESTAB 0 0 141.14.18.6:57550 141.14.25.185:ssh root:godsavethequeen:~/# ss|grep 25.185 tcp ESTAB 0 0 141.14.18.6:smpnameres 141.14.25.185:nfs tcp FIN-WAIT-1 0 1 141.14.18.6:963 141.14.25.185:nfs tcp ESTAB 0 0 141.14.18.6:57550 141.14.25.185:ssh root:godsavethequeen:~/# ss|grep 25.185 tcp ESTAB 0 0 141.14.18.6:smpnameres 141.14.25.185:nfs tcp SYN-SENT 0 1 141.14.18.6:963 141.14.25.185:nfs tcp ESTAB 0 0 141.14.18.6:57550 141.14.25.185:ssh root:godsavethequeen:~/# ss|grep 25.185 tcp ESTAB 0 0 141.14.18.6:smpnameres 141.14.25.185:nfs tcp SYN-SENT 0 1 141.14.18.6:963 141.14.25.185:nfs tcp ESTAB 0 0 141.14.18.6:57550 141.14.25.185:ssh On the client, a hanging mount.nfs (forked by automount) is visible. When it is killed, automount continues and queued up mounts are performed (which hang again): root:godsavethequeen:~/# ps -Af|grep mount root 1427 1 0 Jan03 ? 00:00:52 /usr/sbin/automount -v root 30841 1427 0 Jan22 ? 00:00:00 /usr/bin/mount -t nfs -s -o nosuid ex:/amd/ex/X/X2008/package/node /package/node root 30842 30841 0 Jan22 ? 00:00:00 /sbin/mount.nfs ex:/amd/ex/X/X2008/package/node /package/node -s -o rw,nosuid root 45258 1 0 Jan17 ? 00:00:00 /usr/sbin/rpc.mountd --foreground --manage-gids root 60196 60120 0 12:39 pts/7 00:00:00 umount -f /pkg/node-9.11.2-0 root 60657 60120 0 12:53 pts/7 00:00:00 grep mount root:godsavethequeen:~/# kill 30842 root:godsavethequeen:~/# ps -Af|grep mount root 1427 1 0 Jan03 ? 00:00:52 /usr/sbin/automount -v root 45258 1 0 Jan17 ? 00:00:00 /usr/sbin/rpc.mountd --foreground --manage-gids root 60196 60120 0 12:39 pts/7 00:00:00 umount -f /pkg/node-9.11.2-0 root 60695 1427 0 12:53 ? 00:00:00 /usr/bin/umount /project/PausingDynamics root 60696 60695 81 12:53 ? 00:00:01 /sbin/umount.nfs /project/PausingDynamics -n root 60698 60120 0 12:53 pts/7 00:00:00 grep mount root:godsavethequeen:~/# I've enabled the function_graph traces on both systems and took a dump of the network traffic. The traces and the dump were not takes at exactly the same time, however. I don't want to attach these files here, because of their size, they are available at [2]. [1] : https://github.molgen.mpg.de/mariux64/mxtools/tree/master/nfsdtop [2] : https://owww.molgen.mpg.de/~buczek/nfs_problem3/
This is a server bug, not a client issue.
(In reply to Donald Buczek from comment #0) > If you search the web for "RPC request reserved but used" you find a > few other people reporting this bug, but I didn't find a solution yet. That symptom could have multiple different causes, so I wouldn't assume all those reports are cases of the same bug. Looking at the trace.... A bare SEQUENCE call is getting a SEQUENCE+PUTFH+SETATTR+GETATTR reply. So it's a cached reply. That's probably confusing the logic that predicts how large a result will be. There were some patches in that area recently, I'll take a look.
Yes, this looks likely to be fixed by upstream commit 53da6a53e1d4 "nfsd4: catch some false session retries" (may also need preceding 085def3ade52 "nfsd4: fix cached replies to solo SEQUENCE compounds"). Those are probably reasonable to backport to stable branches. Somebody could check whether they apply and fix the problem and, if so, mail them to stable@vger.kernel.org and cc: to linux-nfs@vger.kernel.org and bfields@fieldses.org. I'm not volunteering to do that, but would happily ack them.
These two patches are already included in 4.19 LTS and 4.20 stable. They do apply to 4.14 LTS and all previous LTS versions shown on kernel.org. The problem occurred only a few times with hundredth of nfs-mounts using nfs4.1 and we don't know what triggered it. So I can't verify, that the problem is fixed by switching a few servers to 4.14 with the two patches on top on it. And I can't reboot all servers to a new kernel.
I've seen similar messages, and added a small patch to print out the XID of the request: [ 99.433198] RPC request 576820793 reserved 108 but used 324 [ 119.840823] RPC request 2540410425 reserved 200 but used 324 I then sniffed traffic while running a testcase that seemed to get a lot of these and saw this: 35112 2019-04-20 06:38:54.139728563 fd90:79d3:5065:f00d::da9 fd90:79d3:5065:f00d::444 NFS 294 V4 Call (Reply In 35156) COMMIT FH: 0x4a1e9ff2 Offset: 0 Len: 0 35156 2019-04-20 06:38:54.176012154 fd90:79d3:5065:f00d::444 fd90:79d3:5065:f00d::da9 NFS 410 V4 Reply (Call In 35112) SETATTR ...and... 43316 2019-04-20 06:39:14.583768975 fd90:79d3:5065:f00d::da9 fd90:79d3:5065:f00d::444 NFS 318 V4 Call (Reply In 43319) CLOSE StateID: 0xacd1 43319 2019-04-20 06:39:14.584101924 fd90:79d3:5065:f00d::444 fd90:79d3:5065:f00d::da9 NFS 410 V4 Reply (Call In 43316) SETATTR ...in both cases, it looks like the server is sending the reply for a different compound (a SETATTR rather than a COMMIT or CLOSE reply).
Which kernel? The two commits 53da6a53e1d4 "nfsd4: catch some false session retries" and "nfsd4: fix cached replies to solo SEQUENCE compounds", which fixed my problem, have been backported to v4.14 (v4.14.99) and are in all kernels v4.15 and later, so this bug could be closed.