Bug 216383

Summary: Can no longer mount using NFSv4 from a Solaris client
Product: File System Reporter: Don Brady (don.brady)
Component: NFSAssignee: Chuck Lever (cel)
Status: RESOLVED CODE_FIX    
Severity: normal CC: bfields, jlayton, trondmy
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: 4.16 and later Subsystem:
Regression: Yes Bisected commit-id:

Description Don Brady 2022-08-19 19:56:30 UTC
I believe this is a regression starting in 4.16 kernels and due to commit of https://www.spinics.net/lists/linux-nfs/msg67189.html
nfsd: return RESOURCE not GARBAGE_ARGS on too many ops

Prior to this change, we could mount from a Solaris client using vers=4.0. The only work-around we have is to force all Solaris client mounts to use NFSv3.  Now we need to move our Solaris clients to NFSv4.

I have confirmed the failure path by tracing nfsd4_proc_compound function and observe that it returns 10018 (NFSERR_RESOURCE) for an opt count of 20 with a mount path like "/one/two/three/four/five".

The issue stems from the fact that Solaris mounts use a compound op which contain 5 ops per path element in the mount path. So with a path like above, the op count is 20 and exceeds the new restriction of NFSD_MAX_OPS_PER_COMPOUND=16.  Prior to the change, the NFS server would mount our five element paths just fine when we asked for vers=4.0.

Solaris Client version is 11.4.0.0.1.15.0
Linux server is Ubuntu 20.04.4 LTS with a 5.4.0 kernel

Can we bump NFSD_MAX_OPS_PER_COMPOUND to 32?  This would allow us to use both NFS v4.0 and v4.1 from a Solaris client.
Comment 1 Chuck Lever 2022-08-20 20:35:32 UTC
There is a pre-allocated fixed array of compound arguments that is available to every nfsd thread. The size of that array is 8 items.

Since 0078117c6d91 ("nfsd: return RESOURCE not GARBAGE_ARGS on too many ops"), the server should be smart enough to allocate a separate piece of memory to handle a COMPOUND with more than 8 operations. I don't think there is a particular structural limitation which dictates a maximum of 16 operations. Make NFSD_MAX_OPS_PER_COMPOUND 32 and see how it works out. You might try even larger numbers than 32.

While testing, I would keep an eye out for memory leaks on the server. You might enable kmemleak if the NFS server system has plenty of CPU and memory available.

I don't have a problem increasing NFSD_MAX_OPS_PER_COMPOUND in upstream Linux as long as we determine that there are no unwelcome side effects.
Comment 2 bfields 2022-08-20 21:11:46 UTC
No objection patching the server, but the described client behavior is definitely buggy.  NFSv4.0 servers are not required to process arbitrarily long compounds, and a client that sends them has to be prepared to recover from NFS4ERR_RESOURCE.
Comment 3 Chuck Lever 2022-08-20 21:18:42 UTC
Solaris clients have worked this way for forever. It's hard to argue that now this behavior is no good, especially because Linux NFSD used to support up to 100 operations per COMPOUND. To them, we're the ones who broke things.

To make the case to the Solaris folks that their client is buggy, I will need RFC 7350 chapter and verse.
Comment 4 Trond Myklebust 2022-08-20 21:28:31 UTC
RFC7530 Section 14.2 is probably the most explicit:

   A COMPOUND is not a transaction, and it is the client's
   responsibility to recover from any partially completed COMPOUND
   procedure.  These may occur at any point due to errors such as
   NFS4ERR_RESOURCE and NFS4ERR_DELAY.  Note that these errors can occur
   in an otherwise valid operation string.  Further, a server reboot
   that occurs in the middle of processing a COMPOUND procedure may
   leave the client with the difficult task of determining how far
   COMPOUND processing has proceeded.  Therefore, the client should
   avoid overly complex COMPOUND procedures in the event of the failure
   of an operation within the procedure.


However there is also language in Section 15.2.5 to the effect of:

   Since an error of any type may occur after only a portion of the
   operations have been evaluated, the client must be prepared to
   recover from any failure.  If the source of an NFS4ERR_RESOURCE error
   was a complex or lengthy set of operations, it is likely that if the
   number of operations were reduced the server would be able to
   evaluate them successfully.  Therefore, the client is responsible for
   dealing with this type of complexity in recovery.
Comment 5 bfields 2022-08-21 00:37:03 UTC
Also worth pointing out that they're still going to fail mounts of paths with more than about MAX_OPS_PER_COMPOUND/5 components.

And it would be so easy to fix, I'm surprised they haven't--which makes me look harder at the server code, and---oh, I see, I think it's our fault after all!

Our previous large MAX_OPS value was just hiding the real bug with our handling of the >MAX_OPS case.

It looks like, in this case, nfsd4_proc_compound is failing the entire compound without processing any ops.  That leaves the client with no way to make forward progress.  What it should be doing is basically truncating the incoming compound to MAX_OPS, processing the first MAX_OPS-1 ops normally, then returning NFS4ERR_RESOURCE on the MAX_OPSth op.  I'm not sure exactly how that would be done most simply, but it shouldn't be hard, and it should fix the real bug.
Comment 6 Chuck Lever 2022-08-21 02:27:43 UTC
(In reply to bfields from comment #5)
> It looks like, in this case, nfsd4_proc_compound is failing the entire
> compound without processing any ops.

I noticed that too, but did not catch the implication below.

> That leaves the client with no way to
> make forward progress.  What it should be doing is basically truncating the
> incoming compound to MAX_OPS, processing the first MAX_OPS-1 ops normally,
> then returning NFS4ERR_RESOURCE on the MAX_OPSth op.  I'm not sure exactly
> how that would be done most simply, but it shouldn't be hard, and it should
> fix the real bug.

I think it might be easy. Instead of comparing the number of incoming ops to NFSD_MAX_OPS_PER_COMPOUND before looping over the operations in the COMPOUND, compare the op number to NFSD_MAX_OPS_PER_COMPOUND at the top of the loop, and exit there with NFS4ERR_RESOURCE if the op number is larger than NFSD_MAX_OPS_PER_COMPOUND.
Comment 7 Chuck Lever 2022-08-28 21:17:10 UTC
Mike Kupfer pointed out that we have another problem here. NFSv4.1 and newer do not have NFS4ERR_RESOURCE.
Comment 8 Chuck Lever 2022-08-28 21:21:24 UTC
Since these problems don't pose a security risk and have been around for some time, I'm lowering the priority of this bug to P3.
Comment 9 Chuck Lever 2022-08-28 21:34:16 UTC
(In reply to Chuck Lever from comment #7)
> Mike Kupfer pointed out that we have another problem here. NFSv4.1 and newer
> do not have NFS4ERR_RESOURCE.

Also he says that the Solaris client will not retry a LOOKUP that fails with NFS4ERR_RESOURCE even if the server were to make some progress. This is a known issue and something that they want to correct, but it's not a priority because of technical debt in this area of their implementation.

Thus it's likely that a change in behavior there will not address the problem with long mount paths, but increasing the maximum operations per COMPOUND will help relieve the issue. I'm still inclined to apply patches for both of these issues, in addition to ensuring that NFS4ERR_RESOURCE is returned only to NFSv4.0 clients.
Comment 10 Trond Myklebust 2022-08-28 21:55:25 UTC
In NFSv4.1, none of this is a problem.

The client and server negotiate a max number buffer size and max number of ops during session creation. If the client exceeds the number of ops that was agreed to, the server responds with NFS4ERR_TOO_MANY_OPS or NFS4ERR_REQ_TOO_BIG, depending on the circumstances.
Comment 11 Chuck Lever 2022-08-29 00:00:19 UTC
(In reply to Trond Myklebust from comment #10)
> In NFSv4.1, none of this is a problem.
> 
> The client and server negotiate a max number buffer size and max number of
> ops during session creation. If the client exceeds the number of ops that
> was agreed to, the server responds with NFS4ERR_TOO_MANY_OPS or
> NFS4ERR_REQ_TOO_BIG, depending on the circumstances.

Right, I’m not blaming the structure of the protocols. I’m saying that it’s possible that the Linux NFS server’s logic to count operations in a COMPOUND does not seem to be keyed on which minor version is in play when it decides there are too many.

I thought I saw some unit tests in pynfs to exercise a server’s COMPOUND sanity checking. I’ll poke around there.
Comment 12 bfields 2022-08-29 12:53:09 UTC
(In reply to Chuck Lever from comment #11)
> I’m saying that it’s
> possible that the Linux NFS server’s logic to count operations in a COMPOUND
> does not seem to be keyed on which minor version is in play when it decides
> there are too many.

That's because in the 4.1 case we always negotiate a maxoperations value at most NFSD_MAX_OPS_PER_COMPOUND; so the limit is the same across minor versions.

(Well, sort of: in the 4.1 case we're willing to negotiate a *lower* limit if the client asks us to, and then fail to return TOO_MANY_OPS if the client later exceeds that smaller value.  But that's a pretty weird case.)

> Also he says that the Solaris client will not retry a LOOKUP that fails with
> NFS4ERR_RESOURCE even if the server were to make some progress.  This is a
> known issue and something that they want to correct, but it's not a priority
> because of technical debt in this area of their implementation.

Oh well, not up to me, but it really does seem like a serious bug that should be easy to fix.
Comment 13 Chuck Lever 2022-08-29 17:25:16 UTC
(In reply to bfields from comment #12)
> (In reply to Chuck Lever from comment #11)
> > I’m saying that it’s
> > possible that the Linux NFS server’s logic to count operations in a
> COMPOUND
> > does not seem to be keyed on which minor version is in play when it decides
> > there are too many.
> 
> That's because in the 4.1 case we always negotiate a maxoperations value at
> most NFSD_MAX_OPS_PER_COMPOUND; so the limit is the same across minor
> versions.

Agreed, but I'm wondering if NFSD's COMPOUND limit checking would actually trigger an NFS4ERR_RESOURCE response if an NFSv4.1 client happens to exceed the limit agreed to at CREATE_SESSION time. The limit checking is split between nfsd4_decode_compound() and nfsd4_proc_compound(). I need to study it carefully to see how it's working.

If you know of any particular pynfs tests that might probe a server's COMPOUND limit checking, let me know.
Comment 14 Chuck Lever 2023-03-09 14:57:01 UTC
Commit 7518a3dc5ea2 ("NFSD: Fix handling of oversized NFSv4 COMPOUND requests") was merged in v6.1.