Bug 219737
Summary: | warning in nfsd4_cb_done | ||
---|---|---|---|
Product: | File System | Reporter: | rik.theys |
Component: | NFSD | Assignee: | Filesystem/NFSD virtual assignee (filesystem_nfsd) |
Status: | NEW --- | ||
Severity: | normal | CC: | cel, chuck.lever, jlayton |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
rik.theys
2025-01-30 14:00:07 UTC
I recommended a separate bug because this looks to me like the CB_GETATTR reply handling path needs some attention. I'm not sure the problems here will result in the hanging symptoms seen in Bug #218735. First issue is the explicit use of NFS4ERR_BAD_XDR in the CB_GETATTR reply decoder. Should be EIO instead. Second issue is the CB_GETATTR reply decoder does not seem capable of handling a non-zero status code in the reply. Third issue is whether NFS4ERR_BADHANDLE means the server requested a CB_GETATTR for the wrong file, or if it is an expected situation. Hi, It seems I've cloned the wrong bug. It should have been https://bugzilla.kernel.org/show_bug.cgi?id=219710 instead. As mentioned by Chuck in https://bugzilla.kernel.org/show_bug.cgi?id=219710#c25 there are multiple events in the trace, but there is only a single WARNING logged? So is there still something different between the various events, or is the warning only logged once? Regards, Rik Olga Kornievskaia <aglo@umich.edu> replies to comment #1: On Thu, Jan 30, 2025 at 9:09 AM Chuck Lever via Bugspray Bot <bugbot@kernel.org> wrote: > > Chuck Lever writes via Kernel.org Bugzilla: > > I recommended a separate bug because this looks to me like the CB_GETATTR > reply handling path needs some attention. I'm not sure the problems here will > result in the hanging symptoms seen in Bug #218735. > > First issue is the explicit use of NFS4ERR_BAD_XDR in the CB_GETATTR reply > decoder. Should be EIO instead. > > Second issue is the CB_GETATTR reply decoder does not seem capable of > handling a non-zero status code in the reply. > > Third issue is whether NFS4ERR_BADHANDLE means the server requested a > CB_GETATTR for the wrong file, or if it is an expected situation. Isn't this because 6.12.x is still missing the patch "NFSD: fix decoding in nfs4_xdr_dec_cb_getattr" that just went into 6.14? > View: https://bugzilla.kernel.org/show_bug.cgi?id=219737#c1 > You can reply to this message to join the discussion. > -- > Deet-doot-dot, I am a bot. > Kernel.org Bugzilla (bugspray 0.1-dev) > > (via https://msgid.link/CAN-5tyFeGvihaucNL1rPPhbkxM04gMfOvDz04V0%2B-05mLbCS4w@mail.gmail.com) (In reply to Bugspray Bot from comment #3) > Olga Kornievskaia <aglo@umich.edu> replies to comment #1: > > > First issue is the explicit use of NFS4ERR_BAD_XDR in the CB_GETATTR reply > > decoder. Should be EIO instead. > > > > Second issue is the CB_GETATTR reply decoder does not seem capable of > > handling a non-zero status code in the reply. > > > > Third issue is whether NFS4ERR_BADHANDLE means the server requested a > > CB_GETATTR for the wrong file, or if it is an expected situation. > > Isn't this because 6.12.x is still missing the patch "NFSD: fix > decoding in nfs4_xdr_dec_cb_getattr" that just went into 6.14? Yes, second and third issues are addressed by 1b3e26a5ccbf ("NFSD: fix decoding in nfs4_xdr_dec_cb_getattr"). The first issue has not yet been addressed upstream. (In reply to Chuck Lever from comment #4) > (In reply to Bugspray Bot from comment #3) > > Olga Kornievskaia <aglo@umich.edu> replies to comment #1: > > > > > First issue is the explicit use of NFS4ERR_BAD_XDR in the CB_GETATTR > reply > > > decoder. Should be EIO instead. > > > > > > Second issue is the CB_GETATTR reply decoder does not seem capable of > > > handling a non-zero status code in the reply. > > > > > > Third issue is whether NFS4ERR_BADHANDLE means the server requested a > > > CB_GETATTR for the wrong file, or if it is an expected situation. > > > > Isn't this because 6.12.x is still missing the patch "NFSD: fix > > decoding in nfs4_xdr_dec_cb_getattr" that just went into 6.14? > > Yes, second and third issues are addressed by 1b3e26a5ccbf ("NFSD: fix > decoding in nfs4_xdr_dec_cb_getattr"). > > The first issue has not yet been addressed upstream. The first issue isn't a big deal, though. Applying 1b3e26a5ccbf should address this bug. (In reply to Chuck Lever from comment #4) > (In reply to Bugspray Bot from comment #3) > > Olga Kornievskaia <aglo@umich.edu> replies to comment #1: > > > > > First issue is the explicit use of NFS4ERR_BAD_XDR in the CB_GETATTR > reply > > > decoder. Should be EIO instead. > > > > > > Second issue is the CB_GETATTR reply decoder does not seem capable of > > > handling a non-zero status code in the reply. > > > > > > Third issue is whether NFS4ERR_BADHANDLE means the server requested a > > > CB_GETATTR for the wrong file, or if it is an expected situation. > > > > Isn't this because 6.12.x is still missing the patch "NFSD: fix > > decoding in nfs4_xdr_dec_cb_getattr" that just went into 6.14? > > Yes, second and third issues are addressed by 1b3e26a5ccbf ("NFSD: fix > decoding in nfs4_xdr_dec_cb_getattr"). I take it back: The third issue is not addressed by 1b3e26a5ccbf. Why is the server sending the client a delegation that it rejects? And does the server recover properly in this case? But again, 1b3e26a5ccbf should prevent the warning reported here. Hi, (In reply to Chuck Lever from comment #4) > (In reply to Bugspray Bot from comment #3) > > Olga Kornievskaia <aglo@umich.edu> replies to comment #1: > > > > > First issue is the explicit use of NFS4ERR_BAD_XDR in the CB_GETATTR > reply > > > decoder. Should be EIO instead. > > > > > > Second issue is the CB_GETATTR reply decoder does not seem capable of > > > handling a non-zero status code in the reply. > > > > > > Third issue is whether NFS4ERR_BADHANDLE means the server requested a > > > CB_GETATTR for the wrong file, or if it is an expected situation. > > > > Isn't this because 6.12.x is still missing the patch "NFSD: fix > > decoding in nfs4_xdr_dec_cb_getattr" that just went into 6.14? > > Yes, second and third issues are addressed by 1b3e26a5ccbf ("NFSD: fix > decoding in nfs4_xdr_dec_cb_getattr"). > > The first issue has not yet been addressed upstream. Is it possible this patch has not (yet) been sent to stable@vger.kernel.org so it ends up in 6.12.y? Regards, Rik (In reply to rik.theys from comment #7) > Is it possible this patch has not (yet) been sent to stable@vger.kernel.org > so it ends up in 6.12.y? Commit 1b3e26a5ccbf has been in a publicly released kernel for exactly two days (as of this writing). It will take some time before it appears in any LTS kernel. For now, if you would like to test the commit, you need to apply it manually. FYI: 521 == EBADHANDLE 10036 == NFS4ERR_BADXDR 3 == CB_GETATTR The cb_status gets set via decode_cb_op_status() during the decode phase and is not cleared in nfsd4_cb_prepare. Note that the code will call rpc_restart_call() in the NFS4ERR_DELAY case, which skips the rpc_call_prepare phase when the task is restarted. So, I think this may go something like this: Get a NFS4ERR_DELAY in a callback operation, that restarts the call but doesn't clear cb_status. Then on the next attempt, it gets a non-zero tk_status (and probably skips the decode phase). That would cause this warning to pop, I think. In practice, I'm warming up to the idea that this entire if statement is just bogus: if (cb->cb_status) { WARN_ONCE(task->tk_status, "cb_status=%d tk_status=%d cb_opcode=%d", cb->cb_status, task->tk_status, cb->cb_ops->opcode); task->tk_status = cb->cb_status; } Consider the case where we successfully decode a cb_status value but then some later field in the reply was corrupt. We will have set the cb_status, but the decode will fail which will cause the tk_status to be non-zero, and this warning will pop. I think we should probably just remove the entire if statement above. (In reply to Chuck Lever from comment #8) > (In reply to rik.theys from comment #7) > > Is it possible this patch has not (yet) been sent to stable@vger.kernel.org > > so it ends up in 6.12.y? > > Commit 1b3e26a5ccbf has been in a publicly released kernel for exactly two > days (as of this writing). It will take some time before it appears in any > LTS kernel. For now, if you would like to test the commit, you need to apply > it manually. Now that I look, 1b3e26a5ccbf is wrong. The patch on the ml was correct, but the one that got committed is different. It should be: status = decode_cb_op_status(xdr, OP_CB_GETATTR, &cb->cb_status); if (unlikely(status || cb->cb_status)) If "status" is non-zero, decoding failed (usu. BADXDR), but we also want to bail out and not decode the rest of the call if the decoded cb_status is non-zero. That's not happening here, cb_seq_status has already been checked and is non-zero, so this ends up trying to decode the rest of the CB_GETATTR reply when it doesn't exist. Also, -NFS4ERR_BADXDR is the wrong status code to use here. The correct code to use is -EIO. We're decoding the callback server's response, not a call (where BADXDR would indeed be appropriate). For comparison, see how the NFS client handles a server response it cannot decode. Hi, We're currently running 6.12.13 with the following 4 patches on top: 1. 1b3e26a5ccbfc2f85bda1930cc278e313165e353: NFSD: fix decoding in nfs4_xdr_dec_cb_getattr This patch went into 6.13-rc7, but I don't think it was CC'ed to stable? I also don't see it in nfsd-6.12.y branch on https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/log Is it not needed on 6.12? 2. cedfbb92cf97a6bff3d25633001d9c44442ee854: NFSD: fix hang in nfsd4_shutdown_callback This one is now in 6.12.16 3. 99e98a2312c8d08fba60d548009c03e7cfb1bf6b: NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't up This one seems to have a CC: stable but I assume it's not yet there as it isn't included in the mainline kernel yet? Is it expected to be merged into 6.14-rcX, or will it have to wait for 6.15? 4. 4990d098433db18c854e75fb0f90d941eb7d479e: NFSD: Fix CB_GETATTR status fix This one seems to be in 6.14-rc, but also doesn't have a CC: stable? It fixes 1b3e26a5ccbfc2f85bda1930cc278e313165e353 above. Are there any plans to include these patches in the 6.12.y kernel? Some of the patches mentioned above are in the nfsd-{fixes,testing,next} branches on https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git, but I don't know what all of these branches are used for. Is the nfsd-next branch what's expected to land in the next upstream kernel (6.15-rc1), or also the current rc kernels (6.14-rcX)? What is the difference between the nfsd-testing and nfsd-fixes branch? This kernel seems to be running stable for now, but it's too soon to conclude it fixes the issue that sometimes comes up. It does show the following kernel message sometimes: rpc-srv/tcp: nfsd: sent 106256 when sending 131204 bytes - shutting down socket I don't recall seeing this message with previous (6.11 and earlier) kernels. Does this message mean a single connection to a client was shut down? Is this a message we can ignore? Regards, Rik > rpc-srv/tcp: nfsd: sent 106256 when sending 131204 bytes - shutting down
> socket
This is unrelated to the original bug report.
It means that the kernel tried to send 131204 bytes on the socket, but only 106256 was sent. It was a short send, IOW. Usually that means the TCP buffers are full and couldn't hold the whole message, possibly because the receiver isn't ACKing things quickly enough.
To try and remedy this, the server shuts down the socket. That should be harmless though -- the client should just reestablish it once it notices the socket is down. I wonder if we really even need that printk at all. It doesn't tell the admin anything actionable, especially since it doesn't tell you anything about the peer.
|