Bug 218735
Summary: | NFSD callback operations block everything when clients are unresponsive | ||
---|---|---|---|
Product: | File System | Reporter: | Chuck Lever (cel) |
Component: | NFSD | Assignee: | Chuck Lever (cel) |
Status: | ASSIGNED --- | ||
Severity: | normal | CC: | jlayton |
Priority: | P3 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | Subsystem: | ||
Regression: | No | Bisected commit-id: |
Description
Chuck Lever
2024-04-16 18:18:18 UTC
A little code audit: 1 fs/nfsd/nfs4layouts.c 739 static const struct nfsd4_callback_ops nfsd4_cb_layout_ops = { 2 fs/nfsd/nfs4proc.c 1622 static const struct nfsd4_callback_ops nfsd4_cb_offload_ops = { 6 fs/nfsd/nfs4state.c 399 static const struct nfsd4_callback_ops nfsd4_cb_notify_lock_ops = { 7 fs/nfsd/nfs4state.c 3079 static const struct nfsd4_callback_ops nfsd4_cb_recall_any_ops = { 8 fs/nfsd/nfs4state.c 3084 static const struct nfsd4_callback_ops nfsd4_cb_getattr_ops = { 9 fs/nfsd/nfs4state.c 5182 static const struct nfsd4_callback_ops nfsd4_cb_recall_ops = { We have these five callback operations to deal with. I think the ->release nfsd4_callback_ops method might be used to schedule retry -- it's invoked by nfsd41_destroy_cb(), which should be able to tell whether a reply has been received. Now I just need to figure out how to keep a record of needing to resend a callback. The retry loop in this case is requeueing the work using queue_delayed_work. That shouldn't be blocking jobs with shorter delays that are sitting on the same queue. Are you certain that's the case? That sounds like a bug in the workqueue implementation if so. Agreed, the queue_delayed_work() isn't working the way I expected, but it may behave differently with an ordered work queue than it does with a bog standard work queue instance. In any event, some CB operations can be "fire and forget" while others will want some recourse on failure-to-send, and at least CB_OFFLOAD needs to be as reliable as we can make it. Thus having specific retry handlers for each CB operation seems like the best long-term approach. Commit c1ccfcf1a9bf ("NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down") was reverted from v6.9-rc to prevent an unresponsive client from backing up callbacks from all clients. In addition, I'm planning to prototype an implementation of OFFLOAD_STATUS for the Linux NFS client so the COPY operations don't hang if the CB_OFFLOAD gets lost. |