Bug 213887 - kernel v5.13 regression & NFSv4.2 client timeouts?
Summary: kernel v5.13 regression & NFSv4.2 client timeouts?
Status: RESOLVED CODE_FIX
Alias: None
Product: File System
Classification: Unclassified
Component: NFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: bfields
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-07-28 10:19 UTC by Daire Byrne
Modified: 2022-05-13 18:20 UTC (History)
4 users (show)

See Also:
Kernel Version: 5.13.5
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Daire Byrne 2021-07-28 10:19:43 UTC
Hi,

I have been using nfs re-exporting for quite a few kernel versions now and something has changed in v5.13 such that I am now seeing random client timeouts to the v4.2 re-export server.

Over the course of 24 hours, we will see something like this on around 5% of our ~1000 clients:

Jul 19 21:59:50 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 19 21:59:50 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 19 21:59:50 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 19 21:59:50 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 19 21:59:50 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 20 07:17:38 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 20 07:17:38 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 20 07:17:38 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 20 07:17:38 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 20 07:17:38 c8000b52 kernel: nfs: server XX not responding, still trying
Jul 20 07:17:38 c8000b52 kernel: nfs: server XX not responding, timed out

Where server XX is a v5.13.x NFSv4.2 server re-exporting an NFSv3 server. The client's mounts hang indefinitely and never recover (no OK messages). The re-export server is exporting multiple originating NFSv3 servers.

Looking at the NFSv4.2 re-export server, there are no messages of interest and I can't see the client listed in /proc/fs/nfsd/clients. I couldn't even see a corresponding rpc.mountd message for the client. All the mounts and exports seem fine and the other 95% of clients also seems fine and perform well.

I do see these messages on the server (newly added in nfs-utils v2.5.4):

Jul 19 20:24:06 nfscache4 rpc.mountd[2740]: v4.2 client attached: 0xbcc365f060eec638 from "A.B.C.D:999"
Jul 19 20:24:06 nfscache4 rpc.mountd[2742]: v4.2 client attached: 0xbcc365f060eec638 from "A.B.C.D:999"
Jul 19 21:58:20 nfscache4 rpc.mountd[2744]: v4.2 client detached: 0xbcc365f060eec638 from "A.B.C.D:999"
Jul 19 21:58:20 nfscache4 rpc.mountd[2745]: v4.2 client detached: 0xbcc365f060eec638 from "A.B.C.D:999"

but I'm not sure they are relevant to this issue - it's just that now we have a log entry for them and often the hung mounts hang near a flurry of them in the server logs.

These client hangs do not occur with v5.12.19 or any previous kernel versions all the way back to v5.7 (even with nfs-utils v2.5.4). They began in v5.13.0 and are still occurring in v5.13.5.

As always, I'm unsure if this issue is related to re-exporting or is just a general NFSv4.2 server issue.

I'm also fairly sure it's specific to v4.2 as I have another cluster running v5.13 that is re-exporting NFSv4.2 servers to NFSv3 clients and I have not seen the issue there yet.

I have begun a git bisect between v5.12.19 & v5.13.5 but because I can only reproduce this at scale with production workloads, it is going to take at least 24 hours for each iteration.

So I decided to open this bug report now in case there are suggestions for particular changes that went into v5.13 that I could test more quickly.

Daire
Comment 1 bfields 2021-08-02 13:41:24 UTC
Note if you bisect you may need to apply 061478438d04 "mm/page_alloc: further fix __alloc_pages_bulk() return value".

Bug 213869 also looks like mysterious timeouts starting with 5.13.x, though over NFSv3.
Comment 2 Daire Byrne 2021-08-02 14:45:27 UTC
Right... yea I forgot about that mm/page_alloc problem that broke late in 5.13.0. 

I need to read up on how to go through the bisect run but always make sure that particular commit has been applied as I haven't done that before.

Yea, I had a look at Bug 213869 but couldn't really say if it's related.

The bisect is going to take a while for sure. I'm only on the 2nd iteration as our production workload was pretty idle for most of last week.
Comment 3 bfields 2021-08-02 15:08:51 UTC
Thanks for the persistence.  I'd be interested in any partial bisect results as you go.

Can you tell (e.g. from /proc/*/stack or sysrq-t dump) if any nfsd threads are stuck somewhere interesting during the hangs?
Comment 4 Daire Byrne 2021-08-05 13:10:18 UTC
Yea, it's really slow going unfortunately. Taking around 48 hours per iteration. This is as far as I have gotten so far:

git bisect start
git bisect good 0e6f651912bdd027a6d730b68d6d1c3f4427c0ae
git bisect bad 25423f4bd9a9ac3e6b0ce7ecfe56c36f4e514893
git bisect good 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
git bisect bad 85f3f17b5db2dd9f8a094a0ddc665555135afd22
git bisect bad ca62e9090d229926f43f20291bb44d67897baab7

I also hit Bug 213273 (nfs4_proc_lock) a few times so I had to pull in that patch too (which slowed me down).

I have checked /proc/*/stack a few times now and I can't see any hung nfsd processes or anything interesting.

It's almost like the client thinks it has the server mounted but the server has no record of it and is ignoring all requests from the client.

I'll keep chugging through it...

Daire
Comment 5 bfields 2021-08-05 20:35:37 UTC
The remaining range does include the nfsd merge.

It'd be worth just testing nfsd-5.13 (commit b73ac6808b0f7994a05ebc38571e2e9eaf98a0f4).  If that's bad, that'd narrow it down to those 70 nfsd commits.

If nfsd threads aren't hung, another reason the server might not respond is that it's not getting responses to its upcalls to rpc.mountd.

I assume rpc.mountd is still running?

We should be able to figure out some tracing to help....
Comment 6 Daire Byrne 2021-08-10 11:48:31 UTC
Still chugging through the bisect...

I did notice this when looking at a hung client and running "rpcdebug -m rpc -c all"

[35266.929232] -pid- flgs status -client- --rqstp- -timeout ---ops--
[35266.935502] 43470 4821      0 ffff9b4e73fad000 ffff9b4e73468200    60000 ffffffffc04042c0 nfsv4 WRITE a:call_status [sunrpc] q:xprt_pending
[35266.948198] 43454 4821      0 ffff9b4e785fde00 ffff9b3df3a3d200    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35266.960805] 43455 4821      0 ffff9b4e785fde00 ffff9b3df3a3c600    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35266.973413] 43456 4821      0 ffff9b4e785fde00 ffff9b3df3a3fe00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35266.985985] 43457 4821      0 ffff9b4e785fde00 ffff9b3df3a3e400    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35266.998550] 43458 4821      0 ffff9b4e785fde00 ffff9b3df3a3ee00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.011119] 43459 4821      0 ffff9b4e785fde00 ffff9b3df3a3da00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.023684] 43460 4821      0 ffff9b4e785fde00 ffff9b3df3a3cc00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.036247] 43461 4821      0 ffff9b4e785fde00 ffff9b3df3a3f800    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.048813] 43462 4821      0 ffff9b4e785fde00 ffff9b4e73468a00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.061382] 43463 4821      0 ffff9b4e785fde00 ffff9b4e73468c00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.073949] 43464 4821      0 ffff9b4e785fde00 ffff9b4e73468400    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.086512] 43465 4821      0 ffff9b4e785fde00 ffff9b4e73fb7800    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.099078] 43466 4821      0 ffff9b4e785fde00 ffff9b4e73fb7e00    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.111640] 43467 4821      0 ffff9b4e785fde00 ffff9b4e73fb7600    60000 ffffffffc04042c0 nfsv4 READ a:call_status [sunrpc] q:xprt_pending
[35267.124218] 43575 4880      0 ffff9b4e706d9000 ffff9b4e70f7ce00   120000 ffffffffc0535a40 nfsv4 GETATTR a:call_status [sunrpc] q:xprt_pending
[35267.137040] 43516 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc2e00   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.149952] 43517 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc3400   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.162860] 43518 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc0600   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.175774] 43519 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc2400   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.188685] 43520 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc0c00   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.201598] 43521 5a81      0 ffff9b4e70f7cc00 ffff9b4bd53e7200   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.214508] 43522 5a81      0 ffff9b4e70f7cc00 ffff9b4bd53e4a00   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.227422] 43576 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc2000   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.240349] 43577 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc2600   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.253261] 43578 5a81      0 ffff9b4e70f7cc00 ffff9b4ccffc0000   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
[35267.266175] 43579 5a81      0 ffff9b4e70f7cc00 ffff9b4bd53e5c00   120000 ffffffffc0535840 nfsv4 SEQUENCE a:call_status [sunrpc] q:xprt_pending
Comment 7 bfields 2021-08-10 14:15:24 UTC
It might be interesting to keep an eye on the svc_drop tracepoint on the server.

svc_drop calls don't necessarily indicate a problem, but if you see them correlated with the client hangs that would be suggestive.
Comment 8 Daire Byrne 2021-08-16 11:42:17 UTC
And right as I got into the final straight, our production workloads dried up. I'm trying to see if I can simulate this in any way.

This is as far as a I got with around 7 more steps left:

git bisect start
git bisect good 0e6f651912bdd027a6d730b68d6d1c3f4427c0ae
git bisect bad 25423f4bd9a9ac3e6b0ce7ecfe56c36f4e514893
git bisect good 9f4ad9e425a1d3b6a34617b8ea226d56a119a717
git bisect bad 85f3f17b5db2dd9f8a094a0ddc665555135afd22
git bisect bad ca62e9090d229926f43f20291bb44d67897baab7
git bisect good d08410d8c9908058a2f69b55e24edfb0d19da7a1
git bisect good f7857bf3745e9ede6367a8ff89c1c4f308bfd914
git bisect bad 55ba0fe059a577fa08f23223991b24564962620f
git bisect good b5b3097d9cbb1eb3df0ade9507585e6e9e3b2385
git bisect bad 2a19866b6e4cf554b57660549d12496ea84aa7d7

Daire
Comment 9 bfields 2021-08-16 14:43:43 UTC
Thanks.  So, that narrows it down to either the cifs or nfsd merges, and the cifs merge touches absolutely no code outside of fs/cifs.  So it's almost definitely the nfsd changes between v5.12-rc4 and b73ac6808b0f.
Comment 10 bfields 2021-08-16 15:08:15 UTC
Looking again at those nfsd changes....  Most are changes to server NFSv2/v3 xdr code, but your clients are all v4.2, so those are unlikely.

There are also a bunch of RDMA changes.  You didn't mention it, so I assume you're not using RDMA.

There's 7f7e7a4006f7 "nfsd: helper for laundromat expiry calculations".  Ben Coddington points out this has an unintended side effect which can cause state to expire one second later.  Or perhaps I've missed some other issue with that patch.  Seems unlikely to be related, though.

c6c7f2a84d "nfsd: Ensure knfsd shuts down when the "nfsd" pseudofs is unmounted" should only affect server shutdown.

There's some changes to copy offload code (e.g., COPY with a length 0 now copies to EOF).  Seems unlikely too.

I'm getting a little suspicious of 472d155a0631b "nfsd: report client confirmation status in "info" file"--which is partly responsible for those new "client attached" messages you noted.  I dismissed that at first.  But now I'm wondering: that patch on its own seems innocent enough, but it's probably triggering the new nfs-utils to do something different.  And maybe there's a bug in that codepath that causes mountd to fail to respond to an upcall--that *might* appear to make the server unresponsive to a single client.

The other patches mostly look like pretty minor cleanup.
Comment 11 bfields 2021-08-16 15:14:08 UTC
So, stuff to try:

- Check to make sure that rpc.mountd is still running normally after the problem reproduces.
- See if you can reproduce after reverting that one patch (472d155a0631b), or after downgrading nfs-utils.

And we should also take a look at the new nfs-utils code.
Comment 12 Daire Byrne 2021-08-16 15:39:16 UTC
I'm pretty sure that rpc.mountd continues running okay simply because subsequent clients can still mount the re-exported filesytems just fine. It really is a small subset of random clients that get stuck while others continue happily.

I forgot to mention, before I started the bisect I also downgraded nfs-utils from 2.5.4 to 2.5.2. I was also suspicious of those new messages and associated patches. But I clearly get the same behavior with 2.5.2 as well.

However, I have not tried reverting just 472d155a0631b. If I can devise a better reproducer or the production load starts up again I'll test that next.

Daire
Comment 13 bfields 2021-08-16 16:03:59 UTC
Oh, well, so much for that theory.  I don't think nfs-utils 2.5.2 has any code that pays attention to /proc/fs/nfsd/clients/.

I forgot, there is actually a known bug in 472d155a0631b, fixed by 934bd07fae7e "nfsd: move fsnotify on client creation outside spinlock", though I don't understand how it would have caused the symptoms you're seeing.
Comment 14 bfields 2021-08-25 21:06:46 UTC
This: https://lore.kernel.org/linux-nfs/20210825193314.354079-1-trond.myklebust@hammerspace.com/ fixes a bug introduced in that range, and I think it would cause the server to stop responding to a particular client.  I don't see why it's 4.2-specific, but maybe that was a coincidence.
Comment 15 Neil Brown 2021-08-27 23:21:33 UTC
Is the svc_rpc_per_connection_limit module parameter non-zero?
If it is, then the patch Bruce mentioned in a perfect fit for your problem.
If it is zero (the default), then it cannot be relevant.

cat /sys/module/sunrpc/parameters/svc_rpc_per_connection_limit
Comment 16 Daire Byrne 2021-08-28 07:50:48 UTC
Ah hah! We do indeed set svc_rpc_per_connection_limit non-zero....

The main reason is that we have found that when this option was introduced to our onprem RHEL7.4 storage servers, one or two clients could quickly eat up all the nfsd threads and essentially block other clients. So if those clients are asking for lots of random data and we are IO limited, everything grinds to a standstill. So since RHEL7.4 we have always set svc_rpc_per_connection_limit=1 to replicate previous behaviour.

In the case of NFS re-exporting, we similarly found that the re-export server could also grind to a halt with a couple of greedy clients eating up all the nfsd threads and stopping other clients from getting much love. So again, we limit it to 1-4 depending on workloads and numbers of clients.

We prefer that greedy clients go slow and leave some nfsd threads available to service other clients.

I will test this patch out early next week and report back. Thanks for help.

Daire
Comment 17 Daire Byrne 2021-08-31 22:59:22 UTC
So, I actually managed to finish the git bisect and it resulted in:

commit 82011c80b3ec0e05940a2ee2c76c1df9fd2b1ce8
Author: Chuck Lever <chuck.lever@oracle.com>
Date:   Tue Jan 5 10:15:09 2021 -0500

    SUNRPC: Move svc_xprt_received() call sites
    
Which tallies with the fix that Trond recently posted. I have applied that fix on 5.13.13 and have not seen any client timeouts yet today.

I'll give it another 24 hours and report back, but it looks good so far.

Thanks all,

Daire
Comment 18 Daire Byrne 2021-09-02 09:12:55 UTC
After 2 days, still no sign of the client timeouts and I would definitely have expected to see them by now.

So yea, Trond's patch fixes the issue introduced in 5.13 by Chuck's "SUNRPC: Move svc_xprt_received() call sites" patch.

Cheers,

Daire
Comment 19 bfields 2021-09-08 22:07:03 UTC
Thanks for the confirmation.
Comment 20 Ho Ming Shun 2022-05-13 18:12:31 UTC
It seems like I am running into the same problem on multiple NFS servers serving a read-only root FS to raspberry pi NFS clients mounting using NFS v4.0.

This issue is still persisting for me all the way to 5.17.5.arch1-2.

When the lockup happens, stopping nfs-server and rmmod'ing all nfsd related modules, then restarting nfs-server would not get the NFS server unstuck.

wireshark shows the server not responding to the initial NULL calls.
Comment 21 bfields 2022-05-13 18:20:19 UTC
(In reply to Ho Ming Shun from comment #20)
> It seems like I am running into the same problem on multiple NFS servers
> serving a read-only root FS to raspberry pi NFS clients mounting using NFS
> v4.0.
> 
> This issue is still persisting for me all the way to 5.17.5.arch1-2.

More likely the bug you're seeing is something different.

Please open a new bug with your details.

Note You need to log in before you can comment on or make changes to this bug.