Bug 218671

Summary: nfsd: memory leak when client does many file operations
Product: File System Reporter: scpcom
Component: NFSAssignee: Chuck Lever (cel)
Status: RESOLVED CODE_FIX    
Severity: normal    
Priority: P3    
Hardware: All   
OS: Linux   
Kernel Version: Subsystem:
Regression: No Bisected commit-id:

Description scpcom 2024-04-01 17:25:31 UTC
Issue found on: v6.6.13, v6.6.14, v6.6.20 and v6.8.1
Not found on: v6.4, v6.1.82 and below
Architectures: amd64 and arm(hf)

Steps to reproduce:
- Create a VM with 1GB RAM (or less)
- Install Debian 12
- Install linux-image-6.6.13+bpo-amd64-unsigned and nfs-kernel-server
- Export some folder
On the client:
- Mount the share
- Run a script that does produce heavy usage on the share (like unpacking large tar archives that cointain many small files into a git and commiting them)

On my setup it takes 20-40 hours until the memory is full and oom-kill gets hired by nfsd to kill other processes. the memory stays full and the system reboots:
[121972.900000] Out of memory and no killable processes...
[121972.910000] Kernel panic - not syncing: System is deadlocked on memory

I found the buggy commit using "git bisect".
Full test result:

$ git bisect start v6.6 v6.5
Bisecting: 7882 revisions left to test after this (roughly 13 steps)
[a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
--
$ git bisect good
Bisecting: 3935 revisions left to test after this (roughly 12 steps)
[e4f1b8202fb59c56a3de7642d50326923670513f] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
--
$ git bisect bad
Bisecting: 2014 revisions left to test after this (roughly 11 steps)
[e0152e7481c6c63764d6ea8ee41af5cf9dfac5e9] Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
--
$ git bisect bad
Bisecting: 975 revisions left to test after this (roughly 10 steps)
[4a3b1007eeb26b2bb7ae4d734cc8577463325165] Merge tag 'pinctrl-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
--
$ git bisect good
Bisecting: 476 revisions left to test after this (roughly 9 steps)
[4debf77169ee459c46ec70e13dc503bc25efd7d2] Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
--
$ git bisect good
Bisecting: 237 revisions left to test after this (roughly 8 steps)
[e7e9423db459423d3dcb367217553ad9ededadc9] Merge tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
--
$ git bisect good
Bisecting: 141 revisions left to test after this (roughly 7 steps)
[8ae5d298ef2005da5454fc1680f983e85d3e1622] Merge tag '6.6-rc-ksmbd-fixes-part1' of git://git.samba.org/ksmbd
--
$ git bisect good
Bisecting: 61 revisions left to test after this (roughly 6 steps)
[99d99825fc075fd24b60cc9cf0fb1e20b9c16b0f] Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
--
$ git bisect bad
Bisecting: 39 revisions left to test after this (roughly 5 steps)
[7b719e2bf342a59e88b2b6215b98ca4cf824bc58] SUNRPC: change svc_recv() to return void.
--
$ git bisect bad
Bisecting: 19 revisions left to test after this (roughly 4 steps)
[e7421ce71437ec8e4d69cc6bdf35b6853adc5050] NFSD: Rename struct svc_cacherep
--
$ git bisect good
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[baabf59c24145612e4a975f459a5024389f13f5d] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array
--
$ git bisect bad
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[be2be5f7f4436442d8f6bffbb97a6f438df2896b] lockd: nlm_blocked list race fixes
--
$ git bisect good
Bisecting: 2 revisions left to test after this (roughly 1 step)
[d424797032c6e24b44037e6c7a2d32fd958300f0] nfsd: inherit required unset default acls from effective set
--
$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call
--
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[2eb2b93581813b74c7174961126f6ec38eadb5a7] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly
--
$ git bisect good
e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4 is the first bad commit
commit e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4

I found the memory loss inside /proc/meminfo only on MemAvailable
 MemTotal:         346948 kB
On a bad test run in looks like this:
-MemAvailable:     210820 kB
+MemAvailable:      26608 kB
On a good test run it looks like this:
-MemAvailable:     215872 kB
+MemAvailable:     221128 kB

Discussion on the mailing list:
https://lore.kernel.org/lkml/trinity-068f55c9-6088-418d-bf3a-c2778a871e98-1711310237802@msvc-mesg-gmx120/
Comment 1 Chuck Lever 2024-04-01 18:04:19 UTC
The bad commit adds a page_frag_cache to be used when sending a four-byte record marker as part of an RPC reply. The pages backing the page_frag_cache never seem to be freed.
Comment 2 Chuck Lever 2024-04-02 13:13:31 UTC
Possible workarounds until a fix is available: Revert e18e157bb5c8; Use a kernel release before v6.6; Or, use proto=rdma or proto=udp
Comment 3 Chuck Lever 2024-04-03 14:18:06 UTC
I've confirmed that our original belief that sock_sendmsg() would manage the reference counts of the pages underlying the bvec was wrong: when MSG_SPLICE_PAGES is in use, sock_sendmsg() is synchronous and the page reference counts are not altered. Thus the pages underlying the page_frag_cache are never released. Releasing the fragment immediately seems to be a safe, narrow, and effective fix.
Comment 4 Chuck Lever 2024-04-07 14:38:51 UTC
We believe that commit 05258a0a69b3 ("SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP") fixes this issue. It will be available in v6.9-rc3.