Issue found on: v6.6.13, v6.6.14, v6.6.20 and v6.8.1 Not found on: v6.4, v6.1.82 and below Architectures: amd64 and arm(hf) Steps to reproduce: - Create a VM with 1GB RAM (or less) - Install Debian 12 - Install linux-image-6.6.13+bpo-amd64-unsigned and nfs-kernel-server - Export some folder On the client: - Mount the share - Run a script that does produce heavy usage on the share (like unpacking large tar archives that cointain many small files into a git and commiting them) On my setup it takes 20-40 hours until the memory is full and oom-kill gets hired by nfsd to kill other processes. the memory stays full and the system reboots: [121972.900000] Out of memory and no killable processes... [121972.910000] Kernel panic - not syncing: System is deadlocked on memory I found the buggy commit using "git bisect". Full test result: $ git bisect start v6.6 v6.5 Bisecting: 7882 revisions left to test after this (roughly 13 steps) [a1c19328a160c80251868dbd80066dce23d07995] Merge tag 'soc-arm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc -- $ git bisect good Bisecting: 3935 revisions left to test after this (roughly 12 steps) [e4f1b8202fb59c56a3de7642d50326923670513f] Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost -- $ git bisect bad Bisecting: 2014 revisions left to test after this (roughly 11 steps) [e0152e7481c6c63764d6ea8ee41af5cf9dfac5e9] Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux -- $ git bisect bad Bisecting: 975 revisions left to test after this (roughly 10 steps) [4a3b1007eeb26b2bb7ae4d734cc8577463325165] Merge tag 'pinctrl-v6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl -- $ git bisect good Bisecting: 476 revisions left to test after this (roughly 9 steps) [4debf77169ee459c46ec70e13dc503bc25efd7d2] Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd -- $ git bisect good Bisecting: 237 revisions left to test after this (roughly 8 steps) [e7e9423db459423d3dcb367217553ad9ededadc9] Merge tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs -- $ git bisect good Bisecting: 141 revisions left to test after this (roughly 7 steps) [8ae5d298ef2005da5454fc1680f983e85d3e1622] Merge tag '6.6-rc-ksmbd-fixes-part1' of git://git.samba.org/ksmbd -- $ git bisect good Bisecting: 61 revisions left to test after this (roughly 6 steps) [99d99825fc075fd24b60cc9cf0fb1e20b9c16b0f] Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs -- $ git bisect bad Bisecting: 39 revisions left to test after this (roughly 5 steps) [7b719e2bf342a59e88b2b6215b98ca4cf824bc58] SUNRPC: change svc_recv() to return void. -- $ git bisect bad Bisecting: 19 revisions left to test after this (roughly 4 steps) [e7421ce71437ec8e4d69cc6bdf35b6853adc5050] NFSD: Rename struct svc_cacherep -- $ git bisect good Bisecting: 9 revisions left to test after this (roughly 3 steps) [baabf59c24145612e4a975f459a5024389f13f5d] SUNRPC: Convert svc_udp_sendto() to use the per-socket bio_vec array -- $ git bisect bad Bisecting: 4 revisions left to test after this (roughly 2 steps) [be2be5f7f4436442d8f6bffbb97a6f438df2896b] lockd: nlm_blocked list race fixes -- $ git bisect good Bisecting: 2 revisions left to test after this (roughly 1 step) [d424797032c6e24b44037e6c7a2d32fd958300f0] nfsd: inherit required unset default acls from effective set -- $ git bisect good Bisecting: 0 revisions left to test after this (roughly 1 step) [e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4] SUNRPC: Send RPC message on TCP with a single sock_sendmsg() call -- $ git bisect bad Bisecting: 0 revisions left to test after this (roughly 0 steps) [2eb2b93581813b74c7174961126f6ec38eadb5a7] SUNRPC: Convert svc_tcp_sendmsg to use bio_vecs directly -- $ git bisect good e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4 is the first bad commit commit e18e157bb5c8c1cd8a9ba25acfdcf4f3035836f4 I found the memory loss inside /proc/meminfo only on MemAvailable MemTotal: 346948 kB On a bad test run in looks like this: -MemAvailable: 210820 kB +MemAvailable: 26608 kB On a good test run it looks like this: -MemAvailable: 215872 kB +MemAvailable: 221128 kB Discussion on the mailing list: https://lore.kernel.org/lkml/trinity-068f55c9-6088-418d-bf3a-c2778a871e98-1711310237802@msvc-mesg-gmx120/
The bad commit adds a page_frag_cache to be used when sending a four-byte record marker as part of an RPC reply. The pages backing the page_frag_cache never seem to be freed.
Possible workarounds until a fix is available: Revert e18e157bb5c8; Use a kernel release before v6.6; Or, use proto=rdma or proto=udp
I've confirmed that our original belief that sock_sendmsg() would manage the reference counts of the pages underlying the bvec was wrong: when MSG_SPLICE_PAGES is in use, sock_sendmsg() is synchronous and the page reference counts are not altered. Thus the pages underlying the page_frag_cache are never released. Releasing the fragment immediately seems to be a safe, narrow, and effective fix.
We believe that commit 05258a0a69b3 ("SUNRPC: Fix a slow server-side memory leak with RPC-over-TCP") fixes this issue. It will be available in v6.9-rc3.