219535 – Possible memory leak on nfsd

Bug 219535 - Possible memory leak on nfsd

Summary: Possible memory leak on nfsd

Status:	ASSIGNED

Alias:	None

Product:	File System
Classification:	Unclassified
Component:	NFSD (show other bugs)
Hardware:	All Linux

Importance:	P3 normal
Assignee:	Chuck Lever

URL:
Keywords:

Depends on:
Blocks:

Reported:	2024-11-27 02:32 UTC by Chen Chen
Modified:	2025-01-22 21:23 UTC (History)
CC List:	2 users (show)

See Also:
Kernel Version:
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
sar -r mem usage (6.95 KB, text/plain) 2024-11-27 02:32 UTC, Chen Chen	Details
lsmod (4.96 KB, text/plain) 2024-11-27 02:32 UTC, Chen Chen	Details
/proc/meminfo (1.53 KB, text/plain) 2024-11-27 02:33 UTC, Chen Chen	Details
/proc/slabinfo (30.92 KB, text/plain) 2024-11-27 02:33 UTC, Chen Chen	Details
systemctl status (4.06 KB, application/octet-stream) 2024-11-27 02:33 UTC, Chen Chen	Details
/proc/vmallocinfo (170.08 KB, text/plain) 2024-11-27 02:34 UTC, Chen Chen	Details
/proc/vmstat (3.65 KB, text/plain) 2024-11-27 02:34 UTC, Chen Chen	Details
oom dmesg from kdump (535.11 KB, text/plain) 2024-11-27 02:34 UTC, Chen Chen	Details
dmesg of another 3 crashes (31.77 KB, text/plain) 2024-12-07 08:30 UTC, Chen Chen	Details
Logs and traces from Jan-18 pt1 (4.61 MB, application/octet-stream) 2025-01-22 20:43 UTC, JJ Jordan	Details
Logs and traces from Jan-18 pt2 (601.99 KB, application/octet-stream) 2025-01-22 20:44 UTC, JJ Jordan	Details
Show Obsolete (2) Add an attachment (proposed patch, testcase, etc.)

Description Chen Chen 2024-11-27 02:32:10 UTC

Created attachment 307283 [details]
sar -r mem usage

My RHEL9 server with only NFS service often OOMed after a day or two, with no userspace memory usage. So I switched to elrepo kernel-lts and still the problem persists.

I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on (RHEL) 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and 6.1.115-1.el9.elrepo.x86_64.

I'm not so sure it is caused by NFS but since it is the only service running on the server I can only suspect it is the culprit. The server has a Mellanox Technologies MT27500 Family [ConnectX-3] Infiniband Card and NFSoRMDA is enabled. No 3rd drivers used.

The following data were gathered moments before it OOMed and crashed

sar reported a typical memory leak appearance.
01:20:13 AM 390187300 388732764   3501864      0.89      4856    363952    390344      0.09    100680    358384     17148
01:30:13 AM 379492128 378312768  13642416      3.46      4856    909388    390344      0.09    108844    895740        16
01:40:13 AM 367687716 367062060  24851416      6.30      4856   1498272    390344      0.09    116736   1476672        16
01:50:50 AM 361704244 361471420  30437312      7.72      4856   1888780    390344      0.09    127888   1856036     29912
02:00:13 AM 355796296 355848120  36061648      9.15      4856   2173560    390344      0.09    131544   2137152         0
....
09:00:13 AM   1518392  18089616 373760196     94.79      4760  18648816    390344      0.09    470608  18273412        36
09:10:13 AM   1499980  17223900 374626172     95.01      4740  17801676    390344      0.09    471964  17424672      5292
09:20:13 AM   1561896   6784736 385059756     97.66      1712   7338540    423580      0.10    325452   7070372         0

meminfo also didn't show anything using ram.
MemTotal:       394292660 kB
MemFree:         1551296 kB
MemAvailable:    6776108 kB
Buffers:            1712 kB
Cached:          7340144 kB
SwapCached:         4308 kB
Active:           325936 kB
Inactive:        7071836 kB
...
KReclaimable:     129816 kB
Slab:             331596 kB
SReclaimable:     129816 kB
SUnreclaim:       201780 kB
...
VmallocUsed:      319528 kB

slabinfo is low. Attached.

vmallocinfo doesn't have much. Attached.

dmesg log showed it has killed nearly every userspace programs.
[29960.547403] Tasks state (memory values in pages):
[29960.547404] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[29960.547412] [   1020]     0  1020     9498      640    94208     1000         -1000 systemd-udevd
[29960.547417] [   1247]     0  1247   105208     6888   126976        0         -1000 multipathd
[29960.547421] [   1342]     0  1342    23190      330    65536      764         -1000 auditd
[29960.547428] [   1472]     0  1472     4185      806    73728      357         -1000 sshd
[29960.547438] Out of memory and no killable processes...
[29960.547439] Kernel panic - not syncing: System is deadlocked on memory

systemctl status attached. Nothing else is running.

I have a 224G vmcore dump but have no idea how to deal with it. And it is too big to upload somewhere I think.

I appreciate any help to help me detect what went wrong.

Comment 1 Chen Chen 2024-11-27 02:32:47 UTC

Created attachment 307284 [details]
lsmod

Comment 2 Chen Chen 2024-11-27 02:33:06 UTC

Created attachment 307285 [details]
/proc/meminfo

Comment 3 Chen Chen 2024-11-27 02:33:22 UTC

Created attachment 307286 [details]
/proc/slabinfo

Comment 4 Chen Chen 2024-11-27 02:33:38 UTC

Created attachment 307287 [details]
systemctl status

Comment 5 Chen Chen 2024-11-27 02:34:02 UTC

Created attachment 307288 [details]
/proc/vmallocinfo

Comment 6 Chen Chen 2024-11-27 02:34:21 UTC

Created attachment 307289 [details]
/proc/vmstat

Comment 7 Chen Chen 2024-11-27 02:34:47 UTC

Created attachment 307290 [details]
oom dmesg from kdump

Comment 8 Chen Chen 2024-12-07 08:30:58 UTC

Created attachment 307330 [details]
dmesg of another 3 crashes

Since reporting I got another 3 crashes. All killed by nfsd.

First one:
[136965.765431] Out of memory and no killable processes...
[136965.765433] Kernel panic - not syncing: System is deadlocked on memory
[136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G            E      6.1.119-1.el9.elrepo.x86_64 #1
[136965.766852] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[136965.767546] Call Trace:
[136965.768230]  <TASK>
[136965.768903]  dump_stack_lvl+0x45/0x5e
[136965.769571]  panic+0x10c/0x2c2
[136965.770231]  out_of_memory.cold+0x2f/0x7e
[136965.770874]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[136965.771518]  __alloc_pages+0x35d/0x370
[136965.772147]  __alloc_pages_bulk+0x3e5/0x680
[136965.772766]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[136965.773431]  svc_recv+0x1f/0x190 [sunrpc]
[136965.774089]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[136965.774726]  nfsd+0x87/0xc0 [nfsd]
[136965.775347]  kthread+0xe5/0x110
[136965.775926]  ? kthread_complete_and_exit+0x20/0x20
[136965.776499]  ret_from_fork+0x1f/0x30
[136965.777062]  </TASK>

Second:
[167723.787640] WARNING: CPU: 3 PID: 1872 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[167723.787667] Modules linked in: <cut here>
[167723.787874] CPU: 3 PID: 1872 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
[167723.787882] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[167723.787886] RIP: 0010:free_large_kmalloc+0x5a/0x80

Third:
[ 3883.748094] ------------[ cut here ]------------
[ 3883.748105] WARNING: CPU: 9 PID: 1886 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80
[ 3883.748131] Modules linked in: <cut here>
[ 3883.748339] CPU: 9 PID: 1886 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1
[ 3883.748342] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[ 3883.748344] RIP: 0010:free_large_kmalloc+0x5a/0x80

Comment 9 Chuck Lever 2024-12-07 15:28:40 UTC

Hi Chen -

After some review, these all appear to be Red Hat Enterprise kernels. Such kernels are extensively patched and maintained exclusively by Red Hat engineers. I kindly request that you report this issue to Red Hat first and have them troubleshoot it.

If they find there is a needed upstream fix, do feel free to re-open this bug.

[I am a fan of the old ConnectX-3 cards, btw]

Comment 10 Chen Chen 2024-12-10 05:16:13 UTC

Hi Mr. Lever,

I *clearly* stated I was using 6.1.119 which is the latest longterm kernel released on 2024-11-22, compiled by the ELRepo Project as-is from upstream tarball.

[136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G            E      6.1.119-1.el9.elrepo.x86_64 #1


I encountered the problem in both shipped RHEL kernel and latest and sub-latest lts. So the bug must still exists in upstream. That's why I filed this bug.

Anyway, I encountered another 2 crashes in the last two days and call stack insists nfsd caused it.

Comment 11 Chuck Lever 2024-12-10 14:43:45 UTC

This is what comment 0 says:

> My RHEL9 server with only NFS service often OOMed after a day or two,
> with no userspace memory usage. So I switched to elrepo kernel-lts and
> still the problem persists.

> I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on
> (RHEL 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and
> 6.1.115-1.el9.elrepo.x86_64.

You mentioned RHEL, and RHEL 9 in particular, several times here. I have no prior knowledge of "the ELRepo Project" -- never heard of it. By "uname" these all look like distro-built kernels to me.

> Anyway, I encountered another 2 crashes in the last two days and
> call stack insists nfsd caused it.

I'm not saying this isn't an NFSD bug. But it might not be a problem in recent kernels. If I'm reading your reports correctly, you have not tested with 6.12 or newer. 6.1.anything is based on a two-year old code base.

Any fix we create for this issue must be applied to the upstream Linus kernel first. Indeed, a fix might already exist somewhere in upstream. By upstream, I mean the "master" branch in this repo:

https://git.kernel.org./pub/scm/linux/kernel/git/torvalds/linux.git

Therefore the first task is for you to confirm by testing that this branch either still has this issue, in which case we have to troubleshoot further; or does not, in which case you can bisect to find the upstream fix that needs to be backported to the LTS kernels.

Comment 12 Chen Chen 2024-12-11 01:11:25 UTC

Hi Mr. Lever,

> You mentioned RHEL, and RHEL 9 in particular, several times here.

Because I want to indicate that, except the kernel, every other toolchains were using latest version from RHEL9.

The ELRepo Project (https://elrepo.org/) is a group of guys grabbing the latest kernel source and package it into RPMs for easy installation on latest EL-like releases (like RHEL, Oracle Linux, Rocky, Alma etc.)

> By upstream, I mean the "master" branch in this repo

OK. I've just installed the latest stable (aka 6.12.4) and see if it might help.

Comment 13 Chuck Lever 2024-12-12 15:57:39 UTC

From attachment 307290 [details]:

[29924.805968] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam)) total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G            E      6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880]  <TASK>
[29925.425885]  dump_stack_lvl+0x45/0x5e
[29925.425893]  dump_header+0x4a/0x213
[29925.425897]  oom_kill_process.cold+0xb/0x10
[29925.425901]  out_of_memory+0xed/0x2e0
[29925.425906]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916]  __alloc_pages+0x35d/0x370
[29925.425921]  __alloc_pages_bulk+0x3e5/0x680
[29925.425927]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991]  svc_recv+0x1f/0x190 [sunrpc]
[29925.426043]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080]  nfsd+0x87/0xc0 [nfsd]
[29925.426113]  kthread+0xe5/0x110
[29925.426118]  ? kthread_complete_and_exit+0x20/0x20
[29925.426122]  ret_from_fork+0x1f/0x30
[29925.426129]  </TASK>

NFSD is targeted by OOM killer because it frequently allocates up to 256 pages at a time to fill the send and receive buffers. It is not necessarily the source of a leak.

The bulk page allocator is on the slow path here, suggesting there weren't any free pages available on the lists it normally checks first. So it is doing one-at-a-time order-0 allocations, a sign that memory is short.

We see that Node 1 appears to be short on free memory, but the system has not pushed into swap at all. Kernel memory isn't swappable, so whatever is leaking is in the kernel proper.

The slab caches all look reasonably sized, so not likely a slab leak.

At this point we would want someone with some MM expertise to come in and help us nail down the leak.

Comment 14 Bugspray Bot 2024-12-12 16:20:08 UTC

Chuck Lever <chuck.lever@oracle.com> replies to comment #13:

Hi -

An NFSD page allocation on v6.1.y is triggering OOM-killer. The reporter
has provided a lot of detail, and we need some help steering us towards
the possible leak culprit. Any takers?

(We've asked the reporter to reproduce on a more recent kernel if
possible).

-------- Forwarded Message --------
Subject: Re: Possible memory leak on nfsd
Date: Thu, 12 Dec 2024 16:00:17 +0000
From: Chuck Lever via Bugspray Bot <bugbot@kernel.org>
To: jlayton@kernel.org, linux-nfs@vger.kernel.org, trondmy@kernel.org, 
cel@kernel.org, anna@kernel.org

Chuck Lever writes via Kernel.org Bugzilla:

 From attachment 307290 [details]:

[29924.805968] 
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0
[29924.805991] Out of memory: Killed process 4503 ((sd-pam)) 
total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 
pgtables:96kB oom_score_adj:100
[29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), 
order=0, oom_score_adj=0
[29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G 
       E      6.1.119-1.el9.elrepo.x86_64 #1
[29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 
2.22.2 09/12/2024
[29925.425877] Call Trace:
[29925.425880]  <TASK>
[29925.425885]  dump_stack_lvl+0x45/0x5e
[29925.425893]  dump_header+0x4a/0x213
[29925.425897]  oom_kill_process.cold+0xb/0x10
[29925.425901]  out_of_memory+0xed/0x2e0
[29925.425906]  __alloc_pages_slowpath.constprop.0+0x707/0x9d0
[29925.425916]  __alloc_pages+0x35d/0x370
[29925.425921]  __alloc_pages_bulk+0x3e5/0x680
[29925.425927]  svc_alloc_arg+0x81/0x1f0 [sunrpc]
[29925.425991]  svc_recv+0x1f/0x190 [sunrpc]
[29925.426043]  ? nfsd_inet6addr_event+0x110/0x110 [nfsd]
[29925.426080]  nfsd+0x87/0xc0 [nfsd]
[29925.426113]  kthread+0xe5/0x110
[29925.426118]  ? kthread_complete_and_exit+0x20/0x20
[29925.426122]  ret_from_fork+0x1f/0x30
[29925.426129]  </TASK>

NFSD is triggering the OOM killer because it frequently allocates up to 
256 pages at a time to fill the send and receive buffers. It is not 
necessarily the source of a leak.

The bulk page allocator is on the slow path here, suggesting there 
weren't any free pages available on the lists it normally checks first. 
So it is doing one-at-a-time order-0 allocations, a sign that memory is 
short.

We see that Node 1 appears to be short on free memory, but the system 
has not pushed into swap at all. Kernel memory isn't swappable, so 
whatever is leaking is in the kernel proper.

The slab caches all look reasonably sized, so not likely a slab leak.

At this point we would want someone with some MM expertise to come in 
and help us nail down the leak.

View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13
You can reply to this message to join the discussion.

(via https://msgid.link/45ccbf00-3bd8-444e-8765-637caac5f738@oracle.com)

Comment 15 Chen Chen 2025-01-10 16:46:38 UTC

Sorry for my rudeness in my previous discussion.

After switching to 6.12.4, the server stayed stable for 30 days. So whatever caused the memleak should have been resolved between 6.1.119 to 6.12.

You might want to close this bug if backport is not worthwhile.

Comment 16 Bugspray Bot 2025-01-10 20:40:00 UTC

Chuck Lever <chuck.lever@oracle.com> replies to comment #15:

On 1/10/25 11:50 AM, Chen Chen via Bugspray Bot wrote:
> Chen Chen writes via Kernel.org Bugzilla:
> 
> Sorry for my rudeness in my previous discussion.
> 
> After switching to 6.12.4, the server stayed stable for 30 days.

That's good news!

> So whatever caused the memleak should have been resolved between 6.1.119 to
> 6.12.

That's tens of thousands of commits over two years. Unfortunately that
doesn't really tell us what the problem is.

> You might want to close this bug if backport is not worthwhile.

We need to know the exact commit that contains the fix before it can
be determined whether a backport is feasible.

Are you able to bisect between v6.1 and v6.12 ? If not, do you have
a simple, narrow reproducer that we can use to explore this ourselves?

(via https://msgid.link/463b4513-5c5a-4394-8a93-58c0feef3149@oracle.com)

Comment 17 JJ Jordan 2025-01-22 20:43:25 UTC

Created attachment 307525 [details]
Logs and traces from Jan-18 pt1

Here are the traces from two NFS crashes that occurred this past weekend.
Both occurred in the AM (US time) on Jan 18, a few hours apart from one
another.

I followed the instructions I found on the various threads.
There was no output to `rpcdebug -m rpc -c`, not sure what I did wrong
there. The syslog ought to contain the output of sysrq-trigger, however.

The output from trace-cmd captures several days' worth of logs in either
case, but not from system boot.

The syslogs I have cut from ~one hour before the incident until it finished
shutting down prior to reboot. I have removed the output of other services.

Both are VMs on GCE running the 6.1.119 kernel from Debian bookworm (6.1.0-28)
~60Gi memory, 16 CPUs.

Comment 18 JJ Jordan 2025-01-22 20:44:44 UTC

Created attachment 307526 [details]
Logs and traces from Jan-18 pt2

Part 2, see previous description

Comment 19 JJ Jordan 2025-01-22 21:23:29 UTC

Comment on attachment 307525 [details]
Logs and traces from Jan-18 pt1

This was submitted in error, apologies.

Comment 20 JJ Jordan 2025-01-22 21:23:52 UTC

Comment on attachment 307526 [details]
Logs and traces from Jan-18 pt2

Also submitted in error.

Note You need to log in before you can comment on or make changes to this bug.