Created attachment 307283 [details] sar -r mem usage My RHEL9 server with only NFS service often OOMed after a day or two, with no userspace memory usage. So I switched to elrepo kernel-lts and still the problem persists. I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on (RHEL) 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and 6.1.115-1.el9.elrepo.x86_64. I'm not so sure it is caused by NFS but since it is the only service running on the server I can only suspect it is the culprit. The server has a Mellanox Technologies MT27500 Family [ConnectX-3] Infiniband Card and NFSoRMDA is enabled. No 3rd drivers used. The following data were gathered moments before it OOMed and crashed sar reported a typical memory leak appearance. 01:20:13 AM 390187300 388732764 3501864 0.89 4856 363952 390344 0.09 100680 358384 17148 01:30:13 AM 379492128 378312768 13642416 3.46 4856 909388 390344 0.09 108844 895740 16 01:40:13 AM 367687716 367062060 24851416 6.30 4856 1498272 390344 0.09 116736 1476672 16 01:50:50 AM 361704244 361471420 30437312 7.72 4856 1888780 390344 0.09 127888 1856036 29912 02:00:13 AM 355796296 355848120 36061648 9.15 4856 2173560 390344 0.09 131544 2137152 0 .... 09:00:13 AM 1518392 18089616 373760196 94.79 4760 18648816 390344 0.09 470608 18273412 36 09:10:13 AM 1499980 17223900 374626172 95.01 4740 17801676 390344 0.09 471964 17424672 5292 09:20:13 AM 1561896 6784736 385059756 97.66 1712 7338540 423580 0.10 325452 7070372 0 meminfo also didn't show anything using ram. MemTotal: 394292660 kB MemFree: 1551296 kB MemAvailable: 6776108 kB Buffers: 1712 kB Cached: 7340144 kB SwapCached: 4308 kB Active: 325936 kB Inactive: 7071836 kB ... KReclaimable: 129816 kB Slab: 331596 kB SReclaimable: 129816 kB SUnreclaim: 201780 kB ... VmallocUsed: 319528 kB slabinfo is low. Attached. vmallocinfo doesn't have much. Attached. dmesg log showed it has killed nearly every userspace programs. [29960.547403] Tasks state (memory values in pages): [29960.547404] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [29960.547412] [ 1020] 0 1020 9498 640 94208 1000 -1000 systemd-udevd [29960.547417] [ 1247] 0 1247 105208 6888 126976 0 -1000 multipathd [29960.547421] [ 1342] 0 1342 23190 330 65536 764 -1000 auditd [29960.547428] [ 1472] 0 1472 4185 806 73728 357 -1000 sshd [29960.547438] Out of memory and no killable processes... [29960.547439] Kernel panic - not syncing: System is deadlocked on memory systemctl status attached. Nothing else is running. I have a 224G vmcore dump but have no idea how to deal with it. And it is too big to upload somewhere I think. I appreciate any help to help me detect what went wrong.
Created attachment 307284 [details] lsmod
Created attachment 307285 [details] /proc/meminfo
Created attachment 307286 [details] /proc/slabinfo
Created attachment 307287 [details] systemctl status
Created attachment 307288 [details] /proc/vmallocinfo
Created attachment 307289 [details] /proc/vmstat
Created attachment 307290 [details] oom dmesg from kdump
Created attachment 307330 [details] dmesg of another 3 crashes Since reporting I got another 3 crashes. All killed by nfsd. First one: [136965.765431] Out of memory and no killable processes... [136965.765433] Kernel panic - not syncing: System is deadlocked on memory [136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1 [136965.766852] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024 [136965.767546] Call Trace: [136965.768230] <TASK> [136965.768903] dump_stack_lvl+0x45/0x5e [136965.769571] panic+0x10c/0x2c2 [136965.770231] out_of_memory.cold+0x2f/0x7e [136965.770874] __alloc_pages_slowpath.constprop.0+0x707/0x9d0 [136965.771518] __alloc_pages+0x35d/0x370 [136965.772147] __alloc_pages_bulk+0x3e5/0x680 [136965.772766] svc_alloc_arg+0x81/0x1f0 [sunrpc] [136965.773431] svc_recv+0x1f/0x190 [sunrpc] [136965.774089] ? nfsd_inet6addr_event+0x110/0x110 [nfsd] [136965.774726] nfsd+0x87/0xc0 [nfsd] [136965.775347] kthread+0xe5/0x110 [136965.775926] ? kthread_complete_and_exit+0x20/0x20 [136965.776499] ret_from_fork+0x1f/0x30 [136965.777062] </TASK> Second: [167723.787640] WARNING: CPU: 3 PID: 1872 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80 [167723.787667] Modules linked in: <cut here> [167723.787874] CPU: 3 PID: 1872 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1 [167723.787882] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024 [167723.787886] RIP: 0010:free_large_kmalloc+0x5a/0x80 Third: [ 3883.748094] ------------[ cut here ]------------ [ 3883.748105] WARNING: CPU: 9 PID: 1886 at mm/slab_common.c:957 free_large_kmalloc+0x5a/0x80 [ 3883.748131] Modules linked in: <cut here> [ 3883.748339] CPU: 9 PID: 1886 Comm: nfsd Kdump: loaded Not tainted 5.14.0-503.15.1.el9_5.x86_64 #1 [ 3883.748342] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024 [ 3883.748344] RIP: 0010:free_large_kmalloc+0x5a/0x80
Hi Chen - After some review, these all appear to be Red Hat Enterprise kernels. Such kernels are extensively patched and maintained exclusively by Red Hat engineers. I kindly request that you report this issue to Red Hat first and have them troubleshoot it. If they find there is a needed upstream fix, do feel free to re-open this bug. [I am a fan of the old ConnectX-3 cards, btw]
Hi Mr. Lever, I *clearly* stated I was using 6.1.119 which is the latest longterm kernel released on 2024-11-22, compiled by the ELRepo Project as-is from upstream tarball. [136965.766148] CPU: 2 PID: 1856 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1 I encountered the problem in both shipped RHEL kernel and latest and sub-latest lts. So the bug must still exists in upstream. That's why I filed this bug. Anyway, I encountered another 2 crashes in the last two days and call stack insists nfsd caused it.
This is what comment 0 says: > My RHEL9 server with only NFS service often OOMed after a day or two, > with no userspace memory usage. So I switched to elrepo kernel-lts and > still the problem persists. > I'm now using 6.1.119-1.el9.elrepo.x86_64. The problem also occured on > (RHEL 5.14.0-427.40.1.el9_4, (RHEL) 5.14.0-503.14.1.el9_5 and > 6.1.115-1.el9.elrepo.x86_64. You mentioned RHEL, and RHEL 9 in particular, several times here. I have no prior knowledge of "the ELRepo Project" -- never heard of it. By "uname" these all look like distro-built kernels to me. > Anyway, I encountered another 2 crashes in the last two days and > call stack insists nfsd caused it. I'm not saying this isn't an NFSD bug. But it might not be a problem in recent kernels. If I'm reading your reports correctly, you have not tested with 6.12 or newer. 6.1.anything is based on a two-year old code base. Any fix we create for this issue must be applied to the upstream Linus kernel first. Indeed, a fix might already exist somewhere in upstream. By upstream, I mean the "master" branch in this repo: https://git.kernel.org./pub/scm/linux/kernel/git/torvalds/linux.git Therefore the first task is for you to confirm by testing that this branch either still has this issue, in which case we have to troubleshoot further; or does not, in which case you can bisect to find the upstream fix that needs to be backported to the LTS kernels.
Hi Mr. Lever, > You mentioned RHEL, and RHEL 9 in particular, several times here. Because I want to indicate that, except the kernel, every other toolchains were using latest version from RHEL9. The ELRepo Project (https://elrepo.org/) is a group of guys grabbing the latest kernel source and package it into RPMs for easy installation on latest EL-like releases (like RHEL, Oracle Linux, Rocky, Alma etc.) > By upstream, I mean the "master" branch in this repo OK. I've just installed the latest stable (aka 6.12.4) and see if it might help.
From attachment 307290 [details]: [29924.805968] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0 [29924.805991] Out of memory: Killed process 4503 ((sd-pam)) total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:100 [29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1 [29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024 [29925.425877] Call Trace: [29925.425880] <TASK> [29925.425885] dump_stack_lvl+0x45/0x5e [29925.425893] dump_header+0x4a/0x213 [29925.425897] oom_kill_process.cold+0xb/0x10 [29925.425901] out_of_memory+0xed/0x2e0 [29925.425906] __alloc_pages_slowpath.constprop.0+0x707/0x9d0 [29925.425916] __alloc_pages+0x35d/0x370 [29925.425921] __alloc_pages_bulk+0x3e5/0x680 [29925.425927] svc_alloc_arg+0x81/0x1f0 [sunrpc] [29925.425991] svc_recv+0x1f/0x190 [sunrpc] [29925.426043] ? nfsd_inet6addr_event+0x110/0x110 [nfsd] [29925.426080] nfsd+0x87/0xc0 [nfsd] [29925.426113] kthread+0xe5/0x110 [29925.426118] ? kthread_complete_and_exit+0x20/0x20 [29925.426122] ret_from_fork+0x1f/0x30 [29925.426129] </TASK> NFSD is targeted by OOM killer because it frequently allocates up to 256 pages at a time to fill the send and receive buffers. It is not necessarily the source of a leak. The bulk page allocator is on the slow path here, suggesting there weren't any free pages available on the lists it normally checks first. So it is doing one-at-a-time order-0 allocations, a sign that memory is short. We see that Node 1 appears to be short on free memory, but the system has not pushed into swap at all. Kernel memory isn't swappable, so whatever is leaking is in the kernel proper. The slab caches all look reasonably sized, so not likely a slab leak. At this point we would want someone with some MM expertise to come in and help us nail down the leak.
Chuck Lever <chuck.lever@oracle.com> replies to comment #13: Hi - An NFSD page allocation on v6.1.y is triggering OOM-killer. The reporter has provided a lot of detail, and we need some help steering us towards the possible leak culprit. Any takers? (We've asked the reporter to reproduce on a more recent kernel if possible). -------- Forwarded Message -------- Subject: Re: Possible memory leak on nfsd Date: Thu, 12 Dec 2024 16:00:17 +0000 From: Chuck Lever via Bugspray Bot <bugbot@kernel.org> To: jlayton@kernel.org, linux-nfs@vger.kernel.org, trondmy@kernel.org, cel@kernel.org, anna@kernel.org Chuck Lever writes via Kernel.org Bugzilla: From attachment 307290 [details]: [29924.805968] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/user.slice/user-0.slice/user@0.service/init.scope,task=(sd-pam),pid=4503,uid=0 [29924.805991] Out of memory: Killed process 4503 ((sd-pam)) total-vm:173972kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:96kB oom_score_adj:100 [29925.425864] nfsd invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 [29925.425872] CPU: 0 PID: 1874 Comm: nfsd Kdump: loaded Tainted: G E 6.1.119-1.el9.elrepo.x86_64 #1 [29925.425875] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.22.2 09/12/2024 [29925.425877] Call Trace: [29925.425880] <TASK> [29925.425885] dump_stack_lvl+0x45/0x5e [29925.425893] dump_header+0x4a/0x213 [29925.425897] oom_kill_process.cold+0xb/0x10 [29925.425901] out_of_memory+0xed/0x2e0 [29925.425906] __alloc_pages_slowpath.constprop.0+0x707/0x9d0 [29925.425916] __alloc_pages+0x35d/0x370 [29925.425921] __alloc_pages_bulk+0x3e5/0x680 [29925.425927] svc_alloc_arg+0x81/0x1f0 [sunrpc] [29925.425991] svc_recv+0x1f/0x190 [sunrpc] [29925.426043] ? nfsd_inet6addr_event+0x110/0x110 [nfsd] [29925.426080] nfsd+0x87/0xc0 [nfsd] [29925.426113] kthread+0xe5/0x110 [29925.426118] ? kthread_complete_and_exit+0x20/0x20 [29925.426122] ret_from_fork+0x1f/0x30 [29925.426129] </TASK> NFSD is triggering the OOM killer because it frequently allocates up to 256 pages at a time to fill the send and receive buffers. It is not necessarily the source of a leak. The bulk page allocator is on the slow path here, suggesting there weren't any free pages available on the lists it normally checks first. So it is doing one-at-a-time order-0 allocations, a sign that memory is short. We see that Node 1 appears to be short on free memory, but the system has not pushed into swap at all. Kernel memory isn't swappable, so whatever is leaking is in the kernel proper. The slab caches all look reasonably sized, so not likely a slab leak. At this point we would want someone with some MM expertise to come in and help us nail down the leak. View: https://bugzilla.kernel.org/show_bug.cgi?id=219535#c13 You can reply to this message to join the discussion. (via https://msgid.link/45ccbf00-3bd8-444e-8765-637caac5f738@oracle.com)
Sorry for my rudeness in my previous discussion. After switching to 6.12.4, the server stayed stable for 30 days. So whatever caused the memleak should have been resolved between 6.1.119 to 6.12. You might want to close this bug if backport is not worthwhile.
Chuck Lever <chuck.lever@oracle.com> replies to comment #15: On 1/10/25 11:50 AM, Chen Chen via Bugspray Bot wrote: > Chen Chen writes via Kernel.org Bugzilla: > > Sorry for my rudeness in my previous discussion. > > After switching to 6.12.4, the server stayed stable for 30 days. That's good news! > So whatever caused the memleak should have been resolved between 6.1.119 to > 6.12. That's tens of thousands of commits over two years. Unfortunately that doesn't really tell us what the problem is. > You might want to close this bug if backport is not worthwhile. We need to know the exact commit that contains the fix before it can be determined whether a backport is feasible. Are you able to bisect between v6.1 and v6.12 ? If not, do you have a simple, narrow reproducer that we can use to explore this ourselves? (via https://msgid.link/463b4513-5c5a-4394-8a93-58c0feef3149@oracle.com)
Created attachment 307525 [details] Logs and traces from Jan-18 pt1 Here are the traces from two NFS crashes that occurred this past weekend. Both occurred in the AM (US time) on Jan 18, a few hours apart from one another. I followed the instructions I found on the various threads. There was no output to `rpcdebug -m rpc -c`, not sure what I did wrong there. The syslog ought to contain the output of sysrq-trigger, however. The output from trace-cmd captures several days' worth of logs in either case, but not from system boot. The syslogs I have cut from ~one hour before the incident until it finished shutting down prior to reboot. I have removed the output of other services. Both are VMs on GCE running the 6.1.119 kernel from Debian bookworm (6.1.0-28) ~60Gi memory, 16 CPUs.
Created attachment 307526 [details] Logs and traces from Jan-18 pt2 Part 2, see previous description
Comment on attachment 307525 [details] Logs and traces from Jan-18 pt1 This was submitted in error, apologies.
Comment on attachment 307526 [details] Logs and traces from Jan-18 pt2 Also submitted in error.