Bug 110501

Summary: kswapd uses 100% of cpu, no swap, no NUMA
Product: Memory Management Reporter: NermaN (nalorokk)
Component: Slab AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: high CC: benjamingslade+kernel, cbillett, ckazanci, d, dieter.ferdinand, ego.cordatus, gim6626, howaboutsynergy, iam, jacky, jharbold, johan.radivoj, kappawingman+kernel, kenny.macdermid, me, newton.moura.junior, nico-bugzilla.kernel.org, noahb2713, pbs3141, reynhout, rezbit.hex, robincello, sanderboom, sebdeligny, seo.d, someuniquename, szg00000, t.clastres, VStoiakin, Wilhelm.Buchmueller, zan.loy
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.1+ Subsystem:
Regression: Yes Bisected commit-id:
Attachments: journalctl log
echo m > /proc/sysrq-trigger; echo t > /proc/sysrq-trigger; dmesg > dmesg.log
sysctl -a | grep vm > sysctl.txt
/proc/meminfo

Description NermaN 2016-01-07 16:21:58 UTC
Sometimes kswapd0 uses 100% of one core. This issue happens during all 4.0+ kernels. I tested some time ago, some of 3.0+ kernels, and problem was there.

drop_caches didn't help. 
Adding swap didn't work.

There is no rule when problem appears. It can wait 1-2 weeks. It can appear twice in day. Most of times it will exist any time long, before i will reboot server. Rarely it can cure itself without reboot.


Perftop with i915 module enabled:

+   98,22%     0,90%  [kernel]             [k] kswapd
+   93,07%     0,32%  [kernel]             [k] shrink_zone
+   87,41%     3,71%  [kernel]             [k] shrink_slab
+   56,55%     1,11%  [i915]               [k] i915_gem_shrinker_scan
+   50,65%    46,24%  [i915]               [k] i915_gem_shrink
+   23,28%     4,32%  [kernel]             [k] super_cache_count
+   18,22%     2,33%  [kernel]             [k] list_lru_count_one
+   15,26%     0,00%  [kernel]             [k] ret_from_fork
+   15,26%     0,00%  [kernel]             [k] kthread
+   11,84%    11,83%  [kernel]             [k] _raw_spin_lock

  46,24%  [i915]            [k] i915_gem_shrink
  12,11%  [kernel]          [k] _raw_spin_lock
   4,28%  [kernel]          [k] super_cache_count
   3,74%  [i915]            [k] i915_vma_unbind
   3,67%  [kernel]          [k] shrink_slab
   3,20%  [i915]            [k] i915_gem_object_put_pages
   2,78%  [kernel]          [k] __list_lru_count_one.isra.0


This is with i915 blacklisted. It don't changed this bug behavior in any way except this output:

    +   97,30%     2,32%  [kernel]             [k] kswapd
    +   83,40%     0,65%  [kernel]             [k] shrink_zone
    +   69,79%     7,77%  [kernel]             [k] shrink_slab
    +   59,73%    10,66%  [kernel]             [k] super_cache_count
    +   46,97%     5,76%  [kernel]             [k] list_lru_count_one
    +   30,73%    30,73%  [kernel]             [k] _raw_spin_lock
    +   23,84%     0,00%  [kernel]             [k] ret_from_fork
    +   23,84%     0,00%  [kernel]             [k] kthread
    +    9,63%     7,23%  [kernel]             [k] __list_lru_count_one.isra.0
    +    6,18%     0,82%  [kernel]             [k] zone_balanced
    +    5,05%     2,28%  [kernel]             [k] shrink_lruvec

  30,63%  [kernel]             [k] _raw_spin_lock
  10,84%  [kernel]             [k] super_cache_count
   7,66%  [kernel]             [k] shrink_slab
   7,28%  [kernel]             [k] __list_lru_count_one.isra.0
   5,29%  [kernel]             [k] list_lru_count_one
   3,69%  [kernel]             [k] memcg_cache_id
   3,62%  [kernel]             [k] _raw_spin_unlock
   2,63%  [kernel]             [k] zone_watermark_ok_safe
   2,53%  [kernel]             [k] mem_cgroup_iter
   2,44%  [kernel]             [k] shrink_lruvec
   2,36%  [kernel]             [k] kswapd
   1,35%  [kernel]             [k] _raw_spin_lock
Comment 1 NermaN 2016-01-14 23:53:47 UTC
Very probably is regression, seems like no problem at kernel 3.14.58
Comment 2 NermaN 2016-01-18 11:11:50 UTC
Problem appeared in kernel 4.1 and still exists.
Comment 3 reynhout 2016-01-22 18:33:27 UTC
A few observations:

* I can reproduce this at will on any 4.1 kernel with limited physical RAM (tested on a 2GB machine) and zram
* I cannot reproduce this with no swap enabled
* I cannot reproduce this on the same kernels when limiting RAM by kernel parameters (I've tried mem=1024M on a 2GB machine; collaborators have tested mem=2048M on 4GB machines)
* I cannot reproduce this on kernel 4.0.9 or earlier

* When the condition is triggered, kswapd jumps from 0% to 99+% CPU over a couple seconds
* kswapd will apparently never recover on its own, but will bounce around between 97-100% CPU for hours and overnight
* Relieving memory pressure does not help directly: I've frequently returned to >1GB freemem and nearly empty swap, while kswapd remained pegged
* Killing a specific process (or probably the last of a pair or a set, it's hard to tell) *does* help: kswapd drops to 0% CPU nearly instantly
* I do not know how to determine which process/set will be the "magic" one: it doesn't seem to be predictable by start order, and is rarely if ever the same process that triggered the condition

To reproduce:

Quickly allocate largish blocks of memory (100MB) in separate processes. On a 2GB machine with 2.8GB zram configured, the condition triggers reliably at ~27 100MB blocks allocated. I can then allocate an additional ~13-15 100MB blocks before mallocs fail.

It is sometimes possible to avoid the condition when allocating gradually, and it does not occur at all when the allocations are in a single process.

This test case was created to resemble the launch of a multiprocess web browser with several tabs open, which is where I first heard about the issue. Reproducing with the web browser was less predictable but still possible.
Comment 4 reynhout 2016-01-23 15:20:32 UTC
Update:

Patch suggested by Kirill Shutemov in http://lkml.iu.edu//hypermail/linux/kernel/1601.2/03564.html , applied to kernel 4.1.14, seems to resolve this issue.

I've run about a dozen tests so far, and have not been able to trigger the condition except possibly one time (the first time), whereas I can trigger it every time under the unpatched kernel.

When kswapd hit 100% this time, I allocated another chunk and kswapd immediately recovered, which would never happen under the unpatched kernel.

I didn't give kswapd any time to recover on its own (I was expecting the old behaviour), so I don't know if it would have done so without the additional allocation.

I have not seen any similar behaviour since that first test, but will be doing more testing today, and will update again if I do.
Comment 5 Tim Edwards 2016-02-21 07:27:10 UTC
I've found a workaround that works well for me so far: create a file /etc/sysctl.d/60-workaround-kswapd-allcpu.conf with the following contents and reboot:
vm.min_free_kbytes=67584

The idea behind this workaround is a post by Kirill A. Shutemov on LKML (http://lkml.iu.edu//hypermail/linux/kernel/1601.2/03564.html) and this Gallium OS bug report: https://github.com/GalliumOS/galliumos-distro/issues/52

Would be interesting to know if this helps others
Comment 6 jacky 2016-03-11 09:38:36 UTC
(In reply to Tim Edwards from comment #5)
> I've found a workaround that works well for me so far: create a file
> /etc/sysctl.d/60-workaround-kswapd-allcpu.conf with the following contents
> and reboot:
> vm.min_free_kbytes=67584
> 
> The idea behind this workaround is a post by Kirill A. Shutemov on LKML
> (http://lkml.iu.edu//hypermail/linux/kernel/1601.2/03564.html) and this
> Gallium OS bug report:
> https://github.com/GalliumOS/galliumos-distro/issues/52
> 
> Would be interesting to know if this helps others

No use for me.
I has a asus notebook whith Intel Celeron 847 and 2GB RAM.
I'm using arch linux with kernel 4.4.3, and when I rebuild the kernel with makepkg command, kswapd0 uses 100% of one core after few minutes.
Comment 7 jacky 2016-03-11 09:45:41 UTC
(In reply to jacky from comment #6)
> (In reply to Tim Edwards from comment #5)
> > I've found a workaround that works well for me so far: create a file
> > /etc/sysctl.d/60-workaround-kswapd-allcpu.conf with the following contents
> > and reboot:
> > vm.min_free_kbytes=67584
> > 
> > The idea behind this workaround is a post by Kirill A. Shutemov on LKML
> > (http://lkml.iu.edu//hypermail/linux/kernel/1601.2/03564.html) and this
> > Gallium OS bug report:
> > https://github.com/GalliumOS/galliumos-distro/issues/52
> > 
> > Would be interesting to know if this helps others
> 
> No use for me.
> I has a asus notebook whith Intel Celeron 847 and 2GB RAM.
> I'm using arch linux with kernel 4.4.3, and when I rebuild the kernel with
> makepkg command, kswapd0 uses 100% of one core after few minutes.

When kswapd0 uses 100% of one core, the free command output:
              total        used        free      shared  buff/cache   available
Mem:        1866212       66036     1465132        1088      335044     1605068
Swap:       1952764           0     1952764

As you can see, the memory is enough for my system, and swap usage is 0%.
Comment 8 Benjamin 2016-06-02 14:44:32 UTC
I've been seeing problems like this since the v2 kernel (6 years ago) when doing large amounts of buffered I/O writes.  Ie., when writing hundreds of mbytes of data to database devices or to an NFS filesystem as part of a database backup.   Where I work, we're currently on Linux 3.10 and we're still seeing it.  

Various things we've read says that it might be related to kernel memory fragmentation.  Ie., you can have lots of regular memory free, but that doesn't help.  It's related to balancing kernel memory for NUMA nodes, slabs, & zones.  

In some cases, disabling NUMA might help.  We also changed our databases to use directio (ie., non-buffered.  Doesn't thrash Linux virtual memory doing useless buffering for a database server that already does it's own caching).  If possible, we try not to do large database backups to NFS.

And of course, we always get the answer, "Upgrade to the newest version of Linux. There's a patch there that sounds like it might fix it".

I wish kswapd had some clearer monitoring info to figure out why it's spinning.
Comment 9 Roman Evstifeev 2016-10-07 19:32:57 UTC
Created attachment 241121 [details]
journalctl log

I was affected by this bug since kernel 2.x. Right now i got it with kernel 4.7.5
The behavior in this most recent kernel changed a bit: now i can see in the journalctl logs that firefox invokes oom-killer. In previous kernel versions there was nothing in the syslog. But despite this message, firefox did not get killed by oom-killer, and as usual i had to switch to root console session to launch "killall firefox" (it takes minutes to do that, as system is thrashing). Only after this firefox is killed and kswapd0 stops tearing PC apart.

Attaching the relevant journalctl log.

Ignore the drm bugs in the log - it happens when i switch from X to console, or back.
Comment 10 Roman Evstifeev 2016-10-07 19:34:52 UTC
Oh, I forgot to mention that the bug happens regardless of the fact if swap is enabled or not.
Comment 11 Wilhelm Buchmüller 2017-01-06 04:52:00 UTC
Happens with no swap, no mount, only reading from SSD:
 -> https://bugzilla.kernel.org/show_bug.cgi?id=65201#c50
Comment 12 NoahB 2017-02-20 20:16:34 UTC
To me it seems like kswapd is getting stuck in an infinite loop. It may be some sort of buffer overflow, which explains why it is triggered by allocating multiple blocks of memory. Allocating gradually allows kswapd to empty its buffer before the next blocks come in, therefore not triggering the bug. After the buffer overflows (or worse, wraps around) the invalid data triggers an infinite loop that devours the CPU.
Comment 13 Roman Evstifeev 2019-01-01 08:46:36 UTC
Created attachment 280229 [details]
echo m > /proc/sysrq-trigger; echo t > /proc/sysrq-trigger; dmesg > dmesg.log

output of 
dmesg -n 7
dmesg -c
echo m > /proc/sysrq-trigger
echo t > /proc/sysrq-trigger
dmesg -s 1000000 > foo

when kswapd0 is starting to hang the system
Comment 14 Roman Evstifeev 2019-01-01 08:48:16 UTC
Created attachment 280231 [details]
sysctl -a | grep vm > sysctl.txt

my sysctl vm.*
Comment 15 Roman Evstifeev 2019-01-01 08:49:39 UTC
Reproduced on kernel 4.19.8
Swap: disabled
Reproducible: always
uname -a: Linux linux-fdsk 4.19.8-1-default #1 SMP PREEMPT Sun Dec 9 20:08:37 UTC 2018 (9cae63f) x86_64 x86_64 x86_64 GNU/Linux

Steps to reproduce:
Download my simple ramhog.py script: https://gist.github.com/Fak3/3c6bf52000651b00d9ab03e5b6bae677
Launch it, then press enter until it hogs all the ram
kswapd0 process will hang the system, taking all the cpu and all disk i\o forever (i was waitnig few hours):

> wget
> https://gist.githubusercontent.com/Fak3/3c6bf52000651b00d9ab03e5b6bae677/raw/a54c816498811041e5eb416fb25fcdbb86232fd2/ramhog.py
> python3 ramhog.py

Expected results: system should not hang, oom killer should kill some process to free some ram.

i managed to get some logs when kswapd is starting to go rogue:
see my attachments to this bug: sysctl, dmesg, meminfo
Comment 16 Roman Evstifeev 2019-01-01 08:51:01 UTC
Created attachment 280233 [details]
/proc/meminfo

/proc/meminfo when kswapd is starting to hang the system
Comment 17 Roman Evstifeev 2019-01-01 09:07:49 UTC
oh, some more details:
The script hanging testing above was done on openSUSE Tumbleweed distro, ssd drive.
This bug is haunting me for about 10 years, regardless of sysctl settings, swap, hdd type and kernel. All that is required to do - is open many tabs in web browser, enough to eat all ram.
Comment 18 Roman Evstifeev 2019-01-01 12:47:58 UTC
oh, more info:
"echo 3 > /proc/sys/vm/drop_caches" does not solve this bug
Comment 19 Caner Kazanci 2020-10-07 17:44:11 UTC
This issue was driving me crazy after a recent Tumbleweed update. Spent hours reading bug reports, logs, trying different things to identify and fix the issue. Then I added a swap partition. Issue is gone. What's interesting is that the swap partition is not even used, but its mere existence somehow prevents kswapd0 to persist unnecessarily and use up resources. Hope this helps someone.
Comment 20 Caner Kazanci 2020-10-16 08:52:39 UTC
Update: Did not work. The issue is not kswapd. Shared memory fills up and is not emptied for some reason. Offending process needs to be identified.
Comment 21 Newton 2020-12-21 05:24:20 UTC
The problem persists in Debian Buster:
root@newton:~# uname -a
Linux newton 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux

Times to times I must give "killall -9 kswapd0" to delete it.

The interesting is kswapd0 always run in my daughter account.
She almost never uses the computer.
Comment 22 Newton 2020-12-21 05:29:12 UTC
Additional information. In my case kswapd0 takes almost 2.0GB of RAM and 3 cores at 100% of my six cores Phenom X6 II.
Comment 23 lesebas 2021-07-08 19:14:12 UTC
I confirm, this problem is still unsolved and happen to me on archlinux with kernel 5.12.
Comment 24 Caner Kazanci 2021-07-16 21:40:52 UTC
In my case, issue is not kswapd, it is RAM management. Somehow buffered RAM is not recycled.  If you do not have SWAP, kswapd steps in as available RAM diminishes. If you have SWAP, it gets used, but eventually gets filled up. 

I added more RAM (to 32GB). Eventually the same issue persisted, though it took longer. 

Now I installed Leap instead of Tumbleweed. Issue does not appear to be present, buffered RAM gets easily recycled. But I need more time to evaluate and be sure. 

This is a serious issue for desktops, workstations and servers that run 24/7.
Comment 25 Johan 2021-08-24 14:32:25 UTC
Can also confirm I am seeing the same issue on arch linux with kernel 5.13.7-arch1-1.

I've tried sysctl settings:
vm.swappiness=10
vm.min_free_kbytes=65536

killall -9 kswapd0 stops the process from doing CPU intensive operations, X and all my applications continue to function without crashes
Comment 26 ValdikSS 2021-08-29 20:15:44 UTC
(In reply to Johan from comment #25)
> Can also confirm I am seeing the same issue on arch linux with kernel
> 5.13.7-arch1-1.
> 
> I've tried sysctl settings:
> vm.swappiness=10
> vm.min_free_kbytes=65536
> 
> killall -9 kswapd0 stops the process from doing CPU intensive operations, X
> and all my applications continue to function without crashes

If `killall -9 kswapd0` really helps you, that means you have a virus process called "kswapd0", not a kernel thread. You won't be able to kill a kernel thread from userspace.
You'd better reinstall your system or at least try to clean the virus.
Comment 27 Johan 2021-08-30 09:32:24 UTC
(In reply to ValdikSS from comment #26)
> (In reply to Johan from comment #25)
> > Can also confirm I am seeing the same issue on arch linux with kernel
> > 5.13.7-arch1-1.
> > 
> > I've tried sysctl settings:
> > vm.swappiness=10
> > vm.min_free_kbytes=65536
> > 
> > killall -9 kswapd0 stops the process from doing CPU intensive operations, X
> > and all my applications continue to function without crashes
> 
> If `killall -9 kswapd0` really helps you, that means you have a virus
> process called "kswapd0", not a kernel thread. You won't be able to kill a
> kernel thread from userspace.
> You'd better reinstall your system or at least try to clean the virus.

That would explain a lot and be an easy solution to the problem. When doing killall and writing up above post, I was a bit puzzled since the process didn't actually disappear and was using the same pid.

I wanted to confirm your notion but the methods I identified to verify if the process is a kernel thread or not all indicate (to me) that it is a kernel thread process spawned under kthreadd. Is there any other way I can verify this claim?

Methods:
htop, shift-k, check for kthreadd and look if the kswapd0 process was inherited from kthreadd
`ps -ef | grep kswapd0` returns `root          82       2  0 08:20 ?        00:00:06 [kswapd0]` (brackets should indicate a kthread?)
`cat /proc/82/cmdline` yields an empty response
`readlink /proc/82/exe ; echo $?` returns `1`

```
stat /proc/82/exe
File: /proc/82/exestat: cannot read symbolic link '/proc/82/exe': No such file or directory

  Size: 0               Blocks: 0          IO Block: 1024   symbolic link
Device: 15h/21d Inode: 4537028     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-08-30 11:22:48.614106587 +0200
Modify: 2021-08-30 11:16:07.510344491 +0200
Change: 2021-08-30 11:16:07.510344491 +0200
 Birth: -
```
Comment 28 Nico Schottelius 2021-09-27 06:51:52 UTC
Running Alpine Linux with plenty of free RAM I also see kswapd consuming 100% cpu and slowing down the system significantly. I tried to drop caches, which seems to have reduced the problem for a bit:


nb3:~# free -m
               total        used        free      shared  buff/cache   available
Mem:           15734        4570        1765        3334        9399        7525
Swap:              0           0           0
nb3:~# echo 3 > /proc/sys/vm/drop_caches
nb3:~# free -m
               total        used        free      shared  buff/cache   available
Mem:           15734        4647        4632        3620        6454        7422
Swap:              0           0           0
nb3:~# 

My kernel version is

Linux nb3 5.10.67-0-lts #1-Alpine SMP Mon, 20 Sep 2021 09:03:55 +0000 x86_64 GNU/Linux
Comment 29 d 2022-02-11 22:43:32 UTC
I got the same issue.

kswapd is running using all CPU. System freezes.

Here's my free -m

[root@admin ~]# free -mh
              total        used        free      shared  buff/cache   available
Mem:           755G        407G        192G         12G        155G        330G
Swap:          4,0G        881M        3,1G

Running 

sync; echo 1 > /proc/sys/vm/drop_caches

in crontab every 10 minutes solves it.
Comment 30 John E. Harbold 2022-05-17 04:40:47 UTC
I was able to kill the kswapd0, sudo killall5 -9 kswapd0, but it caused my PuTTY to exit.  It looks like doing this cause a reboot for me.  I could not find any references to "kswapd0 virus".  My thoughts are when ValkiSS, comment 26 above, tried to kill kswapd0 they doing while from their userspace, not as root.
Comment 31 pbs3141 2022-09-04 05:23:14 UTC
Similarly to Comment 9, I can trigger this bug 100% reliably by running a parallel C++ compile job that exceeds available RAM with no swap enabled. System becomes unresponsive, but OOM killer does not run. Logging into a virtual console is slow but eventually succeeds, and running top reports that kswapd is pegging all cores. Running killall -9 on the offending process usually resolves the problem. If left long enough, eventually OOM killer does run. (Arch Linux, 5.18.16-zen1-1-zen.)
Comment 32 ValdikSS 2022-09-04 05:27:53 UTC
(In reply to pbs3141 from comment #31)

MGLRU patchset would finally help with this issue.
https://www.phoronix.com/news/MGLRU-v14-Released

As an alternative, there's so-called le9 patch, a much simpler one.
https://github.com/hakavlad/le9-patch

It allowed me to run a lot of software with only 2GB RAM
https://notes.valdikss.org.ru/linux-for-old-pc-from-2007/en/
Comment 33 Johan 2022-09-05 07:18:24 UTC
(In reply to ValdikSS from comment #32)
> (In reply to pbs3141 from comment #31)
> 
> MGLRU patchset would finally help with this issue.
> https://www.phoronix.com/news/MGLRU-v14-Released
> 
> As an alternative, there's so-called le9 patch, a much simpler one.
> https://github.com/hakavlad/le9-patch
> 
> It allowed me to run a lot of software with only 2GB RAM
> https://notes.valdikss.org.ru/linux-for-old-pc-from-2007/en/

So it wasn't a virus after all? ;)
Comment 34 Benjamin 2022-09-06 17:17:07 UTC
Re: Comment 31 (I can trigger this bug reliably)

Is this on Linux running on Intel processors with Intel memory management?

What does this "numactl --hardware" command show for the host where you're running your test.

Can you tell if /proc/zoneinfo or /proc/buddyinfo have anything interesting when the kswapd process(es) start to use up CPU? (these files are very hard to read)

/proc/buddyinfo on Intel mem mgmt systems may show no large slabs available for Node 0, zone DMA32 when in a bad situation.

I use these commands to look at interesting zoneinfo changes:

egrep "^Node|^  pages free|^        min|^        low |^        high |^        active|^        inactive|^        scanned|^        present|^  all_unreclaimable" /proc/zoneinfo > /tmp/tmp1.dat
sleep 10
egrep "^Node|^  pages free|^        min|^        low |^        high |^        active|^        inactive|^        scanned|^        present|^  all_unreclaimable" /proc/zoneinfo > /tmp/tmp2.dat
diff -wy -W90 /tmp/tmp1.dat /tmp/tmp2.dat
Comment 35 Dieter Ferdinand 2023-08-06 15:35:46 UTC
hello,
i have the same problem: kswapd use much cpu time >90% on a system which is used for backups.

i use rsync to copy the backup-data on a big raid-system (4 x 4 TB raid 5) and every time, the transferrate is hight and all available memory ist used for buffers or cache, kswapd cputime is very hight.

if i drop the caches with "echo 1 >drop_caches", kswapd use less cpu time and free memory goes high for a short time.

at the moment, i drop the caches every 15 seconds, because kswapd reduce my transferate to 60-70% of the max rate! and i must copy much data (11 TB).

how can i solve this problem that the oldest buffers/cache, which was used for write-function was dropped automatic because i need it only once!

on this system no swap-space is used!

i find this problems on all kernels, which i use. the only system, which posibly don't have that problem is my new server with 64GB ram. but i don't copy so much data on that system, that the kernel must use all memory for cache in a short time.

goodby
Comment 37 lesebas 2024-01-09 23:00:02 UTC
Hello, this bug is opened for 7 years now! And I confirm it's still happening on my computer. Is there any chance to get it solved?
Comment 38 Caner Kazanci 2024-01-10 01:15:50 UTC
Well, a couple of years passed after my last update. I can confirm that the kswapd issue is resolved for me after installing more RAM and switching from Tumbleweed to Leap 15.4. 

I had previously messed with sysctl and swappiness settings, which did not help. Current OS is installed this summer, no changes to default settings, and I experience no issues. 

I don't know if this has anything to do with it or not; but my RAM to HD size ratio was rather low. I have a total of 22 TB HD space and previously had only 16GB of RAM, now I have 32GB... I allocated 32GB of Swap, which is rarely used.

How much is your RAM and total HD size?
Comment 39 Roman Evstifeev 2024-01-10 04:26:17 UTC
The issue is fixed for me since the introduction of mglru in kernel 6.1. I recall it needs to be enabled properly at boot time.
Comment 40 lesebas 2024-01-23 12:41:54 UTC
Hello I've enable mglru but the bug is still present for me. However it seems that the systems become slower but still responsiveness. That allow me to switch on the tty and kill some process to recover the system.
Comment 41 pbs3141 2024-02-22 07:44:50 UTC
(In reply to ValdikSS from comment #32)
> (In reply to pbs3141 from comment #31)
> MGLRU patchset would finally help with this issue.

I've noticed no difference with MGLRU enabled (Arch Linux, 6.7.4-zen1-1-zen), at least assuming that /sys/kernel/mm/lru_gen/enabled reporting 0x0007 indicates that MGLRU is enabled. Alt+SysRq+F has become my friend (runs the OOM killer manually, killing the memory-hogging process and unlocking the system again).

(In reply to Benjamin from comment #34)
> Re: Comment 31 (I can trigger this bug reliably)
> Is this on Linux running on Intel processors with Intel memory management?

Not in my case; this is on an AMD processor.

> What does this "numactl --hardware" command show for the host where you're
> running your test.

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 15327 MB
node 0 free: 3052 MB
node distances:
node   0 
  0:  10