Bug 200105 - High paging activity as soon as the swap is touched (with steps and code to reproduce it)
Summary: High paging activity as soon as the swap is touched (with steps and code to r...
Status: NEW
Alias: None
Product: IO/Storage
Classification: Unclassified
Component: LVM2/DM (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-17 21:56 UTC by john terragon
Modified: 2019-12-14 10:21 UTC (History)
2 users (show)

See Also:
Kernel Version: 4.14 4.15 4.16 4.17
Subsystem:
Regression: No
Bisected commit-id:


Attachments
attachment-18225-0.html (3.72 KB, text/html)
2018-06-21 21:38 UTC, john terragon
Details
attachment-12219-0.html (2.70 KB, text/html)
2018-06-22 11:45 UTC, john terragon
Details
read_vmstat.c (4.91 KB, text/x-csrc)
2018-06-22 12:25 UTC, Michal Hocko
Details
out.txt (1.29 MB, text/plain)
2018-06-25 17:57 UTC, john terragon
Details
out5.txt (1.23 MB, text/plain)
2018-06-27 10:38 UTC, john terragon
Details
attachment-27873-0.html (3.38 KB, text/html)
2018-06-27 13:52 UTC, john terragon
Details
attachment-3570-0.html (4.25 KB, text/html)
2018-06-29 20:24 UTC, john terragon
Details
attachment-32664-0.html (1.29 KB, text/html)
2018-07-02 17:46 UTC, john terragon
Details
attachment-12017-0.html (1.22 KB, text/html)
2018-07-05 20:35 UTC, john terragon
Details
attachment-16893-0.html (2.26 KB, text/html)
2018-07-05 21:09 UTC, john terragon
Details
4.17.0-rc7 VM config (197.39 KB, text/plain)
2018-07-06 01:39 UTC, Daniel Jordan
Details
attachment-5996-0.html (2.59 KB, text/html)
2018-07-06 15:35 UTC, john terragon
Details
out6.txt (750.81 KB, text/plain)
2018-07-06 20:05 UTC, john terragon
Details
vmstat snapshot of the test without a GUI (750.81 KB, text/plain)
2018-07-06 20:09 UTC, john terragon
Details
attachment-11371-0.html (1.26 KB, text/html)
2018-07-07 15:18 UTC, john terragon
Details
vmstat snapshot of the on Freebsd console-only (38.36 KB, text/plain)
2018-07-07 15:18 UTC, john terragon
Details
attachment-11874-0.html (883 bytes, text/html)
2018-07-07 17:44 UTC, john terragon
Details
read_vmstat output (797.53 KB, text/plain)
2018-07-19 20:03 UTC, john terragon
Details
perf report (19.62 KB, text/plain)
2018-07-19 20:03 UTC, john terragon
Details
start and finish times for the memeaters (272 bytes, text/plain)
2018-07-19 20:04 UTC, john terragon
Details
report (19.62 KB, application/octet-stream)
2018-07-20 08:19 UTC, john terragon
Details
out7.txt (797.53 KB, text/plain)
2018-07-20 08:19 UTC, john terragon
Details
start_finish.txt (272 bytes, text/plain)
2018-07-20 08:19 UTC, john terragon
Details

Description john terragon 2018-06-17 21:56:23 UTC
Hi.

Under any desktop environment, as soon as the swap is touched the kernel starts paging out gigabytes of memory for no apparent reason causing the system to freeze for at least a minute.


This happens with normal desktop usage but I'm able to reproduce this behavior systematically by using the code included at the end of this report, let's call it "memeater". memeater takes three parameters: totalmem, chunksize, sleeptime.
It allocates (and initializes to 0) totalmem bytes of memory in chunks of chunksize bytes each. After each chunk allocation it sleeps for sleeptime seconds. If sleeptime=0, it does not sleep in between chunk allocations.
After totalmem bytes of memory have been allocated it sleeps indefinitely.

To reproduce the behaviour using memeater:

1) start a desktop environment (in my case KDE plasma).
2) invoke "memeater x y 0" to bring the system to the brink of swapping. I usually execute "memeater x y 0" multiple times where x is 4 gigs and y is 20 megs.(I've got 16GB of memory so I usually go up to 14/15 GB of used memory).
3) invoke "memeater x y 1" one last time so that it slowly fills up the memory with small chunks of a few megs each (again, I tried it with x=4 gigs and y=20 megs).
 
When the last memeater fills the memory (and keeps allocating chunks) I would expect the mm system to swap out few megabytes worth of pages to accomodate the new requests slowly coming from the last memeater. 
However, what actually happens is that the mm starts to swap out gigabytes worth of pages completely freezing the system for one or two minutes.

I've verified this on various desktop systems, all using SSDs.
Obviously, I'm willing to provide more info and to test patches.


####### memeater #######################à

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>



int main(int argc, char** argv) {
    long int max = -1;
    int mb = 0;
    long int size = 0;
    long int total = 0;
    int sleep_time = 0;
    char* buffer;

    if(argc > 1)
      {
        max = atol(argv[1]);
        size = atol(argv[2]);
        sleep_time = atoi(argv[3]);
      }
    printf("Max: %lu bytes\n", max);
    while((buffer=malloc(size)) != NULL && total < max) {
        memset(buffer, 0, size);
        mb++;
        total=mb*size;
        printf("Allocated %lu bytes\n", total);
        if (sleep_time) sleep(sleep_time);
    }      
sleep(3000000);
return 0;
}
Comment 1 Andrew Morton 2018-06-18 23:17:38 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Sun, 17 Jun 2018 21:56:23 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=200105
> 
>             Bug ID: 200105
>            Summary: High paging activity as soon as the swap is touched
>                     (with steps and code to reproduce it)
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.14 4.15 4.16 4.17
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: terragonjohn@yahoo.com
>         Regression: No
> 
> Hi.
> 
> Under any desktop environment, as soon as the swap is touched the kernel
> starts
> paging out gigabytes of memory for no apparent reason causing the system to
> freeze for at least a minute.
> 
> 
> This happens with normal desktop usage but I'm able to reproduce this
> behavior
> systematically by using the code included at the end of this report, let's
> call
> it "memeater". memeater takes three parameters: totalmem, chunksize,
> sleeptime.
> It allocates (and initializes to 0) totalmem bytes of memory in chunks of
> chunksize bytes each. After each chunk allocation it sleeps for sleeptime
> seconds. If sleeptime=0, it does not sleep in between chunk allocations.
> After totalmem bytes of memory have been allocated it sleeps indefinitely.
> 
> To reproduce the behaviour using memeater:
> 
> 1) start a desktop environment (in my case KDE plasma).
> 2) invoke "memeater x y 0" to bring the system to the brink of swapping. I
> usually execute "memeater x y 0" multiple times where x is 4 gigs and y is 20
> megs.(I've got 16GB of memory so I usually go up to 14/15 GB of used memory).
> 3) invoke "memeater x y 1" one last time so that it slowly fills up the
> memory
> with small chunks of a few megs each (again, I tried it with x=4 gigs and
> y=20
> megs).
> 
> When the last memeater fills the memory (and keeps allocating chunks) I would
> expect the mm system to swap out few megabytes worth of pages to accomodate
> the
> new requests slowly coming from the last memeater. 
> However, what actually happens is that the mm starts to swap out gigabytes
> worth of pages completely freezing the system for one or two minutes.
> 
> I've verified this on various desktop systems, all using SSDs.
> Obviously, I'm willing to provide more info and to test patches.
> 
> 
> ####### memeater #######################à
> 
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <unistd.h>
> 
> 
> 
> int main(int argc, char** argv) {
>     long int max = -1;
>     int mb = 0;
>     long int size = 0;
>     long int total = 0;
>     int sleep_time = 0;
>     char* buffer;
> 
>     if(argc > 1)
>       {
>         max = atol(argv[1]);
>         size = atol(argv[2]);
>         sleep_time = atoi(argv[3]);
>       }
>     printf("Max: %lu bytes\n", max);
>     while((buffer=malloc(size)) != NULL && total < max) {
>         memset(buffer, 0, size);
>         mb++;
>         total=mb*size;
>         printf("Allocated %lu bytes\n", total);
>         if (sleep_time) sleep(sleep_time);
>     }      
> sleep(3000000);
> return 0;
> }
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Michal Hocko 2018-06-21 15:13:13 UTC
On Mon 18-06-18 16:17:35, Andrew Morton wrote:
[...]
> > I've verified this on various desktop systems, all using SSDs.
> > Obviously, I'm willing to provide more info and to test patches.

Could you snapshot /proc/vmstat during your test please? Once per second
or so should tell us more what is going on.
Comment 3 john terragon 2018-06-21 21:38:40 UTC
Created attachment 276709 [details]
attachment-18225-0.html

   On Thursday, June 21, 2018, 5:13:14 PM GMT+2, <bugzilla-daemon@bugzilla.kernel.org> wrote:  
 
 https://bugzilla.kernel.org/show_bug.cgi?id=200105

--- Comment #2 from Michal Hocko (mhocko@kernel.org) ---
On Mon 18-06-18 16:17:35, Andrew Morton wrote:
[...]
> > I've verified this on various desktop systems, all using SSDs.
> > Obviously, I'm willing to provide more info and to test patches.

> Could you snapshot /proc/vmstat during your test please? Once per second
> or so should tell us more what is going on.
I'll do that. For now, you can find below the result of invoking vmstat 1 during the test (I did this some time ago hopingto shed some light on the problem). This is right before the swap is touched. As you can see, there is a sudden and 
huge jump in swap usage.


procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
0  0   6144 160792      8 272868    0    0     0     0  281 1839  1  0 99  0  0
0  0   6144 177844      8 246096    0    0     0     0  425 2300  1  1 99  0  0
0  0   6144 168528      8 246112    0    0    16     0  293 1939  1  0 99  0  0
0  0   6144 158320      8 246116    0    0     0     0  261 1245  0  0 100  0  0
2  0  10752 161624      8 229024    0 4820 17148  4820  845 3656  1  2 97  0  0
2  0  10752 157300      8 228096    0    0 88348     0 2114 8902  0  5 94  1  0
0  0  10752 176108      8 200052    0    0 108312     0 2466 9772  1  5 91  3  0
0  0  10752 170040      8 196780    0    0 17380     0  507 1895  0  1 99  0  0
0 10  10752 160436      8 191244    0    0 346872    20 4184 17274  1  9 64 26  0
0 29 12033856 152888      8 116696 5992 15916880 1074132 15925816 819374 2473643  0 94  0  6  0
3 21 12031552 295644      8 136536 1188    0 11348     0 1362 3913  0  1 10 89  0
0 11 12030528 394072      8 151000 2016    0 17304     0  907 2867  0  1 13 86  0
0 11 12030016 485252      8 158528  708    0  7472     0  566 1680  0  1 23 77  0
0 11 12029248 605820      8 159608  900    0  2024     0  371 1289  0  0 31 69  0
0 11 12028992 725344      8 160472 1076    0  1204     0  387 1381  0  1 33 66  0
0 12 12028480 842276      8 162056  724    0  3112     0  357 1142  0  1 38 61  0
0 13 12027968 937828      8 162652  776    0  1312     0  363 1191  0  1 31 68  0
0  9 12027456 1085672      8 163260  656    0  1520     0  439 1497  0  0 30 69  0
0 10 12027200 1207624      8 163684  728    0   992     0  411 1268  0  0 42 58  0
0  9 12026688 1331492      8 164740  600    0  1732     0  392 1203  0  0 36 64  0
0  9 12026432 1458312      8 166020  628    0  1644     0  366 1176  0  0 33 66  0
Comment 4 john terragon 2018-06-22 11:45:39 UTC
Created attachment 276723 [details]
attachment-12219-0.html

    On Thursday, June 21, 2018, 5:13:14 PM GMT+2, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote:  
 
 https://bugzilla.kernel.org/show_bug.cgi?id=200105

--- Comment #2 from Michal Hocko (mhocko@kernel.org) ---
On Mon 18-06-18 16:17:35, Andrew Morton wrote:
[...]
> > I've verified this on various desktop systems, all using SSDs.
> > Obviously, I'm willing to provide more info and to test patches.

> Could you snapshot /proc/vmstat during your test please? Once per second
> or so should tell us more what is going on.
I've a doubt about how to effectively snapshot /proc/vmstat. As you can see from the output of vmstat 1 in my previous comment, there is a huge jump in swap usage. I suspect that vmstat just got frozen like everything else during the test. And therefore it has not correctly reported what was happening during the big swapping freeze, but rather the situation right before the freeze and the situation right after (when it got unfrozen).So, if I snapshot /proc/vmstat by just doing something like this:
watch -n 1 "cat /proc/vmstat >>/tmp/out ; echo -------->>/tmp/out"
I fear I'm going to get the same behaviour: we'll know /proc/vmstat before and after the swapping-induced freeze but not what's happening _during_ it.
So, did you have something more sophisticated in mind on how to snapshot /proc/vmstat?
Comment 5 Michal Hocko 2018-06-22 12:25:24 UTC
Created attachment 276725 [details]
read_vmstat.c

On Fri 22-06-18 11:45:33, john terragon wrote:
>  
> 
>     On Thursday, June 21, 2018, 5:13:14 PM GMT+2,
>     bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org>
>     wrote:  
>  
>  https://bugzilla.kernel.org/show_bug.cgi?id=200105
> 
> --- Comment #2 from Michal Hocko (mhocko@kernel.org) ---
> On Mon 18-06-18 16:17:35, Andrew Morton wrote:
> [...]
> > > I've verified this on various desktop systems, all using SSDs.
> > > Obviously, I'm willing to provide more info and to test patches.
> 
> > Could you snapshot /proc/vmstat during your test please? Once per second
> > or so should tell us more what is going on.
> I've a doubt about how to effectively snapshot /proc/vmstat. As you can see
> from the output of vmstat 1 in my previous comment, there is a huge jump in
> swap usage. I suspect that vmstat just got frozen like everything else during
> the test. And therefore it has not correctly reported what was happening
> during the big swapping freeze, but rather the situation right before the
> freeze and the situation right after (when it got unfrozen).So, if I snapshot
> /proc/vmstat by just doing something like this:
> watch -n 1 "cat /proc/vmstat >>/tmp/out ; echo -------->>/tmp/out"
> I fear I'm going to get the same behaviour: we'll know /proc/vmstat before
> and after the swapping-induced freeze but not what's happening _during_ it.
> So, did you have something more sophisticated in mind on how to snapshot
> /proc/vmstat?

I usually use the attached little program which tries to make sure that
it doesn't get reclaimed during heavy memory pressure. Could you give it
a try?
Comment 6 john terragon 2018-06-25 17:57:48 UTC
Created attachment 276833 [details]
out.txt

    On Friday, June 22, 2018, 2:26:31 PM GMT+2, <bugzilla-daemon@bugzilla.kernel.org> wrote:  
> I usually use the attached little program which tries to make sure that
> it doesn't get reclaimed during heavy memory pressure. Could you give it
> a try?
I used your program during the test (which, as always, caused the usual freeze) and I've attached 
the output to this email.
Comment 7 Michal Hocko 2018-06-26 12:34:13 UTC
On Mon 25-06-18 17:57:37, john terragon wrote:
>  
> 
>     On Friday, June 22, 2018, 2:26:31 PM GMT+2,
>     <bugzilla-daemon@bugzilla.kernel.org> wrote:  
> > I usually use the attached little program which tries to make sure that
> > it doesn't get reclaimed during heavy memory pressure. Could you give it
> > a try?
> I used your program during the test (which, as always, caused the usual
> freeze) and I've attached 
> the output to this email.

Thanks. It seems that the output is complete. Good. Now let's see
what is going on reclaim wise. You were complaining about stalls and
excessive reclaim (swapout AFAIR) so it is good to check the direct
reclaim activity first:

Report a timestamp (first row) of the entry that has at least some
increase from the previous measured value.

$ awk '/=S/{ts=$2}/steal_direct/{if (prev) {diff=$2-prev}; prev=$2; if (diff > 50) printf "%d %d\n", ts, diff }' out.txt        
1529944049 3620
1529944053 19003
1529944054 55177
1529944055 55471
1529944056 44795
1529944057 26652
1529944058 41044
1529944059 62195
1529944060 61118
1529944061 41875
1529944062 64176
1529944063 51765
1529944064 53420
1529944065 62557
1529944066 39649
1529944067 61628
1529944068 29689
1529944069 12631
1529944070 43268
1529944071 65988
1529944072 55976
1529944073 47137
1529944336 1438
1529944342 103570

It is obvious that the heaviest direct reclaim activity started at
1529944053 and then continued in a pace of 50k pages reclaimed per
second from the direct reclaim context. You will get the background
reclaim activity by replacing steal_direct with steal_kswapd and that
would tell that kswapd didn't reclaim nor scanned all that much during
that time period
ts	   scan_kswapd		steal_kswapd
1529944053 202878		61020
1529944054 207151		6066
1529944055 374415		3115
1529944056 217907		1871
1529944057 157264		1445
1529944058 103033		798
1529944059 242288		1055
1529944060 557561		777
1529944061 314082		907
1529944062 1169696		1597
1529944063 343617		240
1529944064 121597		694
1529944065 238726		863
1529944066 99702		1201
1529944067 169272		2002
1529944068 15838		1534
1529944069 2622			390
1529944070 34216		3053
1529944071 199088		6797
1529944072 140024		6141

So it seems that the kswapd was able to scan a lot of pages while it
didn't manage to reclaim a large part of them. This is interesting
because the direct reclaim had basically 100% reclaim efficiency
(reclaim/scanned) at that time. One notable difference between the two
is that kswapd is allowed to write dirty data (except for swap pages)

$ awk '/=S/{ts=$2}/vmscan_write\>/{if (prev) {diff=$2-prev}; prev=$2; if (diff > 50) printf "%d %d\n", ts, diff }' out.txt
1529943959 442
1529943962 245
1529943963 54
1529944018 380
1529944020 257
1529944047 2579
1529944049 25646
1529944053 20761
1529944054 150695
1529944055 237989
1529944056 197554
1529944057 120136
1529944058 171693
1529944059 264284
1529944060 256677
1529944061 175652
1529944062 276186
1529944063 225011
1529944064 210441
1529944065 261282
1529944066 168771
1529944067 264044
1529944068 134312
1529944069 46161
1529944070 192242
1529944071 301327
1529944072 259882
1529944073 5935

and this seems to be happening a lot in the said time period. We have to
check swap outs to distinguish regular pagecache though

$ awk '/=S/{ts=$2}/pswpout\>/{if (prev) {diff=$2-prev}; prev=$2; if (diff > 50) printf "%d %d\n", ts, diff }' out.txt
1529944053 20844
1529944054 150738
1529944055 237977
1529944056 197614
1529944057 120201
1529944058 171583
1529944059 264189
1529944060 256794
1529944061 175607
1529944062 276164
1529944063 225054
1529944064 210540
1529944065 261106
1529944066 168903
1529944067 263999
1529944068 134370
1529944069 46189
1529944070 192251
1529944071 301349
1529944072 259830

which suggests that almost all of it was the swap out. I do not know how
to tell which context did the actual reclaim. It is possible that kswapd
got throttled or something like that.

Anyway do we have to swap out that much though?

$ awk '/=S/{ts=$2} /nr_inactive_file/{file=$2} /nr_active_file/{file+=$2; printf "%d %d\n", ts, file}' out.txt
[...]
1529944053 2868
1529944054 552
1529944055 843
1529944056 867
1529944057 886
1529944058 1155
1529944059 1186
1529944060 1276
1529944061 1314
1529944062 1283
1529944063 1280
1529944064 872
1529944065 1060
1529944066 1075
1529944067 514
1529944068 417
1529944069 470
1529944070 516
1529944071 578
1529944072 506
[...]

so yes, we have to because there is barely any page cache left.

From the above I do not see an obvious bug here. Kswapd seems to be
hitting the wall but the swap out seems to unavoidable. How does this
related to stalls is another question. And we would need tracepoint
data for that. Anyway your workload seems to be overcommitting on the
available memory here. Maybe we could work much better in that case
but then it would be great to get some tracing data when a real workload
hits this.
Comment 8 john terragon 2018-06-27 10:38:28 UTC
Created attachment 276905 [details]
out5.txt

    On Tuesday, June 26, 2018, 2:33:16 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote:  
 

> From the above I do not see an obvious bug here. Kswapd seems to be
> hitting the wall but the swap out seems to unavoidable. How does this
> related to stalls is another question. And we would need tracepoint
> data for that. Anyway your workload seems to be overcommitting on the
> available memory here. Maybe we could work much better in that case
> but then it would be great to get some tracing data when a real workload
> hits this.

I've redone the test, this time asking for 20 MB of space every 5 seconds (the other test was once every second), just to make things easier (?) for the vm (snapshot of vmstat attached). 
Besides the vmstat number, this is what happens:

-As soon as RAM is full (or close to full, like few tens of MB left) a complete system freeze ensues.-The freeze lasts for about 60 seconds.-After the 60 seconds the swap occupation went from about zero to 15 GB (!).-All that was asked during the 60 seconds was about 240 MB of space (20 MB chunks once every5 seconds).
Just to get this straight. You're telling me that this behavior is correct? And let's even forget about the system freeze. That is, writing 15GB in the swap in 60 seconds when all it was asked to the vm during these 60 seconds was a total of 240MB of ram. This is the vm working as intended?
And let me just clarify that I didn't come up with that test just to troll the vm ;) . 
I use linux in desktop PCs and it all started when all of a sudden these big freezes (even longer that the one in the "synthetic" workload) begun to happen during normal use: one DE (KDE), one mailer, one editor, and some instances of firefox (they take the lion share of the memory but they do not start to allocate 1GB every second, they seem to slowly get "fatter and fatter", so to speak).
Comment 9 Michal Hocko 2018-06-27 11:03:52 UTC
On Wed 27-06-18 10:38:19, john terragon wrote:
>  
> 
>     On Tuesday, June 26, 2018, 2:33:16 PM GMT+2, Michal Hocko
>     <mhocko@kernel.org> wrote:  
>  
> 
> > From the above I do not see an obvious bug here. Kswapd seems to be
> > hitting the wall but the swap out seems to unavoidable. How does this
> > related to stalls is another question. And we would need tracepoint
> > data for that. Anyway your workload seems to be overcommitting on the
> > available memory here. Maybe we could work much better in that case
> > but then it would be great to get some tracing data when a real workload
> > hits this.
> 
> I've redone the test, this time asking for 20 MB of space every 5
> seconds (the other test was once every second), just to make things
> easier (?) for the vm (snapshot of vmstat attached).  Besides the
> vmstat number, this is what happens:
> 
> -As soon as RAM is full (or close to full, like few tens of MB left)
> a complete system freeze ensues.-The freeze lasts for about 60
> seconds.-After the 60 seconds the swap occupation went from about zero
> to 15 GB (!).-All that was asked during the 60 seconds was about 240
> MB of space (20 MB chunks once every5 seconds).

All I can see in the vmstat data is that the heaviest direct reclaim happens
during the following 23s
1530092789 2939
1530092790 26530
1530092791 32758
1530092792 43734
1530092793 56487
1530092794 36676
1530092795 26592
1530092796 40690
1530092797 59096
1530092798 60331
1530092799 54677
1530092800 38491
1530092801 39346
1530092802 56641
1530092803 66019
1530092804 27940
1530092805 53971
1530092806 43717
1530092807 39238
1530092809 73689
1530092810 51959
1530092811 69046
1530092812 11387

We swap out a lot of memory during that time (this is in pages)
1530092790 42106
1530092791 104644
1530092792 153647
1530092793 215620
1530092794 158937
1530092795 115064
1530092796 174210
1530092797 256667
1530092798 255214
1530092799 228175
1530092800 157931
1530092801 164286
1530092802 255517
1530092803 274509
1530092804 121529
1530092805 222045
1530092806 194336
1530092807 172144
1530092809 313238
1530092810 214102
1530092811 163935
1530092812 4378

this is ~15G of memory in total. The amount of free memory during that time
period is
$ awk '/=S/{ts=$2}/nr_free_pages\>/{if (ts >= 1530092790 && ts <= 1530092812) printf "%d %d\n", ts, $2 }' out5.txt
1530092790 33365
1530092791 33447
1530092792 33234
1530092793 33343
1530092794 33376
1530092795 33376
1530092796 33280
1530092797 33463
1530092798 33414
1530092799 33535
1530092800 33715
1530092801 33394
1530092802 33436
1530092803 33436
1530092804 33435
1530092805 33293
1530092806 34028
1530092807 33300
1530092809 33329
1530092810 33398
1530092811 59889
1530092812 92072

which is roughly 130MB and it would correspond to the min watermark so
there is a permanent memory pressure even when we reclaim a lot of
memory (~400MB/s) so somebody is allocating roughly at the same pace.

There is basically no pagecache during that time period
1530092790 916
1530092791 900
1530092792 1465
1530092793 641
1530092794 723
1530092795 798
1530092796 798
1530092797 798
1530092798 876
1530092799 1047
1530092800 1138
1530092801 1802
1530092802 1041
1530092803 1058
1530092804 1042
1530092805 1337
1530092806 819
1530092807 824
1530092809 1174
1530092810 1135
1530092811 7729
1530092812 12004

and all the memory is consumed by anonymous memory (in pages)
1530092790 3943370
1530092791 3910853
1530092792 3868239
1530092793 3816441
1530092794 3777916
1530092795 3750696
1530092796 3709097
1530092797 3648167
1530092798 3586738
1530092799 3531756
1530092800 3494133
1530092801 3455150
1530092802 3393068
1530092803 3327196
1530092804 3299318
1530092805 3244940
1530092806 3198904
1530092807 3158415
1530092809 3082721
1530092810 3030749
1530092811 2967568
1530092812 2956265

so roughly 15GB.

> Just to get this straight. You're telling me that this behavior is
> correct? And let's even forget about the system freeze. That is,
> writing 15GB in the swap in 60 seconds when all it was asked to the vm
> during these 60 seconds was a total of 240MB of ram. This is the vm
> working as intended?

Well, I have no idea what exactly you do here. All I know is that your
memory is full of anonymous memory and there is no way to free that
memory without swap out. Considering that the reclaim doesn't really
frees up memory during this time window it means that somebody is
allocating roughly at the same pace. And under such conditions, yes,
we do swap out a lot.
Comment 10 john terragon 2018-06-27 13:52:42 UTC
Created attachment 276923 [details]
attachment-27873-0.html

    On Wednesday, June 27, 2018, 1:03:50 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote:  
 
> Well, I have no idea what exactly you do here. All I know is that your
> memory is full of anonymous memory and there is no way to free that
> memory without swap out. Considering that the reclaim doesn't really
> frees up memory during this time window it means that somebody is
> allocating roughly at the same pace. And under such conditions, yes,
> we do swap out a lot.

I don't understand. Are you saying that in the vmstat numbers you are seeing an allocation rate that is higher than the roughly 4MB/s that memeater is causing? 


Because the test is very simple and the "actors"  are
a) normal system stuff + kde stuff
 
b) one console window
c) some (4 or 5) instances of memeater used to bring memory usage close to full RAM quickly. 

d) the last memeater that allocates at 4 MB/s

After the c) are done doing their part, they pause (and at that point no swapping going on).
After the c) are done, d) is the only one who's allocating at any discernible rate.So, correct me if I'm wrong. In this setting what should happen is roughly this
1) we get close to full RAM.
2) the active memeater allocates another 20MB.
3) the vm swaps out let's say 100MB to avoid threshold effect. Is has to do this in 5 seconds before getting another request from the last memeater.
 Isn't that roughly what should happen in this setting?
Comment 11 Michal Hocko 2018-06-28 19:03:30 UTC
On Thu 28-06-18 14:07:46, Michal Hocko wrote:
> On Wed 27-06-18 13:52:37, john terragon wrote:
> >  
> > 
> >     On Wednesday, June 27, 2018, 1:03:50 PM GMT+2, Michal Hocko
> <mhocko@kernel.org> wrote:  
> >  
> > > Well, I have no idea what exactly you do here. All I know is that your
> > > memory is full of anonymous memory and there is no way to free that
> > > memory without swap out. Considering that the reclaim doesn't really
> > > frees up memory during this time window it means that somebody is
> > > allocating roughly at the same pace. And under such conditions, yes,
> > > we do swap out a lot.
> > 
> > I don't understand. Are you saying that in the vmstat numbers you are
> > seeing an allocation rate that is higher than the roughly 4MB/s that
> > memeater is causing?
> 
> Well, I might have screwed up my awk-foo (that's why I've pasted it
> there, hint hint). But yes. If I see that there is a lot of reclaimed
> memory and the number of free pages doesn't grow then there is either a
> memory leak or the pages do not get back to the free pool in timely
> fashion or somebody is eating the reclaimed memory. I do not see
> any evidence of the former because the number of free pages grows
> eventually. Ruling out the second option would require some tracepoints
> to follow the IO path. I am currently quite busy but one way to check
> the last option is to look at pgalloc* counters and see whether they
> grow in a similar pace as the number of reclaimed pages.

So I found a minute and the situation looks as follows.
$ awk '/=S/{ts=$2} /pgalloc_normal/{diff=$2-prev;prev=$2; if (ts >= 1530092790 && ts <= 1530092812) printf "%d %d\n", ts, diff}' out5.txt

will tell you how many pages have been allocated from zone normal
between each two timestamps. To be absolutely precise you should also
consider other zones as well but I assume that this one is the biggest
consumer. If we combine that with the direct reclaim (again we should
consider both kswapd and direct to be precise) then we have the
following picture
ts		steal_direct	pgalloc_normal	ratio
1530092790	26530        107146	4.04
1530092791	32758        58778      1.79
1530092792	43734        62185      1.42
1530092793	56487        59678      1.06
1530092794	36676        41816      1.14
1530092795	26592        29041      1.09
1530092796	40690        46552      1.14
1530092797	59096        77442      1.31
1530092798	60331        64751      1.07
1530092799	54677        63432      1.16
1530092800	38491        41643      1.08
1530092801	39346        44797      1.14
1530092802	56641        63045      1.11
1530092803	66019        74768      1.13
1530092804	27940        33814      1.21
1530092805	53971        59844      1.11
1530092806	43717        53088      1.21
1530092807	39238        45798      1.17
1530092809	73689        84833      1.15
1530092810	51959        61545      1.18
1530092811	69046        56313      0.82
1530092812	11387        6839       0.60

that gives a good idea that we are allocating roughly in the same pace we reclaim.
Comment 12 Michal Hocko 2018-06-28 19:03:37 UTC
On Wed 27-06-18 13:52:37, john terragon wrote:
>  
> 
>     On Wednesday, June 27, 2018, 1:03:50 PM GMT+2, Michal Hocko
>     <mhocko@kernel.org> wrote:  
>  
> > Well, I have no idea what exactly you do here. All I know is that your
> > memory is full of anonymous memory and there is no way to free that
> > memory without swap out. Considering that the reclaim doesn't really
> > frees up memory during this time window it means that somebody is
> > allocating roughly at the same pace. And under such conditions, yes,
> > we do swap out a lot.
> 
> I don't understand. Are you saying that in the vmstat numbers you are
> seeing an allocation rate that is higher than the roughly 4MB/s that
> memeater is causing?

Well, I might have screwed up my awk-foo (that's why I've pasted it
there, hint hint). But yes. If I see that there is a lot of reclaimed
memory and the number of free pages doesn't grow then there is either a
memory leak or the pages do not get back to the free pool in timely
fashion or somebody is eating the reclaimed memory. I do not see
any evidence of the former because the number of free pages grows
eventually. Ruling out the second option would require some tracepoints
to follow the IO path. I am currently quite busy but one way to check
the last option is to look at pgalloc* counters and see whether they
grow in a similar pace as the number of reclaimed pages.
Comment 13 john terragon 2018-06-29 20:24:48 UTC
Created attachment 277063 [details]
attachment-3570-0.html

    On Thursday, June 28, 2018, 9:03:27 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote:  

>ts        steal_direct    pgalloc_normal    ratio
>1530092790    26530        107146    4.04
>1530092791    32758        58778      1.79
>1530092792    43734        62185      1.42
>1530092793    56487        59678      1.06
>1530092794    36676        41816      1.14
>1530092795    26592        29041      1.09
>1530092796    40690        46552      1.14
>1530092797    59096        77442      1.31
>1530092798    60331        64751      1.07
>1530092799    54677        63432      1.16
>1530092800    38491        41643      1.08
>1530092801    39346        44797      1.14
>1530092802    56641        63045      1.11
>1530092803    66019        74768      1.13
>1530092804    27940        33814      1.21
>1530092805    53971        59844      1.11
>1530092806    43717        53088      1.21
>1530092807    39238        45798      1.17
>1530092809    73689        84833      1.15
>1530092810    51959        61545      1.18
>1530092811    69046        56313      0.82
>1530092812    11387        6839      0.60

>that gives a good idea that we are allocating roughly in the same pace we
>reclaim.
Those numbers are pages, right? A page is 4K, right?So, for example 59844*4096 ~= 240MB. Hence the allocating pace is 240 MB/s ?!But there's no process doing that. 
I'm sorry, I sound like a broken record:there is only one user  process (the last memeater) that's allocating 
at a relatively high pace during the test, and that's 4 MB/s....
Is there a way to trace which process allocates how much, instead of just having 
the aggregate number?
Comment 14 Michal Hocko 2018-07-02 14:42:04 UTC
On Fri 29-06-18 20:24:39, john terragon wrote:
>  
> 
>     On Thursday, June 28, 2018, 9:03:27 PM GMT+2, Michal Hocko
>     <mhocko@kernel.org> wrote:  
> 
> >ts        steal_direct    pgalloc_normal    ratio
> >1530092790    26530        107146    4.04
> >1530092791    32758        58778      1.79
> >1530092792    43734        62185      1.42
> >1530092793    56487        59678      1.06
> >1530092794    36676        41816      1.14
> >1530092795    26592        29041      1.09
> >1530092796    40690        46552      1.14
> >1530092797    59096        77442      1.31
> >1530092798    60331        64751      1.07
> >1530092799    54677        63432      1.16
> >1530092800    38491        41643      1.08
> >1530092801    39346        44797      1.14
> >1530092802    56641        63045      1.11
> >1530092803    66019        74768      1.13
> >1530092804    27940        33814      1.21
> >1530092805    53971        59844      1.11
> >1530092806    43717        53088      1.21
> >1530092807    39238        45798      1.17
> >1530092809    73689        84833      1.15
> >1530092810    51959        61545      1.18
> >1530092811    69046        56313      0.82
> >1530092812    11387        6839      0.60
> 
> >that gives a good idea that we are allocating roughly in the same pace we
> reclaim.
>
> Those numbers are pages, right? A page is 4K, right?So, for example
> 59844*4096 ~= 240MB. Hence the allocating pace is 240 MB/s ?!But there's no
> process doing that. 
> I'm sorry, I sound like a broken record:there is only one user  process (the
> last memeater) that's allocating 
> at a relatively high pace during the test, and that's 4 MB/s....
> Is there a way to trace which process allocates how much, instead of just
> having 
> the aggregate number? 

You can try to use mm_page_alloc tracepoint (events/kmem). One way to do
that is to use ftrace or mount tracefs and enable the tracepoint under
the above path and then read the trace_pipe.
Comment 15 john terragon 2018-07-02 17:46:42 UTC
Created attachment 277133 [details]
attachment-32664-0.html

    On Monday, July 2, 2018, 4:42:03 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote:  
> You can try to use mm_page_alloc tracepoint (events/kmem). One way to do
> that is to use ftrace or mount tracefs and enable the tracepoint under
> the above path and then read the trace_pipe.
In my bug report I submitted code and steps to reproduce the behaviour. 
Have you tried to reproduce it?
Comment 16 Michal Hocko 2018-07-02 18:02:13 UTC
On Mon 02-07-18 17:46:09, john terragon wrote:
>  
> 
>     On Monday, July 2, 2018, 4:42:03 PM GMT+2, Michal Hocko
>     <mhocko@kernel.org> wrote:  
> > You can try to use mm_page_alloc tracepoint (events/kmem). One way to do
> > that is to use ftrace or mount tracefs and enable the tracepoint under
> > the above path and then read the trace_pipe.
>
> In my bug report I submitted code and steps to reproduce the behaviour. 
> Have you tried to reproduce it? 

No I haven't. I am sorry but my time is quite scarce so I focused on the
state of the system when you see the problem and that data shows that
there is continuous source of allocations and as such the behavior is
expected. Maybe this is solely your test case or maybe something else.
Comment 17 john terragon 2018-07-05 20:35:30 UTC
Created attachment 277203 [details]
attachment-12017-0.html

Maybe you can give me an explanation to the following:

1) The memeater processes (the first ones that allocate close to 16GB of memory without pausing in between requests and thelast one that makes pauses of 5 seconds and starts the unexplained heavy swapping) allocate a grand total of about 16.5 GB of memory (I pause the last memeater right after the one minute freeze and the unexplained 15GB-worth swap-out).

2) At that point, as you also have seen from the vmstat numbers, both RAM and swap are about full (so about 32GB total).

3) If, at that point, I kill all the memeater processes at once then  RAM and swap BOTH go to about zero (about 3/4 hundred megs each).

So, the all memory allocated by the memeater processes seems to be duplicated both in RAM and swap. 
Why is that?
Comment 18 Daniel Jordan 2018-07-05 20:49:24 UTC
(In reply to john terragon from comment #15)
> Created attachment 277133 [details]
> attachment-32664-0.html
> 
>     On Monday, July 2, 2018, 4:42:03 PM GMT+2, Michal Hocko
> <mhocko@kernel.org> wrote:  
> > You can try to use mm_page_alloc tracepoint (events/kmem). One way to do
> > that is to use ftrace or mount tracefs and enable the tracepoint under
> > the above path and then read the trace_pipe.
> In my bug report I submitted code and steps to reproduce the behaviour. 
> Have you tried to reproduce it?

I spun up a VM with 8 CPUs (i7 4790k), 8G memory and 4G SSD swap (IIRC).  I also installed GNOME desktop since you were also using a desktop environment.

Tried using memeater.c and the steps in your original post (#1) to reproduce on a 4.17 kernel, but no luck.  The system remained responsive throughout.

Same deal when I tried filling up memory by opening tens of browser tabs: no freezing.

I've seen something similar to your issue several times on my desktop at home, however, and so have some interest in understanding what's happening here more deeply.

I wonder what's different between your system and mine?  Anyway, let me catch up on the rest of the thread.
Comment 19 john terragon 2018-07-05 21:09:53 UTC
Created attachment 277205 [details]
attachment-16893-0.html

    On Thursday, July 5, 2018, 10:49:28 PM GMT+2, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote: 
> I spun up a VM with 8 CPUs (i7 4790k), 8G memory and 4G SSD swap (IIRC).  I
> also installed GNOME desktop since you were also using a desktop environment.
> 
> Tried using memeater.c and the steps in your original post (#1) to reproduce
> on
> a 4.17 kernel, but no luck.  The system remained responsive throughout.
>
> Same deal when I tried filling up memory by opening tens of browser tabs: no
> freezing.
>
> I've seen something similar to your issue several times on my desktop at
> home,
> however, and so have some interest in understanding what's happening here
> more
> deeply.
>
> I wonder what's different between your system and mine?  Anyway, let me catch
> up on the rest of the thread.
Could you send me the .config of the kernel you used in the VM? I don't think my kernel config has anything strange, as it's mainly derived from the ones in the debian kernel image packages (but I tried the test also with a debian kernel images with the same results).But maybe there's something.
Comment 20 Daniel Jordan 2018-07-06 01:39:41 UTC
Created attachment 277209 [details]
4.17.0-rc7 VM config
Comment 21 Daniel Jordan 2018-07-06 01:46:45 UTC
(In reply to john terragon from comment #19)
> Created attachment 277205 [details]
> attachment-16893-0.html
> 
>     On Thursday, July 5, 2018, 10:49:28 PM GMT+2,
> bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org>
> wrote: 
> > I spun up a VM with 8 CPUs (i7 4790k), 8G memory and 4G SSD swap (IIRC).  I
> > also installed GNOME desktop since you were also using a desktop
> environment.
> > 
> > Tried using memeater.c and the steps in your original post (#1) to
> reproduce
> > on
> > a 4.17 kernel, but no luck.  The system remained responsive throughout.
> >
> > Same deal when I tried filling up memory by opening tens of browser tabs:
> no
> > freezing.
> >
> > I've seen something similar to your issue several times on my desktop at
> > home,
> > however, and so have some interest in understanding what's happening here
> > more
> > deeply.
> >
> > I wonder what's different between your system and mine?  Anyway, let me
> catch
> > up on the rest of the thread.
> Could you send me the .config of the kernel you used in the VM? I don't
> think my kernel config has anything strange, as it's mainly derived from the
> ones in the debian kernel image packages (but I tried the test also with a
> debian kernel images with the same results).But maybe there's something.

The config I just posted is also derived from debian (the VM is running debian test).

So I looked at out5.txt, and it contains 458 seconds worth of samples.  Do you remember roughly the timing of when you started the memeaters and when you observed the freeze?  I'm looking for something like, the memeaters started around 1530092770 or so, and it started freezing about around 1530092791.  Just trying to correlate the vmstats with what you were doing and seeing.
Comment 22 Daniel Jordan 2018-07-06 01:54:42 UTC
It would also be helpful to know about when the system unfroze.
Comment 23 john terragon 2018-07-06 15:35:21 UTC
Created attachment 277231 [details]
attachment-5996-0.html

    On Friday, July 6, 2018, 3:54:44 AM GMT+2, <bugzilla-daemon@bugzilla.kernel.org> wrote:  
> It would also be helpful to know about when the system unfroze
I can always redo the test and take note of those timings (since the test works like clockwork in triggering the behaviour), but as far as I can remember, I started Michael's program (read_vmstat) as soon as the first memeater started. All but the last memeater just allocate some GB in chunks of 20MB each without pausing. I start them one after the other, they have the role of almost filling the RAM and then they sleep. Their executions go smoothly (because they do not involve the swap). Then the last memeater is started and it allocates at a rate of 20MB each 5 seconds (in the out5.txt test). It goes smoothly until the RAM is exhausted, Then the system freezes for about 60 seconds and as soon as it unfreezes I have

1) RAM almost full (13/14 GB)

2) swap almost full too (15GB)

This despite the fact that a) the last memeater during the freeze asked for just 200/300 MB of space (about 20 MB every 5 seconds) and b) all the memeaters together asked for a total of 16GB.

Anyways, I believe that as soon as the system unfroze, I stopped read_vmstat. So, the starting of the freezing should be about 60 seconds prior to the end of out5.txt.

And I see the freeze as just a symptom of what seems (to me) to be an unjustified swap-out: 15GB against the 300MB asked during that 60 second period.

Maybe I should try doing the test without a GUI, just alt-tabbing and doing it from console and see if the big swap-out happens again.
Comment 24 john terragon 2018-07-06 20:05:11 UTC
Created attachment 277251 [details]
out6.txt

So, I repeated the test from console, without the GUI. The vmstat snapshot is attached. 


As I thought, there was no visible system freeze (the symptom) and the problem was therefore visible "to the naked eye" Here's what happened:
1) read_vmstat was started right before starting the first memeater.

2) The RAM was then almost filled with unpausing memeaters. As always, no swapping.

3) Then the last memeater, allocating 20MB once every 2 seconds, started.

4) I was observing the output from watch -n 1 free on another console and as soon as the RAM was filled, the kernel swapped out 15GB of memory in a matter of I'd say 5 seconds.

The kernel apparently behaved like in the old days when there was no hardware support for paging and swapping literally meant "take the whole memory of a process and swap it out at once".  The whole 15GB of memory allocated by the other mematers was swapped out in five seconds upon the request from the last memeater of less than 50MB.

This can't be normal.
Comment 25 john terragon 2018-07-06 20:09:32 UTC
Created attachment 277253 [details]
vmstat snapshot of the test without a GUI

It didn't attach it in my last email. Anyways, here it is.
Comment 26 john terragon 2018-07-07 15:18:06 UTC
Created attachment 277267 [details]
attachment-11371-0.html

 Since the console-only test with linux shed some light on the behavior of the vm, I installed freebsd 11.2 on the same machine and conducted the same test. 
Freebsd doesn't have /proc/vmstat so I've used vmstat 1 to get the vm info. You can find an attachment with the snapshot after this comment.


As you can see from the vmstat snapshot, freebsd handles the vm pressure put on by the last memeater as it should be handled: it swaps out few megabytes at a time every time the RAM gets close to full due to the 20MB chunks requests from the last memeater (which in this case where made one every second, instead of once every two seconds....so even more pressure).


How can the linux kernel (or at least mine) perform so much worse than freebsd's?
Comment 27 john terragon 2018-07-07 15:18:18 UTC
Created attachment 277269 [details]
vmstat snapshot of the on Freebsd console-only
Comment 28 john terragon 2018-07-07 17:44:12 UTC
Created attachment 277271 [details]
attachment-11874-0.html

It's the damned encrypted swap partition (normal dm-crypt luks)....

I just tried the test with a normal swap partition (it's the same ssd and the same partition except that is not encrypted anymore) and, lo and behold, it went through it just a smoothly as freebsd. It's the console-only test and I'll try the one with a GUI but I'm sure that now, without the anomalous swapout there's not going to be any freeze.

So, am I missing something? Because I've never read anywere that using dm-crypt with a swap partition would influence the vm to this degree.
Comment 29 john terragon 2018-07-09 14:38:05 UTC
Workarounds for the abysmal performance of dm-crypt with swap:


1) encrypt the swap partition with the old cryptoloop (or loop-AES, but it's out-of-tree)


2) convert the swap partition into ext4, create a swapfile and encrypt it with the 
VFS-level encryption of ext4


I did the test with both and they behave as smoothly as an un-encrypted swap.
Comment 30 Daniel Jordan 2018-07-09 18:50:27 UTC
Thanks for rerunning those tests earlier.

(In reply to john terragon from comment #28)
> It's the damned encrypted swap partition (normal dm-crypt luks)....
> 
> I just tried the test with a normal swap partition (it's the same ssd and
> the same partition except that is not encrypted anymore) and, lo and behold,
> it went through it just a smoothly as freebsd. It's the console-only test
> and I'll try the one with a GUI but I'm sure that now, without the anomalous
> swapout there's not going to be any freeze.
> 
> So, am I missing something? Because I've never read anywere that using
> dm-crypt with a swap partition would influence the vm to this degree.

Still not seeing this behavior after adding an encrypted swap partition to my own setup:

debian-vm:~ λ cat /etc/crypttab   # after editing this file
# <target name>	<source device>		<key file>	<options>
swap	/dev/vda5	/dev/urandom	swap,cipher=aes-xts-plain64,size=256

Just throwing it out there for anyone who doesn't know, using raw device names that can change is dangerous for an encrypted swap partition, see https://wiki.archlinux.org/index.php/Dm-crypt/Swap_encryption for alternatives.

debian-vm:~ λ sudo mkswap /dev/mapper/swap 
mkswap: /dev/mapper/swap: warning: wiping old swap signature.
Setting up swapspace version 1, size = 8 GiB (8586784768 bytes)
no label, UUID=179abefe-c3c8-4db6-b249-2ade08af972d
debian-vm:~ λ sudo swapon /dev/mapper/swap 
debian-vm:~ λ lsblk
NAME     MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sr0       11:0    1 1024M  0 rom   
vda      254:0    0   46G  0 disk  
├─vda1   254:1    0 19.2G  0 part  /
├─vda2   254:2    0    1K  0 part  
├─vda4   254:4    0   16G  0 part  /scratch
├─vda5   254:5    0    8G  0 part  
│ └─swap 253:0    0    8G  0 crypt [SWAP]
└─vda6   254:6    0  2.8G  0 part  /boot

On my VM with 8G of memory, the system starts out with over 7G free.  Then

$ memeater $((7*2**30)) $((256*2**20)) 0

and most of the free memory is used.  Another few

$ memeater $((300*2**20)) $((20*2**20)) 2

just swap out a little more memory as expected each time, with the majority of the swap space remaining free.

So, can you explain your configuration, please?  Steps that you're using to initialize the encrypted swap partition, contents of /etc/crypttab and /etc/fstab, output of lsblk and swapon -s, that sort of thing.

Thanks.
Comment 31 john terragon 2018-07-09 20:31:42 UTC
So, the normal setup is a luks encrypted partition. It does not have and entry either in crypttab or in fstab (root obviously has) 
because it is opened and activated in a "late" ad-hoc script that does this


/lib/cryptsetup/scripts/decrypt_derived cryroot | 
cryptsetup luksOpen UUID=d4b0a395-321a-41e7-87eb-61778959a95a cryswap --key-file -


swapon /dev/mapper/cryswap


The decrypt_derived script just gets the key from the root partition (also luks-encrypted). 
The disk layout is really simple:


/dev/sda1:   luks root    200 odd GB
/dev/sda2:   luks swap  16GB



Cipher and key length are the default used by cryptsetup which nowadays should be aes-xts-plain64 and 256 bits.


As I wrote, it's and ssd (but I've tried with hdd's). allow_discard is enabled for cryroot. I hadn't it enabled for cryswap normally
but I've tried the test both ways (not that that was likely to change anything).


As I said this is the normal setup, but I've also tried (with the same results):


1) different ciphers with luks: serpent, twofish etc


2) swap partition with plain dm-crypt and, again, various ciphers.


3) swap file on ext4 FS on dm-crypt (on /dev/sda2, because cryroot uses btrfs, which does not support swapfiles).


You said that you witnessed a similar behaviour on your desktop system at home. Do you use a dm-crypt swap partition, or a swapfile on a dm-crypt volume? Just to eliminate all the strange possibilities, it would be great if you could do the test on a real machine, not a vm.
Comment 32 Daniel Jordan 2018-07-11 02:09:41 UTC
Ok, missed the part about luks.

I don't use any form of encrypted swap partition, until now that is.  :)

Sure, the virtualization could be changing things.  I'll try it on bare metal.
Comment 33 Daniel Jordan 2018-07-11 21:05:36 UTC
(In reply to john terragon from comment #31)
> Just to eliminate all the strange possibilities, it would be great
> if you could do the test on a real machine, not a vm.

Since I don't have LUKS setup on my existing system, today I installed another disk in my desktop and created a LUKS encrypted swap partition on it.

zissou:~ λ sudo cryptsetup luksFormat /dev/sdc1

WARNING!
========
This will overwrite data on /dev/sdc1 irrevocably.

Are you sure? (Type uppercase yes): YES
Enter passphrase for /dev/sdc1: 
Verify passphrase: 

zissou:~ λ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 232.9G  0 disk 
├─sda1   8:1    0   550M  0 part /boot/efi
└─sda2   8:2    0 232.4G  0 part /
sdb      8:16   0 931.5G  0 disk 
└─sdb1   8:17   0 931.5G  0 part /hdd
sdc      8:32   0 931.5G  0 disk 
└─sdc1   8:33   0    16G  0 part                # destined for swap
zissou:~ λ sudo cryptsetup luksDump /dev/sdc1
LUKS header information for /dev/sdc1

Version:       	1
Cipher name:   	aes
Cipher mode:   	xts-plain64
Hash spec:     	sha256
Payload offset:	4096
MK bits:       	256
MK digest:     	7f e3 19 a6 9c 1d 5c 2a 72 36 36 6c 52 bc 36 b2 c9 81 c3 36 
MK salt:       	d1 55 16 3f 23 81 53 ee ac ad 05 e3 26 5b b8 22 
               	c1 4a 0e 87 18 75 38 16 42 8c b4 11 e0 c2 ff 31 
MK iterations: 	252061
UUID:          	5ee3c1b1-095c-4761-a349-93b357647d83

Key Slot 0: ENABLED
	Iterations:         	3934618
	Salt:               	6f 21 80 c8 32 04 20 ec 29 2e 09 dd 49 fb 80 c9 
	                      	11 cd 52 9b 80 16 3a 46 f3 b4 c6 1f 85 15 1c bb 
	Key material offset:	8
	AF stripes:            	4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
zissou:~ λ sudo cryptsetup luksOpen /dev/sdc1 swap
Enter passphrase for /dev/sdc1: 
zissou:~ λ lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda        8:0    0 232.9G  0 disk  
├─sda1     8:1    0   550M  0 part  /boot/efi
└─sda2     8:2    0 232.4G  0 part  /
sdb        8:16   0 931.5G  0 disk  
└─sdb1     8:17   0 931.5G  0 part  /hdd
sdc        8:32   0 931.5G  0 disk  
└─sdc1     8:33   0    16G  0 part  
  └─swap 254:0    0    16G  0 crypt 
zissou:~ λ sudo mkswap /dev/mapper/swap 
Setting up swapspace version 1, size = 16 GiB (17177767936 bytes)
no label, UUID=c2304cc4-667a-4ad9-ac37-6e3c4fda5c8a
zissou:~ λ sudo swapon /dev/mapper/swap 
zissou:~ λ lsblk
NAME     MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda        8:0    0 232.9G  0 disk  
├─sda1     8:1    0   550M  0 part  /boot/efi
└─sda2     8:2    0 232.4G  0 part  /
sdb        8:16   0 931.5G  0 disk  
└─sdb1     8:17   0 931.5G  0 part  /hdd
sdc        8:32   0 931.5G  0 disk  
└─sdc1     8:33   0    16G  0 part  
  └─swap 254:0    0    16G  0 crypt [SWAP]

Still no luck reproducing after doing the same tests as in comment 31: the system used the expected amount of swap.  The disk is an HDD from circa 2006, though, so I could hear the thing loud and clear when it started swapping.

So, what Michal was doing in the first place, taking data from where the problem is actually happening, seems to make more sense.  I'll take another look at your test results and hopefully have some tracing commands to suggest when I'm done.
Comment 34 Daniel Jordan 2018-07-11 21:06:49 UTC
(In reply to Daniel Jordan from comment #33)
> Still no luck reproducing after doing the same tests as in comment 31:

Sorry, comment 30.
Comment 35 john terragon 2018-07-11 22:14:29 UTC
Thanks for trying, Daniel.
Comment 36 Daniel Jordan 2018-07-19 02:34:14 UTC
(In reply to Michal Hocko from comment #14)
> On Fri 29-06-18 20:24:39, john terragon wrote:
> >  
> > 
> >     On Thursday, June 28, 2018, 9:03:27 PM GMT+2, Michal Hocko
> >     <mhocko@kernel.org> wrote:  
> > 
> > >ts        steal_direct    pgalloc_normal    ratio
> > >1530092790    26530        107146    4.04
> > >1530092791    32758        58778      1.79
...snip...
> > >1530092809    73689        84833      1.15
> > >1530092810    51959        61545      1.18
> > >1530092811    69046        56313      0.82
> > >1530092812    11387        6839      0.60
> > 
> > >that gives a good idea that we are allocating roughly in the same pace we
> > reclaim.
> >
> > Those numbers are pages, right? A page is 4K, right?So, for example
> > 59844*4096 ~= 240MB. Hence the allocating pace is 240 MB/s ?!But there's no
> > process doing that. 
> > I'm sorry, I sound like a broken record:there is only one user  process
> (the
> > last memeater) that's allocating 
> > at a relatively high pace during the test, and that's 4 MB/s....
> > Is there a way to trace which process allocates how much, instead of just
> > having 
> > the aggregate number? 
> 
> You can try to use mm_page_alloc tracepoint (events/kmem). One way to do
> that is to use ftrace or mount tracefs and enable the tracepoint under
> the above path and then read the trace_pipe.

Yes, could you please try the tracepoint?  Assuming you have perf installed on your system, but as Michal says, there are other ways to get tracepoint data.

And please keep using read_vmstat as well, and try to take note of the timing of when each memeater starts, is finished allocating, when read_vmstat is started and stopped, and when perf is started and stopped.  That will help interpreting the data.

During your test, right before you start the last memeater that starts the system swapping, you can run this as root in its own terminal:

perf record -a -e kmem:mm_page_alloc

and then ^C it once the system comes back from the freeze.  Aggregate the results by pid with this, again as root:

perf report -s pid --stdio

Hopefully this will shed some light on what's causing the extra 200+ MiB per second of page allocation aside from the memeater.
Comment 37 Daniel Jordan 2018-07-19 02:46:48 UTC
(In reply to Michal Hocko from comment #7)
> So it seems that the kswapd was able to scan a lot of pages while it
> didn't manage to reclaim a large part of them. This is interesting
> because the direct reclaim had basically 100% reclaim efficiency
> (reclaim/scanned) at that time.

So I've learned a lot reading Michal's analysis, which is really spot on.

I think this one part, though, may not be right.  For out.txt during the period of heavy direct reclaim, the efficiency (pgsteal_direct/pgscan_direct) hovers around 1%.  Excuse this ugly thing:

$ awk '/=S/ { ts=$2; if (ts >= 1529944053 && ts <= 1529944072) { show=1 } else { show=0 }} /pgsteal_direct / { if (steal_prev) {steal_diff=$2-steal_prev} steal_prev=$2 } /pgscan_direct / { if (scan_prev) {scan_diff=$2-scan_prev} scan_prev=$2; if (show) { printf "%d  %d  %d  %.2f\n", ts, steal_diff, scan_diff, steal_diff / scan_diff * 100}}' out.txt 

timestamp   pgsteal_direct     pgscan_direct     ratio
1529944053  19003  1377730  1.38
1529944054  55177  7491802  0.74
1529944055  55471  3932763  1.41
1529944056  44795  6491762  0.69
1529944057  26652  7809653  0.34
1529944058  41044  8317191  0.49
1529944059  62195  7411019  0.84
1529944060  61118  6906786  0.88
1529944061  41875  8259804  0.51
1529944062  64176  5992609  1.07
1529944063  51765  8664945  0.60
1529944064  53420  8441670  0.63
1529944065  62557  6858885  0.91
1529944066  39649  7736501  0.51
1529944067  61628  6697437  0.92
1529944068  29689  6731073  0.44
1529944069  12631  9350035  0.14
1529944070  43268  5487761  0.79
1529944071  65988  2651462  2.49
1529944072  55976  2298567  2.44

Not sure if such a low rate is unusual, though, not having dealt with these kinds of issues much.
Comment 38 john terragon 2018-07-19 20:03:24 UTC
Created attachment 277415 [details]
read_vmstat output
Comment 39 john terragon 2018-07-19 20:03:58 UTC
Created attachment 277417 [details]
perf report
Comment 40 john terragon 2018-07-19 20:04:36 UTC
Created attachment 277419 [details]
start and finish times for the memeaters
Comment 41 john terragon 2018-07-19 20:11:28 UTC
Three new attachments above:


1) the read_vmstat output (started right before the first memeater was executed)


2) the perf report (as requested, perf was started right before executing the last memeater, the one that caused the swapping)


3) start and finish allocation times of the memeaters (except the last memeater which has only the starting time because I've ctrl-Z it
before it completed all its allocations).
Comment 42 Andrew Morton 2018-07-20 00:03:48 UTC
Sorry, but nobody reads bugzilla.  I tried to switch this discussion to an email thread for a reason!

Please resend all this (useful) info in reply to the email thread which I created for this purpose.
Comment 43 john terragon 2018-07-20 08:19:10 UTC
Created attachment 277431 [details]
report

    On Friday, July 20, 2018, 2:03:48 AM GMT+2, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote:  >
 >
 >https://bugzilla.kernel.org/show_bug.cgi?id=200105
>
>--- Comment #42 from Andrew Morton (akpm@linux-foundation.org) ---
>Sorry, but nobody reads bugzilla.  I tried to switch this discussion to an
>email thread for a reason!
>
>Please resend all this (useful) info in reply to the email thread which I
>created for this purpose.
I'll resend the last message and attachments. Anyone interested on the previous "episodes" go read 
https://bugzilla.kernel.org/show_bug.cgi?id=200105

Three attachments:


1) the read_vmstat output (started right before the first memeater was executed)


2) the perf report (as requested, perf was started right before executing the last memeater, the one that caused the swapping)


3) start and finish allocation times of the memeaters (except the last memeater which has only the starting time because I've ctrl-Z it
before it completed all its allocations).
Comment 44 john terragon 2018-07-20 08:19:11 UTC
Created attachment 277433 [details]
out7.txt
Comment 45 john terragon 2018-07-20 08:19:11 UTC
Created attachment 277435 [details]
start_finish.txt
Comment 46 lkmldmj 2018-07-21 14:39:10 UTC
On 07/20/2018 04:19 AM, john terragon wrote:
> 
> On Friday, July 20, 2018, 2:03:48 AM GMT+2,
> bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
>  >
>  >https://bugzilla.kernel.org/show_bug.cgi?id=200105
>  <https://bugzilla.kernel.org/show_bug.cgi?id=200105>
>  >
>  >--- Comment #42 from Andrew Morton (akpm@linux-foundation.org
>  <mailto:akpm@linux-foundation.org>) ---
>  >Sorry, but nobody reads bugzilla.  I tried to switch this discussion to an
>  >email thread for a reason!
>  >
>  >Please resend all this (useful) info in reply to the email thread which I
>  >created for this purpose.
> 
> I'll resend the last message and attachments. Anyone interested on the
> previous "episodes" go read
> https://bugzilla.kernel.org/show_bug.cgi?id=200105

The summary is that John has put together a reliable reproducer for a problem he's seeing where on high memory usage any of his desktop systems with SSDs hang for around a minute, completely unresponsive, and swaps out 2-3x more memory than the system is allocating.

John's issue only happens using a LUKS encrypted swap partition, unencrypted swap or swap encrypted without LUKS works fine.

In one test (out5.txt) where most system memory is taken by anon pages beforehand, the heavy direct reclaim that Michal noticed lasts for 24 seconds, during which on average if I've crunched my numbers right, John's test program was allocating at 4MiB/s, the system overall (pgalloc_normal) was allocating at 235MiB/s, and the system was swapping out (pswpout) at 673MiB/s.  pgalloc_normal and pswpout stay roughly the same each second, no big swings.

Is the disparity between allocation and swapout rate expected?

John ran perf during another test right before the last test program was started (this doesn't include the initial large allocation bringing the system close to swapping).  The top five allocators (kmem:mm_page_alloc):

# Overhead      Pid:Command
# ........  .......................
#
     48.45%     2005:memeater     # the test program
     32.08%       73:kswapd0
      3.16%     1957:perf_4.17
      1.41%     1748:watch
      1.16%     2043:free

So it seems to be just reclaim activity, but why so much when the test program only allocates at 4MiB/s?

John, adding -g to perf record would show call stacks.


I'll be away for 2.5 weeks so won't be able to get back to this until then.
Comment 47 Daniel Jordan 2018-07-21 18:41:50 UTC
On 07/21/2018 10:39 AM, Daniel Jordan wrote:
> On 07/20/2018 04:19 AM, john terragon wrote:
>>
>> On Friday, July 20, 2018, 2:03:48 AM GMT+2,
>> bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org>
>> wrote:
>>  >
>>  >https://bugzilla.kernel.org/show_bug.cgi?id=200105
>> <https://bugzilla.kernel.org/show_bug.cgi?id=200105>
>>  >
>>  >--- Comment #42 from Andrew Morton (akpm@linux-foundation.org
>> <mailto:akpm@linux-foundation.org>) ---
>>  >Sorry, but nobody reads bugzilla.  I tried to switch this discussion to an
>>  >email thread for a reason!
>>  >
>>  >Please resend all this (useful) info in reply to the email thread which I
>>  >created for this purpose.
>>
>> I'll resend the last message and attachments. Anyone interested on the
>> previous "episodes" go read
>> https://bugzilla.kernel.org/show_bug.cgi?id=200105
> 
> The summary is that John has put together a reliable reproducer for a problem
> he's seeing where on high memory usage any of his desktop systems with SSDs
> hang for around a minute, completely unresponsive, and swaps out 2-3x more
> memory than the system is allocating.
> 
> John's issue only happens using a LUKS encrypted swap partition, unencrypted
> swap or swap encrypted without LUKS works fine.
> 
> In one test (out5.txt) where most system memory is taken by anon pages
> beforehand, the heavy direct reclaim that Michal noticed lasts for 24
> seconds, during which on average if I've crunched my numbers right, John's
> test program was allocating at 4MiB/s, the system overall (pgalloc_normal)
> was allocating at 235MiB/s, and the system was swapping out (pswpout) at
> 673MiB/s.  pgalloc_normal and pswpout stay roughly the same each second, no
> big swings.
> 
> Is the disparity between allocation and swapout rate expected?
> 
> John ran perf during another test right before the last test program was
> started (this doesn't include the initial large allocation bringing the
> system close to swapping).  The top five allocators (kmem:mm_page_alloc):
> 
> # Overhead      Pid:Command
> # ........  .......................
> #
>      48.45%     2005:memeater     # the test program
>      32.08%       73:kswapd0
>       3.16%     1957:perf_4.17
>       1.41%     1748:watch
>       1.16%     2043:free
> 
> So it seems to be just reclaim activity, but why so much when the test
> program only allocates at 4MiB/s?

Should add that during the 24 seconds, reclaim efficiency for both kswapd and direct (pgsteal/pgscan) hovered around 1%, which seems low.

The 24 seconds cover =S 1530092789 to =S 1530092812 in out5.txt from bugzilla.
Comment 48 john terragon 2018-07-21 20:09:48 UTC
I'll add a couple of points to Daniel's summary:
 

1) initially I performed the test with the memeaters within a gui (KDE, but I tried gnome too). The system freezes occur when the
test is performed within a gui.


2) Then, to isolate what seemed to be more symptoms (the freezes) than the problem itself, I started to perform the test using just text consoles. In these conditions there are no big, visible, system freezes. Although one of the tools used to snapshot vmstat (Michal's read_vmstat) complains with messages like "it took 28s to snapshot!" or something like that, so maybe there are still slowdowns.


3) Anyways,  both in gui and text "mode", the test always causes a huge swap activity which seems to be disproportionate given the low pressure the vm is put under (tried as low as 4Mb/s allocation rate).


4) These are the swap configurations in which the huge swap activity occurs:


-luks encrypted swap partition


-dmcrypt plain swap partition


-unencrypted swapfile located inside an encrypted, luks or dmcrypt plain, filesystem


5) The test does not cause any huge swap activity under these swap configs *when performed without a gui*:


a) unencrypted swap partition


b) encrypted swap file located inside and unencryted ext4 filesystem (that is ext4 directly on  the partition and the directory
in which the swapfile resides is encrypted with fs-level encryption, thus no dm-crypt involved).


6) I was particularly happy about b) above because it seemed to be a viable workaround, but sadly I have to report that, although the
test does not causes large swap activity with b), minute-long system freezes are occurring during normal usage even with b).
These seem to be the same kind of freezes that prompted me to find a way to reliably reproduce them under controlled conditions.
I haven't yet performed the test with b)+gui, though, only b)+text console as I wrote in 5.
Comment 49 Michal Hocko 2018-07-23 13:02:40 UTC
[I am really sorry to be slow on responding]

On Sat 21-07-18 10:39:05, Daniel Jordan wrote:
> John's issue only happens using a LUKS encrypted swap partition,
> unencrypted swap or swap encrypted without LUKS works fine.
> 
> In one test (out5.txt) where most system memory is taken by anon pages
> beforehand, the heavy direct reclaim that Michal noticed lasts for 24
> seconds, during which on average if I've crunched my numbers right,
> John's test program was allocating at 4MiB/s, the system overall
> (pgalloc_normal) was allocating at 235MiB/s, and the system was
> swapping out (pswpout) at 673MiB/s. pgalloc_normal and pswpout stay
> roughly the same each second, no big swings.
>
> Is the disparity between allocation and swapout rate expected?
> 
> John ran perf during another test right before the last test program
> was started (this doesn't include the initial large allocation
> bringing the system close to swapping).  The top five allocators
> (kmem:mm_page_alloc):
> 
> # Overhead      Pid:Command
> # ........  .......................
> #
>     48.45%     2005:memeater     # the test program
>     32.08%       73:kswapd0
>      3.16%     1957:perf_4.17
>      1.41%     1748:watch
>      1.16%     2043:free

Huh, kswapd allocating memory sounds really wrong here. Is it possible
that the swap device driver is double buffering and allocating a new
page for each one to swap out?

Note You need to log in before you can comment on or make changes to this bug.