Hi. Under any desktop environment, as soon as the swap is touched the kernel starts paging out gigabytes of memory for no apparent reason causing the system to freeze for at least a minute. This happens with normal desktop usage but I'm able to reproduce this behavior systematically by using the code included at the end of this report, let's call it "memeater". memeater takes three parameters: totalmem, chunksize, sleeptime. It allocates (and initializes to 0) totalmem bytes of memory in chunks of chunksize bytes each. After each chunk allocation it sleeps for sleeptime seconds. If sleeptime=0, it does not sleep in between chunk allocations. After totalmem bytes of memory have been allocated it sleeps indefinitely. To reproduce the behaviour using memeater: 1) start a desktop environment (in my case KDE plasma). 2) invoke "memeater x y 0" to bring the system to the brink of swapping. I usually execute "memeater x y 0" multiple times where x is 4 gigs and y is 20 megs.(I've got 16GB of memory so I usually go up to 14/15 GB of used memory). 3) invoke "memeater x y 1" one last time so that it slowly fills up the memory with small chunks of a few megs each (again, I tried it with x=4 gigs and y=20 megs). When the last memeater fills the memory (and keeps allocating chunks) I would expect the mm system to swap out few megabytes worth of pages to accomodate the new requests slowly coming from the last memeater. However, what actually happens is that the mm starts to swap out gigabytes worth of pages completely freezing the system for one or two minutes. I've verified this on various desktop systems, all using SSDs. Obviously, I'm willing to provide more info and to test patches. ####### memeater #######################à #include <stdlib.h> #include <stdio.h> #include <string.h> #include <unistd.h> int main(int argc, char** argv) { long int max = -1; int mb = 0; long int size = 0; long int total = 0; int sleep_time = 0; char* buffer; if(argc > 1) { max = atol(argv[1]); size = atol(argv[2]); sleep_time = atoi(argv[3]); } printf("Max: %lu bytes\n", max); while((buffer=malloc(size)) != NULL && total < max) { memset(buffer, 0, size); mb++; total=mb*size; printf("Allocated %lu bytes\n", total); if (sleep_time) sleep(sleep_time); } sleep(3000000); return 0; }
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Sun, 17 Jun 2018 21:56:23 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=200105 > > Bug ID: 200105 > Summary: High paging activity as soon as the swap is touched > (with steps and code to reproduce it) > Product: Memory Management > Version: 2.5 > Kernel Version: 4.14 4.15 4.16 4.17 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Page Allocator > Assignee: akpm@linux-foundation.org > Reporter: terragonjohn@yahoo.com > Regression: No > > Hi. > > Under any desktop environment, as soon as the swap is touched the kernel > starts > paging out gigabytes of memory for no apparent reason causing the system to > freeze for at least a minute. > > > This happens with normal desktop usage but I'm able to reproduce this > behavior > systematically by using the code included at the end of this report, let's > call > it "memeater". memeater takes three parameters: totalmem, chunksize, > sleeptime. > It allocates (and initializes to 0) totalmem bytes of memory in chunks of > chunksize bytes each. After each chunk allocation it sleeps for sleeptime > seconds. If sleeptime=0, it does not sleep in between chunk allocations. > After totalmem bytes of memory have been allocated it sleeps indefinitely. > > To reproduce the behaviour using memeater: > > 1) start a desktop environment (in my case KDE plasma). > 2) invoke "memeater x y 0" to bring the system to the brink of swapping. I > usually execute "memeater x y 0" multiple times where x is 4 gigs and y is 20 > megs.(I've got 16GB of memory so I usually go up to 14/15 GB of used memory). > 3) invoke "memeater x y 1" one last time so that it slowly fills up the > memory > with small chunks of a few megs each (again, I tried it with x=4 gigs and > y=20 > megs). > > When the last memeater fills the memory (and keeps allocating chunks) I would > expect the mm system to swap out few megabytes worth of pages to accomodate > the > new requests slowly coming from the last memeater. > However, what actually happens is that the mm starts to swap out gigabytes > worth of pages completely freezing the system for one or two minutes. > > I've verified this on various desktop systems, all using SSDs. > Obviously, I'm willing to provide more info and to test patches. > > > ####### memeater #######################à > > #include <stdlib.h> > #include <stdio.h> > #include <string.h> > #include <unistd.h> > > > > int main(int argc, char** argv) { > long int max = -1; > int mb = 0; > long int size = 0; > long int total = 0; > int sleep_time = 0; > char* buffer; > > if(argc > 1) > { > max = atol(argv[1]); > size = atol(argv[2]); > sleep_time = atoi(argv[3]); > } > printf("Max: %lu bytes\n", max); > while((buffer=malloc(size)) != NULL && total < max) { > memset(buffer, 0, size); > mb++; > total=mb*size; > printf("Allocated %lu bytes\n", total); > if (sleep_time) sleep(sleep_time); > } > sleep(3000000); > return 0; > } > > -- > You are receiving this mail because: > You are the assignee for the bug.
On Mon 18-06-18 16:17:35, Andrew Morton wrote: [...] > > I've verified this on various desktop systems, all using SSDs. > > Obviously, I'm willing to provide more info and to test patches. Could you snapshot /proc/vmstat during your test please? Once per second or so should tell us more what is going on.
Created attachment 276709 [details] attachment-18225-0.html On Thursday, June 21, 2018, 5:13:14 PM GMT+2, <bugzilla-daemon@bugzilla.kernel.org> wrote: https://bugzilla.kernel.org/show_bug.cgi?id=200105 --- Comment #2 from Michal Hocko (mhocko@kernel.org) --- On Mon 18-06-18 16:17:35, Andrew Morton wrote: [...] > > I've verified this on various desktop systems, all using SSDs. > > Obviously, I'm willing to provide more info and to test patches. > Could you snapshot /proc/vmstat during your test please? Once per second > or so should tell us more what is going on. I'll do that. For now, you can find below the result of invoking vmstat 1 during the test (I did this some time ago hopingto shed some light on the problem). This is right before the swap is touched. As you can see, there is a sudden and huge jump in swap usage. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 6144 160792 8 272868 0 0 0 0 281 1839 1 0 99 0 0 0 0 6144 177844 8 246096 0 0 0 0 425 2300 1 1 99 0 0 0 0 6144 168528 8 246112 0 0 16 0 293 1939 1 0 99 0 0 0 0 6144 158320 8 246116 0 0 0 0 261 1245 0 0 100 0 0 2 0 10752 161624 8 229024 0 4820 17148 4820 845 3656 1 2 97 0 0 2 0 10752 157300 8 228096 0 0 88348 0 2114 8902 0 5 94 1 0 0 0 10752 176108 8 200052 0 0 108312 0 2466 9772 1 5 91 3 0 0 0 10752 170040 8 196780 0 0 17380 0 507 1895 0 1 99 0 0 0 10 10752 160436 8 191244 0 0 346872 20 4184 17274 1 9 64 26 0 0 29 12033856 152888 8 116696 5992 15916880 1074132 15925816 819374 2473643 0 94 0 6 0 3 21 12031552 295644 8 136536 1188 0 11348 0 1362 3913 0 1 10 89 0 0 11 12030528 394072 8 151000 2016 0 17304 0 907 2867 0 1 13 86 0 0 11 12030016 485252 8 158528 708 0 7472 0 566 1680 0 1 23 77 0 0 11 12029248 605820 8 159608 900 0 2024 0 371 1289 0 0 31 69 0 0 11 12028992 725344 8 160472 1076 0 1204 0 387 1381 0 1 33 66 0 0 12 12028480 842276 8 162056 724 0 3112 0 357 1142 0 1 38 61 0 0 13 12027968 937828 8 162652 776 0 1312 0 363 1191 0 1 31 68 0 0 9 12027456 1085672 8 163260 656 0 1520 0 439 1497 0 0 30 69 0 0 10 12027200 1207624 8 163684 728 0 992 0 411 1268 0 0 42 58 0 0 9 12026688 1331492 8 164740 600 0 1732 0 392 1203 0 0 36 64 0 0 9 12026432 1458312 8 166020 628 0 1644 0 366 1176 0 0 33 66 0
Created attachment 276723 [details] attachment-12219-0.html On Thursday, June 21, 2018, 5:13:14 PM GMT+2, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote: https://bugzilla.kernel.org/show_bug.cgi?id=200105 --- Comment #2 from Michal Hocko (mhocko@kernel.org) --- On Mon 18-06-18 16:17:35, Andrew Morton wrote: [...] > > I've verified this on various desktop systems, all using SSDs. > > Obviously, I'm willing to provide more info and to test patches. > Could you snapshot /proc/vmstat during your test please? Once per second > or so should tell us more what is going on. I've a doubt about how to effectively snapshot /proc/vmstat. As you can see from the output of vmstat 1 in my previous comment, there is a huge jump in swap usage. I suspect that vmstat just got frozen like everything else during the test. And therefore it has not correctly reported what was happening during the big swapping freeze, but rather the situation right before the freeze and the situation right after (when it got unfrozen).So, if I snapshot /proc/vmstat by just doing something like this: watch -n 1 "cat /proc/vmstat >>/tmp/out ; echo -------->>/tmp/out" I fear I'm going to get the same behaviour: we'll know /proc/vmstat before and after the swapping-induced freeze but not what's happening _during_ it. So, did you have something more sophisticated in mind on how to snapshot /proc/vmstat?
Created attachment 276725 [details] read_vmstat.c On Fri 22-06-18 11:45:33, john terragon wrote: > > > On Thursday, June 21, 2018, 5:13:14 PM GMT+2, > bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > https://bugzilla.kernel.org/show_bug.cgi?id=200105 > > --- Comment #2 from Michal Hocko (mhocko@kernel.org) --- > On Mon 18-06-18 16:17:35, Andrew Morton wrote: > [...] > > > I've verified this on various desktop systems, all using SSDs. > > > Obviously, I'm willing to provide more info and to test patches. > > > Could you snapshot /proc/vmstat during your test please? Once per second > > or so should tell us more what is going on. > I've a doubt about how to effectively snapshot /proc/vmstat. As you can see > from the output of vmstat 1 in my previous comment, there is a huge jump in > swap usage. I suspect that vmstat just got frozen like everything else during > the test. And therefore it has not correctly reported what was happening > during the big swapping freeze, but rather the situation right before the > freeze and the situation right after (when it got unfrozen).So, if I snapshot > /proc/vmstat by just doing something like this: > watch -n 1 "cat /proc/vmstat >>/tmp/out ; echo -------->>/tmp/out" > I fear I'm going to get the same behaviour: we'll know /proc/vmstat before > and after the swapping-induced freeze but not what's happening _during_ it. > So, did you have something more sophisticated in mind on how to snapshot > /proc/vmstat? I usually use the attached little program which tries to make sure that it doesn't get reclaimed during heavy memory pressure. Could you give it a try?
Created attachment 276833 [details] out.txt On Friday, June 22, 2018, 2:26:31 PM GMT+2, <bugzilla-daemon@bugzilla.kernel.org> wrote: > I usually use the attached little program which tries to make sure that > it doesn't get reclaimed during heavy memory pressure. Could you give it > a try? I used your program during the test (which, as always, caused the usual freeze) and I've attached the output to this email.
On Mon 25-06-18 17:57:37, john terragon wrote: > > > On Friday, June 22, 2018, 2:26:31 PM GMT+2, > <bugzilla-daemon@bugzilla.kernel.org> wrote: > > I usually use the attached little program which tries to make sure that > > it doesn't get reclaimed during heavy memory pressure. Could you give it > > a try? > I used your program during the test (which, as always, caused the usual > freeze) and I've attached > the output to this email. Thanks. It seems that the output is complete. Good. Now let's see what is going on reclaim wise. You were complaining about stalls and excessive reclaim (swapout AFAIR) so it is good to check the direct reclaim activity first: Report a timestamp (first row) of the entry that has at least some increase from the previous measured value. $ awk '/=S/{ts=$2}/steal_direct/{if (prev) {diff=$2-prev}; prev=$2; if (diff > 50) printf "%d %d\n", ts, diff }' out.txt 1529944049 3620 1529944053 19003 1529944054 55177 1529944055 55471 1529944056 44795 1529944057 26652 1529944058 41044 1529944059 62195 1529944060 61118 1529944061 41875 1529944062 64176 1529944063 51765 1529944064 53420 1529944065 62557 1529944066 39649 1529944067 61628 1529944068 29689 1529944069 12631 1529944070 43268 1529944071 65988 1529944072 55976 1529944073 47137 1529944336 1438 1529944342 103570 It is obvious that the heaviest direct reclaim activity started at 1529944053 and then continued in a pace of 50k pages reclaimed per second from the direct reclaim context. You will get the background reclaim activity by replacing steal_direct with steal_kswapd and that would tell that kswapd didn't reclaim nor scanned all that much during that time period ts scan_kswapd steal_kswapd 1529944053 202878 61020 1529944054 207151 6066 1529944055 374415 3115 1529944056 217907 1871 1529944057 157264 1445 1529944058 103033 798 1529944059 242288 1055 1529944060 557561 777 1529944061 314082 907 1529944062 1169696 1597 1529944063 343617 240 1529944064 121597 694 1529944065 238726 863 1529944066 99702 1201 1529944067 169272 2002 1529944068 15838 1534 1529944069 2622 390 1529944070 34216 3053 1529944071 199088 6797 1529944072 140024 6141 So it seems that the kswapd was able to scan a lot of pages while it didn't manage to reclaim a large part of them. This is interesting because the direct reclaim had basically 100% reclaim efficiency (reclaim/scanned) at that time. One notable difference between the two is that kswapd is allowed to write dirty data (except for swap pages) $ awk '/=S/{ts=$2}/vmscan_write\>/{if (prev) {diff=$2-prev}; prev=$2; if (diff > 50) printf "%d %d\n", ts, diff }' out.txt 1529943959 442 1529943962 245 1529943963 54 1529944018 380 1529944020 257 1529944047 2579 1529944049 25646 1529944053 20761 1529944054 150695 1529944055 237989 1529944056 197554 1529944057 120136 1529944058 171693 1529944059 264284 1529944060 256677 1529944061 175652 1529944062 276186 1529944063 225011 1529944064 210441 1529944065 261282 1529944066 168771 1529944067 264044 1529944068 134312 1529944069 46161 1529944070 192242 1529944071 301327 1529944072 259882 1529944073 5935 and this seems to be happening a lot in the said time period. We have to check swap outs to distinguish regular pagecache though $ awk '/=S/{ts=$2}/pswpout\>/{if (prev) {diff=$2-prev}; prev=$2; if (diff > 50) printf "%d %d\n", ts, diff }' out.txt 1529944053 20844 1529944054 150738 1529944055 237977 1529944056 197614 1529944057 120201 1529944058 171583 1529944059 264189 1529944060 256794 1529944061 175607 1529944062 276164 1529944063 225054 1529944064 210540 1529944065 261106 1529944066 168903 1529944067 263999 1529944068 134370 1529944069 46189 1529944070 192251 1529944071 301349 1529944072 259830 which suggests that almost all of it was the swap out. I do not know how to tell which context did the actual reclaim. It is possible that kswapd got throttled or something like that. Anyway do we have to swap out that much though? $ awk '/=S/{ts=$2} /nr_inactive_file/{file=$2} /nr_active_file/{file+=$2; printf "%d %d\n", ts, file}' out.txt [...] 1529944053 2868 1529944054 552 1529944055 843 1529944056 867 1529944057 886 1529944058 1155 1529944059 1186 1529944060 1276 1529944061 1314 1529944062 1283 1529944063 1280 1529944064 872 1529944065 1060 1529944066 1075 1529944067 514 1529944068 417 1529944069 470 1529944070 516 1529944071 578 1529944072 506 [...] so yes, we have to because there is barely any page cache left. From the above I do not see an obvious bug here. Kswapd seems to be hitting the wall but the swap out seems to unavoidable. How does this related to stalls is another question. And we would need tracepoint data for that. Anyway your workload seems to be overcommitting on the available memory here. Maybe we could work much better in that case but then it would be great to get some tracing data when a real workload hits this.
Created attachment 276905 [details] out5.txt On Tuesday, June 26, 2018, 2:33:16 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote: > From the above I do not see an obvious bug here. Kswapd seems to be > hitting the wall but the swap out seems to unavoidable. How does this > related to stalls is another question. And we would need tracepoint > data for that. Anyway your workload seems to be overcommitting on the > available memory here. Maybe we could work much better in that case > but then it would be great to get some tracing data when a real workload > hits this. I've redone the test, this time asking for 20 MB of space every 5 seconds (the other test was once every second), just to make things easier (?) for the vm (snapshot of vmstat attached). Besides the vmstat number, this is what happens: -As soon as RAM is full (or close to full, like few tens of MB left) a complete system freeze ensues.-The freeze lasts for about 60 seconds.-After the 60 seconds the swap occupation went from about zero to 15 GB (!).-All that was asked during the 60 seconds was about 240 MB of space (20 MB chunks once every5 seconds). Just to get this straight. You're telling me that this behavior is correct? And let's even forget about the system freeze. That is, writing 15GB in the swap in 60 seconds when all it was asked to the vm during these 60 seconds was a total of 240MB of ram. This is the vm working as intended? And let me just clarify that I didn't come up with that test just to troll the vm ;) . I use linux in desktop PCs and it all started when all of a sudden these big freezes (even longer that the one in the "synthetic" workload) begun to happen during normal use: one DE (KDE), one mailer, one editor, and some instances of firefox (they take the lion share of the memory but they do not start to allocate 1GB every second, they seem to slowly get "fatter and fatter", so to speak).
On Wed 27-06-18 10:38:19, john terragon wrote: > > > On Tuesday, June 26, 2018, 2:33:16 PM GMT+2, Michal Hocko > <mhocko@kernel.org> wrote: > > > > From the above I do not see an obvious bug here. Kswapd seems to be > > hitting the wall but the swap out seems to unavoidable. How does this > > related to stalls is another question. And we would need tracepoint > > data for that. Anyway your workload seems to be overcommitting on the > > available memory here. Maybe we could work much better in that case > > but then it would be great to get some tracing data when a real workload > > hits this. > > I've redone the test, this time asking for 20 MB of space every 5 > seconds (the other test was once every second), just to make things > easier (?) for the vm (snapshot of vmstat attached). Besides the > vmstat number, this is what happens: > > -As soon as RAM is full (or close to full, like few tens of MB left) > a complete system freeze ensues.-The freeze lasts for about 60 > seconds.-After the 60 seconds the swap occupation went from about zero > to 15 GB (!).-All that was asked during the 60 seconds was about 240 > MB of space (20 MB chunks once every5 seconds). All I can see in the vmstat data is that the heaviest direct reclaim happens during the following 23s 1530092789 2939 1530092790 26530 1530092791 32758 1530092792 43734 1530092793 56487 1530092794 36676 1530092795 26592 1530092796 40690 1530092797 59096 1530092798 60331 1530092799 54677 1530092800 38491 1530092801 39346 1530092802 56641 1530092803 66019 1530092804 27940 1530092805 53971 1530092806 43717 1530092807 39238 1530092809 73689 1530092810 51959 1530092811 69046 1530092812 11387 We swap out a lot of memory during that time (this is in pages) 1530092790 42106 1530092791 104644 1530092792 153647 1530092793 215620 1530092794 158937 1530092795 115064 1530092796 174210 1530092797 256667 1530092798 255214 1530092799 228175 1530092800 157931 1530092801 164286 1530092802 255517 1530092803 274509 1530092804 121529 1530092805 222045 1530092806 194336 1530092807 172144 1530092809 313238 1530092810 214102 1530092811 163935 1530092812 4378 this is ~15G of memory in total. The amount of free memory during that time period is $ awk '/=S/{ts=$2}/nr_free_pages\>/{if (ts >= 1530092790 && ts <= 1530092812) printf "%d %d\n", ts, $2 }' out5.txt 1530092790 33365 1530092791 33447 1530092792 33234 1530092793 33343 1530092794 33376 1530092795 33376 1530092796 33280 1530092797 33463 1530092798 33414 1530092799 33535 1530092800 33715 1530092801 33394 1530092802 33436 1530092803 33436 1530092804 33435 1530092805 33293 1530092806 34028 1530092807 33300 1530092809 33329 1530092810 33398 1530092811 59889 1530092812 92072 which is roughly 130MB and it would correspond to the min watermark so there is a permanent memory pressure even when we reclaim a lot of memory (~400MB/s) so somebody is allocating roughly at the same pace. There is basically no pagecache during that time period 1530092790 916 1530092791 900 1530092792 1465 1530092793 641 1530092794 723 1530092795 798 1530092796 798 1530092797 798 1530092798 876 1530092799 1047 1530092800 1138 1530092801 1802 1530092802 1041 1530092803 1058 1530092804 1042 1530092805 1337 1530092806 819 1530092807 824 1530092809 1174 1530092810 1135 1530092811 7729 1530092812 12004 and all the memory is consumed by anonymous memory (in pages) 1530092790 3943370 1530092791 3910853 1530092792 3868239 1530092793 3816441 1530092794 3777916 1530092795 3750696 1530092796 3709097 1530092797 3648167 1530092798 3586738 1530092799 3531756 1530092800 3494133 1530092801 3455150 1530092802 3393068 1530092803 3327196 1530092804 3299318 1530092805 3244940 1530092806 3198904 1530092807 3158415 1530092809 3082721 1530092810 3030749 1530092811 2967568 1530092812 2956265 so roughly 15GB. > Just to get this straight. You're telling me that this behavior is > correct? And let's even forget about the system freeze. That is, > writing 15GB in the swap in 60 seconds when all it was asked to the vm > during these 60 seconds was a total of 240MB of ram. This is the vm > working as intended? Well, I have no idea what exactly you do here. All I know is that your memory is full of anonymous memory and there is no way to free that memory without swap out. Considering that the reclaim doesn't really frees up memory during this time window it means that somebody is allocating roughly at the same pace. And under such conditions, yes, we do swap out a lot.
Created attachment 276923 [details] attachment-27873-0.html On Wednesday, June 27, 2018, 1:03:50 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote: > Well, I have no idea what exactly you do here. All I know is that your > memory is full of anonymous memory and there is no way to free that > memory without swap out. Considering that the reclaim doesn't really > frees up memory during this time window it means that somebody is > allocating roughly at the same pace. And under such conditions, yes, > we do swap out a lot. I don't understand. Are you saying that in the vmstat numbers you are seeing an allocation rate that is higher than the roughly 4MB/s that memeater is causing? Because the test is very simple and the "actors" are a) normal system stuff + kde stuff b) one console window c) some (4 or 5) instances of memeater used to bring memory usage close to full RAM quickly. d) the last memeater that allocates at 4 MB/s After the c) are done doing their part, they pause (and at that point no swapping going on). After the c) are done, d) is the only one who's allocating at any discernible rate.So, correct me if I'm wrong. In this setting what should happen is roughly this 1) we get close to full RAM. 2) the active memeater allocates another 20MB. 3) the vm swaps out let's say 100MB to avoid threshold effect. Is has to do this in 5 seconds before getting another request from the last memeater. Isn't that roughly what should happen in this setting?
On Thu 28-06-18 14:07:46, Michal Hocko wrote: > On Wed 27-06-18 13:52:37, john terragon wrote: > > > > > > On Wednesday, June 27, 2018, 1:03:50 PM GMT+2, Michal Hocko > <mhocko@kernel.org> wrote: > > > > > Well, I have no idea what exactly you do here. All I know is that your > > > memory is full of anonymous memory and there is no way to free that > > > memory without swap out. Considering that the reclaim doesn't really > > > frees up memory during this time window it means that somebody is > > > allocating roughly at the same pace. And under such conditions, yes, > > > we do swap out a lot. > > > > I don't understand. Are you saying that in the vmstat numbers you are > > seeing an allocation rate that is higher than the roughly 4MB/s that > > memeater is causing? > > Well, I might have screwed up my awk-foo (that's why I've pasted it > there, hint hint). But yes. If I see that there is a lot of reclaimed > memory and the number of free pages doesn't grow then there is either a > memory leak or the pages do not get back to the free pool in timely > fashion or somebody is eating the reclaimed memory. I do not see > any evidence of the former because the number of free pages grows > eventually. Ruling out the second option would require some tracepoints > to follow the IO path. I am currently quite busy but one way to check > the last option is to look at pgalloc* counters and see whether they > grow in a similar pace as the number of reclaimed pages. So I found a minute and the situation looks as follows. $ awk '/=S/{ts=$2} /pgalloc_normal/{diff=$2-prev;prev=$2; if (ts >= 1530092790 && ts <= 1530092812) printf "%d %d\n", ts, diff}' out5.txt will tell you how many pages have been allocated from zone normal between each two timestamps. To be absolutely precise you should also consider other zones as well but I assume that this one is the biggest consumer. If we combine that with the direct reclaim (again we should consider both kswapd and direct to be precise) then we have the following picture ts steal_direct pgalloc_normal ratio 1530092790 26530 107146 4.04 1530092791 32758 58778 1.79 1530092792 43734 62185 1.42 1530092793 56487 59678 1.06 1530092794 36676 41816 1.14 1530092795 26592 29041 1.09 1530092796 40690 46552 1.14 1530092797 59096 77442 1.31 1530092798 60331 64751 1.07 1530092799 54677 63432 1.16 1530092800 38491 41643 1.08 1530092801 39346 44797 1.14 1530092802 56641 63045 1.11 1530092803 66019 74768 1.13 1530092804 27940 33814 1.21 1530092805 53971 59844 1.11 1530092806 43717 53088 1.21 1530092807 39238 45798 1.17 1530092809 73689 84833 1.15 1530092810 51959 61545 1.18 1530092811 69046 56313 0.82 1530092812 11387 6839 0.60 that gives a good idea that we are allocating roughly in the same pace we reclaim.
On Wed 27-06-18 13:52:37, john terragon wrote: > > > On Wednesday, June 27, 2018, 1:03:50 PM GMT+2, Michal Hocko > <mhocko@kernel.org> wrote: > > > Well, I have no idea what exactly you do here. All I know is that your > > memory is full of anonymous memory and there is no way to free that > > memory without swap out. Considering that the reclaim doesn't really > > frees up memory during this time window it means that somebody is > > allocating roughly at the same pace. And under such conditions, yes, > > we do swap out a lot. > > I don't understand. Are you saying that in the vmstat numbers you are > seeing an allocation rate that is higher than the roughly 4MB/s that > memeater is causing? Well, I might have screwed up my awk-foo (that's why I've pasted it there, hint hint). But yes. If I see that there is a lot of reclaimed memory and the number of free pages doesn't grow then there is either a memory leak or the pages do not get back to the free pool in timely fashion or somebody is eating the reclaimed memory. I do not see any evidence of the former because the number of free pages grows eventually. Ruling out the second option would require some tracepoints to follow the IO path. I am currently quite busy but one way to check the last option is to look at pgalloc* counters and see whether they grow in a similar pace as the number of reclaimed pages.
Created attachment 277063 [details] attachment-3570-0.html On Thursday, June 28, 2018, 9:03:27 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote: >ts steal_direct pgalloc_normal ratio >1530092790 26530 107146 4.04 >1530092791 32758 58778 1.79 >1530092792 43734 62185 1.42 >1530092793 56487 59678 1.06 >1530092794 36676 41816 1.14 >1530092795 26592 29041 1.09 >1530092796 40690 46552 1.14 >1530092797 59096 77442 1.31 >1530092798 60331 64751 1.07 >1530092799 54677 63432 1.16 >1530092800 38491 41643 1.08 >1530092801 39346 44797 1.14 >1530092802 56641 63045 1.11 >1530092803 66019 74768 1.13 >1530092804 27940 33814 1.21 >1530092805 53971 59844 1.11 >1530092806 43717 53088 1.21 >1530092807 39238 45798 1.17 >1530092809 73689 84833 1.15 >1530092810 51959 61545 1.18 >1530092811 69046 56313 0.82 >1530092812 11387 6839 0.60 >that gives a good idea that we are allocating roughly in the same pace we >reclaim. Those numbers are pages, right? A page is 4K, right?So, for example 59844*4096 ~= 240MB. Hence the allocating pace is 240 MB/s ?!But there's no process doing that. I'm sorry, I sound like a broken record:there is only one user process (the last memeater) that's allocating at a relatively high pace during the test, and that's 4 MB/s.... Is there a way to trace which process allocates how much, instead of just having the aggregate number?
On Fri 29-06-18 20:24:39, john terragon wrote: > > > On Thursday, June 28, 2018, 9:03:27 PM GMT+2, Michal Hocko > <mhocko@kernel.org> wrote: > > >ts steal_direct pgalloc_normal ratio > >1530092790 26530 107146 4.04 > >1530092791 32758 58778 1.79 > >1530092792 43734 62185 1.42 > >1530092793 56487 59678 1.06 > >1530092794 36676 41816 1.14 > >1530092795 26592 29041 1.09 > >1530092796 40690 46552 1.14 > >1530092797 59096 77442 1.31 > >1530092798 60331 64751 1.07 > >1530092799 54677 63432 1.16 > >1530092800 38491 41643 1.08 > >1530092801 39346 44797 1.14 > >1530092802 56641 63045 1.11 > >1530092803 66019 74768 1.13 > >1530092804 27940 33814 1.21 > >1530092805 53971 59844 1.11 > >1530092806 43717 53088 1.21 > >1530092807 39238 45798 1.17 > >1530092809 73689 84833 1.15 > >1530092810 51959 61545 1.18 > >1530092811 69046 56313 0.82 > >1530092812 11387 6839 0.60 > > >that gives a good idea that we are allocating roughly in the same pace we > reclaim. > > Those numbers are pages, right? A page is 4K, right?So, for example > 59844*4096 ~= 240MB. Hence the allocating pace is 240 MB/s ?!But there's no > process doing that. > I'm sorry, I sound like a broken record:there is only one user process (the > last memeater) that's allocating > at a relatively high pace during the test, and that's 4 MB/s.... > Is there a way to trace which process allocates how much, instead of just > having > the aggregate number? You can try to use mm_page_alloc tracepoint (events/kmem). One way to do that is to use ftrace or mount tracefs and enable the tracepoint under the above path and then read the trace_pipe.
Created attachment 277133 [details] attachment-32664-0.html On Monday, July 2, 2018, 4:42:03 PM GMT+2, Michal Hocko <mhocko@kernel.org> wrote: > You can try to use mm_page_alloc tracepoint (events/kmem). One way to do > that is to use ftrace or mount tracefs and enable the tracepoint under > the above path and then read the trace_pipe. In my bug report I submitted code and steps to reproduce the behaviour. Have you tried to reproduce it?
On Mon 02-07-18 17:46:09, john terragon wrote: > > > On Monday, July 2, 2018, 4:42:03 PM GMT+2, Michal Hocko > <mhocko@kernel.org> wrote: > > You can try to use mm_page_alloc tracepoint (events/kmem). One way to do > > that is to use ftrace or mount tracefs and enable the tracepoint under > > the above path and then read the trace_pipe. > > In my bug report I submitted code and steps to reproduce the behaviour. > Have you tried to reproduce it? No I haven't. I am sorry but my time is quite scarce so I focused on the state of the system when you see the problem and that data shows that there is continuous source of allocations and as such the behavior is expected. Maybe this is solely your test case or maybe something else.
Created attachment 277203 [details] attachment-12017-0.html Maybe you can give me an explanation to the following: 1) The memeater processes (the first ones that allocate close to 16GB of memory without pausing in between requests and thelast one that makes pauses of 5 seconds and starts the unexplained heavy swapping) allocate a grand total of about 16.5 GB of memory (I pause the last memeater right after the one minute freeze and the unexplained 15GB-worth swap-out). 2) At that point, as you also have seen from the vmstat numbers, both RAM and swap are about full (so about 32GB total). 3) If, at that point, I kill all the memeater processes at once then RAM and swap BOTH go to about zero (about 3/4 hundred megs each). So, the all memory allocated by the memeater processes seems to be duplicated both in RAM and swap. Why is that?
(In reply to john terragon from comment #15) > Created attachment 277133 [details] > attachment-32664-0.html > > On Monday, July 2, 2018, 4:42:03 PM GMT+2, Michal Hocko > <mhocko@kernel.org> wrote: > > You can try to use mm_page_alloc tracepoint (events/kmem). One way to do > > that is to use ftrace or mount tracefs and enable the tracepoint under > > the above path and then read the trace_pipe. > In my bug report I submitted code and steps to reproduce the behaviour. > Have you tried to reproduce it? I spun up a VM with 8 CPUs (i7 4790k), 8G memory and 4G SSD swap (IIRC). I also installed GNOME desktop since you were also using a desktop environment. Tried using memeater.c and the steps in your original post (#1) to reproduce on a 4.17 kernel, but no luck. The system remained responsive throughout. Same deal when I tried filling up memory by opening tens of browser tabs: no freezing. I've seen something similar to your issue several times on my desktop at home, however, and so have some interest in understanding what's happening here more deeply. I wonder what's different between your system and mine? Anyway, let me catch up on the rest of the thread.
Created attachment 277205 [details] attachment-16893-0.html On Thursday, July 5, 2018, 10:49:28 PM GMT+2, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote: > I spun up a VM with 8 CPUs (i7 4790k), 8G memory and 4G SSD swap (IIRC). I > also installed GNOME desktop since you were also using a desktop environment. > > Tried using memeater.c and the steps in your original post (#1) to reproduce > on > a 4.17 kernel, but no luck. The system remained responsive throughout. > > Same deal when I tried filling up memory by opening tens of browser tabs: no > freezing. > > I've seen something similar to your issue several times on my desktop at > home, > however, and so have some interest in understanding what's happening here > more > deeply. > > I wonder what's different between your system and mine? Anyway, let me catch > up on the rest of the thread. Could you send me the .config of the kernel you used in the VM? I don't think my kernel config has anything strange, as it's mainly derived from the ones in the debian kernel image packages (but I tried the test also with a debian kernel images with the same results).But maybe there's something.
Created attachment 277209 [details] 4.17.0-rc7 VM config
(In reply to john terragon from comment #19) > Created attachment 277205 [details] > attachment-16893-0.html > > On Thursday, July 5, 2018, 10:49:28 PM GMT+2, > bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > I spun up a VM with 8 CPUs (i7 4790k), 8G memory and 4G SSD swap (IIRC). I > > also installed GNOME desktop since you were also using a desktop > environment. > > > > Tried using memeater.c and the steps in your original post (#1) to > reproduce > > on > > a 4.17 kernel, but no luck. The system remained responsive throughout. > > > > Same deal when I tried filling up memory by opening tens of browser tabs: > no > > freezing. > > > > I've seen something similar to your issue several times on my desktop at > > home, > > however, and so have some interest in understanding what's happening here > > more > > deeply. > > > > I wonder what's different between your system and mine? Anyway, let me > catch > > up on the rest of the thread. > Could you send me the .config of the kernel you used in the VM? I don't > think my kernel config has anything strange, as it's mainly derived from the > ones in the debian kernel image packages (but I tried the test also with a > debian kernel images with the same results).But maybe there's something. The config I just posted is also derived from debian (the VM is running debian test). So I looked at out5.txt, and it contains 458 seconds worth of samples. Do you remember roughly the timing of when you started the memeaters and when you observed the freeze? I'm looking for something like, the memeaters started around 1530092770 or so, and it started freezing about around 1530092791. Just trying to correlate the vmstats with what you were doing and seeing.
It would also be helpful to know about when the system unfroze.
Created attachment 277231 [details] attachment-5996-0.html On Friday, July 6, 2018, 3:54:44 AM GMT+2, <bugzilla-daemon@bugzilla.kernel.org> wrote: > It would also be helpful to know about when the system unfroze I can always redo the test and take note of those timings (since the test works like clockwork in triggering the behaviour), but as far as I can remember, I started Michael's program (read_vmstat) as soon as the first memeater started. All but the last memeater just allocate some GB in chunks of 20MB each without pausing. I start them one after the other, they have the role of almost filling the RAM and then they sleep. Their executions go smoothly (because they do not involve the swap). Then the last memeater is started and it allocates at a rate of 20MB each 5 seconds (in the out5.txt test). It goes smoothly until the RAM is exhausted, Then the system freezes for about 60 seconds and as soon as it unfreezes I have 1) RAM almost full (13/14 GB) 2) swap almost full too (15GB) This despite the fact that a) the last memeater during the freeze asked for just 200/300 MB of space (about 20 MB every 5 seconds) and b) all the memeaters together asked for a total of 16GB. Anyways, I believe that as soon as the system unfroze, I stopped read_vmstat. So, the starting of the freezing should be about 60 seconds prior to the end of out5.txt. And I see the freeze as just a symptom of what seems (to me) to be an unjustified swap-out: 15GB against the 300MB asked during that 60 second period. Maybe I should try doing the test without a GUI, just alt-tabbing and doing it from console and see if the big swap-out happens again.
Created attachment 277251 [details] out6.txt So, I repeated the test from console, without the GUI. The vmstat snapshot is attached. As I thought, there was no visible system freeze (the symptom) and the problem was therefore visible "to the naked eye" Here's what happened: 1) read_vmstat was started right before starting the first memeater. 2) The RAM was then almost filled with unpausing memeaters. As always, no swapping. 3) Then the last memeater, allocating 20MB once every 2 seconds, started. 4) I was observing the output from watch -n 1 free on another console and as soon as the RAM was filled, the kernel swapped out 15GB of memory in a matter of I'd say 5 seconds. The kernel apparently behaved like in the old days when there was no hardware support for paging and swapping literally meant "take the whole memory of a process and swap it out at once". The whole 15GB of memory allocated by the other mematers was swapped out in five seconds upon the request from the last memeater of less than 50MB. This can't be normal.
Created attachment 277253 [details] vmstat snapshot of the test without a GUI It didn't attach it in my last email. Anyways, here it is.
Created attachment 277267 [details] attachment-11371-0.html Since the console-only test with linux shed some light on the behavior of the vm, I installed freebsd 11.2 on the same machine and conducted the same test. Freebsd doesn't have /proc/vmstat so I've used vmstat 1 to get the vm info. You can find an attachment with the snapshot after this comment. As you can see from the vmstat snapshot, freebsd handles the vm pressure put on by the last memeater as it should be handled: it swaps out few megabytes at a time every time the RAM gets close to full due to the 20MB chunks requests from the last memeater (which in this case where made one every second, instead of once every two seconds....so even more pressure). How can the linux kernel (or at least mine) perform so much worse than freebsd's?
Created attachment 277269 [details] vmstat snapshot of the on Freebsd console-only
Created attachment 277271 [details] attachment-11874-0.html It's the damned encrypted swap partition (normal dm-crypt luks).... I just tried the test with a normal swap partition (it's the same ssd and the same partition except that is not encrypted anymore) and, lo and behold, it went through it just a smoothly as freebsd. It's the console-only test and I'll try the one with a GUI but I'm sure that now, without the anomalous swapout there's not going to be any freeze. So, am I missing something? Because I've never read anywere that using dm-crypt with a swap partition would influence the vm to this degree.
Workarounds for the abysmal performance of dm-crypt with swap: 1) encrypt the swap partition with the old cryptoloop (or loop-AES, but it's out-of-tree) 2) convert the swap partition into ext4, create a swapfile and encrypt it with the VFS-level encryption of ext4 I did the test with both and they behave as smoothly as an un-encrypted swap.
Thanks for rerunning those tests earlier. (In reply to john terragon from comment #28) > It's the damned encrypted swap partition (normal dm-crypt luks).... > > I just tried the test with a normal swap partition (it's the same ssd and > the same partition except that is not encrypted anymore) and, lo and behold, > it went through it just a smoothly as freebsd. It's the console-only test > and I'll try the one with a GUI but I'm sure that now, without the anomalous > swapout there's not going to be any freeze. > > So, am I missing something? Because I've never read anywere that using > dm-crypt with a swap partition would influence the vm to this degree. Still not seeing this behavior after adding an encrypted swap partition to my own setup: debian-vm:~ λ cat /etc/crypttab # after editing this file # <target name> <source device> <key file> <options> swap /dev/vda5 /dev/urandom swap,cipher=aes-xts-plain64,size=256 Just throwing it out there for anyone who doesn't know, using raw device names that can change is dangerous for an encrypted swap partition, see https://wiki.archlinux.org/index.php/Dm-crypt/Swap_encryption for alternatives. debian-vm:~ λ sudo mkswap /dev/mapper/swap mkswap: /dev/mapper/swap: warning: wiping old swap signature. Setting up swapspace version 1, size = 8 GiB (8586784768 bytes) no label, UUID=179abefe-c3c8-4db6-b249-2ade08af972d debian-vm:~ λ sudo swapon /dev/mapper/swap debian-vm:~ λ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 254:0 0 46G 0 disk ├─vda1 254:1 0 19.2G 0 part / ├─vda2 254:2 0 1K 0 part ├─vda4 254:4 0 16G 0 part /scratch ├─vda5 254:5 0 8G 0 part │ └─swap 253:0 0 8G 0 crypt [SWAP] └─vda6 254:6 0 2.8G 0 part /boot On my VM with 8G of memory, the system starts out with over 7G free. Then $ memeater $((7*2**30)) $((256*2**20)) 0 and most of the free memory is used. Another few $ memeater $((300*2**20)) $((20*2**20)) 2 just swap out a little more memory as expected each time, with the majority of the swap space remaining free. So, can you explain your configuration, please? Steps that you're using to initialize the encrypted swap partition, contents of /etc/crypttab and /etc/fstab, output of lsblk and swapon -s, that sort of thing. Thanks.
So, the normal setup is a luks encrypted partition. It does not have and entry either in crypttab or in fstab (root obviously has) because it is opened and activated in a "late" ad-hoc script that does this /lib/cryptsetup/scripts/decrypt_derived cryroot | cryptsetup luksOpen UUID=d4b0a395-321a-41e7-87eb-61778959a95a cryswap --key-file - swapon /dev/mapper/cryswap The decrypt_derived script just gets the key from the root partition (also luks-encrypted). The disk layout is really simple: /dev/sda1: luks root 200 odd GB /dev/sda2: luks swap 16GB Cipher and key length are the default used by cryptsetup which nowadays should be aes-xts-plain64 and 256 bits. As I wrote, it's and ssd (but I've tried with hdd's). allow_discard is enabled for cryroot. I hadn't it enabled for cryswap normally but I've tried the test both ways (not that that was likely to change anything). As I said this is the normal setup, but I've also tried (with the same results): 1) different ciphers with luks: serpent, twofish etc 2) swap partition with plain dm-crypt and, again, various ciphers. 3) swap file on ext4 FS on dm-crypt (on /dev/sda2, because cryroot uses btrfs, which does not support swapfiles). You said that you witnessed a similar behaviour on your desktop system at home. Do you use a dm-crypt swap partition, or a swapfile on a dm-crypt volume? Just to eliminate all the strange possibilities, it would be great if you could do the test on a real machine, not a vm.
Ok, missed the part about luks. I don't use any form of encrypted swap partition, until now that is. :) Sure, the virtualization could be changing things. I'll try it on bare metal.
(In reply to john terragon from comment #31) > Just to eliminate all the strange possibilities, it would be great > if you could do the test on a real machine, not a vm. Since I don't have LUKS setup on my existing system, today I installed another disk in my desktop and created a LUKS encrypted swap partition on it. zissou:~ λ sudo cryptsetup luksFormat /dev/sdc1 WARNING! ======== This will overwrite data on /dev/sdc1 irrevocably. Are you sure? (Type uppercase yes): YES Enter passphrase for /dev/sdc1: Verify passphrase: zissou:~ λ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 232.9G 0 disk ├─sda1 8:1 0 550M 0 part /boot/efi └─sda2 8:2 0 232.4G 0 part / sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part /hdd sdc 8:32 0 931.5G 0 disk └─sdc1 8:33 0 16G 0 part # destined for swap zissou:~ λ sudo cryptsetup luksDump /dev/sdc1 LUKS header information for /dev/sdc1 Version: 1 Cipher name: aes Cipher mode: xts-plain64 Hash spec: sha256 Payload offset: 4096 MK bits: 256 MK digest: 7f e3 19 a6 9c 1d 5c 2a 72 36 36 6c 52 bc 36 b2 c9 81 c3 36 MK salt: d1 55 16 3f 23 81 53 ee ac ad 05 e3 26 5b b8 22 c1 4a 0e 87 18 75 38 16 42 8c b4 11 e0 c2 ff 31 MK iterations: 252061 UUID: 5ee3c1b1-095c-4761-a349-93b357647d83 Key Slot 0: ENABLED Iterations: 3934618 Salt: 6f 21 80 c8 32 04 20 ec 29 2e 09 dd 49 fb 80 c9 11 cd 52 9b 80 16 3a 46 f3 b4 c6 1f 85 15 1c bb Key material offset: 8 AF stripes: 4000 Key Slot 1: DISABLED Key Slot 2: DISABLED Key Slot 3: DISABLED Key Slot 4: DISABLED Key Slot 5: DISABLED Key Slot 6: DISABLED Key Slot 7: DISABLED zissou:~ λ sudo cryptsetup luksOpen /dev/sdc1 swap Enter passphrase for /dev/sdc1: zissou:~ λ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 232.9G 0 disk ├─sda1 8:1 0 550M 0 part /boot/efi └─sda2 8:2 0 232.4G 0 part / sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part /hdd sdc 8:32 0 931.5G 0 disk └─sdc1 8:33 0 16G 0 part └─swap 254:0 0 16G 0 crypt zissou:~ λ sudo mkswap /dev/mapper/swap Setting up swapspace version 1, size = 16 GiB (17177767936 bytes) no label, UUID=c2304cc4-667a-4ad9-ac37-6e3c4fda5c8a zissou:~ λ sudo swapon /dev/mapper/swap zissou:~ λ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 232.9G 0 disk ├─sda1 8:1 0 550M 0 part /boot/efi └─sda2 8:2 0 232.4G 0 part / sdb 8:16 0 931.5G 0 disk └─sdb1 8:17 0 931.5G 0 part /hdd sdc 8:32 0 931.5G 0 disk └─sdc1 8:33 0 16G 0 part └─swap 254:0 0 16G 0 crypt [SWAP] Still no luck reproducing after doing the same tests as in comment 31: the system used the expected amount of swap. The disk is an HDD from circa 2006, though, so I could hear the thing loud and clear when it started swapping. So, what Michal was doing in the first place, taking data from where the problem is actually happening, seems to make more sense. I'll take another look at your test results and hopefully have some tracing commands to suggest when I'm done.
(In reply to Daniel Jordan from comment #33) > Still no luck reproducing after doing the same tests as in comment 31: Sorry, comment 30.
Thanks for trying, Daniel.
(In reply to Michal Hocko from comment #14) > On Fri 29-06-18 20:24:39, john terragon wrote: > > > > > > On Thursday, June 28, 2018, 9:03:27 PM GMT+2, Michal Hocko > > <mhocko@kernel.org> wrote: > > > > >ts steal_direct pgalloc_normal ratio > > >1530092790 26530 107146 4.04 > > >1530092791 32758 58778 1.79 ...snip... > > >1530092809 73689 84833 1.15 > > >1530092810 51959 61545 1.18 > > >1530092811 69046 56313 0.82 > > >1530092812 11387 6839 0.60 > > > > >that gives a good idea that we are allocating roughly in the same pace we > > reclaim. > > > > Those numbers are pages, right? A page is 4K, right?So, for example > > 59844*4096 ~= 240MB. Hence the allocating pace is 240 MB/s ?!But there's no > > process doing that. > > I'm sorry, I sound like a broken record:there is only one user process > (the > > last memeater) that's allocating > > at a relatively high pace during the test, and that's 4 MB/s.... > > Is there a way to trace which process allocates how much, instead of just > > having > > the aggregate number? > > You can try to use mm_page_alloc tracepoint (events/kmem). One way to do > that is to use ftrace or mount tracefs and enable the tracepoint under > the above path and then read the trace_pipe. Yes, could you please try the tracepoint? Assuming you have perf installed on your system, but as Michal says, there are other ways to get tracepoint data. And please keep using read_vmstat as well, and try to take note of the timing of when each memeater starts, is finished allocating, when read_vmstat is started and stopped, and when perf is started and stopped. That will help interpreting the data. During your test, right before you start the last memeater that starts the system swapping, you can run this as root in its own terminal: perf record -a -e kmem:mm_page_alloc and then ^C it once the system comes back from the freeze. Aggregate the results by pid with this, again as root: perf report -s pid --stdio Hopefully this will shed some light on what's causing the extra 200+ MiB per second of page allocation aside from the memeater.
(In reply to Michal Hocko from comment #7) > So it seems that the kswapd was able to scan a lot of pages while it > didn't manage to reclaim a large part of them. This is interesting > because the direct reclaim had basically 100% reclaim efficiency > (reclaim/scanned) at that time. So I've learned a lot reading Michal's analysis, which is really spot on. I think this one part, though, may not be right. For out.txt during the period of heavy direct reclaim, the efficiency (pgsteal_direct/pgscan_direct) hovers around 1%. Excuse this ugly thing: $ awk '/=S/ { ts=$2; if (ts >= 1529944053 && ts <= 1529944072) { show=1 } else { show=0 }} /pgsteal_direct / { if (steal_prev) {steal_diff=$2-steal_prev} steal_prev=$2 } /pgscan_direct / { if (scan_prev) {scan_diff=$2-scan_prev} scan_prev=$2; if (show) { printf "%d %d %d %.2f\n", ts, steal_diff, scan_diff, steal_diff / scan_diff * 100}}' out.txt timestamp pgsteal_direct pgscan_direct ratio 1529944053 19003 1377730 1.38 1529944054 55177 7491802 0.74 1529944055 55471 3932763 1.41 1529944056 44795 6491762 0.69 1529944057 26652 7809653 0.34 1529944058 41044 8317191 0.49 1529944059 62195 7411019 0.84 1529944060 61118 6906786 0.88 1529944061 41875 8259804 0.51 1529944062 64176 5992609 1.07 1529944063 51765 8664945 0.60 1529944064 53420 8441670 0.63 1529944065 62557 6858885 0.91 1529944066 39649 7736501 0.51 1529944067 61628 6697437 0.92 1529944068 29689 6731073 0.44 1529944069 12631 9350035 0.14 1529944070 43268 5487761 0.79 1529944071 65988 2651462 2.49 1529944072 55976 2298567 2.44 Not sure if such a low rate is unusual, though, not having dealt with these kinds of issues much.
Created attachment 277415 [details] read_vmstat output
Created attachment 277417 [details] perf report
Created attachment 277419 [details] start and finish times for the memeaters
Three new attachments above: 1) the read_vmstat output (started right before the first memeater was executed) 2) the perf report (as requested, perf was started right before executing the last memeater, the one that caused the swapping) 3) start and finish allocation times of the memeaters (except the last memeater which has only the starting time because I've ctrl-Z it before it completed all its allocations).
Sorry, but nobody reads bugzilla. I tried to switch this discussion to an email thread for a reason! Please resend all this (useful) info in reply to the email thread which I created for this purpose.
Created attachment 277431 [details] report On Friday, July 20, 2018, 2:03:48 AM GMT+2, bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> wrote: > > >https://bugzilla.kernel.org/show_bug.cgi?id=200105 > >--- Comment #42 from Andrew Morton (akpm@linux-foundation.org) --- >Sorry, but nobody reads bugzilla. I tried to switch this discussion to an >email thread for a reason! > >Please resend all this (useful) info in reply to the email thread which I >created for this purpose. I'll resend the last message and attachments. Anyone interested on the previous "episodes" go read https://bugzilla.kernel.org/show_bug.cgi?id=200105 Three attachments: 1) the read_vmstat output (started right before the first memeater was executed) 2) the perf report (as requested, perf was started right before executing the last memeater, the one that caused the swapping) 3) start and finish allocation times of the memeaters (except the last memeater which has only the starting time because I've ctrl-Z it before it completed all its allocations).
Created attachment 277433 [details] out7.txt
Created attachment 277435 [details] start_finish.txt
On 07/20/2018 04:19 AM, john terragon wrote: > > On Friday, July 20, 2018, 2:03:48 AM GMT+2, > bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> > wrote: > > > >https://bugzilla.kernel.org/show_bug.cgi?id=200105 > <https://bugzilla.kernel.org/show_bug.cgi?id=200105> > > > >--- Comment #42 from Andrew Morton (akpm@linux-foundation.org > <mailto:akpm@linux-foundation.org>) --- > >Sorry, but nobody reads bugzilla. I tried to switch this discussion to an > >email thread for a reason! > > > >Please resend all this (useful) info in reply to the email thread which I > >created for this purpose. > > I'll resend the last message and attachments. Anyone interested on the > previous "episodes" go read > https://bugzilla.kernel.org/show_bug.cgi?id=200105 The summary is that John has put together a reliable reproducer for a problem he's seeing where on high memory usage any of his desktop systems with SSDs hang for around a minute, completely unresponsive, and swaps out 2-3x more memory than the system is allocating. John's issue only happens using a LUKS encrypted swap partition, unencrypted swap or swap encrypted without LUKS works fine. In one test (out5.txt) where most system memory is taken by anon pages beforehand, the heavy direct reclaim that Michal noticed lasts for 24 seconds, during which on average if I've crunched my numbers right, John's test program was allocating at 4MiB/s, the system overall (pgalloc_normal) was allocating at 235MiB/s, and the system was swapping out (pswpout) at 673MiB/s. pgalloc_normal and pswpout stay roughly the same each second, no big swings. Is the disparity between allocation and swapout rate expected? John ran perf during another test right before the last test program was started (this doesn't include the initial large allocation bringing the system close to swapping). The top five allocators (kmem:mm_page_alloc): # Overhead Pid:Command # ........ ....................... # 48.45% 2005:memeater # the test program 32.08% 73:kswapd0 3.16% 1957:perf_4.17 1.41% 1748:watch 1.16% 2043:free So it seems to be just reclaim activity, but why so much when the test program only allocates at 4MiB/s? John, adding -g to perf record would show call stacks. I'll be away for 2.5 weeks so won't be able to get back to this until then.
On 07/21/2018 10:39 AM, Daniel Jordan wrote: > On 07/20/2018 04:19 AM, john terragon wrote: >> >> On Friday, July 20, 2018, 2:03:48 AM GMT+2, >> bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> >> wrote: >> > >> >https://bugzilla.kernel.org/show_bug.cgi?id=200105 >> <https://bugzilla.kernel.org/show_bug.cgi?id=200105> >> > >> >--- Comment #42 from Andrew Morton (akpm@linux-foundation.org >> <mailto:akpm@linux-foundation.org>) --- >> >Sorry, but nobody reads bugzilla. I tried to switch this discussion to an >> >email thread for a reason! >> > >> >Please resend all this (useful) info in reply to the email thread which I >> >created for this purpose. >> >> I'll resend the last message and attachments. Anyone interested on the >> previous "episodes" go read >> https://bugzilla.kernel.org/show_bug.cgi?id=200105 > > The summary is that John has put together a reliable reproducer for a problem > he's seeing where on high memory usage any of his desktop systems with SSDs > hang for around a minute, completely unresponsive, and swaps out 2-3x more > memory than the system is allocating. > > John's issue only happens using a LUKS encrypted swap partition, unencrypted > swap or swap encrypted without LUKS works fine. > > In one test (out5.txt) where most system memory is taken by anon pages > beforehand, the heavy direct reclaim that Michal noticed lasts for 24 > seconds, during which on average if I've crunched my numbers right, John's > test program was allocating at 4MiB/s, the system overall (pgalloc_normal) > was allocating at 235MiB/s, and the system was swapping out (pswpout) at > 673MiB/s. pgalloc_normal and pswpout stay roughly the same each second, no > big swings. > > Is the disparity between allocation and swapout rate expected? > > John ran perf during another test right before the last test program was > started (this doesn't include the initial large allocation bringing the > system close to swapping). The top five allocators (kmem:mm_page_alloc): > > # Overhead Pid:Command > # ........ ....................... > # > 48.45% 2005:memeater # the test program > 32.08% 73:kswapd0 > 3.16% 1957:perf_4.17 > 1.41% 1748:watch > 1.16% 2043:free > > So it seems to be just reclaim activity, but why so much when the test > program only allocates at 4MiB/s? Should add that during the 24 seconds, reclaim efficiency for both kswapd and direct (pgsteal/pgscan) hovered around 1%, which seems low. The 24 seconds cover =S 1530092789 to =S 1530092812 in out5.txt from bugzilla.
I'll add a couple of points to Daniel's summary: 1) initially I performed the test with the memeaters within a gui (KDE, but I tried gnome too). The system freezes occur when the test is performed within a gui. 2) Then, to isolate what seemed to be more symptoms (the freezes) than the problem itself, I started to perform the test using just text consoles. In these conditions there are no big, visible, system freezes. Although one of the tools used to snapshot vmstat (Michal's read_vmstat) complains with messages like "it took 28s to snapshot!" or something like that, so maybe there are still slowdowns. 3) Anyways, both in gui and text "mode", the test always causes a huge swap activity which seems to be disproportionate given the low pressure the vm is put under (tried as low as 4Mb/s allocation rate). 4) These are the swap configurations in which the huge swap activity occurs: -luks encrypted swap partition -dmcrypt plain swap partition -unencrypted swapfile located inside an encrypted, luks or dmcrypt plain, filesystem 5) The test does not cause any huge swap activity under these swap configs *when performed without a gui*: a) unencrypted swap partition b) encrypted swap file located inside and unencryted ext4 filesystem (that is ext4 directly on the partition and the directory in which the swapfile resides is encrypted with fs-level encryption, thus no dm-crypt involved). 6) I was particularly happy about b) above because it seemed to be a viable workaround, but sadly I have to report that, although the test does not causes large swap activity with b), minute-long system freezes are occurring during normal usage even with b). These seem to be the same kind of freezes that prompted me to find a way to reliably reproduce them under controlled conditions. I haven't yet performed the test with b)+gui, though, only b)+text console as I wrote in 5.
[I am really sorry to be slow on responding] On Sat 21-07-18 10:39:05, Daniel Jordan wrote: > John's issue only happens using a LUKS encrypted swap partition, > unencrypted swap or swap encrypted without LUKS works fine. > > In one test (out5.txt) where most system memory is taken by anon pages > beforehand, the heavy direct reclaim that Michal noticed lasts for 24 > seconds, during which on average if I've crunched my numbers right, > John's test program was allocating at 4MiB/s, the system overall > (pgalloc_normal) was allocating at 235MiB/s, and the system was > swapping out (pswpout) at 673MiB/s. pgalloc_normal and pswpout stay > roughly the same each second, no big swings. > > Is the disparity between allocation and swapout rate expected? > > John ran perf during another test right before the last test program > was started (this doesn't include the initial large allocation > bringing the system close to swapping). The top five allocators > (kmem:mm_page_alloc): > > # Overhead Pid:Command > # ........ ....................... > # > 48.45% 2005:memeater # the test program > 32.08% 73:kswapd0 > 3.16% 1957:perf_4.17 > 1.41% 1748:watch > 1.16% 2043:free Huh, kswapd allocating memory sounds really wrong here. Is it possible that the swap device driver is double buffering and allocating a new page for each one to swap out?