Created attachment 277505 [details] iptables save After creating large number of cgroups and under memory pressure, iptables command fails with following error: "iptables-restor: vmalloc: allocation failure, allocated 3047424 of 3465216 bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" System which is used to reproduce the bug is with 2 vcpus and 2GB of ram, but it happens on more powerfull systems. Steps to reproduce: mkdir /cgroup mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p "/cgroup/user/$a/$b"; done; done Then in separate consoles cat /dev/vda > /dev/null ./test ./test i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); echo $i; done Here is the source of "test" program and attached iptables.save. It happens also with smaller iptables.save file. #include <stdio.h> #include <stdlib.h> int main(void) { srand(time(NULL)); int i = 0, j = 0, randnum=0; int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; while(1) { for (i = 0; i < 6 ; i++) { int *ptr = (int*) malloc(arr[i] * 93); for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) { *(ptr+j) = j+1; } free(ptr); } } }
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=200651 > > Bug ID: 200651 > Summary: cgroups iptables-restor: vmalloc: allocation failure Thanks. Please do note the above request. > Product: Memory Management > Version: 2.5 > Kernel Version: 4.14 > Hardware: All > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Other > Assignee: akpm@linux-foundation.org > Reporter: gnikolov@icdsoft.com > Regression: No > > Created attachment 277505 [details] > --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit > iptables save > > After creating large number of cgroups and under memory pressure, iptables > command fails with following error: > > "iptables-restor: vmalloc: allocation failure, allocated 3047424 of 3465216 > bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" I'm not sure what the problem is here, apart from iptables being over-optimistic about vmalloc()'s abilities. Are cgroups having any impact on this, or is it simply vmalloc arena fragmentation, and the iptables code should use some data structure more sophisticated than a massive array? Maybe all that ccgroup metadata is contributing to the arena fragmentation, but that allocations will be small and the two systems should be able to live alongside, by being realistic about vmalloc. > System which is used to reproduce the bug is with 2 vcpus and 2GB of ram, but > it happens on more powerfull systems. > > Steps to reproduce: > > mkdir /cgroup > mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup > for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p > "/cgroup/user/$a/$b"; done; done > > Then in separate consoles > > cat /dev/vda > /dev/null > ./test > ./test > i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); echo > $i; > done > > Here is the source of "test" program and attached iptables.save. It happens > also with smaller iptables.save file. > > #include <stdio.h> > #include <stdlib.h> > > int main(void) { > > srand(time(NULL)); > int i = 0, j = 0, randnum=0; > int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; > while(1) { > > for (i = 0; i < 6 ; i++) { > > int *ptr = (int*) malloc(arr[i] * 93); > > for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) { > *(ptr+j) = j+1; > } > > free(ptr); > } > } > } >
On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: > On 07/25/2018 09:52 PM, Andrew Morton wrote: > > (switched to email. Please respond via emailed reply-to-all, not via the > > bugzilla web interface). > > > > On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org > wrote: > > > >> https://bugzilla.kernel.org/show_bug.cgi?id=200651 > >> > >> Bug ID: 200651 > >> Summary: cgroups iptables-restor: vmalloc: allocation failure > > > > Thanks. Please do note the above request. > > > >> Product: Memory Management > >> Version: 2.5 > >> Kernel Version: 4.14 > >> Hardware: All > >> OS: Linux > >> Tree: Mainline > >> Status: NEW > >> Severity: normal > >> Priority: P1 > >> Component: Other > >> Assignee: akpm@linux-foundation.org > >> Reporter: gnikolov@icdsoft.com > >> Regression: No > >> > >> Created attachment 277505 [details] > >> --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit > >> iptables save > >> > >> After creating large number of cgroups and under memory pressure, iptables > >> command fails with following error: > >> > >> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of > 3465216 > >> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" > > This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and > 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c > ("netfilter: x_tables: make allocation less aggressive") was backported > to 4.14. Removing __GFP_NORETRY might help here, but bring back other > issues. Less than 4MB is not that much though, maybe find some "sane" > limit and use __GFP_NORETRY only above that? I have seen the same report via http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com and the reported confirmed that kvmalloc is not a real culprit http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com > > I'm not sure what the problem is here, apart from iptables being > > over-optimistic about vmalloc()'s abilities. > > > > Are cgroups having any impact on this, or is it simply vmalloc arena > > fragmentation, and the iptables code should use some data structure > > more sophisticated than a massive array? > > > > Maybe all that ccgroup metadata is contributing to the arena > > fragmentation, but that allocations will be small and the two systems > > should be able to live alongside, by being realistic about vmalloc. > > > >> System which is used to reproduce the bug is with 2 vcpus and 2GB of ram, > but > >> it happens on more powerfull systems. > >> > >> Steps to reproduce: > >> > >> mkdir /cgroup > >> mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup > >> for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p > >> "/cgroup/user/$a/$b"; done; done > >> > >> Then in separate consoles > >> > >> cat /dev/vda > /dev/null > >> ./test > >> ./test > >> i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); > echo $i; > >> done > >> > >> Here is the source of "test" program and attached iptables.save. It > happens > >> also with smaller iptables.save file. > >> > >> #include <stdio.h> > >> #include <stdlib.h> > >> > >> int main(void) { > >> > >> srand(time(NULL)); > >> int i = 0, j = 0, randnum=0; > >> int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; > >> while(1) { > >> > >> for (i = 0; i < 6 ; i++) { > >> > >> int *ptr = (int*) malloc(arr[i] * 93); > >> > >> for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) { > >> *(ptr+j) = j+1; > >> } > >> > >> free(ptr); > >> } > >> } > >> } > >> > >
On 07/25/2018 09:52 PM, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=200651 >> >> Bug ID: 200651 >> Summary: cgroups iptables-restor: vmalloc: allocation failure > > Thanks. Please do note the above request. > >> Product: Memory Management >> Version: 2.5 >> Kernel Version: 4.14 >> Hardware: All >> OS: Linux >> Tree: Mainline >> Status: NEW >> Severity: normal >> Priority: P1 >> Component: Other >> Assignee: akpm@linux-foundation.org >> Reporter: gnikolov@icdsoft.com >> Regression: No >> >> Created attachment 277505 [details] >> --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit >> iptables save >> >> After creating large number of cgroups and under memory pressure, iptables >> command fails with following error: >> >> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of 3465216 >> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") was backported to 4.14. Removing __GFP_NORETRY might help here, but bring back other issues. Less than 4MB is not that much though, maybe find some "sane" limit and use __GFP_NORETRY only above that? > I'm not sure what the problem is here, apart from iptables being > over-optimistic about vmalloc()'s abilities. > > Are cgroups having any impact on this, or is it simply vmalloc arena > fragmentation, and the iptables code should use some data structure > more sophisticated than a massive array? > > Maybe all that ccgroup metadata is contributing to the arena > fragmentation, but that allocations will be small and the two systems > should be able to live alongside, by being realistic about vmalloc. > >> System which is used to reproduce the bug is with 2 vcpus and 2GB of ram, >> but >> it happens on more powerfull systems. >> >> Steps to reproduce: >> >> mkdir /cgroup >> mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup >> for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p >> "/cgroup/user/$a/$b"; done; done >> >> Then in separate consoles >> >> cat /dev/vda > /dev/null >> ./test >> ./test >> i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); echo >> $i; >> done >> >> Here is the source of "test" program and attached iptables.save. It happens >> also with smaller iptables.save file. >> >> #include <stdio.h> >> #include <stdlib.h> >> >> int main(void) { >> >> srand(time(NULL)); >> int i = 0, j = 0, randnum=0; >> int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; >> while(1) { >> >> for (i = 0; i < 6 ; i++) { >> >> int *ptr = (int*) malloc(arr[i] * 93); >> >> for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) { >> *(ptr+j) = j+1; >> } >> >> free(ptr); >> } >> } >> } >> >
On 07/26/2018 09:26 AM, Michal Hocko wrote: > On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >> On 07/25/2018 09:52 PM, Andrew Morton wrote: >>> (switched to email. Please respond via emailed reply-to-all, not via the >>> bugzilla web interface). >>> >>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org >>> wrote: >>> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651 >>>> >>>> Bug ID: 200651 >>>> Summary: cgroups iptables-restor: vmalloc: allocation failure >>> >>> Thanks. Please do note the above request. >>> >>>> Product: Memory Management >>>> Version: 2.5 >>>> Kernel Version: 4.14 >>>> Hardware: All >>>> OS: Linux >>>> Tree: Mainline >>>> Status: NEW >>>> Severity: normal >>>> Priority: P1 >>>> Component: Other >>>> Assignee: akpm@linux-foundation.org >>>> Reporter: gnikolov@icdsoft.com >>>> Regression: No >>>> >>>> Created attachment 277505 [details] >>>> --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit >>>> iptables save >>>> >>>> After creating large number of cgroups and under memory pressure, iptables >>>> command fails with following error: >>>> >>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of >>>> 3465216 >>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" >> >> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and >> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c >> ("netfilter: x_tables: make allocation less aggressive") was backported >> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >> issues. Less than 4MB is not that much though, maybe find some "sane" >> limit and use __GFP_NORETRY only above that? > > I have seen the same report via > http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com > and the reported confirmed that kvmalloc is not a real culprit > http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in v4.14). I assume it was reverted on top of vanilla v4.14 as there would be conflict on the stable with 0537250fdc6c backport. So what should be tested to be sure is either vanilla v4.14 without stable backports, or latest v4.14.y with revert of 0537250fdc6c.
On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: > On 07/26/2018 09:26 AM, Michal Hocko wrote: > > On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: > >> On 07/25/2018 09:52 PM, Andrew Morton wrote: > >>> (switched to email. Please respond via emailed reply-to-all, not via the > >>> bugzilla web interface). > >>> > >>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org > wrote: > >>> > >>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651 > >>>> > >>>> Bug ID: 200651 > >>>> Summary: cgroups iptables-restor: vmalloc: allocation failure > >>> > >>> Thanks. Please do note the above request. > >>> > >>>> Product: Memory Management > >>>> Version: 2.5 > >>>> Kernel Version: 4.14 > >>>> Hardware: All > >>>> OS: Linux > >>>> Tree: Mainline > >>>> Status: NEW > >>>> Severity: normal > >>>> Priority: P1 > >>>> Component: Other > >>>> Assignee: akpm@linux-foundation.org > >>>> Reporter: gnikolov@icdsoft.com > >>>> Regression: No > >>>> > >>>> Created attachment 277505 [details] > >>>> --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit > >>>> iptables save > >>>> > >>>> After creating large number of cgroups and under memory pressure, > iptables > >>>> command fails with following error: > >>>> > >>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of > 3465216 > >>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" > >> > >> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and > >> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c > >> ("netfilter: x_tables: make allocation less aggressive") was backported > >> to 4.14. Removing __GFP_NORETRY might help here, but bring back other > >> issues. Less than 4MB is not that much though, maybe find some "sane" > >> limit and use __GFP_NORETRY only above that? > > > > I have seen the same report via > http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com > > and the reported confirmed that kvmalloc is not a real culprit > > http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com > > Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use > kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that > removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in > v4.14). I assume it was reverted on top of vanilla v4.14 as there would > be conflict on the stable with 0537250fdc6c backport. So what should be > tested to be sure is either vanilla v4.14 without stable backports, or > latest v4.14.y with revert of 0537250fdc6c. But 0537250fdc6c simply restored the previous NORETRY behavior from before eacd86ca3b03. So whatever causes these issues doesn't seem to be directly related to the kvmalloc change. Or do I miss what you are saying?
On 07/26/2018 09:42 AM, Michal Hocko wrote: > On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: >> On 07/26/2018 09:26 AM, Michal Hocko wrote: >>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >>>> On 07/25/2018 09:52 PM, Andrew Morton wrote: >>>>> (switched to email. Please respond via emailed reply-to-all, not via the >>>>> bugzilla web interface). >>>>> >>>>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org >>>>> wrote: >>>>> >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651 >>>>>> >>>>>> Bug ID: 200651 >>>>>> Summary: cgroups iptables-restor: vmalloc: allocation failure >>>>> >>>>> Thanks. Please do note the above request. >>>>> >>>>>> Product: Memory Management >>>>>> Version: 2.5 >>>>>> Kernel Version: 4.14 >>>>>> Hardware: All >>>>>> OS: Linux >>>>>> Tree: Mainline >>>>>> Status: NEW >>>>>> Severity: normal >>>>>> Priority: P1 >>>>>> Component: Other >>>>>> Assignee: akpm@linux-foundation.org >>>>>> Reporter: gnikolov@icdsoft.com >>>>>> Regression: No >>>>>> >>>>>> Created attachment 277505 [details] >>>>>> --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit >>>>>> iptables save >>>>>> >>>>>> After creating large number of cgroups and under memory pressure, >>>>>> iptables >>>>>> command fails with following error: >>>>>> >>>>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of >>>>>> 3465216 >>>>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" >>>> >>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and >>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c >>>> ("netfilter: x_tables: make allocation less aggressive") was backported >>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >>>> issues. Less than 4MB is not that much though, maybe find some "sane" >>>> limit and use __GFP_NORETRY only above that? >>> >>> I have seen the same report via >>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com >>> and the reported confirmed that kvmalloc is not a real culprit >>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >> >> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use >> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that >> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in >> v4.14). I assume it was reverted on top of vanilla v4.14 as there would >> be conflict on the stable with 0537250fdc6c backport. So what should be >> tested to be sure is either vanilla v4.14 without stable backports, or >> latest v4.14.y with revert of 0537250fdc6c. > > But 0537250fdc6c simply restored the previous NORETRY behavior from > before eacd86ca3b03. So whatever causes these issues doesn't seem to be > directly related to the kvmalloc change. Or do I miss what you are > saying? I'm saying that although it's not a regression, as you say (the vmalloc() there was only for a few kernel versions called without __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix the issue and thus we will rule out other possibilities.
On Thu 26-07-18 09:50:45, Vlastimil Babka wrote: > On 07/26/2018 09:42 AM, Michal Hocko wrote: > > On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: > >> On 07/26/2018 09:26 AM, Michal Hocko wrote: > >>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: > >>>> On 07/25/2018 09:52 PM, Andrew Morton wrote: > >>>>> (switched to email. Please respond via emailed reply-to-all, not via > the > >>>>> bugzilla web interface). > >>>>> > >>>>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org > wrote: > >>>>> > >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651 > >>>>>> > >>>>>> Bug ID: 200651 > >>>>>> Summary: cgroups iptables-restor: vmalloc: allocation > failure > >>>>> > >>>>> Thanks. Please do note the above request. > >>>>> > >>>>>> Product: Memory Management > >>>>>> Version: 2.5 > >>>>>> Kernel Version: 4.14 > >>>>>> Hardware: All > >>>>>> OS: Linux > >>>>>> Tree: Mainline > >>>>>> Status: NEW > >>>>>> Severity: normal > >>>>>> Priority: P1 > >>>>>> Component: Other > >>>>>> Assignee: akpm@linux-foundation.org > >>>>>> Reporter: gnikolov@icdsoft.com > >>>>>> Regression: No > >>>>>> > >>>>>> Created attachment 277505 [details] > >>>>>> --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit > >>>>>> iptables save > >>>>>> > >>>>>> After creating large number of cgroups and under memory pressure, > iptables > >>>>>> command fails with following error: > >>>>>> > >>>>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of > 3465216 > >>>>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)" > >>>> > >>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and > >>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c > >>>> ("netfilter: x_tables: make allocation less aggressive") was backported > >>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other > >>>> issues. Less than 4MB is not that much though, maybe find some "sane" > >>>> limit and use __GFP_NORETRY only above that? > >>> > >>> I have seen the same report via > http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com > >>> and the reported confirmed that kvmalloc is not a real culprit > >>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com > >> > >> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use > >> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that > >> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in > >> v4.14). I assume it was reverted on top of vanilla v4.14 as there would > >> be conflict on the stable with 0537250fdc6c backport. So what should be > >> tested to be sure is either vanilla v4.14 without stable backports, or > >> latest v4.14.y with revert of 0537250fdc6c. > > > > But 0537250fdc6c simply restored the previous NORETRY behavior from > > before eacd86ca3b03. So whatever causes these issues doesn't seem to be > > directly related to the kvmalloc change. Or do I miss what you are > > saying? > > I'm saying that although it's not a regression, as you say (the > vmalloc() there was only for a few kernel versions called without > __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix > the issue and thus we will rule out other possibilities. http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com claims that reverting eacd86ca3b03 didn't really help.
On 07/26/2018 10:03 AM, Michal Hocko wrote: > On Thu 26-07-18 09:50:45, Vlastimil Babka wrote: >> On 07/26/2018 09:42 AM, Michal Hocko wrote: >>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: >>>> On 07/26/2018 09:26 AM, Michal Hocko wrote: >>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote: >>>>>> >>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and >>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c >>>>>> ("netfilter: x_tables: make allocation less aggressive") was backported >>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >>>>>> issues. Less than 4MB is not that much though, maybe find some "sane" >>>>>> limit and use __GFP_NORETRY only above that? >>>>> >>>>> I have seen the same report via >>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com >>>>> and the reported confirmed that kvmalloc is not a real culprit >>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >>>> >>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use >>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that >>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in >>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would >>>> be conflict on the stable with 0537250fdc6c backport. So what should be >>>> tested to be sure is either vanilla v4.14 without stable backports, or >>>> latest v4.14.y with revert of 0537250fdc6c. >>> >>> But 0537250fdc6c simply restored the previous NORETRY behavior from >>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be >>> directly related to the kvmalloc change. Or do I miss what you are >>> saying? >> >> I'm saying that although it's not a regression, as you say (the >> vmalloc() there was only for a few kernel versions called without >> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix >> the issue and thus we will rule out other possibilities. > > http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com > claims that reverting eacd86ca3b03 didn't really help. Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert reintroduced it. I tried to explain it in the quoted part above starting with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is that eacd86ca3b03 might have actually *fixed* (or rather prevented) this alloc failure, if there was not 0537250fdc6c and its 4.14 stable backport (the kernel bugzilla report says 4.14, I'm assuming new enough stable to contain 0537250fdc6c as the failure message contains __GFP_NORETRY). The mail you reference also says "seems that old version is masking errors", which confirms that we are indeed looking at the right vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and thus the revert reintroduced it).
On 07/26/2018 10:31 AM, Vlastimil Babka wrote: > On 07/26/2018 10:03 AM, Michal Hocko wrote: >> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote: >>> On 07/26/2018 09:42 AM, Michal Hocko wrote: >>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: >>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote: >>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote: >>>>>>> >>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 >>>>>>> and >>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c >>>>>>> ("netfilter: x_tables: make allocation less aggressive") was backported >>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane" >>>>>>> limit and use __GFP_NORETRY only above that? >>>>>> >>>>>> I have seen the same report via >>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com >>>>>> and the reported confirmed that kvmalloc is not a real culprit >>>>>> >>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >>>>> >>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use >>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that >>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in >>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would >>>>> be conflict on the stable with 0537250fdc6c backport. So what should be >>>>> tested to be sure is either vanilla v4.14 without stable backports, or >>>>> latest v4.14.y with revert of 0537250fdc6c. >>>> >>>> But 0537250fdc6c simply restored the previous NORETRY behavior from >>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be >>>> directly related to the kvmalloc change. Or do I miss what you are >>>> saying? >>> >>> I'm saying that although it's not a regression, as you say (the >>> vmalloc() there was only for a few kernel versions called without >>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix >>> the issue and thus we will rule out other possibilities. >> >> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >> claims that reverting eacd86ca3b03 didn't really help. Ah, I see, that mail thread references a different kernel bugzilla #200639 which doesn't mention 4.14, but outright blames commit eacd86ca3b03. Yet the alloc fail message contains __GFP_NORETRY, so I still suspect the kernel also had 0537250fdc6c backport. Georgi can you please clarify which exact kernel version had the alloc failures, and how exactly you tested the revert (which version was the baseline for revert). Thanks. > Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert > reintroduced it. I tried to explain it in the quoted part above starting > with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is > that eacd86ca3b03 might have actually *fixed* (or rather prevented) this > alloc failure, if there was not 0537250fdc6c and its 4.14 stable > backport (the kernel bugzilla report says 4.14, I'm assuming new enough > stable to contain 0537250fdc6c as the failure message contains > __GFP_NORETRY). > > The mail you reference also says "seems that old version is masking > errors", which confirms that we are indeed looking at the right > vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and > thus the revert reintroduced it). > >
Created attachment 277549 [details] attachment-20091-0.html *Georgi Nikolov* System Administrator www.icdsoft.com <http://www.icdsoft.com> On 07/26/2018 11:48 AM, Vlastimil Babka wrote: > On 07/26/2018 10:31 AM, Vlastimil Babka wrote: >> On 07/26/2018 10:03 AM, Michal Hocko wrote: >>> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote: >>>> On 07/26/2018 09:42 AM, Michal Hocko wrote: >>>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: >>>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote: >>>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >>>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote: >>>>>>>> >>>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 >>>>>>>> and >>>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit >>>>>>>> 0537250fdc6c >>>>>>>> ("netfilter: x_tables: make allocation less aggressive") was >>>>>>>> backported >>>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >>>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane" >>>>>>>> limit and use __GFP_NORETRY only above that? >>>>>>> I have seen the same report via >>>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com >>>>>>> and the reported confirmed that kvmalloc is not a real culprit >>>>>>> >>>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >>>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use >>>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that >>>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in >>>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would >>>>>> be conflict on the stable with 0537250fdc6c backport. So what should be >>>>>> tested to be sure is either vanilla v4.14 without stable backports, or >>>>>> latest v4.14.y with revert of 0537250fdc6c. >>>>> But 0537250fdc6c simply restored the previous NORETRY behavior from >>>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be >>>>> directly related to the kvmalloc change. Or do I miss what you are >>>>> saying? >>>> I'm saying that although it's not a regression, as you say (the >>>> vmalloc() there was only for a few kernel versions called without >>>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix >>>> the issue and thus we will rule out other possibilities. >>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >>> claims that reverting eacd86ca3b03 didn't really help. > Ah, I see, that mail thread references a different kernel bugzilla > #200639 which doesn't mention 4.14, but outright blames commit > eacd86ca3b03. Yet the alloc fail message contains __GFP_NORETRY, so I > still suspect the kernel also had 0537250fdc6c backport. Georgi can you > please clarify which exact kernel version had the alloc failures, and > how exactly you tested the revert (which version was the baseline for > revert). Thanks. > >> Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert >> reintroduced it. I tried to explain it in the quoted part above starting >> with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is >> that eacd86ca3b03 might have actually *fixed* (or rather prevented) this >> alloc failure, if there was not 0537250fdc6c and its 4.14 stable >> backport (the kernel bugzilla report says 4.14, I'm assuming new enough >> stable to contain 0537250fdc6c as the failure message contains >> __GFP_NORETRY). >> >> The mail you reference also says "seems that old version is masking >> errors", which confirms that we are indeed looking at the right >> vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and >> thus the revert reintroduced it). >> >> Hello, Kernel that has allocation failures is 4.14.50. Here is the patch applied to this version which masks errors: --- net/netfilter/x_tables.c 2018-06-18 14:18:21.138347416 +0300 +++ net/netfilter/x_tables.c 2018-07-26 11:58:01.721932962 +0300 @@ -1059,9 +1059,19 @@ * than shoot all processes down before realizing there is nothing * more to reclaim. */ - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); +/* info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); if (!info) return NULL; +*/ + + if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) + info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); + if (!info) { + info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, + PAGE_KERNEL); + if (!info) + return NULL; + } memset(info, 0, sizeof(*info)); info->size = size; I will try to reproduce it with only info = kvmalloc(sz, GFP_KERNEL); Regards, -- Georgi Nikolov
On 07/26/2018 12:02 PM, Georgi Nikolov wrote: > On 07/26/2018 11:48 AM, Vlastimil Babka wrote: >> On 07/26/2018 10:31 AM, Vlastimil Babka wrote: >>> On 07/26/2018 10:03 AM, Michal Hocko wrote: >>>> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote: >>>>> On 07/26/2018 09:42 AM, Michal Hocko wrote: >>>>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote: >>>>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote: >>>>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >>>>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote: >>>>>>>>> >>>>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 >>>>>>>>> and >>>>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit >>>>>>>>> 0537250fdc6c >>>>>>>>> ("netfilter: x_tables: make allocation less aggressive") was >>>>>>>>> backported >>>>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >>>>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane" >>>>>>>>> limit and use __GFP_NORETRY only above that? >>>>>>>> I have seen the same report via >>>>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com >>>>>>>> and the reported confirmed that kvmalloc is not a real culprit >>>>>>>> >>>>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >>>>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use >>>>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that >>>>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in >>>>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would >>>>>>> be conflict on the stable with 0537250fdc6c backport. So what should be >>>>>>> tested to be sure is either vanilla v4.14 without stable backports, or >>>>>>> latest v4.14.y with revert of 0537250fdc6c. >>>>>> But 0537250fdc6c simply restored the previous NORETRY behavior from >>>>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be >>>>>> directly related to the kvmalloc change. Or do I miss what you are >>>>>> saying? >>>>> I'm saying that although it's not a regression, as you say (the >>>>> vmalloc() there was only for a few kernel versions called without >>>>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix >>>>> the issue and thus we will rule out other possibilities. >>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com >>>> claims that reverting eacd86ca3b03 didn't really help. >> Ah, I see, that mail thread references a different kernel bugzilla >> #200639 which doesn't mention 4.14, but outright blames commit >> eacd86ca3b03. Yet the alloc fail message contains __GFP_NORETRY, so I >> still suspect the kernel also had 0537250fdc6c backport. Georgi can you >> please clarify which exact kernel version had the alloc failures, and >> how exactly you tested the revert (which version was the baseline for >> revert). Thanks. >> >>> Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert >>> reintroduced it. I tried to explain it in the quoted part above starting >>> with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is >>> that eacd86ca3b03 might have actually *fixed* (or rather prevented) this >>> alloc failure, if there was not 0537250fdc6c and its 4.14 stable >>> backport (the kernel bugzilla report says 4.14, I'm assuming new enough >>> stable to contain 0537250fdc6c as the failure message contains >>> __GFP_NORETRY). >>> >>> The mail you reference also says "seems that old version is masking >>> errors", which confirms that we are indeed looking at the right >>> vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and >>> thus the revert reintroduced it). >>> >>> > > Hello, > Kernel that has allocation failures is 4.14.50. > Here is the patch applied to this version which masks errors: > > --- net/netfilter/x_tables.c 2018-06-18 14:18:21.138347416 +0300 > +++ net/netfilter/x_tables.c 2018-07-26 11:58:01.721932962 +0300 > @@ -1059,9 +1059,19 @@ > * than shoot all processes down before realizing there is nothing > * more to reclaim. > */ > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > +/* info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > if (!info) > return NULL; > +*/ > + > + if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > + info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); > + if (!info) { > + info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, > + PAGE_KERNEL); > + if (!info) > + return NULL; > + } > > memset(info, 0, sizeof(*info)); > info->size = size; > > > I will try to reproduce it with only > > info = kvmalloc(sz, GFP_KERNEL); > > Regards, > > -- > Georgi Nikolov > Hello, Without GFP_NORETRY problem disappears. What is correct way to fix it. - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add this flag only for sizes bigger than some threshold - inside kvmalloc_node remove GFP_NORETRY from __vmalloc_node_flags_caller (i don't know if it honors this flag, or the problem is elsewhere) Regards, -- Georgi Nikolov
On Mon 30-07-18 16:37:07, Georgi Nikolov wrote: > On 07/26/2018 12:02 PM, Georgi Nikolov wrote: [...] > > Here is the patch applied to this version which masks errors: > > > > --- net/netfilter/x_tables.c 2018-06-18 14:18:21.138347416 +0300 > > +++ net/netfilter/x_tables.c 2018-07-26 11:58:01.721932962 +0300 > > @@ -1059,9 +1059,19 @@ > > * than shoot all processes down before realizing there is nothing > > * more to reclaim. > > */ > > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > > +/* info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > > if (!info) > > return NULL; > > +*/ > > + > > + if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) > > + info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); > > + if (!info) { > > + info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, > > + PAGE_KERNEL); > > + if (!info) > > + return NULL; > > + } > > > > memset(info, 0, sizeof(*info)); > > info->size = size; > > > > > > I will try to reproduce it with only > > > > info = kvmalloc(sz, GFP_KERNEL); > > > > Regards, > > > > -- > > Georgi Nikolov > > > > Hello, > > Without GFP_NORETRY problem disappears. Hmm, there are two allocation paths which have __GFP_NORETRY here. I expect you have removed both of them, right? kvmalloc implicitly performs __GFP_NORETRY on kmalloc path but it doesn't have it for the vmalloc fallback. This would match kvmalloc(GFP_KERNEL). I thought you were testing this code path previously but there is some confusion flying around because you have claimed that the regressions started with eacd86ca3b036. If the regression is really with __GFP_NORETRY being used for the vmalloc fallback which would be kvmalloc(GFP_KERNEL | __GFP_NORETRY) then I am still confused because that would match the original code. > What is correct way to fix it. > - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add > this flag only for sizes bigger than some threshold This would reintroduce issue fixed by 0537250fdc6c8. Note that kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the original code (well, except for __GFP_NOWARN). > - inside kvmalloc_node remove GFP_NORETRY from > __vmalloc_node_flags_caller (i don't know if it honors this flag, or > the problem is elsewhere) No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL). I strongly suspect that this is not a regression in this code but rather a side effect of larger memory fragmentation caused by something else. In any case do you see this failure also without artificial test case with a standard workload?
On 07/30/2018 04:57 PM, Michal Hocko wrote: > On Mon 30-07-18 16:37:07, Georgi Nikolov wrote: >> On 07/26/2018 12:02 PM, Georgi Nikolov wrote: > [...] >>> Here is the patch applied to this version which masks errors: >>> >>> --- net/netfilter/x_tables.c 2018-06-18 14:18:21.138347416 +0300 >>> +++ net/netfilter/x_tables.c 2018-07-26 11:58:01.721932962 +0300 >>> @@ -1059,9 +1059,19 @@ >>> * than shoot all processes down before realizing there is nothing >>> * more to reclaim. >>> */ >>> - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); >>> +/* info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); >>> if (!info) >>> return NULL; >>> +*/ >>> + >>> + if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) >>> + info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY); >>> + if (!info) { >>> + info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, >>> + PAGE_KERNEL); >>> + if (!info) >>> + return NULL; >>> + } >>> >>> memset(info, 0, sizeof(*info)); >>> info->size = size; >>> >>> >>> I will try to reproduce it with only >>> >>> info = kvmalloc(sz, GFP_KERNEL); >>> >>> Regards, >>> >>> -- >>> Georgi Nikolov >>> >> Hello, >> >> Without GFP_NORETRY problem disappears. > Hmm, there are two allocation paths which have __GFP_NORETRY here. > I expect you have removed both of them, right? > > kvmalloc implicitly performs __GFP_NORETRY on kmalloc path but it > doesn't have it for the vmalloc fallback. This would match > kvmalloc(GFP_KERNEL). I thought you were testing this code path > previously but there is some confusion flying around because you have > claimed that the regressions started with eacd86ca3b036. If the > regression is really with __GFP_NORETRY being used for the vmalloc > fallback which would be kvmalloc(GFP_KERNEL | __GFP_NORETRY) then > I am still confused because that would match the original code. No i was wrong. The regression starts actually with 0537250fdc6c8. - old code, which opencodes kvmalloc, is masking error but error is there - kvmalloc without GFP_NORETRY works fine, but probably can consume a lot of memory - commit: eacd86ca3b036 - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8 >> What is correct way to fix it. >> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add >> this flag only for sizes bigger than some threshold > This would reintroduce issue fixed by 0537250fdc6c8. Note that > kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the > original code (well, except for __GFP_NOWARN). So probably we should pass GFP_NORETRY only for large requests (above some threshold). > >> - inside kvmalloc_node remove GFP_NORETRY from >> __vmalloc_node_flags_caller (i don't know if it honors this flag, or >> the problem is elsewhere) > No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL). > > I strongly suspect that this is not a regression in this code but rather > a side effect of larger memory fragmentation caused by something else. > In any case do you see this failure also without artificial test case > with a standard workload? Yes i can see failures with standard workload, in fact it was hard to reproduce it. Here is the error from production servers where allocation is smaller: iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null) I didn't understand if vmalloc honors GFP_NORETRY. Regards, -- Georgi Nikolov
On Mon 30-07-18 18:54:24, Georgi Nikolov wrote: [...] > No i was wrong. The regression starts actually with 0537250fdc6c8. > - old code, which opencodes kvmalloc, is masking error but error is there > - kvmalloc without GFP_NORETRY works fine, but probably can consume a > lot of memory - commit: eacd86ca3b036 > - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8 OK. > >> What is correct way to fix it. > >> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add > >> this flag only for sizes bigger than some threshold > > This would reintroduce issue fixed by 0537250fdc6c8. Note that > > kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the > > original code (well, except for __GFP_NOWARN). > > So probably we should pass GFP_NORETRY only for large requests (above > some threshold). What would be the treshold? This is not really my area so I just wanted to keep the original code semantic. > >> - inside kvmalloc_node remove GFP_NORETRY from > >> __vmalloc_node_flags_caller (i don't know if it honors this flag, or > >> the problem is elsewhere) > > No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL). > > > > I strongly suspect that this is not a regression in this code but rather > > a side effect of larger memory fragmentation caused by something else. > > In any case do you see this failure also without artificial test case > > with a standard workload? > > Yes i can see failures with standard workload, in fact it was hard to > reproduce it. > Here is the error from production servers where allocation is smaller: > iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes, > mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null) > > I didn't understand if vmalloc honors GFP_NORETRY. 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really support the GFP_NORETRY remantic because that would imply the request wouldn't trigger the oom killer but in rare cases this might happen (e.g. when page tables are allocated because those are hardcoded GFP_KERNEL). That being said, I have no objection to use GFP_KERNEL if it helps real workloads but we probably need some cap...
On 07/30/2018 09:38 PM, Michal Hocko wrote: > On Mon 30-07-18 18:54:24, Georgi Nikolov wrote: > [...] >> No i was wrong. The regression starts actually with 0537250fdc6c8. >> - old code, which opencodes kvmalloc, is masking error but error is there >> - kvmalloc without GFP_NORETRY works fine, but probably can consume a >> lot of memory - commit: eacd86ca3b036 >> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8 > OK. > >>>> What is correct way to fix it. >>>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add >>>> this flag only for sizes bigger than some threshold >>> This would reintroduce issue fixed by 0537250fdc6c8. Note that >>> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the >>> original code (well, except for __GFP_NOWARN). >> So probably we should pass GFP_NORETRY only for large requests (above >> some threshold). > What would be the treshold? This is not really my area so I just wanted > to keep the original code semantic. > >>>> - inside kvmalloc_node remove GFP_NORETRY from >>>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or >>>> the problem is elsewhere) >>> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL). >>> >>> I strongly suspect that this is not a regression in this code but rather >>> a side effect of larger memory fragmentation caused by something else. >>> In any case do you see this failure also without artificial test case >>> with a standard workload? >> Yes i can see failures with standard workload, in fact it was hard to >> reproduce it. >> Here is the error from production servers where allocation is smaller: >> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes, >> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null) >> >> I didn't understand if vmalloc honors GFP_NORETRY. > 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really > support the GFP_NORETRY remantic because that would imply the request > wouldn't trigger the oom killer but in rare cases this might happen > (e.g. when page tables are allocated because those are hardcoded GFP_KERNEL). > > That being said, I have no objection to use GFP_KERNEL if it helps real > workloads but we probably need some cap... Probably Vlastimil Babka can propose some limit: On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") was backported to 4.14. Removing __GFP_NORETRY might help here, but bring back other issues. Less than 4MB is not that much though, maybe find some "sane" limit and use __GFP_NORETRY only above that? Regards, -- Georgi Nikolov
On 07/30/2018 08:51 PM, Georgi Nikolov wrote: > On 07/30/2018 09:38 PM, Michal Hocko wrote: >> On Mon 30-07-18 18:54:24, Georgi Nikolov wrote: >> [...] >>> No i was wrong. The regression starts actually with 0537250fdc6c8. >>> - old code, which opencodes kvmalloc, is masking error but error is there >>> - kvmalloc without GFP_NORETRY works fine, but probably can consume a >>> lot of memory - commit: eacd86ca3b036 >>> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8 >> OK. >> >>>>> What is correct way to fix it. >>>>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add >>>>> this flag only for sizes bigger than some threshold >>>> This would reintroduce issue fixed by 0537250fdc6c8. Note that >>>> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the >>>> original code (well, except for __GFP_NOWARN). >>> So probably we should pass GFP_NORETRY only for large requests (above >>> some threshold). >> What would be the treshold? This is not really my area so I just wanted >> to keep the original code semantic. >> >>>>> - inside kvmalloc_node remove GFP_NORETRY from >>>>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or >>>>> the problem is elsewhere) >>>> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL). >>>> >>>> I strongly suspect that this is not a regression in this code but rather >>>> a side effect of larger memory fragmentation caused by something else. >>>> In any case do you see this failure also without artificial test case >>>> with a standard workload? >>> Yes i can see failures with standard workload, in fact it was hard to >>> reproduce it. >>> Here is the error from production servers where allocation is smaller: >>> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes, >>> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null) >>> >>> I didn't understand if vmalloc honors GFP_NORETRY. >> 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really >> support the GFP_NORETRY remantic because that would imply the request >> wouldn't trigger the oom killer but in rare cases this might happen >> (e.g. when page tables are allocated because those are hardcoded >> GFP_KERNEL). >> >> That being said, I have no objection to use GFP_KERNEL if it helps real >> workloads but we probably need some cap... > > Probably Vlastimil Babka can propose some limit: No, I think that's rather for the netfilter folks to decide. However, it seems there has been the debate already [1] and it was not found. The conclusion was that __GFP_NORETRY worked fine before, so it should work again after it's added back. But now we know that it doesn't... [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u > On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: > This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and > 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c > ("netfilter: x_tables: make allocation less aggressive") was backported > to 4.14. Removing __GFP_NORETRY might help here, but bring back other > issues. Less than 4MB is not that much though, maybe find some "sane" > limit and use __GFP_NORETRY only above that? > > > Regards, > > -- > Georgi Nikolov > >
On 07/31/2018 09:38 AM, Vlastimil Babka wrote: > On 07/30/2018 08:51 PM, Georgi Nikolov wrote: >> On 07/30/2018 09:38 PM, Michal Hocko wrote: >>> On Mon 30-07-18 18:54:24, Georgi Nikolov wrote: >>> [...] >>>> No i was wrong. The regression starts actually with 0537250fdc6c8. >>>> - old code, which opencodes kvmalloc, is masking error but error is there >>>> - kvmalloc without GFP_NORETRY works fine, but probably can consume a >>>> lot of memory - commit: eacd86ca3b036 >>>> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8 >>> OK. >>> >>>>>> What is correct way to fix it. >>>>>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add >>>>>> this flag only for sizes bigger than some threshold >>>>> This would reintroduce issue fixed by 0537250fdc6c8. Note that >>>>> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the >>>>> original code (well, except for __GFP_NOWARN). >>>> So probably we should pass GFP_NORETRY only for large requests (above >>>> some threshold). >>> What would be the treshold? This is not really my area so I just wanted >>> to keep the original code semantic. >>> >>>>>> - inside kvmalloc_node remove GFP_NORETRY from >>>>>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or >>>>>> the problem is elsewhere) >>>>> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL). >>>>> >>>>> I strongly suspect that this is not a regression in this code but rather >>>>> a side effect of larger memory fragmentation caused by something else. >>>>> In any case do you see this failure also without artificial test case >>>>> with a standard workload? >>>> Yes i can see failures with standard workload, in fact it was hard to >>>> reproduce it. >>>> Here is the error from production servers where allocation is smaller: >>>> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes, >>>> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null) >>>> >>>> I didn't understand if vmalloc honors GFP_NORETRY. >>> 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really >>> support the GFP_NORETRY remantic because that would imply the request >>> wouldn't trigger the oom killer but in rare cases this might happen >>> (e.g. when page tables are allocated because those are hardcoded >>> GFP_KERNEL). >>> >>> That being said, I have no objection to use GFP_KERNEL if it helps real >>> workloads but we probably need some cap... >> Probably Vlastimil Babka can propose some limit: > No, I think that's rather for the netfilter folks to decide. However, it > seems there has been the debate already [1] and it was not found. The > conclusion was that __GFP_NORETRY worked fine before, so it should work > again after it's added back. But now we know that it doesn't... > > [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u Yes i see. I will add Florian Westphal to CC list. netfilter-devel is already in this list so probably have to wait for their opinion. >> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote: >> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and >> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c >> ("netfilter: x_tables: make allocation less aggressive") was backported >> to 4.14. Removing __GFP_NORETRY might help here, but bring back other >> issues. Less than 4MB is not that much though, maybe find some "sane" >> limit and use __GFP_NORETRY only above that? >> >> >> Regards, >> >> -- >> Georgi Nikolov >> >> Regards, -- Georgi Nikolov
On 07/31/2018 05:05 PM, Florian Westphal wrote: > Georgi Nikolov <gnikolov@icdsoft.com> wrote: >>> No, I think that's rather for the netfilter folks to decide. However, it >>> seems there has been the debate already [1] and it was not found. The >>> conclusion was that __GFP_NORETRY worked fine before, so it should work >>> again after it's added back. But now we know that it doesn't... >>> >>> [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u >> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is >> already in this list so probably have to wait for their opinion. > It hasn't changed, I think having OOM killer zap random processes > just because userspace wants to import large iptables ruleset is not a > good idea. And what about passing GFP_NORETRY only above some reasonable threshold? Or situation has to be handled in userspace. Regards, -- Georgi Nikolov
Georgi Nikolov <gnikolov@icdsoft.com> wrote: > > No, I think that's rather for the netfilter folks to decide. However, it > > seems there has been the debate already [1] and it was not found. The > > conclusion was that __GFP_NORETRY worked fine before, so it should work > > again after it's added back. But now we know that it doesn't... > > > > [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u > > Yes i see. I will add Florian Westphal to CC list. netfilter-devel is > already in this list so probably have to wait for their opinion. It hasn't changed, I think having OOM killer zap random processes just because userspace wants to import large iptables ruleset is not a good idea.
On 07/31/2018 04:25 PM, Georgi Nikolov wrote: > On 07/31/2018 05:05 PM, Florian Westphal wrote: >> Georgi Nikolov <gnikolov@icdsoft.com> wrote: >>>> No, I think that's rather for the netfilter folks to decide. However, it >>>> seems there has been the debate already [1] and it was not found. The >>>> conclusion was that __GFP_NORETRY worked fine before, so it should work >>>> again after it's added back. But now we know that it doesn't... >>>> >>>> [1] >>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u >>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is >>> already in this list so probably have to wait for their opinion. >> It hasn't changed, I think having OOM killer zap random processes >> just because userspace wants to import large iptables ruleset is not a >> good idea. > And what about passing GFP_NORETRY only above some reasonable threshold? What is the reasonable threshold? > Or situation has to be handled in userspace. How? > > Regards, > > -- > Georgi Nikolov >
On 07/31/2018 04:05 PM, Florian Westphal wrote: > Georgi Nikolov <gnikolov@icdsoft.com> wrote: >>> No, I think that's rather for the netfilter folks to decide. However, it >>> seems there has been the debate already [1] and it was not found. The >>> conclusion was that __GFP_NORETRY worked fine before, so it should work >>> again after it's added back. But now we know that it doesn't... >>> >>> [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u >> >> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is >> already in this list so probably have to wait for their opinion. > > It hasn't changed, I think having OOM killer zap random processes > just because userspace wants to import large iptables ruleset is not a > good idea. If we denied the allocation instead of OOM (e.g. by using __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving the system without much memory, so it will invoke OOM killer sooner or later anyway. I don't see any silver-bullet solution, unfortunately. If this can be abused by (multiple) namespaces, then they have to be contained by kmemcg as that's the generic mechanism intended for this. Then we could use the __GFP_RETRY_MAYFAIL. The only limit we could impose to outright deny the allocation (to prevent obvious bugs/admin mistakes or abuses) could be based on the amount of RAM, as was suggested in the old thread. __GFP_NORETRY might look like a good match at first sight as that stops allocating when "reclaim becomes hard" which means the system is still relatively far from OOM. But it's not reliable in principle, and as this bug report shows. That's fine when __GFP_NORETRY is used for optimistic allocations that have some other fallback (e.g. huge page with fallback to base page), but far from ideal when failure means returning -ENOMEM to userspace.
On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: > On 07/31/2018 04:05 PM, Florian Westphal wrote: > > Georgi Nikolov <gnikolov@icdsoft.com> wrote: > >>> No, I think that's rather for the netfilter folks to decide. However, it > >>> seems there has been the debate already [1] and it was not found. The > >>> conclusion was that __GFP_NORETRY worked fine before, so it should work > >>> again after it's added back. But now we know that it doesn't... > >>> > >>> [1] > https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u > >> > >> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is > >> already in this list so probably have to wait for their opinion. > > > > It hasn't changed, I think having OOM killer zap random processes > > just because userspace wants to import large iptables ruleset is not a > > good idea. > > If we denied the allocation instead of OOM (e.g. by using > __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving > the system without much memory, so it will invoke OOM killer sooner or > later anyway. > > I don't see any silver-bullet solution, unfortunately. If this can be > abused by (multiple) namespaces, then they have to be contained by > kmemcg as that's the generic mechanism intended for this. Then we could > use the __GFP_RETRY_MAYFAIL. > The only limit we could impose to outright deny the allocation (to > prevent obvious bugs/admin mistakes or abuses) could be based on the > amount of RAM, as was suggested in the old thread. > > __GFP_NORETRY might look like a good match at first sight as that stops > allocating when "reclaim becomes hard" which means the system is still > relatively far from OOM. But it's not reliable in principle, and as this > bug report shows. That's fine when __GFP_NORETRY is used for optimistic > allocations that have some other fallback (e.g. huge page with fallback > to base page), but far from ideal when failure means returning -ENOMEM > to userspace. I absolutely agree. The whole __GFP_NORETRY is quite dubious TBH. I have used it to get the original behavior because the change wasn't really intended to make functional changes. But consideg ring this requires higher privileges then I fail to see where the distrust comes from. If this is really about untrusted root in a namespace then the proper way is to use __GFP_ACCOUNT and limit that via kmemc. __GFP_NORETRY can fail really easily if the kswapd doesn't keep the pace with the allocations which might be completely unrelated to this particular request.
*Georgi Nikolov* System Administrator www.icdsoft.com <http://www.icdsoft.com> On 08/01/2018 11:33 AM, Michal Hocko wrote: > On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: >> On 07/31/2018 04:05 PM, Florian Westphal wrote: >>> Georgi Nikolov <gnikolov@icdsoft.com> wrote: >>>>> No, I think that's rather for the netfilter folks to decide. However, it >>>>> seems there has been the debate already [1] and it was not found. The >>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work >>>>> again after it's added back. But now we know that it doesn't... >>>>> >>>>> [1] >>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u >>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is >>>> already in this list so probably have to wait for their opinion. >>> It hasn't changed, I think having OOM killer zap random processes >>> just because userspace wants to import large iptables ruleset is not a >>> good idea. >> If we denied the allocation instead of OOM (e.g. by using >> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving >> the system without much memory, so it will invoke OOM killer sooner or >> later anyway. >> >> I don't see any silver-bullet solution, unfortunately. If this can be >> abused by (multiple) namespaces, then they have to be contained by >> kmemcg as that's the generic mechanism intended for this. Then we could >> use the __GFP_RETRY_MAYFAIL. >> The only limit we could impose to outright deny the allocation (to >> prevent obvious bugs/admin mistakes or abuses) could be based on the >> amount of RAM, as was suggested in the old thread. Can we make this configurable - on/off switch or size above which to pass GFP_NORETRY. Probably hard coded based on amount of RAM is a good idea too. >> __GFP_NORETRY might look like a good match at first sight as that stops >> allocating when "reclaim becomes hard" which means the system is still >> relatively far from OOM. But it's not reliable in principle, and as this >> bug report shows. That's fine when __GFP_NORETRY is used for optimistic >> allocations that have some other fallback (e.g. huge page with fallback >> to base page), but far from ideal when failure means returning -ENOMEM >> to userspace. > I absolutely agree. The whole __GFP_NORETRY is quite dubious TBH. I have > used it to get the original behavior because the change wasn't really > intended to make functional changes. But consideg ring this requires > higher privileges then I fail to see where the distrust comes from. If > this is really about untrusted root in a namespace then the proper way > is to use __GFP_ACCOUNT and limit that via kmemc. > > __GFP_NORETRY can fail really easily if the kswapd doesn't keep the pace > with the allocations which might be completely unrelated to this > particular request.
On Wed 01-08-18 19:03:03, Georgi Nikolov wrote: > > *Georgi Nikolov* > System Administrator > www.icdsoft.com <http://www.icdsoft.com> > > On 08/01/2018 11:33 AM, Michal Hocko wrote: > > On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: > >> On 07/31/2018 04:05 PM, Florian Westphal wrote: > >>> Georgi Nikolov <gnikolov@icdsoft.com> wrote: > >>>>> No, I think that's rather for the netfilter folks to decide. However, > it > >>>>> seems there has been the debate already [1] and it was not found. The > >>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work > >>>>> again after it's added back. But now we know that it doesn't... > >>>>> > >>>>> [1] > https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u > >>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is > >>>> already in this list so probably have to wait for their opinion. > >>> It hasn't changed, I think having OOM killer zap random processes > >>> just because userspace wants to import large iptables ruleset is not a > >>> good idea. > >> If we denied the allocation instead of OOM (e.g. by using > >> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving > >> the system without much memory, so it will invoke OOM killer sooner or > >> later anyway. > >> > >> I don't see any silver-bullet solution, unfortunately. If this can be > >> abused by (multiple) namespaces, then they have to be contained by > >> kmemcg as that's the generic mechanism intended for this. Then we could > >> use the __GFP_RETRY_MAYFAIL. > >> The only limit we could impose to outright deny the allocation (to > >> prevent obvious bugs/admin mistakes or abuses) could be based on the > >> amount of RAM, as was suggested in the old thread. > > Can we make this configurable - on/off switch or size above which > to pass GFP_NORETRY. Yet another tunable? How do you decide which one to select? Seriously, configuration knobs sound attractive but they are rarely a good idea. Either we trust privileged users or we don't and we have kmem accounting for that. > Probably hard coded based on amount of RAM is a good idea too. How do you scale that? In other words, why don't we simply do the following? Note that this is not tested. I have also no idea what is the lifetime of this allocation. Is it bound to any specific process or is it a namespace bound? If the later then the memcg OOM killer might wipe the whole memcg down without making any progress. This would make the whole namespace unsuable until somebody intervenes. Is this acceptable? --- From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Thu, 2 Aug 2018 10:38:57 +0200 Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too easilly eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() in xt_alloc_table_info()") has unintentionally fortified xt_alloc_table_info allocation when __GFP_RETRY has been dropped from the vmalloc fallback. Later on there was a syzbot report that this can lead to OOM killer invocations when tables are too large and 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") has been merged to restore the original behavior. Georgi Nikolov however noticed that he is not able to install his iptables anymore so this can be seen as a regression. The primary argument for 0537250fdc6c was that this allocation path shouldn't really trigger the OOM killer and kill innocent tasks. On the other hand the interface requires root and as such should allow what the admin asks for. Root inside a namespaces makes this more complicated because those might be not trusted in general. If they are not then such namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Michal Hocko <mhocko@suse.com> --- net/netfilter/x_tables.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index d0d8397c9588..b769408e04ab 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size) if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) return NULL; - /* __GFP_NORETRY is not fully supported by kvmalloc but it should - * work reasonably well if sz is too large and bail out rather - * than shoot all processes down before realizing there is nothing - * more to reclaim. - */ - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); if (!info) return NULL;
On Thu, Aug 02, 2018 at 10:50:43AM +0200, Michal Hocko wrote: > On Wed 01-08-18 19:03:03, Georgi Nikolov wrote: > > > > *Georgi Nikolov* > > System Administrator > > www.icdsoft.com <http://www.icdsoft.com> > > > > On 08/01/2018 11:33 AM, Michal Hocko wrote: > > > On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: > > >> On 07/31/2018 04:05 PM, Florian Westphal wrote: > > >>> Georgi Nikolov <gnikolov@icdsoft.com> wrote: > > >>>>> No, I think that's rather for the netfilter folks to decide. However, > it > > >>>>> seems there has been the debate already [1] and it was not found. The > > >>>>> conclusion was that __GFP_NORETRY worked fine before, so it should > work > > >>>>> again after it's added back. But now we know that it doesn't... > > >>>>> > > >>>>> [1] > https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u > > >>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is > > >>>> already in this list so probably have to wait for their opinion. > > >>> It hasn't changed, I think having OOM killer zap random processes > > >>> just because userspace wants to import large iptables ruleset is not a > > >>> good idea. > > >> If we denied the allocation instead of OOM (e.g. by using > > >> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving > > >> the system without much memory, so it will invoke OOM killer sooner or > > >> later anyway. > > >> > > >> I don't see any silver-bullet solution, unfortunately. If this can be > > >> abused by (multiple) namespaces, then they have to be contained by > > >> kmemcg as that's the generic mechanism intended for this. Then we could > > >> use the __GFP_RETRY_MAYFAIL. > > >> The only limit we could impose to outright deny the allocation (to > > >> prevent obvious bugs/admin mistakes or abuses) could be based on the > > >> amount of RAM, as was suggested in the old thread. > > > > Can we make this configurable - on/off switch or size above which > > to pass GFP_NORETRY. > > Yet another tunable? How do you decide which one to select? Seriously, > configuration knobs sound attractive but they are rarely a good idea. > Either we trust privileged users or we don't and we have kmem accounting > for that. > > > Probably hard coded based on amount of RAM is a good idea too. > > How do you scale that? > > In other words, why don't we simply do the following? Note that this is > not tested. I have also no idea what is the lifetime of this allocation. > Is it bound to any specific process or is it a namespace bound? If the > later then the memcg OOM killer might wipe the whole memcg down without > making any progress. This would make the whole namespace unsuable until > somebody intervenes. Is this acceptable? > --- > From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 2 Aug 2018 10:38:57 +0200 > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > easilly > > eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() > in xt_alloc_table_info()") has unintentionally fortified > xt_alloc_table_info allocation when __GFP_RETRY has been dropped from > the vmalloc fallback. Later on there was a syzbot report that this > can lead to OOM killer invocations when tables are too large and > 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > has been merged to restore the original behavior. Georgi Nikolov however > noticed that he is not able to install his iptables anymore so this can > be seen as a regression. > > The primary argument for 0537250fdc6c was that this allocation path > shouldn't really trigger the OOM killer and kill innocent tasks. On the > other hand the interface requires root and as such should allow what the > admin asks for. Root inside a namespaces makes this more complicated > because those might be not trusted in general. If they are not then such > namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY > and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. > > Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > net/netfilter/x_tables.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index d0d8397c9588..b769408e04ab 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int > size) > if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > return NULL; > > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > - * work reasonably well if sz is too large and bail out rather > - * than shoot all processes down before realizing there is nothing > - * more to reclaim. > - */ > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); I guess the large number of cgroups match is helping to consume a lot of memory very quickly? We have a PATH_MAX in struct xt_cgroup_info_v1. Can we just trim off the path to something more reasonable? We did similar things to other extensions. Only concern here is what is a reasonable path length to describe the cgroup.
On Thu 02-08-18 11:25:49, Pablo Neira Ayuso wrote: > On Thu, Aug 02, 2018 at 10:50:43AM +0200, Michal Hocko wrote: [...] > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > > index d0d8397c9588..b769408e04ab 100644 > > --- a/net/netfilter/x_tables.c > > +++ b/net/netfilter/x_tables.c > > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned > int size) > > if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > > return NULL; > > > > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > > - * work reasonably well if sz is too large and bail out rather > > - * than shoot all processes down before realizing there is nothing > > - * more to reclaim. > > - */ > > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); > > I guess the large number of cgroups match is helping to consume a lot > of memory very quickly? We have a PATH_MAX in struct xt_cgroup_info_v1. I really fail to see how that is related to the patch here.
On 08/02/2018 11:50 AM, Michal Hocko wrote: > On Wed 01-08-18 19:03:03, Georgi Nikolov wrote: >> *Georgi Nikolov* >> System Administrator >> www.icdsoft.com <http://www.icdsoft.com> >> >> On 08/01/2018 11:33 AM, Michal Hocko wrote: >>> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: >>>> On 07/31/2018 04:05 PM, Florian Westphal wrote: >>>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote: >>>>>>> No, I think that's rather for the netfilter folks to decide. However, >>>>>>> it >>>>>>> seems there has been the debate already [1] and it was not found. The >>>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work >>>>>>> again after it's added back. But now we know that it doesn't... >>>>>>> >>>>>>> [1] >>>>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u >>>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is >>>>>> already in this list so probably have to wait for their opinion. >>>>> It hasn't changed, I think having OOM killer zap random processes >>>>> just because userspace wants to import large iptables ruleset is not a >>>>> good idea. >>>> If we denied the allocation instead of OOM (e.g. by using >>>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving >>>> the system without much memory, so it will invoke OOM killer sooner or >>>> later anyway. >>>> >>>> I don't see any silver-bullet solution, unfortunately. If this can be >>>> abused by (multiple) namespaces, then they have to be contained by >>>> kmemcg as that's the generic mechanism intended for this. Then we could >>>> use the __GFP_RETRY_MAYFAIL. >>>> The only limit we could impose to outright deny the allocation (to >>>> prevent obvious bugs/admin mistakes or abuses) could be based on the >>>> amount of RAM, as was suggested in the old thread. >> Can we make this configurable - on/off switch or size above which >> to pass GFP_NORETRY. > Yet another tunable? How do you decide which one to select? Seriously, > configuration knobs sound attractive but they are rarely a good idea. > Either we trust privileged users or we don't and we have kmem accounting > for that. > >> Probably hard coded based on amount of RAM is a good idea too. > How do you scale that? > > In other words, why don't we simply do the following? Note that this is > not tested. I have also no idea what is the lifetime of this allocation. > Is it bound to any specific process or is it a namespace bound? If the > later then the memcg OOM killer might wipe the whole memcg down without > making any progress. This would make the whole namespace unsuable until > somebody intervenes. Is this acceptable? > --- > From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 2 Aug 2018 10:38:57 +0200 > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > easilly > > eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() > in xt_alloc_table_info()") has unintentionally fortified > xt_alloc_table_info allocation when __GFP_RETRY has been dropped from > the vmalloc fallback. Later on there was a syzbot report that this > can lead to OOM killer invocations when tables are too large and > 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > has been merged to restore the original behavior. Georgi Nikolov however > noticed that he is not able to install his iptables anymore so this can > be seen as a regression. > > The primary argument for 0537250fdc6c was that this allocation path > shouldn't really trigger the OOM killer and kill innocent tasks. On the > other hand the interface requires root and as such should allow what the > admin asks for. Root inside a namespaces makes this more complicated > because those might be not trusted in general. If they are not then such > namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY > and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. > > Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > net/netfilter/x_tables.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index d0d8397c9588..b769408e04ab 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int > size) > if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > return NULL; > > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > - * work reasonably well if sz is too large and bail out rather > - * than shoot all processes down before realizing there is nothing > - * more to reclaim. > - */ > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); > if (!info) > return NULL; > I will check if this change fixes the problem. Regards, -- Georgi Nikolov
On 08/06/2018 11:42 AM, Georgi Nikolov wrote: > On 08/02/2018 11:50 AM, Michal Hocko wrote: >> In other words, why don't we simply do the following? Note that this is >> not tested. I have also no idea what is the lifetime of this allocation. >> Is it bound to any specific process or is it a namespace bound? If the >> later then the memcg OOM killer might wipe the whole memcg down without >> making any progress. This would make the whole namespace unsuable until >> somebody intervenes. Is this acceptable? >> --- >> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 >> From: Michal Hocko <mhocko@suse.com> >> Date: Thu, 2 Aug 2018 10:38:57 +0200 >> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too >> easilly >> >> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() >> in xt_alloc_table_info()") has unintentionally fortified >> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from >> the vmalloc fallback. Later on there was a syzbot report that this >> can lead to OOM killer invocations when tables are too large and >> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") >> has been merged to restore the original behavior. Georgi Nikolov however >> noticed that he is not able to install his iptables anymore so this can >> be seen as a regression. >> >> The primary argument for 0537250fdc6c was that this allocation path >> shouldn't really trigger the OOM killer and kill innocent tasks. On the >> other hand the interface requires root and as such should allow what the >> admin asks for. Root inside a namespaces makes this more complicated >> because those might be not trusted in general. If they are not then such >> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY >> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. >> >> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") >> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> >> Suggested-by: Vlastimil Babka <vbabka@suse.cz> >> Signed-off-by: Michal Hocko <mhocko@suse.com> >> --- >> net/netfilter/x_tables.c | 7 +------ >> 1 file changed, 1 insertion(+), 6 deletions(-) >> >> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c >> index d0d8397c9588..b769408e04ab 100644 >> --- a/net/netfilter/x_tables.c >> +++ b/net/netfilter/x_tables.c >> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned >> int size) >> if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) >> return NULL; >> >> - /* __GFP_NORETRY is not fully supported by kvmalloc but it should >> - * work reasonably well if sz is too large and bail out rather >> - * than shoot all processes down before realizing there is nothing >> - * more to reclaim. >> - */ >> - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); >> + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); >> if (!info) >> return NULL; >> > I will check if this change fixes the problem. > > Regards, > > -- > Georgi Nikolov I can't reproduce it anymore. If i understand correctly this way memory allocated will be accounted to kmem of this cgroup (if inside cgroup).
On Tue 07-08-18 14:02:00, Georgi Nikolov wrote: > On 08/06/2018 11:42 AM, Georgi Nikolov wrote: > > On 08/02/2018 11:50 AM, Michal Hocko wrote: > >> In other words, why don't we simply do the following? Note that this is > >> not tested. I have also no idea what is the lifetime of this allocation. > >> Is it bound to any specific process or is it a namespace bound? If the > >> later then the memcg OOM killer might wipe the whole memcg down without > >> making any progress. This would make the whole namespace unsuable until > >> somebody intervenes. Is this acceptable? > >> --- > >> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 > >> From: Michal Hocko <mhocko@suse.com> > >> Date: Thu, 2 Aug 2018 10:38:57 +0200 > >> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > >> easilly > >> > >> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() > >> in xt_alloc_table_info()") has unintentionally fortified > >> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from > >> the vmalloc fallback. Later on there was a syzbot report that this > >> can lead to OOM killer invocations when tables are too large and > >> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > >> has been merged to restore the original behavior. Georgi Nikolov however > >> noticed that he is not able to install his iptables anymore so this can > >> be seen as a regression. > >> > >> The primary argument for 0537250fdc6c was that this allocation path > >> shouldn't really trigger the OOM killer and kill innocent tasks. On the > >> other hand the interface requires root and as such should allow what the > >> admin asks for. Root inside a namespaces makes this more complicated > >> because those might be not trusted in general. If they are not then such > >> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY > >> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. > >> > >> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less > aggressive") > >> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> > >> Suggested-by: Vlastimil Babka <vbabka@suse.cz> > >> Signed-off-by: Michal Hocko <mhocko@suse.com> > >> --- > >> net/netfilter/x_tables.c | 7 +------ > >> 1 file changed, 1 insertion(+), 6 deletions(-) > >> > >> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > >> index d0d8397c9588..b769408e04ab 100644 > >> --- a/net/netfilter/x_tables.c > >> +++ b/net/netfilter/x_tables.c > >> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned > int size) > >> if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > >> return NULL; > >> > >> - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > >> - * work reasonably well if sz is too large and bail out rather > >> - * than shoot all processes down before realizing there is nothing > >> - * more to reclaim. > >> - */ > >> - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > >> + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); > >> if (!info) > >> return NULL; > >> > > I will check if this change fixes the problem. > > > > Regards, > > > > -- > > Georgi Nikolov > > I can't reproduce it anymore. > If i understand correctly this way memory allocated will be > accounted to kmem of this cgroup (if inside cgroup). s@this@caller's@ Florian, is this patch acceptable?
Michal Hocko <mhocko@kernel.org> wrote: > > I can't reproduce it anymore. > > If i understand correctly this way memory allocated will be > > accounted to kmem of this cgroup (if inside cgroup). > > s@this@caller's@ > > Florian, is this patch acceptable I am no mm expert. Should all longlived GFP_KERNEL allocations set ACCOUNT? If so, there are more places that should get same treatment. The change looks fine to me, but again, I don't know when ACCOUNT should be set in the first place.
On Tue 07-08-18 13:19:26, Florian Westphal wrote: > Michal Hocko <mhocko@kernel.org> wrote: > > > I can't reproduce it anymore. > > > If i understand correctly this way memory allocated will be > > > accounted to kmem of this cgroup (if inside cgroup). > > > > s@this@caller's@ > > > > Florian, is this patch acceptable > > I am no mm expert. Should all longlived GFP_KERNEL allocations set ACCOUNT? No. We should focus only on those that are under direct userspace control and it can be triggered by an untrusted user. > If so, there are more places that should get same treatment. > The change looks fine to me, but again, I don't know when ACCOUNT should > be set in the first place. see above.
On 08/02/2018 10:50 AM, Michal Hocko wrote: > On Wed 01-08-18 19:03:03, Georgi Nikolov wrote: >> >> *Georgi Nikolov* >> System Administrator >> www.icdsoft.com <http://www.icdsoft.com> >> >> On 08/01/2018 11:33 AM, Michal Hocko wrote: >>> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: >>>> On 07/31/2018 04:05 PM, Florian Westphal wrote: >>>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote: >>>>>>> No, I think that's rather for the netfilter folks to decide. However, >>>>>>> it >>>>>>> seems there has been the debate already [1] and it was not found. The >>>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work >>>>>>> again after it's added back. But now we know that it doesn't... >>>>>>> >>>>>>> [1] >>>>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u >>>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is >>>>>> already in this list so probably have to wait for their opinion. >>>>> It hasn't changed, I think having OOM killer zap random processes >>>>> just because userspace wants to import large iptables ruleset is not a >>>>> good idea. >>>> If we denied the allocation instead of OOM (e.g. by using >>>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving >>>> the system without much memory, so it will invoke OOM killer sooner or >>>> later anyway. >>>> >>>> I don't see any silver-bullet solution, unfortunately. If this can be >>>> abused by (multiple) namespaces, then they have to be contained by >>>> kmemcg as that's the generic mechanism intended for this. Then we could >>>> use the __GFP_RETRY_MAYFAIL. >>>> The only limit we could impose to outright deny the allocation (to >>>> prevent obvious bugs/admin mistakes or abuses) could be based on the >>>> amount of RAM, as was suggested in the old thread. >> >> Can we make this configurable - on/off switch or size above which >> to pass GFP_NORETRY. > > Yet another tunable? How do you decide which one to select? Seriously, > configuration knobs sound attractive but they are rarely a good idea. > Either we trust privileged users or we don't and we have kmem accounting > for that. > >> Probably hard coded based on amount of RAM is a good idea too. > > How do you scale that? > > In other words, why don't we simply do the following? Note that this is > not tested. I have also no idea what is the lifetime of this allocation. > Is it bound to any specific process or is it a namespace bound? If the > later then the memcg OOM killer might wipe the whole memcg down without > making any progress. This would make the whole namespace unsuable until > somebody intervenes. Is this acceptable? > --- > From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Thu, 2 Aug 2018 10:38:57 +0200 > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > easilly > > eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() > in xt_alloc_table_info()") has unintentionally fortified > xt_alloc_table_info allocation when __GFP_RETRY has been dropped from > the vmalloc fallback. Later on there was a syzbot report that this > can lead to OOM killer invocations when tables are too large and > 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > has been merged to restore the original behavior. Georgi Nikolov however > noticed that he is not able to install his iptables anymore so this can > be seen as a regression. > > The primary argument for 0537250fdc6c was that this allocation path > shouldn't really trigger the OOM killer and kill innocent tasks. On the > other hand the interface requires root and as such should allow what the > admin asks for. Root inside a namespaces makes this more complicated > because those might be not trusted in general. If they are not then such > namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY > and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. > > Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > net/netfilter/x_tables.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index d0d8397c9588..b769408e04ab 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int > size) > if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > return NULL; > > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > - * work reasonably well if sz is too large and bail out rather > - * than shoot all processes down before realizing there is nothing > - * more to reclaim. > - */ > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); GFP_KERNEL_ACCOUNT ? > if (!info) > return NULL; > >
Michal Hocko <mhocko@kernel.org> wrote: > On Tue 07-08-18 13:19:26, Florian Westphal wrote: > > Michal Hocko <mhocko@kernel.org> wrote: > > > > I can't reproduce it anymore. > > > > If i understand correctly this way memory allocated will be > > > > accounted to kmem of this cgroup (if inside cgroup). > > > > > > s@this@caller's@ > > > > > > Florian, is this patch acceptable > > > > I am no mm expert. Should all longlived GFP_KERNEL allocations set > ACCOUNT? > > No. We should focus only on those that are under direct userspace > control and it can be triggered by an untrusted user. In that case patch is fine and we will need similar patches for nf_tables_api.c .
On 08/07/2018 01:26 PM, Michal Hocko wrote: > On Tue 07-08-18 13:19:26, Florian Westphal wrote: >> Michal Hocko <mhocko@kernel.org> wrote: >>>> I can't reproduce it anymore. >>>> If i understand correctly this way memory allocated will be >>>> accounted to kmem of this cgroup (if inside cgroup). >>> >>> s@this@caller's@ >>> >>> Florian, is this patch acceptable >> >> I am no mm expert. Should all longlived GFP_KERNEL allocations set ACCOUNT? > > No. We should focus only on those that are under direct userspace > control and it can be triggered by an untrusted user. Looks like the description in include/linux/gfp.h could use some details to guide developers, possibly also Mike's new/improved docs (+CC). >> If so, there are more places that should get same treatment. >> The change looks fine to me, but again, I don't know when ACCOUNT should >> be set in the first place. > > see above. >
On Tue 07-08-18 13:29:15, Vlastimil Babka wrote: > On 08/02/2018 10:50 AM, Michal Hocko wrote: > > On Wed 01-08-18 19:03:03, Georgi Nikolov wrote: > >> > >> *Georgi Nikolov* > >> System Administrator > >> www.icdsoft.com <http://www.icdsoft.com> > >> > >> On 08/01/2018 11:33 AM, Michal Hocko wrote: > >>> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote: > >>>> On 07/31/2018 04:05 PM, Florian Westphal wrote: > >>>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote: > >>>>>>> No, I think that's rather for the netfilter folks to decide. However, > it > >>>>>>> seems there has been the debate already [1] and it was not found. The > >>>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should > work > >>>>>>> again after it's added back. But now we know that it doesn't... > >>>>>>> > >>>>>>> [1] > https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u > >>>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is > >>>>>> already in this list so probably have to wait for their opinion. > >>>>> It hasn't changed, I think having OOM killer zap random processes > >>>>> just because userspace wants to import large iptables ruleset is not a > >>>>> good idea. > >>>> If we denied the allocation instead of OOM (e.g. by using > >>>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving > >>>> the system without much memory, so it will invoke OOM killer sooner or > >>>> later anyway. > >>>> > >>>> I don't see any silver-bullet solution, unfortunately. If this can be > >>>> abused by (multiple) namespaces, then they have to be contained by > >>>> kmemcg as that's the generic mechanism intended for this. Then we could > >>>> use the __GFP_RETRY_MAYFAIL. > >>>> The only limit we could impose to outright deny the allocation (to > >>>> prevent obvious bugs/admin mistakes or abuses) could be based on the > >>>> amount of RAM, as was suggested in the old thread. > >> > >> Can we make this configurable - on/off switch or size above which > >> to pass GFP_NORETRY. > > > > Yet another tunable? How do you decide which one to select? Seriously, > > configuration knobs sound attractive but they are rarely a good idea. > > Either we trust privileged users or we don't and we have kmem accounting > > for that. > > > >> Probably hard coded based on amount of RAM is a good idea too. > > > > How do you scale that? > > > > In other words, why don't we simply do the following? Note that this is > > not tested. I have also no idea what is the lifetime of this allocation. > > Is it bound to any specific process or is it a namespace bound? If the > > later then the memcg OOM killer might wipe the whole memcg down without > > making any progress. This would make the whole namespace unsuable until > > somebody intervenes. Is this acceptable? > > --- > > From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.com> > > Date: Thu, 2 Aug 2018 10:38:57 +0200 > > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > > easilly > > > > eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc() > > in xt_alloc_table_info()") has unintentionally fortified > > xt_alloc_table_info allocation when __GFP_RETRY has been dropped from > > the vmalloc fallback. Later on there was a syzbot report that this > > can lead to OOM killer invocations when tables are too large and > > 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive") > > has been merged to restore the original behavior. Georgi Nikolov however > > noticed that he is not able to install his iptables anymore so this can > > be seen as a regression. > > > > The primary argument for 0537250fdc6c was that this allocation path > > shouldn't really trigger the OOM killer and kill innocent tasks. On the > > other hand the interface requires root and as such should allow what the > > admin asks for. Root inside a namespaces makes this more complicated > > because those might be not trusted in general. If they are not then such > > namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY > > and replace it by __GFP_ACCOUNT to enfore memcg constrains on it. > > > > Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less > aggressive") > > Reported-by: Georgi Nikolov <gnikolov@icdsoft.com> > > Suggested-by: Vlastimil Babka <vbabka@suse.cz> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > net/netfilter/x_tables.c | 7 +------ > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > > index d0d8397c9588..b769408e04ab 100644 > > --- a/net/netfilter/x_tables.c > > +++ b/net/netfilter/x_tables.c > > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned > int size) > > if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE) > > return NULL; > > > > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > > - * work reasonably well if sz is too large and bail out rather > > - * than shoot all processes down before realizing there is nothing > > - * more to reclaim. > > - */ > > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); > > GFP_KERNEL_ACCOUNT ? Certainly possible, I guess I just wanted to call the __GFP_ACCOUNT. But I can change that of course.
On Tue 07-08-18 13:30:13, Florian Westphal wrote: > Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 07-08-18 13:19:26, Florian Westphal wrote: > > > Michal Hocko <mhocko@kernel.org> wrote: > > > > > I can't reproduce it anymore. > > > > > If i understand correctly this way memory allocated will be > > > > > accounted to kmem of this cgroup (if inside cgroup). > > > > > > > > s@this@caller's@ > > > > > > > > Florian, is this patch acceptable > > > > > > I am no mm expert. Should all longlived GFP_KERNEL allocations set > ACCOUNT? > > > > No. We should focus only on those that are under direct userspace > > control and it can be triggered by an untrusted user. > > In that case patch is fine and we will need similar patches for > nf_tables_api.c . Can I assume your Acked-by or should I repost the standalong patch in its new thread?
On Tue, Aug 07, 2018 at 01:31:21PM +0200, Vlastimil Babka wrote: > On 08/07/2018 01:26 PM, Michal Hocko wrote: > > On Tue 07-08-18 13:19:26, Florian Westphal wrote: > >> Michal Hocko <mhocko@kernel.org> wrote: > >>>> I can't reproduce it anymore. > >>>> If i understand correctly this way memory allocated will be > >>>> accounted to kmem of this cgroup (if inside cgroup). > >>> > >>> s@this@caller's@ > >>> > >>> Florian, is this patch acceptable > >> > >> I am no mm expert. Should all longlived GFP_KERNEL allocations set > ACCOUNT? > > > > No. We should focus only on those that are under direct userspace > > control and it can be triggered by an untrusted user. > > Looks like the description in include/linux/gfp.h could use some details > to guide developers, possibly also Mike's new/improved docs (+CC). A "memory allocation guide" is definitely on my todo. If you and other mm developers have any notes, random thoughts or anything else you'd like to see there, send it my way, I'll convert them to ReST :-). > >> If so, there are more places that should get same treatment. > >> The change looks fine to me, but again, I don't know when ACCOUNT should > >> be set in the first place. > > > > see above. > > >
Michal Hocko <mhocko@kernel.org> wrote: > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > easilly [..] > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > - * work reasonably well if sz is too large and bail out rather > - * than shoot all processes down before realizing there is nothing > - * more to reclaim. > - */ > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); > if (!info) > return NULL; Acked-by: Florian Westphal <fw@strlen.de> You can keep this acked-by in case you mangle this patch in a minor way such as using GFP_KERNEL_ACCOUNT. Thanks, Florian
On Tue 07-08-18 20:23:55, Florian Westphal wrote: > Michal Hocko <mhocko@kernel.org> wrote: > > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too > > easilly > > [..] > > > - /* __GFP_NORETRY is not fully supported by kvmalloc but it should > > - * work reasonably well if sz is too large and bail out rather > > - * than shoot all processes down before realizing there is nothing > > - * more to reclaim. > > - */ > > - info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY); > > + info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); > > if (!info) > > return NULL; > > Acked-by: Florian Westphal <fw@strlen.de> > > You can keep this acked-by in case you mangle this patch in a minor > way such as using GFP_KERNEL_ACCOUNT. Thanks!