Bug 200651 - cgroups iptables-restor: vmalloc: allocation failure
Summary: cgroups iptables-restor: vmalloc: allocation failure
Status: NEW
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-25 11:42 UTC by Georgi Georgiev
Modified: 2018-08-07 19:30 UTC (History)
0 users

See Also:
Kernel Version: 4.14
Subsystem:
Regression: No
Bisected commit-id:


Attachments
iptables save (3.64 MB, text/plain)
2018-07-25 11:42 UTC, Georgi Georgiev
Details
attachment-20091-0.html (6.49 KB, text/html)
2018-07-26 09:09 UTC, Georgi Georgiev
Details

Description Georgi Georgiev 2018-07-25 11:42:57 UTC
Created attachment 277505 [details]
iptables save

After creating large number of cgroups and under memory pressure, iptables command fails with following error:

"iptables-restor: vmalloc: allocation failure, allocated 3047424 of 3465216 bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"

System which is used to reproduce the bug is with 2 vcpus and 2GB of ram, but it happens on more powerfull systems.

Steps to reproduce:

mkdir /cgroup
mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup
for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p "/cgroup/user/$a/$b"; done; done

Then in separate consoles

cat /dev/vda > /dev/null
./test
./test
i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); echo $i; done

Here is the source of "test" program and attached iptables.save. It happens also with smaller iptables.save file.

#include <stdio.h>
#include <stdlib.h>

int main(void) {

    srand(time(NULL));
    int i = 0, j = 0, randnum=0;
    int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; 
    while(1) {

        for (i = 0; i < 6 ; i++) {

            int *ptr = (int*) malloc(arr[i] * 93);  

            for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) {
                *(ptr+j) = j+1;
            }

            free(ptr);
        }
    }       
}
Comment 1 Andrew Morton 2018-07-25 19:52:42 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=200651
> 
>             Bug ID: 200651
>            Summary: cgroups iptables-restor: vmalloc: allocation failure

Thanks.  Please do note the above request.

>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.14
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: gnikolov@icdsoft.com
>         Regression: No
> 
> Created attachment 277505 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
> iptables save
> 
> After creating large number of cgroups and under memory pressure, iptables
> command fails with following error:
> 
> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of 3465216
> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"

I'm not sure what the problem is here, apart from iptables being
over-optimistic about vmalloc()'s abilities.

Are cgroups having any impact on this, or is it simply vmalloc arena
fragmentation, and the iptables code should use some data structure
more sophisticated than a massive array?

Maybe all that ccgroup metadata is contributing to the arena
fragmentation, but that allocations will be small and the two systems
should be able to live alongside, by being realistic about vmalloc.

> System which is used to reproduce the bug is with 2 vcpus and 2GB of ram, but
> it happens on more powerfull systems.
> 
> Steps to reproduce:
> 
> mkdir /cgroup
> mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup
> for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p
> "/cgroup/user/$a/$b"; done; done
> 
> Then in separate consoles
> 
> cat /dev/vda > /dev/null
> ./test
> ./test
> i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); echo
> $i;
> done
> 
> Here is the source of "test" program and attached iptables.save. It happens
> also with smaller iptables.save file.
> 
> #include <stdio.h>
> #include <stdlib.h>
> 
> int main(void) {
> 
>     srand(time(NULL));
>     int i = 0, j = 0, randnum=0;
>     int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; 
>     while(1) {
> 
>         for (i = 0; i < 6 ; i++) {
> 
>             int *ptr = (int*) malloc(arr[i] * 93);  
> 
>             for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) {
>                 *(ptr+j) = j+1;
>             }
> 
>             free(ptr);
>         }
>     }       
> }
>
Comment 2 Michal Hocko 2018-07-26 07:26:26 UTC
On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
> On 07/25/2018 09:52 PM, Andrew Morton wrote:
> > (switched to email.  Please respond via emailed reply-to-all, not via the
> > bugzilla web interface).
> > 
> > On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org
> wrote:
> > 
> >> https://bugzilla.kernel.org/show_bug.cgi?id=200651
> >>
> >>             Bug ID: 200651
> >>            Summary: cgroups iptables-restor: vmalloc: allocation failure
> > 
> > Thanks.  Please do note the above request.
> > 
> >>            Product: Memory Management
> >>            Version: 2.5
> >>     Kernel Version: 4.14
> >>           Hardware: All
> >>                 OS: Linux
> >>               Tree: Mainline
> >>             Status: NEW
> >>           Severity: normal
> >>           Priority: P1
> >>          Component: Other
> >>           Assignee: akpm@linux-foundation.org
> >>           Reporter: gnikolov@icdsoft.com
> >>         Regression: No
> >>
> >> Created attachment 277505 [details]
> >>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
> >> iptables save
> >>
> >> After creating large number of cgroups and under memory pressure, iptables
> >> command fails with following error:
> >>
> >> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of
> 3465216
> >> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"
> 
> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
> ("netfilter: x_tables: make allocation less aggressive") was backported
> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
> issues. Less than 4MB is not that much though, maybe find some "sane"
> limit and use __GFP_NORETRY only above that?

I have seen the same report via http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
and the reported confirmed that kvmalloc is not a real culprit
http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
 
> > I'm not sure what the problem is here, apart from iptables being
> > over-optimistic about vmalloc()'s abilities.
> > 
> > Are cgroups having any impact on this, or is it simply vmalloc arena
> > fragmentation, and the iptables code should use some data structure
> > more sophisticated than a massive array?
> > 
> > Maybe all that ccgroup metadata is contributing to the arena
> > fragmentation, but that allocations will be small and the two systems
> > should be able to live alongside, by being realistic about vmalloc.
> > 
> >> System which is used to reproduce the bug is with 2 vcpus and 2GB of ram,
> but
> >> it happens on more powerfull systems.
> >>
> >> Steps to reproduce:
> >>
> >> mkdir /cgroup
> >> mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup
> >> for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p
> >> "/cgroup/user/$a/$b"; done; done
> >>
> >> Then in separate consoles
> >>
> >> cat /dev/vda > /dev/null
> >> ./test
> >> ./test
> >> i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1));
> echo $i;
> >> done
> >>
> >> Here is the source of "test" program and attached iptables.save. It
> happens
> >> also with smaller iptables.save file.
> >>
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >>
> >> int main(void) {
> >>
> >>     srand(time(NULL));
> >>     int i = 0, j = 0, randnum=0;
> >>     int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; 
> >>     while(1) {
> >>
> >>         for (i = 0; i < 6 ; i++) {
> >>
> >>             int *ptr = (int*) malloc(arr[i] * 93);  
> >>
> >>             for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) {
> >>                 *(ptr+j) = j+1;
> >>             }
> >>
> >>             free(ptr);
> >>         }
> >>     }       
> >> }
> >>
> >
Comment 3 Vlastimil Babka 2018-07-26 07:33:18 UTC
On 07/25/2018 09:52 PM, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=200651
>>
>>             Bug ID: 200651
>>            Summary: cgroups iptables-restor: vmalloc: allocation failure
> 
> Thanks.  Please do note the above request.
> 
>>            Product: Memory Management
>>            Version: 2.5
>>     Kernel Version: 4.14
>>           Hardware: All
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: Other
>>           Assignee: akpm@linux-foundation.org
>>           Reporter: gnikolov@icdsoft.com
>>         Regression: No
>>
>> Created attachment 277505 [details]
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
>> iptables save
>>
>> After creating large number of cgroups and under memory pressure, iptables
>> command fails with following error:
>>
>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of 3465216
>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"

This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
("netfilter: x_tables: make allocation less aggressive") was backported
to 4.14. Removing __GFP_NORETRY might help here, but bring back other
issues. Less than 4MB is not that much though, maybe find some "sane"
limit and use __GFP_NORETRY only above that?

> I'm not sure what the problem is here, apart from iptables being
> over-optimistic about vmalloc()'s abilities.
> 
> Are cgroups having any impact on this, or is it simply vmalloc arena
> fragmentation, and the iptables code should use some data structure
> more sophisticated than a massive array?
> 
> Maybe all that ccgroup metadata is contributing to the arena
> fragmentation, but that allocations will be small and the two systems
> should be able to live alongside, by being realistic about vmalloc.
> 
>> System which is used to reproduce the bug is with 2 vcpus and 2GB of ram,
>> but
>> it happens on more powerfull systems.
>>
>> Steps to reproduce:
>>
>> mkdir /cgroup
>> mount cgroup -t cgroup -omemory,pids,blkio,cpuacct /cgroup
>> for a in `seq 1 1000`; do for b in `seq 1 4` ; do mkdir -p
>> "/cgroup/user/$a/$b"; done; done
>>
>> Then in separate consoles
>>
>> cat /dev/vda > /dev/null
>> ./test
>> ./test
>> i=0;while sleep 0 ; do iptables-restore < iptables.save ; i=$(($i+1)); echo
>> $i;
>> done
>>
>> Here is the source of "test" program and attached iptables.save. It happens
>> also with smaller iptables.save file.
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>>
>> int main(void) {
>>
>>     srand(time(NULL));
>>     int i = 0, j = 0, randnum=0;
>>     int arr[6] = { 3072, 7168, 15360 , 31744, 64512, 130048}; 
>>     while(1) {
>>
>>         for (i = 0; i < 6 ; i++) {
>>
>>             int *ptr = (int*) malloc(arr[i] * 93);  
>>
>>             for(j = 0 ; j < arr[i] * 93 / sizeof(int); j++) {
>>                 *(ptr+j) = j+1;
>>             }
>>
>>             free(ptr);
>>         }
>>     }       
>> }
>>
>
Comment 4 Vlastimil Babka 2018-07-26 07:35:03 UTC
On 07/26/2018 09:26 AM, Michal Hocko wrote:
> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
>>> (switched to email.  Please respond via emailed reply-to-all, not via the
>>> bugzilla web interface).
>>>
>>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org
>>> wrote:
>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651
>>>>
>>>>             Bug ID: 200651
>>>>            Summary: cgroups iptables-restor: vmalloc: allocation failure
>>>
>>> Thanks.  Please do note the above request.
>>>
>>>>            Product: Memory Management
>>>>            Version: 2.5
>>>>     Kernel Version: 4.14
>>>>           Hardware: All
>>>>                 OS: Linux
>>>>               Tree: Mainline
>>>>             Status: NEW
>>>>           Severity: normal
>>>>           Priority: P1
>>>>          Component: Other
>>>>           Assignee: akpm@linux-foundation.org
>>>>           Reporter: gnikolov@icdsoft.com
>>>>         Regression: No
>>>>
>>>> Created attachment 277505 [details]
>>>>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
>>>> iptables save
>>>>
>>>> After creating large number of cgroups and under memory pressure, iptables
>>>> command fails with following error:
>>>>
>>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of
>>>> 3465216
>>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"
>>
>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
>> ("netfilter: x_tables: make allocation less aggressive") was backported
>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>> issues. Less than 4MB is not that much though, maybe find some "sane"
>> limit and use __GFP_NORETRY only above that?
> 
> I have seen the same report via
> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
> and the reported confirmed that kvmalloc is not a real culprit
> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com

Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
v4.14). I assume it was reverted on top of vanilla v4.14 as there would
be conflict on the stable with 0537250fdc6c backport. So what should be
tested to be sure is either vanilla v4.14 without stable backports, or
latest v4.14.y with revert of 0537250fdc6c.
Comment 5 Michal Hocko 2018-07-26 07:42:24 UTC
On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
> On 07/26/2018 09:26 AM, Michal Hocko wrote:
> > On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
> >> On 07/25/2018 09:52 PM, Andrew Morton wrote:
> >>> (switched to email.  Please respond via emailed reply-to-all, not via the
> >>> bugzilla web interface).
> >>>
> >>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org
> wrote:
> >>>
> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651
> >>>>
> >>>>             Bug ID: 200651
> >>>>            Summary: cgroups iptables-restor: vmalloc: allocation failure
> >>>
> >>> Thanks.  Please do note the above request.
> >>>
> >>>>            Product: Memory Management
> >>>>            Version: 2.5
> >>>>     Kernel Version: 4.14
> >>>>           Hardware: All
> >>>>                 OS: Linux
> >>>>               Tree: Mainline
> >>>>             Status: NEW
> >>>>           Severity: normal
> >>>>           Priority: P1
> >>>>          Component: Other
> >>>>           Assignee: akpm@linux-foundation.org
> >>>>           Reporter: gnikolov@icdsoft.com
> >>>>         Regression: No
> >>>>
> >>>> Created attachment 277505 [details]
> >>>>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
> >>>> iptables save
> >>>>
> >>>> After creating large number of cgroups and under memory pressure,
> iptables
> >>>> command fails with following error:
> >>>>
> >>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of
> 3465216
> >>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"
> >>
> >> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
> >> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
> >> ("netfilter: x_tables: make allocation less aggressive") was backported
> >> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
> >> issues. Less than 4MB is not that much though, maybe find some "sane"
> >> limit and use __GFP_NORETRY only above that?
> > 
> > I have seen the same report via
> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
> > and the reported confirmed that kvmalloc is not a real culprit
> > http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
> 
> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
> be conflict on the stable with 0537250fdc6c backport. So what should be
> tested to be sure is either vanilla v4.14 without stable backports, or
> latest v4.14.y with revert of 0537250fdc6c.

But 0537250fdc6c simply restored the previous NORETRY behavior from
before eacd86ca3b03. So whatever causes these issues doesn't seem to be
directly related to the kvmalloc change. Or do I miss what you are
saying?
Comment 6 Vlastimil Babka 2018-07-26 07:50:50 UTC
On 07/26/2018 09:42 AM, Michal Hocko wrote:
> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
>> On 07/26/2018 09:26 AM, Michal Hocko wrote:
>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
>>>>> (switched to email.  Please respond via emailed reply-to-all, not via the
>>>>> bugzilla web interface).
>>>>>
>>>>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org
>>>>> wrote:
>>>>>
>>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651
>>>>>>
>>>>>>             Bug ID: 200651
>>>>>>            Summary: cgroups iptables-restor: vmalloc: allocation failure
>>>>>
>>>>> Thanks.  Please do note the above request.
>>>>>
>>>>>>            Product: Memory Management
>>>>>>            Version: 2.5
>>>>>>     Kernel Version: 4.14
>>>>>>           Hardware: All
>>>>>>                 OS: Linux
>>>>>>               Tree: Mainline
>>>>>>             Status: NEW
>>>>>>           Severity: normal
>>>>>>           Priority: P1
>>>>>>          Component: Other
>>>>>>           Assignee: akpm@linux-foundation.org
>>>>>>           Reporter: gnikolov@icdsoft.com
>>>>>>         Regression: No
>>>>>>
>>>>>> Created attachment 277505 [details]
>>>>>>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
>>>>>> iptables save
>>>>>>
>>>>>> After creating large number of cgroups and under memory pressure,
>>>>>> iptables
>>>>>> command fails with following error:
>>>>>>
>>>>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of
>>>>>> 3465216
>>>>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"
>>>>
>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
>>>> ("netfilter: x_tables: make allocation less aggressive") was backported
>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>>>> issues. Less than 4MB is not that much though, maybe find some "sane"
>>>> limit and use __GFP_NORETRY only above that?
>>>
>>> I have seen the same report via
>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
>>> and the reported confirmed that kvmalloc is not a real culprit
>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>
>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
>> be conflict on the stable with 0537250fdc6c backport. So what should be
>> tested to be sure is either vanilla v4.14 without stable backports, or
>> latest v4.14.y with revert of 0537250fdc6c.
> 
> But 0537250fdc6c simply restored the previous NORETRY behavior from
> before eacd86ca3b03. So whatever causes these issues doesn't seem to be
> directly related to the kvmalloc change. Or do I miss what you are
> saying?

I'm saying that although it's not a regression, as you say (the
vmalloc() there was only for a few kernel versions called without
__GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix
the issue and thus we will rule out other possibilities.
Comment 7 Michal Hocko 2018-07-26 08:03:06 UTC
On Thu 26-07-18 09:50:45, Vlastimil Babka wrote:
> On 07/26/2018 09:42 AM, Michal Hocko wrote:
> > On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
> >> On 07/26/2018 09:26 AM, Michal Hocko wrote:
> >>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
> >>>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
> >>>>> (switched to email.  Please respond via emailed reply-to-all, not via
> the
> >>>>> bugzilla web interface).
> >>>>>
> >>>>> On Wed, 25 Jul 2018 11:42:57 +0000 bugzilla-daemon@bugzilla.kernel.org
> wrote:
> >>>>>
> >>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=200651
> >>>>>>
> >>>>>>             Bug ID: 200651
> >>>>>>            Summary: cgroups iptables-restor: vmalloc: allocation
> failure
> >>>>>
> >>>>> Thanks.  Please do note the above request.
> >>>>>
> >>>>>>            Product: Memory Management
> >>>>>>            Version: 2.5
> >>>>>>     Kernel Version: 4.14
> >>>>>>           Hardware: All
> >>>>>>                 OS: Linux
> >>>>>>               Tree: Mainline
> >>>>>>             Status: NEW
> >>>>>>           Severity: normal
> >>>>>>           Priority: P1
> >>>>>>          Component: Other
> >>>>>>           Assignee: akpm@linux-foundation.org
> >>>>>>           Reporter: gnikolov@icdsoft.com
> >>>>>>         Regression: No
> >>>>>>
> >>>>>> Created attachment 277505 [details]
> >>>>>>   --> https://bugzilla.kernel.org/attachment.cgi?id=277505&action=edit
> >>>>>> iptables save
> >>>>>>
> >>>>>> After creating large number of cgroups and under memory pressure,
> iptables
> >>>>>> command fails with following error:
> >>>>>>
> >>>>>> "iptables-restor: vmalloc: allocation failure, allocated 3047424 of
> 3465216
> >>>>>> bytes, mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)"
> >>>>
> >>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
> >>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
> >>>> ("netfilter: x_tables: make allocation less aggressive") was backported
> >>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
> >>>> issues. Less than 4MB is not that much though, maybe find some "sane"
> >>>> limit and use __GFP_NORETRY only above that?
> >>>
> >>> I have seen the same report via
> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
> >>> and the reported confirmed that kvmalloc is not a real culprit
> >>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
> >>
> >> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
> >> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
> >> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
> >> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
> >> be conflict on the stable with 0537250fdc6c backport. So what should be
> >> tested to be sure is either vanilla v4.14 without stable backports, or
> >> latest v4.14.y with revert of 0537250fdc6c.
> > 
> > But 0537250fdc6c simply restored the previous NORETRY behavior from
> > before eacd86ca3b03. So whatever causes these issues doesn't seem to be
> > directly related to the kvmalloc change. Or do I miss what you are
> > saying?
> 
> I'm saying that although it's not a regression, as you say (the
> vmalloc() there was only for a few kernel versions called without
> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix
> the issue and thus we will rule out other possibilities.

http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
claims that reverting eacd86ca3b03 didn't really help.
Comment 8 Vlastimil Babka 2018-07-26 08:31:37 UTC
On 07/26/2018 10:03 AM, Michal Hocko wrote:
> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote:
>> On 07/26/2018 09:42 AM, Michal Hocko wrote:
>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote:
>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
>>>>>>
>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
>>>>>> ("netfilter: x_tables: make allocation less aggressive") was backported
>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane"
>>>>>> limit and use __GFP_NORETRY only above that?
>>>>>
>>>>> I have seen the same report via
>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
>>>>> and the reported confirmed that kvmalloc is not a real culprit
>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>>>
>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
>>>> be conflict on the stable with 0537250fdc6c backport. So what should be
>>>> tested to be sure is either vanilla v4.14 without stable backports, or
>>>> latest v4.14.y with revert of 0537250fdc6c.
>>>
>>> But 0537250fdc6c simply restored the previous NORETRY behavior from
>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be
>>> directly related to the kvmalloc change. Or do I miss what you are
>>> saying?
>>
>> I'm saying that although it's not a regression, as you say (the
>> vmalloc() there was only for a few kernel versions called without
>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix
>> the issue and thus we will rule out other possibilities.
> 
> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
> claims that reverting eacd86ca3b03 didn't really help.

Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert
reintroduced it. I tried to explain it in the quoted part above starting
with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is
that eacd86ca3b03 might have actually *fixed* (or rather prevented) this
alloc failure, if there was not 0537250fdc6c and its 4.14 stable
backport (the kernel bugzilla report says 4.14, I'm assuming new enough
stable to contain 0537250fdc6c as the failure message contains
__GFP_NORETRY).

The mail you reference also says "seems that old version is masking
errors", which confirms that we are indeed looking at the right
vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and
thus the revert reintroduced it).
Comment 9 Vlastimil Babka 2018-07-26 08:48:11 UTC
On 07/26/2018 10:31 AM, Vlastimil Babka wrote:
> On 07/26/2018 10:03 AM, Michal Hocko wrote:
>> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote:
>>> On 07/26/2018 09:42 AM, Michal Hocko wrote:
>>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
>>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote:
>>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
>>>>>>>
>>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13
>>>>>>> and
>>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
>>>>>>> ("netfilter: x_tables: make allocation less aggressive") was backported
>>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane"
>>>>>>> limit and use __GFP_NORETRY only above that?
>>>>>>
>>>>>> I have seen the same report via
>>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
>>>>>> and the reported confirmed that kvmalloc is not a real culprit
>>>>>>
>>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>>>>
>>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
>>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
>>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
>>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
>>>>> be conflict on the stable with 0537250fdc6c backport. So what should be
>>>>> tested to be sure is either vanilla v4.14 without stable backports, or
>>>>> latest v4.14.y with revert of 0537250fdc6c.
>>>>
>>>> But 0537250fdc6c simply restored the previous NORETRY behavior from
>>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be
>>>> directly related to the kvmalloc change. Or do I miss what you are
>>>> saying?
>>>
>>> I'm saying that although it's not a regression, as you say (the
>>> vmalloc() there was only for a few kernel versions called without
>>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix
>>> the issue and thus we will rule out other possibilities.
>>
>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>> claims that reverting eacd86ca3b03 didn't really help.

Ah, I see, that mail thread references a different kernel bugzilla
#200639 which doesn't mention 4.14, but outright blames commit
eacd86ca3b03. Yet the alloc fail message contains __GFP_NORETRY, so I
still suspect the kernel also had 0537250fdc6c backport. Georgi can you
please clarify which exact kernel version had the alloc failures, and
how exactly you tested the revert (which version was the baseline for
revert). Thanks.

> Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert
> reintroduced it. I tried to explain it in the quoted part above starting
> with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is
> that eacd86ca3b03 might have actually *fixed* (or rather prevented) this
> alloc failure, if there was not 0537250fdc6c and its 4.14 stable
> backport (the kernel bugzilla report says 4.14, I'm assuming new enough
> stable to contain 0537250fdc6c as the failure message contains
> __GFP_NORETRY).
> 
> The mail you reference also says "seems that old version is masking
> errors", which confirms that we are indeed looking at the right
> vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and
> thus the revert reintroduced it).
> 
>
Comment 10 Georgi Georgiev 2018-07-26 09:09:05 UTC
Created attachment 277549 [details]
attachment-20091-0.html

*Georgi Nikolov*
System Administrator
www.icdsoft.com <http://www.icdsoft.com>

On 07/26/2018 11:48 AM, Vlastimil Babka wrote:
> On 07/26/2018 10:31 AM, Vlastimil Babka wrote:
>> On 07/26/2018 10:03 AM, Michal Hocko wrote:
>>> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote:
>>>> On 07/26/2018 09:42 AM, Michal Hocko wrote:
>>>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
>>>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote:
>>>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>>>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
>>>>>>>>
>>>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13
>>>>>>>> and
>>>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit
>>>>>>>> 0537250fdc6c
>>>>>>>> ("netfilter: x_tables: make allocation less aggressive") was
>>>>>>>> backported
>>>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>>>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane"
>>>>>>>> limit and use __GFP_NORETRY only above that?
>>>>>>> I have seen the same report via
>>>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
>>>>>>> and the reported confirmed that kvmalloc is not a real culprit
>>>>>>>
>>>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
>>>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
>>>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
>>>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
>>>>>> be conflict on the stable with 0537250fdc6c backport. So what should be
>>>>>> tested to be sure is either vanilla v4.14 without stable backports, or
>>>>>> latest v4.14.y with revert of 0537250fdc6c.
>>>>> But 0537250fdc6c simply restored the previous NORETRY behavior from
>>>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be
>>>>> directly related to the kvmalloc change. Or do I miss what you are
>>>>> saying?
>>>> I'm saying that although it's not a regression, as you say (the
>>>> vmalloc() there was only for a few kernel versions called without
>>>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix
>>>> the issue and thus we will rule out other possibilities.
>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>> claims that reverting eacd86ca3b03 didn't really help.
> Ah, I see, that mail thread references a different kernel bugzilla
> #200639 which doesn't mention 4.14, but outright blames commit
> eacd86ca3b03. Yet the alloc fail message contains __GFP_NORETRY, so I
> still suspect the kernel also had 0537250fdc6c backport. Georgi can you
> please clarify which exact kernel version had the alloc failures, and
> how exactly you tested the revert (which version was the baseline for
> revert). Thanks.
>
>> Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert
>> reintroduced it. I tried to explain it in the quoted part above starting
>> with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is
>> that eacd86ca3b03 might have actually *fixed* (or rather prevented) this
>> alloc failure, if there was not 0537250fdc6c and its 4.14 stable
>> backport (the kernel bugzilla report says 4.14, I'm assuming new enough
>> stable to contain 0537250fdc6c as the failure message contains
>> __GFP_NORETRY).
>>
>> The mail you reference also says "seems that old version is masking
>> errors", which confirms that we are indeed looking at the right
>> vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and
>> thus the revert reintroduced it).
>>
>>

Hello,
Kernel that has allocation failures is 4.14.50.
Here is the patch applied to this version which masks errors:

--- net/netfilter/x_tables.c    2018-06-18 14:18:21.138347416 +0300
+++ net/netfilter/x_tables.c    2018-07-26 11:58:01.721932962 +0300
@@ -1059,9 +1059,19 @@
      * than shoot all processes down before realizing there is nothing
      * more to reclaim.
      */
-    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
+/*    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
     if (!info)
         return NULL;
+*/
+
+    if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+        info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
+    if (!info) {
+        info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
+         PAGE_KERNEL);
+        if (!info)
+        return NULL;
+    }
 
     memset(info, 0, sizeof(*info));
     info->size = size;


I will try to reproduce it with only

info = kvmalloc(sz, GFP_KERNEL);

Regards,

--
Georgi Nikolov
Comment 11 Georgi Georgiev 2018-07-30 13:37:21 UTC
On 07/26/2018 12:02 PM, Georgi Nikolov wrote:
> On 07/26/2018 11:48 AM, Vlastimil Babka wrote:
>> On 07/26/2018 10:31 AM, Vlastimil Babka wrote:
>>> On 07/26/2018 10:03 AM, Michal Hocko wrote:
>>>> On Thu 26-07-18 09:50:45, Vlastimil Babka wrote:
>>>>> On 07/26/2018 09:42 AM, Michal Hocko wrote:
>>>>>> On Thu 26-07-18 09:34:58, Vlastimil Babka wrote:
>>>>>>> On 07/26/2018 09:26 AM, Michal Hocko wrote:
>>>>>>>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>>>>>>>>> On 07/25/2018 09:52 PM, Andrew Morton wrote:
>>>>>>>>>
>>>>>>>>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13
>>>>>>>>> and
>>>>>>>>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit
>>>>>>>>> 0537250fdc6c
>>>>>>>>> ("netfilter: x_tables: make allocation less aggressive") was
>>>>>>>>> backported
>>>>>>>>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>>>>>>>>> issues. Less than 4MB is not that much though, maybe find some "sane"
>>>>>>>>> limit and use __GFP_NORETRY only above that?
>>>>>>>> I have seen the same report via
>>>>>>>> http://lkml.kernel.org/r/df6f501c-8546-1f55-40b1-7e3a8f54d872@icdsoft.com
>>>>>>>> and the reported confirmed that kvmalloc is not a real culprit
>>>>>>>>
>>>>>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>>>>>> Hmm but that was revert of eacd86ca3b03 ("net/netfilter/x_tables.c: use
>>>>>>> kvmalloc() in xt_alloc_table_info()") which was the 4.13 commit that
>>>>>>> removed __GFP_NORETRY (there's no __GFP_NORETRY under net/netfilter in
>>>>>>> v4.14). I assume it was reverted on top of vanilla v4.14 as there would
>>>>>>> be conflict on the stable with 0537250fdc6c backport. So what should be
>>>>>>> tested to be sure is either vanilla v4.14 without stable backports, or
>>>>>>> latest v4.14.y with revert of 0537250fdc6c.
>>>>>> But 0537250fdc6c simply restored the previous NORETRY behavior from
>>>>>> before eacd86ca3b03. So whatever causes these issues doesn't seem to be
>>>>>> directly related to the kvmalloc change. Or do I miss what you are
>>>>>> saying?
>>>>> I'm saying that although it's not a regression, as you say (the
>>>>> vmalloc() there was only for a few kernel versions called without
>>>>> __GFP_NORETRY), it's still possible that removing __GFP_NORETRY will fix
>>>>> the issue and thus we will rule out other possibilities.
>>>> http://lkml.kernel.org/r/d99a9598-808a-6968-4131-c3949b752004@icdsoft.com
>>>> claims that reverting eacd86ca3b03 didn't really help.
>> Ah, I see, that mail thread references a different kernel bugzilla
>> #200639 which doesn't mention 4.14, but outright blames commit
>> eacd86ca3b03. Yet the alloc fail message contains __GFP_NORETRY, so I
>> still suspect the kernel also had 0537250fdc6c backport. Georgi can you
>> please clarify which exact kernel version had the alloc failures, and
>> how exactly you tested the revert (which version was the baseline for
>> revert). Thanks.
>>
>>> Of course not. eacd86ca3b03 *removed* __GFP_NORETRY, so the revert
>>> reintroduced it. I tried to explain it in the quoted part above starting
>>> with "Hmm but that was revert of eacd86ca3b03 ...". What I'm saying is
>>> that eacd86ca3b03 might have actually *fixed* (or rather prevented) this
>>> alloc failure, if there was not 0537250fdc6c and its 4.14 stable
>>> backport (the kernel bugzilla report says 4.14, I'm assuming new enough
>>> stable to contain 0537250fdc6c as the failure message contains
>>> __GFP_NORETRY).
>>>
>>> The mail you reference also says "seems that old version is masking
>>> errors", which confirms that we are indeed looking at the right
>>> vmalloc(), because eacd86ca3b03 also removed __GFP_NOWARN there (and
>>> thus the revert reintroduced it).
>>>
>>>
>
> Hello,
> Kernel that has allocation failures is 4.14.50.
> Here is the patch applied to this version which masks errors:
>
> --- net/netfilter/x_tables.c    2018-06-18 14:18:21.138347416 +0300
> +++ net/netfilter/x_tables.c    2018-07-26 11:58:01.721932962 +0300
> @@ -1059,9 +1059,19 @@
>       * than shoot all processes down before realizing there is nothing
>       * more to reclaim.
>       */
> -    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> +/*    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>      if (!info)
>          return NULL;
> +*/
> +
> +    if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +        info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> +    if (!info) {
> +        info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
> +         PAGE_KERNEL);
> +        if (!info)
> +        return NULL;
> +    }
>  
>      memset(info, 0, sizeof(*info));
>      info->size = size;
>
>
> I will try to reproduce it with only
>
> info = kvmalloc(sz, GFP_KERNEL);
>
> Regards,
>
> --
> Georgi Nikolov
>

Hello,

Without GFP_NORETRY problem disappears.

What is correct way to fix it.
- inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add this flag only for sizes bigger than some threshold
- inside kvmalloc_node remove GFP_NORETRY from __vmalloc_node_flags_caller (i don't know if it honors this flag, or the problem is elsewhere)

Regards,

--
Georgi Nikolov
Comment 12 Michal Hocko 2018-07-30 13:57:50 UTC
On Mon 30-07-18 16:37:07, Georgi Nikolov wrote:
> On 07/26/2018 12:02 PM, Georgi Nikolov wrote:
[...]
> > Here is the patch applied to this version which masks errors:
> >
> > --- net/netfilter/x_tables.c    2018-06-18 14:18:21.138347416 +0300
> > +++ net/netfilter/x_tables.c    2018-07-26 11:58:01.721932962 +0300
> > @@ -1059,9 +1059,19 @@
> >       * than shoot all processes down before realizing there is nothing
> >       * more to reclaim.
> >       */
> > -    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> > +/*    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> >      if (!info)
> >          return NULL;
> > +*/
> > +
> > +    if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > +        info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
> > +    if (!info) {
> > +        info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
> > +         PAGE_KERNEL);
> > +        if (!info)
> > +        return NULL;
> > +    }
> >  
> >      memset(info, 0, sizeof(*info));
> >      info->size = size;
> >
> >
> > I will try to reproduce it with only
> >
> > info = kvmalloc(sz, GFP_KERNEL);
> >
> > Regards,
> >
> > --
> > Georgi Nikolov
> >
> 
> Hello,
> 
> Without GFP_NORETRY problem disappears.

Hmm, there are two allocation paths which have __GFP_NORETRY here.
I expect you have removed both of them, right?

kvmalloc implicitly performs __GFP_NORETRY on kmalloc path but it
doesn't have it for the vmalloc fallback. This would match
kvmalloc(GFP_KERNEL). I thought you were testing this code path
previously but there is some confusion flying around because you have
claimed that the regressions started with eacd86ca3b036. If the
regression is really with __GFP_NORETRY being used for the vmalloc
fallback which would be kvmalloc(GFP_KERNEL | __GFP_NORETRY) then
I am still confused because that would match the original code.

> What is correct way to fix it.
> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add
> this flag only for sizes bigger than some threshold

This would reintroduce issue fixed by 0537250fdc6c8. Note that
kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the
original code (well, except for __GFP_NOWARN).

> - inside kvmalloc_node remove GFP_NORETRY from
> __vmalloc_node_flags_caller (i don't know if it honors this flag, or
> the problem is elsewhere)

No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL).

I strongly suspect that this is not a regression in this code but rather
a side effect of larger memory fragmentation caused by something else.
In any case do you see this failure also without artificial test case
with a standard workload?
Comment 13 Georgi Georgiev 2018-07-30 15:54:46 UTC
On 07/30/2018 04:57 PM, Michal Hocko wrote:

> On Mon 30-07-18 16:37:07, Georgi Nikolov wrote:
>> On 07/26/2018 12:02 PM, Georgi Nikolov wrote:
> [...]
>>> Here is the patch applied to this version which masks errors:
>>>
>>> --- net/netfilter/x_tables.c    2018-06-18 14:18:21.138347416 +0300
>>> +++ net/netfilter/x_tables.c    2018-07-26 11:58:01.721932962 +0300
>>> @@ -1059,9 +1059,19 @@
>>>       * than shoot all processes down before realizing there is nothing
>>>       * more to reclaim.
>>>       */
>>> -    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>>> +/*    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>>>      if (!info)
>>>          return NULL;
>>> +*/
>>> +
>>> +    if (sz <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
>>> +        info = kmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY);
>>> +    if (!info) {
>>> +        info = __vmalloc(sz, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
>>> +         PAGE_KERNEL);
>>> +        if (!info)
>>> +        return NULL;
>>> +    }
>>>  
>>>      memset(info, 0, sizeof(*info));
>>>      info->size = size;
>>>
>>>
>>> I will try to reproduce it with only
>>>
>>> info = kvmalloc(sz, GFP_KERNEL);
>>>
>>> Regards,
>>>
>>> --
>>> Georgi Nikolov
>>>
>> Hello,
>>
>> Without GFP_NORETRY problem disappears.
> Hmm, there are two allocation paths which have __GFP_NORETRY here.
> I expect you have removed both of them, right?
>
> kvmalloc implicitly performs __GFP_NORETRY on kmalloc path but it
> doesn't have it for the vmalloc fallback. This would match
> kvmalloc(GFP_KERNEL). I thought you were testing this code path
> previously but there is some confusion flying around because you have
> claimed that the regressions started with eacd86ca3b036. If the
> regression is really with __GFP_NORETRY being used for the vmalloc
> fallback which would be kvmalloc(GFP_KERNEL | __GFP_NORETRY) then
> I am still confused because that would match the original code.

No i was wrong. The regression starts actually with 0537250fdc6c8.
- old code, which opencodes kvmalloc, is masking error but error is there
- kvmalloc without GFP_NORETRY works fine, but probably can consume a
lot of memory - commit: eacd86ca3b036
- kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8

>> What is correct way to fix it.
>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add
>> this flag only for sizes bigger than some threshold
> This would reintroduce issue fixed by 0537250fdc6c8. Note that
> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the
> original code (well, except for __GFP_NOWARN).

So probably we should pass GFP_NORETRY only for large requests (above
some threshold).

>
>> - inside kvmalloc_node remove GFP_NORETRY from
>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or
>> the problem is elsewhere)
> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL).
>
> I strongly suspect that this is not a regression in this code but rather
> a side effect of larger memory fragmentation caused by something else.
> In any case do you see this failure also without artificial test case
> with a standard workload?

Yes i can see failures with standard workload, in fact it was hard to
reproduce it.
Here is the error from production servers where allocation is smaller:
iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes,
mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)

I didn't understand if vmalloc honors GFP_NORETRY.

Regards,

--
Georgi Nikolov
Comment 14 Michal Hocko 2018-07-30 18:38:26 UTC
On Mon 30-07-18 18:54:24, Georgi Nikolov wrote:
[...]
> No i was wrong. The regression starts actually with 0537250fdc6c8.
> - old code, which opencodes kvmalloc, is masking error but error is there
> - kvmalloc without GFP_NORETRY works fine, but probably can consume a
> lot of memory - commit: eacd86ca3b036
> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8

OK.

> >> What is correct way to fix it.
> >> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add
> >> this flag only for sizes bigger than some threshold
> > This would reintroduce issue fixed by 0537250fdc6c8. Note that
> > kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the
> > original code (well, except for __GFP_NOWARN).
> 
> So probably we should pass GFP_NORETRY only for large requests (above
> some threshold).

What would be the treshold? This is not really my area so I just wanted
to keep the original code semantic.
 
> >> - inside kvmalloc_node remove GFP_NORETRY from
> >> __vmalloc_node_flags_caller (i don't know if it honors this flag, or
> >> the problem is elsewhere)
> > No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL).
> >
> > I strongly suspect that this is not a regression in this code but rather
> > a side effect of larger memory fragmentation caused by something else.
> > In any case do you see this failure also without artificial test case
> > with a standard workload?
> 
> Yes i can see failures with standard workload, in fact it was hard to
> reproduce it.
> Here is the error from production servers where allocation is smaller:
> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes,
> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)
> 
> I didn't understand if vmalloc honors GFP_NORETRY.

0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really
support the GFP_NORETRY remantic because that would imply the request
wouldn't trigger the oom killer but in rare cases this might happen
(e.g. when page tables are allocated because those are hardcoded GFP_KERNEL).

That being said, I have no objection to use GFP_KERNEL if it helps real
workloads but we probably need some cap...
Comment 15 Georgi Georgiev 2018-07-30 18:51:58 UTC
On 07/30/2018 09:38 PM, Michal Hocko wrote:
> On Mon 30-07-18 18:54:24, Georgi Nikolov wrote:
> [...]
>> No i was wrong. The regression starts actually with 0537250fdc6c8.
>> - old code, which opencodes kvmalloc, is masking error but error is there
>> - kvmalloc without GFP_NORETRY works fine, but probably can consume a
>> lot of memory - commit: eacd86ca3b036
>> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8
> OK.
>
>>>> What is correct way to fix it.
>>>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add
>>>> this flag only for sizes bigger than some threshold
>>> This would reintroduce issue fixed by 0537250fdc6c8. Note that
>>> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the
>>> original code (well, except for __GFP_NOWARN).
>> So probably we should pass GFP_NORETRY only for large requests (above
>> some threshold).
> What would be the treshold? This is not really my area so I just wanted
> to keep the original code semantic.
>  
>>>> - inside kvmalloc_node remove GFP_NORETRY from
>>>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or
>>>> the problem is elsewhere)
>>> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL).
>>>
>>> I strongly suspect that this is not a regression in this code but rather
>>> a side effect of larger memory fragmentation caused by something else.
>>> In any case do you see this failure also without artificial test case
>>> with a standard workload?
>> Yes i can see failures with standard workload, in fact it was hard to
>> reproduce it.
>> Here is the error from production servers where allocation is smaller:
>> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes,
>> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)
>>
>> I didn't understand if vmalloc honors GFP_NORETRY.
> 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really
> support the GFP_NORETRY remantic because that would imply the request
> wouldn't trigger the oom killer but in rare cases this might happen
> (e.g. when page tables are allocated because those are hardcoded GFP_KERNEL).
>
> That being said, I have no objection to use GFP_KERNEL if it helps real
> workloads but we probably need some cap...

Probably Vlastimil Babka can propose some limit:

On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
("netfilter: x_tables: make allocation less aggressive") was backported
to 4.14. Removing __GFP_NORETRY might help here, but bring back other
issues. Less than 4MB is not that much though, maybe find some "sane"
limit and use __GFP_NORETRY only above that?


Regards,

--
Georgi Nikolov
Comment 16 Vlastimil Babka 2018-07-31 06:38:07 UTC
On 07/30/2018 08:51 PM, Georgi Nikolov wrote:
> On 07/30/2018 09:38 PM, Michal Hocko wrote:
>> On Mon 30-07-18 18:54:24, Georgi Nikolov wrote:
>> [...]
>>> No i was wrong. The regression starts actually with 0537250fdc6c8.
>>> - old code, which opencodes kvmalloc, is masking error but error is there
>>> - kvmalloc without GFP_NORETRY works fine, but probably can consume a
>>> lot of memory - commit: eacd86ca3b036
>>> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8
>> OK.
>>
>>>>> What is correct way to fix it.
>>>>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add
>>>>> this flag only for sizes bigger than some threshold
>>>> This would reintroduce issue fixed by 0537250fdc6c8. Note that
>>>> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the
>>>> original code (well, except for __GFP_NOWARN).
>>> So probably we should pass GFP_NORETRY only for large requests (above
>>> some threshold).
>> What would be the treshold? This is not really my area so I just wanted
>> to keep the original code semantic.
>>  
>>>>> - inside kvmalloc_node remove GFP_NORETRY from
>>>>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or
>>>>> the problem is elsewhere)
>>>> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL).
>>>>
>>>> I strongly suspect that this is not a regression in this code but rather
>>>> a side effect of larger memory fragmentation caused by something else.
>>>> In any case do you see this failure also without artificial test case
>>>> with a standard workload?
>>> Yes i can see failures with standard workload, in fact it was hard to
>>> reproduce it.
>>> Here is the error from production servers where allocation is smaller:
>>> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes,
>>> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)
>>>
>>> I didn't understand if vmalloc honors GFP_NORETRY.
>> 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really
>> support the GFP_NORETRY remantic because that would imply the request
>> wouldn't trigger the oom killer but in rare cases this might happen
>> (e.g. when page tables are allocated because those are hardcoded
>> GFP_KERNEL).
>>
>> That being said, I have no objection to use GFP_KERNEL if it helps real
>> workloads but we probably need some cap...
> 
> Probably Vlastimil Babka can propose some limit:

No, I think that's rather for the netfilter folks to decide. However, it
seems there has been the debate already [1] and it was not found. The
conclusion was that __GFP_NORETRY worked fine before, so it should work
again after it's added back. But now we know that it doesn't...

[1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u

> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
> ("netfilter: x_tables: make allocation less aggressive") was backported
> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
> issues. Less than 4MB is not that much though, maybe find some "sane"
> limit and use __GFP_NORETRY only above that?
> 
> 
> Regards,
> 
> --
> Georgi Nikolov
> 
>
Comment 17 Georgi Georgiev 2018-07-31 13:55:41 UTC
On 07/31/2018 09:38 AM, Vlastimil Babka wrote:
> On 07/30/2018 08:51 PM, Georgi Nikolov wrote:
>> On 07/30/2018 09:38 PM, Michal Hocko wrote:
>>> On Mon 30-07-18 18:54:24, Georgi Nikolov wrote:
>>> [...]
>>>> No i was wrong. The regression starts actually with 0537250fdc6c8.
>>>> - old code, which opencodes kvmalloc, is masking error but error is there
>>>> - kvmalloc without GFP_NORETRY works fine, but probably can consume a
>>>> lot of memory - commit: eacd86ca3b036
>>>> - kvmalloc with GFP_NORETRY shows error - commit: 0537250fdc6c8
>>> OK.
>>>
>>>>>> What is correct way to fix it.
>>>>>> - inside xt_alloc_table_info remove GFP_NORETRY from kvmalloc or add
>>>>>> this flag only for sizes bigger than some threshold
>>>>> This would reintroduce issue fixed by 0537250fdc6c8. Note that
>>>>> kvmalloc(GFP_KERNEL | __GFP_NORETRY) is more or less equivalent to the
>>>>> original code (well, except for __GFP_NOWARN).
>>>> So probably we should pass GFP_NORETRY only for large requests (above
>>>> some threshold).
>>> What would be the treshold? This is not really my area so I just wanted
>>> to keep the original code semantic.
>>>  
>>>>>> - inside kvmalloc_node remove GFP_NORETRY from
>>>>>> __vmalloc_node_flags_caller (i don't know if it honors this flag, or
>>>>>> the problem is elsewhere)
>>>>> No, not really. This is basically equivalent to kvmalloc(GFP_KERNEL).
>>>>>
>>>>> I strongly suspect that this is not a regression in this code but rather
>>>>> a side effect of larger memory fragmentation caused by something else.
>>>>> In any case do you see this failure also without artificial test case
>>>>> with a standard workload?
>>>> Yes i can see failures with standard workload, in fact it was hard to
>>>> reproduce it.
>>>> Here is the error from production servers where allocation is smaller:
>>>> iptables: vmalloc: allocation failure, allocated 131072 of 225280 bytes,
>>>> mode:0x14010c0(GFP_KERNEL|__GFP_NORETRY), nodemask=(null)
>>>>
>>>> I didn't understand if vmalloc honors GFP_NORETRY.
>>> 0537250fdc6c8 changelog tries to explain. kvmalloc doesn't really
>>> support the GFP_NORETRY remantic because that would imply the request
>>> wouldn't trigger the oom killer but in rare cases this might happen
>>> (e.g. when page tables are allocated because those are hardcoded
>>> GFP_KERNEL).
>>>
>>> That being said, I have no objection to use GFP_KERNEL if it helps real
>>> workloads but we probably need some cap...
>> Probably Vlastimil Babka can propose some limit:
> No, I think that's rather for the netfilter folks to decide. However, it
> seems there has been the debate already [1] and it was not found. The
> conclusion was that __GFP_NORETRY worked fine before, so it should work
> again after it's added back. But now we know that it doesn't...
>
> [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u

Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
already in this list so probably have to wait for their opinion.

>> On Thu 26-07-18 09:18:57, Vlastimil Babka wrote:
>> This is likely the kvmalloc() in xt_alloc_table_info(). Between 4.13 and
>> 4.17 it shouldn't use __GFP_NORETRY, but looks like commit 0537250fdc6c
>> ("netfilter: x_tables: make allocation less aggressive") was backported
>> to 4.14. Removing __GFP_NORETRY might help here, but bring back other
>> issues. Less than 4MB is not that much though, maybe find some "sane"
>> limit and use __GFP_NORETRY only above that?
>>
>>
>> Regards,
>>
>> --
>> Georgi Nikolov
>>
>>

Regards,

--
Georgi Nikolov
Comment 18 Georgi Georgiev 2018-07-31 14:25:18 UTC
On 07/31/2018 05:05 PM, Florian Westphal wrote:
> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
>>> No, I think that's rather for the netfilter folks to decide. However, it
>>> seems there has been the debate already [1] and it was not found. The
>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
>>> again after it's added back. But now we know that it doesn't...
>>>
>>> [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
>> already in this list so probably have to wait for their opinion.
> It hasn't changed, I think having OOM killer zap random processes
> just because userspace wants to import large iptables ruleset is not a
> good idea.
And what about passing GFP_NORETRY only above some reasonable threshold?
Or situation has to be handled in userspace.


Regards,

--
Georgi Nikolov
Comment 19 Florian Westphal 2018-07-31 14:45:15 UTC
Georgi Nikolov <gnikolov@icdsoft.com> wrote:
> > No, I think that's rather for the netfilter folks to decide. However, it
> > seems there has been the debate already [1] and it was not found. The
> > conclusion was that __GFP_NORETRY worked fine before, so it should work
> > again after it's added back. But now we know that it doesn't...
> >
> > [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
> 
> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
> already in this list so probably have to wait for their opinion.

It hasn't changed, I think having OOM killer zap random processes
just because userspace wants to import large iptables ruleset is not a
good idea.
Comment 20 Vlastimil Babka 2018-08-01 07:17:53 UTC
On 07/31/2018 04:25 PM, Georgi Nikolov wrote:
> On 07/31/2018 05:05 PM, Florian Westphal wrote:
>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
>>>> No, I think that's rather for the netfilter folks to decide. However, it
>>>> seems there has been the debate already [1] and it was not found. The
>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
>>>> again after it's added back. But now we know that it doesn't...
>>>>
>>>> [1]
>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
>>> already in this list so probably have to wait for their opinion.
>> It hasn't changed, I think having OOM killer zap random processes
>> just because userspace wants to import large iptables ruleset is not a
>> good idea.
> And what about passing GFP_NORETRY only above some reasonable threshold?

What is the reasonable threshold?

> Or situation has to be handled in userspace.

How?

> 
> Regards,
> 
> --
> Georgi Nikolov
>
Comment 21 Vlastimil Babka 2018-08-01 07:34:28 UTC
On 07/31/2018 04:05 PM, Florian Westphal wrote:
> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
>>> No, I think that's rather for the netfilter folks to decide. However, it
>>> seems there has been the debate already [1] and it was not found. The
>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
>>> again after it's added back. But now we know that it doesn't...
>>>
>>> [1] https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
>>
>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
>> already in this list so probably have to wait for their opinion.
> 
> It hasn't changed, I think having OOM killer zap random processes
> just because userspace wants to import large iptables ruleset is not a
> good idea.

If we denied the allocation instead of OOM (e.g. by using
__GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
the system without much memory, so it will invoke OOM killer sooner or
later anyway.

I don't see any silver-bullet solution, unfortunately. If this can be
abused by (multiple) namespaces, then they have to be contained by
kmemcg as that's the generic mechanism intended for this. Then we could
use the __GFP_RETRY_MAYFAIL.
The only limit we could impose to outright deny the allocation (to
prevent obvious bugs/admin mistakes or abuses) could be based on the
amount of RAM, as was suggested in the old thread.

__GFP_NORETRY might look like a good match at first sight as that stops
allocating when "reclaim becomes hard" which means the system is still
relatively far from OOM. But it's not reliable in principle, and as this
bug report shows. That's fine when __GFP_NORETRY is used for optimistic
allocations that have some other fallback (e.g. huge page with fallback
to base page), but far from ideal when failure means returning -ENOMEM
to userspace.
Comment 22 Michal Hocko 2018-08-01 08:33:55 UTC
On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
> On 07/31/2018 04:05 PM, Florian Westphal wrote:
> > Georgi Nikolov <gnikolov@icdsoft.com> wrote:
> >>> No, I think that's rather for the netfilter folks to decide. However, it
> >>> seems there has been the debate already [1] and it was not found. The
> >>> conclusion was that __GFP_NORETRY worked fine before, so it should work
> >>> again after it's added back. But now we know that it doesn't...
> >>>
> >>> [1]
> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
> >>
> >> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
> >> already in this list so probably have to wait for their opinion.
> > 
> > It hasn't changed, I think having OOM killer zap random processes
> > just because userspace wants to import large iptables ruleset is not a
> > good idea.
> 
> If we denied the allocation instead of OOM (e.g. by using
> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
> the system without much memory, so it will invoke OOM killer sooner or
> later anyway.
> 
> I don't see any silver-bullet solution, unfortunately. If this can be
> abused by (multiple) namespaces, then they have to be contained by
> kmemcg as that's the generic mechanism intended for this. Then we could
> use the __GFP_RETRY_MAYFAIL.
> The only limit we could impose to outright deny the allocation (to
> prevent obvious bugs/admin mistakes or abuses) could be based on the
> amount of RAM, as was suggested in the old thread.
> 
> __GFP_NORETRY might look like a good match at first sight as that stops
> allocating when "reclaim becomes hard" which means the system is still
> relatively far from OOM. But it's not reliable in principle, and as this
> bug report shows. That's fine when __GFP_NORETRY is used for optimistic
> allocations that have some other fallback (e.g. huge page with fallback
> to base page), but far from ideal when failure means returning -ENOMEM
> to userspace.

I absolutely agree. The whole __GFP_NORETRY is quite dubious TBH. I have
used it to get the original behavior because the change wasn't really
intended to make functional changes. But consideg ring this requires
higher privileges then I fail to see where the distrust comes from. If
this is really about untrusted root in a namespace then the proper way
is to use __GFP_ACCOUNT and limit that via kmemc.

__GFP_NORETRY can fail really easily if the kswapd doesn't keep the pace
with the allocations which might be completely unrelated to this
particular request.
Comment 23 Georgi Georgiev 2018-08-01 16:03:17 UTC
*Georgi Nikolov*
System Administrator
www.icdsoft.com <http://www.icdsoft.com>

On 08/01/2018 11:33 AM, Michal Hocko wrote:
> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
>> On 07/31/2018 04:05 PM, Florian Westphal wrote:
>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
>>>>> No, I think that's rather for the netfilter folks to decide. However, it
>>>>> seems there has been the debate already [1] and it was not found. The
>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
>>>>> again after it's added back. But now we know that it doesn't...
>>>>>
>>>>> [1]
>>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
>>>> already in this list so probably have to wait for their opinion.
>>> It hasn't changed, I think having OOM killer zap random processes
>>> just because userspace wants to import large iptables ruleset is not a
>>> good idea.
>> If we denied the allocation instead of OOM (e.g. by using
>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
>> the system without much memory, so it will invoke OOM killer sooner or
>> later anyway.
>>
>> I don't see any silver-bullet solution, unfortunately. If this can be
>> abused by (multiple) namespaces, then they have to be contained by
>> kmemcg as that's the generic mechanism intended for this. Then we could
>> use the __GFP_RETRY_MAYFAIL.
>> The only limit we could impose to outright deny the allocation (to
>> prevent obvious bugs/admin mistakes or abuses) could be based on the
>> amount of RAM, as was suggested in the old thread.

Can we make this configurable - on/off switch or size above which
to pass GFP_NORETRY. Probably hard coded based on amount of RAM is a
good idea too.

>> __GFP_NORETRY might look like a good match at first sight as that stops
>> allocating when "reclaim becomes hard" which means the system is still
>> relatively far from OOM. But it's not reliable in principle, and as this
>> bug report shows. That's fine when __GFP_NORETRY is used for optimistic
>> allocations that have some other fallback (e.g. huge page with fallback
>> to base page), but far from ideal when failure means returning -ENOMEM
>> to userspace.
> I absolutely agree. The whole __GFP_NORETRY is quite dubious TBH. I have
> used it to get the original behavior because the change wasn't really
> intended to make functional changes. But consideg ring this requires
> higher privileges then I fail to see where the distrust comes from. If
> this is really about untrusted root in a namespace then the proper way
> is to use __GFP_ACCOUNT and limit that via kmemc.
>
> __GFP_NORETRY can fail really easily if the kswapd doesn't keep the pace
> with the allocations which might be completely unrelated to this
> particular request.
Comment 24 Michal Hocko 2018-08-02 08:50:49 UTC
On Wed 01-08-18 19:03:03, Georgi Nikolov wrote:
> 
> *Georgi Nikolov*
> System Administrator
> www.icdsoft.com <http://www.icdsoft.com>
> 
> On 08/01/2018 11:33 AM, Michal Hocko wrote:
> > On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
> >> On 07/31/2018 04:05 PM, Florian Westphal wrote:
> >>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
> >>>>> No, I think that's rather for the netfilter folks to decide. However,
> it
> >>>>> seems there has been the debate already [1] and it was not found. The
> >>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
> >>>>> again after it's added back. But now we know that it doesn't...
> >>>>>
> >>>>> [1]
> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
> >>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
> >>>> already in this list so probably have to wait for their opinion.
> >>> It hasn't changed, I think having OOM killer zap random processes
> >>> just because userspace wants to import large iptables ruleset is not a
> >>> good idea.
> >> If we denied the allocation instead of OOM (e.g. by using
> >> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
> >> the system without much memory, so it will invoke OOM killer sooner or
> >> later anyway.
> >>
> >> I don't see any silver-bullet solution, unfortunately. If this can be
> >> abused by (multiple) namespaces, then they have to be contained by
> >> kmemcg as that's the generic mechanism intended for this. Then we could
> >> use the __GFP_RETRY_MAYFAIL.
> >> The only limit we could impose to outright deny the allocation (to
> >> prevent obvious bugs/admin mistakes or abuses) could be based on the
> >> amount of RAM, as was suggested in the old thread.
> 
> Can we make this configurable - on/off switch or size above which
> to pass GFP_NORETRY.

Yet another tunable? How do you decide which one to select? Seriously,
configuration knobs sound attractive but they are rarely a good idea.
Either we trust privileged users or we don't and we have kmem accounting
for that.

> Probably hard coded based on amount of RAM is a good idea too.

How do you scale that?

In other words, why don't we simply do the following? Note that this is
not tested. I have also no idea what is the lifetime of this allocation.
Is it bound to any specific process or is it a namespace bound? If the
later then the memcg OOM killer might wipe the whole memcg down without
making any progress. This would make the whole namespace unsuable until
somebody intervenes. Is this acceptable?
---
From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 2 Aug 2018 10:38:57 +0200
Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
 easilly

eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
in xt_alloc_table_info()") has unintentionally fortified
xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
the vmalloc fallback. Later on there was a syzbot report that this
can lead to OOM killer invocations when tables are too large and
0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
has been merged to restore the original behavior. Georgi Nikolov however
noticed that he is not able to install his iptables anymore so this can
be seen as a regression.

The primary argument for 0537250fdc6c was that this allocation path
shouldn't really trigger the OOM killer and kill innocent tasks. On the
other hand the interface requires root and as such should allow what the
admin asks for. Root inside a namespaces makes this more complicated
because those might be not trusted in general. If they are not then such
namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.

Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 net/netfilter/x_tables.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d0d8397c9588..b769408e04ab 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int size)
 	if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
 		return NULL;
 
-	/* __GFP_NORETRY is not fully supported by kvmalloc but it should
-	 * work reasonably well if sz is too large and bail out rather
-	 * than shoot all processes down before realizing there is nothing
-	 * more to reclaim.
-	 */
-	info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
+	info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
 	if (!info)
 		return NULL;
Comment 25 Pablo Neira Ayuso 2018-08-02 09:35:07 UTC
On Thu, Aug 02, 2018 at 10:50:43AM +0200, Michal Hocko wrote:
> On Wed 01-08-18 19:03:03, Georgi Nikolov wrote:
> > 
> > *Georgi Nikolov*
> > System Administrator
> > www.icdsoft.com <http://www.icdsoft.com>
> > 
> > On 08/01/2018 11:33 AM, Michal Hocko wrote:
> > > On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
> > >> On 07/31/2018 04:05 PM, Florian Westphal wrote:
> > >>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
> > >>>>> No, I think that's rather for the netfilter folks to decide. However,
> it
> > >>>>> seems there has been the debate already [1] and it was not found. The
> > >>>>> conclusion was that __GFP_NORETRY worked fine before, so it should
> work
> > >>>>> again after it's added back. But now we know that it doesn't...
> > >>>>>
> > >>>>> [1]
> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
> > >>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
> > >>>> already in this list so probably have to wait for their opinion.
> > >>> It hasn't changed, I think having OOM killer zap random processes
> > >>> just because userspace wants to import large iptables ruleset is not a
> > >>> good idea.
> > >> If we denied the allocation instead of OOM (e.g. by using
> > >> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
> > >> the system without much memory, so it will invoke OOM killer sooner or
> > >> later anyway.
> > >>
> > >> I don't see any silver-bullet solution, unfortunately. If this can be
> > >> abused by (multiple) namespaces, then they have to be contained by
> > >> kmemcg as that's the generic mechanism intended for this. Then we could
> > >> use the __GFP_RETRY_MAYFAIL.
> > >> The only limit we could impose to outright deny the allocation (to
> > >> prevent obvious bugs/admin mistakes or abuses) could be based on the
> > >> amount of RAM, as was suggested in the old thread.
> > 
> > Can we make this configurable - on/off switch or size above which
> > to pass GFP_NORETRY.
> 
> Yet another tunable? How do you decide which one to select? Seriously,
> configuration knobs sound attractive but they are rarely a good idea.
> Either we trust privileged users or we don't and we have kmem accounting
> for that.
> 
> > Probably hard coded based on amount of RAM is a good idea too.
> 
> How do you scale that?
> 
> In other words, why don't we simply do the following? Note that this is
> not tested. I have also no idea what is the lifetime of this allocation.
> Is it bound to any specific process or is it a namespace bound? If the
> later then the memcg OOM killer might wipe the whole memcg down without
> making any progress. This would make the whole namespace unsuable until
> somebody intervenes. Is this acceptable?
> ---
> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 2 Aug 2018 10:38:57 +0200
> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
>  easilly
> 
> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
> in xt_alloc_table_info()") has unintentionally fortified
> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
> the vmalloc fallback. Later on there was a syzbot report that this
> can lead to OOM killer invocations when tables are too large and
> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> has been merged to restore the original behavior. Georgi Nikolov however
> noticed that he is not able to install his iptables anymore so this can
> be seen as a regression.
> 
> The primary argument for 0537250fdc6c was that this allocation path
> shouldn't really trigger the OOM killer and kill innocent tasks. On the
> other hand the interface requires root and as such should allow what the
> admin asks for. Root inside a namespaces makes this more complicated
> because those might be not trusted in general. If they are not then such
> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
> 
> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  net/netfilter/x_tables.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d0d8397c9588..b769408e04ab 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int
> size)
>       if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
>               return NULL;
>  
> -     /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> -      * work reasonably well if sz is too large and bail out rather
> -      * than shoot all processes down before realizing there is nothing
> -      * more to reclaim.
> -      */
> -     info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> +     info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);

I guess the large number of cgroups match is helping to consume a lot
of memory very quickly? We have a PATH_MAX in struct xt_cgroup_info_v1.

Can we just trim off the path to something more reasonable? We did
similar things to other extensions.

Only concern here is what is a reasonable path length to describe the
cgroup.
Comment 26 Michal Hocko 2018-08-02 10:44:08 UTC
On Thu 02-08-18 11:25:49, Pablo Neira Ayuso wrote:
> On Thu, Aug 02, 2018 at 10:50:43AM +0200, Michal Hocko wrote:
[...]
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index d0d8397c9588..b769408e04ab 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned
> int size)
> >     if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
> >             return NULL;
> >  
> > -   /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> > -    * work reasonably well if sz is too large and bail out rather
> > -    * than shoot all processes down before realizing there is nothing
> > -    * more to reclaim.
> > -    */
> > -   info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> > +   info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
> 
> I guess the large number of cgroups match is helping to consume a lot
> of memory very quickly? We have a PATH_MAX in struct xt_cgroup_info_v1.

I really fail to see how that is related to the patch here.
Comment 27 Georgi Georgiev 2018-08-06 08:43:03 UTC
On 08/02/2018 11:50 AM, Michal Hocko wrote:
> On Wed 01-08-18 19:03:03, Georgi Nikolov wrote:
>> *Georgi Nikolov*
>> System Administrator
>> www.icdsoft.com <http://www.icdsoft.com>
>>
>> On 08/01/2018 11:33 AM, Michal Hocko wrote:
>>> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
>>>> On 07/31/2018 04:05 PM, Florian Westphal wrote:
>>>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
>>>>>>> No, I think that's rather for the netfilter folks to decide. However,
>>>>>>> it
>>>>>>> seems there has been the debate already [1] and it was not found. The
>>>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
>>>>>>> again after it's added back. But now we know that it doesn't...
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
>>>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
>>>>>> already in this list so probably have to wait for their opinion.
>>>>> It hasn't changed, I think having OOM killer zap random processes
>>>>> just because userspace wants to import large iptables ruleset is not a
>>>>> good idea.
>>>> If we denied the allocation instead of OOM (e.g. by using
>>>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
>>>> the system without much memory, so it will invoke OOM killer sooner or
>>>> later anyway.
>>>>
>>>> I don't see any silver-bullet solution, unfortunately. If this can be
>>>> abused by (multiple) namespaces, then they have to be contained by
>>>> kmemcg as that's the generic mechanism intended for this. Then we could
>>>> use the __GFP_RETRY_MAYFAIL.
>>>> The only limit we could impose to outright deny the allocation (to
>>>> prevent obvious bugs/admin mistakes or abuses) could be based on the
>>>> amount of RAM, as was suggested in the old thread.
>> Can we make this configurable - on/off switch or size above which
>> to pass GFP_NORETRY.
> Yet another tunable? How do you decide which one to select? Seriously,
> configuration knobs sound attractive but they are rarely a good idea.
> Either we trust privileged users or we don't and we have kmem accounting
> for that.
>
>> Probably hard coded based on amount of RAM is a good idea too.
> How do you scale that?
>
> In other words, why don't we simply do the following? Note that this is
> not tested. I have also no idea what is the lifetime of this allocation.
> Is it bound to any specific process or is it a namespace bound? If the
> later then the memcg OOM killer might wipe the whole memcg down without
> making any progress. This would make the whole namespace unsuable until
> somebody intervenes. Is this acceptable?
> ---
> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 2 Aug 2018 10:38:57 +0200
> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
>  easilly
>
> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
> in xt_alloc_table_info()") has unintentionally fortified
> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
> the vmalloc fallback. Later on there was a syzbot report that this
> can lead to OOM killer invocations when tables are too large and
> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> has been merged to restore the original behavior. Georgi Nikolov however
> noticed that he is not able to install his iptables anymore so this can
> be seen as a regression.
>
> The primary argument for 0537250fdc6c was that this allocation path
> shouldn't really trigger the OOM killer and kill innocent tasks. On the
> other hand the interface requires root and as such should allow what the
> admin asks for. Root inside a namespaces makes this more complicated
> because those might be not trusted in general. If they are not then such
> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
>
> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  net/netfilter/x_tables.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d0d8397c9588..b769408e04ab 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int
> size)
>       if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
>               return NULL;
>  
> -     /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> -      * work reasonably well if sz is too large and bail out rather
> -      * than shoot all processes down before realizing there is nothing
> -      * more to reclaim.
> -      */
> -     info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> +     info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
>       if (!info)
>               return NULL;
>  

I will check if this change fixes the problem.

Regards,

--
Georgi Nikolov
Comment 28 Georgi Georgiev 2018-08-07 11:02:14 UTC
On 08/06/2018 11:42 AM, Georgi Nikolov wrote:
> On 08/02/2018 11:50 AM, Michal Hocko wrote:
>> In other words, why don't we simply do the following? Note that this is
>> not tested. I have also no idea what is the lifetime of this allocation.
>> Is it bound to any specific process or is it a namespace bound? If the
>> later then the memcg OOM killer might wipe the whole memcg down without
>> making any progress. This would make the whole namespace unsuable until
>> somebody intervenes. Is this acceptable?
>> ---
>> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@suse.com>
>> Date: Thu, 2 Aug 2018 10:38:57 +0200
>> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
>>  easilly
>>
>> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
>> in xt_alloc_table_info()") has unintentionally fortified
>> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
>> the vmalloc fallback. Later on there was a syzbot report that this
>> can lead to OOM killer invocations when tables are too large and
>> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
>> has been merged to restore the original behavior. Georgi Nikolov however
>> noticed that he is not able to install his iptables anymore so this can
>> be seen as a regression.
>>
>> The primary argument for 0537250fdc6c was that this allocation path
>> shouldn't really trigger the OOM killer and kill innocent tasks. On the
>> other hand the interface requires root and as such should allow what the
>> admin asks for. Root inside a namespaces makes this more complicated
>> because those might be not trusted in general. If they are not then such
>> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
>> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
>>
>> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
>> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
>> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> Signed-off-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  net/netfilter/x_tables.c | 7 +------
>>  1 file changed, 1 insertion(+), 6 deletions(-)
>>
>> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
>> index d0d8397c9588..b769408e04ab 100644
>> --- a/net/netfilter/x_tables.c
>> +++ b/net/netfilter/x_tables.c
>> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned
>> int size)
>>      if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
>>              return NULL;
>>  
>> -    /* __GFP_NORETRY is not fully supported by kvmalloc but it should
>> -     * work reasonably well if sz is too large and bail out rather
>> -     * than shoot all processes down before realizing there is nothing
>> -     * more to reclaim.
>> -     */
>> -    info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
>> +    info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
>>      if (!info)
>>              return NULL;
>>  
> I will check if this change fixes the problem.
>
> Regards,
>
> --
> Georgi Nikolov

I can't reproduce it anymore.
If i understand correctly this way memory allocated will be
accounted to kmem of this cgroup (if inside cgroup).
Comment 29 Michal Hocko 2018-08-07 11:09:57 UTC
On Tue 07-08-18 14:02:00, Georgi Nikolov wrote:
> On 08/06/2018 11:42 AM, Georgi Nikolov wrote:
> > On 08/02/2018 11:50 AM, Michal Hocko wrote:
> >> In other words, why don't we simply do the following? Note that this is
> >> not tested. I have also no idea what is the lifetime of this allocation.
> >> Is it bound to any specific process or is it a namespace bound? If the
> >> later then the memcg OOM killer might wipe the whole memcg down without
> >> making any progress. This would make the whole namespace unsuable until
> >> somebody intervenes. Is this acceptable?
> >> ---
> >> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
> >> From: Michal Hocko <mhocko@suse.com>
> >> Date: Thu, 2 Aug 2018 10:38:57 +0200
> >> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
> >>  easilly
> >>
> >> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
> >> in xt_alloc_table_info()") has unintentionally fortified
> >> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
> >> the vmalloc fallback. Later on there was a syzbot report that this
> >> can lead to OOM killer invocations when tables are too large and
> >> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> >> has been merged to restore the original behavior. Georgi Nikolov however
> >> noticed that he is not able to install his iptables anymore so this can
> >> be seen as a regression.
> >>
> >> The primary argument for 0537250fdc6c was that this allocation path
> >> shouldn't really trigger the OOM killer and kill innocent tasks. On the
> >> other hand the interface requires root and as such should allow what the
> >> admin asks for. Root inside a namespaces makes this more complicated
> >> because those might be not trusted in general. If they are not then such
> >> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
> >> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
> >>
> >> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less
> aggressive")
> >> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
> >> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> >> Signed-off-by: Michal Hocko <mhocko@suse.com>
> >> ---
> >>  net/netfilter/x_tables.c | 7 +------
> >>  1 file changed, 1 insertion(+), 6 deletions(-)
> >>
> >> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> >> index d0d8397c9588..b769408e04ab 100644
> >> --- a/net/netfilter/x_tables.c
> >> +++ b/net/netfilter/x_tables.c
> >> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned
> int size)
> >>    if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
> >>            return NULL;
> >>  
> >> -  /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> >> -   * work reasonably well if sz is too large and bail out rather
> >> -   * than shoot all processes down before realizing there is nothing
> >> -   * more to reclaim.
> >> -   */
> >> -  info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> >> +  info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
> >>    if (!info)
> >>            return NULL;
> >>  
> > I will check if this change fixes the problem.
> >
> > Regards,
> >
> > --
> > Georgi Nikolov
> 
> I can't reproduce it anymore.
> If i understand correctly this way memory allocated will be
> accounted to kmem of this cgroup (if inside cgroup).

s@this@caller's@

Florian, is this patch acceptable?
Comment 30 Florian Westphal 2018-08-07 11:19:35 UTC
Michal Hocko <mhocko@kernel.org> wrote:
> > I can't reproduce it anymore.
> > If i understand correctly this way memory allocated will be
> > accounted to kmem of this cgroup (if inside cgroup).
> 
> s@this@caller's@
> 
> Florian, is this patch acceptable

I am no mm expert.  Should all longlived GFP_KERNEL allocations set ACCOUNT?

If so, there are more places that should get same treatment.
The change looks fine to me, but again, I don't know when ACCOUNT should
be set in the first place.
Comment 31 Michal Hocko 2018-08-07 11:26:46 UTC
On Tue 07-08-18 13:19:26, Florian Westphal wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> > > I can't reproduce it anymore.
> > > If i understand correctly this way memory allocated will be
> > > accounted to kmem of this cgroup (if inside cgroup).
> > 
> > s@this@caller's@
> > 
> > Florian, is this patch acceptable
> 
> I am no mm expert.  Should all longlived GFP_KERNEL allocations set ACCOUNT?

No. We should focus only on those that are under direct userspace
control and it can be triggered by an untrusted user.

> If so, there are more places that should get same treatment.
> The change looks fine to me, but again, I don't know when ACCOUNT should
> be set in the first place.

see above.
Comment 32 Vlastimil Babka 2018-08-07 11:29:22 UTC
On 08/02/2018 10:50 AM, Michal Hocko wrote:
> On Wed 01-08-18 19:03:03, Georgi Nikolov wrote:
>>
>> *Georgi Nikolov*
>> System Administrator
>> www.icdsoft.com <http://www.icdsoft.com>
>>
>> On 08/01/2018 11:33 AM, Michal Hocko wrote:
>>> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
>>>> On 07/31/2018 04:05 PM, Florian Westphal wrote:
>>>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
>>>>>>> No, I think that's rather for the netfilter folks to decide. However,
>>>>>>> it
>>>>>>> seems there has been the debate already [1] and it was not found. The
>>>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should work
>>>>>>> again after it's added back. But now we know that it doesn't...
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
>>>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
>>>>>> already in this list so probably have to wait for their opinion.
>>>>> It hasn't changed, I think having OOM killer zap random processes
>>>>> just because userspace wants to import large iptables ruleset is not a
>>>>> good idea.
>>>> If we denied the allocation instead of OOM (e.g. by using
>>>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
>>>> the system without much memory, so it will invoke OOM killer sooner or
>>>> later anyway.
>>>>
>>>> I don't see any silver-bullet solution, unfortunately. If this can be
>>>> abused by (multiple) namespaces, then they have to be contained by
>>>> kmemcg as that's the generic mechanism intended for this. Then we could
>>>> use the __GFP_RETRY_MAYFAIL.
>>>> The only limit we could impose to outright deny the allocation (to
>>>> prevent obvious bugs/admin mistakes or abuses) could be based on the
>>>> amount of RAM, as was suggested in the old thread.
>>
>> Can we make this configurable - on/off switch or size above which
>> to pass GFP_NORETRY.
> 
> Yet another tunable? How do you decide which one to select? Seriously,
> configuration knobs sound attractive but they are rarely a good idea.
> Either we trust privileged users or we don't and we have kmem accounting
> for that.
> 
>> Probably hard coded based on amount of RAM is a good idea too.
> 
> How do you scale that?
> 
> In other words, why don't we simply do the following? Note that this is
> not tested. I have also no idea what is the lifetime of this allocation.
> Is it bound to any specific process or is it a namespace bound? If the
> later then the memcg OOM killer might wipe the whole memcg down without
> making any progress. This would make the whole namespace unsuable until
> somebody intervenes. Is this acceptable?
> ---
> From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 2 Aug 2018 10:38:57 +0200
> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
>  easilly
> 
> eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
> in xt_alloc_table_info()") has unintentionally fortified
> xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
> the vmalloc fallback. Later on there was a syzbot report that this
> can lead to OOM killer invocations when tables are too large and
> 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> has been merged to restore the original behavior. Georgi Nikolov however
> noticed that he is not able to install his iptables anymore so this can
> be seen as a regression.
> 
> The primary argument for 0537250fdc6c was that this allocation path
> shouldn't really trigger the OOM killer and kill innocent tasks. On the
> other hand the interface requires root and as such should allow what the
> admin asks for. Root inside a namespaces makes this more complicated
> because those might be not trusted in general. If they are not then such
> namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
> and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
> 
> Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  net/netfilter/x_tables.c | 7 +------
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index d0d8397c9588..b769408e04ab 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned int
> size)
>       if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
>               return NULL;
>  
> -     /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> -      * work reasonably well if sz is too large and bail out rather
> -      * than shoot all processes down before realizing there is nothing
> -      * more to reclaim.
> -      */
> -     info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> +     info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);

GFP_KERNEL_ACCOUNT ?

>       if (!info)
>               return NULL;
>  
>
Comment 33 Florian Westphal 2018-08-07 11:30:20 UTC
Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 07-08-18 13:19:26, Florian Westphal wrote:
> > Michal Hocko <mhocko@kernel.org> wrote:
> > > > I can't reproduce it anymore.
> > > > If i understand correctly this way memory allocated will be
> > > > accounted to kmem of this cgroup (if inside cgroup).
> > > 
> > > s@this@caller's@
> > > 
> > > Florian, is this patch acceptable
> > 
> > I am no mm expert.  Should all longlived GFP_KERNEL allocations set
> ACCOUNT?
> 
> No. We should focus only on those that are under direct userspace
> control and it can be triggered by an untrusted user.

In that case patch is fine and we will need similar patches for
nf_tables_api.c .
Comment 34 Vlastimil Babka 2018-08-07 11:31:26 UTC
On 08/07/2018 01:26 PM, Michal Hocko wrote:
> On Tue 07-08-18 13:19:26, Florian Westphal wrote:
>> Michal Hocko <mhocko@kernel.org> wrote:
>>>> I can't reproduce it anymore.
>>>> If i understand correctly this way memory allocated will be
>>>> accounted to kmem of this cgroup (if inside cgroup).
>>>
>>> s@this@caller's@
>>>
>>> Florian, is this patch acceptable
>>
>> I am no mm expert.  Should all longlived GFP_KERNEL allocations set ACCOUNT?
> 
> No. We should focus only on those that are under direct userspace
> control and it can be triggered by an untrusted user.

Looks like the description in include/linux/gfp.h could use some details
to guide developers, possibly also Mike's new/improved docs (+CC).

>> If so, there are more places that should get same treatment.
>> The change looks fine to me, but again, I don't know when ACCOUNT should
>> be set in the first place.
> 
> see above.
>
Comment 35 Michal Hocko 2018-08-07 11:37:57 UTC
On Tue 07-08-18 13:29:15, Vlastimil Babka wrote:
> On 08/02/2018 10:50 AM, Michal Hocko wrote:
> > On Wed 01-08-18 19:03:03, Georgi Nikolov wrote:
> >>
> >> *Georgi Nikolov*
> >> System Administrator
> >> www.icdsoft.com <http://www.icdsoft.com>
> >>
> >> On 08/01/2018 11:33 AM, Michal Hocko wrote:
> >>> On Wed 01-08-18 09:34:23, Vlastimil Babka wrote:
> >>>> On 07/31/2018 04:05 PM, Florian Westphal wrote:
> >>>>> Georgi Nikolov <gnikolov@icdsoft.com> wrote:
> >>>>>>> No, I think that's rather for the netfilter folks to decide. However,
> it
> >>>>>>> seems there has been the debate already [1] and it was not found. The
> >>>>>>> conclusion was that __GFP_NORETRY worked fine before, so it should
> work
> >>>>>>> again after it's added back. But now we know that it doesn't...
> >>>>>>>
> >>>>>>> [1]
> https://lore.kernel.org/lkml/20180130140104.GE21609@dhcp22.suse.cz/T/#u
> >>>>>> Yes i see. I will add Florian Westphal to CC list. netfilter-devel is
> >>>>>> already in this list so probably have to wait for their opinion.
> >>>>> It hasn't changed, I think having OOM killer zap random processes
> >>>>> just because userspace wants to import large iptables ruleset is not a
> >>>>> good idea.
> >>>> If we denied the allocation instead of OOM (e.g. by using
> >>>> __GFP_RETRY_MAYFAIL), a slightly smaller one may succeed, still leaving
> >>>> the system without much memory, so it will invoke OOM killer sooner or
> >>>> later anyway.
> >>>>
> >>>> I don't see any silver-bullet solution, unfortunately. If this can be
> >>>> abused by (multiple) namespaces, then they have to be contained by
> >>>> kmemcg as that's the generic mechanism intended for this. Then we could
> >>>> use the __GFP_RETRY_MAYFAIL.
> >>>> The only limit we could impose to outright deny the allocation (to
> >>>> prevent obvious bugs/admin mistakes or abuses) could be based on the
> >>>> amount of RAM, as was suggested in the old thread.
> >>
> >> Can we make this configurable - on/off switch or size above which
> >> to pass GFP_NORETRY.
> > 
> > Yet another tunable? How do you decide which one to select? Seriously,
> > configuration knobs sound attractive but they are rarely a good idea.
> > Either we trust privileged users or we don't and we have kmem accounting
> > for that.
> > 
> >> Probably hard coded based on amount of RAM is a good idea too.
> > 
> > How do you scale that?
> > 
> > In other words, why don't we simply do the following? Note that this is
> > not tested. I have also no idea what is the lifetime of this allocation.
> > Is it bound to any specific process or is it a namespace bound? If the
> > later then the memcg OOM killer might wipe the whole memcg down without
> > making any progress. This would make the whole namespace unsuable until
> > somebody intervenes. Is this acceptable?
> > ---
> > From 4dec96eb64954a7e58264ed551afadf62ca4c5f7 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 2 Aug 2018 10:38:57 +0200
> > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
> >  easilly
> > 
> > eacd86ca3b03 ("net/netfilter/x_tables.c: use kvmalloc()
> > in xt_alloc_table_info()") has unintentionally fortified
> > xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
> > the vmalloc fallback. Later on there was a syzbot report that this
> > can lead to OOM killer invocations when tables are too large and
> > 0537250fdc6c ("netfilter: x_tables: make allocation less aggressive")
> > has been merged to restore the original behavior. Georgi Nikolov however
> > noticed that he is not able to install his iptables anymore so this can
> > be seen as a regression.
> > 
> > The primary argument for 0537250fdc6c was that this allocation path
> > shouldn't really trigger the OOM killer and kill innocent tasks. On the
> > other hand the interface requires root and as such should allow what the
> > admin asks for. Root inside a namespaces makes this more complicated
> > because those might be not trusted in general. If they are not then such
> > namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
> > and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.
> > 
> > Fixes: 0537250fdc6c ("netfilter: x_tables: make allocation less
> aggressive")
> > Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  net/netfilter/x_tables.c | 7 +------
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> > 
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index d0d8397c9588..b769408e04ab 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -1178,12 +1178,7 @@ struct xt_table_info *xt_alloc_table_info(unsigned
> int size)
> >     if (sz < sizeof(*info) || sz >= XT_MAX_TABLE_SIZE)
> >             return NULL;
> >  
> > -   /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> > -    * work reasonably well if sz is too large and bail out rather
> > -    * than shoot all processes down before realizing there is nothing
> > -    * more to reclaim.
> > -    */
> > -   info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> > +   info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
> 
> GFP_KERNEL_ACCOUNT ?

Certainly possible, I guess I just wanted to call the __GFP_ACCOUNT. But
I can change that of course.
Comment 36 Michal Hocko 2018-08-07 11:38:32 UTC
On Tue 07-08-18 13:30:13, Florian Westphal wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 07-08-18 13:19:26, Florian Westphal wrote:
> > > Michal Hocko <mhocko@kernel.org> wrote:
> > > > > I can't reproduce it anymore.
> > > > > If i understand correctly this way memory allocated will be
> > > > > accounted to kmem of this cgroup (if inside cgroup).
> > > > 
> > > > s@this@caller's@
> > > > 
> > > > Florian, is this patch acceptable
> > > 
> > > I am no mm expert.  Should all longlived GFP_KERNEL allocations set
> ACCOUNT?
> > 
> > No. We should focus only on those that are under direct userspace
> > control and it can be triggered by an untrusted user.
> 
> In that case patch is fine and we will need similar patches for
> nf_tables_api.c .

Can I assume your Acked-by or should I repost the standalong patch in
its new thread?
Comment 37 rppt 2018-08-07 14:06:35 UTC
On Tue, Aug 07, 2018 at 01:31:21PM +0200, Vlastimil Babka wrote:
> On 08/07/2018 01:26 PM, Michal Hocko wrote:
> > On Tue 07-08-18 13:19:26, Florian Westphal wrote:
> >> Michal Hocko <mhocko@kernel.org> wrote:
> >>>> I can't reproduce it anymore.
> >>>> If i understand correctly this way memory allocated will be
> >>>> accounted to kmem of this cgroup (if inside cgroup).
> >>>
> >>> s@this@caller's@
> >>>
> >>> Florian, is this patch acceptable
> >>
> >> I am no mm expert.  Should all longlived GFP_KERNEL allocations set
> ACCOUNT?
> > 
> > No. We should focus only on those that are under direct userspace
> > control and it can be triggered by an untrusted user.
> 
> Looks like the description in include/linux/gfp.h could use some details
> to guide developers, possibly also Mike's new/improved docs (+CC).

A "memory allocation guide" is definitely on my todo.

If you and other mm developers have any notes, random thoughts or anything else you'd
like to see there, send it my way, I'll convert them to ReST :-).

> >> If so, there are more places that should get same treatment.
> >> The change looks fine to me, but again, I don't know when ACCOUNT should
> >> be set in the first place.
> > 
> > see above.
> > 
>
Comment 38 Florian Westphal 2018-08-07 18:24:05 UTC
Michal Hocko <mhocko@kernel.org> wrote:
> Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
>  easilly

[..]

> -     /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> -      * work reasonably well if sz is too large and bail out rather
> -      * than shoot all processes down before realizing there is nothing
> -      * more to reclaim.
> -      */
> -     info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> +     info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
>       if (!info)
>               return NULL;

Acked-by: Florian Westphal <fw@strlen.de>

You can keep this acked-by in case you mangle this patch in a minor
way such as using GFP_KERNEL_ACCOUNT.

Thanks,
Florian
Comment 39 Michal Hocko 2018-08-07 19:30:29 UTC
On Tue 07-08-18 20:23:55, Florian Westphal wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
> > Subject: [PATCH] netfilter/x_tables: do not fail xt_alloc_table_info too
> >  easilly
> 
> [..]
> 
> > -   /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> > -    * work reasonably well if sz is too large and bail out rather
> > -    * than shoot all processes down before realizing there is nothing
> > -    * more to reclaim.
> > -    */
> > -   info = kvmalloc(sz, GFP_KERNEL | __GFP_NORETRY);
> > +   info = kvmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT);
> >     if (!info)
> >             return NULL;
> 
> Acked-by: Florian Westphal <fw@strlen.de>
> 
> You can keep this acked-by in case you mangle this patch in a minor
> way such as using GFP_KERNEL_ACCOUNT.

Thanks!

Note You need to log in before you can comment on or make changes to this bug.