Bug 199297 - OOMs writing to files from processes with cgroup memory limits
Summary: OOMs writing to files from processes with cgroup memory limits
Status: RESOLVED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Memory Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-05 21:55 UTC by Chris Behrens
Modified: 2018-04-07 03:29 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.11+
Subsystem:
Regression: No
Bisected commit-id:


Attachments
script to reproduce issue + kernel config + oom log from dmesg (54.07 KB, application/x-gzip)
2018-04-05 21:55 UTC, Chris Behrens
Details

Description Chris Behrens 2018-04-05 21:55:26 UTC
Created attachment 275113 [details]
script to reproduce issue + kernel config + oom log from dmesg

OVERVIEW:

Processes that have a cgroup memory limit can easily OOM just writing to files. It appears there is no throttling and the process can very quickly exceed the cgroup memory limit. vm.dirty_ratio appears to be applied to global available memory and not available memory for the cgroup, at least in my case.

This issue came to light by using kubernetes and putting memory limits on pods, but is completely reproducible stand-alone.

STEPS TO REPRODUCE:

* create a memory cgroup
* put a memory limit on the cgroup.. say 256M.
* add a shell process to the cgroup
* use dd from that shell to write a bunch of data to a file

See attached simple script that reproduces the issue every time.

dd will end up getting OOMkilled very shortly after starting. OOM logging will show dirty pages above the cgroup limit.

Kernels before 4.11 do not see this behavior. I've tracked the issue to the following commit:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=726d061fbd3658e4bfeffa1b8e82da97de2ca4dd

When I reverse this commit, dd will complete successfully.

I believe there to be a larger issue here, though. I did a bit of debugging. As mentioned above, it doesn't appear there's any throttling. It also doesn't appear that writebacks are fired when dirty pages build up for the cgroup. From what I can tell, vm.dirty* configs would only get applied against the cgroup limit IFF inode_cgwb_enabled(inode) returns true in balance_dirty_pages_ratelimited(). That appears to be returning false for me. I'm using ext4 and CONFIG_CGROUP_WRITEBACK is enabled. The code removed from the above commit seems to be saving things by reclaiming during try_charge(). But from what I can tell, we actually want to throttle in balance_dirty_pages(), instead.. but that's not happening. This code is all foreign to me, but just wanted to dump a bit about what I saw from my debugging.

NOTE: If I set vm.dirty_bytes to a value lower than my cgroup memory limit, I no longer see OOMs... as it appears the process gets throttled correctly.

I'm attaching a script to reproduce the issue, kernel config, and OOM log messages from dmesg for kernel 4.15.0.
Comment 1 Andrew Morton 2018-04-06 20:36:03 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 05 Apr 2018 21:55:26 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=199297
> 
>             Bug ID: 199297
>            Summary: OOMs writing to files from processes with cgroup
>                     memory limits
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.11+
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: cbehrens@codestud.com
>         Regression: No
> 
> Created attachment 275113 [details]
>   --> https://bugzilla.kernel.org/attachment.cgi?id=275113&action=edit
> script to reproduce issue + kernel config + oom log from dmesg
> 
> OVERVIEW:
> 
> Processes that have a cgroup memory limit can easily OOM just writing to
> files.
> It appears there is no throttling and the process can very quickly exceed the
> cgroup memory limit. vm.dirty_ratio appears to be applied to global available
> memory and not available memory for the cgroup, at least in my case.
> 
> This issue came to light by using kubernetes and putting memory limits on
> pods,
> but is completely reproducible stand-alone.
> 
> STEPS TO REPRODUCE:
> 
> * create a memory cgroup
> * put a memory limit on the cgroup.. say 256M.
> * add a shell process to the cgroup
> * use dd from that shell to write a bunch of data to a file
> 
> See attached simple script that reproduces the issue every time.
> 
> dd will end up getting OOMkilled very shortly after starting. OOM logging
> will
> show dirty pages above the cgroup limit.
> 
> Kernels before 4.11 do not see this behavior. I've tracked the issue to the
> following commit:
> 
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=726d061fbd3658e4bfeffa1b8e82da97de2ca4dd
> 
> When I reverse this commit, dd will complete successfully.
> 
> I believe there to be a larger issue here, though. I did a bit of debugging.
> As
> mentioned above, it doesn't appear there's any throttling. It also doesn't
> appear that writebacks are fired when dirty pages build up for the cgroup.
> From
> what I can tell, vm.dirty* configs would only get applied against the cgroup
> limit IFF inode_cgwb_enabled(inode) returns true in
> balance_dirty_pages_ratelimited(). That appears to be returning false for me.
> I'm using ext4 and CONFIG_CGROUP_WRITEBACK is enabled. The code removed from
> the above commit seems to be saving things by reclaiming during try_charge().
> But from what I can tell, we actually want to throttle in
> balance_dirty_pages(), instead.. but that's not happening. This code is all
> foreign to me, but just wanted to dump a bit about what I saw from my
> debugging.
> 
> NOTE: If I set vm.dirty_bytes to a value lower than my cgroup memory limit, I
> no longer see OOMs... as it appears the process gets throttled correctly.
> 
> I'm attaching a script to reproduce the issue, kernel config, and OOM log
> messages from dmesg for kernel 4.15.0.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Michal Hocko 2018-04-06 21:36:38 UTC
On Fri 06-04-18 13:36:00, Andrew Morton wrote:
[...]
> > Kernels before 4.11 do not see this behavior. I've tracked the issue to the
> > following commit:
> > 
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=726d061fbd3658e4bfeffa1b8e82da97de2ca4dd

Thanks for the report! This should be fixed by 1c610d5f93c7 ("mm/vmscan:
wake up flushers for legacy cgroups too")
Comment 3 Chris Behrens 2018-04-06 23:18:23 UTC
> On Apr 6, 2018, at 2:36 PM, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=199297
> 
> --- Comment #2 from Michal Hocko (mhocko@kernel.org) ---
> On Fri 06-04-18 13:36:00, Andrew Morton wrote:
> [...]
>>> Kernels before 4.11 do not see this behavior. I've tracked the issue to the
>>> following commit:
>>> 
>>> 
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=726d061fbd3658e4bfeffa1b8e82da97de2ca4dd
> 
> Thanks for the report! This should be fixed by 1c610d5f93c7 ("mm/vmscan:
> wake up flushers for legacy cgroups too")

Aha. Gosh.. just merged 2 weeks ago. And I'd missed the comment for sane_reclaim() which sort of explains the rest of my babbling about the process not getting throttled in balance_dirty_pages().

Thanks for the pointer! I'll give 4.16 a shot as I see it was just released a week ago and contains this fix.

- Chris

Note You need to log in before you can comment on or make changes to this bug.