Bug 207273

Summary: cgroup with 1.5GB limit and 100MB rss usage OOM-kills processes due to page cache usage after upgrading to kernel 5.4
Product: Memory Management Reporter: Paul Furtado (paulfurtado91)
Component: Page AllocatorAssignee: Andrew Morton (akpm)
Status: NEW ---    
Severity: normal CC: 0x44444444, mail.anchalagarwal, paulfurtado91, xiehuanjun
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 5.4.20 Subsystem:
Regression: Yes Bisected commit-id:

Description Paul Furtado 2020-04-15 01:32:12 UTC
Upon upgrading to kernel 5.4, we see constant OOM kills in database containers that are restoring from backups, with nearly no RSS memory usage. It appears all the memory is consumed by file_dirty, with applications using minimal memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it appears to be a new regression.

The full OOM log from dmesg shows:

xtrabackup invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=993
CPU: 9 PID: 50206 Comm: xtrabackup Tainted: G            E     5.4.20-hs779.el6.x86_64 #1
Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
Call Trace:
 dump_stack+0x66/0x8b
 dump_header+0x4a/0x200
 oom_kill_process+0xd7/0x110
 out_of_memory+0x105/0x510
 mem_cgroup_out_of_memory+0xb5/0xd0
 try_charge+0x7b1/0x7f0
 mem_cgroup_try_charge+0x70/0x190
 __add_to_page_cache_locked+0x2b6/0x2f0
 ? scan_shadow_nodes+0x30/0x30
 add_to_page_cache_lru+0x4a/0xc0
 pagecache_get_page+0xf5/0x210
 grab_cache_page_write_begin+0x1f/0x40
 iomap_write_begin.constprop.33+0x1ee/0x320
 ? iomap_write_end+0x91/0x240
 iomap_write_actor+0x92/0x170
 ? iomap_dirty_actor+0x1b0/0x1b0
 iomap_apply+0xba/0x130
 ? iomap_dirty_actor+0x1b0/0x1b0
 iomap_file_buffered_write+0x62/0x90
 ? iomap_dirty_actor+0x1b0/0x1b0
 xfs_file_buffered_aio_write+0xca/0x310 [xfs]
 new_sync_write+0x11b/0x1b0
 vfs_write+0xad/0x1a0
 ksys_pwrite64+0x71/0x90
 do_syscall_64+0x4e/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6085b181a3
Code: 49 89 ca b8 12 00 00 00 0f 05 48 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 8b f0 ff ff 48 89 04 24 49 89 ca b8 12 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d1 f0 ff ff 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007ffd43632320 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
RAX: ffffffffffffffda RBX: 00007ffd43632400 RCX: 00007f6085b181a3
RDX: 0000000000100000 RSI: 0000000004a54000 RDI: 0000000000000004
RBP: 00007ffd43632590 R08: 0000000066e00000 R09: 00007ffd436325c0
R10: 0000000066e00000 R11: 0000000000000293 R12: 0000000000100000
R13: 0000000066e00000 R14: 0000000066e00000 R15: 0000000001acdd20
memory: usage 1536000kB, limit 1536000kB, failcnt 0
memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221
kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8:
anon 72507392
file 1474740224
kernel_stack 774144
slab 18673664
sock 0
shmem 0
file_mapped 0
file_dirty 1413857280
file_writeback 60555264
anon_thp 0
inactive_anon 0
active_anon 72585216
inactive_file 368873472
active_file 1106067456
unevictable 0
slab_reclaimable 11403264
slab_unreclaimable 7270400
pgfault 34848
pgmajfault 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
pgrefill 17089962
pgscan 18425256
pgsteal 602912
pgactivate 17822046
pgdeactivate 17089962
pglazyfree 0
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[  42046]   500 42046      257        1    32768        0           993 init
[  43157]   500 43157   164204    18473   335872        0           993 vttablet
[  50206]   500 50206   294931     8856   360448        0           993 xtrabackup
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,mems_allowed=0,oom_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task=vttablet,pid=43157,uid=500
Memory cgroup out of memory: Killed process 43157 (vttablet) total-vm:656816kB, anon-rss:50572kB, file-rss:23320kB, shmem-rss:0kB, UID:500 pgtables:328kB oom_score_adj:993
Comment 1 Andrew Morton 2020-04-15 04:26:00 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 15 Apr 2020 01:32:12 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=207273
> 
>             Bug ID: 207273
>            Summary: cgroup with 1.5GB limit and 100MB rss usage OOM-kills
>                     processes due to page cache usage after upgrading to
>                     kernel 5.4
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 5.4.20
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Page Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: paulfurtado91@gmail.com
>         Regression: No
> 
> Upon upgrading to kernel 5.4, we see constant OOM kills in database
> containers
> that are restoring from backups, with nearly no RSS memory usage. It appears
> all the memory is consumed by file_dirty, with applications using minimal
> memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it
> appears to be a new regression.

Thanks.

That's an elderly kernel.  Are you in a position to determine whether
contemporary kernel behave similarly?

> The full OOM log from dmesg shows:
> 
> xtrabackup invoked oom-killer:
>
> gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),
> order=0, oom_score_adj=993
> CPU: 9 PID: 50206 Comm: xtrabackup Tainted: G            E    
> 5.4.20-hs779.el6.x86_64 #1
> Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
> Call Trace:
>  dump_stack+0x66/0x8b
>  dump_header+0x4a/0x200
>  oom_kill_process+0xd7/0x110
>  out_of_memory+0x105/0x510
>  mem_cgroup_out_of_memory+0xb5/0xd0
>  try_charge+0x7b1/0x7f0
>  mem_cgroup_try_charge+0x70/0x190
>  __add_to_page_cache_locked+0x2b6/0x2f0
>  ? scan_shadow_nodes+0x30/0x30
>  add_to_page_cache_lru+0x4a/0xc0
>  pagecache_get_page+0xf5/0x210
>  grab_cache_page_write_begin+0x1f/0x40
>  iomap_write_begin.constprop.33+0x1ee/0x320
>  ? iomap_write_end+0x91/0x240
>  iomap_write_actor+0x92/0x170
>  ? iomap_dirty_actor+0x1b0/0x1b0
>  iomap_apply+0xba/0x130
>  ? iomap_dirty_actor+0x1b0/0x1b0
>  iomap_file_buffered_write+0x62/0x90
>  ? iomap_dirty_actor+0x1b0/0x1b0
>  xfs_file_buffered_aio_write+0xca/0x310 [xfs]
>  new_sync_write+0x11b/0x1b0
>  vfs_write+0xad/0x1a0
>  ksys_pwrite64+0x71/0x90
>  do_syscall_64+0x4e/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f6085b181a3
> Code: 49 89 ca b8 12 00 00 00 0f 05 48 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8
> 8b f0 ff ff 48 89 04 24 49 89 ca b8 12 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2
> e8
> d1 f0 ff ff 48 89 d0 48 83 c4 08 48 3d 01
> RSP: 002b:00007ffd43632320 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> RAX: ffffffffffffffda RBX: 00007ffd43632400 RCX: 00007f6085b181a3
> RDX: 0000000000100000 RSI: 0000000004a54000 RDI: 0000000000000004
> RBP: 00007ffd43632590 R08: 0000000066e00000 R09: 00007ffd436325c0
> R10: 0000000066e00000 R11: 0000000000000293 R12: 0000000000100000
> R13: 0000000066e00000 R14: 0000000066e00000 R15: 0000000001acdd20
> memory: usage 1536000kB, limit 1536000kB, failcnt 0
> memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221
> kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for
>
> /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8:
> anon 72507392
> file 1474740224
> kernel_stack 774144
> slab 18673664
> sock 0
> shmem 0
> file_mapped 0
> file_dirty 1413857280
> file_writeback 60555264
> anon_thp 0
> inactive_anon 0
> active_anon 72585216
> inactive_file 368873472
> active_file 1106067456
> unevictable 0
> slab_reclaimable 11403264
> slab_unreclaimable 7270400
> pgfault 34848
> pgmajfault 0
> workingset_refault 0
> workingset_activate 0
> workingset_nodereclaim 0
> pgrefill 17089962
> pgscan 18425256
> pgsteal 602912
> pgactivate 17822046
> pgdeactivate 17089962
> pglazyfree 0
> pglazyfreed 0
> thp_fault_alloc 0
> thp_collapse_alloc 0
> Tasks state (memory values in pages):
> [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj
> name
> [  42046]   500 42046      257        1    32768        0           993 init
> [  43157]   500 43157   164204    18473   335872        0           993
> vttablet
> [  50206]   500 50206   294931     8856   360448        0           993
> xtrabackup
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,mems_allowed=0,oom_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task=vttablet,pid=43157,uid=500
> Memory cgroup out of memory: Killed process 43157 (vttablet)
> total-vm:656816kB,
> anon-rss:50572kB, file-rss:23320kB, shmem-rss:0kB, UID:500 pgtables:328kB
> oom_score_adj:993
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Michal Hocko 2020-04-15 06:51:03 UTC
On Tue 14-04-20 21:25:58, Andrew Morton wrote:
[...]
> > Upon upgrading to kernel 5.4, we see constant OOM kills in database
> containers
> > that are restoring from backups, with nearly no RSS memory usage. It
> appears
> > all the memory is consumed by file_dirty, with applications using minimal
> > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it
> > appears to be a new regression.

OK, this is interesting. Because the memcg oom handling has changed in
4.19. Older kernels triggered memcg oom only from the page fault path
while your stack trace is pointing to the write(2) syscall. But if you
do not see any problem with 4.19 then this is not it.

[...]

> >
> gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[...]
> > memory: usage 1536000kB, limit 1536000kB, failcnt 0
> > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221
> > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0

Based on the output I assume you are using cgroup v1

> > Memory cgroup stats for
> >
> /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8:
> > anon 72507392
> > file 1474740224
> > kernel_stack 774144
> > slab 18673664
> > sock 0
> > shmem 0
> > file_mapped 0
> > file_dirty 1413857280
> > file_writeback 60555264

This seems to be the crux of the problem. You cannot swap out any memory
due to memory+swap limit and 95% of the file LRU is dirty. Quite a lot
of dirty memory to flush. This alone shouldn't be a disaster because
cgroup v1 does have a hack to throttle the memory reclaim in presence of
dirty/writeback pages. But note the gfp_mask for the allocation. It
says GFP_NOFS which means that we cannot apply the throttling and have
to give up. We used to retry the reclaim even though not much could be
done with a restricted allocation context and that led to lockups. Then
we have merged f9c645621a28 ("memcg, oom: don't require __GFP_FS when
invoking memcg OOM killer") in 5.4 and it has been marked for stable
trees (4.19+). And this is likely the primary culprit of the issue you
are seeing.

Now what to do about that. Reverting f9c645621a28 doesn't sound like a
feasible solution. We could try to put a sleep for restricted
allocations after memory reclaim fails but we know from the past
experience this is a bit fishy because a sleep without any feedback on
the flushing is just not going to work reliably.

Another possibility is to workaround the problem by configuration. You
can either try to use cgroup v2 which has much better memcg aware dirty
throttling implementation so such a large amount of dirty pages doesn't
accumulate in the first place or you can reconfigure global dirty
limits. I pressume you are using defaults for
/proc/sys/vm/dirty_{background_}ratio which is a percentage of the
available memory. I would recommend using their resp. *_bytes
alternatives and use something like 500M for background and 800M for
dirty_bytes. That should help in your current situation. The overall IO
throughput might be smaller so you might need to tune those values a
bit.

HTH
Comment 3 Paul Furtado 2020-04-15 08:35:10 UTC
> You can either try to use cgroup v2 which has much better memcg aware dirty
> throttling implementation so such a large amount of dirty pages doesn't
> accumulate in the first place

I'd love to use cgroup v2, however this is docker + kubernetes so that
would require a lot of changes on our end to make happen, given how
recently container runtimes gained cgroup v2 support.

> I pressume you are using defaults for
> /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> available memory. I would recommend using their resp. *_bytes
> alternatives and use something like 500M for background and 800M for
> dirty_bytes.

We're using the defaults right now, however, given that this is a
containerized environment, it's problematic to set these values too
low system-wide since the containers all have dedicated volumes with
varying performance (from as low as 100MB/sec to gigabyes). Looking
around, I see that there were patches in the past to set per-cgroup
vm.dirty settings, however it doesn't look like those ever made it
into the kernel unless I'm missing something. In practice, maybe 500M
and 800M wouldn't be so bad though and may improve latency in other
ways. The other problem is that this also sets an upper bound on the
minimum container size for anything that does do IO. That said, I'll
still I'll tune these settings in our infrastructure and see how
things go, but it sounds like something should be done inside the
kernel to help this situation, since it's so easy to trigger, but
looking at the threads that led to the commits you referenced, I can
see that this is complicated.

Thanks,
Paul



On Wed, Apr 15, 2020 at 2:51 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-04-20 21:25:58, Andrew Morton wrote:
> [...]
> > > Upon upgrading to kernel 5.4, we see constant OOM kills in database
> containers
> > > that are restoring from backups, with nearly no RSS memory usage. It
> appears
> > > all the memory is consumed by file_dirty, with applications using minimal
> > > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it
> > > appears to be a new regression.
>
> OK, this is interesting. Because the memcg oom handling has changed in
> 4.19. Older kernels triggered memcg oom only from the page fault path
> while your stack trace is pointing to the write(2) syscall. But if you
> do not see any problem with 4.19 then this is not it.
>
> [...]
>
> > >
> gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
> [...]
> > > memory: usage 1536000kB, limit 1536000kB, failcnt 0
> > > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221
> > > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0
>
> Based on the output I assume you are using cgroup v1
>
> > > Memory cgroup stats for
> > >
> /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8:
> > > anon 72507392
> > > file 1474740224
> > > kernel_stack 774144
> > > slab 18673664
> > > sock 0
> > > shmem 0
> > > file_mapped 0
> > > file_dirty 1413857280
> > > file_writeback 60555264
>
> This seems to be the crux of the problem. You cannot swap out any memory
> due to memory+swap limit and 95% of the file LRU is dirty. Quite a lot
> of dirty memory to flush. This alone shouldn't be a disaster because
> cgroup v1 does have a hack to throttle the memory reclaim in presence of
> dirty/writeback pages. But note the gfp_mask for the allocation. It
> says GFP_NOFS which means that we cannot apply the throttling and have
> to give up. We used to retry the reclaim even though not much could be
> done with a restricted allocation context and that led to lockups. Then
> we have merged f9c645621a28 ("memcg, oom: don't require __GFP_FS when
> invoking memcg OOM killer") in 5.4 and it has been marked for stable
> trees (4.19+). And this is likely the primary culprit of the issue you
> are seeing.
>
> Now what to do about that. Reverting f9c645621a28 doesn't sound like a
> feasible solution. We could try to put a sleep for restricted
> allocations after memory reclaim fails but we know from the past
> experience this is a bit fishy because a sleep without any feedback on
> the flushing is just not going to work reliably.
>
> Another possibility is to workaround the problem by configuration. You
> can either try to use cgroup v2 which has much better memcg aware dirty
> throttling implementation so such a large amount of dirty pages doesn't
> accumulate in the first place or you can reconfigure global dirty
> limits. I pressume you are using defaults for
> /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> available memory. I would recommend using their resp. *_bytes
> alternatives and use something like 500M for background and 800M for
> dirty_bytes. That should help in your current situation. The overall IO
> throughput might be smaller so you might need to tune those values a
> bit.
>
> HTH
> --
> Michal Hocko
> SUSE Labs
Comment 4 Michal Hocko 2020-04-15 09:45:04 UTC
On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > You can either try to use cgroup v2 which has much better memcg aware dirty
> > throttling implementation so such a large amount of dirty pages doesn't
> > accumulate in the first place
> 
> I'd love to use cgroup v2, however this is docker + kubernetes so that
> would require a lot of changes on our end to make happen, given how
> recently container runtimes gained cgroup v2 support.
> 
> > I pressume you are using defaults for
> > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > available memory. I would recommend using their resp. *_bytes
> > alternatives and use something like 500M for background and 800M for
> > dirty_bytes.
> 
> We're using the defaults right now, however, given that this is a
> containerized environment, it's problematic to set these values too
> low system-wide since the containers all have dedicated volumes with
> varying performance (from as low as 100MB/sec to gigabyes). Looking
> around, I see that there were patches in the past to set per-cgroup
> vm.dirty settings, however it doesn't look like those ever made it
> into the kernel unless I'm missing something.

I am not aware of that work for memcg v1.

> In practice, maybe 500M
> and 800M wouldn't be so bad though and may improve latency in other
> ways. The other problem is that this also sets an upper bound on the
> minimum container size for anything that does do IO.

Well this would be a conservative approach but most allocations will
simply be throttled during reclaim. It is the restricted memory reclaim
context that is the bummer here. I have already brought up why this is
the case in the generic write(2) system call path [1]. Maybe we can
reduce the amount of NOFS requests.

> That said, I'll
> still I'll tune these settings in our infrastructure and see how
> things go, but it sounds like something should be done inside the
> kernel to help this situation, since it's so easy to trigger, but
> looking at the threads that led to the commits you referenced, I can
> see that this is complicated.

Yeah, there are certainly things that we should be doing and reducing
the NOFS allocations is the first step. From my past experience
non trivial usage has turned out to be used incorrectly. I am not sure
how much we can do for cgroup v1 though. If tuning for global dirty
thresholds doesn't lead to a better behavior we can think of a band aid
of some form. Something like this (only compile tested)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 05b4ec2c6499..4e1e8d121785 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		goto retry;
 
+	/*
+	 * Legacy memcg relies on dirty data throttling during the reclaim
+	 * but this cannot be done for GFP_NOFS requests so we might trigger
+	 * the oom way too early. Throttle here if we have way too many
+	 * dirty/writeback pages.
+	 */
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) {
+		unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
+			      writeback = memcg_page_state(memcg, NR_WRITEBACK);
+
+		if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory))
+			schedule_timeout_interruptible(1);
+	}
+
 	if (nr_retries--)
 		goto retry;
 

[1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
Comment 5 Anchal Agarwal 2021-07-30 21:21:01 UTC
Hi Michal,   
I am currently seeing the same issue when migrating a container from 4.14 to 5.4+ kernels. I tested this patch with a configuration where application requests more than the container's limit and I could easily trigger OOMs on newer kernels. I tested your patch however, I have to increase the jiffies from 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. 
Are there any plans to merge this patch upstream?      

--
Thanks,
Anchal Agarwal
Comment 6 Anchal Agarwal 2021-08-02 22:49:34 UTC
A quick correction here as it may look confusing, "application requests more than the container's limit". Application has nothing to do with requesting more memory here other than it reaches around the container's memory limit but not crossing it. Just like the above scenario, where we see ooms in write syscall due to restricted memory reclamation, I have the same scenario where crash is related to how kernel is allocating memory in IO path rather than application using more memory. 

Thanks,
Anchal
Comment 7 anchalag 2021-08-06 20:42:53 UTC
On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote:
> On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > > You can either try to use cgroup v2 which has much better memcg aware
> dirty
> > > throttling implementation so such a large amount of dirty pages doesn't
> > > accumulate in the first place
> > 
> > I'd love to use cgroup v2, however this is docker + kubernetes so that
> > would require a lot of changes on our end to make happen, given how
> > recently container runtimes gained cgroup v2 support.
> > 
> > > I pressume you are using defaults for
> > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > > available memory. I would recommend using their resp. *_bytes
> > > alternatives and use something like 500M for background and 800M for
> > > dirty_bytes.
> > 
> > We're using the defaults right now, however, given that this is a
> > containerized environment, it's problematic to set these values too
> > low system-wide since the containers all have dedicated volumes with
> > varying performance (from as low as 100MB/sec to gigabyes). Looking
> > around, I see that there were patches in the past to set per-cgroup
> > vm.dirty settings, however it doesn't look like those ever made it
> > into the kernel unless I'm missing something.
> 
> I am not aware of that work for memcg v1.
> 
> > In practice, maybe 500M
> > and 800M wouldn't be so bad though and may improve latency in other
> > ways. The other problem is that this also sets an upper bound on the
> > minimum container size for anything that does do IO.
> 
> Well this would be a conservative approach but most allocations will
> simply be throttled during reclaim. It is the restricted memory reclaim
> context that is the bummer here. I have already brought up why this is
> the case in the generic write(2) system call path [1]. Maybe we can
> reduce the amount of NOFS requests.
> 
> > That said, I'll
> > still I'll tune these settings in our infrastructure and see how
> > things go, but it sounds like something should be done inside the
> > kernel to help this situation, since it's so easy to trigger, but
> > looking at the threads that led to the commits you referenced, I can
> > see that this is complicated.
> 
> Yeah, there are certainly things that we should be doing and reducing
> the NOFS allocations is the first step. From my past experience
> non trivial usage has turned out to be used incorrectly. I am not sure
> how much we can do for cgroup v1 though. If tuning for global dirty
> thresholds doesn't lead to a better behavior we can think of a band aid
> of some form. Something like this (only compile tested)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05b4ec2c6499..4e1e8d121785 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t
> gfp_mask,
>       ie (mem_cgroup_wait_acct_move(mem_over_limit))
>               goto retry;
>  
> +     /*
> +      * Legacy memcg relies on dirty data throttling during the reclaim
> +      * but this cannot be done for GFP_NOFS requests so we might trigger
> +      * the oom way too early. Throttle here if we have way too many
> +      * dirty/writeback pages.
> +      */
> +     if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask &
> __GFP_FS)) {
> +             unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
> +                           writeback = memcg_page_state(memcg,
> NR_WRITEBACK);
> +
> +             if (4*(dirty + writeback) > 3*
> page_counter_read(&memcg->memory))
> +                     schedule_timeout_interruptible(1);
> +     }
> +
>       if (nr_retries--)
>               goto retry;
>  
> 
> [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
> -- 
> Michal Hocko
> SUSE Labs
Hi Michal,
Following up my conversation from bugzilla here:
I am currently seeing the same issue when migrating a container from 4.14 to
5.4+ kernels. I tested this patch with a configuration where application reaches
cgroups memory limit while doing IO. The issue is similar to described here
https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in
write syscall due to restricted memory reclamation.
I tested your patch however, I have to increase the jiffies from
1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload.
I also tried adjusting the dirty_bytes* and it worked after some tuning however,
there's no one set of values suits all use cases kind of scenario.
Hence it does not look like a viable option for me to change those defaults here and 
expect it work for all kind of workloads. I think working out a fix in kernel may be a
better option since this issue will be seen ins o many use cases where
applications are used to old kernel behavior and they suddenly start failing on
newer ones.
I see the same stack trace on 4.19 kernel too.

Here is the stack trace:

dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997
CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1
Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
Call Trace:
dump_stack+0x50/0x6b
dump_header+0x4a/0x200
oom_kill_process+0xd7/0x110
out_of_memory+0x105/0x510
mem_cgroup_out_of_memory+0xb5/0xd0
try_charge+0x766/0x7c0
mem_cgroup_try_charge+0x70/0x190
__add_to_page_cache_locked+0x355/0x390
? scan_shadow_nodes+0x30/0x30
add_to_page_cache_lru+0x4a/0xc0
pagecache_get_page+0xf5/0x210
grab_cache_page_write_begin+0x1f/0x40
iomap_write_begin.constprop.34+0x1ee/0x340
? iomap_write_end+0x91/0x240
iomap_write_actor+0x92/0x170
? iomap_dirty_actor+0x1b0/0x1b0
iomap_apply+0xba/0x130
? iomap_dirty_actor+0x1b0/0x1b0
iomap_file_buffered_write+0x62/0x90
? iomap_dirty_actor+0x1b0/0x1b0
xfs_file_buffered_aio_write+0xca/0x310 [xfs]
new_sync_write+0x11b/0x1b0
vfs_write+0xad/0x1a0
ksys_write+0xa1/0xe0
do_syscall_64+0x48/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fc956e853ad
ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca
4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8
02 00 00 00 49 89 f4 be 00 88 08 00 55
RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad
RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001
RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000
memory: usage 30720kB, limit 30720kB, failcnt 424
memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0
kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0
Memory cgroup stats for
/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd:
anon 1089536
file 27475968
kernel_stack 73728
slab 1941504
sock 0
shmem 0
file_mapped 0
file_dirty 0
file_writeback 0
anon_thp 0
inactive_anon 0
active_anon 1351680
inactive_file 27705344
active_file 40960
unevictable 0
slab_reclaimable 819200
slab_unreclaimable 1122304
pgfault 23397
pgmajfault 0
workingset_refault 33
workingset_activate 33
workingset_nodereclaim 0
pgrefill 119108
pgscan 124436
pgsteal 928
pgactivate 123222
pgdeactivate 119083
pglazyfree 99
pglazyfreed 0
thp_fault_alloc 0
thp_collapse_alloc 0
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj
name
[  28589]     0 28589      242        1    28672        0          -998 pause
[  28703]     0 28703      399        1    40960        0           997 sh
[  28766]     0 28766      821      341    45056        0           997 dd
oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0
Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB,
anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB
oom_score_adj:997
oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB,
shmem-rss:0kB


Here is a snippet of the container spec:

containers:
- image: docker.io/library/alpine:latest
name: dd
command:
- sh
args:
- -c
- cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300
resources:
requests:
memory: 30Mi
cpu: 20m
limits:
memory: 30Mi

Thanks,
Anchal Agarwal
Comment 8 anchalag 2021-08-11 19:41:36 UTC
On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote:
> On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote:
> > On Wed 15-04-20 04:34:56, Paul Furtado wrote:
> > > > You can either try to use cgroup v2 which has much better memcg aware
> dirty
> > > > throttling implementation so such a large amount of dirty pages doesn't
> > > > accumulate in the first place
> > > 
> > > I'd love to use cgroup v2, however this is docker + kubernetes so that
> > > would require a lot of changes on our end to make happen, given how
> > > recently container runtimes gained cgroup v2 support.
> > > 
> > > > I pressume you are using defaults for
> > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the
> > > > available memory. I would recommend using their resp. *_bytes
> > > > alternatives and use something like 500M for background and 800M for
> > > > dirty_bytes.
> > > 
> > > We're using the defaults right now, however, given that this is a
> > > containerized environment, it's problematic to set these values too
> > > low system-wide since the containers all have dedicated volumes with
> > > varying performance (from as low as 100MB/sec to gigabyes). Looking
> > > around, I see that there were patches in the past to set per-cgroup
> > > vm.dirty settings, however it doesn't look like those ever made it
> > > into the kernel unless I'm missing something.
> > 
> > I am not aware of that work for memcg v1.
> > 
> > > In practice, maybe 500M
> > > and 800M wouldn't be so bad though and may improve latency in other
> > > ways. The other problem is that this also sets an upper bound on the
> > > minimum container size for anything that does do IO.
> > 
> > Well this would be a conservative approach but most allocations will
> > simply be throttled during reclaim. It is the restricted memory reclaim
> > context that is the bummer here. I have already brought up why this is
> > the case in the generic write(2) system call path [1]. Maybe we can
> > reduce the amount of NOFS requests.
> > 
> > > That said, I'll
> > > still I'll tune these settings in our infrastructure and see how
> > > things go, but it sounds like something should be done inside the
> > > kernel to help this situation, since it's so easy to trigger, but
> > > looking at the threads that led to the commits you referenced, I can
> > > see that this is complicated.
> > 
> > Yeah, there are certainly things that we should be doing and reducing
> > the NOFS allocations is the first step. From my past experience
> > non trivial usage has turned out to be used incorrectly. I am not sure
> > how much we can do for cgroup v1 though. If tuning for global dirty
> > thresholds doesn't lead to a better behavior we can think of a band aid
> > of some form. Something like this (only compile tested)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 05b4ec2c6499..4e1e8d121785 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg,
> gfp_t gfp_mask,
> >     ie (mem_cgroup_wait_acct_move(mem_over_limit))
> >             goto retry;
> >  
> > +   /*
> > +    * Legacy memcg relies on dirty data throttling during the reclaim
> > +    * but this cannot be done for GFP_NOFS requests so we might trigger
> > +    * the oom way too early. Throttle here if we have way too many
> > +    * dirty/writeback pages.
> > +    */
> > +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask &
> __GFP_FS)) {
> > +           unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY),
> > +                         writeback = memcg_page_state(memcg,
> NR_WRITEBACK);
> > +
> > +           if (4*(dirty + writeback) > 3*
> page_counter_read(&memcg->memory))
> > +                   schedule_timeout_interruptible(1);
> > +   }
> > +
> >     if (nr_retries--)
> >             goto retry;
> >  
> > 
> > [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
> > -- 
> > Michal Hocko
> > SUSE Labs
> Hi Michal,
> Following up my conversation from bugzilla here:
> I am currently seeing the same issue when migrating a container from 4.14 to
> 5.4+ kernels. I tested this patch with a configuration where application
> reaches
> cgroups memory limit while doing IO. The issue is similar to described here
> https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in
> write syscall due to restricted memory reclamation.
> I tested your patch however, I have to increase the jiffies from
> 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload.
> I also tried adjusting the dirty_bytes* and it worked after some tuning
> however,
> there's no one set of values suits all use cases kind of scenario.
> Hence it does not look like a viable option for me to change those defaults
> here and 
> expect it work for all kind of workloads. I think working out a fix in kernel
> may be a
> better option since this issue will be seen ins o many use cases where
> applications are used to old kernel behavior and they suddenly start failing
> on
> newer ones.
> I see the same stack trace on 4.19 kernel too.
> 
> Here is the stack trace:
> 
> dd invoked
> oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0,
> oom_score_adj=997
> CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1
> Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017
> Call Trace:
> dump_stack+0x50/0x6b
> dump_header+0x4a/0x200
> oom_kill_process+0xd7/0x110
> out_of_memory+0x105/0x510
> mem_cgroup_out_of_memory+0xb5/0xd0
> try_charge+0x766/0x7c0
> mem_cgroup_try_charge+0x70/0x190
> __add_to_page_cache_locked+0x355/0x390
> ? scan_shadow_nodes+0x30/0x30
> add_to_page_cache_lru+0x4a/0xc0
> pagecache_get_page+0xf5/0x210
> grab_cache_page_write_begin+0x1f/0x40
> iomap_write_begin.constprop.34+0x1ee/0x340
> ? iomap_write_end+0x91/0x240
> iomap_write_actor+0x92/0x170
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_apply+0xba/0x130
> ? iomap_dirty_actor+0x1b0/0x1b0
> iomap_file_buffered_write+0x62/0x90
> ? iomap_dirty_actor+0x1b0/0x1b0
> xfs_file_buffered_aio_write+0xca/0x310 [xfs]
> new_sync_write+0x11b/0x1b0
> vfs_write+0xad/0x1a0
> ksys_write+0xa1/0xe0
> do_syscall_64+0x48/0xf0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fc956e853ad
> ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89
> ca
> 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54
> b8
> 02 00 00 00 49 89 f4 be 00 88 08 00 55
> RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad
> RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001
> RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000
> memory: usage 30720kB, limit 30720kB, failcnt 424
> memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for
> /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd:
> anon 1089536
> file 27475968
> kernel_stack 73728
> slab 1941504
> sock 0
> shmem 0
> file_mapped 0
> file_dirty 0
> file_writeback 0
> anon_thp 0
> inactive_anon 0
> active_anon 1351680
> inactive_file 27705344
> active_file 40960
> unevictable 0
> slab_reclaimable 819200
> slab_unreclaimable 1122304
> pgfault 23397
> pgmajfault 0
> workingset_refault 33
> workingset_activate 33
> workingset_nodereclaim 0
> pgrefill 119108
> pgscan 124436
> pgsteal 928
> pgactivate 123222
> pgdeactivate 119083
> pglazyfree 99
> pglazyfreed 0
> thp_fault_alloc 0
> thp_collapse_alloc 0
> Tasks state (memory values in pages):
> [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj
> name
> [  28589]     0 28589      242        1    28672        0          -998 pause
> [  28703]     0 28703      399        1    40960        0           997 sh
> [  28766]     0 28766      821      341    45056        0           997 dd
>
> oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0
> Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB,
> anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB
> oom_score_adj:997
> oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB,
> shmem-rss:0kB
> 
> 
> Here is a snippet of the container spec:
> 
> containers:
> - image: docker.io/library/alpine:latest
> name: dd
> command:
> - sh
> args:
> - -c
> - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file
> bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300
> resources:
> requests:
> memory: 30Mi
> cpu: 20m
> limits:
> memory: 30Mi
> 
> Thanks,
> Anchal Agarwal
A gentle ping on this issue!

Thanks,
Anchal Agarwal
Comment 9 mhocko 2021-08-19 09:35:35 UTC
On Wed 11-08-21 19:41:23, Anchal Agarwal wrote:
> A gentle ping on this issue!

Sorry, I am currently swamped by other stuff and will be offline next
week. I still try to keep this on my todo list and will try to get back
to this hopefully soon.
Comment 10 benh 2021-08-19 09:41:22 UTC
On Thu, 2021-08-19 at 11:35 +0200, Michal Hocko wrote:
> 
> On Wed 11-08-21 19:41:23, Anchal Agarwal wrote:
> > A gentle ping on this issue!
> 
> Sorry, I am currently swamped by other stuff and will be offline next
> week. I still try to keep this on my todo list and will try to get
> back to this hopefully soon.

Hi Michal !

Thanks for your reply ! Enjoy your time off !

Cheers,
Ben.
Comment 11 Ujjal Roy 2023-06-09 09:55:42 UTC
I am also facing same problem. BTW, it's been 2 years back. Is there any further updated on this thread?