Upon upgrading to kernel 5.4, we see constant OOM kills in database containers that are restoring from backups, with nearly no RSS memory usage. It appears all the memory is consumed by file_dirty, with applications using minimal memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it appears to be a new regression. The full OOM log from dmesg shows: xtrabackup invoked oom-killer: gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), order=0, oom_score_adj=993 CPU: 9 PID: 50206 Comm: xtrabackup Tainted: G E 5.4.20-hs779.el6.x86_64 #1 Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017 Call Trace: dump_stack+0x66/0x8b dump_header+0x4a/0x200 oom_kill_process+0xd7/0x110 out_of_memory+0x105/0x510 mem_cgroup_out_of_memory+0xb5/0xd0 try_charge+0x7b1/0x7f0 mem_cgroup_try_charge+0x70/0x190 __add_to_page_cache_locked+0x2b6/0x2f0 ? scan_shadow_nodes+0x30/0x30 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0xf5/0x210 grab_cache_page_write_begin+0x1f/0x40 iomap_write_begin.constprop.33+0x1ee/0x320 ? iomap_write_end+0x91/0x240 iomap_write_actor+0x92/0x170 ? iomap_dirty_actor+0x1b0/0x1b0 iomap_apply+0xba/0x130 ? iomap_dirty_actor+0x1b0/0x1b0 iomap_file_buffered_write+0x62/0x90 ? iomap_dirty_actor+0x1b0/0x1b0 xfs_file_buffered_aio_write+0xca/0x310 [xfs] new_sync_write+0x11b/0x1b0 vfs_write+0xad/0x1a0 ksys_pwrite64+0x71/0x90 do_syscall_64+0x4e/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f6085b181a3 Code: 49 89 ca b8 12 00 00 00 0f 05 48 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 8b f0 ff ff 48 89 04 24 49 89 ca b8 12 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d1 f0 ff ff 48 89 d0 48 83 c4 08 48 3d 01 RSP: 002b:00007ffd43632320 EFLAGS: 00000293 ORIG_RAX: 0000000000000012 RAX: ffffffffffffffda RBX: 00007ffd43632400 RCX: 00007f6085b181a3 RDX: 0000000000100000 RSI: 0000000004a54000 RDI: 0000000000000004 RBP: 00007ffd43632590 R08: 0000000066e00000 R09: 00007ffd436325c0 R10: 0000000066e00000 R11: 0000000000000293 R12: 0000000000100000 R13: 0000000066e00000 R14: 0000000066e00000 R15: 0000000001acdd20 memory: usage 1536000kB, limit 1536000kB, failcnt 0 memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221 kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8: anon 72507392 file 1474740224 kernel_stack 774144 slab 18673664 sock 0 shmem 0 file_mapped 0 file_dirty 1413857280 file_writeback 60555264 anon_thp 0 inactive_anon 0 active_anon 72585216 inactive_file 368873472 active_file 1106067456 unevictable 0 slab_reclaimable 11403264 slab_unreclaimable 7270400 pgfault 34848 pgmajfault 0 workingset_refault 0 workingset_activate 0 workingset_nodereclaim 0 pgrefill 17089962 pgscan 18425256 pgsteal 602912 pgactivate 17822046 pgdeactivate 17089962 pglazyfree 0 pglazyfreed 0 thp_fault_alloc 0 thp_collapse_alloc 0 Tasks state (memory values in pages): [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 42046] 500 42046 257 1 32768 0 993 init [ 43157] 500 43157 164204 18473 335872 0 993 vttablet [ 50206] 500 50206 294931 8856 360448 0 993 xtrabackup oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,mems_allowed=0,oom_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task=vttablet,pid=43157,uid=500 Memory cgroup out of memory: Killed process 43157 (vttablet) total-vm:656816kB, anon-rss:50572kB, file-rss:23320kB, shmem-rss:0kB, UID:500 pgtables:328kB oom_score_adj:993
(switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Wed, 15 Apr 2020 01:32:12 +0000 bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=207273 > > Bug ID: 207273 > Summary: cgroup with 1.5GB limit and 100MB rss usage OOM-kills > processes due to page cache usage after upgrading to > kernel 5.4 > Product: Memory Management > Version: 2.5 > Kernel Version: 5.4.20 > Hardware: x86-64 > OS: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Page Allocator > Assignee: akpm@linux-foundation.org > Reporter: paulfurtado91@gmail.com > Regression: No > > Upon upgrading to kernel 5.4, we see constant OOM kills in database > containers > that are restoring from backups, with nearly no RSS memory usage. It appears > all the memory is consumed by file_dirty, with applications using minimal > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it > appears to be a new regression. Thanks. That's an elderly kernel. Are you in a position to determine whether contemporary kernel behave similarly? > The full OOM log from dmesg shows: > > xtrabackup invoked oom-killer: > > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE), > order=0, oom_score_adj=993 > CPU: 9 PID: 50206 Comm: xtrabackup Tainted: G E > 5.4.20-hs779.el6.x86_64 #1 > Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x66/0x8b > dump_header+0x4a/0x200 > oom_kill_process+0xd7/0x110 > out_of_memory+0x105/0x510 > mem_cgroup_out_of_memory+0xb5/0xd0 > try_charge+0x7b1/0x7f0 > mem_cgroup_try_charge+0x70/0x190 > __add_to_page_cache_locked+0x2b6/0x2f0 > ? scan_shadow_nodes+0x30/0x30 > add_to_page_cache_lru+0x4a/0xc0 > pagecache_get_page+0xf5/0x210 > grab_cache_page_write_begin+0x1f/0x40 > iomap_write_begin.constprop.33+0x1ee/0x320 > ? iomap_write_end+0x91/0x240 > iomap_write_actor+0x92/0x170 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_apply+0xba/0x130 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_file_buffered_write+0x62/0x90 > ? iomap_dirty_actor+0x1b0/0x1b0 > xfs_file_buffered_aio_write+0xca/0x310 [xfs] > new_sync_write+0x11b/0x1b0 > vfs_write+0xad/0x1a0 > ksys_pwrite64+0x71/0x90 > do_syscall_64+0x4e/0x100 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7f6085b181a3 > Code: 49 89 ca b8 12 00 00 00 0f 05 48 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 > 8b f0 ff ff 48 89 04 24 49 89 ca b8 12 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 > e8 > d1 f0 ff ff 48 89 d0 48 83 c4 08 48 3d 01 > RSP: 002b:00007ffd43632320 EFLAGS: 00000293 ORIG_RAX: 0000000000000012 > RAX: ffffffffffffffda RBX: 00007ffd43632400 RCX: 00007f6085b181a3 > RDX: 0000000000100000 RSI: 0000000004a54000 RDI: 0000000000000004 > RBP: 00007ffd43632590 R08: 0000000066e00000 R09: 00007ffd436325c0 > R10: 0000000066e00000 R11: 0000000000000293 R12: 0000000000100000 > R13: 0000000066e00000 R14: 0000000066e00000 R15: 0000000001acdd20 > memory: usage 1536000kB, limit 1536000kB, failcnt 0 > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221 > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0 > Memory cgroup stats for > > /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8: > anon 72507392 > file 1474740224 > kernel_stack 774144 > slab 18673664 > sock 0 > shmem 0 > file_mapped 0 > file_dirty 1413857280 > file_writeback 60555264 > anon_thp 0 > inactive_anon 0 > active_anon 72585216 > inactive_file 368873472 > active_file 1106067456 > unevictable 0 > slab_reclaimable 11403264 > slab_unreclaimable 7270400 > pgfault 34848 > pgmajfault 0 > workingset_refault 0 > workingset_activate 0 > workingset_nodereclaim 0 > pgrefill 17089962 > pgscan 18425256 > pgsteal 602912 > pgactivate 17822046 > pgdeactivate 17089962 > pglazyfree 0 > pglazyfreed 0 > thp_fault_alloc 0 > thp_collapse_alloc 0 > Tasks state (memory values in pages): > [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj > name > [ 42046] 500 42046 257 1 32768 0 993 init > [ 43157] 500 43157 164204 18473 335872 0 993 > vttablet > [ 50206] 500 50206 294931 8856 360448 0 993 > xtrabackup > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,mems_allowed=0,oom_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task_memcg=/kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8,task=vttablet,pid=43157,uid=500 > Memory cgroup out of memory: Killed process 43157 (vttablet) > total-vm:656816kB, > anon-rss:50572kB, file-rss:23320kB, shmem-rss:0kB, UID:500 pgtables:328kB > oom_score_adj:993 > > -- > You are receiving this mail because: > You are the assignee for the bug.
On Tue 14-04-20 21:25:58, Andrew Morton wrote: [...] > > Upon upgrading to kernel 5.4, we see constant OOM kills in database > containers > > that are restoring from backups, with nearly no RSS memory usage. It > appears > > all the memory is consumed by file_dirty, with applications using minimal > > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it > > appears to be a new regression. OK, this is interesting. Because the memcg oom handling has changed in 4.19. Older kernels triggered memcg oom only from the page fault path while your stack trace is pointing to the write(2) syscall. But if you do not see any problem with 4.19 then this is not it. [...] > > > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) [...] > > memory: usage 1536000kB, limit 1536000kB, failcnt 0 > > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221 > > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0 Based on the output I assume you are using cgroup v1 > > Memory cgroup stats for > > > /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8: > > anon 72507392 > > file 1474740224 > > kernel_stack 774144 > > slab 18673664 > > sock 0 > > shmem 0 > > file_mapped 0 > > file_dirty 1413857280 > > file_writeback 60555264 This seems to be the crux of the problem. You cannot swap out any memory due to memory+swap limit and 95% of the file LRU is dirty. Quite a lot of dirty memory to flush. This alone shouldn't be a disaster because cgroup v1 does have a hack to throttle the memory reclaim in presence of dirty/writeback pages. But note the gfp_mask for the allocation. It says GFP_NOFS which means that we cannot apply the throttling and have to give up. We used to retry the reclaim even though not much could be done with a restricted allocation context and that led to lockups. Then we have merged f9c645621a28 ("memcg, oom: don't require __GFP_FS when invoking memcg OOM killer") in 5.4 and it has been marked for stable trees (4.19+). And this is likely the primary culprit of the issue you are seeing. Now what to do about that. Reverting f9c645621a28 doesn't sound like a feasible solution. We could try to put a sleep for restricted allocations after memory reclaim fails but we know from the past experience this is a bit fishy because a sleep without any feedback on the flushing is just not going to work reliably. Another possibility is to workaround the problem by configuration. You can either try to use cgroup v2 which has much better memcg aware dirty throttling implementation so such a large amount of dirty pages doesn't accumulate in the first place or you can reconfigure global dirty limits. I pressume you are using defaults for /proc/sys/vm/dirty_{background_}ratio which is a percentage of the available memory. I would recommend using their resp. *_bytes alternatives and use something like 500M for background and 800M for dirty_bytes. That should help in your current situation. The overall IO throughput might be smaller so you might need to tune those values a bit. HTH
> You can either try to use cgroup v2 which has much better memcg aware dirty > throttling implementation so such a large amount of dirty pages doesn't > accumulate in the first place I'd love to use cgroup v2, however this is docker + kubernetes so that would require a lot of changes on our end to make happen, given how recently container runtimes gained cgroup v2 support. > I pressume you are using defaults for > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > available memory. I would recommend using their resp. *_bytes > alternatives and use something like 500M for background and 800M for > dirty_bytes. We're using the defaults right now, however, given that this is a containerized environment, it's problematic to set these values too low system-wide since the containers all have dedicated volumes with varying performance (from as low as 100MB/sec to gigabyes). Looking around, I see that there were patches in the past to set per-cgroup vm.dirty settings, however it doesn't look like those ever made it into the kernel unless I'm missing something. In practice, maybe 500M and 800M wouldn't be so bad though and may improve latency in other ways. The other problem is that this also sets an upper bound on the minimum container size for anything that does do IO. That said, I'll still I'll tune these settings in our infrastructure and see how things go, but it sounds like something should be done inside the kernel to help this situation, since it's so easy to trigger, but looking at the threads that led to the commits you referenced, I can see that this is complicated. Thanks, Paul On Wed, Apr 15, 2020 at 2:51 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 14-04-20 21:25:58, Andrew Morton wrote: > [...] > > > Upon upgrading to kernel 5.4, we see constant OOM kills in database > containers > > > that are restoring from backups, with nearly no RSS memory usage. It > appears > > > all the memory is consumed by file_dirty, with applications using minimal > > > memory. On kernel 4.14.146 and 4.19.75, we do not see this problem, so it > > > appears to be a new regression. > > OK, this is interesting. Because the memcg oom handling has changed in > 4.19. Older kernels triggered memcg oom only from the page fault path > while your stack trace is pointing to the write(2) syscall. But if you > do not see any problem with 4.19 then this is not it. > > [...] > > > > > gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) > [...] > > > memory: usage 1536000kB, limit 1536000kB, failcnt 0 > > > memory+swap: usage 1536000kB, limit 1536000kB, failcnt 490221 > > > kmem: usage 23164kB, limit 9007199254740988kB, failcnt 0 > > Based on the output I assume you are using cgroup v1 > > > > Memory cgroup stats for > > > > /kubepods/burstable/pod6900693c-8b2c-4efe-ab52-26e4a6bd9e4c/83216944bb43baf32f0d43ef12c85ebaa2767b3f51846f5fa438bba00b4636d8: > > > anon 72507392 > > > file 1474740224 > > > kernel_stack 774144 > > > slab 18673664 > > > sock 0 > > > shmem 0 > > > file_mapped 0 > > > file_dirty 1413857280 > > > file_writeback 60555264 > > This seems to be the crux of the problem. You cannot swap out any memory > due to memory+swap limit and 95% of the file LRU is dirty. Quite a lot > of dirty memory to flush. This alone shouldn't be a disaster because > cgroup v1 does have a hack to throttle the memory reclaim in presence of > dirty/writeback pages. But note the gfp_mask for the allocation. It > says GFP_NOFS which means that we cannot apply the throttling and have > to give up. We used to retry the reclaim even though not much could be > done with a restricted allocation context and that led to lockups. Then > we have merged f9c645621a28 ("memcg, oom: don't require __GFP_FS when > invoking memcg OOM killer") in 5.4 and it has been marked for stable > trees (4.19+). And this is likely the primary culprit of the issue you > are seeing. > > Now what to do about that. Reverting f9c645621a28 doesn't sound like a > feasible solution. We could try to put a sleep for restricted > allocations after memory reclaim fails but we know from the past > experience this is a bit fishy because a sleep without any feedback on > the flushing is just not going to work reliably. > > Another possibility is to workaround the problem by configuration. You > can either try to use cgroup v2 which has much better memcg aware dirty > throttling implementation so such a large amount of dirty pages doesn't > accumulate in the first place or you can reconfigure global dirty > limits. I pressume you are using defaults for > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > available memory. I would recommend using their resp. *_bytes > alternatives and use something like 500M for background and 800M for > dirty_bytes. That should help in your current situation. The overall IO > throughput might be smaller so you might need to tune those values a > bit. > > HTH > -- > Michal Hocko > SUSE Labs
On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > You can either try to use cgroup v2 which has much better memcg aware dirty > > throttling implementation so such a large amount of dirty pages doesn't > > accumulate in the first place > > I'd love to use cgroup v2, however this is docker + kubernetes so that > would require a lot of changes on our end to make happen, given how > recently container runtimes gained cgroup v2 support. > > > I pressume you are using defaults for > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > available memory. I would recommend using their resp. *_bytes > > alternatives and use something like 500M for background and 800M for > > dirty_bytes. > > We're using the defaults right now, however, given that this is a > containerized environment, it's problematic to set these values too > low system-wide since the containers all have dedicated volumes with > varying performance (from as low as 100MB/sec to gigabyes). Looking > around, I see that there were patches in the past to set per-cgroup > vm.dirty settings, however it doesn't look like those ever made it > into the kernel unless I'm missing something. I am not aware of that work for memcg v1. > In practice, maybe 500M > and 800M wouldn't be so bad though and may improve latency in other > ways. The other problem is that this also sets an upper bound on the > minimum container size for anything that does do IO. Well this would be a conservative approach but most allocations will simply be throttled during reclaim. It is the restricted memory reclaim context that is the bummer here. I have already brought up why this is the case in the generic write(2) system call path [1]. Maybe we can reduce the amount of NOFS requests. > That said, I'll > still I'll tune these settings in our infrastructure and see how > things go, but it sounds like something should be done inside the > kernel to help this situation, since it's so easy to trigger, but > looking at the threads that led to the commits you referenced, I can > see that this is complicated. Yeah, there are certainly things that we should be doing and reducing the NOFS allocations is the first step. From my past experience non trivial usage has turned out to be used incorrectly. I am not sure how much we can do for cgroup v1 though. If tuning for global dirty thresholds doesn't lead to a better behavior we can think of a band aid of some form. Something like this (only compile tested) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 05b4ec2c6499..4e1e8d121785 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (mem_cgroup_wait_acct_move(mem_over_limit)) goto retry; + /* + * Legacy memcg relies on dirty data throttling during the reclaim + * but this cannot be done for GFP_NOFS requests so we might trigger + * the oom way too early. Throttle here if we have way too many + * dirty/writeback pages. + */ + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & __GFP_FS)) { + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), + writeback = memcg_page_state(memcg, NR_WRITEBACK); + + if (4*(dirty + writeback) > 3* page_counter_read(&memcg->memory)) + schedule_timeout_interruptible(1); + } + if (nr_retries--) goto retry; [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz
Hi Michal, I am currently seeing the same issue when migrating a container from 4.14 to 5.4+ kernels. I tested this patch with a configuration where application requests more than the container's limit and I could easily trigger OOMs on newer kernels. I tested your patch however, I have to increase the jiffies from 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. Are there any plans to merge this patch upstream? -- Thanks, Anchal Agarwal
A quick correction here as it may look confusing, "application requests more than the container's limit". Application has nothing to do with requesting more memory here other than it reaches around the container's memory limit but not crossing it. Just like the above scenario, where we see ooms in write syscall due to restricted memory reclamation, I have the same scenario where crash is related to how kernel is allocating memory in IO path rather than application using more memory. Thanks, Anchal
On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote: > On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > > You can either try to use cgroup v2 which has much better memcg aware > dirty > > > throttling implementation so such a large amount of dirty pages doesn't > > > accumulate in the first place > > > > I'd love to use cgroup v2, however this is docker + kubernetes so that > > would require a lot of changes on our end to make happen, given how > > recently container runtimes gained cgroup v2 support. > > > > > I pressume you are using defaults for > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > > available memory. I would recommend using their resp. *_bytes > > > alternatives and use something like 500M for background and 800M for > > > dirty_bytes. > > > > We're using the defaults right now, however, given that this is a > > containerized environment, it's problematic to set these values too > > low system-wide since the containers all have dedicated volumes with > > varying performance (from as low as 100MB/sec to gigabyes). Looking > > around, I see that there were patches in the past to set per-cgroup > > vm.dirty settings, however it doesn't look like those ever made it > > into the kernel unless I'm missing something. > > I am not aware of that work for memcg v1. > > > In practice, maybe 500M > > and 800M wouldn't be so bad though and may improve latency in other > > ways. The other problem is that this also sets an upper bound on the > > minimum container size for anything that does do IO. > > Well this would be a conservative approach but most allocations will > simply be throttled during reclaim. It is the restricted memory reclaim > context that is the bummer here. I have already brought up why this is > the case in the generic write(2) system call path [1]. Maybe we can > reduce the amount of NOFS requests. > > > That said, I'll > > still I'll tune these settings in our infrastructure and see how > > things go, but it sounds like something should be done inside the > > kernel to help this situation, since it's so easy to trigger, but > > looking at the threads that led to the commits you referenced, I can > > see that this is complicated. > > Yeah, there are certainly things that we should be doing and reducing > the NOFS allocations is the first step. From my past experience > non trivial usage has turned out to be used incorrectly. I am not sure > how much we can do for cgroup v1 though. If tuning for global dirty > thresholds doesn't lead to a better behavior we can think of a band aid > of some form. Something like this (only compile tested) > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 05b4ec2c6499..4e1e8d121785 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t > gfp_mask, > ie (mem_cgroup_wait_acct_move(mem_over_limit)) > goto retry; > > + /* > + * Legacy memcg relies on dirty data throttling during the reclaim > + * but this cannot be done for GFP_NOFS requests so we might trigger > + * the oom way too early. Throttle here if we have way too many > + * dirty/writeback pages. > + */ > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & > __GFP_FS)) { > + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), > + writeback = memcg_page_state(memcg, > NR_WRITEBACK); > + > + if (4*(dirty + writeback) > 3* > page_counter_read(&memcg->memory)) > + schedule_timeout_interruptible(1); > + } > + > if (nr_retries--) > goto retry; > > > [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz > -- > Michal Hocko > SUSE Labs Hi Michal, Following up my conversation from bugzilla here: I am currently seeing the same issue when migrating a container from 4.14 to 5.4+ kernels. I tested this patch with a configuration where application reaches cgroups memory limit while doing IO. The issue is similar to described here https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in write syscall due to restricted memory reclamation. I tested your patch however, I have to increase the jiffies from 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. I also tried adjusting the dirty_bytes* and it worked after some tuning however, there's no one set of values suits all use cases kind of scenario. Hence it does not look like a viable option for me to change those defaults here and expect it work for all kind of workloads. I think working out a fix in kernel may be a better option since this issue will be seen ins o many use cases where applications are used to old kernel behavior and they suddenly start failing on newer ones. I see the same stack trace on 4.19 kernel too. Here is the stack trace: dd invoked oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, oom_score_adj=997 CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1 Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017 Call Trace: dump_stack+0x50/0x6b dump_header+0x4a/0x200 oom_kill_process+0xd7/0x110 out_of_memory+0x105/0x510 mem_cgroup_out_of_memory+0xb5/0xd0 try_charge+0x766/0x7c0 mem_cgroup_try_charge+0x70/0x190 __add_to_page_cache_locked+0x355/0x390 ? scan_shadow_nodes+0x30/0x30 add_to_page_cache_lru+0x4a/0xc0 pagecache_get_page+0xf5/0x210 grab_cache_page_write_begin+0x1f/0x40 iomap_write_begin.constprop.34+0x1ee/0x340 ? iomap_write_end+0x91/0x240 iomap_write_actor+0x92/0x170 ? iomap_dirty_actor+0x1b0/0x1b0 iomap_apply+0xba/0x130 ? iomap_dirty_actor+0x1b0/0x1b0 iomap_file_buffered_write+0x62/0x90 ? iomap_dirty_actor+0x1b0/0x1b0 xfs_file_buffered_aio_write+0xca/0x310 [xfs] new_sync_write+0x11b/0x1b0 vfs_write+0xad/0x1a0 ksys_write+0xa1/0xe0 do_syscall_64+0x48/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fc956e853ad ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 b8 02 00 00 00 49 89 f4 be 00 88 08 00 55 RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001 RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000 memory: usage 30720kB, limit 30720kB, failcnt 424 memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0 kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0 Memory cgroup stats for /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd: anon 1089536 file 27475968 kernel_stack 73728 slab 1941504 sock 0 shmem 0 file_mapped 0 file_dirty 0 file_writeback 0 anon_thp 0 inactive_anon 0 active_anon 1351680 inactive_file 27705344 active_file 40960 unevictable 0 slab_reclaimable 819200 slab_unreclaimable 1122304 pgfault 23397 pgmajfault 0 workingset_refault 33 workingset_activate 33 workingset_nodereclaim 0 pgrefill 119108 pgscan 124436 pgsteal 928 pgactivate 123222 pgdeactivate 119083 pglazyfree 99 pglazyfreed 0 thp_fault_alloc 0 thp_collapse_alloc 0 Tasks state (memory values in pages): [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 28589] 0 28589 242 1 28672 0 -998 pause [ 28703] 0 28703 399 1 40960 0 997 sh [ 28766] 0 28766 821 341 45056 0 997 dd oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0 Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB, anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB oom_score_adj:997 oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Here is a snippet of the container spec: containers: - image: docker.io/library/alpine:latest name: dd command: - sh args: - -c - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300 resources: requests: memory: 30Mi cpu: 20m limits: memory: 30Mi Thanks, Anchal Agarwal
On Fri, Aug 06, 2021 at 08:42:46PM +0000, Anchal Agarwal wrote: > On Wed, Apr 15, 2020 at 11:44:58AM +0200, Michal Hocko wrote: > > On Wed 15-04-20 04:34:56, Paul Furtado wrote: > > > > You can either try to use cgroup v2 which has much better memcg aware > dirty > > > > throttling implementation so such a large amount of dirty pages doesn't > > > > accumulate in the first place > > > > > > I'd love to use cgroup v2, however this is docker + kubernetes so that > > > would require a lot of changes on our end to make happen, given how > > > recently container runtimes gained cgroup v2 support. > > > > > > > I pressume you are using defaults for > > > > /proc/sys/vm/dirty_{background_}ratio which is a percentage of the > > > > available memory. I would recommend using their resp. *_bytes > > > > alternatives and use something like 500M for background and 800M for > > > > dirty_bytes. > > > > > > We're using the defaults right now, however, given that this is a > > > containerized environment, it's problematic to set these values too > > > low system-wide since the containers all have dedicated volumes with > > > varying performance (from as low as 100MB/sec to gigabyes). Looking > > > around, I see that there were patches in the past to set per-cgroup > > > vm.dirty settings, however it doesn't look like those ever made it > > > into the kernel unless I'm missing something. > > > > I am not aware of that work for memcg v1. > > > > > In practice, maybe 500M > > > and 800M wouldn't be so bad though and may improve latency in other > > > ways. The other problem is that this also sets an upper bound on the > > > minimum container size for anything that does do IO. > > > > Well this would be a conservative approach but most allocations will > > simply be throttled during reclaim. It is the restricted memory reclaim > > context that is the bummer here. I have already brought up why this is > > the case in the generic write(2) system call path [1]. Maybe we can > > reduce the amount of NOFS requests. > > > > > That said, I'll > > > still I'll tune these settings in our infrastructure and see how > > > things go, but it sounds like something should be done inside the > > > kernel to help this situation, since it's so easy to trigger, but > > > looking at the threads that led to the commits you referenced, I can > > > see that this is complicated. > > > > Yeah, there are certainly things that we should be doing and reducing > > the NOFS allocations is the first step. From my past experience > > non trivial usage has turned out to be used incorrectly. I am not sure > > how much we can do for cgroup v1 though. If tuning for global dirty > > thresholds doesn't lead to a better behavior we can think of a band aid > > of some form. Something like this (only compile tested) > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 05b4ec2c6499..4e1e8d121785 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2532,6 +2536,20 @@ static int try_charge(struct mem_cgroup *memcg, > gfp_t gfp_mask, > > ie (mem_cgroup_wait_acct_move(mem_over_limit)) > > goto retry; > > > > + /* > > + * Legacy memcg relies on dirty data throttling during the reclaim > > + * but this cannot be done for GFP_NOFS requests so we might trigger > > + * the oom way too early. Throttle here if we have way too many > > + * dirty/writeback pages. > > + */ > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !(gfp_mask & > __GFP_FS)) { > > + unsigned long dirty = memcg_page_state(memcg, NR_FILE_DIRTY), > > + writeback = memcg_page_state(memcg, > NR_WRITEBACK); > > + > > + if (4*(dirty + writeback) > 3* > page_counter_read(&memcg->memory)) > > + schedule_timeout_interruptible(1); > > + } > > + > > if (nr_retries--) > > goto retry; > > > > > > [1] http://lkml.kernel.org/r/20200415070228.GW4629@dhcp22.suse.cz > > -- > > Michal Hocko > > SUSE Labs > Hi Michal, > Following up my conversation from bugzilla here: > I am currently seeing the same issue when migrating a container from 4.14 to > 5.4+ kernels. I tested this patch with a configuration where application > reaches > cgroups memory limit while doing IO. The issue is similar to described here > https://bugzilla.kernel.org/show_bug.cgi?id=207273 where where we see ooms in > write syscall due to restricted memory reclamation. > I tested your patch however, I have to increase the jiffies from > 1 to 10 to make it work and it works for both 5.4 and 5.10 on my workload. > I also tried adjusting the dirty_bytes* and it worked after some tuning > however, > there's no one set of values suits all use cases kind of scenario. > Hence it does not look like a viable option for me to change those defaults > here and > expect it work for all kind of workloads. I think working out a fix in kernel > may be a > better option since this issue will be seen ins o many use cases where > applications are used to old kernel behavior and they suddenly start failing > on > newer ones. > I see the same stack trace on 4.19 kernel too. > > Here is the stack trace: > > dd invoked > oom-killer:gfp_mask=0x101c4a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE),order=0, > oom_score_adj=997 > CPU: 0 PID: 28766 Comm: dd Not tainted 5.4.129-62.227.amzn2.x86_64 #1 > Hardware name: Amazon EC2 m5.large/, BIOS 1.0 10/16/2017 > Call Trace: > dump_stack+0x50/0x6b > dump_header+0x4a/0x200 > oom_kill_process+0xd7/0x110 > out_of_memory+0x105/0x510 > mem_cgroup_out_of_memory+0xb5/0xd0 > try_charge+0x766/0x7c0 > mem_cgroup_try_charge+0x70/0x190 > __add_to_page_cache_locked+0x355/0x390 > ? scan_shadow_nodes+0x30/0x30 > add_to_page_cache_lru+0x4a/0xc0 > pagecache_get_page+0xf5/0x210 > grab_cache_page_write_begin+0x1f/0x40 > iomap_write_begin.constprop.34+0x1ee/0x340 > ? iomap_write_end+0x91/0x240 > iomap_write_actor+0x92/0x170 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_apply+0xba/0x130 > ? iomap_dirty_actor+0x1b0/0x1b0 > iomap_file_buffered_write+0x62/0x90 > ? iomap_dirty_actor+0x1b0/0x1b0 > xfs_file_buffered_aio_write+0xca/0x310 [xfs] > new_sync_write+0x11b/0x1b0 > vfs_write+0xad/0x1a0 > ksys_write+0xa1/0xe0 > do_syscall_64+0x48/0xf0 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > RIP: 0033:0x7fc956e853ad > ode: c3 8b 07 85 c0 75 24 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 > ca > 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 8a d2 ff ff 41 54 > b8 > 02 00 00 00 49 89 f4 be 00 88 08 00 55 > RSP: 002b:00007ffdf7960058 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > RAX: ffffffffffffffda RBX: 00007fc956ec6b48 RCX: 00007fc956e853ad > RDX: 0000000000100000 RSI: 00007fc956cd9000 RDI: 0000000000000001 > RBP: 00007fc956cd9000 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > R13: 0000000000000000 R14: 00005558753057a0 R15: 0000000000100000 > memory: usage 30720kB, limit 30720kB, failcnt 424 > memory+swap: usage 30720kB, limit 9007199254740988kB, failcnt 0 > kmem: usage 2416kB, limit 9007199254740988kB, failcnt 0 > Memory cgroup stats for > /kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd: > anon 1089536 > file 27475968 > kernel_stack 73728 > slab 1941504 > sock 0 > shmem 0 > file_mapped 0 > file_dirty 0 > file_writeback 0 > anon_thp 0 > inactive_anon 0 > active_anon 1351680 > inactive_file 27705344 > active_file 40960 > unevictable 0 > slab_reclaimable 819200 > slab_unreclaimable 1122304 > pgfault 23397 > pgmajfault 0 > workingset_refault 33 > workingset_activate 33 > workingset_nodereclaim 0 > pgrefill 119108 > pgscan 124436 > pgsteal 928 > pgactivate 123222 > pgdeactivate 119083 > pglazyfree 99 > pglazyfreed 0 > thp_fault_alloc 0 > thp_collapse_alloc 0 > Tasks state (memory values in pages): > [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj > name > [ 28589] 0 28589 242 1 28672 0 -998 pause > [ 28703] 0 28703 399 1 40960 0 997 sh > [ 28766] 0 28766 821 341 45056 0 997 dd > > oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,mems_allowed=0,oom_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd,task_memcg=/kubepods/burstable/pod2d356ec7-5c92-4692-a184-380253ac6fbd/224eacdaa07c1a67f0cf2a5c85ffc6fe29d95f971743c7c7938de26e85351075,task=dd,pid=28766,uid=0 > Memory cgroup out of memory: Killed process 28766 (dd) total-vm:3284kB, > anon-rss:1036kB, file-rss:328kB, shmem-rss:0kB, UID:0 pgtables:44kB > oom_score_adj:997 > oom_reaper: reaped process 28766 (dd), now anon-rss:0kB, file-rss:0kB, > shmem-rss:0kB > > > Here is a snippet of the container spec: > > containers: > - image: docker.io/library/alpine:latest > name: dd > command: > - sh > args: > - -c > - cat /proc/meminfo && apk add coreutils && dd if=/dev/zero of=/data/file > bs=1M count=1000 && cat /proc/meminfo && echo "OK" && sleep 300 > resources: > requests: > memory: 30Mi > cpu: 20m > limits: > memory: 30Mi > > Thanks, > Anchal Agarwal A gentle ping on this issue! Thanks, Anchal Agarwal
On Wed 11-08-21 19:41:23, Anchal Agarwal wrote: > A gentle ping on this issue! Sorry, I am currently swamped by other stuff and will be offline next week. I still try to keep this on my todo list and will try to get back to this hopefully soon.
On Thu, 2021-08-19 at 11:35 +0200, Michal Hocko wrote: > > On Wed 11-08-21 19:41:23, Anchal Agarwal wrote: > > A gentle ping on this issue! > > Sorry, I am currently swamped by other stuff and will be offline next > week. I still try to keep this on my todo list and will try to get > back to this hopefully soon. Hi Michal ! Thanks for your reply ! Enjoy your time off ! Cheers, Ben.
I am also facing same problem. BTW, it's been 2 years back. Is there any further updated on this thread?