Bug 190841 - [REGRESSION] Intensive Memory CGroup removal leads to high load average 10+
Summary: [REGRESSION] Intensive Memory CGroup removal leads to high load average 10+
Status: NEW
Alias: None
Product: File System
Classification: Unclassified
Component: SysFS (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Andrew Morton
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-12-21 19:56 UTC by Vlad Frolov
Modified: 2017-12-26 20:40 UTC (History)
1 user (show)

See Also:
Kernel Version: 4.7.0-rc1+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Vlad Frolov 2016-12-21 19:56:16 UTC
My simplified workflow looks like this:

1. Create a Memory CGroup with memory limit
2. Exec a child process
3. Add the child process PID into the Memory CGroup
4. Wait for the child process to finish
5. Remove the Memory CGroup

The child processes usually run less than 0.1 seconds, but I have lots of them. Normally, I could run over 10000 child processes per minute, but with newer kernels, I can only do 400-500 executions per minute, and my system becomes extremely sluggish (the only indicator of the weirdness I found is an unusually high load average, which sometimes goes over 250!).

Here is a simple reproduction script:

#!/bin/sh
CGROUP_BASE=/sys/fs/cgroup/memory/qq

for $i in $(seq 1000); do
    echo "Iteration #$i"
    sh -c "
        mkdir '$CGROUP_BASE'
        sh -c 'echo \$$ > $CGROUP_BASE/tasks ; sleep 0.0'
        rmdir '$CGROUP_BASE' || true
    "
done
# ===

Running this script on 4.7.0-rc1 and above I get a noticeable slowdown and also high load average with no other indicators like high CPU or IO usage reported in top/iotop/vmstat.

It used to work just fine up until Kernel 4.7.0. In fact, I have jumped from 4.4 to 4.8 kernel, so I had to test several kernels before I came to the conclusion that this seems to be a regression in Kernel. Currently, I have tried the following kernels (using a fresh minimal Ubuntu 16.04 on VirtualBox with their binary mainline kernels):

* Ubuntu 4.4.0-57 kernel works fine
* Mainline 4.4.39 and below seem to work just fine - https://youtu.be/tGD6sfwa-3c
* Mainline 4.6.7 kernel behaves seminormal, load average is higher than on 4.4, but not as bad as on 4.7+ - https://youtu.be/-CyhmkkPbKE
* Mainline 4.7.0-rc1 kernel is the first kernel after 4.6.7 that is available in binaries, so I chose to test it and it doesn't play nicely - https://youtu.be/C_J5es74Ars
* Mainline 4.9.0 kernel still doesn't play nicely - https://youtu.be/_o17U5x3bmY

OTHER NOTES:
1. Using VirtualBox I have noticed that this bug only reproducible when I have 2+ CPU cores!
2. This bug is also reproducible on other Linux distibutions: Fedora 25 with 4.8.14-300.fc25.x86_64 kernel, latest Arch Linux with 4.8.13 and 4.8.15 with Liquorix patchset.
3. Commenting out `rmdir '$CGROUP_BASE'` in the reproduction script makes things fly yet again, but I don't want to leave leftovers after the runs.
Comment 1 Andrew Morton 2017-01-05 01:29:24 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 21 Dec 2016 19:56:16 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=190841
> 
>             Bug ID: 190841
>            Summary: [REGRESSION] Intensive Memory CGroup removal leads to
>                     high load average 10+
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.7.0-rc1+
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Other
>           Assignee: akpm@linux-foundation.org
>           Reporter: frolvlad@gmail.com
>         Regression: No
> 
> My simplified workflow looks like this:
> 
> 1. Create a Memory CGroup with memory limit
> 2. Exec a child process
> 3. Add the child process PID into the Memory CGroup
> 4. Wait for the child process to finish
> 5. Remove the Memory CGroup
> 
> The child processes usually run less than 0.1 seconds, but I have lots of
> them.
> Normally, I could run over 10000 child processes per minute, but with newer
> kernels, I can only do 400-500 executions per minute, and my system becomes
> extremely sluggish (the only indicator of the weirdness I found is an
> unusually
> high load average, which sometimes goes over 250!).
> 
> Here is a simple reproduction script:
> 
> #!/bin/sh
> CGROUP_BASE=/sys/fs/cgroup/memory/qq
> 
> for $i in $(seq 1000); do
>     echo "Iteration #$i"
>     sh -c "
>         mkdir '$CGROUP_BASE'
>         sh -c 'echo \$$ > $CGROUP_BASE/tasks ; sleep 0.0'
>         rmdir '$CGROUP_BASE' || true
>     "
> done
> # ===
> 
> Running this script on 4.7.0-rc1 and above I get a noticeable slowdown and
> also
> high load average with no other indicators like high CPU or IO usage reported
> in top/iotop/vmstat.
> 
> It used to work just fine up until Kernel 4.7.0. In fact, I have jumped from
> 4.4 to 4.8 kernel, so I had to test several kernels before I came to the
> conclusion that this seems to be a regression in Kernel. Currently, I have
> tried the following kernels (using a fresh minimal Ubuntu 16.04 on VirtualBox
> with their binary mainline kernels):
> 
> * Ubuntu 4.4.0-57 kernel works fine
> * Mainline 4.4.39 and below seem to work just fine -
> https://youtu.be/tGD6sfwa-3c
> * Mainline 4.6.7 kernel behaves seminormal, load average is higher than on
> 4.4,
> but not as bad as on 4.7+ - https://youtu.be/-CyhmkkPbKE
> * Mainline 4.7.0-rc1 kernel is the first kernel after 4.6.7 that is available
> in binaries, so I chose to test it and it doesn't play nicely -
> https://youtu.be/C_J5es74Ars
> * Mainline 4.9.0 kernel still doesn't play nicely -
> https://youtu.be/_o17U5x3bmY
> 
> OTHER NOTES:
> 1. Using VirtualBox I have noticed that this bug only reproducible when I
> have
> 2+ CPU cores!
> 2. This bug is also reproducible on other Linux distibutions: Fedora 25 with
> 4.8.14-300.fc25.x86_64 kernel, latest Arch Linux with 4.8.13 and 4.8.15 with
> Liquorix patchset.
> 3. Commenting out `rmdir '$CGROUP_BASE'` in the reproduction script makes
> things fly yet again, but I don't want to leave leftovers after the runs.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Vlad Frolov 2017-01-05 20:27:17 UTC
> I would expect older kernels would just refuse the create new cgroups...
> Maybe that happens in your script and just gets unnoticed?

I have been running a production service doing this "intensive"
cgroups creation and cleaning for over a year now and it just works
with 3.xx - 4.5 kernels (currently, I run it on an LTS 4.4 kernel),
triggering up to 100 CGroup creations/cleanings events per second
non-stop for months, and I haven't noticed any refuses in new cgroup
creations whatsoever even on 1GB RAM boxes.


> Even without memcg involved. Are there any strong reasons you cannot reuse an
> existing cgroup?

I run concurrent executions (I run cgmemtime
[https://github.com/gsauthof/cgmemtime] to measure high-water memory
usage of a group of processes), so I cannot reuse a single cgroup, and
I, currently, cannot maintain a pool of cgroups (it will add extra
complexity in my code, and will require cgmemtime patching, while
older kernels just worked fine). Do you believe there is no bug there
and it is just slow by design? There are a few odd things here:

1. 4.7+ kernels perform 20 times *slower* while postponing should in
theory speed things up due to "async" nature
2. Other cgroup creation/cleaning work like a charm, it is only
`memory` cgroup making my system overloaded


> echo 1 > $CGROUP_BASE/memory.force_empty

This didn't help at alll.

On 5 January 2017 at 14:33, Michal Hocko <mhocko@kernel.org> wrote:
> On Wed 04-01-17 17:30:37, Andrew Morton wrote:
>>
>> (switched to email.  Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Wed, 21 Dec 2016 19:56:16 +0000 bugzilla-daemon@bugzilla.kernel.org
>> wrote:
>>
>> > https://bugzilla.kernel.org/show_bug.cgi?id=190841
>> >
>> >             Bug ID: 190841
>> >            Summary: [REGRESSION] Intensive Memory CGroup removal leads to
>> >                     high load average 10+
>> >            Product: Memory Management
>> >            Version: 2.5
>> >     Kernel Version: 4.7.0-rc1+
>> >           Hardware: All
>> >                 OS: Linux
>> >               Tree: Mainline
>> >             Status: NEW
>> >           Severity: normal
>> >           Priority: P1
>> >          Component: Other
>> >           Assignee: akpm@linux-foundation.org
>> >           Reporter: frolvlad@gmail.com
>> >         Regression: No
>> >
>> > My simplified workflow looks like this:
>> >
>> > 1. Create a Memory CGroup with memory limit
>> > 2. Exec a child process
>> > 3. Add the child process PID into the Memory CGroup
>> > 4. Wait for the child process to finish
>> > 5. Remove the Memory CGroup
>> >
>> > The child processes usually run less than 0.1 seconds, but I have lots of
>> them.
>> > Normally, I could run over 10000 child processes per minute, but with
>> newer
>> > kernels, I can only do 400-500 executions per minute, and my system
>> becomes
>> > extremely sluggish (the only indicator of the weirdness I found is an
>> unusually
>> > high load average, which sometimes goes over 250!).
>
> Well, yes, rmdir is not the cheapest operation... Since b2052564e66d
> ("mm: memcontrol: continue cache reclaim from offlined groups") we are
> postponing the real memcg removal to later, when there is a memory
> pressure. 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure
> after many small jobs") fixed unbound id space consumption. I would be
> quite surprised if this caused a new regression. But the report says
> that this is 4.7+ thing. I would expect older kernels would just refuse
> the create new cgroups... Maybe that happens in your script and just
> gets unnoticed?
>
> We might come up with some more harderning in the offline path (e.g.
> count the number of dead memcgs and force their reclaim after some
> number gets accumulated). But all that just adds more code and risk of
> regression for something that is not used very often. Cgroups
> creation/destruction are too heavy operations to be done for very
> shortlived process. Even without memcg involved. Are there any strong
> reasons you cannot reuse an existing cgroup?
>
>> > Here is a simple reproduction script:
>> >
>> > #!/bin/sh
>> > CGROUP_BASE=/sys/fs/cgroup/memory/qq
>> >
>> > for $i in $(seq 1000); do
>> >     echo "Iteration #$i"
>> >     sh -c "
>> >         mkdir '$CGROUP_BASE'
>> >         sh -c 'echo \$$ > $CGROUP_BASE/tasks ; sleep 0.0'
>
> one possible workaround would be to do
>             echo 1 > $CGROUP_BASE/memory.force_empty
>
> before you remove the cgroup. That should drop the existing charges - at
> least for the page cache which might be what keeps those memcgs alive.
>
>> >         rmdir '$CGROUP_BASE' || true
>> >     "
>> > done
>> > # ===
>
> --
> Michal Hocko
> SUSE Labs
Comment 3 Johannes Weiner 2017-01-05 21:23:05 UTC
On Wed, Jan 04, 2017 at 05:30:37PM -0800, Andrew Morton wrote:
> > My simplified workflow looks like this:
> > 
> > 1. Create a Memory CGroup with memory limit
> > 2. Exec a child process
> > 3. Add the child process PID into the Memory CGroup
> > 4. Wait for the child process to finish
> > 5. Remove the Memory CGroup
> > 
> > The child processes usually run less than 0.1 seconds, but I have lots of
> them.
> > Normally, I could run over 10000 child processes per minute, but with newer
> > kernels, I can only do 400-500 executions per minute, and my system becomes
> > extremely sluggish (the only indicator of the weirdness I found is an
> unusually
> > high load average, which sometimes goes over 250!).
> > 
> > Here is a simple reproduction script:
> > 
> > #!/bin/sh
> > CGROUP_BASE=/sys/fs/cgroup/memory/qq
> > 
> > for $i in $(seq 1000); do
> >     echo "Iteration #$i"
> >     sh -c "
> >         mkdir '$CGROUP_BASE'
> >         sh -c 'echo \$$ > $CGROUP_BASE/tasks ; sleep 0.0'
> >         rmdir '$CGROUP_BASE' || true
> >     "
> > done
> > # ===

You're not even running anything concurrently. While I agree with
Michal that cgroup creation and destruction are not the fastest paths,
a load of 250 from a single-threaded testcase is silly.

We recently had a load spikee issue with the on-demand memcg slab
cache duplication, but that should have happened in 4.6 already. I
don't see anything suspicious going into memcontrol.c after 4.6.

When the load is high like this, can you check with ps what the
blocked tasks are?

A run with perf record -a also might give us an idea if cycles go to
the wrong place.

I'll try to reproduce this once I have access to my test machine again
next week.
Comment 4 Michal Hocko 2017-01-06 14:08:11 UTC
On Thu 05-01-17 22:26:53, Vladyslav Frolov wrote:
[...]
> > Even without memcg involved. Are there any strong reasons you cannot reuse
> an existing cgroup?
> 
> I run concurrent executions (I run cgmemtime
> [https://github.com/gsauthof/cgmemtime] to measure high-water memory
> usage of a group of processes), so I cannot reuse a single cgroup, and
> I, currently, cannot maintain a pool of cgroups (it will add extra
> complexity in my code, and will require cgmemtime patching, while
> older kernels just worked fine). Do you believe there is no bug there
> and it is just slow by design?

> There are a few odd things here:
> 
> 1. 4.7+ kernels perform 20 times *slower* while postponing should in
> theory speed things up due to "async" nature
> 2. Other cgroup creation/cleaning work like a charm, it is only
> `memory` cgroup making my system overloaded
> 
> > echo 1 > $CGROUP_BASE/memory.force_empty
> 
> This didn't help at alll.

OK, then it is not just the page cache staying behind which prevents
those memcgs go away. Another reason might be kmem charges. Memcg kernel
memory accounting has been enabled by default since 4.6 AFAIR. You say
4.7+ has seen a slowdown though so this might be completely unrelated.
But it would be good to see whether the same happens with kernel command
line:
cgroup.memory=nokmem
Comment 5 Vlad Frolov 2017-01-12 13:56:24 UTC
Indeed, `cgroup.memory=nokmem` works around the high load average on
all the kernels!

4.10rc2 kernel without `cgroup.memory=nokmem` behaves much better than
4.7-4.9 kernels, yet it still reaches LA ~6 using my reproduction
script, while LA <=1.0 is expected. 4.10rc2 feels like 4.6, which I
described as "seminormal".

Running the reproduction script 3000 times gives the following results:

* 4.4 kernel takes 13 seconds to complete and LA <= 1.0
* 4.6-4.10rc2 kernels with `cgroup.memory=nokmem'` also takes 13
seconds to complete and LA <= 1.0
* 4.6 kernel takes 25 seconds to complete and LA ~= 5
* 4.7-4.9 kernels take 6-9 minutes (yes, 25-40 times slower than with
`nokmem`) to complete and LA > 20
* 4.10rc2 kernel takes 60 seconds (4 times slower than with `nokmem`)
to complete and LA ~= 6


On 6 January 2017 at 18:28, Vladimir Davydov <vdavydov@tarantool.org> wrote:
> Hello,
>
> The issue does look like kmemcg related - see below.
>
> On Wed, Jan 04, 2017 at 05:30:37PM -0800, Andrew Morton wrote:
>
>> > * Ubuntu 4.4.0-57 kernel works fine
>> > * Mainline 4.4.39 and below seem to work just fine -
>> > https://youtu.be/tGD6sfwa-3c
>
> kmemcg is disabled
>
>> > * Mainline 4.6.7 kernel behaves seminormal, load average is higher than on
>> 4.4,
>> > but not as bad as on 4.7+ - https://youtu.be/-CyhmkkPbKE
>
> 4.6+
>
> b313aeee25098 mm: memcontrol: enable kmem accounting for all cgroups in the
> legacy hierarchy
>
> kmemcg is enabled by default for all cgroups, which introduces extra
> overhead to memcg destruction path
>
>> > * Mainline 4.7.0-rc1 kernel is the first kernel after 4.6.7 that is
>> available
>> > in binaries, so I chose to test it and it doesn't play nicely -
>> > https://youtu.be/C_J5es74Ars
>
> 4.7+
>
> 81ae6d03952c1 mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()
>
> kick_all_cpus_sync(), which was used for synchronizing slub cache
> destruction before this commit, turns out to be too disruptive on big
> SMP machines as it generates a lot of IPIs, so it is replaced with more
> lightweight synchronize_sched(). The latter, however, blocks cgroup
> rmdir under the slab_mutex for relatively long, resulting in higher load
> average as well as stalling other processes trying to create or destroy
> a kmem cache.
>
>> > * Mainline 4.9.0 kernel still doesn't play nicely -
>> > https://youtu.be/_o17U5x3bmY
>
> The above-mentioned issue is still unfixed.
>
>> >
>> > OTHER NOTES:
>> > 1. Using VirtualBox I have noticed that this bug only reproducible when I
>> have
>> > 2+ CPU cores!
>
> synchronize_sched() is a no-op on UP machines, which explains why on a
> UP machine the problems goes away.
>
> If I'm correct, the issue must have been fixed in 4.10, which is yet to
> be released:
>
> 89e364db71fb5 slub: move synchronize_sched out of slab_mutex on shrink
>
> You can workaround it on older kernels by turning kmem accounting off.
> To do that, append 'cgroup.memory=nokmem' to the kernel command line.
> Alternatively, you can try to recompile the kernel choosing SLAB as the
> slab allocator, because only SLUB is affected IIRC.
>
> FWIW I tried the script you provided in a 4 CPU VM running 4.10-rc2 and
> didn't notice any significant stalls or latency spikes. Could you please
> check if this kernel fixes your problem? If it does it might be worth
> submitting the patch to stable..
Comment 6 Vlad Frolov 2017-03-01 19:31:37 UTC
Any progress on this issue?
Comment 7 Vlad Frolov 2017-12-26 20:40:35 UTC
It seems that this issue has been fixed in one of the recent major
releases. I cannot reproduce it on 4.14.8 now (I still can reproduce
the issue on the same host with the older kernels and even with 4.9.71
LTS).

Can someone close the issue on bugzilla?

Thank you!

Note You need to log in before you can comment on or make changes to this bug.