Bug 172981

Summary: [bisected] SLAB: extreme load averages and over 2000 kworker threads
Product: Memory Management Reporter: Doug Smythies (dsmythies)
Component: Slab AllocatorAssignee: Andrew Morton (akpm)
Status: RESOLVED CODE_FIX    
Severity: normal CC: kernelorg, lee295012, szg00000
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.7+ Subsystem:
Regression: No Bisected commit-id:
Attachments: Just a very simple script used to create many kworker processes

Description Doug Smythies 2016-09-27 17:57:08 UTC
Immediately after boot, extreme load average numbers and over 2000 kworker processes are being observed on my main linux test computer (basically a Ubuntu 16.04 server, no GUI). The worker threads appear to be idle, and do disappear after the nominal 5 minute timeout, depending on whatever other stuff might run in the meantime. However, the number of threads can hugely increase again. The issue occurs with ease for kernels compiled using SLAB.

For SLAB, kernel bisection gave:
801faf0db8947e01877920e848a4d338dd7a99e7
"mm/slab: lockless decision to grow cache"

The following monitoring script was used for the below examples:

#!/bin/dash

while [ 1 ];
do
  echo $(uptime) ::: $(ps -A --no-headers | wc -l) ::: $(ps aux | grep kworker | grep -v u | grep -v H | wc -l)
  sleep 10.0
done

Example (SLAB):

After boot:

22:26:21 up 1 min, 2 users, load average: 295.98, 85.67, 29.47 ::: 2240 ::: 2074
22:26:31 up 1 min, 2 users, load average: 250.47, 82.85, 29.15 ::: 2240 ::: 2074
22:26:41 up 1 min, 2 users, load average: 211.96, 80.12, 28.84 ::: 2240 ::: 2074
...
22:52:34 up 27 min, 3 users, load average: 0.00, 0.43, 5.40 ::: 165 ::: 17
22:52:44 up 27 min, 3 users, load average: 0.00, 0.42, 5.34 ::: 165 ::: 17

Now type: sudo echo "bla":

22:53:14 up 27 min, 3 users, load average: 0.00, 0.38, 5.17 ::: 493 ::: 345
22:53:24 up 28 min, 3 users, load average: 0.00, 0.36, 5.11 ::: 493 ::: 345

Caused 328 new kworker threads.
Now queue just a few (8 in this case) very simple jobs.

22:55:45 up 30 min, 3 users, load average: 0.11, 0.27, 4.38 ::: 493 ::: 345
22:55:55 up 30 min, 3 users, load average: 0.09, 0.26, 4.34 ::: 2207 ::: 2059
22:56:05 up 30 min, 3 users, load average: 0.08, 0.25, 4.29 ::: 2207 ::: 2059

If I look at linux/Documentation/workqueue.txt and do:

echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event

and:

cat /sys/kernel/debug/tracing/trace_pipe > out.txt

I get somewhere between 10,000 and 20,000 occurrences of memcg_kmem_cache_create_func in the file (using my simple test method).

Also tested with kernel 4.8-rc7.
Comment 1 Andrew Morton 2016-09-27 18:11:03 UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Tue, 27 Sep 2016 17:57:08 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=172981
> 
>             Bug ID: 172981
>            Summary: [bisected] SLAB: extreme load averages and over 2000
>                     kworker threads
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 4.7+
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Slab Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: dsmythies@telus.net
>         Regression: No
> 
> Immediately after boot, extreme load average numbers and over 2000 kworker
> processes are being observed on my main linux test computer (basically a
> Ubuntu
> 16.04 server, no GUI). The worker threads appear to be idle, and do disappear
> after the nominal 5 minute timeout, depending on whatever other stuff might
> run
> in the meantime. However, the number of threads can hugely increase again.
> The
> issue occurs with ease for kernels compiled using SLAB.
> 
> For SLAB, kernel bisection gave:
> 801faf0db8947e01877920e848a4d338dd7a99e7
> "mm/slab: lockless decision to grow cache"
> 
> The following monitoring script was used for the below examples:
> 
> #!/bin/dash
> 
> while [ 1 ];
> do
>   echo $(uptime) ::: $(ps -A --no-headers | wc -l) ::: $(ps aux | grep
>   kworker
> | grep -v u | grep -v H | wc -l)
>   sleep 10.0
> done
> 
> Example (SLAB):
> 
> After boot:
> 
> 22:26:21 up 1 min, 2 users, load average: 295.98, 85.67, 29.47 ::: 2240 :::
> 2074
> 22:26:31 up 1 min, 2 users, load average: 250.47, 82.85, 29.15 ::: 2240 :::
> 2074
> 22:26:41 up 1 min, 2 users, load average: 211.96, 80.12, 28.84 ::: 2240 :::
> 2074
> ...
> 22:52:34 up 27 min, 3 users, load average: 0.00, 0.43, 5.40 ::: 165 ::: 17
> 22:52:44 up 27 min, 3 users, load average: 0.00, 0.42, 5.34 ::: 165 ::: 17
> 
> Now type: sudo echo "bla":
> 
> 22:53:14 up 27 min, 3 users, load average: 0.00, 0.38, 5.17 ::: 493 ::: 345
> 22:53:24 up 28 min, 3 users, load average: 0.00, 0.36, 5.11 ::: 493 ::: 345
> 
> Caused 328 new kworker threads.
> Now queue just a few (8 in this case) very simple jobs.
> 
> 22:55:45 up 30 min, 3 users, load average: 0.11, 0.27, 4.38 ::: 493 ::: 345
> 22:55:55 up 30 min, 3 users, load average: 0.09, 0.26, 4.34 ::: 2207 ::: 2059
> 22:56:05 up 30 min, 3 users, load average: 0.08, 0.25, 4.29 ::: 2207 ::: 2059
> 
> If I look at linux/Documentation/workqueue.txt and do:
> 
> echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
> 
> and:
> 
> cat /sys/kernel/debug/tracing/trace_pipe > out.txt
> 
> I get somewhere between 10,000 and 20,000 occurrences of
> memcg_kmem_cache_create_func in the file (using my simple test method).
> 
> Also tested with kernel 4.8-rc7.
> 
> -- 
> You are receiving this mail because:
> You are the assignee for the bug.
Comment 2 Doug Smythies 2016-09-27 18:20:31 UTC
Created attachment 239831 [details]
Just a very simple script used to create many kworker processes
Comment 3 Johannes Weiner 2016-09-28 02:23:22 UTC
[CC Vladimir]

These are the delayed memcg cache allocations, where in a fresh memcg
that doesn't have per-memcg caches yet, every accounted allocation
schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
until the cache is finally available. It looks like those can be many
more than the number of slab caches in existence, if there is a storm
of slab allocations before the workers get a chance to run.

Vladimir, what do you think of embedding the work item into the
memcg_cache_array? That way we make sure we have exactly one work per
cache and not an unbounded number of them. The downside of course is
that we'd have to keep these things around as long as the memcg is in
existence, but that's the only place I can think of that allows us to
serialize this.

On Tue, Sep 27, 2016 at 11:10:59AM -0700, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Tue, 27 Sep 2016 17:57:08 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=172981
> > 
> >             Bug ID: 172981
> >            Summary: [bisected] SLAB: extreme load averages and over 2000
> >                     kworker threads
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 4.7+
> >           Hardware: All
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Slab Allocator
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: dsmythies@telus.net
> >         Regression: No
> > 
> > Immediately after boot, extreme load average numbers and over 2000 kworker
> > processes are being observed on my main linux test computer (basically a
> Ubuntu
> > 16.04 server, no GUI). The worker threads appear to be idle, and do
> disappear
> > after the nominal 5 minute timeout, depending on whatever other stuff might
> run
> > in the meantime. However, the number of threads can hugely increase again.
> The
> > issue occurs with ease for kernels compiled using SLAB.
> > 
> > For SLAB, kernel bisection gave:
> > 801faf0db8947e01877920e848a4d338dd7a99e7
> > "mm/slab: lockless decision to grow cache"
> > 
> > The following monitoring script was used for the below examples:
> > 
> > #!/bin/dash
> > 
> > while [ 1 ];
> > do
> >   echo $(uptime) ::: $(ps -A --no-headers | wc -l) ::: $(ps aux | grep
> kworker
> > | grep -v u | grep -v H | wc -l)
> >   sleep 10.0
> > done
> > 
> > Example (SLAB):
> > 
> > After boot:
> > 
> > 22:26:21 up 1 min, 2 users, load average: 295.98, 85.67, 29.47 ::: 2240 :::
> > 2074
> > 22:26:31 up 1 min, 2 users, load average: 250.47, 82.85, 29.15 ::: 2240 :::
> > 2074
> > 22:26:41 up 1 min, 2 users, load average: 211.96, 80.12, 28.84 ::: 2240 :::
> > 2074
> > ...
> > 22:52:34 up 27 min, 3 users, load average: 0.00, 0.43, 5.40 ::: 165 ::: 17
> > 22:52:44 up 27 min, 3 users, load average: 0.00, 0.42, 5.34 ::: 165 ::: 17
> > 
> > Now type: sudo echo "bla":
> > 
> > 22:53:14 up 27 min, 3 users, load average: 0.00, 0.38, 5.17 ::: 493 ::: 345
> > 22:53:24 up 28 min, 3 users, load average: 0.00, 0.36, 5.11 ::: 493 ::: 345
> > 
> > Caused 328 new kworker threads.
> > Now queue just a few (8 in this case) very simple jobs.
> > 
> > 22:55:45 up 30 min, 3 users, load average: 0.11, 0.27, 4.38 ::: 493 ::: 345
> > 22:55:55 up 30 min, 3 users, load average: 0.09, 0.26, 4.34 ::: 2207 :::
> 2059
> > 22:56:05 up 30 min, 3 users, load average: 0.08, 0.25, 4.29 ::: 2207 :::
> 2059
> > 
> > If I look at linux/Documentation/workqueue.txt and do:
> > 
> > echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
> > 
> > and:
> > 
> > cat /sys/kernel/debug/tracing/trace_pipe > out.txt
> > 
> > I get somewhere between 10,000 and 20,000 occurrences of
> > memcg_kmem_cache_create_func in the file (using my simple test method).
> > 
> > Also tested with kernel 4.8-rc7.
> > 
> > -- 
> > You are receiving this mail because:
> > You are the assignee for the bug.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Comment 4 Doug Smythies 2016-09-28 03:22:13 UTC
By the way, I can eliminate the problem by doing this:
(see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)

diff --git a/mm/slab.c b/mm/slab.c
index b672710..a4edbfa 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
         * freed after synchronize_sched().
         */
        if (force_change)
-               synchronize_sched();
+               kick_all_cpus_sync();

 fail:
        kfree(old_shared);
Comment 5 Joonsoo Kim 2016-09-28 05:10:20 UTC
On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
> By the way, I can eliminate the problem by doing this:
> (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)

I think that Johannes found the root cause of the problem and they
(Johannes and Vladimir) will solve the root cause.

However, there is something useful to do in SLAB side.
Could you test following patch, please?

Thanks.

---------->8--------------
diff --git a/mm/slab.c b/mm/slab.c
index 0eb6691..39e3bf2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache *cachep,
         * guaranteed to be valid until irq is re-enabled, because it will be
         * freed after synchronize_sched().
         */
-       if (force_change)
+       if (n->shared && force_change)
                synchronize_sched();
 
 fail:
Comment 6 Joonsoo Kim 2016-09-28 06:12:02 UTC
On Wed, Sep 28, 2016 at 02:18:42PM +0900, Joonsoo Kim wrote:
> On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
> > By the way, I can eliminate the problem by doing this:
> > (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)
> 
> I think that Johannes found the root cause of the problem and they
> (Johannes and Vladimir) will solve the root cause.
> 
> However, there is something useful to do in SLAB side.
> Could you test following patch, please?
> 
> Thanks.
> 
> ---------->8--------------
> diff --git a/mm/slab.c b/mm/slab.c
> index 0eb6691..39e3bf2 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache
> *cachep,
>          * guaranteed to be valid until irq is re-enabled, because it will be
>          * freed after synchronize_sched().
>          */
> -       if (force_change)
> +       if (n->shared && force_change)
>                 synchronize_sched();

Oops...

s/n->shared/old_shared/

Thanks.
Comment 7 Doug Smythies 2016-09-28 15:22:33 UTC
On 2016.09.27 23:20 Joonsoo Kim wrote:
> On Wed, Sep 28, 2016 at 02:18:42PM +0900, Joonsoo Kim wrote:
>> On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
>>> By the way, I can eliminate the problem by doing this:
>>> (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)
>> 
>> I think that Johannes found the root cause of the problem and they
>> (Johannes and Vladimir) will solve the root cause.
>> 
>> However, there is something useful to do in SLAB side.
>> Could you test following patch, please?
>> 
>> Thanks.
>> 
>> ---------->8--------------
>> diff --git a/mm/slab.c b/mm/slab.c
>> index 0eb6691..39e3bf2 100644
>> --- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache
>> *cachep,
>>          * guaranteed to be valid until irq is re-enabled, because it will
>>          be
>>          * freed after synchronize_sched().
>>          */
>> -       if (force_change)
>> +       if (n->shared && force_change)
>>                 synchronize_sched();
>
> Oops...
>
> s/n->shared/old_shared/

Yes, that seems to work fine. After boot everything is good.
Then I tried and tried to get it to mess up, but could not.
Comment 8 Joonsoo Kim 2016-09-29 01:42:19 UTC
On Wed, Sep 28, 2016 at 08:22:24AM -0700, Doug Smythies wrote:
> On 2016.09.27 23:20 Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 02:18:42PM +0900, Joonsoo Kim wrote:
> >> On Tue, Sep 27, 2016 at 08:13:58PM -0700, Doug Smythies wrote:
> >>> By the way, I can eliminate the problem by doing this:
> >>> (see also: https://bugzilla.kernel.org/show_bug.cgi?id=172991)
> >> 
> >> I think that Johannes found the root cause of the problem and they
> >> (Johannes and Vladimir) will solve the root cause.
> >> 
> >> However, there is something useful to do in SLAB side.
> >> Could you test following patch, please?
> >> 
> >> Thanks.
> >> 
> >> ---------->8--------------
> >> diff --git a/mm/slab.c b/mm/slab.c
> >> index 0eb6691..39e3bf2 100644
> >> --- a/mm/slab.c
> >> +++ b/mm/slab.c
> >> @@ -965,7 +965,7 @@ static int setup_kmem_cache_node(struct kmem_cache
> *cachep,
> >>          * guaranteed to be valid until irq is re-enabled, because it will
> be
> >>          * freed after synchronize_sched().
> >>          */
> >> -       if (force_change)
> >> +       if (n->shared && force_change)
> >>                 synchronize_sched();
> >
> > Oops...
> >
> > s/n->shared/old_shared/
> 
> Yes, that seems to work fine. After boot everything is good.
> Then I tried and tried to get it to mess up, but could not.

Thanks for confirm.
I will send a formal patch, soon.

Thanks.
Comment 9 Joonsoo Kim 2016-09-29 01:52:26 UTC
On Wed, Sep 28, 2016 at 11:09:53AM +0300, Vladimir Davydov wrote:
> On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> > [CC Vladimir]
> > 
> > These are the delayed memcg cache allocations, where in a fresh memcg
> > that doesn't have per-memcg caches yet, every accounted allocation
> > schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> > until the cache is finally available. It looks like those can be many
> > more than the number of slab caches in existence, if there is a storm
> > of slab allocations before the workers get a chance to run.
> > 
> > Vladimir, what do you think of embedding the work item into the
> > memcg_cache_array? That way we make sure we have exactly one work per
> > cache and not an unbounded number of them. The downside of course is
> > that we'd have to keep these things around as long as the memcg is in
> > existence, but that's the only place I can think of that allows us to
> > serialize this.
> 
> We could set the entry of the root_cache->memcg_params.memcg_caches
> array corresponding to the cache being created to a special value, say
> (void*)1, and skip scheduling cache creation work on kmalloc if the
> caller sees it. I'm not sure it's really worth it though, because
> work_struct isn't that big (at least, in comparison with the cache
> itself) to avoid embedding it at all costs.

Hello, Johannes and Vladimir.

I'm not familiar with memcg so have a question about this solution.
This solution will solve the current issue but if burst memcg creation
happens, similar issue would happen again. My understanding is correct?

I think that the other cause of the problem is that we call
synchronize_sched() which is rather slow with holding a slab_mutex and
it blocks further kmem_cache creation. Should we fix that, too?

Thanks.
Comment 10 Joonsoo Kim 2016-09-30 08:11:14 UTC
On Thu, Sep 29, 2016 at 04:45:50PM +0300, Vladimir Davydov wrote:
> On Thu, Sep 29, 2016 at 11:00:50AM +0900, Joonsoo Kim wrote:
> > On Wed, Sep 28, 2016 at 11:09:53AM +0300, Vladimir Davydov wrote:
> > > On Tue, Sep 27, 2016 at 10:03:47PM -0400, Johannes Weiner wrote:
> > > > [CC Vladimir]
> > > > 
> > > > These are the delayed memcg cache allocations, where in a fresh memcg
> > > > that doesn't have per-memcg caches yet, every accounted allocation
> > > > schedules a kmalloc work item in __memcg_schedule_kmem_cache_create()
> > > > until the cache is finally available. It looks like those can be many
> > > > more than the number of slab caches in existence, if there is a storm
> > > > of slab allocations before the workers get a chance to run.
> > > > 
> > > > Vladimir, what do you think of embedding the work item into the
> > > > memcg_cache_array? That way we make sure we have exactly one work per
> > > > cache and not an unbounded number of them. The downside of course is
> > > > that we'd have to keep these things around as long as the memcg is in
> > > > existence, but that's the only place I can think of that allows us to
> > > > serialize this.
> > > 
> > > We could set the entry of the root_cache->memcg_params.memcg_caches
> > > array corresponding to the cache being created to a special value, say
> > > (void*)1, and skip scheduling cache creation work on kmalloc if the
> > > caller sees it. I'm not sure it's really worth it though, because
> > > work_struct isn't that big (at least, in comparison with the cache
> > > itself) to avoid embedding it at all costs.
> > 
> > Hello, Johannes and Vladimir.
> > 
> > I'm not familiar with memcg so have a question about this solution.
> > This solution will solve the current issue but if burst memcg creation
> > happens, similar issue would happen again. My understanding is correct?
> 
> Yes, I think you're right - embedding the work_struct responsible for
> cache creation in kmem_cache struct won't help if a thousand of
> different cgroups call kmem_cache_alloc() simultaneously for a cache
> they haven't used yet.
> 
> Come to think of it, we could fix the issue by simply introducing a
> special single-threaded workqueue used exclusively for cache creation
> works - cache creation is done mostly under the slab_mutex, anyway. This
> way, we wouldn't have to keep those used-once work_structs for the whole
> kmem_cache life time.
> 
> > 
> > I think that the other cause of the problem is that we call
> > synchronize_sched() which is rather slow with holding a slab_mutex and
> > it blocks further kmem_cache creation. Should we fix that, too?
> 
> Well, the patch you posted looks pretty obvious and it helps the
> reporter, so personally I don't see any reason for not applying it.

Oops... I forgot to mention why I asked that.

There is another report that similar problem also happens in SLUB. In there,
synchronize_sched() is called in cache shrinking path with holding the
slab_mutex. I guess that it blocks further kmem_cache creation.

If we uses special single-threaded workqueue, number of kworker would
be limited but kmem_cache creation will be delayed for a long time in
burst memcg creation/destroy scenario.

https://bugzilla.kernel.org/show_bug.cgi?id=172991

Do we need to remove synchronize_sched() in SLUB and find other
solution?

Thanks.
Comment 11 Doug Smythies 2016-10-06 05:04:34 UTC
On 2016.09.30 12:59 Vladimir Davydov wrote:

> Yeah, you're right. We'd better do something about this
> synchronize_sched(). I think moving it out of the slab_mutex and calling
> it once for all caches in memcg_deactivate_kmem_caches() would resolve
> the issue. I'll post the patches tomorrow.

Would someone please be kind enough to send me the patch set?

I didn't get them, and would like to test them.
I have searched and searched and did manage to find:
"[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
And a thread about a patch 1 of 2:
"Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches"
Where I see me as "reported by", but I guess "reported by" people don't get the e-mails.
I haven't found PATCH 0/2, nor do I know if what I did find is current.

... Doug
Comment 12 Joonsoo Kim 2016-10-06 06:35:21 UTC
On Wed, Oct 05, 2016 at 10:04:27PM -0700, Doug Smythies wrote:
> On 2016.09.30 12:59 Vladimir Davydov wrote:
> 
> > Yeah, you're right. We'd better do something about this
> > synchronize_sched(). I think moving it out of the slab_mutex and calling
> > it once for all caches in memcg_deactivate_kmem_caches() would resolve
> > the issue. I'll post the patches tomorrow.
> 
> Would someone please be kind enough to send me the patch set?
> 
> I didn't get them, and would like to test them.
> I have searched and searched and did manage to find:
> "[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
> And a thread about a patch 1 of 2:
> "Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg
> caches"
> Where I see me as "reported by", but I guess "reported by" people don't get
> the e-mails.
> I haven't found PATCH 0/2, nor do I know if what I did find is current.

I think that what you find is correct one. It has no cover-letter so
there is no [PATCH 0/2]. Anyway, to clarify, I add links to these
patches.

https://patchwork.kernel.org/patch/9361853
https://patchwork.kernel.org/patch/9359271

It would be very helpful if you test these patches.

Thanks.
Comment 13 Doug Smythies 2016-10-06 16:02:10 UTC
On 2016.10.05 23:35 Joonsoo Kim wrote:
> On Wed, Oct 05, 2016 at 10:04:27PM -0700, Doug Smythies wrote:
>> On 2016.09.30 12:59 Vladimir Davydov wrote:
>> 
>>> Yeah, you're right. We'd better do something about this
>>> synchronize_sched(). I think moving it out of the slab_mutex and calling
>>> it once for all caches in memcg_deactivate_kmem_caches() would resolve
>>> the issue. I'll post the patches tomorrow.
>> 
>> Would someone please be kind enough to send me the patch set?
>> 
>> I didn't get them, and would like to test them.
>> I have searched and searched and did manage to find:
>> "[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
>> And a thread about a patch 1 of 2:
>> "Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating
>> per-memcg caches"
>> Where I see me as "reported by", but I guess "reported by" people don't get
>> the e-mails.
>> I haven't found PATCH 0/2, nor do I know if what I did find is current.
>
> I think that what you find is correct one. It has no cover-letter so
> there is no [PATCH 0/2]. Anyway, to clarify, I add links to these
> patches.
>
> https://patchwork.kernel.org/patch/9361853
> https://patchwork.kernel.org/patch/9359271
>
> It would be very helpful if you test these patches.

Yes, as best as I am able to test, the 2 patch set
solves both this SLAB and the other SLUB bug reports.
Comment 14 Doug Smythies 2016-10-07 15:55:30 UTC
On 2016.10.06 09:02 Doug Smythies wrote:
> On 2016.10.05 23:35 Joonsoo Kim wrote:
>> On Wed, Oct 05, 2016 at 10:04:27PM -0700, Doug Smythies wrote:
>>> On 2016.09.30 12:59 Vladimir Davydov wrote:
>>> 
>>>> Yeah, you're right. We'd better do something about this
>>>> synchronize_sched(). I think moving it out of the slab_mutex and calling
>>>> it once for all caches in memcg_deactivate_kmem_caches() would resolve
>>>> the issue. I'll post the patches tomorrow.
>>> 
>>> Would someone please be kind enough to send me the patch set?
>>> 
>>> I didn't get them, and would like to test them.
>>> I have searched and searched and did manage to find:
>>> "[PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink"
>>> And a thread about a patch 1 of 2:
>>> "Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating
>>> per-memcg caches"
>>> Where I see me as "reported by", but I guess "reported by" people don't get
>>> the e-mails.
>>> I haven't found PATCH 0/2, nor do I know if what I did find is current.
>>
>> I think that what you find is correct one. It has no cover-letter so
>> there is no [PATCH 0/2]. Anyway, to clarify, I add links to these
>> patches.
>>
>> https://patchwork.kernel.org/patch/9361853
>> https://patchwork.kernel.org/patch/9359271
>>
>> It would be very helpful if you test these patches.
>
> Yes, as best as I am able to test, the 2 patch set
> solves both this SLAB and the other SLUB bug reports.

I tested the patch from the other thread on top of these two,
And things continued to work fine. The additional patch
does seems a little faster under some of my hammering conditions.

Reference:
https://marc.info/?l=linux-kernel&m=147573486705407&w=2
Comment 15 Patrick Schaaf 2016-11-02 11:06:09 UTC
I've seen these issues (2000 kworker threads) on memcg enabled 4.8.5 and 4.8.6 kernels in two situations:

1) directly after boot, probably triggered by either systemd itself or libvirt kvm machine startups. 2000 kworkers and load up to 30, load stabilizes almost immediately and the 2000 kworkers are gone after some minutes without reappearing
2) when using systemd-nspawn to fire up a small container with systemd inside. Here the issue happens when I log in to the container. Same symptoms. Another login after waiting for the workers to die, once more creates 2000 of them for a while and pushes load.

After applying the three patches linked to in Doug Smythies' previous comment, against mainline 4.8.6 (SLAB), I can no longer reproduce the issue.

I'm running such a patched 4.8.6 on one of my production boxes now (for all of two hours...), with pretty intense mysql and apache workloads (inside KVM machines) on it, and so far everything seems quite fine.

So, I'd plead for mainline + stable inclusion, and provide:

Tested-By: Patrick Schaaf <kernelorg@bof.de>