Bug 121121 - Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-enabling a core
Summary: Kernel v4.7-rc5 - performance degradation upto 40% after disabling and re-ena...
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-28 23:10 UTC by Jirka Hladky
Modified: 2016-09-12 11:41 UTC (History)
6 users (show)

See Also:
Kernel Version: v4.7-rc5
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Script to reproduce the problem (731 bytes, application/x-shellscript)
2016-06-28 23:12 UTC, Jirka Hladky
Details

Description Jirka Hladky 2016-06-28 23:10:02 UTC
Hello,

on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we observe following performance degradation:

Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after booting the kernel:

real  1m57.834s
user  113m51.520s

Then we disable and re-enable one core:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

and rerun the same test. Runtime is now degraded (by 40% for user time and by 30% for the real (wall-clock) time) using all 64 cores

real 2m47.746s
user 160m46.109s

The issue was first reported in "The Linux Scheduler: a Decade of Wasted Cores" paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
https://github.com/jplozi/wastedcores/issues/1

How to reproduce the issue:

A) Get benchmark and compile it:

1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
2) tar zxvf NPB3.3.1.tar.gz
3) cd ~/NPB3.3.1/NPB3.3-OMP/config/
4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler)
5) ln -sf NAS.samples/suite.def.lu suite.def
6) cd ~/NPB3.3.1/NPB3.3-OMP
7) make suite
8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin benchmarks  lu.*. The binaries are alphabetically sorted by runtime with  "lu.A.x" having the shortest runtime.

B) Reproducing the issue

Remark: we have done the tests with autogroup disabled
sysctl -w kernel.sched_autogroup_enabled=0
to avoid following issue on 4.7 kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

The test was conducted on NUMA server with 4 nodes and using the all 64 cores.

1) (time bin/lu.C.x) |& tee $(uname -r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0

2) disable and re-enable one core
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

3) (time bin/lu.C.x) |& tee $(uname -r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0

grep "real\|user" *lu.C*

You will see significant difference in both real and user time.

Regarding to the authors of the paper the root cause of the problem is a missing call to regenerate domains inside NUMA nodes after re-enabling CPU. The problem was introduced in 3.19 kernel. Kernel 4.7 performance has improved significantly over 4.6 kernel but the bug still exists in 4.7 kernel. The authors of paper has proposed a patch which applies to 4.1 kernel. Here is the link:
https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

===========For the completeness here are the results with 4.6 kernel===========

AFTER BOOT
real    1m31.639s
user    89m24.657s

AFTER core has been disabled and re-enabled
real    2m44.566s
user    157m59.814s

Please notice that with 4.6 kernel problem is much more visible than with 4.7 rc5 kernel.

At the same time, 4.6 kernel delivers better performance after boot than 4.7 rc5 kernel which might indicate that another problem is in play.    
=================================================================

Hello,

on NUMA enabled server equipped with 4 Intel E5-4610 v2 CPUs we observe following performance degradation:

Runtime to run "lu.C.x" test from NAS Parallel Benchmarks after booting the kernel:

real  1m57.834s
user  113m51.520s

Then we disable and re-enable one core:

echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

and rerun the same test. Runtime is now degraded (by 40% for user time and by 30% for the real (wall-clock) time) using all 64 cores

real 2m47.746s
user 160m46.109s

The issue was first reported in "The Linux Scheduler: a Decade of Wasted Cores" paper
http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
https://github.com/jplozi/wastedcores/issues/1

How to reproduce the issue:

A) Get benchmark and compile it:

1) wget http://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz
2) tar zxvf NPB3.3.1.tar.gz
3) cd ~/NPB3.3.1/NPB3.3-OMP/config/
4) ln -sf NAS.samples/make.def.gcc_x86 make.def (assuming using gcc compiler)
5) ln -sf NAS.samples/suite.def.lu suite.def
6) cd ~/NPB3.3.1/NPB3.3-OMP
7) make suite
8) You should have now in directory ~/NPB3.3.1/NPB3.3-OMP/bin benchmarks  lu.*. The binaries are alphabetically sorted by runtime with  "lu.A.x" having the shortest runtime.

B) Reproducing the issue (see also attached script)

Remark: we have done the tests with autogroup disabled
sysctl -w kernel.sched_autogroup_enabled=0
to avoid this issue on 4.7 kernel:
https://bugzilla.kernel.org/show_bug.cgi?id=120481

The test was conducted on NUMA server with 4 nodes and using all available 64 cores.

1) (time bin/lu.C.x) |& tee $(uname -r)_lu.C.x.log_before_reenable_kernel.sched_autogroup_enabled=0

2) disable and re-enable one core
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online

3) (time bin/lu.C.x) |& tee $(uname -r)_lu.C.x.log_after_reenable_kernel.sched_autogroup_enabled=0

grep "real\|user" *lu.C*

You will see significant difference in both real and user time.

Regarding to the authors of the paper the root cause of the problem is a missing call to regenerate domains inside NUMA nodes after re-enabling CPU. The problem was introduced in 3.19 kernel. The authors of paper has proposed a patch which applies to 4.1 kernel. Here is the link:
https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

===========For the completeness here are the results with 4.6 kernel===========

AFTER BOOT
real    1m31.639s
user    89m24.657s

AFTER core has been disabled and re-enabled
real    2m44.566s
user    157m59.814s

Please notice that 4.6 kernel problem is much more visible than with 4.7 rc5 kernel.

At the same time, 4.6 kernel delivers much better performance after boot than 4.7 rc5 kernel which might indicate that another problem is in play.    
=================================================================

I have also tested kernel provided by Peter Zijlstra on Friday, June 24th which provides fix for https://bugzilla.kernel.org/show_bug.cgi?id=120481. It does not fix this issue and kernel right after boot performs worse than 4.6 kernel right after boot so we may in fact face two problems here.

========Results with 4.7.0-02548776ded1185e6e16ad0a475481e982741ee9 kernel=====
git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git sched/urgent
$ git rev-parse HEAD
02548776ded1185e6e16ad0a475481e982741ee9

 AFTER BOOT
real        1m58.549s
user        113m31.448s

AFTER core has been disabled and re-enabled
real 2m35.930s
user 148m20.795s
=================================================================

Thanks a lot
Jirka
Comment 1 Jirka Hladky 2016-06-28 23:12:07 UTC
Created attachment 221421 [details]
Script to reproduce the problem

It assumes that working directory is ~/NPB3.3.1/NPB3.3-OMP and under bin directory there is "lu.C.x" binary.
Comment 2 Jirka Hladky 2016-07-28 21:42:48 UTC
It turns out that lu.C.x results show quite big variation and tests have to be repeated several times and mean value of real time has to be used to get reliable results.

There is NO regression on following CPUs

4x Xeon(R) CPU E5-4610 v2 @ 2.30GHz
4x Xeon(R) CPU E5-2690 v3 @ 2.60GHz

but there is regression (slow down by factor 6x) on 

AMD Opteron(TM) Processor 6272


Kernel 4.7.0-0.rc7.git0.1.el7.x86_64

real_time to run ./lu.C.x benchmark (mean value out of 10 runs)

Right after boot: 273 seconds
After disabling and enabling a core: 1702 seconds!
Comment 3 Jirka Hladky 2016-08-29 12:42:28 UTC
WORKLOG

==========================================================================
Jiri Olsa is working on a patch. This is the latest update from Jiri Olsa:
==========================================================================


Jirka, Peter and Jean-Pierre reported performance drop on
some cpus after making cpu offline and online again.

The reason is the kernel logic that falls back to SMT
level topology if more than one node is detected within
CPU package. During the system boot this logic cuts out
the DIE topology level and numa code adds NUMA level
on top of this.

After the boot if you reboot make the cpu offline and online
again, this logic resets the SMT level topology, removing
whole NUMA level stuff.

Ensuring the SMT topology fallback happens only once during
the boot so the NUMA topology level is kept once it's built.

This problem is one of the issues reported in the wastedcores
article [1]. There's similar patch to this issue attached
to the article [2].

[1] https://github.com/jplozi/wastedcores
[2] https://github.com/jplozi/wastedcores/blob/master/patches/missing_sched_domains_linux_4.1.patch

Cc: Jean-Pierre Lozi <jplozi@unice.fr>
Cc: Jirka Hladky <jhladky@redhat.com>
Cc: Petr Surý <psury@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/kernel/smpboot.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2a6e84a30a54..f2a769b2b3fe 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -487,7 +487,17 @@ static struct sched_domain_topology_level numa_inside_package_topology[] = {
  */
 static void primarily_use_numa_for_topology(void)
 {
-       set_sched_topology(numa_inside_package_topology);
+       static bool once;
+
+       /*
+        * We need to run it only during boot, once we are
+        * here due to getting cpu online again we have already
+        * NUMA topology setup done.
+        */
+       if (!once) {
+               set_sched_topology(numa_inside_package_topology);
+               once = true;
+       }
 }

 void set_cpu_sibling_map(int cpu)
Comment 4 Jirka Hladky 2016-09-12 11:41:55 UTC
WORKLOG

fixed by following upstream post:
  http://marc.info/?l=linux-kernel&m=147337376415456&w=2

Note You need to log in before you can comment on or make changes to this bug.