Bug 120481

Summary: Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks
Product: Process Management Reporter: Jirka Hladky (hladky.jiri)
Component: SchedulerAssignee: Ingo Molnar (mingo)
Status: RESOLVED CODE_FIX    
Severity: normal CC: a.p.zijlstra, hladky.jiri, jeder, kkolakow, mingo, regressions
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 4.7.0-0.rc0 Subsystem:
Regression: No Bisected commit-id:
Attachments: SPECjbb2005 results
SPECjvm2008 results for crypto operation
SPECjvm2008 results for xml operation
Script to quickly reproduce the results
List of rpm files installed on the box

Description Jirka Hladky 2016-06-16 16:19:27 UTC
Hello,

we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel. 

We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as well affected. 

We have observed the drop on variety of different x86_64 servers with different configuration (different CPU models, RAM sizes, both with Hyper Threading ON and OFF, different NUMA configurations (2 and 4 NUMA nodes) 

Linpack and Stream benchmarks do not show any performance drop. 

The performance drop increases with higher number of threads. The maximum number of threads in each benchmark is the same as number of CPUs. 

I will upload the graphs and tables comparing results on 4.6 kernel (baseline) and 4.7 rc3 kernel (target). On x-axis there is a number of threads, on y-axis there are number of operations per time unit.  

Do you have any hints which commit should we try to reverse? 

Thanks a lot!
Jirka
Comment 1 Jirka Hladky 2016-06-16 16:28:53 UTC
Created attachment 220321 [details]
SPECjbb2005 results

Single SPECjbb2005 benchmark executed with different number of threads 
Performance drop upto 37%  between 4.7rc3 and 4.6 kernels
Comment 2 Jirka Hladky 2016-06-16 16:30:20 UTC
Created attachment 220331 [details]
SPECjvm2008 results for crypto operation

Single SPECjvm2008 benchmark executed with different number of threads 
Performance drop upto 40%  between 4.7rc3 and 4.6 kernels
Comment 3 Jirka Hladky 2016-06-16 16:31:59 UTC
Created attachment 220341 [details]
SPECjvm2008 results for xml operation

SPECjvm2008 results for crypto operation

Single SPECjvm2008 benchmark executed with different number of threads 
Performance drop upto 31%  between 4.7rc3 and 4.6 kernels
Comment 4 The Linux kernel's regression tracker (Thorsten Leemhuis) 2016-06-19 13:22:29 UTC
(In reply to Jirka Hladky from comment #0)
> Do you have any hints […]
FWIW, there are at least three other performance regressions under investigation that might be related:
http://thread.gmane.org/gmane.linux.kernel/2233056/
http://thread.gmane.org/gmane.linux.kernel/2235794/
http://thread.gmane.org/gmane.linux.kernel/2231025/

Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo )
Comment 5 Jirka Hladky 2016-06-21 13:42:38 UTC
Adding e-mail communication:
=======Message from Peter Zijlstra on Fri, Jun 17, 2016 at 1:04 AM =======

There were only 66 commits or so, and I think we can rule out the
hotplug changes, which should reduce it even further.

You could see what the parent of this one does:

  2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels

If not that, maybe the parent of:

  c58d25f371f5 sched/fair: Move record_wakee()

After that I suppose you'll have to go bisect.
=========================================================================

My update:

Hi Peter,

I have an update for this performance issue. I have tested several
kernels, I'm not at the parent of

  2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels

and I still see the performance regression for multi-threaded workloads.

There are only 27 commits remaining between v4.6 (last known to be OK)
and current HEAD (6ecdd74962f246dfe8750b7bea481a1c0816315d)
6ecdd74962f246dfe8750b7bea481a1c0816315d    sched/fair: Generalize the
load/util averages resolution definitionq hook unless util changed

See below [0].

Any hint which commit should I try now?

Thanks a lot!
Jirka

[0]
$ git log --pretty=oneline v4.6..HEAD kernel/sched 
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased load resolution on 64-bit kernels
e7904a28f5331c21d17af638cb477c83662e3cb6 locking/lockdep, sched/core: Implement a better lock pinning scheme
eb58075149b7f0300ff19142e6245fe75db2a081 sched/core: Introduce 'struct rq_flags'
3e71a462dd483ce508a723356b293731e7d788ea sched/core: Move task_rq_lock() out of line
64b7aad5798478ffff52e110878ccaae4c3aaa34 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
f98db6013c557c216da5038d9c52045be55cd039 sched/core: Add switch_mm_irqs_off() and use it in the scheduler
594dd290cf5403a9a5818619dfff42d8e8e0518e sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times
fec148c000d0f9ac21679601722811eb60b4cc52 sched/deadline: Fix a bug in dl_overflow()
9fd81dd5ce0b12341c9f83346f8d32ac68bd3841 sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updates
1f41906a6fda1114debd3898668bd7ab6470ee41 sched/fair: Correctly handle nohz ticks CPU load accounting
cee1afce3053e7aa0793fbd5f2e845fa2cef9e33 sched/fair: Gather CPU load functions under a more conventional namespace
a2c6c91f98247fef0fe75216d607812485aeb0df sched/fair: Call cpufreq hook in additional paths
41e0d37f7ac81297c07ba311e4ad39465b8c8295 sched/fair: Do not call cpufreq hook unless util changed
21e96f88776deead303ecd30a17d1d7c2a1776e3 sched/fair: Move cpufreq hook to update_cfs_rq_load_avg()
1f621e028baf391f6684003e32e009bc934b750f sched/fair: Fix asym packing to select correct CPU
bd92883051a0228cc34996b8e766111ba10c9aac sched/cpuacct: Check for NULL when using task_pt_regs()
2c923e94cd9c6acff3b22f0ae29cfe65e2658b40 sched/clock: Make local_clock()/cpu_clock() inline
c78b17e28cc2c2df74264afc408bdc6aaf3fbcc8 sched/clock: Remove pointless test in cpu_clock/local_clock
fb90a6e93c0684ab2629a42462400603aa829b9c sched/debug: Don't dump sched debug info in SysRq-W
2b8c41daba327c633228169e8bd8ec067ab443f8 sched/fair: Initiate a new task's util avg to a bounded value
1c3de5e19fc96206dd086e634129d08e5f7b1000 sched/fair: Update comments after a variable rename
47252cfbac03644ee4a3adfa50c77896aa94f2bb sched/core: Add preempt checks in preempt_schedule() code
bfdb198ccd99472c5bded689699eb30dd06316bb sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernels
d02c071183e1c01a76811c878c8a52322201f81f sched/fair: Reset nr_balance_failed after active balancing
d740037fac7052e49450f6fa1454f1144a103b55 sched/cpuacct: Split usage accounting into user_usage and sys_usage
5ca3726af7f66a8cc71ce4414cfeb86deb784491 sched/cpuacct: Show all possible CPUs in cpuacct output
Comment 6 Jirka Hladky 2016-06-21 13:58:04 UTC
(In reply to Thorsten Leemhuis from comment #4)
> (In reply to Jirka Hladky from comment #0)
> > Do you have any hints […]
> FWIW, there are at least three other performance regressions under
> investigation that might be related:
> http://thread.gmane.org/gmane.linux.kernel/2233056/
> http://thread.gmane.org/gmane.linux.kernel/2235794/
> http://thread.gmane.org/gmane.linux.kernel/2231025/
> 
> Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo )

Hi Thorsten,

thanks a lot for the hints! 

I don't think that any of issues you have mentioned above is related to this regression. The last commit which I have tested is

6ecdd74962f246dfe8750b7bea481a1c0816315d    sched/fair: Generalize the
load/util averages resolution definitionq hook unless util changed

and there is still regression. 

I have checked commits between v4.6..6ecdd74962f246dfe8750b7bea481a1c0816315d

git log --pretty=oneline v4.6..6ecdd74962f246dfe8750b7bea481a1c0816315d

and it does not include the commits which are responsible for performance degradation in the 3 issues you have named above. 

Jirka


PS: latest performance scores:

./4.6.0-1.el7.x86_64/4.6.0-1.el7.x86_64.log:Score on xml.validation: 3467.18 ops/m
./4.6.0v4.6-10530-g28165ec+/2016-Jun-20_16h18m50s.log:Score on xml.validation: 1994.87 ops/m
./4.6.0v4.6-rc5-34-g2159197+/2016-Jun-21_01h44m27s.log:Score on xml.validation: 2074.91 ops/m
./4.6.0v4.6-rc5-35-g6ecdd74+/2016-Jun-21_03h05m08s.log:Score on xml.validation: 2121.73 ops/m
Comment 7 Jirka Hladky 2016-06-22 00:10:24 UTC
Worklog:

last known good commit:21e96f88776deead303ecd30a17d1d7c2a1776e3
first known bad commit:6ecdd74962f246dfe8750b7bea481a1c0816315d

Kernel panic with commit: 2159197
Comment 8 Jirka Hladky 2016-06-22 07:54:14 UTC
Worklog:

last known good commit: e7904a28f5331c21d17af638cb477c83662e3cb6
first known bad commit: 6ecdd74962f246dfe8750b7bea481a1c0816315d

Last two commits to be checked:

 git log --pretty=oneline e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d                                                                                                                                
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased load resolution on 64-bit kernels

I will attach a reproducer and the list of installed rpm files. I'm using Java 1.7

/usr/lib/jvm/java-1.7.0

Jirka
Comment 9 Jirka Hladky 2016-06-22 08:09:13 UTC
Created attachment 220921 [details]
Script to quickly reproduce the results

1) Obtain SPECjvm2008 here:
https://www.spec.org/download.html

2) Install Java 1.7 
3) Install Window Manager  - Benchmark is creating graphs and it needs WM for this step

I will upload list of rpm files I have installed on test machine in the next attachment.

4) Copy the script from this attachment to directory where you have installed the benchmark. There should be script run-specjvm.sh

5) Review the script reproduce.sh and check JAVA_HOME variable and number of threads. Number of threads should be the same as number of CPUs you have in the box.

6) Run reproduce.sh

It will run only xml.transform xml.validation benchmarks to speed up the bug reproduction. 

7) To check all results you have you can run

find ./ -name "*log" | xargs grep -H Score | grep xml.validation | grep "[0-9]\{4\}[.][0-9]\{2\} ops/m"

To give you some numbers to expect, my bad result is ~ 2000 ops/m for xml.validation test and the good result is ~3500 ops/m.

./run-specjvm.sh --benchmarkThreads 32 --iterations 1 --iterationTime 180 --warmuptime 90 xml.transform xml.validation 2>&1 | tee > ${VER}/${DATE}.log
Comment 10 Jirka Hladky 2016-06-22 08:10:01 UTC
Created attachment 220931 [details]
List of rpm files installed on the box
Comment 11 Jirka Hladky 2016-06-22 08:34:55 UTC
Hi Peter,

I use this Java package.

java-1.7.0-openjdk

I think you don't thing you need to install Windows Manager. In fact,
I'm using just plain text log file (${VER}/${DATE}.log) produced with
this command

./run-specjvm.sh --benchmarkThreads 32 --iterations 1 --iterationTime
180 --warmuptime 90 xml.transform xml.validation 2>&1 | tee >
${VER}/${DATE}.log

Please let me finish bisecting. There are last two commits to be checked:

$ git log --pretty=oneline
e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels

I should be able to identify the commit which has caused this
regression in few hours.

Jirka

On Wed, Jun 22, 2016 at 10:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 22, 2016 at 08:09:13AM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=120481
>>
>> --- Comment #9 from Jirka Hladky <hladky.jiri@gmail.com> ---
>> Created attachment 220921 [details]
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=220921&action=edit
>> Script to quickly reproduce the results
>>
>> 1) Obtain SPECjvm2008 here:
>> https://www.spec.org/download.html
>>
>> 2) Install Java 1.7
>
> I don't have this, its not in Debian repositories. All I have is various
> versions of openjdk or gcj.
>
>> 3) Install Window Manager  - Benchmark is creating graphs and it needs WM
>> for
>> this step
>
> This machine doesn't even have a monitor or keyboard attached, wtf would
> I install desktop stuff on it, that's fail.
>
>> I will upload list of rpm files I have installed on test machine in the next
>> attachment.
>
> Useless, I'm on Debian.
>
>> 4) Copy the script from this attachment to directory where you have
>> installed
>> the benchmark. There should be script run-specjvm.sh
>
> Please just use email; bugzilla is crap.
Comment 12 Jirka Hladky 2016-06-22 10:02:55 UTC
The performance regression has been caused by this commit

=================================================
commit 6ecdd74962f246dfe8750b7bea481a1c0816315d
Author: Yuyang Du <yuyang.du@intel.com>
Date:   Tue Apr 5 12:12:26 2016 +0800

    sched/fair: Generalize the load/util averages resolution definition
=================================================
Comment 13 Jirka Hladky 2016-06-22 14:43:41 UTC
Please ignore the previous comment. 

I have reviewed my results once again:

This commit is fine:
64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into
sched/core, to pick up fixes before applying new changes

This version has already a problem:
2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
load resolution on 64-bit kernels

git bisect start
git bisect good 64b7aad
git bisect bad 2159197
Bisecting: 1 revision left to test after this (roughly 1 step)
[eb58075149b7f0300ff19142e6245fe75db2a081] sched/core: Introduce
'struct rq_flags'

The kernel I got after bisecting does not boot - I'm getting kernel panic. 

In any case, the regression was introduced between
git bisect good 64b7aad
git bisect bad 2159197

and there is just one commit in-between. 

Jirka
Comment 14 Jirka Hladky 2016-06-23 14:46:07 UTC
Latest update from Peter Zijlstra
peterz@infradead.org

=======================================================
Yes, that is indeed the culprit.

The below 'revert' makes it go fast again. I'll try and figure out
what's wrong tomorrow.

---

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf6fea9..e7e312b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,7 +55,7 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * Really only required when CONFIG_FAIR_GROUP_SCHED is also set, but to
  * increase coverage and consistency always enable it on 64bit platforms.
  */
-#ifdef CONFIG_64BIT
+#if 0 // def CONFIG_64BIT
 # define NICE_0_LOAD_SHIFT     (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)         ((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)    ((w) >> SCHED_FIXEDPOINT_SHIFT)

=======================================================
Comment 15 Jirka Hladky 2016-06-24 10:06:47 UTC
Latest update:

==============Peter Zijlstra======================
OK, so I'm fairly certain you _do_ run in a cgroup, because its made
almost impossible not to these days.

Run:

        grep java /proc/sched_debug

while the thing is running. That'll show you the actual cgroup the stuff
is running in.

This modern Linux stuff stinks loads. And even Debian seems infected to
the point of almost being useless :-(

The _only_ reason I could reproduce was because I recently did an
upgrade of Debian Testing and I hadn't noticed just how messed up things
had become.

When I run it in the root cgroup (I had to kill cgmanager and reboot)
the numbers are just fine.

In any case, now I gotta go look at the cgroup code..

That'll end up looking something like:

root@ivb-ep:/usr/src/linux-2.6# grep java /proc/sched_debug
            java  2714     18270.634925        89   120         0.000000         1.490023         0.000000 0 0 /user.slice/user-0.slice/session-2.scope
            java  2666     18643.629673         2   120         0.000000         0.063129         0.000000 0 0 /user.slice/user-0.slice/session-2.scope
            java  2676     18655.652878         3   120         0.000000         0.077127         0.000000 0 0 /user.slice/user-0.slice/session-2.scope
            java  2680     18655.683384         3   120         0.000000         0.082993         0.000000 0 0 /user.slice/user-0.slice/session-2.scope

which shows a 3 deep hierarchy. 

========== Jirka Hladky ========================
I have tried to disable autogroups

sysctl -w kernel.sched_autogroup_enabled=0

and I can confirm that performance is then back at level as in 4.6 kernel. 

CONFIG_SCHED_AUTOGROUP=y

is used both in RHEL6 and RHEL7.  We compile the upstream kernels with
config derived from RHEL7 config file.
===================================================
Comment 16 Jirka Hladky 2016-07-22 10:01:43 UTC
Adding e-mail communication for documentation purposes. The entire e-mail thread is here:
https://lkml.org/lkml/2016/6/16/621

==================================================================

On Wed, Jun 22, 2016 at 04:41:06PM +0200, Jirka Hladky wrote:
> This commit is bad:
> 2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased
> load resolution on 64-bit kernels
>
> Could you please have a look?

Yes, that is indeed the culprit.

The below 'revert' makes it go fast again. I'll try and figure out
what's wrong tomorrow.

---

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bf6fea9..e7e312b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -55,7 +55,7 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
  * Really only required when CONFIG_FAIR_GROUP_SCHED is also set, but to
  * increase coverage and consistency always enable it on 64bit platforms.
  */
-#ifdef CONFIG_64BIT
+#if 0 // def CONFIG_64BIT
 # define NICE_0_LOAD_SHIFT     (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT)
 # define scale_load(w)         ((w) << SCHED_FIXEDPOINT_SHIFT)
 # define scale_load_down(w)    ((w) >> SCHED_FIXEDPOINT_SHIFT)




So unless the heat has made me do really silly things, the below seems
to cure things. Could you please verify?


---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22d64b3f5876..d4f6fb2f3057 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq)
         */
        tg_weight = atomic_long_read(&tg->load_avg);
        tg_weight -= cfs_rq->tg_load_avg_contrib;
-       tg_weight += cfs_rq->load.weight;
+       tg_weight += cfs_rq->avg.load_avg;

        return tg_weight;
 }
@@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg)
        long tg_weight, load, shares;

        tg_weight = calc_tg_weight(tg, cfs_rq);
-       load = cfs_rq->load.weight;
+       load = cfs_rq->avg.load_avg;

        shares = (tg->shares * load);
        if (tg_weight)

Ah!, I hadn't yet done a git-blame on this. Right you are, we should
have put a comment there.

So the problem here is that since commit:

  2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

load.weight and avg.load_avg are in different metrics. Which completely
wrecked things.

The obvious alternative is using:

        scale_load_down(cfs_rq->load.weight);

Let me go run that through the benchmark.

> I have one question regarding the use of cfs_rq->avg.load_avg
> cfs_rq->tg_load_avg_contrib is the sampling of cfs_rq->avg.load_avg so
> I'm curious to understand why you use cfs_rq->avg.load_avg instead of
> keeping cfs_rq->tg_load_avg_contrib. Do you think that the sampling is
> not accurate enough to prevent any significant difference between both
> when we use tg->load_avg ?

I'm not entirely sure I understand your question; is it to the existence
of calc_tg_weight()? That is, why use calc_tg_weight() and not use
tg->load_avg as is?

It seemed like a simple and cheap way to increase accuracy, nothing more
behind it until the commit you referred to.
Comment 17 Jirka Hladky 2016-07-22 10:03:04 UTC
The issue is fixed. I have verified it on kernel v4.7-rc7.