Bug 120481
Summary: | Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks | ||
---|---|---|---|
Product: | Process Management | Reporter: | Jirka Hladky (hladky.jiri) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | RESOLVED CODE_FIX | ||
Severity: | normal | CC: | a.p.zijlstra, hladky.jiri, jeder, kkolakow, mingo, regressions |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.7.0-0.rc0 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
SPECjbb2005 results
SPECjvm2008 results for crypto operation SPECjvm2008 results for xml operation Script to quickly reproduce the results List of rpm files installed on the box |
Description
Jirka Hladky
2016-06-16 16:19:27 UTC
Created attachment 220321 [details]
SPECjbb2005 results
Single SPECjbb2005 benchmark executed with different number of threads
Performance drop upto 37% between 4.7rc3 and 4.6 kernels
Created attachment 220331 [details]
SPECjvm2008 results for crypto operation
Single SPECjvm2008 benchmark executed with different number of threads
Performance drop upto 40% between 4.7rc3 and 4.6 kernels
Created attachment 220341 [details]
SPECjvm2008 results for xml operation
SPECjvm2008 results for crypto operation
Single SPECjvm2008 benchmark executed with different number of threads
Performance drop upto 31% between 4.7rc3 and 4.6 kernels
(In reply to Jirka Hladky from comment #0) > Do you have any hints […] FWIW, there are at least three other performance regressions under investigation that might be related: http://thread.gmane.org/gmane.linux.kernel/2233056/ http://thread.gmane.org/gmane.linux.kernel/2235794/ http://thread.gmane.org/gmane.linux.kernel/2231025/ Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo ) Adding e-mail communication: =======Message from Peter Zijlstra on Fri, Jun 17, 2016 at 1:04 AM ======= There were only 66 commits or so, and I think we can rule out the hotplug changes, which should reduce it even further. You could see what the parent of this one does: 2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels If not that, maybe the parent of: c58d25f371f5 sched/fair: Move record_wakee() After that I suppose you'll have to go bisect. ========================================================================= My update: Hi Peter, I have an update for this performance issue. I have tested several kernels, I'm not at the parent of 2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels and I still see the performance regression for multi-threaded workloads. There are only 27 commits remaining between v4.6 (last known to be OK) and current HEAD (6ecdd74962f246dfe8750b7bea481a1c0816315d) 6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definitionq hook unless util changed See below [0]. Any hint which commit should I try now? Thanks a lot! Jirka [0] $ git log --pretty=oneline v4.6..HEAD kernel/sched 6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definition 2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased load resolution on 64-bit kernels e7904a28f5331c21d17af638cb477c83662e3cb6 locking/lockdep, sched/core: Implement a better lock pinning scheme eb58075149b7f0300ff19142e6245fe75db2a081 sched/core: Introduce 'struct rq_flags' 3e71a462dd483ce508a723356b293731e7d788ea sched/core: Move task_rq_lock() out of line 64b7aad5798478ffff52e110878ccaae4c3aaa34 Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes f98db6013c557c216da5038d9c52045be55cd039 sched/core: Add switch_mm_irqs_off() and use it in the scheduler 594dd290cf5403a9a5818619dfff42d8e8e0518e sched/cpufreq: Optimize cpufreq update kicker to avoid update multiple times fec148c000d0f9ac21679601722811eb60b4cc52 sched/deadline: Fix a bug in dl_overflow() 9fd81dd5ce0b12341c9f83346f8d32ac68bd3841 sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updates 1f41906a6fda1114debd3898668bd7ab6470ee41 sched/fair: Correctly handle nohz ticks CPU load accounting cee1afce3053e7aa0793fbd5f2e845fa2cef9e33 sched/fair: Gather CPU load functions under a more conventional namespace a2c6c91f98247fef0fe75216d607812485aeb0df sched/fair: Call cpufreq hook in additional paths 41e0d37f7ac81297c07ba311e4ad39465b8c8295 sched/fair: Do not call cpufreq hook unless util changed 21e96f88776deead303ecd30a17d1d7c2a1776e3 sched/fair: Move cpufreq hook to update_cfs_rq_load_avg() 1f621e028baf391f6684003e32e009bc934b750f sched/fair: Fix asym packing to select correct CPU bd92883051a0228cc34996b8e766111ba10c9aac sched/cpuacct: Check for NULL when using task_pt_regs() 2c923e94cd9c6acff3b22f0ae29cfe65e2658b40 sched/clock: Make local_clock()/cpu_clock() inline c78b17e28cc2c2df74264afc408bdc6aaf3fbcc8 sched/clock: Remove pointless test in cpu_clock/local_clock fb90a6e93c0684ab2629a42462400603aa829b9c sched/debug: Don't dump sched debug info in SysRq-W 2b8c41daba327c633228169e8bd8ec067ab443f8 sched/fair: Initiate a new task's util avg to a bounded value 1c3de5e19fc96206dd086e634129d08e5f7b1000 sched/fair: Update comments after a variable rename 47252cfbac03644ee4a3adfa50c77896aa94f2bb sched/core: Add preempt checks in preempt_schedule() code bfdb198ccd99472c5bded689699eb30dd06316bb sched/numa: Remove unnecessary NUMA dequeue update from non-SMP kernels d02c071183e1c01a76811c878c8a52322201f81f sched/fair: Reset nr_balance_failed after active balancing d740037fac7052e49450f6fa1454f1144a103b55 sched/cpuacct: Split usage accounting into user_usage and sys_usage 5ca3726af7f66a8cc71ce4414cfeb86deb784491 sched/cpuacct: Show all possible CPUs in cpuacct output (In reply to Thorsten Leemhuis from comment #4) > (In reply to Jirka Hladky from comment #0) > > Do you have any hints […] > FWIW, there are at least three other performance regressions under > investigation that might be related: > http://thread.gmane.org/gmane.linux.kernel/2233056/ > http://thread.gmane.org/gmane.linux.kernel/2235794/ > http://thread.gmane.org/gmane.linux.kernel/2231025/ > > Sincerely, your regression tracker for Linux 4.7 (http://bit.ly/28JRmJo ) Hi Thorsten, thanks a lot for the hints! I don't think that any of issues you have mentioned above is related to this regression. The last commit which I have tested is 6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definitionq hook unless util changed and there is still regression. I have checked commits between v4.6..6ecdd74962f246dfe8750b7bea481a1c0816315d git log --pretty=oneline v4.6..6ecdd74962f246dfe8750b7bea481a1c0816315d and it does not include the commits which are responsible for performance degradation in the 3 issues you have named above. Jirka PS: latest performance scores: ./4.6.0-1.el7.x86_64/4.6.0-1.el7.x86_64.log:Score on xml.validation: 3467.18 ops/m ./4.6.0v4.6-10530-g28165ec+/2016-Jun-20_16h18m50s.log:Score on xml.validation: 1994.87 ops/m ./4.6.0v4.6-rc5-34-g2159197+/2016-Jun-21_01h44m27s.log:Score on xml.validation: 2074.91 ops/m ./4.6.0v4.6-rc5-35-g6ecdd74+/2016-Jun-21_03h05m08s.log:Score on xml.validation: 2121.73 ops/m Worklog: last known good commit:21e96f88776deead303ecd30a17d1d7c2a1776e3 first known bad commit:6ecdd74962f246dfe8750b7bea481a1c0816315d Kernel panic with commit: 2159197 Worklog: last known good commit: e7904a28f5331c21d17af638cb477c83662e3cb6 first known bad commit: 6ecdd74962f246dfe8750b7bea481a1c0816315d Last two commits to be checked: git log --pretty=oneline e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d 6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definition 2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased load resolution on 64-bit kernels I will attach a reproducer and the list of installed rpm files. I'm using Java 1.7 /usr/lib/jvm/java-1.7.0 Jirka Created attachment 220921 [details] Script to quickly reproduce the results 1) Obtain SPECjvm2008 here: https://www.spec.org/download.html 2) Install Java 1.7 3) Install Window Manager - Benchmark is creating graphs and it needs WM for this step I will upload list of rpm files I have installed on test machine in the next attachment. 4) Copy the script from this attachment to directory where you have installed the benchmark. There should be script run-specjvm.sh 5) Review the script reproduce.sh and check JAVA_HOME variable and number of threads. Number of threads should be the same as number of CPUs you have in the box. 6) Run reproduce.sh It will run only xml.transform xml.validation benchmarks to speed up the bug reproduction. 7) To check all results you have you can run find ./ -name "*log" | xargs grep -H Score | grep xml.validation | grep "[0-9]\{4\}[.][0-9]\{2\} ops/m" To give you some numbers to expect, my bad result is ~ 2000 ops/m for xml.validation test and the good result is ~3500 ops/m. ./run-specjvm.sh --benchmarkThreads 32 --iterations 1 --iterationTime 180 --warmuptime 90 xml.transform xml.validation 2>&1 | tee > ${VER}/${DATE}.log Created attachment 220931 [details]
List of rpm files installed on the box
Hi Peter, I use this Java package. java-1.7.0-openjdk I think you don't thing you need to install Windows Manager. In fact, I'm using just plain text log file (${VER}/${DATE}.log) produced with this command ./run-specjvm.sh --benchmarkThreads 32 --iterations 1 --iterationTime 180 --warmuptime 90 xml.transform xml.validation 2>&1 | tee > ${VER}/${DATE}.log Please let me finish bisecting. There are last two commits to be checked: $ git log --pretty=oneline e7904a28f5331c21d17af638cb477c83662e3cb6..6ecdd74962f246dfe8750b7bea481a1c0816315d 6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the load/util averages resolution definition 2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased load resolution on 64-bit kernels I should be able to identify the commit which has caused this regression in few hours. Jirka On Wed, Jun 22, 2016 at 10:17 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, Jun 22, 2016 at 08:09:13AM +0000, bugzilla-daemon@bugzilla.kernel.org > wrote: >> https://bugzilla.kernel.org/show_bug.cgi?id=120481 >> >> --- Comment #9 from Jirka Hladky <hladky.jiri@gmail.com> --- >> Created attachment 220921 [details] >> --> https://bugzilla.kernel.org/attachment.cgi?id=220921&action=edit >> Script to quickly reproduce the results >> >> 1) Obtain SPECjvm2008 here: >> https://www.spec.org/download.html >> >> 2) Install Java 1.7 > > I don't have this, its not in Debian repositories. All I have is various > versions of openjdk or gcj. > >> 3) Install Window Manager - Benchmark is creating graphs and it needs WM >> for >> this step > > This machine doesn't even have a monitor or keyboard attached, wtf would > I install desktop stuff on it, that's fail. > >> I will upload list of rpm files I have installed on test machine in the next >> attachment. > > Useless, I'm on Debian. > >> 4) Copy the script from this attachment to directory where you have >> installed >> the benchmark. There should be script run-specjvm.sh > > Please just use email; bugzilla is crap. The performance regression has been caused by this commit ================================================= commit 6ecdd74962f246dfe8750b7bea481a1c0816315d Author: Yuyang Du <yuyang.du@intel.com> Date: Tue Apr 5 12:12:26 2016 +0800 sched/fair: Generalize the load/util averages resolution definition ================================================= Please ignore the previous comment. I have reviewed my results once again: This commit is fine: 64b7aad - Ingo Molnar, 7 weeks ago : Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes This version has already a problem: 2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased load resolution on 64-bit kernels git bisect start git bisect good 64b7aad git bisect bad 2159197 Bisecting: 1 revision left to test after this (roughly 1 step) [eb58075149b7f0300ff19142e6245fe75db2a081] sched/core: Introduce 'struct rq_flags' The kernel I got after bisecting does not boot - I'm getting kernel panic. In any case, the regression was introduced between git bisect good 64b7aad git bisect bad 2159197 and there is just one commit in-between. Jirka Latest update from Peter Zijlstra peterz@infradead.org ======================================================= Yes, that is indeed the culprit. The below 'revert' makes it go fast again. I'll try and figure out what's wrong tomorrow. --- diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bf6fea9..e7e312b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -55,7 +55,7 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } * Really only required when CONFIG_FAIR_GROUP_SCHED is also set, but to * increase coverage and consistency always enable it on 64bit platforms. */ -#ifdef CONFIG_64BIT +#if 0 // def CONFIG_64BIT # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT) # define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT) ======================================================= Latest update: ==============Peter Zijlstra====================== OK, so I'm fairly certain you _do_ run in a cgroup, because its made almost impossible not to these days. Run: grep java /proc/sched_debug while the thing is running. That'll show you the actual cgroup the stuff is running in. This modern Linux stuff stinks loads. And even Debian seems infected to the point of almost being useless :-( The _only_ reason I could reproduce was because I recently did an upgrade of Debian Testing and I hadn't noticed just how messed up things had become. When I run it in the root cgroup (I had to kill cgmanager and reboot) the numbers are just fine. In any case, now I gotta go look at the cgroup code.. That'll end up looking something like: root@ivb-ep:/usr/src/linux-2.6# grep java /proc/sched_debug java 2714 18270.634925 89 120 0.000000 1.490023 0.000000 0 0 /user.slice/user-0.slice/session-2.scope java 2666 18643.629673 2 120 0.000000 0.063129 0.000000 0 0 /user.slice/user-0.slice/session-2.scope java 2676 18655.652878 3 120 0.000000 0.077127 0.000000 0 0 /user.slice/user-0.slice/session-2.scope java 2680 18655.683384 3 120 0.000000 0.082993 0.000000 0 0 /user.slice/user-0.slice/session-2.scope which shows a 3 deep hierarchy. ========== Jirka Hladky ======================== I have tried to disable autogroups sysctl -w kernel.sched_autogroup_enabled=0 and I can confirm that performance is then back at level as in 4.6 kernel. CONFIG_SCHED_AUTOGROUP=y is used both in RHEL6 and RHEL7. We compile the upstream kernels with config derived from RHEL7 config file. =================================================== Adding e-mail communication for documentation purposes. The entire e-mail thread is here: https://lkml.org/lkml/2016/6/16/621 ================================================================== On Wed, Jun 22, 2016 at 04:41:06PM +0200, Jirka Hladky wrote: > This commit is bad: > 2159197 - Peter Zijlstra, 8 weeks ago : sched/core: Enable increased > load resolution on 64-bit kernels > > Could you please have a look? Yes, that is indeed the culprit. The below 'revert' makes it go fast again. I'll try and figure out what's wrong tomorrow. --- diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bf6fea9..e7e312b 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -55,7 +55,7 @@ static inline void cpu_load_update_active(struct rq *this_rq) { } * Really only required when CONFIG_FAIR_GROUP_SCHED is also set, but to * increase coverage and consistency always enable it on 64bit platforms. */ -#ifdef CONFIG_64BIT +#if 0 // def CONFIG_64BIT # define NICE_0_LOAD_SHIFT (SCHED_FIXEDPOINT_SHIFT + SCHED_FIXEDPOINT_SHIFT) # define scale_load(w) ((w) << SCHED_FIXEDPOINT_SHIFT) # define scale_load_down(w) ((w) >> SCHED_FIXEDPOINT_SHIFT) So unless the heat has made me do really silly things, the below seems to cure things. Could you please verify? --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 22d64b3f5876..d4f6fb2f3057 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2484,7 +2484,7 @@ static inline long calc_tg_weight(struct task_group *tg, struct cfs_rq *cfs_rq) */ tg_weight = atomic_long_read(&tg->load_avg); tg_weight -= cfs_rq->tg_load_avg_contrib; - tg_weight += cfs_rq->load.weight; + tg_weight += cfs_rq->avg.load_avg; return tg_weight; } @@ -2494,7 +2494,7 @@ static long calc_cfs_shares(struct cfs_rq *cfs_rq, struct task_group *tg) long tg_weight, load, shares; tg_weight = calc_tg_weight(tg, cfs_rq); - load = cfs_rq->load.weight; + load = cfs_rq->avg.load_avg; shares = (tg->shares * load); if (tg_weight) Ah!, I hadn't yet done a git-blame on this. Right you are, we should have put a comment there. So the problem here is that since commit: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels") load.weight and avg.load_avg are in different metrics. Which completely wrecked things. The obvious alternative is using: scale_load_down(cfs_rq->load.weight); Let me go run that through the benchmark. > I have one question regarding the use of cfs_rq->avg.load_avg > cfs_rq->tg_load_avg_contrib is the sampling of cfs_rq->avg.load_avg so > I'm curious to understand why you use cfs_rq->avg.load_avg instead of > keeping cfs_rq->tg_load_avg_contrib. Do you think that the sampling is > not accurate enough to prevent any significant difference between both > when we use tg->load_avg ? I'm not entirely sure I understand your question; is it to the existence of calc_tg_weight()? That is, why use calc_tg_weight() and not use tg->load_avg as is? It seemed like a simple and cheap way to increase accuracy, nothing more behind it until the commit you referred to. The issue is fixed. I have verified it on kernel v4.7-rc7. |