Created attachment 24500 [details] "Good" 2.6.32.1 .config For some reason, the kernels starting from 2.6.32.2 show a much higher "<kernel IPI> : Rescheduling interrupts" wakeup count (according to powertop). E.g. with 2.6.32.1 I see no more than 5-15% of such wakeups (from the total number of wakeups in a measurement period), but with 2.6.32.2 and 2.6.32.3 it very commonly stays at about 30-50%, i.e. in many cases a half of all wakeups are due to "Rescheduling interrupts" (and the wakeup count itself is very high too), making that the single most frequent reason for wakeups. I observe this on two amd64 systems, an Athlon64 X2 on a nForce4 board, and a Phenom X4 on AMD 790FX board.
Created attachment 24501 [details] "Bad" 2.6.32.2 .config
Created attachment 24502 [details] "Bad" 2.6.32.3 .config
------- powertop during high LAN traffic in 2.6.32.1: ------- < Detailed C-state information is not P-states (frequencies) 2.40 Ghz 0.1% 1200 Mhz 99.9% Wakeups-from-idle per second : 2351.9 interval: 10.0s no ACPI power usage estimate available Top causes for wakeups: 92.8% (8136.3) <interrupt> : eth0 2.9% (250.0) bash : pty_write (delayed_work_timer_fn) 2.7% (235.5) <kernel IPI> : Rescheduling interrupts 0.4% ( 33.4) xfce4-terminal : do_nanosleep (hrtimer_wakeup) 0.4% ( 32.3) xfce4-terminal : schedule_hrtimeout_range (hrtimer_wakeup) 0.3% ( 23.9) <interrupt> : saa7134[0], saa7134[0] ------- powertop during high LAN traffic in 2.6.32.2: ------- PowerTOP version 1.11 (C) 2007 Intel Corporation < Detailed C-state information is not P-states (frequencies) 2.40 Ghz 0.5% 1200 Mhz 99.5% Wakeups-from-idle per second : 3564.7 interval: 10.0s no ACPI power usage estimate available Top causes for wakeups: 63.2% (8563.4) <interrupt> : eth0 33.7% (4563.6) <kernel IPI> : Rescheduling interrupts 1.8% (240.2) bash : pty_write (delayed_work_timer_fn) 0.3% ( 36.0) USB device 1-2.2 : ADNS-3080 Mouse (Agilent Mouse) 0.2% ( 33.7) xfce4-terminal : do_nanosleep (hrtimer_wakeup) 0.2% ( 27.3) xfce4-terminal : schedule_hrtimeout_range (hrtimer_wakeup)
Created attachment 24503 [details] "Good" 2.6.32.1 dmesg
Created attachment 24504 [details] "Bad" 2.6.32.2 dmesg
Still not fixed in the latest 2.6.33-rc5. I try the newest kernels from time to time, and all of them show much higher "kernel IPI" wakeup count than 2.6.32.1. I suggest that we limit the further investigation to 2.6.32.1 versus 2.6.32.2 for simplicity's sake. It looks like 2.6.32.2 saw a number of scheduler-related changes <http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.32.2>, so it looks like one of these did cause this regression. I will try to find time to roll them back one by one and test the resulting kernels, but it'd be great if someone with more knowledge in this area could provide more information on precisely which change could've caused this, or what else could be the cause.
I think this is an arch bug, so reassigning. Would you be able to carry out a bisection of commits introduced between 2.6.32.1 and 2.6.32.2?
Hello! Rafael, thanks for the suggestion. This is the result of git bisect: c2f1953231c6e99777633761c3bc0d2ca7ead20a is the first bad commit commit c2f1953231c6e99777633761c3bc0d2ca7ead20a Author: Mike Galbraith <efault@gmx.de> Date: Tue Oct 27 15:35:38 2009 +0100 sched: Check for an idle shared cache in select_task_rq_fair() commit a1f84a3ab8e002159498814eaa7e48c33752b04b upstream. When waking affine, check for an idle shared cache, and if found, wake to that CPU/sibling instead of the waker's CPU. This improves pgsql+oltp ramp up by roughly 8%. Possibly more for other loads, depending on overlap. The trade-off is a roughly 1% peak downturn if tasks are truly synchronous. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <1256654138.17752.7.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> :040000 040000 166c585327f726d5dc4dcdf6246f35e0cc98978c 6e629a6eb530efa0c56a8f73adcd18b72e7903f0 M kernel
On 2/8/2010 6:54, bugzilla-daemon@bugzilla.kernel.org wrote: > > c2f1953231c6e99777633761c3bc0d2ca7ead20a is the first bad commit > commit c2f1953231c6e99777633761c3bc0d2ca7ead20a > Author: Mike Galbraith<efault@gmx.de> > Date: Tue Oct 27 15:35:38 2009 +0100 > > sched: Check for an idle shared cache in select_task_rq_fair() > > commit a1f84a3ab8e002159498814eaa7e48c33752b04b upstream. > > When waking affine, check for an idle shared cache, and if > found, wake to that CPU/sibling instead of the waker's CPU. > > This improves pgsql+oltp ramp up by roughly 8%. Possibly more > for other loads, depending on overlap. The trade-off is a > roughly 1% peak downturn if tasks are truly synchronous. > > Signed-off-by: Mike Galbraith<efault@gmx.de> > Cc: Arjan van de Ven<arjan@infradead.org> > Cc: Peter Zijlstra<peterz@infradead.org> > LKML-Reference:<1256654138.17752.7.camel@marge.simson.net> > Signed-off-by: Ingo Molnar<mingo@elte.hu> > Signed-off-by: Greg Kroah-Hartman<gregkh@suse.de> > > :040000 040000 166c585327f726d5dc4dcdf6246f35e0cc98978c > 6e629a6eb530efa0c56a8f73adcd18b72e7903f0 M kernel while this commit may give an IPI, it gives the IPI to the hyperthreading sibbling, which does not cost power (in fact, this is saving power due to better efficiency). If this code hits, you really have 2 programs that are running concurrently, and while powertop reports this, just hiding the IPI and making the machine go slower is the wrong answer technically. The right answer is 1) run the programs as fast as you can (eg do the IPI) 2) see if the action to wake up the other program can just be avoided *as part of the program*.
> while this commit may give an IPI, it gives the IPI to the hyperthreading > sibbling, which does not cost power Well, in my case, the CPU is not a hyperthreading one, but a Phenom X4 9750, which has four independent cores with separate frequency control per core, 512 KB of dedicated cache on each, and 2 MB of level 3 cache shared between all cores. Continuing the experiments, I took a kernel 2.6.32.7, reverse-applied the patch [1] which was pointed at by bisecting (together with its correction [2]), and as expected, such version of 2.6.32.7 did not exhibit the high IPI wakeup count. [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a1f84a3ab8e002159498814eaa7e48c33752b04b [2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fd210738f6601d0fb462df9a2fe5a41896ff6a8f
First-Bad-Commit : a1f84a3ab8e002159498814eaa7e48c33752b04b Is the problem reproducible with the current mainline?
On 2/8/2010 10:54, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15044 > > > > > > --- Comment #11 from Rafael J. Wysocki<rjw@sisk.pl> 2010-02-08 18:54:17 --- > First-Bad-Commit : a1f84a3ab8e002159498814eaa7e48c33752b04b > > Is the problem reproducible with the current mainline? > I'm sure it is. The problem is that this "we see IPIs in powertop" is not a regression by itself, it's a regression in a symptom. The old scheduler did not wake up "friend" CPUs as aggressively, causing performance cost (which in turn costs power) and causes latency. The new code is more aggressive in moving things to "friend" CPUs (ones that share cache and power plane), which helps performance, but gives you an IPI. Now, powertop reports this IPI obviously, BUT, and powertop does not currently know this, this IPI is near free in terms of power, since the CPU the IPI is sent to is sharing power with the current CPU anyway... If we were to fix the symptom by reverting the patch, powertop reports a better number, but behavior is actually worse for power, performance and latency. I suppose we need some way to let powertop know the difference between "friend" and "further away" scheduler IPI's.... that way we can still catch the bad ones while not blending the report to the user with the more benign ones.
On Mon, 2010-02-08 at 20:16 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15044 > > > > > > --- Comment #12 from Arjan van de Ven <arjan@linux.intel.com> 2010-02-08 > 20:16:34 --- > On 2/8/2010 10:54, bugzilla-daemon@bugzilla.kernel.org wrote: > > http://bugzilla.kernel.org/show_bug.cgi?id=15044 > > > > > > > > > > > > --- Comment #11 from Rafael J. Wysocki<rjw@sisk.pl> 2010-02-08 18:54:17 > --- > > First-Bad-Commit : a1f84a3ab8e002159498814eaa7e48c33752b04b > > > > Is the problem reproducible with the current mainline? > > > I'm sure it is. No doubt. > The problem is that this "we see IPIs in powertop" is not a regression by > itself, > it's a regression in a symptom. > > The old scheduler did not wake up "friend" CPUs as aggressively, causing > performance cost (which in turn costs power) > and causes latency. > The new code is more aggressive in moving things to "friend" CPUs (ones that > share cache and power plane), which > helps performance, but gives you an IPI. Now, powertop reports this IPI > obviously, BUT, and powertop does not currently know this, > this IPI is near free in terms of power, since the CPU the IPI is sent to is > sharing power with the current CPU anyway... > > If we were to fix the symptom by reverting the patch, powertop reports a > better > number, but behavior is actually worse for power, > performance and latency. It could well cost power, actually that seems likely to be the case. You can either minimize power consumption, or maximize throughput. The patch aims for the later, though it _could_ also save power by getting work done more efficiently so cores can go back to sleep sooner. Numbers would be interesting. > I suppose we need some way to let powertop know the difference between > "friend" > and "further away" scheduler IPI's.... that way > we can still catch the bad ones while not blending the report to the user > with > the more benign ones. Maybe a connect with cpu-freq to only kick in other CPUs when the waker's core is going full bore. That way, a user setting governor to power save would automatically disable the feature. The cost of getting that information has to be _extremely_ cheap though, otherwise, tuning knobs are the only option. (currently, that would be turning off SD_SHARE_PKG_RESOURCES flag in SIBLING/MC domains) -Mike
On Monday 15 February 2010, you wrote: > On Mon, 2010-02-15 at 00:38 +0100, Rafael J. Wysocki wrote: > > This message has been generated automatically as a part of a summary report > > of recent regressions. > > > > The following bug entry is on the current list of known regressions > > from 2.6.32. Please verify if it still should be listed and let the > tracking team > > know (either way). > > > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=15044 > > Subject : Much higher wakeups for "<kernel IPI> : Rescheduling > interrupts" since 2.6.32.2 > > Submitter : Roman Mamedov <roman@rm.pp.ru> > > Date : 2010-01-11 02:58 (35 days old) > > First-Bad-Commit: > http://git.kernel.org/git/linus/a1f84a3ab8e002159498814eaa7e48c33752b04b > > I don't know that this should be carried as a regression. > > Yes, the code in question increases cross cpu wakeups, but that's the > entire point. If there is any overlap in execution larger than the cost > of running a scheduler on another core, that time can be converted to > throughput. > > Tip AF_UNIX lmbench numbers show that throughput gain being realized, > TCP numbers below that (x) show what can be had for apps which do a lot > of what that microbenchmark does, given a tiny enabler patchlet. > > *Local* Communication bandwidths in MB/s - bigger is better > ----------------------------------------------------------------------------- > Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem > UNIX reread reread (libc) (hand) read write > --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- > marge 2.6.31.9-smp 2853 2923 1132 2829.3 4761.9 1235.0 1234.4 4472 1683. > marge 2.6.31.9-smp 2839 2921 1141 2846.5 4779.8 1242.5 1235.9 4455 1684. > marge 2.6.31.9-smp 2838 2935 751. 2838.5 4820.0 1243.6 1235.0 4472 1684. > marge 2.6.33-tip-smp 3057 5166 859. 2760.2 4827.8 1481.1 1466.1 4499 1811. > marge 2.6.33-tip-smp 1796 5165 1257 2748.6 4817.4 1481.1 1464.8 4487 1806. > marge 2.6.33-tip-smp 3055 5175 1262 2763.4 4812.4 1483.9 1462.7 4477 1810. > marge 2.6.33-tip-smpx 3063 5140 2940 2811.1 4740.0 1235.8 1237.0 4433 1673. > marge 2.6.33-tip-smpx 3065 5205 2945 2836.3 4794.4 1243.6 1233.7 4293 1686. > marge 2.6.33-tip-smpx 3058 5181 2940 2785.4 4700.2 1243.9 1234.5 4415 1682. > > (1. tip memory numbers are phase-of-moon anomaly.. irrelevant here) > (2. pipe numbers are only possible with pipe buffer increase patch in > tip. Often, pipes are truly synchronous, so waking cross CPU is small > loss. In tip, it's a win because of optimistic mutex spin.. context > switch cost is converted to throughput. That throughput gain also > cannot be had if you don't do the cross cpu wakeup to get the ball > rolling. The code in question is acting as enabler for spintex.) > > So yeah, the code in question _will_ cause more cross CPU wakeups, and > it _may_ cost power. It may _save_ power by getting the job done more > efficiently. Dunno. > > Regression? Depends on what you're measuring.
I find it difficult to believe that this is normal. < Detailed C-state information is not P-states (frequencies) 2.40 Ghz 0.3% 1200 Mhz 99.7% Wakeups-from-idle per second : 438.0 interval: 10.0s no ACPI power usage estimate available Top causes for wakeups: 77.4% (1031.4) <kernel IPI> : Rescheduling interrupts 7.0% ( 93.9) <interrupt> : ICE1724 5.6% ( 75.1) <interrupt> : bttv0 1.9% ( 25.1) tvtime : do_nanosleep (hrtimer_wakeup) 1.4% ( 18.5) Xorg : schedule_hrtimeout_range (hrtimer_wakeup) 1.3% ( 17.4) USB device 1-4.4 : ADNS-3080 Mouse (Agilent Mouse) Suggestion: Enable SATA ALPM link power management via: echo min_power > /sys/class/scsi_host/host0/link_power_management_policy or press the S key.
Hello. On Sat, 06 Mar 2010 18:56:54 +0100 Mike Galbraith <efault@gmx.de> wrote: > > Top causes for wakeups: > > 77.4% (1031.4) <kernel IPI> : Rescheduling interrupts > > From another thread, likely relevant. > > http://patchwork.kernel.org/patch/78544/ > Maybe, but it looks like [1] this patch was merged a day before the final 2.6.33, and the above output was recorded with the 2.6.33 release. [1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=shortlog;h=refs/tags/v2.6.33
Created attachment 25397 [details] Patch to restore pre-2.6.32.2 behaviour of the scheduler (applies to 2.6.32.9) In case anyone wants to compare the new rescheduling behaviour on their system with the way it was before, here's a patch for 2.6.32.9 which restores the old behaviour (reduces IPIs).
On Sun, 2010-03-07 at 22:25 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=15044 > > > > > > --- Comment #17 from Roman Mamedov <roman@rm.pp.ru> 2010-03-07 22:25:28 --- > Created an attachment (id=25397) > --> (http://bugzilla.kernel.org/attachment.cgi?id=25397) > Patch to restore pre-2.6.32.2 behaviour of the scheduler (applies to > 2.6.32.9) > > In case anyone wants to compare the new rescheduling behaviour on their > system > with the way it was before, here's a patch for 2.6.32.9 which restores the > old > behaviour (reduces IPIs). No need to patch. You can turn off the SD_PREFER_SIBLING domain flag, or for newer kernels SD_SHARE_PKG_RESOURCES (what it should have been from the start), to accomplish the very same. -Mike