Bug 15044

Summary: Much higher wakeups for "<kernel IPI> : Rescheduling interrupts" since 2.6.32.2
Product: Platform Specific/Hardware Reporter: Roman Mamedov (rm+bko)
Component: x86-64Assignee: platform_x86_64 (platform_x86_64)
Status: CLOSED DOCUMENTED    
Severity: normal CC: alan, arjan, efault, greg, hanno, rjw, rm+bko
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 14885    
Attachments: "Good" 2.6.32.1 .config
"Bad" 2.6.32.2 .config
"Bad" 2.6.32.3 .config
"Good" 2.6.32.1 dmesg
"Bad" 2.6.32.2 dmesg
Patch to restore pre-2.6.32.2 behaviour of the scheduler (applies to 2.6.32.9)

Description Roman Mamedov 2010-01-11 02:58:53 UTC
Created attachment 24500 [details]
"Good" 2.6.32.1 .config

For some reason, the kernels starting from 2.6.32.2 show a much higher "<kernel IPI> : Rescheduling interrupts" wakeup count (according to powertop).

E.g. with 2.6.32.1 I see no more than 5-15% of such wakeups (from the total number of wakeups in a measurement period), but with 2.6.32.2 and 2.6.32.3 it very commonly stays at about 30-50%, i.e. in many cases a half of all wakeups are due to "Rescheduling interrupts" (and the wakeup count itself is very high too), making that the single most frequent reason for wakeups.

I observe this on two amd64 systems, an Athlon64 X2 on a nForce4 board, and a Phenom X4 on AMD 790FX board.
Comment 1 Roman Mamedov 2010-01-11 02:59:32 UTC
Created attachment 24501 [details]
"Bad" 2.6.32.2 .config
Comment 2 Roman Mamedov 2010-01-11 02:59:54 UTC
Created attachment 24502 [details]
"Bad" 2.6.32.3 .config
Comment 3 Roman Mamedov 2010-01-11 03:29:28 UTC
------- powertop during high LAN traffic in 2.6.32.1: -------

< Detailed C-state information is not P-states (frequencies)
                                        2.40 Ghz     0.1%
                                        1200 Mhz    99.9%




Wakeups-from-idle per second : 2351.9   interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
  92.8% (8136.3)       <interrupt> : eth0 
   2.9% (250.0)              bash : pty_write (delayed_work_timer_fn) 
   2.7% (235.5)      <kernel IPI> : Rescheduling interrupts 
   0.4% ( 33.4)    xfce4-terminal : do_nanosleep (hrtimer_wakeup) 
   0.4% ( 32.3)    xfce4-terminal : schedule_hrtimeout_range (hrtimer_wakeup) 
   0.3% ( 23.9)       <interrupt> : saa7134[0], saa7134[0] 


------- powertop during high LAN traffic in 2.6.32.2: -------

     PowerTOP version 1.11      (C) 2007 Intel Corporation

< Detailed C-state information is not P-states (frequencies)
                                        2.40 Ghz     0.5%
                                        1200 Mhz    99.5%




Wakeups-from-idle per second : 3564.7   interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
  63.2% (8563.4)       <interrupt> : eth0 
  33.7% (4563.6)      <kernel IPI> : Rescheduling interrupts 
   1.8% (240.2)              bash : pty_write (delayed_work_timer_fn) 
   0.3% ( 36.0)   USB device 1-2.2 : ADNS-3080 Mouse (Agilent Mouse) 
   0.2% ( 33.7)    xfce4-terminal : do_nanosleep (hrtimer_wakeup) 
   0.2% ( 27.3)    xfce4-terminal : schedule_hrtimeout_range (hrtimer_wakeup)
Comment 4 Roman Mamedov 2010-01-11 05:23:38 UTC
Created attachment 24503 [details]
"Good" 2.6.32.1 dmesg
Comment 5 Roman Mamedov 2010-01-11 05:24:35 UTC
Created attachment 24504 [details]
"Bad" 2.6.32.2 dmesg
Comment 6 Roman Mamedov 2010-01-24 16:39:19 UTC
Still not fixed in the latest 2.6.33-rc5. I try the newest kernels from time to time, and all of them show much higher "kernel IPI" wakeup count than 2.6.32.1.

I suggest that we limit the further investigation to 2.6.32.1 versus 2.6.32.2 for simplicity's sake. It looks like 2.6.32.2 saw a number of scheduler-related changes <http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.32.2>, so it looks like one of these did cause this regression.

I will try to find time to roll them back one by one and test the resulting kernels, but it'd be great if someone with more knowledge in this area could provide more information on precisely which change could've caused this, or what else could be the cause.
Comment 7 Rafael J. Wysocki 2010-02-07 13:17:47 UTC
I think this is an arch bug, so reassigning.

Would you be able to carry out a bisection of commits introduced between 2.6.32.1 and 2.6.32.2?
Comment 8 Roman Mamedov 2010-02-08 14:54:15 UTC
Hello!

Rafael, thanks for the suggestion. This is the result of git bisect:

c2f1953231c6e99777633761c3bc0d2ca7ead20a is the first bad commit
commit c2f1953231c6e99777633761c3bc0d2ca7ead20a
Author: Mike Galbraith <efault@gmx.de>
Date:   Tue Oct 27 15:35:38 2009 +0100

    sched: Check for an idle shared cache in select_task_rq_fair()
    
    commit a1f84a3ab8e002159498814eaa7e48c33752b04b upstream.
    
    When waking affine, check for an idle shared cache, and if
    found, wake to that CPU/sibling instead of the waker's CPU.
    
    This improves pgsql+oltp ramp up by roughly 8%. Possibly more
    for other loads, depending on overlap. The trade-off is a
    roughly 1% peak downturn if tasks are truly synchronous.
    
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Cc: Arjan van de Ven <arjan@infradead.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    LKML-Reference: <1256654138.17752.7.camel@marge.simson.net>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

:040000 040000 166c585327f726d5dc4dcdf6246f35e0cc98978c 6e629a6eb530efa0c56a8f73adcd18b72e7903f0 M	kernel
Comment 9 Arjan van de Ven 2010-02-08 15:57:37 UTC
On 2/8/2010 6:54, bugzilla-daemon@bugzilla.kernel.org wrote:
>
> c2f1953231c6e99777633761c3bc0d2ca7ead20a is the first bad commit
> commit c2f1953231c6e99777633761c3bc0d2ca7ead20a
> Author: Mike Galbraith<efault@gmx.de>
> Date:   Tue Oct 27 15:35:38 2009 +0100
>
>      sched: Check for an idle shared cache in select_task_rq_fair()
>
>      commit a1f84a3ab8e002159498814eaa7e48c33752b04b upstream.
>
>      When waking affine, check for an idle shared cache, and if
>      found, wake to that CPU/sibling instead of the waker's CPU.
>
>      This improves pgsql+oltp ramp up by roughly 8%. Possibly more
>      for other loads, depending on overlap. The trade-off is a
>      roughly 1% peak downturn if tasks are truly synchronous.
>
>      Signed-off-by: Mike Galbraith<efault@gmx.de>
>      Cc: Arjan van de Ven<arjan@infradead.org>
>      Cc: Peter Zijlstra<peterz@infradead.org>
>      LKML-Reference:<1256654138.17752.7.camel@marge.simson.net>
>      Signed-off-by: Ingo Molnar<mingo@elte.hu>
>      Signed-off-by: Greg Kroah-Hartman<gregkh@suse.de>
>
> :040000 040000 166c585327f726d5dc4dcdf6246f35e0cc98978c
> 6e629a6eb530efa0c56a8f73adcd18b72e7903f0 M    kernel


while this commit may give an IPI, it gives the IPI to the hyperthreading sibbling,
which does not cost power (in fact, this is saving power due to better efficiency).

If this code hits, you really have 2 programs that are running concurrently, and
while powertop reports this, just hiding the IPI and making the machine go slower
is the wrong answer technically. The right answer is
1) run the programs as fast as you can (eg do the IPI)
2) see if the action to wake up the other program can just be avoided *as part of the program*.
Comment 10 Roman Mamedov 2010-02-08 16:46:54 UTC
> while this commit may give an IPI, it gives the IPI to the hyperthreading
> sibbling, which does not cost power

Well, in my case, the CPU is not a hyperthreading one, but a Phenom X4 9750, which has four independent cores with separate frequency control per core, 512 KB of dedicated cache on each, and 2 MB of level 3 cache shared between all cores.

Continuing the experiments, I took a kernel 2.6.32.7, reverse-applied the patch [1] which was pointed at by bisecting (together with its correction [2]), and as expected, such version of 2.6.32.7 did not exhibit the high IPI wakeup count.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=a1f84a3ab8e002159498814eaa7e48c33752b04b
[2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fd210738f6601d0fb462df9a2fe5a41896ff6a8f
Comment 11 Rafael J. Wysocki 2010-02-08 18:54:17 UTC
First-Bad-Commit : a1f84a3ab8e002159498814eaa7e48c33752b04b

Is the problem reproducible with the current mainline?
Comment 12 Arjan van de Ven 2010-02-08 20:16:34 UTC
On 2/8/2010 10:54, bugzilla-daemon@bugzilla.kernel.org wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15044
>
>
>
>
>
> --- Comment #11 from Rafael J. Wysocki<rjw@sisk.pl>   2010-02-08 18:54:17 ---
> First-Bad-Commit : a1f84a3ab8e002159498814eaa7e48c33752b04b
>
> Is the problem reproducible with the current mainline?
>
I'm sure it is.

The problem is that this "we see IPIs in powertop" is not a regression by itself,
it's a regression in a symptom.

The old scheduler did not wake up "friend" CPUs as aggressively, causing performance cost (which in turn costs power)
and causes latency.
The new code is more aggressive in moving things to "friend" CPUs (ones that share cache and power plane), which
helps performance, but gives you an IPI. Now, powertop reports this IPI obviously, BUT, and powertop does not currently know this,
this IPI is near free in terms of power, since the CPU the IPI is sent to is sharing power with the current CPU anyway...

If we were to fix the symptom by reverting the patch, powertop reports a better number, but behavior is actually worse for power,
performance and latency.

I suppose we need some way to let powertop know the difference between "friend" and "further away" scheduler IPI's.... that way
we can still catch the bad ones while not blending the report to the user with the more benign ones.
Comment 13 Mike Galbraith 2010-02-09 06:21:30 UTC
On Mon, 2010-02-08 at 20:16 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15044
> 
> 
> 
> 
> 
> --- Comment #12 from Arjan van de Ven <arjan@linux.intel.com>  2010-02-08
> 20:16:34 ---
> On 2/8/2010 10:54, bugzilla-daemon@bugzilla.kernel.org wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=15044
> >
> >
> >
> >
> >
> > --- Comment #11 from Rafael J. Wysocki<rjw@sisk.pl>   2010-02-08 18:54:17
> ---
> > First-Bad-Commit : a1f84a3ab8e002159498814eaa7e48c33752b04b
> >
> > Is the problem reproducible with the current mainline?
> >
> I'm sure it is.

No doubt.

> The problem is that this "we see IPIs in powertop" is not a regression by
> itself,
> it's a regression in a symptom.
> 
> The old scheduler did not wake up "friend" CPUs as aggressively, causing
> performance cost (which in turn costs power)
> and causes latency.
> The new code is more aggressive in moving things to "friend" CPUs (ones that
> share cache and power plane), which
> helps performance, but gives you an IPI. Now, powertop reports this IPI
> obviously, BUT, and powertop does not currently know this,
> this IPI is near free in terms of power, since the CPU the IPI is sent to is
> sharing power with the current CPU anyway...
> 
> If we were to fix the symptom by reverting the patch, powertop reports a
> better
> number, but behavior is actually worse for power,
> performance and latency.

It could well cost power, actually that seems likely to be the case.
You can either minimize power consumption, or maximize throughput.  The
patch aims for the later, though it _could_ also save power by getting
work done more efficiently so cores can go back to sleep sooner.

Numbers would be interesting.

> I suppose we need some way to let powertop know the difference between
> "friend"
> and "further away" scheduler IPI's.... that way
> we can still catch the bad ones while not blending the report to the user
> with
> the more benign ones.

Maybe a connect with cpu-freq to only kick in other CPUs when the
waker's core is going full bore.  That way, a user setting governor to
power save would automatically disable the feature. The cost of getting
that information has to be _extremely_ cheap though, otherwise, tuning
knobs are the only option.  (currently, that would be turning off
SD_SHARE_PKG_RESOURCES flag in SIBLING/MC domains)

	-Mike
Comment 14 Rafael J. Wysocki 2010-02-15 20:46:45 UTC
On Monday 15 February 2010, you wrote:
> On Mon, 2010-02-15 at 00:38 +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> > 
> > The following bug entry is on the current list of known regressions
> > from 2.6.32.  Please verify if it still should be listed and let the
> tracking team
> > know (either way).
> > 
> > 
> > Bug-Entry   : http://bugzilla.kernel.org/show_bug.cgi?id=15044
> > Subject             : Much higher wakeups for "<kernel IPI> : Rescheduling
> interrupts" since 2.6.32.2
> > Submitter   : Roman Mamedov <roman@rm.pp.ru>
> > Date                : 2010-01-11 02:58 (35 days old)
> > First-Bad-Commit:
> http://git.kernel.org/git/linus/a1f84a3ab8e002159498814eaa7e48c33752b04b
> 
> I don't know that this should be carried as a regression.
> 
> Yes, the code in question increases cross cpu wakeups, but that's the
> entire point.  If there is any overlap in execution larger than the cost
> of running a scheduler on another core, that time can be converted to
> throughput.
> 
> Tip AF_UNIX lmbench numbers show that throughput gain being realized,
> TCP numbers below that (x) show what can be had for apps which do a lot
> of what that microbenchmark does, given a tiny enabler patchlet.
> 
> *Local* Communication bandwidths in MB/s - bigger is better
> -----------------------------------------------------------------------------
> Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
>                              UNIX      reread reread (libc) (hand) read write
> --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
> marge      2.6.31.9-smp 2853 2923 1132 2829.3 4761.9 1235.0 1234.4 4472 1683.
> marge      2.6.31.9-smp 2839 2921 1141 2846.5 4779.8 1242.5 1235.9 4455 1684.
> marge      2.6.31.9-smp 2838 2935 751. 2838.5 4820.0 1243.6 1235.0 4472 1684.
> marge    2.6.33-tip-smp 3057 5166 859. 2760.2 4827.8 1481.1 1466.1 4499 1811.
> marge    2.6.33-tip-smp 1796 5165 1257 2748.6 4817.4 1481.1 1464.8 4487 1806.
> marge    2.6.33-tip-smp 3055 5175 1262 2763.4 4812.4 1483.9 1462.7 4477 1810.
> marge   2.6.33-tip-smpx 3063 5140 2940 2811.1 4740.0 1235.8 1237.0 4433 1673.
> marge   2.6.33-tip-smpx 3065 5205 2945 2836.3 4794.4 1243.6 1233.7 4293 1686.
> marge   2.6.33-tip-smpx 3058 5181 2940 2785.4 4700.2 1243.9 1234.5 4415 1682.
> 
> (1. tip memory numbers are phase-of-moon anomaly.. irrelevant here)
> (2. pipe numbers are only possible with pipe buffer increase patch in
> tip.  Often, pipes are truly synchronous, so waking cross CPU is small
> loss.  In tip, it's a win because of optimistic mutex spin.. context
> switch cost is converted to throughput.  That throughput gain also
> cannot be had if you don't do the cross cpu wakeup to get the ball
> rolling.  The code in question is acting as enabler for spintex.)
> 
> So yeah, the code in question _will_ cause more cross CPU wakeups, and
> it _may_ cost power.  It may _save_ power by getting the job done more
> efficiently.  Dunno.
> 
> Regression?  Depends on what you're measuring.
Comment 15 Roman Mamedov 2010-03-06 12:11:48 UTC
I find it difficult to believe that this is normal.

< Detailed C-state information is not P-states (frequencies)
                                        2.40 Ghz     0.3%
                                        1200 Mhz    99.7%




Wakeups-from-idle per second : 438.0    interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
  77.4% (1031.4)      <kernel IPI> : Rescheduling interrupts 
   7.0% ( 93.9)       <interrupt> : ICE1724 
   5.6% ( 75.1)       <interrupt> : bttv0 
   1.9% ( 25.1)            tvtime : do_nanosleep (hrtimer_wakeup) 
   1.4% ( 18.5)              Xorg : schedule_hrtimeout_range (hrtimer_wakeup) 
   1.3% ( 17.4)   USB device 1-4.4 : ADNS-3080 Mouse (Agilent Mouse) 

Suggestion: Enable SATA ALPM link power management via:
  echo min_power > /sys/class/scsi_host/host0/link_power_management_policy
or press the S key.
Comment 16 Roman Mamedov 2010-03-06 18:07:25 UTC
Hello.

On Sat, 06 Mar 2010 18:56:54 +0100
Mike Galbraith <efault@gmx.de> wrote:

> > Top causes for wakeups:
> >   77.4% (1031.4)      <kernel IPI> : Rescheduling interrupts 
> 
> From another thread, likely relevant.
> 
> http://patchwork.kernel.org/patch/78544/
> 

Maybe, but it looks like [1] this patch was merged a day before the final 2.6.33, and the above output was recorded with the 2.6.33 release.

[1] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=shortlog;h=refs/tags/v2.6.33
Comment 17 Roman Mamedov 2010-03-07 22:25:28 UTC
Created attachment 25397 [details]
Patch to restore pre-2.6.32.2 behaviour of the scheduler (applies to 2.6.32.9)

In case anyone wants to compare the new rescheduling behaviour on their system with the way it was before, here's a patch for 2.6.32.9 which restores the old behaviour (reduces IPIs).
Comment 18 Mike Galbraith 2010-03-08 03:59:20 UTC
On Sun, 2010-03-07 at 22:25 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=15044
> 
> 
> 
> 
> 
> --- Comment #17 from Roman Mamedov <roman@rm.pp.ru>  2010-03-07 22:25:28 ---
> Created an attachment (id=25397)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=25397)
> Patch to restore pre-2.6.32.2 behaviour of the scheduler (applies to
> 2.6.32.9)
> 
> In case anyone wants to compare the new rescheduling behaviour on their
> system
> with the way it was before, here's a patch for 2.6.32.9 which restores the
> old
> behaviour (reduces IPIs).

No need to patch.  You can turn off the SD_PREFER_SIBLING domain flag,
or for newer kernels SD_SHARE_PKG_RESOURCES (what it should have been
from the start), to accomplish the very same.

	-Mike