Bug 10761 - hackbench regression with 2.6.26-rc2 on tulsa machine
Summary: hackbench regression with 2.6.26-rc2 on tulsa machine
Status: CLOSED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks: 10492
  Show dependency tree
 
Reported: 2008-05-20 16:00 UTC by Rafael J. Wysocki
Modified: 2008-06-08 09:49 UTC (History)
5 users (show)

See Also:
Kernel Version: 2.6.26-rc2
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Rafael J. Wysocki 2008-05-20 16:00:40 UTC
Subject    : hackbench regression with 2.6.26-rc2 on tulsa machine
Submitter  : "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Date       : 2008-05-20 8:09
References : http://marc.info/?l=linux-kernel&m=121127121813708&w=2
Handled-By : Mike Galbraith <efault@gmx.de>

This entry is being used for tracking a regression from 2.6.25.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Rafael J. Wysocki 2008-05-20 16:01:52 UTC
Probably caused by:

commit 46151122e0a2e80e5a6b2889f595e371fe2b600d
Author: Mike Galbraith <efault@gmx.de>
Date:   Thu May 8 17:00:42 2008 +0200

    sched: fix weight calculations

    Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Signed-off-by: Ingo Molnar <mingo@elte.hu>
Comment 2 Rafael J. Wysocki 2008-05-21 04:59:14 UTC
Mike Galbraith says that the tested configuration is known broken.

Closing.
Comment 3 Adrian Bunk 2008-05-21 05:07:42 UTC
We were not that far from giving GROUP_SCHED a dependency on BROKEN during 2.6.25-rc.

Can we consider this now instead of adding yet another problem to a known problematic configuration? 
Comment 4 Adrian Bunk 2008-05-21 05:08:50 UTC
"known problematic" for us - a user who once enabled it in his kernel cannot know that it could cause such problems.
Comment 5 Rafael J. Wysocki 2008-05-21 05:17:39 UTC
(In reply to comment #3)
> We were not that far from giving GROUP_SCHED a dependency on BROKEN during
> 2.6.25-rc.

I sort of agree with this.  What was the reason, actually?
Comment 6 Adrian Bunk 2008-05-21 05:54:34 UTC
(In reply to comment #5)
> (In reply to comment #3)
> > We were not that far from giving GROUP_SCHED a dependency on BROKEN during
> > 2.6.25-rc.
> 
> I sort of agree with this.  What was the reason, actually?

It was discussed in the thread around http://lkml.org/lkml/2008/3/28/273

In 2.6.25-rc we had 6 CPU scheduler regressions.
3 or 4 of them were caused by group scheduling.
Including one that is still unfixed.

In 2.6.26-rc we already have 6 CPU scheduler regressions, 5 of them still unfixed.
3 of them seem to be group scheduler regressions.

The CPU scheduler is currently regressing horribly often, and half of the regressions are in group scheduling.
Comment 7 Peter Zijlstra 2008-05-21 06:00:17 UTC
On Wed, 2008-05-21 at 05:54 -0700, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10761
> 
> 
> 
> 
> 
> ------- Comment #6 from bunk@kernel.org  2008-05-21 05:54 -------
> (In reply to comment #5)
> > (In reply to comment #3)
> > > We were not that far from giving GROUP_SCHED a dependency on BROKEN
> during
> > > 2.6.25-rc.
> > 
> > I sort of agree with this.  What was the reason, actually?
> 
> It was discussed in the thread around http://lkml.org/lkml/2008/3/28/273
> 
> In 2.6.25-rc we had 6 CPU scheduler regressions.
> 3 or 4 of them were caused by group scheduling.
> Including one that is still unfixed.
> 
> In 2.6.26-rc we already have 6 CPU scheduler regressions, 5 of them still
> unfixed.
> 3 of them seem to be group scheduler regressions.
> 
> The CPU scheduler is currently regressing horribly often, and half of the
> regressions are in group scheduling.

That is because group scheduling is horribly complex and was never
feature complete - trying to solve that is high on my list of
priorities.
Comment 8 Adrian Bunk 2008-05-21 06:18:58 UTC
(In reply to comment #7)
> On Wed, 2008-05-21 at 05:54 -0700, bugme-daemon@bugzilla.kernel.org
> wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=10761
> > ------- Comment #6 from bunk@kernel.org  2008-05-21 05:54 -------
> > (In reply to comment #5)
> > > (In reply to comment #3)
> > > > We were not that far from giving GROUP_SCHED a dependency on BROKEN
> during
> > > > 2.6.25-rc.
> > > 
> > > I sort of agree with this.  What was the reason, actually?
> > 
> > It was discussed in the thread around http://lkml.org/lkml/2008/3/28/273
> > 
> > In 2.6.25-rc we had 6 CPU scheduler regressions.
> > 3 or 4 of them were caused by group scheduling.
> > Including one that is still unfixed.
> > 
> > In 2.6.26-rc we already have 6 CPU scheduler regressions, 5 of them still
> > unfixed.
> > 3 of them seem to be group scheduler regressions.
> > 
> > The CPU scheduler is currently regressing horribly often, and half of the
> > regressions are in group scheduling.
> 
> That is because group scheduling is horribly complex and was never
> feature complete - trying to solve that is high on my list of
> priorities.

The current question is what to do for 2.6.26.

And getting it feature complete is nothing that would suit for 2.6.26.

Can we agree to add to GROUP_SCHED a dependency on BROKEN and keep this dependency in Linus' tree until the code is feature complete and considered ready for production use?

Currently it seems to be more of a pitfall (for users who enable it) than a useful feature.
Comment 9 Peter Zijlstra 2008-05-21 06:34:38 UTC
On Wed, 2008-05-21 at 06:18 -0700, bugme-daemon@bugzilla.kernel.org
wrote:
> http://bugzilla.kernel.org/show_bug.cgi?id=10761
> 
> 
> 
> 
> 
> ------- Comment #8 from bunk@kernel.org  2008-05-21 06:18 -------
> (In reply to comment #7)
> > On Wed, 2008-05-21 at 05:54 -0700, bugme-daemon@bugzilla.kernel.org
> > wrote:
> > > http://bugzilla.kernel.org/show_bug.cgi?id=10761
> > > ------- Comment #6 from bunk@kernel.org  2008-05-21 05:54 -------
> > > (In reply to comment #5)
> > > > (In reply to comment #3)
> > > > > We were not that far from giving GROUP_SCHED a dependency on BROKEN
> during
> > > > > 2.6.25-rc.
> > > > 
> > > > I sort of agree with this.  What was the reason, actually?
> > > 
> > > It was discussed in the thread around http://lkml.org/lkml/2008/3/28/273
> > > 
> > > In 2.6.25-rc we had 6 CPU scheduler regressions.
> > > 3 or 4 of them were caused by group scheduling.
> > > Including one that is still unfixed.
> > > 
> > > In 2.6.26-rc we already have 6 CPU scheduler regressions, 5 of them still
> > > unfixed.
> > > 3 of them seem to be group scheduler regressions.
> > > 
> > > The CPU scheduler is currently regressing horribly often, and half of the
> > > regressions are in group scheduling.
> > 
> > That is because group scheduling is horribly complex and was never
> > feature complete - trying to solve that is high on my list of
> > priorities.
> 
> The current question is what to do for 2.6.26.
> 
> And getting it feature complete is nothing that would suit for 2.6.26.
> 
> Can we agree to add to GROUP_SCHED a dependency on BROKEN and keep this
> dependency in Linus' tree until the code is feature complete and considered
> ready for production use?
> 
> Currently it seems to be more of a pitfall (for users who enable it) than a
> useful feature.

I think we changed the default to 'N' - isn't that enough?
Comment 10 Adrian Bunk 2008-05-21 06:44:32 UTC
(In reply to comment #9)

> I think we changed the default to 'N' - isn't that enough?

It does already default to N, and we know how many people run into problems with it.

How many hours have people wasted on bisecting regressions that turned out to be group scheduling problems?

If a feature isn't ready for being used on production systems it shouldn't be in stable kernels.
Comment 11 Mike Galbraith 2008-05-21 09:12:30 UTC
(IMHO, bugzilla shouldn't be used for tracking EXPERIMENTAL code, so I
shouldn't be replying to bugme-daemon, but...)

On Wed, 2008-05-21 at 06:18 -0700, bugme-daemon@bugzilla.kernel.org
wrote:

> The current question is what to do for 2.6.26.

My $.02 is that since it defaults to 'N' _and_ depends on EXPERIMENTAL,
all is just fine.

> And getting it feature complete is nothing that would suit for 2.6.26.

Heartily disagree given the above.

> Can we agree to add to GROUP_SCHED a dependency on BROKEN and keep this
> dependency in Linus' tree until the code is feature complete and considered
> ready for production use?

Why mark it BROKEN?  It's only 'broken' in so far as it has known
performance issues, which is quite normal for complex code under active
development.  BROKEN means "this gizmo don't work, and ain't being
fixed".  That does not apply to group scheduling.

> Currently it seems to be more of a pitfall (for users who enable it) than a
> useful feature.

If you explicitly enable features marked EXPERIMENTAL, you might indeed
encounter a developmental pitfall or two.  Nothing unusual here.

	-Mike
Comment 12 Adrian Bunk 2008-05-21 09:24:50 UTC
(In reply to comment #11)
> (IMHO, bugzilla shouldn't be used for tracking EXPERIMENTAL code, so I
> shouldn't be replying to bugme-daemon, but...)
> 
> On Wed, 2008-05-21 at 06:18 -0700, bugme-daemon@bugzilla.kernel.org
> wrote:
> 
> > The current question is what to do for 2.6.26.
> 
> My $.02 is that since it defaults to 'N' _and_ depends on EXPERIMENTAL,
> all is just fine.

My €.02 (which is more than $.03) is that it's very common that it's impossible to use a kernel with CONFIG_EXPERIMENTAL=n (e.g. for hardware drivers), and users having CONFIG_EXPERIMENTAL=n set are therefore _very_ rare.

> > And getting it feature complete is nothing that would suit for 2.6.26.
> 
> Heartily disagree given the above.
> 
> > Can we agree to add to GROUP_SCHED a dependency on BROKEN and keep this
> > dependency in Linus' tree until the code is feature complete and considered
> > ready for production use?
> 
> Why mark it BROKEN?  It's only 'broken' in so far as it has known
> performance issues, which is quite normal for complex code under active
> development.  BROKEN means "this gizmo don't work, and ain't being
> fixed".  That does not apply to group scheduling.

Is it ready for being used in production today or not?

> > Currently it seems to be more of a pitfall (for users who enable it) than a
> > useful feature.
> 
> If you explicitly enable features marked EXPERIMENTAL, you might indeed
> encounter a developmental pitfall or two.  Nothing unusual here.

Please name one distribution that builds it's kernels with CONFIG_EXPERIMENTAL=n.

Your expectations of CONFIG_EXPERIMENTAL do not match reality.
Comment 13 Mike Galbraith 2008-05-21 10:51:59 UTC
On Wed, 2008-05-21 at 09:24 -0700, bugme-daemon@bugzilla.kernel.org
wrote: 
> http://bugzilla.kernel.org/show_bug.cgi?id=10761
> 
> 
> 
> 
> 
> ------- Comment #12 from bunk@kernel.org  2008-05-21 09:24 -------
> (In reply to comment #11)
> > (IMHO, bugzilla shouldn't be used for tracking EXPERIMENTAL code, so I
> > shouldn't be replying to bugme-daemon, but...)
> > 
> > On Wed, 2008-05-21 at 06:18 -0700, bugme-daemon@bugzilla.kernel.org
> > wrote:
> > 
> > > The current question is what to do for 2.6.26.
> > 
> > My $.02 is that since it defaults to 'N' _and_ depends on EXPERIMENTAL,
> > all is just fine.
> 
> My €.02 (which is more than $.03) is that it's very common that it's
> impossible to use a kernel with CONFIG_EXPERIMENTAL=n (e.g. for hardware
> drivers), and users having CONFIG_EXPERIMENTAL=n set are therefore _very_
> rare.

Yes, some hardware needs experimental drivers.  That doesn't change the
definition of EXPERIMENTAL.  It still means "Aunt Tilly beware!", as it
always has.

> > > And getting it feature complete is nothing that would suit for 2.6.26.
> > 
> > Heartily disagree given the above.
> > 
> > > Can we agree to add to GROUP_SCHED a dependency on BROKEN and keep this
> > > dependency in Linus' tree until the code is feature complete and
> considered
> > > ready for production use?
> > 
> > Why mark it BROKEN?  It's only 'broken' in so far as it has known
> > performance issues, which is quite normal for complex code under active
> > development.  BROKEN means "this gizmo don't work, and ain't being
> > fixed".  That does not apply to group scheduling.
> 
> Is it ready for being used in production today or not?

Depends on the production load I suppose.  There are loads where EXT3
doesn't perform well.  Rhetorical: Shall we mark EXT3 BROKEN?

> > > Currently it seems to be more of a pitfall (for users who enable it) than
> a
> > > useful feature.
> > 
> > If you explicitly enable features marked EXPERIMENTAL, you might indeed
> > encounter a developmental pitfall or two.  Nothing unusual here.
> 
> Please name one distribution that builds it's kernels with
> CONFIG_EXPERIMENTAL=n.

Rhetorical: Your point is?  If distros were hiring Aunt Tilly to
configure and test their kernels, they could run into trouble enabling
EXPERIMENTAL.  I don't think that's the case.

> Your expectations of CONFIG_EXPERIMENTAL do not match reality.

No.  Your redefinition thereof doesn't match past or current reality.

I've stated my position, and rebutted yours for the record.  Bugzilla
wasn't intended to be a debate podium, so I'm outta here ;-)

	EOT,

	-Mike
Comment 14 Rafael J. Wysocki 2008-06-05 14:39:34 UTC
Confirmed to have been improved recently:
References : http://lkml.org/lkml/2008/6/2/10
Comment 15 Rafael J. Wysocki 2008-06-08 09:49:33 UTC
The problem appears to be fixed in the mainline.
References : http://lkml.org/lkml/2008/6/7/227

Closing.

Note You need to log in before you can comment on or make changes to this bug.