Bug 67601 - [BISECTED]Regression: Erratic 3D performance after "sched/numa: Avoid overloading CPUs..."
Summary: [BISECTED]Regression: Erratic 3D performance after "sched/numa: Avoid overlo...
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-12-23 23:41 UTC by Thomas Hellstrom
Modified: 2016-03-23 18:25 UTC (History)
6 users (show)

See Also:
Kernel Version: 3.13-rc1+
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
dmesg output (105.11 KB, text/plain)
2013-12-23 23:46 UTC, Thomas Hellstrom
Details
output of /proc/cpuinfo (1.49 KB, text/plain)
2013-12-23 23:47 UTC, Thomas Hellstrom
Details

Description Thomas Hellstrom 2013-12-23 23:41:22 UTC
The commit 58d081b5082dd85e02ac9a1fb151d97395340a09
sched/numa: Avoid overloading CPUs on a preferred NUMA node

for some reason causes degraded and erratic 3D performance of a fedora 20 system running as a VMware virtual machine. For example glxgears is down from some 4500 fps to 700-2000 fps, varying hevily.

The system is an intel core i7 running on two virtual CPUs with one core each.
Comment 1 Thomas Hellstrom 2013-12-23 23:46:38 UTC
Created attachment 119431 [details]
dmesg output
Comment 2 Thomas Hellstrom 2013-12-23 23:47:43 UTC
Created attachment 119441 [details]
output of /proc/cpuinfo
Comment 3 Thomas Hellstrom 2013-12-23 23:49:44 UTC
The problematic commit was found by bisecting, and double-checked twice.
Comment 4 Mel Gorman 2013-12-24 11:43:13 UTC
Thanks for the report.

Is the physical machine a NUMA machine? The dmesg implies no but it also appears to belong to the virtual machine and is running a kernel that would not include this commit. If it is a NUMA machine, is CONFIG_NUMA_BALANCING set and enabled? If so, does disabling it work around the problem? What is the ratio of virtual CPUs to physical CPUs on this machine?

What is considered normal behaviour for glxgears on this machine?

Have you noticed erratic behaviour in any workload other than glxgears? I ask because glxgears is widely considered an invalid benchmark. It tests a very limited number of 3d rendering features but worse, better glxgears performance does not necessarily mean better 3d performance. It can sometimes be gamed by disabling 3d rendering entirely and it'll artificially have a higher fps by virtue of the fact it's not rendering all frames.

As a heads-up, I'm mostly offline at the moment and will not be very responsive until the new years.
Comment 5 Thomas Hellstrom 2013-12-24 12:30:31 UTC
Hi. Some quick answers: (I'll be more responsive after new year's as well).

AFAICT, the host is not a NUMA machine, it's an Intel core i7 with 8 cores, (I think two physical cores with 4 HT cores each?)

The virtual machine uses two physical cores with 1 HT each.

The dmesg is from the virtual machine. The fact that you are seeing 3.12.0-rc4+, is because that's what you end up with when bisecting. Probably the branch on which the work was made was based on 3.12.0-rc4+?. So the dmesg output is from the exact commit, or possibly the commit before that.

Also a note on glxgears: It's true that it's not a good benchmark for thriangle throughput, but it's a good swapbuffer- or fill rate benchmark, and as a GPU driver developer it's useful because typically the GPU command queue contains 2 frames worth of glxgears rendering + swapbuffers. Hence the benchmark is GPU bound and should have a steady frame-rate unless something is wrong: It could be a slow CPU, or something is keeping the CPU from filling the GPU command queue, and in this case I think it's the latter. Since both the X server, a compositing manager and the app itself is involved, if there is a delay somewhere, it will get noticed.

More info after new year.

Thanks,
Thomas
Comment 6 Rik van Riel 2013-12-24 15:33:55 UTC
Are you running the same kernel on the guest and the host?

Did you bisect in just the guest?  If so, what is the host running?

It might be good to know whether it is scheduling in the host that is causing your problem, or scheduling in the guest, or both...
Comment 7 Thomas Hellstrom 2013-12-25 10:50:42 UTC
The host is a fedora 17, with kernel 3.9.10-100.fc17.x86_64, with

CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA_BALANCING=y

The host configuration has remained unchanged throughout testing.

/Thomas
Comment 8 Thomas Hellstrom 2013-12-25 14:52:34 UTC
Some other data points:

If I alter the VM configuration to only advertise a single CPU core to the guest, performance is back to normal (or even slightly better).

If I disable CONFIG_NUMA_BALANCING in the guest kernel: No change. (still erratic performance)

If I disable CONFIG_NUMA in the guest kernel: No change. (still erratic performance)

/Thomas
Comment 9 Rik van Riel 2013-12-26 15:18:08 UTC
Thomas,

do you have CONFIG_FAIR_GROUP_SCHED enabled?

Without NUMA, the top parts of changeset 58d081b5082dd85e02ac9a1fb151d97395340a09 will make no difference, since that code will never get called.

That only leaves these hunks of the patch that could have a possible effect:

@@ -3292,7 +3366,7 @@ static long effective_load(struct task_group *tg, int cpu, l
 {
        struct sched_entity *se = tg->se[cpu];
 
-       if (!tg->parent)        /* the trivial, non-cgroup case */
+       if (!tg->parent || !wl) /* the trivial, non-cgroup case */
                return wl;
 
        for_each_sched_entity(se) {
@@ -3345,8 +3419,7 @@ static long effective_load(struct task_group *tg, int cpu, l
 }
 #else
 
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
-               unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 {
        return wl;
 }

If you have CONFIG_FAIR_GROUP_SCHED enabled, could you try undoing the first of the two patch hunks above, to see if that makes a difference?

I have not looked at the math in effective_load in any detail, but I simply do not know what else in the changeset could cause what you observed...
Comment 10 Thomas Hellstrom 2013-12-26 19:22:26 UTC
Indeed, as you suggest, the following diff returns things to normal...

Tested both on top of 58d081b5082dd85e02ac9a1fb151d97395340a09 and on top of Linus' master (3.13-rc5+)


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7395d9..e64b079 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3923,7 +3923,7 @@ static long effective_load(struct task_group *tg, int cpu,
 long wl, long wg)
 {
 	struct sched_entity *se = tg->se[cpu];
 
-	if (!tg->parent || !wl)	/* the trivial, non-cgroup case */
+	if (!tg->parent)	/* the trivial, non-cgroup case */
 		return wl;
 
 	for_each_sched_entity(se) {


/Thomas
Comment 11 Thomas Hellstrom 2014-01-06 08:10:47 UTC
Ping?

Is anybody picking this up?

Thanks,
Thomas
Comment 12 Mel Gorman 2014-01-06 11:41:53 UTC
(In reply to Thomas Hellstrom from comment #11)
> Ping?
> 
> Is anybody picking this up?
> 

Today was my first day back after weather related downtime. Candidate patch has been posted for comment by Peter and Rik with you cc'd. It should not take too long to properly document it in with a changelog and punt the result with proper signed-offs to Ingo. Sorry about the delay.

Thanks

Note You need to log in before you can comment on or make changes to this bug.