Bug 116701 - Multi-threaded programs forced onto a set of isolated cores have all threads scheduled on one CPU.
Summary: Multi-threaded programs forced onto a set of isolated cores have all threads ...
Status: NEW
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-19 14:16 UTC by Edd Barrett
Modified: 2016-06-11 20:15 UTC (History)
4 users (show)

See Also:
Kernel Version: 3.16.7
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Edd Barrett 2016-04-19 14:16:37 UTC
Hi,

I've noticed an odd thread scheduling behaviour on a Debian 8 system running a 3.16.7 kernel. This may be a bug, or it may be that my expectations are wrong.

Here is a test-case which simply spawns 16 never-ending threads.

```
#include <stdio.h>
#include <pthread.h>
#include <err.h>
#include <unistd.h>
#include <stdlib.h>

#define NTHR    16
#define TIME    60 * 5

void *
do_stuff(void *arg)
{
    int i = 0;

    (void) arg;
    while (1) {
        i += i;
        usleep(10000); /* dont dominate CPU */
    }
}

int
main(void)
{
    pthread_t   threads[NTHR];
    int     rv, i;

    for (i = 0; i < NTHR; i++) {
        rv = pthread_create(&threads[i], NULL, do_stuff, NULL);
        if (rv) {
            perror("pthread_create");
            return (EXIT_FAILURE);
        }
    }
    sleep(TIME);
    exit(EXIT_SUCCESS);
}
```

 * If I compile and run this on a kernel with no isolated CPUs, then the threads are spread out over my 4 CPUs. This is expected as the default affinity mask includes all non-isolated cores.

 * Booted with `isolcpus=2,3`, running the program without taskset distributes threads over cores 0 and 1. This is expected as the default affinity mask now excludes cores 2 and 3.

 * Booted with `isolcpus=2,3`, running with `taskset -c 0,1` also distributes threads over cores 0 and 1. This is expected, as I constrained the affinity of the process to cores 0 and 1.

 * Booted with `isolcpus=2,3`, and running with taskset -c 2,3 causes all threads to go onto the same core (I saw them all on core 2 in my setup). I did not expect this.

In this latter case, would we not expect the threads to spread out over cores 2 and 3 (just as `taskset -c 0,1` caused the threads to spread out over cores 0 and 1)? Why Does the fact that the process is forced onto a set of isolated cores affect the thread scheduling decisions?

I also notice that if you use `chrt` to run the process with the fifo scheduling policy and then use `taskset -c 2,3` to force the process onto isolated cores, then the threads *do* spread out over cores 2 and 3. In other words, the "odd" behaviour only shows under default scheduling policy, but not the fifo real-time policy.

In case it matters, this is a NO_HZ_FULL_ALL tickless kernel, so all cores apart from 0 are in adaptive ticks mode. The CPU is an Intel i7-i4790K with hyper-threading disabled.

This was briefly discussed in a stack overflow question I raised here:
http://stackoverflow.com/questions/36604360/why-does-using-taskset-to-run-a-multi-threaded-linux-program-on-a-set-of-isolate?noredirect=1#comment60835562_36604360

I'm re-raising the issue here, as I didn't come to an consensus as to whether this is actually a bug.

Thanks
Comment 1 Edd Barrett 2016-05-09 09:37:33 UTC
Hi,

On Friday afternoon I spent some more time on this. I learned how to use systemtap and I think I am a little closer to understanding the issue.

I installed Debian 8 in a qemu virtual machine. This came with Linux-3.16.0 (Debian patch level 4).

In my stack overflow thread (linked above) someone suggests that the function `select_task_rq_fair` is always returning the same cpu when a process with threads is pinned to an isolated core. This was my starting point.

I then devised the following systemtap script to use with my test case (as in previous post, but with NTHR=4):

```
global N = 0

probe kernel.function("select_task_rq_fair") {
        if (execname() == "threads") {
                printf(">>> %s select_task_rq_fair\n", execname())
        }
}

probe kernel.function("select_task_rq_fair").return {
        if (execname() == "threads") {
                printf("<<< %s select_task_rq_fair: %s\n", execname(), $$return)

                N++

                // stop when 4 threads have been scheduled
                if (N == 4) {
                        exit()
                }
        }
}

probe kernel.statement("select_task_rq_fair@fair.c:*") {
        if (execname() == "threads") {
                printf("%s\n", pp())
        }
}

probe kernel.statement("select_task_rq_fair@fair.c:4496") {
        if (execname() == "threads") {
                printf("  want_affine=%d\n", $want_affine)
        }
}

probe begin {
        printf("stap running\n")
}

```

The binary I am testing is called "threads" hence the `if (execname() == "threads")` lines.

I'm printing info when:
 * select_task_rq_fair is entered.
 * select_task_rq_fair returned, including return value.
 * a line of select_task_rq_fair is executed, including line number.
 * want_affine at line 4496.

I quit tracing after the 4 threads have all been scheduled.

Now, with the VM booted with 4 cpus and `isolcpus=2,3`, I captured the output from a normal run `./threads` and a run forcing the process onto the set of isolated cores `taskset -c 2,3 ./threads`. I then diff the output.

First thing that I notice:

```
--- normal.trace        2016-05-06 18:30:27.964000000 +0100
+++ taskset.trace       2016-05-06 18:31:27.884000000 +0100
...
-<<< threads select_task_rq_fair: return=0x0
+<<< threads select_task_rq_fair: return=0x2
...
-<<< threads select_task_rq_fair: return=0x0
+<<< threads select_task_rq_fair: return=0x2
...
-<<< threads select_task_rq_fair: return=0x1
+<<< threads select_task_rq_fair: return=0x2
...
-<<< threads select_task_rq_fair: return=0x0
+<<< threads select_task_rq_fair: return=0x2

```

This confirms the effect we are seeing. The taskset threads all got scheduled on core 0x2, whereas the normal run put threads on cores 0x0 and 0x1 (remember, 0x2 and 0x3 are off-limits without taskset).

Looking deeper, asides from the return value, each thread's diff looks a little different, but what is consistently different between the runs is that line 4497, which is always executed on a normal run, is never executed on a taskset run, e.g.:

```
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489")
-  want_affine=1
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509")
```

This is the following section of code:


```
4488         for_each_domain(cpu, tmp) {
4489                 if (!(tmp->flags & SD_LOAD_BALANCE))
4490                         continue;
4491 
4492                 /*
4493                  * If both cpu and prev_cpu are part of this domain,
4494                  * cpu is a valid SD_WAKE_AFFINE target.
4495                  */
4496                 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&
4497         ---->       cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {
4498                         affine_sd = tmp;
4499                         break;
4500                 }
4501 
4502                 if (tmp->flags & sd_flag)
4503                         sd = tmp;
4504         }
4505 
4506         if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync))
```

Notice that I am also printing want_affine. The value of want_affine varies on taskset threads, so if the line numbers are to be believed (not sure), the we know that one of the other conditions for that branch can also fail.

Without pumping more time into this, I think there is a bug with the affine wake logic when isolated cores are in use. If I had to guess, it seems for SCHED_OTHER, affine wakes are off unless the process is forced onto isolated cores.

Is there a way to capture the evaluation of `tmp->flags & SD_WAKE_AFFINE` and `cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))` in systemtap?

Thanks
Comment 2 Edd Barrett 2016-05-09 09:39:41 UTC
For completeness, here is the whole trace diff generated from the above systemtap script:

```
--- normal.trace	2016-05-06 18:30:27.964000000 +0100
+++ taskset.trace	2016-05-06 18:31:27.884000000 +0100
@@ -3,36 +3,20 @@
 >>> threads select_task_rq_fair
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4478")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4481")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4482")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489")
-  want_affine=1
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4511")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552")
-<<< threads select_task_rq_fair: return=0x0
+<<< threads select_task_rq_fair: return=0x2
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471")
 >>> threads select_task_rq_fair
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4478")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4481")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4482")
+kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489")
-  want_affine=1
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4511")
+kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552")
-<<< threads select_task_rq_fair: return=0x0
+<<< threads select_task_rq_fair: return=0x2
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471")
 >>> threads select_task_rq_fair
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473")
@@ -41,20 +25,10 @@
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489")
-  want_affine=0
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4502")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4540")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4518")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4530")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4538")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4541")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552")
-<<< threads select_task_rq_fair: return=0x1
+<<< threads select_task_rq_fair: return=0x2
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471")
 >>> threads select_task_rq_fair
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473")
@@ -63,15 +37,7 @@
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489")
-  want_affine=0
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4502")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4540")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4518")
-kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4525")
 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552")
-<<< threads select_task_rq_fair: return=0x0
+<<< threads select_task_rq_fair: return=0x2
```
Comment 3 Edd Barrett 2016-05-10 09:44:58 UTC
FWIW, `taskset -a` does not help either.
Comment 4 Paul Mehrer 2016-06-10 10:25:33 UTC
Hi Edd,

(In reply to Edd Barrett from comment #0)
> I've noticed an odd thread scheduling behaviour on a Debian 8 system running
> a 3.16.7 kernel. This may be a bug, or it may be that my expectations are
> wrong.


I could reproduce this behavior (on 4.4.12). But reading ./kernel/sched/core.c it seems NOT to be a bug.

/*
 * Set up scheduler domains and groups. [...]
 */
static int init_sched_domains(const struct cpumask *cpu_map)
{ [...]
        cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map);
        err = build_sched_domains(doms_cur[0], NULL);

=> for isolated cpus there are no scheduling domains build. These CPUs seem to be completely excluded from scheduling.

what you are looking for is actually "cgroups". See:
https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

and maybe you even want to use a container like docker instead of dealing with cgroups yourself.

Consider closing this bug report

best regards
Paul Mehrer
Comment 5 Edd Barrett 2016-06-10 16:22:30 UTC
That may be so, but don't you find it odd (or at-least inconsistent) that if you use a real-time scheduling policy, then the threads *do* span across cores? Arguably that is a bug in either the code or the documentation.

I did end up using cgroups, specifically `cset shield`, and this worked just fine.

This does beg the question as to why there are two process isolation mechanisms. Perhaps if cgroups does everything isolcpus can, you could remove isolcpus altogether. Although that's a bold move, and probably a separate issue.

Note You need to log in before you can comment on or make changes to this bug.