Hi, I've noticed an odd thread scheduling behaviour on a Debian 8 system running a 3.16.7 kernel. This may be a bug, or it may be that my expectations are wrong. Here is a test-case which simply spawns 16 never-ending threads. ``` #include <stdio.h> #include <pthread.h> #include <err.h> #include <unistd.h> #include <stdlib.h> #define NTHR 16 #define TIME 60 * 5 void * do_stuff(void *arg) { int i = 0; (void) arg; while (1) { i += i; usleep(10000); /* dont dominate CPU */ } } int main(void) { pthread_t threads[NTHR]; int rv, i; for (i = 0; i < NTHR; i++) { rv = pthread_create(&threads[i], NULL, do_stuff, NULL); if (rv) { perror("pthread_create"); return (EXIT_FAILURE); } } sleep(TIME); exit(EXIT_SUCCESS); } ``` * If I compile and run this on a kernel with no isolated CPUs, then the threads are spread out over my 4 CPUs. This is expected as the default affinity mask includes all non-isolated cores. * Booted with `isolcpus=2,3`, running the program without taskset distributes threads over cores 0 and 1. This is expected as the default affinity mask now excludes cores 2 and 3. * Booted with `isolcpus=2,3`, running with `taskset -c 0,1` also distributes threads over cores 0 and 1. This is expected, as I constrained the affinity of the process to cores 0 and 1. * Booted with `isolcpus=2,3`, and running with taskset -c 2,3 causes all threads to go onto the same core (I saw them all on core 2 in my setup). I did not expect this. In this latter case, would we not expect the threads to spread out over cores 2 and 3 (just as `taskset -c 0,1` caused the threads to spread out over cores 0 and 1)? Why Does the fact that the process is forced onto a set of isolated cores affect the thread scheduling decisions? I also notice that if you use `chrt` to run the process with the fifo scheduling policy and then use `taskset -c 2,3` to force the process onto isolated cores, then the threads *do* spread out over cores 2 and 3. In other words, the "odd" behaviour only shows under default scheduling policy, but not the fifo real-time policy. In case it matters, this is a NO_HZ_FULL_ALL tickless kernel, so all cores apart from 0 are in adaptive ticks mode. The CPU is an Intel i7-i4790K with hyper-threading disabled. This was briefly discussed in a stack overflow question I raised here: http://stackoverflow.com/questions/36604360/why-does-using-taskset-to-run-a-multi-threaded-linux-program-on-a-set-of-isolate?noredirect=1#comment60835562_36604360 I'm re-raising the issue here, as I didn't come to an consensus as to whether this is actually a bug. Thanks
Hi, On Friday afternoon I spent some more time on this. I learned how to use systemtap and I think I am a little closer to understanding the issue. I installed Debian 8 in a qemu virtual machine. This came with Linux-3.16.0 (Debian patch level 4). In my stack overflow thread (linked above) someone suggests that the function `select_task_rq_fair` is always returning the same cpu when a process with threads is pinned to an isolated core. This was my starting point. I then devised the following systemtap script to use with my test case (as in previous post, but with NTHR=4): ``` global N = 0 probe kernel.function("select_task_rq_fair") { if (execname() == "threads") { printf(">>> %s select_task_rq_fair\n", execname()) } } probe kernel.function("select_task_rq_fair").return { if (execname() == "threads") { printf("<<< %s select_task_rq_fair: %s\n", execname(), $$return) N++ // stop when 4 threads have been scheduled if (N == 4) { exit() } } } probe kernel.statement("select_task_rq_fair@fair.c:*") { if (execname() == "threads") { printf("%s\n", pp()) } } probe kernel.statement("select_task_rq_fair@fair.c:4496") { if (execname() == "threads") { printf(" want_affine=%d\n", $want_affine) } } probe begin { printf("stap running\n") } ``` The binary I am testing is called "threads" hence the `if (execname() == "threads")` lines. I'm printing info when: * select_task_rq_fair is entered. * select_task_rq_fair returned, including return value. * a line of select_task_rq_fair is executed, including line number. * want_affine at line 4496. I quit tracing after the 4 threads have all been scheduled. Now, with the VM booted with 4 cpus and `isolcpus=2,3`, I captured the output from a normal run `./threads` and a run forcing the process onto the set of isolated cores `taskset -c 2,3 ./threads`. I then diff the output. First thing that I notice: ``` --- normal.trace 2016-05-06 18:30:27.964000000 +0100 +++ taskset.trace 2016-05-06 18:31:27.884000000 +0100 ... -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ... -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ... -<<< threads select_task_rq_fair: return=0x1 +<<< threads select_task_rq_fair: return=0x2 ... -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ``` This confirms the effect we are seeing. The taskset threads all got scheduled on core 0x2, whereas the normal run put threads on cores 0x0 and 0x1 (remember, 0x2 and 0x3 are off-limits without taskset). Looking deeper, asides from the return value, each thread's diff looks a little different, but what is consistently different between the runs is that line 4497, which is always executed on a normal run, is never executed on a taskset run, e.g.: ``` kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=1 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") ``` This is the following section of code: ``` 4488 for_each_domain(cpu, tmp) { 4489 if (!(tmp->flags & SD_LOAD_BALANCE)) 4490 continue; 4491 4492 /* 4493 * If both cpu and prev_cpu are part of this domain, 4494 * cpu is a valid SD_WAKE_AFFINE target. 4495 */ 4496 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && 4497 ----> cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { 4498 affine_sd = tmp; 4499 break; 4500 } 4501 4502 if (tmp->flags & sd_flag) 4503 sd = tmp; 4504 } 4505 4506 if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) ``` Notice that I am also printing want_affine. The value of want_affine varies on taskset threads, so if the line numbers are to be believed (not sure), the we know that one of the other conditions for that branch can also fail. Without pumping more time into this, I think there is a bug with the affine wake logic when isolated cores are in use. If I had to guess, it seems for SCHED_OTHER, affine wakes are off unless the process is forced onto isolated cores. Is there a way to capture the evaluation of `tmp->flags & SD_WAKE_AFFINE` and `cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))` in systemtap? Thanks
For completeness, here is the whole trace diff generated from the above systemtap script: ``` --- normal.trace 2016-05-06 18:30:27.964000000 +0100 +++ taskset.trace 2016-05-06 18:31:27.884000000 +0100 @@ -3,36 +3,20 @@ >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4478") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4481") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4482") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=1 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4511") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471") >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4478") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4481") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4482") +kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=1 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4511") +kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471") >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") @@ -41,20 +25,10 @@ kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=0 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4502") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4540") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4518") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4530") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4538") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4541") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x1 +<<< threads select_task_rq_fair: return=0x2 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471") >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") @@ -63,15 +37,7 @@ kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=0 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4502") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4540") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4518") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4525") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ```
FWIW, `taskset -a` does not help either.
Hi Edd, (In reply to Edd Barrett from comment #0) > I've noticed an odd thread scheduling behaviour on a Debian 8 system running > a 3.16.7 kernel. This may be a bug, or it may be that my expectations are > wrong. I could reproduce this behavior (on 4.4.12). But reading ./kernel/sched/core.c it seems NOT to be a bug. /* * Set up scheduler domains and groups. [...] */ static int init_sched_domains(const struct cpumask *cpu_map) { [...] cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map); err = build_sched_domains(doms_cur[0], NULL); => for isolated cpus there are no scheduling domains build. These CPUs seem to be completely excluded from scheduling. what you are looking for is actually "cgroups". See: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt and maybe you even want to use a container like docker instead of dealing with cgroups yourself. Consider closing this bug report best regards Paul Mehrer
That may be so, but don't you find it odd (or at-least inconsistent) that if you use a real-time scheduling policy, then the threads *do* span across cores? Arguably that is a bug in either the code or the documentation. I did end up using cgroups, specifically `cset shield`, and this worked just fine. This does beg the question as to why there are two process isolation mechanisms. Perhaps if cgroups does everything isolcpus can, you could remove isolcpus altogether. Although that's a bold move, and probably a separate issue.