Bug 116701
Summary: | Multi-threaded programs forced onto a set of isolated cores have all threads scheduled on one CPU. | ||
---|---|---|---|
Product: | Process Management | Reporter: | Edd Barrett (edd) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | NEW --- | ||
Severity: | normal | CC: | edd, mount.sarah, paulmehrer, zala.lucian |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 3.16.7 | Subsystem: | |
Regression: | No | Bisected commit-id: |
Description
Edd Barrett
2016-04-19 14:16:37 UTC
Hi, On Friday afternoon I spent some more time on this. I learned how to use systemtap and I think I am a little closer to understanding the issue. I installed Debian 8 in a qemu virtual machine. This came with Linux-3.16.0 (Debian patch level 4). In my stack overflow thread (linked above) someone suggests that the function `select_task_rq_fair` is always returning the same cpu when a process with threads is pinned to an isolated core. This was my starting point. I then devised the following systemtap script to use with my test case (as in previous post, but with NTHR=4): ``` global N = 0 probe kernel.function("select_task_rq_fair") { if (execname() == "threads") { printf(">>> %s select_task_rq_fair\n", execname()) } } probe kernel.function("select_task_rq_fair").return { if (execname() == "threads") { printf("<<< %s select_task_rq_fair: %s\n", execname(), $$return) N++ // stop when 4 threads have been scheduled if (N == 4) { exit() } } } probe kernel.statement("select_task_rq_fair@fair.c:*") { if (execname() == "threads") { printf("%s\n", pp()) } } probe kernel.statement("select_task_rq_fair@fair.c:4496") { if (execname() == "threads") { printf(" want_affine=%d\n", $want_affine) } } probe begin { printf("stap running\n") } ``` The binary I am testing is called "threads" hence the `if (execname() == "threads")` lines. I'm printing info when: * select_task_rq_fair is entered. * select_task_rq_fair returned, including return value. * a line of select_task_rq_fair is executed, including line number. * want_affine at line 4496. I quit tracing after the 4 threads have all been scheduled. Now, with the VM booted with 4 cpus and `isolcpus=2,3`, I captured the output from a normal run `./threads` and a run forcing the process onto the set of isolated cores `taskset -c 2,3 ./threads`. I then diff the output. First thing that I notice: ``` --- normal.trace 2016-05-06 18:30:27.964000000 +0100 +++ taskset.trace 2016-05-06 18:31:27.884000000 +0100 ... -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ... -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ... -<<< threads select_task_rq_fair: return=0x1 +<<< threads select_task_rq_fair: return=0x2 ... -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ``` This confirms the effect we are seeing. The taskset threads all got scheduled on core 0x2, whereas the normal run put threads on cores 0x0 and 0x1 (remember, 0x2 and 0x3 are off-limits without taskset). Looking deeper, asides from the return value, each thread's diff looks a little different, but what is consistently different between the runs is that line 4497, which is always executed on a normal run, is never executed on a taskset run, e.g.: ``` kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=1 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") ``` This is the following section of code: ``` 4488 for_each_domain(cpu, tmp) { 4489 if (!(tmp->flags & SD_LOAD_BALANCE)) 4490 continue; 4491 4492 /* 4493 * If both cpu and prev_cpu are part of this domain, 4494 * cpu is a valid SD_WAKE_AFFINE target. 4495 */ 4496 if (want_affine && (tmp->flags & SD_WAKE_AFFINE) && 4497 ----> cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) { 4498 affine_sd = tmp; 4499 break; 4500 } 4501 4502 if (tmp->flags & sd_flag) 4503 sd = tmp; 4504 } 4505 4506 if (affine_sd && cpu != prev_cpu && wake_affine(affine_sd, p, sync)) ``` Notice that I am also printing want_affine. The value of want_affine varies on taskset threads, so if the line numbers are to be believed (not sure), the we know that one of the other conditions for that branch can also fail. Without pumping more time into this, I think there is a bug with the affine wake logic when isolated cores are in use. If I had to guess, it seems for SCHED_OTHER, affine wakes are off unless the process is forced onto isolated cores. Is there a way to capture the evaluation of `tmp->flags & SD_WAKE_AFFINE` and `cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))` in systemtap? Thanks For completeness, here is the whole trace diff generated from the above systemtap script: ``` --- normal.trace 2016-05-06 18:30:27.964000000 +0100 +++ taskset.trace 2016-05-06 18:31:27.884000000 +0100 @@ -3,36 +3,20 @@ >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4478") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4481") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4482") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=1 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4511") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471") >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4478") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4481") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4482") +kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=1 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4506") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4511") +kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471") >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") @@ -41,20 +25,10 @@ kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=0 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4502") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4540") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4518") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4530") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4538") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4541") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x1 +<<< threads select_task_rq_fair: return=0x2 kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4471") >>> threads select_task_rq_fair kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4473") @@ -63,15 +37,7 @@ kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4475") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4488") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4472") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4497") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4489") - want_affine=0 -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4496") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4502") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4509") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4514") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4540") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4518") -kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4525") kernel.statement("select_task_rq_fair@/build/linux-lqALYs/linux-3.16.7-ckt25/kernel/sched/fair.c:4552") -<<< threads select_task_rq_fair: return=0x0 +<<< threads select_task_rq_fair: return=0x2 ``` FWIW, `taskset -a` does not help either. Hi Edd, (In reply to Edd Barrett from comment #0) > I've noticed an odd thread scheduling behaviour on a Debian 8 system running > a 3.16.7 kernel. This may be a bug, or it may be that my expectations are > wrong. I could reproduce this behavior (on 4.4.12). But reading ./kernel/sched/core.c it seems NOT to be a bug. /* * Set up scheduler domains and groups. [...] */ static int init_sched_domains(const struct cpumask *cpu_map) { [...] cpumask_andnot(doms_cur[0], cpu_map, cpu_isolated_map); err = build_sched_domains(doms_cur[0], NULL); => for isolated cpus there are no scheduling domains build. These CPUs seem to be completely excluded from scheduling. what you are looking for is actually "cgroups". See: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt and maybe you even want to use a container like docker instead of dealing with cgroups yourself. Consider closing this bug report best regards Paul Mehrer That may be so, but don't you find it odd (or at-least inconsistent) that if you use a real-time scheduling policy, then the threads *do* span across cores? Arguably that is a bug in either the code or the documentation. I did end up using cgroups, specifically `cset shield`, and this worked just fine. This does beg the question as to why there are two process isolation mechanisms. Perhaps if cgroups does everything isolcpus can, you could remove isolcpus altogether. Although that's a bold move, and probably a separate issue. |