Bug 26842
Summary: | [2.6.37 regression] threads with CPU affinity cannot be killed | ||
---|---|---|---|
Product: | Process Management | Reporter: | tim blechmann (tim) |
Component: | Scheduler | Assignee: | Ingo Molnar (mingo) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | a.p.zijlstra, florian, maciej.rutecki, rjw |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 2.6.37 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Bug Depends on: | |||
Bug Blocks: | 21782 |
Description
tim blechmann
2011-01-16 13:39:42 UTC
2.6.38-rc2 still has the issue. it seems, i cannot kill any thread which is running with real-time scheduling. i bisected it, the first bad commit is: commit 34f971f6f7988be4d014eec3e3526bee6d007ffa Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed Sep 22 13:53:15 2010 +0200 sched: Create special class for stop/migrate work In order to separate the stop/migrate work thread from the SCHED_FIFO implementation, create a special class for it that is of higher priority than SCHED_FIFO itself. This currently solves a problem where cpu-hotplug consumes so much cpu-time that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment timer pending on the now dead cpu. It is also required for when we add the planned deadline scheduling class above SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks. Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1285165776.2275.1022.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu> First-Bad-Commit : 34f971f6f7988be4d014eec3e3526bee6d007ffa this two-liner seems to resolve the symptoms: diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c index 2df820b..2035b4f 100644 --- a/kernel/stop_machine.c +++ b/kernel/stop_machine.c @@ -293,6 +293,7 @@ extern void sched_set_stop_task(int cpu, struct task_struct *stop); static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) { + struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 }; unsigned int cpu = (unsigned long)hcpu; struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); struct task_struct *p; @@ -305,6 +306,7 @@ static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb, cpu); if (IS_ERR(p)) return notifier_from_errno(PTR_ERR(p)); + sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m); get_task_struct(p); kthread_bind(p, cpu); sched_set_stop_task(cpu, p); hm actually not quite: while i can finally kill my application, it doesn't behave correctly: the application has 4 real-time threads, each pinned to separate CPU cores. some of the threads do not seem to get scheduled (they don't consume any CPU time) On Sun, 2011-02-06 at 14:26 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=26842 > > > > > > --- Comment #4 from tim blechmann <tim@klingt.org> 2011-02-06 14:26:01 --- > this two-liner seems to resolve the symptoms: > > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c > index 2df820b..2035b4f 100644 > --- a/kernel/stop_machine.c > +++ b/kernel/stop_machine.c > @@ -293,6 +293,7 @@ extern void sched_set_stop_task(int cpu, struct > task_struct > *stop); > static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb, > unsigned long action, void *hcpu) > { > + struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 }; > unsigned int cpu = (unsigned long)hcpu; > struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu); > struct task_struct *p; > @@ -305,6 +306,7 @@ static int __cpuinit cpu_stop_cpu_callback(struct > notifier_block *nfb, > cpu); > if (IS_ERR(p)) > return notifier_from_errno(PTR_ERR(p)); > + sched_setscheduler_nocheck(p, SCHED_FIFO, ¶m); > get_task_struct(p); > kthread_bind(p, cpu); > sched_set_stop_task(cpu, p); > That should be an absolute NOP, the task shouldn't be running and sched_set_stop_task() already fiddles with sched_setscheduler. Does the below patch cure things? If not, I'll try and write a proglet that does what you describe to see if I can reproduce. --- commit 06c3bc655697b19521901f9254eb0bbb2c67e7e8 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Wed Feb 2 13:19:48 2011 +0100 sched: Fix update_curr_rt() cpu_stopper_thread() migration_cpu_stop() __migrate_task() deactivate_task() dequeue_task() dequeue_task_rq() update_curr_rt() Will call update_curr_rt() on rq->curr, which at that time is rq->stop. The problem is that rq->stop.prio matches an RT prio and thus falsely assumes its a rt_sched_class task. Reported-Debuged-Tested-Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> Cc: stable@kernel.org # .37 Signed-off-by: Ingo Molnar <mingo@elte.hu> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c index c914ec7..ad62677 100644 --- a/kernel/sched_rt.c +++ b/kernel/sched_rt.c @@ -625,7 +625,7 @@ static void update_curr_rt(struct rq *rq) struct rt_rq *rt_rq = rt_rq_of_se(rt_se); u64 delta_exec; - if (!task_has_rt_policy(curr)) + if (curr->sched_class != &rt_sched_class) return; delta_exec = rq->clock_task - curr->se.exec_start; this patch seems to fix it ... Fixed by commit 06c3bc655697b19521901f9254eb0bbb2c67e7e8 . |