Reproduced on 4.1.17 preemptrt using 4 core smp processor but the problematic code appears in the latest 4.20 kernel. ./kernel/time/hrtimer.c:1398 __run_hrtimer calls fn(timer) with irq's enabled. This can call ./kernel/sched/fair.c:4860 sched_cfs_period_timer() which locks cfs_b. The hard irq smp_apic_timer_interrupt can then occur. It can call ttwu_queue which grabs the spinlock for its CPU run queue and can then try to enqueue a task via the cfs scheduler. This can call check_enqueue_throttle which can call assign_cfs_rq_runtime which tries to obtain the cfs_b lock. It is now blocked Typical outcomes thread 4 - Hard irq - smp_apic_timer_interrupt, blocked on rq0, trying to wake pid 922, python3, prio=107, on cpu 0 thread 3 - Hard irq - smp_apic_timer_interrupt. Owns spinloc rq0, blocked on cfs_b->lock, trying to wake pid 995, python3, prio=110, on cpu 0 owner of cfs_b is cpu 2, 0xffff88007b65cb60 - ksoftirqd thread 2 - ksoftirqd, pid 18, prio 98. blocked on rq0, try_to_wake_up(pid 7, rcu_preempt, prio=98, on_rq 0 thread 1 - Trying to schedule work, blocked waiting for its rq0 spinlock.
The platform is a 64bit Intel architecture. The cfs_b lock uses spin_lock and so was not intended for use inside a hard irq but the CFS scheduler does just that when it uses a hrtimer_interrupt to wake up and enqueue work. My initial impression is that the cfs_b needs to become a spin_lock_irqsave. Here is the hard irq stack trace #0 arch_spin_lock (lock=<optimised out>) at arch/x86/include/asm/spinlock.h:118 #1 __spin_lock_debug (lock=<optimised out>) at kernel/locking/spinlock_debug.c:132 #2 do_raw_spin_lock (lock=0xffff88006f435570) at kernel/locking/spinlock_debug.c:139 #3 0xffffffff81823621 in __raw_spin_lock (lock=<optimised out>) at include/linux/spinlock_api_smp.h:145 #4 _raw_spin_lock (lock=0xffff88006f435570) at kernel/locking/spinlock.c:154 #5 0xffffffff810bca84 in assign_cfs_rq_runtime (cfs_rq=<optimised out>) at kernel/sched/fair.c:3475 #6 __account_cfs_rq_runtime (cfs_rq=0xffff880070fcc200, delta_exec=<optimised out>) at kernel/sched/fair.c:3559 #7 0xffffffff810bfa75 in account_cfs_rq_runtime (delta_exec=<optimised out>, cfs_rq=<optimised out>) at kernel/sched/fair.c:3569 #8 check_enqueue_throttle (cfs_rq=<optimised out>) at kernel/sched/fair.c:3973 #9 enqueue_entity (cfs_rq=0xffff880070fcc200, se=0xffff88006fd4bc00, flags=2103279296) at kernel/sched/fair.c:3095 #10 0xffffffff810c08da in enqueue_task_fair (rq=<optimised out>, p=<optimised out>, flags=<optimised out>) at kernel/sched/fair.c:4231 #11 0xffffffff810add8a in enqueue_task (rq=0xffff88007d5d7ec0, p=0xffff88004d948000, flags=5) at kernel/sched/core.c:896 #12 0xffffffff810af5b3 in activate_task (rq=<optimised out>, p=<optimised out>, flags=<optimised out>) at kernel/sched/core.c:911 #13 0xffffffff810afad3 in ttwu_activate (en_flags=<optimised out>, p=<optimised out>, rq=<optimised out>) at kernel/sched/core.c:1550 #14 ttwu_do_activate (rq=0xffffffff, p=0x40 <irq_stack_union+64>, wake_flags=<optimised out>) at kernel/sched/core.c:1590 #15 0xffffffff810b3937 in ttwu_queue (cpu=<optimised out>, p=<optimised out>) at kernel/sched/core.c:1732 #16 try_to_wake_up (p=0xffff88004d948000, state=<optimised out>, wake_flags=4) at kernel/sched/core.c:1820 #17 0xffffffff810b3d40 in wake_up_process (p=0xffff88004d948000) at kernel/sched/core.c:1844 #18 0xffffffff810fc1f2 in hrtimer_wakeup (timer=<optimised out>) at kernel/time/hrtimer.c:1736 #19 0xffffffff810fd628 in __run_hrtimer (timer=0xffffffff, now=0x40 <irq_stack_union+64>) at kernel/time/hrtimer.c:1334 #20 0xffffffff810fe708 in hrtimer_interrupt (dev=<optimised out>) at kernel/time/hrtimer.c:1543 #21 0xffffffff810397ba in local_apic_timer_interrupt () at arch/x86/kernel/apic/apic.c:897 #22 0xffffffff81827181 in smp_apic_timer_interrupt (regs=<optimised out>) at arch/x86/kernel/apic/apic.c:921 #23 <signal handler called> #24 0xffffffffffffff10 in ?? () Backtrace stopped: Cannot access memory at address 0x246 This is my interpretation run_ksoftirqd |-> do_current_softirqs |-> do_single_softirq |-> handle_softirq |-> run_hrtimer_softirq |-> hrtimer_rt_run_pending |-> raw_spin_unlock_irq(&cpu_base->lock); |-> restart = fn(timer); ===> sched_cfs_period_timer |-> raw_spin_lock(&cfs_b->lock); smp_apic_timer_interrupt |-> local_apic_timer_interrupt |-> hrtimer_interrupt |-> __run_hrtimer |-> hrtimer_wakeup |-> wake_up_process |-> try_to_wake_up |-> raw_spin_lock_irqsave(&p->pi_lock, flags); {{25,26}} |-> ttwu_queue (task cpu=0, pid 995, python3, prio=110, on_rq 0) |-> _raw_spin_lock(&rq->lock) - Runqueue lock for CPU 0 obtained |-> ttwu_do_activate |-> ttwu_activate |-> activate_task |-> enqueue_task(rq0) |-> enqueue_task_fair from p->sched_class->enqueue_task(rq, p, flags); |-> for_each_sched_entity(se) |-> enqueue_entity() |-> check_enqueue_throttle |-> account_cfs_rq_runtime |-> __account_cfs_rq_runtime |-> assign_cfs_rq_runtime |-> raw_spin_lock(&cfs_b->lock); {{17,19}}, owner cpu 0, My clients 4.1.17 preempt build freezes quickly and so is good for testing any fix. It looks likely that I will see a 1 month extension to my contract and so can get permission to test any fix. I currently plan to create a patch that upgrades the cfs_b lock to disable hard irq.
I see that core.c has an irq spin lock and then calls __start_cfs_bandwidth which does an unlock, leaving irq disabled whilst it calls cpu_relax(). I guess this gives another CPU opportunity to service its cfs_b needs but irq's remain locally disabled whilst we relax the CPU - Pina colada's on the terrace. Is this good? /* requires cfs_b->lock, may release to reprogram timer */ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force) { /* * The timer may be active because we're trying to set a new bandwidth * period or because we're racing with the tear-down path * (timer_active==0 becomes visible before the hrtimer call-back * terminates). In either case we ensure that it's re-programmed */ while (unlikely(hrtimer_active(&cfs_b->period_timer)) && hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) { /* bounce the lock to allow do_sched_cfs_period_timer to run */ raw_spin_unlock(&cfs_b->lock); cpu_relax(); raw_spin_lock(&cfs_b->lock); /* if someone else restarted the timer then we're done */ if (!force && cfs_b->timer_active) return; } cfs_b->timer_active = 1; start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period); } The patch I am trying out extends this approach. I now lock cfs_b with irq always disabled and then allow methods __start_cfs_bandwidth and do_sched_cfs_period_timer to unlock the cfs_b whilst irq's remain disabled. Perhaps for do_sched_cfs_period_timer this is not so clever. Feedback most welcome.
ref: https://lore.kernel.org/lkml/20190107125231.GE14122@hirez.programming.kicks-ass.net/T/ n.b. The PREEMPT_RT patch enables IRQ for HRT_Timer via one added call point - the addition of method hrtimer_rt_run_pending. +static void hrtimer_rt_run_pending(void) +{ ... + /* + * Same as the above __run_hrtimer function + * just we run with interrupts enabled. + */ .... + raw_spin_unlock_irq(&cpu_base->lock); + restart = fn(timer); + raw_spin_lock_irq(&cpu_base->lock); In all the other calls HRT_Timer is still called with IRQ Disabled. Now that you have a solution to allow HRT_TIMER to work when IRQ's are enabled, you may want to review whether to enable IRQ's elsewhere. For example, the call to raise_softirq_irqoff(HRTIMER_SOFTIRQ).