Bug 201993 - hard interrupt on hrt softirq causes deadlock
Summary: hard interrupt on hrt softirq causes deadlock
Status: RESOLVED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: Intel Linux
: P1 blocking
Assignee: john stultz
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-14 18:45 UTC by Mike Pearce
Modified: 2019-01-25 13:00 UTC (History)
0 users

See Also:
Kernel Version: 4.20
Subsystem:
Regression: No
Bisected commit-id:


Attachments

Description Mike Pearce 2018-12-14 18:45:45 UTC
Reproduced on 4.1.17 preemptrt using 4 core smp processor but the problematic code appears in the latest 4.20 kernel.

./kernel/time/hrtimer.c:1398 __run_hrtimer calls fn(timer) with irq's enabled.
This can call ./kernel/sched/fair.c:4860 sched_cfs_period_timer() which locks cfs_b.

The hard irq smp_apic_timer_interrupt can then occur. 
It can call ttwu_queue which grabs the spinlock for its CPU run queue and can then try to enqueue a task via the cfs scheduler. 
This can call check_enqueue_throttle which can call assign_cfs_rq_runtime which tries to obtain the cfs_b lock. It is now blocked

Typical outcomes
thread 4 - Hard irq - smp_apic_timer_interrupt, blocked on rq0, trying to wake pid 922, python3, prio=107, on cpu 0
thread 3 - Hard irq - smp_apic_timer_interrupt. Owns spinloc rq0, blocked on cfs_b->lock, trying to wake pid 995, python3, prio=110, on cpu 0
                                                owner of cfs_b is cpu 2, 0xffff88007b65cb60 - ksoftirqd
thread 2 - ksoftirqd, pid 18, prio 98. blocked on rq0, try_to_wake_up(pid 7, rcu_preempt, prio=98, on_rq 0
thread 1 - Trying to schedule work, blocked waiting for its rq0 spinlock.
Comment 1 Mike Pearce 2018-12-19 10:57:14 UTC
The platform is a 64bit Intel architecture. The cfs_b lock uses spin_lock and so was not intended for use inside a hard irq but the CFS scheduler does just that when it uses a hrtimer_interrupt to wake up and enqueue work. My initial impression is that the cfs_b needs to become a spin_lock_irqsave.

Here is the hard irq stack trace
#0  arch_spin_lock (lock=<optimised out>) at arch/x86/include/asm/spinlock.h:118
#1  __spin_lock_debug (lock=<optimised out>) at kernel/locking/spinlock_debug.c:132
#2  do_raw_spin_lock (lock=0xffff88006f435570) at kernel/locking/spinlock_debug.c:139
#3  0xffffffff81823621 in __raw_spin_lock (lock=<optimised out>) at include/linux/spinlock_api_smp.h:145
#4  _raw_spin_lock (lock=0xffff88006f435570) at kernel/locking/spinlock.c:154
#5  0xffffffff810bca84 in assign_cfs_rq_runtime (cfs_rq=<optimised out>) at kernel/sched/fair.c:3475
#6  __account_cfs_rq_runtime (cfs_rq=0xffff880070fcc200, delta_exec=<optimised out>) at kernel/sched/fair.c:3559
#7  0xffffffff810bfa75 in account_cfs_rq_runtime (delta_exec=<optimised out>, cfs_rq=<optimised out>) at kernel/sched/fair.c:3569
#8  check_enqueue_throttle (cfs_rq=<optimised out>) at kernel/sched/fair.c:3973
#9  enqueue_entity (cfs_rq=0xffff880070fcc200, se=0xffff88006fd4bc00, flags=2103279296) at kernel/sched/fair.c:3095
#10 0xffffffff810c08da in enqueue_task_fair (rq=<optimised out>, p=<optimised out>, flags=<optimised out>) at kernel/sched/fair.c:4231
#11 0xffffffff810add8a in enqueue_task (rq=0xffff88007d5d7ec0, p=0xffff88004d948000, flags=5) at kernel/sched/core.c:896
#12 0xffffffff810af5b3 in activate_task (rq=<optimised out>, p=<optimised out>, flags=<optimised out>) at kernel/sched/core.c:911
#13 0xffffffff810afad3 in ttwu_activate (en_flags=<optimised out>, p=<optimised out>, rq=<optimised out>) at kernel/sched/core.c:1550
#14 ttwu_do_activate (rq=0xffffffff, p=0x40 <irq_stack_union+64>, wake_flags=<optimised out>) at kernel/sched/core.c:1590
#15 0xffffffff810b3937 in ttwu_queue (cpu=<optimised out>, p=<optimised out>) at kernel/sched/core.c:1732
#16 try_to_wake_up (p=0xffff88004d948000, state=<optimised out>, wake_flags=4) at kernel/sched/core.c:1820
#17 0xffffffff810b3d40 in wake_up_process (p=0xffff88004d948000) at kernel/sched/core.c:1844
#18 0xffffffff810fc1f2 in hrtimer_wakeup (timer=<optimised out>) at kernel/time/hrtimer.c:1736
#19 0xffffffff810fd628 in __run_hrtimer (timer=0xffffffff, now=0x40 <irq_stack_union+64>) at kernel/time/hrtimer.c:1334
#20 0xffffffff810fe708 in hrtimer_interrupt (dev=<optimised out>) at kernel/time/hrtimer.c:1543
#21 0xffffffff810397ba in local_apic_timer_interrupt () at arch/x86/kernel/apic/apic.c:897
#22 0xffffffff81827181 in smp_apic_timer_interrupt (regs=<optimised out>) at arch/x86/kernel/apic/apic.c:921
#23 <signal handler called>
#24 0xffffffffffffff10 in ?? ()
Backtrace stopped: Cannot access memory at address 0x246

This is my interpretation
run_ksoftirqd
 |-> do_current_softirqs
      |-> do_single_softirq
           |-> handle_softirq
                |-> run_hrtimer_softirq
                     |-> hrtimer_rt_run_pending
                          |-> raw_spin_unlock_irq(&cpu_base->lock);
                          |-> restart = fn(timer); ===> sched_cfs_period_timer
                               |-> raw_spin_lock(&cfs_b->lock);

smp_apic_timer_interrupt
  |-> local_apic_timer_interrupt
        |-> hrtimer_interrupt
              |-> __run_hrtimer
                    |-> hrtimer_wakeup
                          |-> wake_up_process
                                |-> try_to_wake_up
                                      |-> raw_spin_lock_irqsave(&p->pi_lock, flags); {{25,26}} 
                                      |-> ttwu_queue (task cpu=0, pid 995, python3, prio=110, on_rq 0)
                                            |-> _raw_spin_lock(&rq->lock) - Runqueue lock for CPU 0 obtained
                                            |-> ttwu_do_activate
                                                 |-> ttwu_activate
                                                      |-> activate_task
                                                           |-> enqueue_task(rq0)
                                                                |-> enqueue_task_fair from p->sched_class->enqueue_task(rq, p, flags);
                                                                    |-> for_each_sched_entity(se)
                                                                         |-> enqueue_entity()
                                                                             |-> check_enqueue_throttle
                                                                                 |-> account_cfs_rq_runtime
                                                                                     |-> __account_cfs_rq_runtime
                                                                                         |-> assign_cfs_rq_runtime
                                                                                              |-> raw_spin_lock(&cfs_b->lock); {{17,19}}, owner cpu 0,

My clients 4.1.17 preempt build freezes quickly and so is good for testing any fix. It looks likely that I will see a 1 month extension to my contract and so can get permission to test any fix. I currently plan to create a patch that upgrades the cfs_b lock to disable hard irq.
Comment 2 Mike Pearce 2018-12-20 12:47:36 UTC
I see that core.c has an irq spin lock and then calls __start_cfs_bandwidth which does an unlock, leaving irq disabled whilst it calls cpu_relax(). I guess this gives another CPU opportunity to service its cfs_b needs but irq's remain locally disabled whilst we relax the CPU - Pina colada's on the terrace. Is this good?

/* requires cfs_b->lock, may release to reprogram timer */
void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
{
	/*
	 * The timer may be active because we're trying to set a new bandwidth
	 * period or because we're racing with the tear-down path
	 * (timer_active==0 becomes visible before the hrtimer call-back
	 * terminates).  In either case we ensure that it's re-programmed
	 */
	while (unlikely(hrtimer_active(&cfs_b->period_timer)) &&
	       hrtimer_try_to_cancel(&cfs_b->period_timer) < 0) {
		/* bounce the lock to allow do_sched_cfs_period_timer to run */
		raw_spin_unlock(&cfs_b->lock);
		cpu_relax();
		raw_spin_lock(&cfs_b->lock);
		/* if someone else restarted the timer then we're done */
		if (!force && cfs_b->timer_active)
			return;
	}

	cfs_b->timer_active = 1;
	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
}

The patch I am trying out extends this approach. I now lock cfs_b with irq always disabled and then allow methods __start_cfs_bandwidth and do_sched_cfs_period_timer to unlock the cfs_b whilst irq's remain disabled. Perhaps for do_sched_cfs_period_timer this is not so clever. Feedback most welcome.
Comment 3 Mike Pearce 2019-01-25 13:00:05 UTC
ref: https://lore.kernel.org/lkml/20190107125231.GE14122@hirez.programming.kicks-ass.net/T/

n.b.
The PREEMPT_RT patch enables IRQ for HRT_Timer via one added call point - the addition of method hrtimer_rt_run_pending.

+static void hrtimer_rt_run_pending(void)
+{
...
+			/*
+			 * Same as the above __run_hrtimer function
+			 * just we run with interrupts enabled.
+			 */
....
+			raw_spin_unlock_irq(&cpu_base->lock);
+			restart = fn(timer);
+			raw_spin_lock_irq(&cpu_base->lock);

In all the other calls HRT_Timer is still called with IRQ Disabled.

Now that you have a solution to allow HRT_TIMER to work when IRQ's are enabled, you may want to review whether to enable IRQ's elsewhere. For example, the call to raise_softirq_irqoff(HRTIMER_SOFTIRQ).

Note You need to log in before you can comment on or make changes to this bug.