Bug 61451 - Race condition in tasklet scheduling
Summary: Race condition in tasklet scheduling
Status: ASSIGNED
Alias: None
Product: Process Management
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 high
Assignee: process_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-09-16 15:05 UTC by fredrik.markstrom
Modified: 2022-02-11 10:57 UTC (History)
5 users (show)

See Also:
Kernel Version: All recent -rt
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Some kernel code that demonstrates the race condition in tasklet scheduling (5.54 KB, text/plain)
2013-09-16 15:06 UTC, fredrik.markstrom
Details
attachment-17809-0.html (2.24 KB, text/html)
2020-03-17 12:54 UTC, fredrik.markstrom
Details

Description fredrik.markstrom 2013-09-16 15:05:12 UTC
A race condition in the state-bit flipping in softirq.c might cause a tasklet to be added to more then one cpu:s tasklet_vec list. This might result in the tasklet executing too many times and/or on the wrong cpu. It might potentially also cause unrelated tasklets scheduled after this to execute on the wrong cpu. The most noticeable symptom in our system is that the WARN_ON() in __tasklet_action is triggered.

An artificial example on how to trigger this bug is attached, it also contains lots of comments describing the race. It seems like this code (softirq.c) has a lot of history and I have a hard time understanding why stuff is done as it's done, so I'll save my save the embarrassment of proposing a solution.

The example has been tested on qemux86 and powerpc, but should work equally well on all architectures and kernel versions.

/Fredrik
Comment 1 fredrik.markstrom 2013-09-16 15:06:14 UTC
Created attachment 108481 [details]
Some kernel code that demonstrates the race condition in tasklet scheduling
Comment 2 fredrik.markstrom 2013-09-16 15:07:27 UTC
You need at least three cores to run the attached examples (Start qemu with -smp 3)
Comment 3 ZhangXiao 2013-11-26 08:35:34 UTC
process "tasklet_schedule"

S-1) test_and_set_bit TASKLET_STATE_SCHED
S-2) local_irq_save
S-3) test_and_set_bit TASKLET_STATE_RUN
S-4) if TASKLET_STATE_SCHED still been set
S-5) Chain the tasklet on tasklet_head
S-6) Clear bit TASKLET_STATE_RUN
S-7) local_irq_restore

Process "__tasklet_action", for every chained tasklet:
A-1) test_and_set_bit TASKLET_STATE_RUN
A-2) test_and_clear_bit TASKLET_STATE_SCHED
A-3) Spawn the handler function
A-4) conditionnally test and clear bit TASKLET_STATEF_RUN(cmpxchg)
A-5) test_and_clear_bit TASKLET_STATE_SCHED, if setted, A-3

Three CPUs, one tasklet:

Time  CPU1          CPU2          CPU3      SCHED
T-1   S-1~7                                 1
T-2   A-1,2,3                               0
T-3                 S-1 preempted           1
T-4   A-4,5 -> A-3                          0
T-5   A-4,5 finish                          0
T-6                               S-1 ~ 7   1
T-7                 S2 ~7                   1

After T-7, this tasklet chained on both CPU2 and CPU3.

The key lies on time point T-3. CPU1 are dealing with the tastlet
while CPU2 is trying to schedule it but been preempted just after
S-1(SCHED is set), The CPU1 move on and call the handler function
twice and clear the SCHED at last. The CPU3 schedule this very
tasklet successfully. Chain it on CPU3. The CPU2 comes back, with
SCHED was set, has no idea on the real state of the tasklet that
it had already chained on CPU3. CPU2 moves on and chained it on
itself.

So how about add one state bit to avoid this race?
Add a state bit CHAINED to see if it is already chained on any CPU.
-) In schedule process, after S-4(confirmed SCHED bit is set), add
test and set on CHAINED, if alreay CHAINED, clear RUN bit and just
return;

-) In action process, modify step A-4. The original A-4 is check and
exchange RUN bit. If only RUN bit was set, clear it, or do nothing on
the state byte. My modification is test RUN and CHAINED two bits, if
only these two bits were set, clear them, or do nothing.

(Patch not tested yet!!!)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 838d4bd..f8681ba 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -534,11 +534,14 @@ enum
        TASKLET_STATE_SCHED,    /* Tasklet is scheduled for execution */
        TASKLET_STATE_RUN,      /* Tasklet is running (SMP only) */
        TASKLET_STATE_PENDING   /* Tasklet is pending */
+       TASKLET_STATE_CHAINED   /* Tasklet is chained */
 };
 
 #define TASKLET_STATEF_SCHED   (1 << TASKLET_STATE_SCHED)
 #define TASKLET_STATEF_RUN     (1 << TASKLET_STATE_RUN)
 #define TASKLET_STATEF_PENDING (1 << TASKLET_STATE_PENDING)
+#define TASKLET_STATEF_CHAINED (1 << TASKLET_STATE_CHAINED)
+#define TASKLET_STATEF_RC      (TASKLET_STATEF_RUN | TASKLET_STATEF_CHAINED)
 
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT_FULL)
 static inline int tasklet_trylock(struct tasklet_struct *t)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 3298e55..87d5219 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -785,6 +785,10 @@ again:
                 * is locked before adding it to the list.
                 */
                if (test_bit(TASKLET_STATE_SCHED, &t->state)) {
+                       if(test_and_set_bit(TASKLET_STATE_CHAINED, &t->state)) {
+                               tasklet_unlock(t);
+                               return;
+                       }
                        t->next = NULL;
                        *head->tail = t;
                        head->tail = &(t->next);
@@ -907,9 +911,9 @@ again:
                /*
                 * Try to unlock the tasklet. We must use cmpxchg, because
                 * another CPU might have scheduled or disabled the tasklet.
-                * We only allow the STATE_RUN -> 0 transition here.
+                * We only allow the STATE_RC -> 0 transition here.
                 */
-               while (!tasklet_tryunlock(t)) {
+               while (cmpxchg(&t->state, TASKLET_STATEF_RC, 0) != TASKLET_STATEF_RC) {
                        /*
                         * If it got disabled meanwhile, bail out:
                         */

(Patch not tested yet!!!)
Comment 4 fredrik.markstrom 2013-11-26 14:32:46 UTC
Hello ZhangXiao, 

Thanks for paying attention to this bug. I was just preparing to submit a patch when I read your response. My patch has a slightly different approach setting RUN and SCHED simultaneously. My patch is larger but does not introduce any extra state, so I'm not sure which one is better. Have you analyzed the effect of your patch in relation to the tasklet_kill() stuff ?

I'm currently running some extensive tests on both 3.4-rt (which we use) and the latest 3.10-rt.

What do you think ?


--- 8< ---

If a thread got preempted in tasklet_schedule after setting the SCHED bit but
before taking the "lock", the restart code in tasklet_action() can can run the
tasklet and clear SCHED. If other threads in the system schedule the task
again before the first tasklet_schedule() may run again it will successfully
set SCHED again and add the tasklet to a tasklet queue. If this happens and the
first tasklet_schedule() continues it will successfully get the lock and it
will verify that the SCHED bit is taken and add the tasklet queue, this is
wrong since it's already in a tasklet queue. The consequences are that the
tasklet will run one time too many and tasklets behind it in the first list
potentially on the wrong cpu.

Signed-off-by: Fredrik Markström <fredrik.markstrom@gmail.com>
---
 include/linux/interrupt.h |    6 ++--
 kernel/softirq.c          |   73 ++++++++++++++++++++++++++++-----------------
 2 files changed, 49 insertions(+), 30 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 838d4bd..0b59d29 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -570,7 +570,7 @@ extern void __tasklet_schedule(struct tasklet_struct *t);
 
 static inline void tasklet_schedule(struct tasklet_struct *t)
 {
-	if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
+	if (!test_bit(TASKLET_STATE_SCHED, &t->state))
 		__tasklet_schedule(t);
 }
 
@@ -578,7 +578,7 @@ extern void __tasklet_hi_schedule(struct tasklet_struct *t);
 
 static inline void tasklet_hi_schedule(struct tasklet_struct *t)
 {
-	if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
+	if (!test_bit(TASKLET_STATE_SCHED, &t->state))
 		__tasklet_hi_schedule(t);
 }
 
@@ -592,7 +592,7 @@ extern void __tasklet_hi_schedule_first(struct tasklet_struct *t);
  */
 static inline void tasklet_hi_schedule_first(struct tasklet_struct *t)
 {
-	if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
+	if (!test_bit(TASKLET_STATE_SCHED, &t->state))
 		__tasklet_hi_schedule_first(t);
 }
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 4cacbcd..188ed51 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -787,34 +787,53 @@ static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);
 static void inline
 __tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head *head, unsigned int nr)
 {
-	if (tasklet_trylock(t)) {
-again:
-		/* We may have been preempted before tasklet_trylock
-		 * and __tasklet_action may have already run.
-		 * So double check the sched bit while the takslet
-		 * is locked before adding it to the list.
+	do  {
+		/* Read the current value of t->state and mask out RUN and
+		 * SCHED, we need to do this to be agnostic to bits other
+		 * (PENDING) then the two we are interested in.
 		 */
-		if (test_bit(TASKLET_STATE_SCHED, &t->state)) {
-			t->next = NULL;
-			*head->tail = t;
-			head->tail = &(t->next);
-			raise_softirq_irqoff(nr);
-			tasklet_unlock(t);
-		} else {
-			/* This is subtle. If we hit the corner case above
-			 * It is possible that we get preempted right here,
-			 * and another task has successfully called
-			 * tasklet_schedule(), then this function, and
-			 * failed on the trylock. Thus we must be sure
-			 * before releasing the tasklet lock, that the
-			 * SCHED_BIT is clear. Otherwise the tasklet
-			 * may get its SCHED_BIT set, but not added to the
-			 * list
-			 */
-			if (!tasklet_tryunlock(t))
-				goto again;
-		}
-	}
+		unsigned long exp =
+			t->state & ~(TASKLET_STATEF_RUN | TASKLET_STATEF_SCHED);
+
+		/* If neither RUN nor SCHED is set, set them both and shedule
+		 * the tasklet.
+		 */
+		if (cmpxchg(&t->state, exp,
+			    (exp | TASKLET_STATEF_RUN | TASKLET_STATEF_SCHED))
+		    == exp)
+			break;
+
+		/* If RUN is set but not SCHED, set SCHED and return. The
+		 * restart code (goto again) in __tasklet_action() will take
+		 * care of running the tasklet once more.
+		 */
+		else if (cmpxchg(&t->state, exp | TASKLET_STATEF_RUN,
+				 exp|TASKLET_STATEF_RUN|TASKLET_STATEF_SCHED)
+			 == exp | TASKLET_STATEF_RUN)
+			return;
+
+		/* If SCHED is already set, do nothing...
+		 * This tests is also done in interrupt.h, but SCHED might have
+		 * been set after that, if it is we don't want to end up looping
+		 * forever.
+		 */
+		else if (t->state & TASKLET_STATEF_SCHED)
+			return;
+
+		/* We didn't match any of the if():s above and that means
+		 * t->state changed between the tests, because the tests covers
+		 * all possible combinations. Try again and lets hope it's
+		 * nothing changes this time.
+		 */
+	} while (1);
+
+	t->next = NULL;
+	*head->tail = t;
+	head->tail = &(t->next);
+	raise_softirq_irqoff(nr);
+
+
+	tasklet_unlock(t);
 }
 
 void __tasklet_schedule(struct tasklet_struct *t)
-- 
1.7.9.5
Comment 5 ZhangXiao 2013-11-27 02:49:48 UTC
(In reply to fredrik.markstrom from comment #4)
> Hello ZhangXiao, 
> 
> Thanks for paying attention to this bug. I was just preparing to submit a
> patch when I read your response. My patch has a slightly different approach
> setting RUN and SCHED simultaneously. My patch is larger but does not
> introduce any extra state, so I'm not sure which one is better. Have you
> analyzed the effect of your patch in relation to the tasklet_kill() stuff ?
> 

? I can't find any race condition caused by it. Any comments?

Before it test and wait for a zero RUN bit, tasklet_kill only set/test SCHED bit. From my point of view, it is safe. Please correct me if I miss or miss-understand anything.

> I'm currently running some extensive tests on both 3.4-rt (which we use) and
> the latest 3.10-rt.
> 
> What do you think ?
> 
> 
Good idea. :-) Delay setting SCHED bit into function __tasklet_common_schedule thus get rid of this race condition without anymore flag bit. But anyway, introduce more footprints as mines. Compare with my patch, you use more "cmpxchg" thus may kill more brain cells of human readers. :-)

Another way to resolve it is: How about enlarge the scope of the local_irq_save and local_irq_restore pare of function __tasklet_schedule and __tasklet_hi_schedule? For example, in function __tasklet_common_schedule, it do nothing but just call function __tasklet_common_schedule with the protection of the local_irq protection above. While its caller "tasklet_schedule" do nothing but just call it after a test_and_set_bit on SCHED. Similar realization on __tasklet_hi_schedule. So how about enlarge the local_irq protection to include the "test_and_set_bit on SCHED"? I should resolve the issue but I am not familiar about the use case thus I am not clear how much footprint will cause by this modification. It depends on the probability that SCHED is already been set when this function was spawned. :-/

BTW: I want to know why we use RUN bit as xxx_lock without renaming it to anything else? Can we just use it without those wraps as tasklet_trylock? Or just rename it as LOCK?

Thanks
Xiao
Comment 6 fredrik.markstrom 2013-11-27 12:04:03 UTC
Which solution is easier to understand is probably subjective. I think mine is, but might be because I spent more time with it.

I don't think disabling interrupts will solve anything since it's only affecting the local cpu. The whole sequence could theoretically happen with different cores racing without any interrupts.

I agree on the naming of the RUN-bit/xxx_lock, but I didn't want to complicate things and think that could/should be a separate patch.
Comment 7 ZhangXiao 2013-11-28 01:30:26 UTC
On, the local_irq pairs, you are right,

I supposed you are the maintainer of RT. I don't care which patch will be used but I hope this issue can bee resolved sooner.
Comment 8 StefanCiotec 2017-02-27 15:21:54 UTC
Hi guys,

Do you know if there has been any fix merged related to this?

I am facing a very similar issue, in 3.12.69-rt92 and 4.1.5-rt5.
Comment 9 Sebastian A. Siewior 2020-03-17 12:12:50 UTC
(In reply to ZhangXiao from comment #5)
> ? I can't find any race condition caused by it. Any comments?

Any feedback from someone here in the past few years? 

In the meantime I posted ZhangXiao's patch on the rt-users mailing list for more review. It would be nice if you could reply adding your SOB.

> Thanks
> Xiao


Sebastian
Comment 10 fredrik.markstrom 2020-03-17 12:54:25 UTC
Created attachment 287957 [details]
attachment-17809-0.html

I have no further knowledge about this (and I've forgotten most of it).
We've been running with the patch I posted for years without any issues. We
have also been able to drop preempt-rt entirely, although I don't remember
if this issue was preempt-RT specific.

/Fredrik

On Tue, Mar 17, 2020 at 1:12 PM <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=61451
>
> Sebastian A. Siewior (korg-bugzilla.bigeasy@linutronix.de) changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>              Status|NEW                         |ASSIGNED
>                  CC|
> |korg-bugzilla.bigeasy@linut
>                    |                            |ronix.de
>
> --- Comment #9 from Sebastian A. Siewior (
> korg-bugzilla.bigeasy@linutronix.de) ---
> (In reply to ZhangXiao from comment #5)
> > ? I can't find any race condition caused by it. Any comments?
>
> Any feedback from someone here in the past few years?
>
> In the meantime I posted ZhangXiao's patch on the rt-users mailing list for
> more review. It would be nice if you could reply adding your SOB.
>
> > Thanks
> > Xiao
>
>
> Sebastian
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
Comment 11 wisape 2022-02-11 10:57:35 UTC

(In reply to fredrik.markstrom from comment #4)
> Hello ZhangXiao, 
> 
> Thanks for paying attention to this bug. I was just preparing to submit a
> patch when I read your response. My patch has a slightly different approach
> setting RUN and SCHED simultaneously. My patch is larger but does not
> introduce any extra state, so I'm not sure which one is better. Have you
> analyzed the effect of your patch in relation to the tasklet_kill() stuff ?
> 
> I'm currently running some extensive tests on both 3.4-rt (which we use) and
> the latest 3.10-rt.
> 
> What do you think ?
> 
> 
> --- 8< ---
> 
> If a thread got preempted in tasklet_schedule after setting the SCHED bit but
> before taking the "lock", the restart code in tasklet_action() can can run
> the
> tasklet and clear SCHED. If other threads in the system schedule the task
> again before the first tasklet_schedule() may run again it will successfully
> set SCHED again and add the tasklet to a tasklet queue. If this happens and
> the
> first tasklet_schedule() continues it will successfully get the lock and it
> will verify that the SCHED bit is taken and add the tasklet queue, this is
> wrong since it's already in a tasklet queue. The consequences are that the
> tasklet will run one time too many and tasklets behind it in the first list
> potentially on the wrong cpu.
> 
> Signed-off-by: Fredrik Markström <fredrik.markstrom@gmail.com>
> ---
>  include/linux/interrupt.h |    6 ++--
>  kernel/softirq.c          |   73
> ++++++++++++++++++++++++++++-----------------
>  2 files changed, 49 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index 838d4bd..0b59d29 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -570,7 +570,7 @@ extern void __tasklet_schedule(struct tasklet_struct *t);
>  
>  static inline void tasklet_schedule(struct tasklet_struct *t)
>  {
> -     if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
> +     if (!test_bit(TASKLET_STATE_SCHED, &t->state))
>               __tasklet_schedule(t);
>  }
>  
> @@ -578,7 +578,7 @@ extern void __tasklet_hi_schedule(struct tasklet_struct
> *t);
>  
>  static inline void tasklet_hi_schedule(struct tasklet_struct *t)
>  {
> -     if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
> +     if (!test_bit(TASKLET_STATE_SCHED, &t->state))
>               __tasklet_hi_schedule(t);
>  }
>  
> @@ -592,7 +592,7 @@ extern void __tasklet_hi_schedule_first(struct
> tasklet_struct *t);
>   */
>  static inline void tasklet_hi_schedule_first(struct tasklet_struct *t)
>  {
> -     if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
> +     if (!test_bit(TASKLET_STATE_SCHED, &t->state))
>               __tasklet_hi_schedule_first(t);
>  }
>  
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 4cacbcd..188ed51 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -787,34 +787,53 @@ static DEFINE_PER_CPU(struct tasklet_head,
> tasklet_hi_vec);
>  static void inline
>  __tasklet_common_schedule(struct tasklet_struct *t, struct tasklet_head
> *head, unsigned int nr)
>  {
> -     if (tasklet_trylock(t)) {
> -again:
> -             /* We may have been preempted before tasklet_trylock
> -              * and __tasklet_action may have already run.
> -              * So double check the sched bit while the takslet
> -              * is locked before adding it to the list.
> +     do  {
> +             /* Read the current value of t->state and mask out RUN and
> +              * SCHED, we need to do this to be agnostic to bits other
> +              * (PENDING) then the two we are interested in.
>                */
> -             if (test_bit(TASKLET_STATE_SCHED, &t->state)) {
> -                     t->next = NULL;
> -                     *head->tail = t;
> -                     head->tail = &(t->next);
> -                     raise_softirq_irqoff(nr);
> -                     tasklet_unlock(t);
> -             } else {
> -                     /* This is subtle. If we hit the corner case above
> -                      * It is possible that we get preempted right here,
> -                      * and another task has successfully called
> -                      * tasklet_schedule(), then this function, and
> -                      * failed on the trylock. Thus we must be sure
> -                      * before releasing the tasklet lock, that the
> -                      * SCHED_BIT is clear. Otherwise the tasklet
> -                      * may get its SCHED_BIT set, but not added to the
> -                      * list
> -                      */
> -                     if (!tasklet_tryunlock(t))
> -                             goto again;
> -             }
> -     }
> +             unsigned long exp =
> +                     t->state & ~(TASKLET_STATEF_RUN |
> TASKLET_STATEF_SCHED);
> +
> +             /* If neither RUN nor SCHED is set, set them both and shedule
> +              * the tasklet.
> +              */
> +             if (cmpxchg(&t->state, exp,
> +                         (exp | TASKLET_STATEF_RUN | TASKLET_STATEF_SCHED))
> +                 == exp)
> +                     break;
> +
> +             /* If RUN is set but not SCHED, set SCHED and return. The
> +              * restart code (goto again) in __tasklet_action() will take
> +              * care of running the tasklet once more.
> +              */
> +             else if (cmpxchg(&t->state, exp | TASKLET_STATEF_RUN,
> +                              exp|TASKLET_STATEF_RUN|TASKLET_STATEF_SCHED)
> +                      == exp | TASKLET_STATEF_RUN)
> +                     return;
> +
> +             /* If SCHED is already set, do nothing...
> +              * This tests is also done in interrupt.h, but SCHED might have
> +              * been set after that, if it is we don't want to end up
> looping
> +              * forever.
> +              */
> +             else if (t->state & TASKLET_STATEF_SCHED)
> +                     return;
> +
> +             /* We didn't match any of the if():s above and that means
> +              * t->state changed between the tests, because the tests covers
> +              * all possible combinations. Try again and lets hope it's
> +              * nothing changes this time.
> +              */
> +     } while (1);
> +
> +     t->next = NULL;
> +     *head->tail = t;
> +     head->tail = &(t->next);
> +     raise_softirq_irqoff(nr);
> +
> +
> +     tasklet_unlock(t);
>  }
>  
>  void __tasklet_schedule(struct tasklet_struct *t)
> -- 
> 1.7.9.5

Hi Fredrik

Is the patch corresponding to the  commit record? 
We encountered a similar problem recently, but we did not find the corresponding modification in RT Patchs.

Note You need to log in before you can comment on or make changes to this bug.