Bug 16011 - Real-time policies hang some other processes running on other available cores
Summary: Real-time policies hang some other processes running on other available cores
Status: RESOLVED OBSOLETE
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-20 10:57 UTC by Marc
Modified: 2013-12-10 21:36 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.33
Subsystem:
Regression: No
Bisected commit-id:


Attachments
Call Traces for hung sshd and hung Xorg 2.6.27.41-170.2.117.fc10.i686 (2.75 KB, text/plain)
2010-05-20 11:00 UTC, Marc
Details
CallTrace for hung sshd with 2.6.33.1-24.fc13.i686 (1.93 KB, text/plain)
2010-05-20 11:04 UTC, Marc
Details
Call Trace for hung sshd with Ubuntu 2.6.32-21.32-generic (1.70 KB, text/plain)
2010-05-20 11:07 UTC, Marc
Details

Description Marc 2010-05-20 10:57:16 UTC
As soon as any process is scheduled with real-time priority (either
SCHED_FIFO or SCHED_RR), then "some" other processes stop being
scheduled by the kernel and hang forever. All seem stuck in
schedule_timeout() one way or the other (see Call Traces attached).

This report contains one dead-simple test case to systematically hang
_sshd_. Other software unrelated to sshd also hangs in more convoluted
test cases.

I have reproduced this on a variety of SMP hardware with the following
kernels:

 2.6.27.41-170.2.117.fc10.i686
 2.6.32.11-99.fc12.x86_64
 2.6.32-21-generic #32-Ubuntu
 2.6.33.1-24.fc13.i686


To systematically reproduce simply run this:

   echo 10 > /proc/sys/kernel/hung_task_timeout_secs
   ssh localhost

   # In the same or different terminal:
   cat /dev/urandom >/dev/null  &  burnerPID=$!
   taskset -p 1 $burnerPID # optional, see below
   chrt --fifo -p 1 $burnerPID  &

THEN type "exit" in your ssh session: sshd systematically hangs
forever. Most other processes keep running fine. Warning: Xorg or text
consoles might also freeze, making the system usable. Ironically,
logging *IN* using ssh seems the least vulnerable (only logging OUT hangs).

To unlock all hung processes just run:

   chrt --other -p 0 $burnerPID

To switch back to "hanging mode" again simply run:

   chrt --fifo -p 1 $burnerPID  &

Etc.


Not using taskset makes the problem slightly less annoying; still very
annoying:
- with taskset, programs are locked forever
- without taskset, programs are locked for a possibly very long
  time. They seem to be unlocked whenever the real-time process is
  moved to a different core (which happens from time to time).

Note: apart from hung sshd, the "/proc/sys/kernel/hung_task_*" feature
does not always notice and report hung tasks. Then "echo t >/proc/sysrq-trigger" provides Call Traces.
Comment 1 Marc 2010-05-20 11:00:29 UTC
Created attachment 26453 [details]
Call Traces for hung sshd and hung Xorg 2.6.27.41-170.2.117.fc10.i686
Comment 2 Marc 2010-05-20 11:04:01 UTC
Created attachment 26454 [details]
CallTrace for hung sshd with 2.6.33.1-24.fc13.i686
Comment 3 Marc 2010-05-20 11:07:19 UTC
Created attachment 26455 [details]
Call Trace for hung sshd with Ubuntu 2.6.32-21.32-generic
Comment 4 Andrew Morton 2010-05-21 20:46:11 UTC
(cc Peter)

sshd got stuck because it is waiting on some per-cpu kernel thread, and that kernel thread isn't getting scheduled because there's a SCHED_FIFO task spinning on its CPU.

We consider that a user error, sorry - SCHED_FIFO tasks should yield the CPU sometimes.  Any simple fix to this would degrade SCHED_FIFO's latency benefits.  And I can't think of a complex fix :(

As for the second part of the bug report: yeah, perhaps it's improvable.  The scheduler could migrate non-pinned rt-tasks to other CPUs more aggressively if it sees that a pinned task has become runnable on the CPU which the rt task is using.
Comment 5 Marc 2010-05-27 22:11:28 UTC
Thanks Andrew for the insightful answer, much appreciated.

(In reply to comment #4)
> sshd got stuck because it is waiting on some per-cpu kernel thread, and that
> kernel thread isn't getting scheduled because there's a SCHED_FIFO task
> spinning on its CPU.
> 
> We consider that a user error, sorry - SCHED_FIFO tasks should yield the CPU
> sometimes.

Could you please elaborate a tiny bit on how bad is this "user error"? I mean: what kind of per-cpu kernel threads is there and how critical are they? Please consider only the most favorable case where the (obviously embedded) system is fully dedicated to the real time task(s), not interested in running any other process, sshd or whatever else.

And also: would by any chance any of the "real-time" forks of the kernel support such a "user error"? Or is supporting such a use case just pure science-fiction?
Comment 6 Andrew Morton 2010-05-27 22:36:47 UTC
The kernel considers that privileged SCHED_FIFO userspace knows that it is doing and won't make things lock up.  The only alternative would be for the kernel to preempt realtime userspace for its own purposes.  This would degrade the latency performance for rt tasks so we didn't do it.

I doubt if any of the rt kernels alter that policy.  I expect that it'd be a pretty simple patch to alter this and the switch could be made via a /proc knob, but the addition of such a knob wouldn't be very popular, I expect.

As for the "scheduler could migrate non-pinned rt-tasks to other CPUs more aggressively" thing: Peter didn't take the bait ;)
Comment 7 Anonymous Emailer 2010-05-31 11:05:22 UTC
Reply-To: peterz@infradead.org

On Thu, 2010-05-27 at 22:36 +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:

> --- Comment #6 from Andrew Morton <akpm@linux-foundation.org>  2010-05-27
> 22:36:47 ---
> The kernel considers that privileged SCHED_FIFO userspace knows that it is
> doing and won't make things lock up.  The only alternative would be for the
> kernel to preempt realtime userspace for its own purposes.  This would
> degrade
> the latency performance for rt tasks so we didn't do it.
> 
> I doubt if any of the rt kernels alter that policy.  I expect that it'd be a
> pretty simple patch to alter this and the switch could be made via a /proc
> knob, but the addition of such a knob wouldn't be very popular, I expect.

Right, so there's a number of issues, a SCHED_FIFO-99 while(1) loop will
only ever get preempted by migrated/#, which is used by kstopmachine and
explicit migration.

On -rt we made all blocking locks PI locks, this limits the inversion
cases to !cpu resources for which we don't currently have a proper
resource model (see the workqueue example below).

As to jitter, SCHED_FIFO-99 while(1) loops will always still be
preempted by hardware interrupts, on -rt we try to mitigate this by
having very short hardware interrupt handlers that basically do nothing
more but check the hardware and kick a kernel thread, after which its
back to scheduling as normal.

Things like the timer interrupt (including the tick) still happen in
hardirq context though, so you'll still see some of that.

There is talk of extending NO_HZ to the case where there is only a
single runnable task, since at that point the kernel doesn't need to
preempt etc.. The hard work there is ensuring this only happens when all
the regular housekeeping tasks done by the tick are either moved out of
the tick and or completed.

Once we know the tick is redundant, we can actually stop it, but it does
take a lot of work to go through all the cases, but it shouldn't be too
hard since its basically the same process as going into NO_HZ for the
idle case but with a few interesting extra cases.

Of course, then the process isn't strictly user bound, it will very
likely generate work for the CPU in question and prevent this from
happening, but some HPC/RT workloads are perfectly happy with staying in
userspace for a very long time indeed -- which renders this whole
project worth doing (it just needs someone doing it).

But even then, a SCHED_FIFO-99 task that never sleeps is always going to
stress the OS as its not going to give the OS much time for
housekeeping. Things like the SLAB and page allocators have per-cpu
workqueue thingies that want to run at times, by hogging the cpu with a
SCHED_FIFO task, you'll block these and create inversion problems.

As Andrew stated, SCHED_FIFO is a privileged environment and we
generally expect users to know what they're doing (ie. you get to keep
the pieces).

As to the knob, we actually have something like that, see
Documentation/scheduler/sched-rt-group.txt, but note that an RT task
that does get throttled is basically a buggy app and in order to avoid
serious starvation issues you really need -rt (as that has the proper PI
for blocking locks).

> As for the "scheduler could migrate non-pinned rt-tasks to other CPUs more
> aggressively" thing: Peter didn't take the bait ;)

:-), its on the todo list, we have infrastructure to do so these days,
it just needs a bit of time and care to make it work 'as expected' I
guess.

Note You need to log in before you can comment on or make changes to this bug.