As soon as any process is scheduled with real-time priority (either SCHED_FIFO or SCHED_RR), then "some" other processes stop being scheduled by the kernel and hang forever. All seem stuck in schedule_timeout() one way or the other (see Call Traces attached). This report contains one dead-simple test case to systematically hang _sshd_. Other software unrelated to sshd also hangs in more convoluted test cases. I have reproduced this on a variety of SMP hardware with the following kernels: 2.6.27.41-170.2.117.fc10.i686 2.6.32.11-99.fc12.x86_64 2.6.32-21-generic #32-Ubuntu 2.6.33.1-24.fc13.i686 To systematically reproduce simply run this: echo 10 > /proc/sys/kernel/hung_task_timeout_secs ssh localhost # In the same or different terminal: cat /dev/urandom >/dev/null & burnerPID=$! taskset -p 1 $burnerPID # optional, see below chrt --fifo -p 1 $burnerPID & THEN type "exit" in your ssh session: sshd systematically hangs forever. Most other processes keep running fine. Warning: Xorg or text consoles might also freeze, making the system usable. Ironically, logging *IN* using ssh seems the least vulnerable (only logging OUT hangs). To unlock all hung processes just run: chrt --other -p 0 $burnerPID To switch back to "hanging mode" again simply run: chrt --fifo -p 1 $burnerPID & Etc. Not using taskset makes the problem slightly less annoying; still very annoying: - with taskset, programs are locked forever - without taskset, programs are locked for a possibly very long time. They seem to be unlocked whenever the real-time process is moved to a different core (which happens from time to time). Note: apart from hung sshd, the "/proc/sys/kernel/hung_task_*" feature does not always notice and report hung tasks. Then "echo t >/proc/sysrq-trigger" provides Call Traces.
Created attachment 26453 [details] Call Traces for hung sshd and hung Xorg 2.6.27.41-170.2.117.fc10.i686
Created attachment 26454 [details] CallTrace for hung sshd with 2.6.33.1-24.fc13.i686
Created attachment 26455 [details] Call Trace for hung sshd with Ubuntu 2.6.32-21.32-generic
(cc Peter) sshd got stuck because it is waiting on some per-cpu kernel thread, and that kernel thread isn't getting scheduled because there's a SCHED_FIFO task spinning on its CPU. We consider that a user error, sorry - SCHED_FIFO tasks should yield the CPU sometimes. Any simple fix to this would degrade SCHED_FIFO's latency benefits. And I can't think of a complex fix :( As for the second part of the bug report: yeah, perhaps it's improvable. The scheduler could migrate non-pinned rt-tasks to other CPUs more aggressively if it sees that a pinned task has become runnable on the CPU which the rt task is using.
Thanks Andrew for the insightful answer, much appreciated. (In reply to comment #4) > sshd got stuck because it is waiting on some per-cpu kernel thread, and that > kernel thread isn't getting scheduled because there's a SCHED_FIFO task > spinning on its CPU. > > We consider that a user error, sorry - SCHED_FIFO tasks should yield the CPU > sometimes. Could you please elaborate a tiny bit on how bad is this "user error"? I mean: what kind of per-cpu kernel threads is there and how critical are they? Please consider only the most favorable case where the (obviously embedded) system is fully dedicated to the real time task(s), not interested in running any other process, sshd or whatever else. And also: would by any chance any of the "real-time" forks of the kernel support such a "user error"? Or is supporting such a use case just pure science-fiction?
The kernel considers that privileged SCHED_FIFO userspace knows that it is doing and won't make things lock up. The only alternative would be for the kernel to preempt realtime userspace for its own purposes. This would degrade the latency performance for rt tasks so we didn't do it. I doubt if any of the rt kernels alter that policy. I expect that it'd be a pretty simple patch to alter this and the switch could be made via a /proc knob, but the addition of such a knob wouldn't be very popular, I expect. As for the "scheduler could migrate non-pinned rt-tasks to other CPUs more aggressively" thing: Peter didn't take the bait ;)
Reply-To: peterz@infradead.org On Thu, 2010-05-27 at 22:36 +0000, bugzilla-daemon@bugzilla.kernel.org wrote: > --- Comment #6 from Andrew Morton <akpm@linux-foundation.org> 2010-05-27 > 22:36:47 --- > The kernel considers that privileged SCHED_FIFO userspace knows that it is > doing and won't make things lock up. The only alternative would be for the > kernel to preempt realtime userspace for its own purposes. This would > degrade > the latency performance for rt tasks so we didn't do it. > > I doubt if any of the rt kernels alter that policy. I expect that it'd be a > pretty simple patch to alter this and the switch could be made via a /proc > knob, but the addition of such a knob wouldn't be very popular, I expect. Right, so there's a number of issues, a SCHED_FIFO-99 while(1) loop will only ever get preempted by migrated/#, which is used by kstopmachine and explicit migration. On -rt we made all blocking locks PI locks, this limits the inversion cases to !cpu resources for which we don't currently have a proper resource model (see the workqueue example below). As to jitter, SCHED_FIFO-99 while(1) loops will always still be preempted by hardware interrupts, on -rt we try to mitigate this by having very short hardware interrupt handlers that basically do nothing more but check the hardware and kick a kernel thread, after which its back to scheduling as normal. Things like the timer interrupt (including the tick) still happen in hardirq context though, so you'll still see some of that. There is talk of extending NO_HZ to the case where there is only a single runnable task, since at that point the kernel doesn't need to preempt etc.. The hard work there is ensuring this only happens when all the regular housekeeping tasks done by the tick are either moved out of the tick and or completed. Once we know the tick is redundant, we can actually stop it, but it does take a lot of work to go through all the cases, but it shouldn't be too hard since its basically the same process as going into NO_HZ for the idle case but with a few interesting extra cases. Of course, then the process isn't strictly user bound, it will very likely generate work for the CPU in question and prevent this from happening, but some HPC/RT workloads are perfectly happy with staying in userspace for a very long time indeed -- which renders this whole project worth doing (it just needs someone doing it). But even then, a SCHED_FIFO-99 task that never sleeps is always going to stress the OS as its not going to give the OS much time for housekeeping. Things like the SLAB and page allocators have per-cpu workqueue thingies that want to run at times, by hogging the cpu with a SCHED_FIFO task, you'll block these and create inversion problems. As Andrew stated, SCHED_FIFO is a privileged environment and we generally expect users to know what they're doing (ie. you get to keep the pieces). As to the knob, we actually have something like that, see Documentation/scheduler/sched-rt-group.txt, but note that an RT task that does get throttled is basically a buggy app and in order to avoid serious starvation issues you really need -rt (as that has the proper PI for blocking locks). > As for the "scheduler could migrate non-pinned rt-tasks to other CPUs more > aggressively" thing: Peter didn't take the bait ;) :-), its on the todo list, we have infrastructure to do so these days, it just needs a bit of time and care to make it work 'as expected' I guess.