Bug 5877 - Suspected scheduling starvation
Summary: Suspected scheduling starvation
Status: CLOSED CODE_FIX
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: i386 Linux
: P2 normal
Assignee: Con Kolivas
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-01-12 11:59 UTC by Heikki Orsila
Modified: 2006-03-22 03:46 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.15-rc7
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
sched improve task noninteractive patch (5.38 KB, patch)
2006-01-12 17:06 UTC, Con Kolivas
Details | Diff
2.6.15 interactivity rollup (5.74 KB, patch)
2006-01-12 21:19 UTC, Con Kolivas
Details | Diff
2.6.15-rc7 interbench results without any patch (2.12 KB, text/plain)
2006-01-13 17:45 UTC, Heikki Orsila
Details
2.6.15 interbench results with the second patch applied (2.11 KB, text/plain)
2006-01-13 17:47 UTC, Heikki Orsila
Details

Description Heikki Orsila 2006-01-12 11:59:40 UTC
Most recent kernel where this bug did not occur:
Distribution: Gentoo

Hardware Environment:
Software Environment:
Linux e275d 2.6.15-rc7 #1 Fri Dec 30 03:58:06 EET 2005 x86_64 AMD Athlon(tm) 64
Processor 3000+ AuthenticAMD GNU/Linux

Gnu C                  3.4.4
Gnu make               3.80
binutils               2.16.1
util-linux             2.12r
mount                  2.12r
module-init-tools      3.0
e2fsprogs              1.38
jfsutils               1.1.8
reiserfsprogs          line
reiser4progs           line
xfsprogs               2.6.25
nfs-utils              1.0.6
Linux C Library        2.3.5
Dynamic linker (ldd)   2.3.5
Procps                 3.2.5
Net-tools              1.60
Kbd                    1.12
Sh-utils               5.2.1
udev                   070
Modules Loaded

The kernel does not have pre-empt and it has a 250 Hz timer.

Problem Description:

Currently firefox + X can starve two of my processes so that they do not
get any timeslice during some wall clock seconds. I am getting huge
buffer underruns when playing sound with uade123 (uade 2.01 at
http://zakalwe.virtuaalipalvelin.net/uade/ ). uadecore process is
attached to the uade123 by two pipes. uadecore synthesizes sound data
and passes that data to uade123. uade123 pushes the sound data to libao
which pushes it for ALSA. Here's a small strace dump of what happens
when I open 3 tabs to firefox and push down ctrl-page up so that firefox
starts to change tabs rapidly (consuming lots of CPU). Normally
uadecore+uade123 consume only 4.0% of CPU but when starved they only get
a fraction. 'make soundcheck' should produce the problem well.

22796 21:40:54 write(1, "Playing time position 2.9s in su"..., 60) = 60
22796 21:40:54 select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
22796 21:40:54 write(4, "\0\0\0\5\0\0\0\4\0\0\17\370", 12) = 12
22796 21:40:54 write(4, "\0\0\0\20\0\0\0\0", 8) = 8
22796 21:40:54 ioctl(6, 0x40184150, 0x7fffffc285b0) = 0
22796 21:40:54 read(5,  <unfinished ...>
* A full second without system calls
22797 21:40:56 <... read resumed> "\0\0\0\5\0\0\0\4", 8) = 8
22797 21:40:56 read(3, "\0\0\17\370", 4) = 4
22797 21:40:56 read(3, "\0\0\0\20\0\0\0\0", 8) = 8
22797 21:40:56 write(6, "\0\0\0\31\0\0\17\370\31\313\23\256\31\313\0261\31\313\$
22796 21:40:56 <... read resumed> "\0\0\0\31\0\0\17\370", 8) = 8
22797 21:40:56 <... write resumed> )    = 4096
22796 21:40:56 read(5, "\31\313\23\256\31\313\0261\31\313\27&\31\313\27\24\31\3$
22796 21:40:56 read(5,  <unfinished ...>
22797 21:40:56 write(6, "\0\0\0\20\0\0\0\0", 8 <unfinished ...>
22796 21:40:56 <... read resumed> "\0\0\0\20\0\0\0\0", 8) = 8
22797 21:40:56 <... write resumed> )    = 8
22796 21:40:56 write(1, "Playing time position 3.0s in su"..., 60) = 60
22796 21:40:56 select(1, [0], NULL, NULL, {0, 0}) = 0 (Timeout)
22796 21:40:56 write(4, "\0\0\0\5\0\0\0\4\0\0\17\370", 12) = 12
22796 21:40:56 write(4, "\0\0\0\20\0\0\0\0", 8) = 8
22796 21:40:56 ioctl(6, 0x40184150, 0x7fffffc285b0) = -1 EPIPE (Broken pipe)
22796 21:40:56 write(2, "ALSA: underrun, at least 0ms.\n", 30) = 30
22796 21:40:56 ioctl(6, 0x4140, 0x1e)   = 0
22796 21:40:56 read(5,  <unfinished ...>
22797 21:40:56 read(3, "\0\0\0\5\0\0\0\4", 8) = 8
22797 21:40:56 read(3, "\0\0\17\370", 4) = 4
22797 21:40:56 read(3, "\0\0\0\20\0\0\0\0", 8) = 8
22797 21:40:56 write(6, "\0\0\0\31\0\0\17\370\324Y\353C\324Y\352u\324Y\351(\324$
22796 21:40:56 <... read resumed> "\0\0\0\31\0\0\17\370", 8) = 8
22797 21:40:56 <... write resumed> )    = 4096
22796 21:40:56 read(5, "\324Y\353C\324Y\352u\324Y\351(\324Y\347\250\324Y\3462\3$
22796 21:40:56 read(5,  <unfinished ...>
22797 21:40:56 write(6, "\0\0\0\20\0\0\0\0", 8 <unfinished ...>
22796 21:40:56 <... read resumed> "\0\0\0\20\0\0\0\0", 8) = 8
22797 21:40:56 <... write resumed> )    = 8

I think I need to try this on BSDs and 2.4.x kernel too, but I do not have such
systems at hand.

Steps to reproduce:

'make soundcheck' for uade123 (uade 2.01), open 3 tabs to firefox and press down
ctrl-page up so that firefox switches tabs rapidly. This will cause huge underruns.
Comment 1 Con Kolivas 2006-01-12 15:42:08 UTC
This looks like it is related to the TASK_NONINTERACTIVE flag for pipes. Can you
check to see if the problem existed prior to this change? 2.6.12 had the new
fatter deeper pipes but did not have the TASK_NONINTERACTIVE flag if I recall
correctly.
Comment 2 Heikki Orsila 2006-01-12 16:54:31 UTC
It seems I don't get any underruns on 2.6.12, which is great. I hope you can fix
this problem soon.
Comment 3 Con Kolivas 2006-01-12 17:06:40 UTC
Created attachment 7007 [details]
sched improve task noninteractive patch

Alter the activated mechanism to count all sleep time in a linear fashion and
move the TASK_NONINTERACTIVE flagged tasks to gain sleep average from this
instead of no sleep average.
Comment 4 Con Kolivas 2006-01-12 17:07:42 UTC
Please try the patch I attached here to see if it helps 2.6.15. Also I am
interested in any detrimental interactivity effects of this patch.
Comment 5 Heikki Orsila 2006-01-12 17:24:07 UTC
> Please try the patch I attached here to see if it helps 2.6.15. Also I
> am interested in any detrimental interactivity effects of this patch.

Do you have any tips what to test? Are there scheduling test suites?

Comment 6 Con Kolivas 2006-01-12 17:29:43 UTC
I wrote an interactivity benchmark which covers some of the basics
(interbench.kolivas.org). You can use this for some hard measurements. A lot is
still up to you to test in your normal environment to see how smooth windows
move about and audio and video plays back etc under your _normal_ workloads. I
don't really care how it feels with 'make -j16' in the background because
optimising for something like that is pointless and tends to favour unfair
scheduling.
Comment 7 Heikki Orsila 2006-01-12 18:24:33 UTC
I compiled 2.6.15 with your patch and I'm not getting any underruns anymore :)
Comment 8 Con Kolivas 2006-01-12 21:19:39 UTC
Created attachment 7008 [details]
2.6.15 interactivity rollup

Great. That patch was part of a series I'm working on to correct a few current
quirks. Can you test this rolled up patch which contains all those in the
series to ensure it still fixes your problem? You will need to back out the
previous patch first.
Comment 9 Heikki Orsila 2006-01-13 03:41:23 UTC
I'll try it in the evening. -> work
Comment 10 Heikki Orsila 2006-01-13 12:21:36 UTC
I tested your newer patch too. It also worked well; no underruns. I will post
interbench results later for 2.6.15-rc7 and 2.6.15-interactive-patch.
Comment 11 Heikki Orsila 2006-01-13 17:45:57 UTC
Created attachment 7019 [details]
2.6.15-rc7 interbench results without any patch
Comment 12 Heikki Orsila 2006-01-13 17:47:29 UTC
Created attachment 7020 [details]
2.6.15 interbench results with the second patch applied

Here are both of the interbench results. The first one (2.6.15-rc7) is without
any patch, and the second one is with the second interactivity patch.
Comment 13 Con Kolivas 2006-01-13 18:18:37 UTC
Thanks for the results. The changes are consistent with what we would expect,
the heavy cpu interactive tasks (like X) suffer more under I/O load since these
patches also increase the bonuses of I/O bound tasks (see the lkml thread). Ok
these patches have been queued up for the next -mm so I'm marking this bug as fixed.
Comment 14 Heikki Orsila 2006-03-22 02:38:12 UTC
The bug occurs with 2.6.16 too. Is this going to be fixed in the future? Is
merging the 2.6.15 interactivity patch safe for 2.6.16?
Comment 15 Con Kolivas 2006-03-22 02:46:44 UTC
The patches were merged into -mm and thus are following the normal cycle for
mainstream inclusion. They are in the -mm kernel as of then and I was planning
on pushing them for 2.6.17. There are some changes in 2.6.16 that prevent the
patch from applying cleanly. I closed this bug because a code fix was pushed to
-mm and will eventually be merged upstream. Only reopen the bug if the problem
is present in the current -mm kernel please.
Comment 16 Heikki Orsila 2006-03-22 02:54:51 UTC
> I was planning on pushing them for 2.6.17

Thanks. Will it also be pushed into the new 2.6.16.* series?
Comment 17 Con Kolivas 2006-03-22 02:58:37 UTC
No, because it was too big a change to go into 2.6.16, therefore it is
definitely too big a change to go into 2.6.16.x
For convenience I've posted a patch for 2.6.16 here:
http://ck.kolivas.org/patches/interactivity/2.6.16-O22.1int.patch
Comment 18 Heikki Orsila 2006-03-22 03:43:00 UTC
> No, because it was too big a change to go into 2.6.16, therefore it is
> definitely too big a change to go into 2.6.16.x

According to this:

 http://lkml.org/lkml/2005/12/3/55

2.6.16.x might be maintained for as long as 2 to 3 years. Having a deficient
scheduler for that long would tremendously decrease usability of that kernel
series. No doubt this will cause more bug reports and I dislike the idea of
work-arounding this problem in the application.
Comment 19 Con Kolivas 2006-03-22 03:46:19 UTC
Being maintained doesn't make it the "current" stable kernel. I'm not going to
argue the development model here. Feel free to debate this issue in the
appropriate place.

Note You need to log in before you can comment on or make changes to this bug.