Bug 12562 - High overhead while switching or synchronizing threads on different cores
Summary: High overhead while switching or synchronizing threads on different cores
Status: REJECTED INVALID
Alias: None
Product: Process Management
Classification: Unclassified
Component: Scheduler (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Ingo Molnar
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-28 06:35 UTC by Thomas Pilarski
Modified: 2009-02-02 21:20 UTC (History)
0 users

See Also:
Kernel Version: 2.6.28
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
testcase (8.21 KB, text/x-csrc)
2009-01-28 06:37 UTC, Thomas Pilarski
Details
testcase (8.12 KB, text/x-csrc)
2009-01-28 07:17 UTC, Thomas Pilarski
Details

Description Thomas Pilarski 2009-01-28 06:35:14 UTC
Hardware Environment: Core2Duo 2.4GHz / 4GB RAM 
Software Environment: Ubuntu 8.10 + Vanilla 2.6.28

Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM 
Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2

Problem Description:
The overhead on a dual core while switching between tasks is extremely high (>60% of cputime). If is produced by synchronization with pthread and mutex/cond. 

Executing the attaches program schedulingissue 1 1024 8 20, which create a producer and a consumer thread with eight 8kb big buffers. The producer creates 1024 random generated double values, consumer makes the same after receiving the buffer.

While executing the program the thoughtput is ~1.6 msg/s. While executing two instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = 17,4 msg/s). 

Small improvement while using jiffies as clocksource instead of acpi_pm or hpet (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives no improvement. Much higher performance with kernel <= 2.6.24, but still four times slower.

---------------------------------------
Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 GNU/Linux
acpi_pm (equal with htep)
schedulerissue 1 1024 8 20
All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s
schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s
All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s
---------------------------------------
Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 GNU/Linux
jiffies
schedulerissue 1 1024 8 20
All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s
schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s
All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s
--------------------------------------
Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 GNU/Linux
hpet 
schedulerissue 1 1024 8 20
All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s
schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s
All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s


AMD64 X2 @ 2.1GHz
Linux bugs-desktop 2.6.28.2 #4 SMP Mon Jan 26 20:26:12 CET 2009 x86_64 GNU/Linux
acpi_pm
schedulerissue 1 1024 8 20
All threads finished: 20 messages in 9.288 seconds / 2.153 msg/s
schedulerissue 1 1024 8 200
All threads finished: 200 messages in 17.049 seconds / 11.731 msg/s
All threads finished: 200 messages in 18.539 seconds / 10.788 msg/s
Comment 1 Thomas Pilarski 2009-01-28 06:37:19 UTC
Created attachment 20030 [details]
testcase

gcc -O3 -lm -lrt -lpthread ThreadSchedulingIssue.c -o schedulingissue
Comment 2 Thomas Pilarski 2009-01-28 07:17:26 UTC
Created attachment 20031 [details]
testcase

Removed constants from testcase

The results in the description were made with parameter "schedulingissue 1 524288 4 20" and "schedulingissue 1 524288 4 200".
Comment 3 Anonymous Emailer 2009-01-28 12:56:35 UTC
Reply-To: akpm@linux-foundation.org


(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Wed, 28 Jan 2009 06:35:20 -0800 (PST)
bugme-daemon@bugzilla.kernel.org wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=12562
> 
>            Summary: High overhead while switching or synchronizing threads
>                     on different cores

Thanks for the report, and the testcase.

>            Product: Process Management
>            Version: 2.5
>      KernelVersion: 2.6.28
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: Scheduler
>         AssignedTo: mingo@elte.hu
>         ReportedBy: thomas.pi@arcor.de

(There's testcase code in the bugzilla report)

(Seems to be a regression)

> 
> Hardware Environment: Core2Duo 2.4GHz / 4GB RAM 
> Software Environment: Ubuntu 8.10 + Vanilla 2.6.28
> 
> Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM 
> Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2
> 
> Problem Description:
> The overhead on a dual core while switching between tasks is extremely high
> (>60% of cputime). If is produced by synchronization with pthread and
> mutex/cond. 
> 
> Executing the attaches program schedulingissue 1 1024 8 20, which create a
> producer and a consumer thread with eight 8kb big buffers. The producer
> creates
> 1024 random generated double values, consumer makes the same after receiving
> the buffer.
> 
> While executing the program the thoughtput is ~1.6 msg/s. While executing two
> instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = 17,4
> msg/s). 
> 
> Small improvement while using jiffies as clocksource instead of acpi_pm or
> hpet
> (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives
> no improvement. Much higher performance with kernel <= 2.6.24, but still four
> times slower.

Unclear.  What is four times slower than what?  You're saying that the
app progresses four times faster when there are two instances of it
running, rather than one instance?


> ---------------------------------------
> Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> GNU/Linux
> acpi_pm (equal with htep)
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s
> schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s
> All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s
> ---------------------------------------
> Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> GNU/Linux
> jiffies
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s
> schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s
> All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s
> --------------------------------------
> Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64
> GNU/Linux
> hpet 
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s
> schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s
> All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s
> 

Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24
and 2.6.28 run at the same speed when 200 messages are sent?

If so, that seems rather odd, doesn't it?  Is it possible that cpufreq
does something bad once the CPU gets hot?


> AMD64 X2 @ 2.1GHz
> Linux bugs-desktop 2.6.28.2 #4 SMP Mon Jan 26 20:26:12 CET 2009 x86_64
> GNU/Linux
> acpi_pm
> schedulerissue 1 1024 8 20
> All threads finished: 20 messages in 9.288 seconds / 2.153 msg/s
> schedulerissue 1 1024 8 200
> All threads finished: 200 messages in 17.049 seconds / 11.731 msg/s
> All threads finished: 200 messages in 18.539 seconds / 10.788 msg/s
Comment 4 Peter Zijlstra 2009-01-28 14:16:14 UTC
On Wed, 2009-01-28 at 12:56 -0800, Andrew Morton wrote:
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Wed, 28 Jan 2009 06:35:20 -0800 (PST)
> bugme-daemon@bugzilla.kernel.org wrote:
> 
> > http://bugzilla.kernel.org/show_bug.cgi?id=12562
> > 
> >            Summary: High overhead while switching or synchronizing threads
> >                     on different cores
> 
> Thanks for the report, and the testcase.
> 
> >            Product: Process Management
> >            Version: 2.5
> >      KernelVersion: 2.6.28
> >           Platform: All
> >         OS/Version: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: Scheduler
> >         AssignedTo: mingo@elte.hu
> >         ReportedBy: thomas.pi@arcor.de
> 
> (There's testcase code in the bugzilla report)
> 
> (Seems to be a regression)

Is there a known good kernel?

> > 
> > Hardware Environment: Core2Duo 2.4GHz / 4GB RAM 
> > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28
> > 
> > Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM 
> > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2
> > 
> > Problem Description:
> > The overhead on a dual core while switching between tasks is extremely high
> > (>60% of cputime). If is produced by synchronization with pthread and
> > mutex/cond. 
> > 
> > Executing the attaches program schedulingissue 1 1024 8 20, which create a
> > producer and a consumer thread with eight 8kb big buffers. The producer
> creates
> > 1024 random generated double values, consumer makes the same after
> receiving
> > the buffer.
> > 
> > While executing the program the thoughtput is ~1.6 msg/s. While executing
> two
> > instances of the program, the thoughtput is much higher (2 * 8.7 msg/s =
> 17,4
> > msg/s). 
> > 
> > Small improvement while using jiffies as clocksource instead of acpi_pm or
> hpet
> > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME
> gives
> > no improvement. Much higher performance with kernel <= 2.6.24, but still
> four
> > times slower.
> 
> Unclear.  What is four times slower than what?  You're saying that the
> app progresses four times faster when there are two instances of it
> running, rather than one instance?

It seems that way indeed, a bit more clarity would be good though.

> > ---------------------------------------
> > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> > GNU/Linux
> > acpi_pm (equal with htep)
> > schedulerissue 1 1024 8 20
> > All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s
> > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> > All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s
> > All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s
> > ---------------------------------------
> > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64
> > GNU/Linux
> > jiffies
> > schedulerissue 1 1024 8 20
> > All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s
> > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> > All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s
> > All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s
> > --------------------------------------
> > Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64
> GNU/Linux
> > hpet 
> > schedulerissue 1 1024 8 20
> > All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s
> > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200
> > All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s
> > All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s
> > 
> 
> Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24
> and 2.6.28 run at the same speed when 200 messages are sent?
> 
> If so, that seems rather odd, doesn't it?  Is it possible that cpufreq
> does something bad once the CPU gets hot?

Nah, I'll bet is a cache affinity issue.

Some applications like strong wakeup affinity, others not so. This looks
to be a lover.

With a single instance, the producer and consumer get scheduled on two
different cores for some reason (maybe wake idle too strong).

With two instances, they get to stay on the same cpu, since the other
cpu is already busy.

I'll start up the browser in the morning to download this proglet and
poke at it some, but sleep comes first.
Comment 5 Thomas Pilarski 2009-01-28 14:26:23 UTC
Am Mittwoch, den 28.01.2009, 12:56 -0800 schrieb Andrew Morton: 

> (There's testcase code in the bugzilla report)
> 
> (Seems to be a regression)

There is a regression, because of the improved cpu switching. The problem exists in every kernel. 
I takes a lot of time to switch between the threads, when they are executed on different cores.
Perhaps of the big buffer size of 512KB?
 
> > Small improvement while using jiffies as clocksource instead of acpi_pm or
> hpet
> > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME
> gives
> > no improvement. Much higher performance with kernel <= 2.6.24, but still
> four
> > times slower.
> 
> Unclear.  What is four times slower than what?  You're saying that the
> app progresses four times faster when there are two instances of it
> running, rather than one instance?

About 4 messages every second, while executing only one instance and
about 8 message every second, while executing two instance of the test.
It makes 16 messages every second, when the two threads of a instance is
executed on only one core.

> Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24
> and 2.6.28 run at the same speed when 200 messages are sent?

I have executed the test twenty times. It stays constant on 2.6.28. On
2.6.24 one of ten tests is executed slower.

******* kernel 2.6.28:
All threads finished: 20 messages in 12.853 seconds / 1.556 msg/s
real	0m12.857s
user	0m8.589s
sys	0m16.629s

******* kernel 2.6.24:
All threads finished: 20 messages in 4.939 seconds / 4.050 msg/s
real	0m4.942s
user	0m5.248s
sys	0m4.352s

One of ten executions is going down to 1.806 msg/s.
All threads finished: 20 messages in 11.074 seconds / 1.806 msg/s
real	0m11.077s
user	0m8.817s
sys	0m12.925s

> If so, that seems rather odd, doesn't it?  Is it possible that cpufreq
> does something bad once the CPU gets hot?

I have disabled the acpid, clocked the cpu to 2.4GHz and watched the
temperature of the cores and the frequency. The clock stay always at
2.4GHz and the temperature is always below 67°C. My cpu is clocking down
at 95°C.
Comment 6 Peter Zijlstra 2009-01-29 01:07:36 UTC
On Wed, 2009-01-28 at 23:25 +0100, Thomas Pilarski wrote:
> Am Mittwoch, den 28.01.2009, 12:56 -0800 schrieb Andrew Morton: 
> 
> > (There's testcase code in the bugzilla report)
> > 
> > (Seems to be a regression)
> 
> There is a regression, because of the improved cpu switching. The
> problem exists in every kernel. 

This is a contradiction in terms - twice.

If it is a regression, then clearly things haven't improved.

If it is a regression, state clearly when it worked last. If it never
worked, it cannot be a regression.

> I takes a lot of time to switch between the threads, when they are
> executed on different cores.
> Perhaps of the big buffer size of 512KB?

Of course, pushing 512kb to another cpu means lots and lots of cache
misses.
Comment 7 Thomas Pilarski 2009-01-29 02:13:18 UTC
> > There is a regression, because of the improved cpu switching. The
> > problem exists in every kernel. 
> 
> This is a contradiction in terms - twice.
> 
> If it is a regression, then clearly things haven't improved.
> 
> If it is a regression, state clearly when it worked last. If it never
> worked, it cannot be a regression.

There is a improvement in load balancing for single threaded
applications. It's a regression for my problem. But the problem exists
in every kernel I have tested.

> > I takes a lot of time to switch between the threads, when they are
> > executed on different cores.
> > Perhaps of the big buffer size of 512KB?
> 
> Of course, pushing 512kb to another cpu means lots and lots of cache
> misses.

I have tried 2.6.15, 2.6.18 and 2.6.20 too, but same behavior as in
2.6.24.
With Windows I can get 64 message every second with a buffer size of 512
KB. It is reduced to 16 messages with a buffer size of 1MB. But I think
it not really comparable, because there is nearby no cpu consumption
with 512kB. Perhaps random() works different. By increasing the cpu
usage eight times in the producer, I can get 16msg/s and both cores are
used about ~50%. Doing the same with linux I get a throughput of
~2msg/s. 

If it is a caching issue, shouldn't it exists in Windows too?

Using a smaller buffer of 4KB, the test is executed on one core only. 
./schedulerissue 1 4096 8 2000
All threads finished: 2000 messages in 1.631 seconds / 1226.076 msg/s
real	0m1.635s
user	0m1.352s
sys	0m0.052s


But I want to use both cores to increase the performance. Adding a
second producer and a second consumer reduces the performance to 33%.
Both cores are used.
./schedulerissue 2 4096 8 2000
All threads finished: 1999 messages in 4.744 seconds / 421.379 msg/s
real	0m4.748s
user	0m3.280s
sys	0m5.852s

I have added a new version as there was a possible deadlock during
shut-down.
Comment 8 Thomas Pilarski 2009-01-29 02:24:23 UTC
Some explanation of the test program. 

./schedulerissue 1 4096 8 2000
1 producer and 1 consumer
buffer size of 4096 doubles * 8byte 
8 buffer (256kB total buffer)
2000 messages

./schedulerissue 2 4096 8 2000
2 producer and 2 consumer
buffer size of 4096 doubles * 8byte 
8 buffer (256kB total buffer)
2000 messages


It was not 512KB bytes in the test before, but 4MB.
But there is the same problem with a total buffer size of 48kB and 4
threads (./schedulerissue 2 2048 3 20000).
Comment 9 Peter Zijlstra 2009-01-29 02:32:03 UTC
On Thu, 2009-01-29 at 11:24 +0100, Thomas Pilarski wrote:
> Some explanation of the test program. 
> 
> ../schedulerissue 1 4096 8 2000
> 1 producer and 1 consumer
> buffer size of 4096 doubles * 8byte 
> 8 buffer (256kB total buffer)
> 2000 messages
> 
> ../schedulerissue 2 4096 8 2000
> 2 producer and 2 consumer
> buffer size of 4096 doubles * 8byte 
> 8 buffer (256kB total buffer)
> 2000 messages
> 
> 
> It was not 512KB bytes in the test before, but 4MB.
> But there is the same problem with a total buffer size of 48kB and 4
> threads (./schedulerissue 2 2048 3 20000).

Right, read the proglet (and removed that usleep(1)) and am poking at
it.
Comment 10 Peter Zijlstra 2009-01-29 03:37:50 UTC
On Thu, 2009-01-29 at 11:24 +0100, Thomas Pilarski wrote:
> Some explanation of the test program. 
> 
> ../schedulerissue 1 4096 8 2000
> 1 producer and 1 consumer
> buffer size of 4096 doubles * 8byte 
> 8 buffer (256kB total buffer)
> 2000 messages
> 
> ../schedulerissue 2 4096 8 2000
> 2 producer and 2 consumer
> buffer size of 4096 doubles * 8byte 
> 8 buffer (256kB total buffer)
> 2000 messages
> 
> 
> It was not 512KB bytes in the test before, but 4MB.
> But there is the same problem with a total buffer size of 48kB and 4
> threads (./schedulerissue 2 2048 3 20000).

Linux opteron 2.6.29-rc3-tip #61 SMP PREEMPT Thu Jan 29 11:59:15 CET
2009 x86_64 x86_64 x86_64 GNU/Linux

[root@opteron bench]# schedtool -a 1 -e ./ThreadSchedulingIssue 1 4096 8 20000
All threads finished: 19992 messages in 6.485 seconds / 3082.877 msg/s
[root@opteron bench]# ./ThreadSchedulingIssue 1 4096 8 20000
All threads finished: 19992 messages in 6.496 seconds / 3077.604 msg/s
[root@opteron bench]# ./ThreadSchedulingIssue 1 4096 8 20000 & ./ThreadSchedulingIssue 1 4096 8 20000 &
[1] 10314
[2] 10315
[root@opteron bench]# All threads finished: 19992 messages in 6.720 seconds / 2975.009 msg/s
All threads finished: 19992 messages in 6.792 seconds / 2943.574 msg/s

[1]-  Done                    ./ThreadSchedulingIssue 1 4096 8 20000
[2]+  Done                    ./ThreadSchedulingIssue 1 4096 8 20000
[root@opteron bench]# ./ThreadSchedulingIssue 2 4096 8 20000
All threads finished: 19992 messages in 17.299 seconds / 1155.667 msg/s


[root@opteron bench]# for i in 4 8 16 32 64 128 256 ; do 
> echo -n $((i*1024)) $((80000/i)) " " ; 
> schedtool -a 1 -e ./ThreadSchedulingIssue 1 $((i*1024)) 8 $((80000/i)) ;
> done
4096 20000  All threads finished: 19992 messages in 6.368 seconds / 3139.251 msg/s
8192 10000  All threads finished: 9992 messages in 5.363 seconds / 1863.083 msg/s
16384 5000  All threads finished: 4992 messages in 5.471 seconds / 912.479 msg/s
32768 2500  All threads finished: 2493 messages in 5.730 seconds / 435.059 msg/s
65536 1250  All threads finished: 1242 messages in 5.544 seconds / 224.021 msg/s
131072 625  All threads finished: 617 messages in 5.755 seconds / 107.217 msg/s
262144 312  All threads finished: 305 messages in 6.014 seconds / 50.713 msg/s

[root@opteron bench]# for i in 4 8 16 32 64 128 256 ; do
> echo -n $((i*1024)) $((80000/i)) " " ;
> ./ThreadSchedulingIssue 1 $((i*1024)) 8 $((80000/i)) ;
> done
4096 20000  All threads finished: 19992 messages in 6.462 seconds / 3093.717 msg/s
8192 10000  All threads finished: 9992 messages in 8.767 seconds / 1139.738 msg/s
16384 5000  All threads finished: 5000 messages in 5.366 seconds / 931.798 msg/s
32768 2500  All threads finished: 2494 messages in 20.720 seconds / 120.369 msg/s
65536 1250  All threads finished: 1242 messages in 11.521 seconds / 107.805 msg/s
131072 625  All threads finished: 618 messages in 14.035 seconds / 44.032 msg/s
262144 312  All threads finished: 305 messages in 17.342 seconds / 17.587 msg/s

The above point between 16 and 32 is exactly where the total working set
doesn't fit into cache anymore -- I suspect that pushes the producer's
latency to go to sleep over the edge and everything collapses.


We use wakeup patterns to determine if two tasks are working together
and should thus be kept together.

Task A should wake up B, and B should wake up A. Furthermore, any task
should quickly go to sleep after waking up the other.

This program does neither, with a single pair, the producer continues
production after waking the consumer (until the queue is filled --
which, if the consumer is fast enough, might never happen).

With multiple pairs there is no strict pair relation at all, since they
all work on the same global buffer queue, so P1 can wake Cn etc.

Furthermore the program uses shared memory (not a bad design), and thus
mises out on the explicit affinity hints of pipes, sockets, etc.


In short this program is carefully crafted to defeat all our affinity
tests - and I'm not sure what to do.
Comment 11 Thomas Pilarski 2009-01-29 06:06:37 UTC
> In short this program is carefully crafted to defeat all our affinity
> tests - and I'm not sure what to do.

I am sorry, although it is not carefully crafted. The function random()
is causing my problem. I currently have no real data, so I tried to make
some random utilization and data.

Without the random() function it works even with 80MB of data and I get
great results.

./ThreadSchedulingIssue 1 10485760 8 312
All threads finished: 309 messages in 29.369 seconds / 10.521 msg/s

schedtool -a 1 -e ./ThreadSchedulingIssue 1 10485760 8 312
All threads finished: 312 messages in 44.284 seconds / 7.045 msg/s

It does not even regress with more then two threads. 

./ThreadSchedulingIssue 2 10485760 8 312
All threads finished: 311 messages in 28.040 seconds / 11.091 msg/s

./ThreadSchedulingIssue 4 10485760 8 312
All threads finished: 309 messages in 28.021 seconds / 11.027 msg/s

With small amounts of data the speed on two core is even doubled. 

schedtool -a 1 -e ./ThreadSchedulingIssue 1 1048 8 312000
All threads finished: 311992 messages in 19.437 seconds / 16051.247
msg/s

./ThreadSchedulingIssue 3 1048 8 312000
All threads finished: 311998 messages in 9.652 seconds / 32324.411 msg/s

./ThreadSchedulingIssue 8 1048 8 312000
All threads finished: 311997 messages in 9.339 seconds / 33406.370 msg/s

--------------
Perhaps it is as it should be, but when I run the test (without
random()) with 2*8 threads, it uses ~186 of the cpu, while an instance
of "bzip2 -9 -c /dev/urandom >/dev/null" gets only 12%.
Comment 12 Mike Galbraith 2009-01-29 23:58:03 UTC
On Thu, 2009-01-29 at 15:05 +0100, Thomas Pilarski wrote:
> > In short this program is carefully crafted to defeat all our affinity
> > tests - and I'm not sure what to do.
> 
> I am sorry, although it is not carefully crafted. The function random()
> is causing my problem. I currently have no real data, so I tried to make
> some random utilization and data.

Yeah, rather big difference, mega-contention vs zero-contention.

2.6.28.2, profile of ThreadSchedulingIssue 4 524288 8 200

vma              samples  %        app name                 symbol name
ffffffff80251efa 2574819  31.6774  vmlinux                  futex_wake
ffffffff80251a39 1367613  16.8255  vmlinux                  futex_wait
0000000000411790 815426   10.0320  ThreadSchedulingIssue    random
ffffffff8022b3b5 343692    4.2284  vmlinux                  task_rq_lock
0000000000404e30 299316    3.6824  ThreadSchedulingIssue    __lll_lock_wait_private
ffffffff8030d430 262906    3.2345  vmlinux                  copy_user_generic_string
ffffffff80462af2 235176    2.8933  vmlinux                  schedule
0000000000411b90 210984    2.5957  ThreadSchedulingIssue    random_r
ffffffff80251730 129376    1.5917  vmlinux                  hash_futex
ffffffff8020be10 123548    1.5200  vmlinux                  system_call
ffffffff8020a679 119398    1.4689  vmlinux                  __switch_to
ffffffff8022f49b 110068    1.3541  vmlinux                  try_to_wake_up
ffffffff8024c4d1 106352    1.3084  vmlinux                  sched_clock_cpu
ffffffff8020be20 102709    1.2636  vmlinux                  system_call_after_swapgs
ffffffff80229a2d 100614    1.2378  vmlinux                  update_curr
ffffffff80248309 86475     1.0639  vmlinux                  add_wait_queue
ffffffff80253149 85969     1.0577  vmlinux                  do_futex

Versus using myrand() free sample cruft generator from rand(3) manpage.  Poof.

vma      samples  %        app name                 symbol name
004002f4 979506   90.7113  ThreadSchedulingIssue    myrand
00400b00 53348     4.9405  ThreadSchedulingIssue    thread_consumer
00400c25 42710     3.9553  ThreadSchedulingIssue    thread_producer

One of those "don't _ever_ do that" things?

	-Mike
Comment 13 Thomas Pilarski 2009-02-01 23:48:01 UTC
Am Freitag, den 30.01.2009, 08:57 +0100 schrieb Mike Galbraith:
> One of those "don't _ever_ do that" things?

I did not known random() uses a system call. It's rather unrealistic to
have five million system calls in a second. By adding a small loop with
some calculations near the random, the problem disappears too.
It is a unlucky chosen data generator.
Comment 14 Peter Zijlstra 2009-02-02 00:20:10 UTC
On Mon, 2009-02-02 at 08:43 +0100, Thomas Pilarski wrote:
> Am Freitag, den 30.01.2009, 08:57 +0100 schrieb Mike Galbraith:
> > One of those "don't _ever_ do that" things?
> 
> I did not known random() uses a system call. It's rather unrealistic to
> have five million system calls in a second. By adding a small loop with
> some calculations near the random, the problem disappears too.
> It is a unlucky chosen data generator.

I suppose you'll have to go bug the glibc people about their random()
implementation.

If you really need random() to perform for your application (monte-carlo
stuff?) You might be better off writing a PRNG with TLS state or
something.
Comment 15 Thomas Pilarski 2009-02-02 00:37:25 UTC
Am Montag, den 02.02.2009, 09:19 +0100 schrieb Peter Zijlstra:
> I suppose you'll have to go bug the glibc people about their random()
> implementation.

Yes, I will.

> If you really need random() to perform for your application (monte-carlo
> stuff?) You might be better off writing a PRNG with TLS state or
> something.

I just need some noise in my images.
Comment 16 Mike Galbraith 2009-02-02 00:52:25 UTC
On Mon, 2009-02-02 at 09:33 +0100, Thomas Pilarski wrote:
> Am Montag, den 02.02.2009, 09:19 +0100 schrieb Peter Zijlstra:
> > I suppose you'll have to go bug the glibc people about their random()
> > implementation.
> 
> Yes, I will.

Finding the below was easy enough...

/* POSIX.1c requires that there is mutual exclusion for the `rand' and
   `srand' functions to prevent concurrent calls from modifying common
   data.  */
__libc_lock_define_initialized (static, lock)

...

long int
__random ()
{
  int32_t retval;

  __libc_lock_lock (lock);

  (void) __random_r (&unsafe_state, &retval);

  __libc_lock_unlock (lock);

  return retval;
}

...but finding the plumbing leading to __lll_lock_wait_private()
over-taxed my attention span.

	-Mike
Comment 17 Peter Zijlstra 2009-02-02 00:56:03 UTC
On Mon, 2009-02-02 at 09:52 +0100, Mike Galbraith wrote:
> On Mon, 2009-02-02 at 09:33 +0100, Thomas Pilarski wrote:
> > Am Montag, den 02.02.2009, 09:19 +0100 schrieb Peter Zijlstra:
> > > I suppose you'll have to go bug the glibc people about their random()
> > > implementation.
> > 
> > Yes, I will.
> 
> Finding the below was easy enough...

Ah, that was a good clue, apparently all you need to so it use
random_r() and provide your own state and all should be well.
Comment 18 Anonymous Emailer 2009-02-02 04:16:04 UTC
Reply-To: peterz@infradead.org

On Mon, 2009-02-02 at 09:55 +0100, Peter Zijlstra wrote:

> Ah, that was a good clue, apparently all you need to so it use
> random_r() and provide your own state and all should be well.

Michael, would it make sense to add the random_r() family to the "SEE
ALSO" section of the random() man page?

(Admittedly, my random() manpage is ancient: 2008-03-07, so it might be
this is already the case, in which case, ignore me :)
Comment 19 Anonymous Emailer 2009-02-02 10:30:00 UTC
Reply-To: mtk.manpages@googlemail.com

Hi Peter,

On Tue, Feb 3, 2009 at 1:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2009-02-02 at 09:55 +0100, Peter Zijlstra wrote:
>
>> Ah, that was a good clue, apparently all you need to so it use
>> random_r() and provide your own state and all should be well.
>
> Michael, would it make sense to add the random_r() family to the "SEE
> ALSO" section of the random() man page?
>
> (Admittedly, my random() manpage is ancient: 2008-03-07, so it might be
> this is already the case, in which case, ignore me :)

(Up-to-date version of the pages can always be found online at the
location in the .sig.)

Well, the man page already had this text under notes:

       This  function  should  not  be  used  in  cases where multiple
       threads use random() and the behavior should  be  reproducible.
       Use random_r(3) for that purpose.

But it certainly doesn't hurt to have random_r(3) also listed under
the SEE ALSO, and I've added it for man-pages-3.18.

Cheers,

Michael
Comment 20 Anonymous Emailer 2009-02-02 10:35:26 UTC
Reply-To: peterz@infradead.org

On Tue, 2009-02-03 at 07:29 +1300, Michael Kerrisk wrote:
> Hi Peter,
> 
> On Tue, Feb 3, 2009 at 1:15 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, 2009-02-02 at 09:55 +0100, Peter Zijlstra wrote:
> >
> >> Ah, that was a good clue, apparently all you need to so it use
> >> random_r() and provide your own state and all should be well.
> >
> > Michael, would it make sense to add the random_r() family to the "SEE
> > ALSO" section of the random() man page?
> >
> > (Admittedly, my random() manpage is ancient: 2008-03-07, so it might be
> > this is already the case, in which case, ignore me :)
> 
> (Up-to-date version of the pages can always be found online at the
> location in the .sig.)

Ah, I'll try to remember that.

> Well, the man page already had this text under notes:
> 
>        This  function  should  not  be  used  in  cases where multiple
>        threads use random() and the behavior should  be  reproducible.
>        Use random_r(3) for that purpose.

Yeah, but I found it eventually, but I generally don't read a full
manpage when I'm looking for related functions, only the SEE ALSO
section.

> But it certainly doesn't hurt to have random_r(3) also listed under
> the SEE ALSO, and I've added it for man-pages-3.18.

Thanks.
Comment 21 Valdis Kletnieks 2009-02-02 19:57:28 UTC
On Mon, 02 Feb 2009 08:43:55 +0100, Thomas Pilarski said:
> Am Freitag, den 30.01.2009, 08:57 +0100 schrieb Mike Galbraith:
> > One of those "don't _ever_ do that" things?
> 
> I did not known random() uses a system call. It's rather unrealistic to
> have five million system calls in a second. By adding a small loop with
> some calculations near the random, the problem disappears too.
> It is a unlucky chosen data generator.

Am I the only one that's scared by the concept of anything that beats
on random numbers enough to need 5 million of them a second, but is still
using the relatively sucky one that's in most glibc's? :) 
Comment 22 Mike Galbraith 2009-02-02 20:56:11 UTC
This bug is now dead... so who closes it?

	-Mike
Comment 23 Andrew Morton 2009-02-02 21:20:23 UTC
Was a glibc thing .. closed.

Note You need to log in before you can comment on or make changes to this bug.