Hardware Environment: Core2Duo 2.4GHz / 4GB RAM Software Environment: Ubuntu 8.10 + Vanilla 2.6.28 Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2 Problem Description: The overhead on a dual core while switching between tasks is extremely high (>60% of cputime). If is produced by synchronization with pthread and mutex/cond. Executing the attaches program schedulingissue 1 1024 8 20, which create a producer and a consumer thread with eight 8kb big buffers. The producer creates 1024 random generated double values, consumer makes the same after receiving the buffer. While executing the program the thoughtput is ~1.6 msg/s. While executing two instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = 17,4 msg/s). Small improvement while using jiffies as clocksource instead of acpi_pm or hpet (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives no improvement. Much higher performance with kernel <= 2.6.24, but still four times slower. --------------------------------------- Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 GNU/Linux acpi_pm (equal with htep) schedulerissue 1 1024 8 20 All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s --------------------------------------- Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 GNU/Linux jiffies schedulerissue 1 1024 8 20 All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s -------------------------------------- Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 GNU/Linux hpet schedulerissue 1 1024 8 20 All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s AMD64 X2 @ 2.1GHz Linux bugs-desktop 2.6.28.2 #4 SMP Mon Jan 26 20:26:12 CET 2009 x86_64 GNU/Linux acpi_pm schedulerissue 1 1024 8 20 All threads finished: 20 messages in 9.288 seconds / 2.153 msg/s schedulerissue 1 1024 8 200 All threads finished: 200 messages in 17.049 seconds / 11.731 msg/s All threads finished: 200 messages in 18.539 seconds / 10.788 msg/s
Created attachment 20030 [details] testcase gcc -O3 -lm -lrt -lpthread ThreadSchedulingIssue.c -o schedulingissue
Created attachment 20031 [details] testcase Removed constants from testcase The results in the description were made with parameter "schedulingissue 1 524288 4 20" and "schedulingissue 1 524288 4 200".
Reply-To: akpm@linux-foundation.org (switched to email. Please respond via emailed reply-to-all, not via the bugzilla web interface). On Wed, 28 Jan 2009 06:35:20 -0800 (PST) bugme-daemon@bugzilla.kernel.org wrote: > http://bugzilla.kernel.org/show_bug.cgi?id=12562 > > Summary: High overhead while switching or synchronizing threads > on different cores Thanks for the report, and the testcase. > Product: Process Management > Version: 2.5 > KernelVersion: 2.6.28 > Platform: All > OS/Version: Linux > Tree: Mainline > Status: NEW > Severity: normal > Priority: P1 > Component: Scheduler > AssignedTo: mingo@elte.hu > ReportedBy: thomas.pi@arcor.de (There's testcase code in the bugzilla report) (Seems to be a regression) > > Hardware Environment: Core2Duo 2.4GHz / 4GB RAM > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28 > > Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2 > > Problem Description: > The overhead on a dual core while switching between tasks is extremely high > (>60% of cputime). If is produced by synchronization with pthread and > mutex/cond. > > Executing the attaches program schedulingissue 1 1024 8 20, which create a > producer and a consumer thread with eight 8kb big buffers. The producer > creates > 1024 random generated double values, consumer makes the same after receiving > the buffer. > > While executing the program the thoughtput is ~1.6 msg/s. While executing two > instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = 17,4 > msg/s). > > Small improvement while using jiffies as clocksource instead of acpi_pm or > hpet > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME gives > no improvement. Much higher performance with kernel <= 2.6.24, but still four > times slower. Unclear. What is four times slower than what? You're saying that the app progresses four times faster when there are two instances of it running, rather than one instance? > --------------------------------------- > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 > GNU/Linux > acpi_pm (equal with htep) > schedulerissue 1 1024 8 20 > All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 > All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s > All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s > --------------------------------------- > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 > GNU/Linux > jiffies > schedulerissue 1 1024 8 20 > All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 > All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s > All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s > -------------------------------------- > Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 > GNU/Linux > hpet > schedulerissue 1 1024 8 20 > All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 > All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s > All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s > Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24 and 2.6.28 run at the same speed when 200 messages are sent? If so, that seems rather odd, doesn't it? Is it possible that cpufreq does something bad once the CPU gets hot? > AMD64 X2 @ 2.1GHz > Linux bugs-desktop 2.6.28.2 #4 SMP Mon Jan 26 20:26:12 CET 2009 x86_64 > GNU/Linux > acpi_pm > schedulerissue 1 1024 8 20 > All threads finished: 20 messages in 9.288 seconds / 2.153 msg/s > schedulerissue 1 1024 8 200 > All threads finished: 200 messages in 17.049 seconds / 11.731 msg/s > All threads finished: 200 messages in 18.539 seconds / 10.788 msg/s
On Wed, 2009-01-28 at 12:56 -0800, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Wed, 28 Jan 2009 06:35:20 -0800 (PST) > bugme-daemon@bugzilla.kernel.org wrote: > > > http://bugzilla.kernel.org/show_bug.cgi?id=12562 > > > > Summary: High overhead while switching or synchronizing threads > > on different cores > > Thanks for the report, and the testcase. > > > Product: Process Management > > Version: 2.5 > > KernelVersion: 2.6.28 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: normal > > Priority: P1 > > Component: Scheduler > > AssignedTo: mingo@elte.hu > > ReportedBy: thomas.pi@arcor.de > > (There's testcase code in the bugzilla report) > > (Seems to be a regression) Is there a known good kernel? > > > > Hardware Environment: Core2Duo 2.4GHz / 4GB RAM > > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28 > > > > Hardware Environment: AMD64 X2 2.1GHz / 6GB RAM > > Software Environment: Ubuntu 8.10 + Vanilla 2.6.28.2 > > > > Problem Description: > > The overhead on a dual core while switching between tasks is extremely high > > (>60% of cputime). If is produced by synchronization with pthread and > > mutex/cond. > > > > Executing the attaches program schedulingissue 1 1024 8 20, which create a > > producer and a consumer thread with eight 8kb big buffers. The producer > creates > > 1024 random generated double values, consumer makes the same after > receiving > > the buffer. > > > > While executing the program the thoughtput is ~1.6 msg/s. While executing > two > > instances of the program, the thoughtput is much higher (2 * 8.7 msg/s = > 17,4 > > msg/s). > > > > Small improvement while using jiffies as clocksource instead of acpi_pm or > hpet > > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME > gives > > no improvement. Much higher performance with kernel <= 2.6.24, but still > four > > times slower. > > Unclear. What is four times slower than what? You're saying that the > app progresses four times faster when there are two instances of it > running, rather than one instance? It seems that way indeed, a bit more clarity would be good though. > > --------------------------------------- > > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 > > GNU/Linux > > acpi_pm (equal with htep) > > schedulerissue 1 1024 8 20 > > All threads finished: 20 messages in 12.295 seconds / 1.627 msg/s > > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 > > All threads finished: 200 messages in 22.882 seconds / 8.741 msg/s > > All threads finished: 200 messages in 22.934 seconds / 8.721 msg/s > > --------------------------------------- > > Linux bugs-laptop 2.6.28-hz-hrt #4 SMP Wed Jan 28 13:33:18 CET 2009 x86_64 > > GNU/Linux > > jiffies > > schedulerissue 1 1024 8 20 > > All threads finished: 20 messages in 10.704 seconds / 1.868 msg/s > > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 > > All threads finished: 200 messages in 23.372 seconds / 8.557 msg/s > > All threads finished: 200 messages in 23.460 seconds / 8.525 msg/s > > -------------------------------------- > > Linux bugs-laptop 2.6.24.7 #1 SMP Wed Jan 14 10:21:04 CET 2009 x86_64 > GNU/Linux > > hpet > > schedulerissue 1 1024 8 20 > > All threads finished: 20 messages in 5.290 seconds / 3.781 msg/s > > schedulerissue 1 1024 8 200 & schedulerissue 1 1024 8 200 > > All threads finished: 200 messages in 23.000 seconds / 8.695 msg/s > > All threads finished: 200 messages in 23.078 seconds / 8.666 msg/s > > > > Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24 > and 2.6.28 run at the same speed when 200 messages are sent? > > If so, that seems rather odd, doesn't it? Is it possible that cpufreq > does something bad once the CPU gets hot? Nah, I'll bet is a cache affinity issue. Some applications like strong wakeup affinity, others not so. This looks to be a lover. With a single instance, the producer and consumer get scheduled on two different cores for some reason (maybe wake idle too strong). With two instances, they get to stay on the same cpu, since the other cpu is already busy. I'll start up the browser in the morning to download this proglet and poke at it some, but sleep comes first.
Am Mittwoch, den 28.01.2009, 12:56 -0800 schrieb Andrew Morton: > (There's testcase code in the bugzilla report) > > (Seems to be a regression) There is a regression, because of the improved cpu switching. The problem exists in every kernel. I takes a lot of time to switch between the threads, when they are executed on different cores. Perhaps of the big buffer size of 512KB? > > Small improvement while using jiffies as clocksource instead of acpi_pm or > hpet > > (1.8 messages instead of 1.6). Disabling NO_HZ and HIGH_RESOLUTION_TIME > gives > > no improvement. Much higher performance with kernel <= 2.6.24, but still > four > > times slower. > > Unclear. What is four times slower than what? You're saying that the > app progresses four times faster when there are two instances of it > running, rather than one instance? About 4 messages every second, while executing only one instance and about 8 message every second, while executing two instance of the test. It makes 16 messages every second, when the two threads of a instance is executed on only one core. > Seems that 2.6.24 is faster than 2.6.28 with 20 messages, but 2.6.24 > and 2.6.28 run at the same speed when 200 messages are sent? I have executed the test twenty times. It stays constant on 2.6.28. On 2.6.24 one of ten tests is executed slower. ******* kernel 2.6.28: All threads finished: 20 messages in 12.853 seconds / 1.556 msg/s real 0m12.857s user 0m8.589s sys 0m16.629s ******* kernel 2.6.24: All threads finished: 20 messages in 4.939 seconds / 4.050 msg/s real 0m4.942s user 0m5.248s sys 0m4.352s One of ten executions is going down to 1.806 msg/s. All threads finished: 20 messages in 11.074 seconds / 1.806 msg/s real 0m11.077s user 0m8.817s sys 0m12.925s > If so, that seems rather odd, doesn't it? Is it possible that cpufreq > does something bad once the CPU gets hot? I have disabled the acpid, clocked the cpu to 2.4GHz and watched the temperature of the cores and the frequency. The clock stay always at 2.4GHz and the temperature is always below 67°C. My cpu is clocking down at 95°C.
On Wed, 2009-01-28 at 23:25 +0100, Thomas Pilarski wrote: > Am Mittwoch, den 28.01.2009, 12:56 -0800 schrieb Andrew Morton: > > > (There's testcase code in the bugzilla report) > > > > (Seems to be a regression) > > There is a regression, because of the improved cpu switching. The > problem exists in every kernel. This is a contradiction in terms - twice. If it is a regression, then clearly things haven't improved. If it is a regression, state clearly when it worked last. If it never worked, it cannot be a regression. > I takes a lot of time to switch between the threads, when they are > executed on different cores. > Perhaps of the big buffer size of 512KB? Of course, pushing 512kb to another cpu means lots and lots of cache misses.
> > There is a regression, because of the improved cpu switching. The > > problem exists in every kernel. > > This is a contradiction in terms - twice. > > If it is a regression, then clearly things haven't improved. > > If it is a regression, state clearly when it worked last. If it never > worked, it cannot be a regression. There is a improvement in load balancing for single threaded applications. It's a regression for my problem. But the problem exists in every kernel I have tested. > > I takes a lot of time to switch between the threads, when they are > > executed on different cores. > > Perhaps of the big buffer size of 512KB? > > Of course, pushing 512kb to another cpu means lots and lots of cache > misses. I have tried 2.6.15, 2.6.18 and 2.6.20 too, but same behavior as in 2.6.24. With Windows I can get 64 message every second with a buffer size of 512 KB. It is reduced to 16 messages with a buffer size of 1MB. But I think it not really comparable, because there is nearby no cpu consumption with 512kB. Perhaps random() works different. By increasing the cpu usage eight times in the producer, I can get 16msg/s and both cores are used about ~50%. Doing the same with linux I get a throughput of ~2msg/s. If it is a caching issue, shouldn't it exists in Windows too? Using a smaller buffer of 4KB, the test is executed on one core only. ./schedulerissue 1 4096 8 2000 All threads finished: 2000 messages in 1.631 seconds / 1226.076 msg/s real 0m1.635s user 0m1.352s sys 0m0.052s But I want to use both cores to increase the performance. Adding a second producer and a second consumer reduces the performance to 33%. Both cores are used. ./schedulerissue 2 4096 8 2000 All threads finished: 1999 messages in 4.744 seconds / 421.379 msg/s real 0m4.748s user 0m3.280s sys 0m5.852s I have added a new version as there was a possible deadlock during shut-down.
Some explanation of the test program. ./schedulerissue 1 4096 8 2000 1 producer and 1 consumer buffer size of 4096 doubles * 8byte 8 buffer (256kB total buffer) 2000 messages ./schedulerissue 2 4096 8 2000 2 producer and 2 consumer buffer size of 4096 doubles * 8byte 8 buffer (256kB total buffer) 2000 messages It was not 512KB bytes in the test before, but 4MB. But there is the same problem with a total buffer size of 48kB and 4 threads (./schedulerissue 2 2048 3 20000).
On Thu, 2009-01-29 at 11:24 +0100, Thomas Pilarski wrote: > Some explanation of the test program. > > ../schedulerissue 1 4096 8 2000 > 1 producer and 1 consumer > buffer size of 4096 doubles * 8byte > 8 buffer (256kB total buffer) > 2000 messages > > ../schedulerissue 2 4096 8 2000 > 2 producer and 2 consumer > buffer size of 4096 doubles * 8byte > 8 buffer (256kB total buffer) > 2000 messages > > > It was not 512KB bytes in the test before, but 4MB. > But there is the same problem with a total buffer size of 48kB and 4 > threads (./schedulerissue 2 2048 3 20000). Right, read the proglet (and removed that usleep(1)) and am poking at it.
On Thu, 2009-01-29 at 11:24 +0100, Thomas Pilarski wrote: > Some explanation of the test program. > > ../schedulerissue 1 4096 8 2000 > 1 producer and 1 consumer > buffer size of 4096 doubles * 8byte > 8 buffer (256kB total buffer) > 2000 messages > > ../schedulerissue 2 4096 8 2000 > 2 producer and 2 consumer > buffer size of 4096 doubles * 8byte > 8 buffer (256kB total buffer) > 2000 messages > > > It was not 512KB bytes in the test before, but 4MB. > But there is the same problem with a total buffer size of 48kB and 4 > threads (./schedulerissue 2 2048 3 20000). Linux opteron 2.6.29-rc3-tip #61 SMP PREEMPT Thu Jan 29 11:59:15 CET 2009 x86_64 x86_64 x86_64 GNU/Linux [root@opteron bench]# schedtool -a 1 -e ./ThreadSchedulingIssue 1 4096 8 20000 All threads finished: 19992 messages in 6.485 seconds / 3082.877 msg/s [root@opteron bench]# ./ThreadSchedulingIssue 1 4096 8 20000 All threads finished: 19992 messages in 6.496 seconds / 3077.604 msg/s [root@opteron bench]# ./ThreadSchedulingIssue 1 4096 8 20000 & ./ThreadSchedulingIssue 1 4096 8 20000 & [1] 10314 [2] 10315 [root@opteron bench]# All threads finished: 19992 messages in 6.720 seconds / 2975.009 msg/s All threads finished: 19992 messages in 6.792 seconds / 2943.574 msg/s [1]- Done ./ThreadSchedulingIssue 1 4096 8 20000 [2]+ Done ./ThreadSchedulingIssue 1 4096 8 20000 [root@opteron bench]# ./ThreadSchedulingIssue 2 4096 8 20000 All threads finished: 19992 messages in 17.299 seconds / 1155.667 msg/s [root@opteron bench]# for i in 4 8 16 32 64 128 256 ; do > echo -n $((i*1024)) $((80000/i)) " " ; > schedtool -a 1 -e ./ThreadSchedulingIssue 1 $((i*1024)) 8 $((80000/i)) ; > done 4096 20000 All threads finished: 19992 messages in 6.368 seconds / 3139.251 msg/s 8192 10000 All threads finished: 9992 messages in 5.363 seconds / 1863.083 msg/s 16384 5000 All threads finished: 4992 messages in 5.471 seconds / 912.479 msg/s 32768 2500 All threads finished: 2493 messages in 5.730 seconds / 435.059 msg/s 65536 1250 All threads finished: 1242 messages in 5.544 seconds / 224.021 msg/s 131072 625 All threads finished: 617 messages in 5.755 seconds / 107.217 msg/s 262144 312 All threads finished: 305 messages in 6.014 seconds / 50.713 msg/s [root@opteron bench]# for i in 4 8 16 32 64 128 256 ; do > echo -n $((i*1024)) $((80000/i)) " " ; > ./ThreadSchedulingIssue 1 $((i*1024)) 8 $((80000/i)) ; > done 4096 20000 All threads finished: 19992 messages in 6.462 seconds / 3093.717 msg/s 8192 10000 All threads finished: 9992 messages in 8.767 seconds / 1139.738 msg/s 16384 5000 All threads finished: 5000 messages in 5.366 seconds / 931.798 msg/s 32768 2500 All threads finished: 2494 messages in 20.720 seconds / 120.369 msg/s 65536 1250 All threads finished: 1242 messages in 11.521 seconds / 107.805 msg/s 131072 625 All threads finished: 618 messages in 14.035 seconds / 44.032 msg/s 262144 312 All threads finished: 305 messages in 17.342 seconds / 17.587 msg/s The above point between 16 and 32 is exactly where the total working set doesn't fit into cache anymore -- I suspect that pushes the producer's latency to go to sleep over the edge and everything collapses. We use wakeup patterns to determine if two tasks are working together and should thus be kept together. Task A should wake up B, and B should wake up A. Furthermore, any task should quickly go to sleep after waking up the other. This program does neither, with a single pair, the producer continues production after waking the consumer (until the queue is filled -- which, if the consumer is fast enough, might never happen). With multiple pairs there is no strict pair relation at all, since they all work on the same global buffer queue, so P1 can wake Cn etc. Furthermore the program uses shared memory (not a bad design), and thus mises out on the explicit affinity hints of pipes, sockets, etc. In short this program is carefully crafted to defeat all our affinity tests - and I'm not sure what to do.
> In short this program is carefully crafted to defeat all our affinity > tests - and I'm not sure what to do. I am sorry, although it is not carefully crafted. The function random() is causing my problem. I currently have no real data, so I tried to make some random utilization and data. Without the random() function it works even with 80MB of data and I get great results. ./ThreadSchedulingIssue 1 10485760 8 312 All threads finished: 309 messages in 29.369 seconds / 10.521 msg/s schedtool -a 1 -e ./ThreadSchedulingIssue 1 10485760 8 312 All threads finished: 312 messages in 44.284 seconds / 7.045 msg/s It does not even regress with more then two threads. ./ThreadSchedulingIssue 2 10485760 8 312 All threads finished: 311 messages in 28.040 seconds / 11.091 msg/s ./ThreadSchedulingIssue 4 10485760 8 312 All threads finished: 309 messages in 28.021 seconds / 11.027 msg/s With small amounts of data the speed on two core is even doubled. schedtool -a 1 -e ./ThreadSchedulingIssue 1 1048 8 312000 All threads finished: 311992 messages in 19.437 seconds / 16051.247 msg/s ./ThreadSchedulingIssue 3 1048 8 312000 All threads finished: 311998 messages in 9.652 seconds / 32324.411 msg/s ./ThreadSchedulingIssue 8 1048 8 312000 All threads finished: 311997 messages in 9.339 seconds / 33406.370 msg/s -------------- Perhaps it is as it should be, but when I run the test (without random()) with 2*8 threads, it uses ~186 of the cpu, while an instance of "bzip2 -9 -c /dev/urandom >/dev/null" gets only 12%.
On Thu, 2009-01-29 at 15:05 +0100, Thomas Pilarski wrote: > > In short this program is carefully crafted to defeat all our affinity > > tests - and I'm not sure what to do. > > I am sorry, although it is not carefully crafted. The function random() > is causing my problem. I currently have no real data, so I tried to make > some random utilization and data. Yeah, rather big difference, mega-contention vs zero-contention. 2.6.28.2, profile of ThreadSchedulingIssue 4 524288 8 200 vma samples % app name symbol name ffffffff80251efa 2574819 31.6774 vmlinux futex_wake ffffffff80251a39 1367613 16.8255 vmlinux futex_wait 0000000000411790 815426 10.0320 ThreadSchedulingIssue random ffffffff8022b3b5 343692 4.2284 vmlinux task_rq_lock 0000000000404e30 299316 3.6824 ThreadSchedulingIssue __lll_lock_wait_private ffffffff8030d430 262906 3.2345 vmlinux copy_user_generic_string ffffffff80462af2 235176 2.8933 vmlinux schedule 0000000000411b90 210984 2.5957 ThreadSchedulingIssue random_r ffffffff80251730 129376 1.5917 vmlinux hash_futex ffffffff8020be10 123548 1.5200 vmlinux system_call ffffffff8020a679 119398 1.4689 vmlinux __switch_to ffffffff8022f49b 110068 1.3541 vmlinux try_to_wake_up ffffffff8024c4d1 106352 1.3084 vmlinux sched_clock_cpu ffffffff8020be20 102709 1.2636 vmlinux system_call_after_swapgs ffffffff80229a2d 100614 1.2378 vmlinux update_curr ffffffff80248309 86475 1.0639 vmlinux add_wait_queue ffffffff80253149 85969 1.0577 vmlinux do_futex Versus using myrand() free sample cruft generator from rand(3) manpage. Poof. vma samples % app name symbol name 004002f4 979506 90.7113 ThreadSchedulingIssue myrand 00400b00 53348 4.9405 ThreadSchedulingIssue thread_consumer 00400c25 42710 3.9553 ThreadSchedulingIssue thread_producer One of those "don't _ever_ do that" things? -Mike
Am Freitag, den 30.01.2009, 08:57 +0100 schrieb Mike Galbraith: > One of those "don't _ever_ do that" things? I did not known random() uses a system call. It's rather unrealistic to have five million system calls in a second. By adding a small loop with some calculations near the random, the problem disappears too. It is a unlucky chosen data generator.
On Mon, 2009-02-02 at 08:43 +0100, Thomas Pilarski wrote: > Am Freitag, den 30.01.2009, 08:57 +0100 schrieb Mike Galbraith: > > One of those "don't _ever_ do that" things? > > I did not known random() uses a system call. It's rather unrealistic to > have five million system calls in a second. By adding a small loop with > some calculations near the random, the problem disappears too. > It is a unlucky chosen data generator. I suppose you'll have to go bug the glibc people about their random() implementation. If you really need random() to perform for your application (monte-carlo stuff?) You might be better off writing a PRNG with TLS state or something.
Am Montag, den 02.02.2009, 09:19 +0100 schrieb Peter Zijlstra: > I suppose you'll have to go bug the glibc people about their random() > implementation. Yes, I will. > If you really need random() to perform for your application (monte-carlo > stuff?) You might be better off writing a PRNG with TLS state or > something. I just need some noise in my images.
On Mon, 2009-02-02 at 09:33 +0100, Thomas Pilarski wrote: > Am Montag, den 02.02.2009, 09:19 +0100 schrieb Peter Zijlstra: > > I suppose you'll have to go bug the glibc people about their random() > > implementation. > > Yes, I will. Finding the below was easy enough... /* POSIX.1c requires that there is mutual exclusion for the `rand' and `srand' functions to prevent concurrent calls from modifying common data. */ __libc_lock_define_initialized (static, lock) ... long int __random () { int32_t retval; __libc_lock_lock (lock); (void) __random_r (&unsafe_state, &retval); __libc_lock_unlock (lock); return retval; } ...but finding the plumbing leading to __lll_lock_wait_private() over-taxed my attention span. -Mike
On Mon, 2009-02-02 at 09:52 +0100, Mike Galbraith wrote: > On Mon, 2009-02-02 at 09:33 +0100, Thomas Pilarski wrote: > > Am Montag, den 02.02.2009, 09:19 +0100 schrieb Peter Zijlstra: > > > I suppose you'll have to go bug the glibc people about their random() > > > implementation. > > > > Yes, I will. > > Finding the below was easy enough... Ah, that was a good clue, apparently all you need to so it use random_r() and provide your own state and all should be well.
Reply-To: peterz@infradead.org On Mon, 2009-02-02 at 09:55 +0100, Peter Zijlstra wrote: > Ah, that was a good clue, apparently all you need to so it use > random_r() and provide your own state and all should be well. Michael, would it make sense to add the random_r() family to the "SEE ALSO" section of the random() man page? (Admittedly, my random() manpage is ancient: 2008-03-07, so it might be this is already the case, in which case, ignore me :)
Reply-To: mtk.manpages@googlemail.com Hi Peter, On Tue, Feb 3, 2009 at 1:15 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, 2009-02-02 at 09:55 +0100, Peter Zijlstra wrote: > >> Ah, that was a good clue, apparently all you need to so it use >> random_r() and provide your own state and all should be well. > > Michael, would it make sense to add the random_r() family to the "SEE > ALSO" section of the random() man page? > > (Admittedly, my random() manpage is ancient: 2008-03-07, so it might be > this is already the case, in which case, ignore me :) (Up-to-date version of the pages can always be found online at the location in the .sig.) Well, the man page already had this text under notes: This function should not be used in cases where multiple threads use random() and the behavior should be reproducible. Use random_r(3) for that purpose. But it certainly doesn't hurt to have random_r(3) also listed under the SEE ALSO, and I've added it for man-pages-3.18. Cheers, Michael
Reply-To: peterz@infradead.org On Tue, 2009-02-03 at 07:29 +1300, Michael Kerrisk wrote: > Hi Peter, > > On Tue, Feb 3, 2009 at 1:15 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, 2009-02-02 at 09:55 +0100, Peter Zijlstra wrote: > > > >> Ah, that was a good clue, apparently all you need to so it use > >> random_r() and provide your own state and all should be well. > > > > Michael, would it make sense to add the random_r() family to the "SEE > > ALSO" section of the random() man page? > > > > (Admittedly, my random() manpage is ancient: 2008-03-07, so it might be > > this is already the case, in which case, ignore me :) > > (Up-to-date version of the pages can always be found online at the > location in the .sig.) Ah, I'll try to remember that. > Well, the man page already had this text under notes: > > This function should not be used in cases where multiple > threads use random() and the behavior should be reproducible. > Use random_r(3) for that purpose. Yeah, but I found it eventually, but I generally don't read a full manpage when I'm looking for related functions, only the SEE ALSO section. > But it certainly doesn't hurt to have random_r(3) also listed under > the SEE ALSO, and I've added it for man-pages-3.18. Thanks.
On Mon, 02 Feb 2009 08:43:55 +0100, Thomas Pilarski said: > Am Freitag, den 30.01.2009, 08:57 +0100 schrieb Mike Galbraith: > > One of those "don't _ever_ do that" things? > > I did not known random() uses a system call. It's rather unrealistic to > have five million system calls in a second. By adding a small loop with > some calculations near the random, the problem disappears too. > It is a unlucky chosen data generator. Am I the only one that's scared by the concept of anything that beats on random numbers enough to need 5 million of them a second, but is still using the relatively sucky one that's in most glibc's? :)
This bug is now dead... so who closes it? -Mike
Was a glibc thing .. closed.