Bug 12369 - Performace drop from 2.6.10 to 2.6.21
Summary: Performace drop from 2.6.10 to 2.6.21
Status: REJECTED WILL_NOT_FIX
Alias: None
Product: Other
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: other_other
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-01-05 22:35 UTC by Evan
Modified: 2009-01-12 00:31 UTC (History)
3 users (show)

See Also:
Kernel Version: 2.6.21
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
Sample code that exhibits the issue (5.66 KB, text/plain)
2009-01-08 11:34 UTC, AndyR
Details

Description Evan 2009-01-05 22:35:46 UTC
Latest working kernel version: 2.6.10
Earliest failing kernel version: 2.6.18
Distribution: MontaVista CGE4/CGE5
Hardware Environment: PowerPC 750
Software Environment: MontaVista CGE4/CGE5
Problem Description:

We are working an embeded Linux using MontaVista CGE. Since we upgraded from CGE4 (2.6.10) to CGE5 (2.6.18), we found that our applications' performance drop very much. With writing a small test program, and runing it with strace, I found the main performance difference is on futex system. It seems futex on CGE5 is much heavier than on CGE4.

strace data on CGE4 looks like this:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.26    0.060724         222       274           write
 38.64    0.059762           2     27246           time
 19.51    0.030168           1     27244           nanosleep
  1.29    0.002002         501         4         1 futex
  0.64    0.000995         995         1           execve
  0.37    0.000579           2       274           gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00    0.154657                 55102        12 total

While on CGE5 is:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 69.14    0.347917       69583         5         1 futex
 23.53    0.118406           4     29890           nanosleep
  6.62    0.033335           1     29892           time
  0.52    0.002606           9       300           write
  0.07    0.000357           1       300           gettimeofday
  0.05    0.000239         239         1           execve
------ ----------- ----------- --------- --------- ----------------
100.00    0.503226                 60446        10 total


Steps to reproduce:
1. Write a simple C program that just looply certain time to do nanosleep 5ms.
2. Build the program to executable named tnano. 
2. Use this command "strace -Ffctrr -o nosleep.trace ./tnano" to it on both 2.6.10 and 2.6.18, and cat nosleep.trace to see data.
Comment 1 Evan 2009-01-05 22:43:47 UTC
Correct an error in bug description: 2.6.18 should be 2.6.21.
Comment 2 Andrew Morton 2009-01-05 23:06:16 UTC
gee, that's nearly three years worth of kernels.  And 2.6.21 is nearly
two years old!

Are you able to send a simple C program which demonstrates this?  Preferably
one which runs on x86 too, so we can see if the regression is there as well,
so we can bisect it.

Thanks.
Comment 3 Evan 2009-01-05 23:20:12 UTC
Below is my test program:

tnano.c
===========================

#include <string.h>
#include <math.h>
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <errno.h>
#include <unistd.h>
#include <stdarg.h>
#include <time.h>
#include <pthread.h>
#include <sys/timeb.h>

#define DURATION 300

#define EMF_LOG 1
#define PLOG_ERROR 1
#define SUCCESS 0
#define FAIL -1

void LogMessage(int a, int b, const char *func, const char *fmt, ...)
{
	struct timeb tp;
	struct tm mytm;
	char buf[1024];
	va_list ap;


	ftime(&tp);
	localtime_r(&tp.time, &mytm);
	va_start(ap, fmt);
	vsprintf(buf, fmt, ap);
	va_end(ap);

	printf("%02d:%02d:%02d.%03d - %08x - %s - %s\n", mytm.tm_hour, mytm.tm_min,
		mytm.tm_sec, tp.millitm, pthread_self(), func, buf);
}

void *thr1(void *arg)
{
	int count;
	time_t now, beginTime;	
	struct timespec tv;

	time(&beginTime);
	LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "started");
	
	count = 0;
	while (1)
	{
		time(&now);
		if (DURATION > 0 && now - beginTime > DURATION)
			break;

		tv.tv_sec = 0;
		tv.tv_nsec = (0x5F5E100 / 10);
		nanosleep(&tv, NULL);

		count ++;
		if (count >= 100)
		{
			LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "wake up");
			count = 0;
		}
	}

	return NULL;
}

main()
{
	pthread_t tid1;
	pthread_attr_t attr;
	struct sched_param param;
	int ret;
	pthread_attr_init(&attr);
	param.sched_priority = 20;
	pthread_attr_setschedparam(&attr, &param);
	pthread_attr_setschedpolicy(&attr, SCHED_RR);
	pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
	ret = pthread_create(&tid1, &attr, thr1, (void *)10);
	if (ret != 0)
	{
		LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "create thread fail , %d, %s", ret, strerror(ret));
		return 0;
	}

	LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "start join");
	pthread_join(tid1, NULL);
}
Comment 4 Oleg Nesterov 2009-01-06 06:44:54 UTC
> It seems futex on CGE5 is much heavier than on CGE4.
> ...
> Steps to reproduce:
> 1. Write a simple C program that just looply certain time to do nanosleep
> 5ms.

Could you clarify? This (and the test program you showed) doesn't
use futex at all.

Except for pthread_join(), but in that case the caller just sleeps
waiting for sub-thread completion.
Comment 5 Evan 2009-01-06 17:42:07 UTC
I don't know how futex is called. My program doesn't directly call it. Have you try to compile my program and run it with strace?
Comment 6 Oleg Nesterov 2009-01-07 09:51:12 UTC
> Have you try to compile my program and run it with strace?

Err... currently I can't. On my machine strace -f is broken ;)

Did you? Please, just do "strace -ffT -o xxx" and grep for
futex. And I bet you will see just a few futex syscalls which
are needed for pthread_create/pthread_join. OK, perhaps a couple
more, for the first printf() from the main thread.

But all of them should be very quick, except the one used by
pthread_join(). And this is _normal_, because the main thread
just sleeps DURATION seconds until sub-thread exits.

No?

To clarify, I mean futex == syscall, strace can't see the fast
path in the userspace.
Comment 7 Evan 2009-01-07 23:24:34 UTC
I tried "strace -ffT -o xxx", the result is same as Oleg said. So that, then I would ask questions:

1. time shown by strace is duration the call last or actual CPU time the call used?

2. As you said, in pthread_join(), futex's time is very long. But when it a thread sleep, doesn't it release CPU?

3. How to explain the difference of strace data caught on CGE4 and CGE5? Those are same programs and same CPU. From top command, my test program occupies more CPU time on CGE5 than on CGE4. strace data also shows that.
Comment 8 Oleg Nesterov 2009-01-08 05:19:46 UTC
1. strace just reports the time spent in syscall, it can't know
   whether the tracee sleeps or burns CPU.

        $ strace -e nanosleep -T perl -e 'sleep 1'
        nanosleep({1, 0}, {1, 0})               = 0 <1.000121>

   note also that "-r" does _not_ measure the syscall itself
   (please read the man ;)

        $ strace -e nanosleep -r perl -e 'sleep 1; sleep 1'
             0.000000 nanosleep({1, 0}, {1, 0}) = 0
             1.000630 nanosleep({1, 0}, {1, 0}) = 0

   note the difference in report for two subsequent nanosleep's.

2. see above.

3. I don't know, sorry.

Did you do the testing with the same userspace (libc/strace) ?
Could you do "strace -ffT -o xxx" on CGE4 ? You can also add -tt.

Anyway. I don't trust top/strace/etc when it comes to performance
regressions. We need the test-case which demonstrates the slowdown
via the simple /usr/bin/time or something similar.

That said,

        > While on CGE5 is:
        > % time     seconds  usecs/call     calls    errors syscall
        > ------ ----------- ----------- --------- --------- ----------------
        >  69.14    0.347917       69583         5         1 futex

looks a bit strange for this test-case, but I don't really know what
strace does with "-c". Perhaps just a bug in strace... But, again,
please don't forget you used -r, this has nothing to do with the
performance of futex.

Evan, please either give more info about regression, or close this
bug.
Comment 9 Oleg Nesterov 2009-01-08 05:33:09 UTC
> From top command, my test program occupies more
> CPU time on CGE5 than on CGE4.

this _might_ be explained by CONFIG_HZ/etc difference, the
modern kernels make the sleep much more "precise".

please modify this program so that thr1() prints
"how_many_times_i_called_nanosleep" at the end.

not that I really think we will see the difference, but just
in case.
Comment 10 AndyR 2009-01-08 11:32:24 UTC
Hi folks,

Let me try to fill in some of the blanks here. The real application uses a timer signal to periodically unlock a semaphore (10ms intervals). In the case of the complete (embedded) system, there are multiple instances of the process running. In upgrading from CGE4 to CGE5, it was noted that the CPU utilization, as reported by top, was higher for what, in the test cases, were idle processes that should be doing little other than waking up, seeing that there is nothing to do, and going back to sleep via a semaphore.

After investigating, it appears that futex calls take significantly more time under the newer version. From test runs with the attached code, under the 2.6.21-based kernel, futex calls are reported as taking 10usec, vs 3usec in the 2.6.10 version. Doesn't seem like much, but we have enough instances running per processor that reported CPU utilization jumps 10%, which is enough to cause problems.

The attached file (tsleep.c) is a cut-down version of the test code we've been using to analyze the issue. In our case, it is cross compiled, so we do a ppc_7xx-gcc -lrt tsleep.c -o posix_sleeper to build it. POSIX and SYS V semaphores are both affected, and there's more of a penalty when the sem unlock is called from a timer signal vs. a sleeping thread, but this is the most simple case I could put together.

I tried benchmarking on x86 desktop versions, but only under VMware, and the numbers were so inconsistent that I don't think they're valid :-(. 

Testing with strace -fc returns numbers that seem appropropriate as far as counts, and the overall CPU time is close enough to what one sees from top that it all seems consistent. The downside of '-c' is that it gives you averages, but running enough loops should reduce the effect of outliers. (one could also use -p to get thread data once it has launched).

Thanks for your time investigating this.
Comment 11 AndyR 2009-01-08 11:34:14 UTC
Created attachment 19721 [details]
Sample code that exhibits the issue

Simple test case, use gcc -lrt tsleep.c -o yourappnamehere
Comment 12 Oleg Nesterov 2009-01-09 04:59:07 UTC
I am sorry, but when I saw this test-case, my first intent was just close
the window and forget about this bugreport ;)

> After investigating, it appears that futex calls take significantly more time
> under the newer version. From test runs with the attached code, under the
> 2.6.21-based kernel, futex calls are reported as taking 10usec, vs 3usec in
> the
> 2.6.10 version.

How did you get these numbers?

Did you use the same userspace (libc/etc) for the testing?

_Which_ futex calls were slow-downed ?


What this test-case does is very simple. But it is not immediately obvious
because it is overcomplicated. Please simplify the code as much as possible
before sending. try_wait was added (I guess) just to add more confusion.
And you didn't explain how this test-case should be used!

Afaics, what it does is

        int Cycles = 2000;
        sem_t MyPSem;
        int DoneRunning = 0;

        thr1()
        {
                for (0 .. Cycles)
                         sem_wait(&MyPsem);

                DoneRunning = 1;
        }

        main()
        {
                while (!DoneRunning) {
                        nanosleep(0x5F5E100 / 10);
                        sem_post(&MyPsem);
                }
        }

thats all, and all printfs are meaningless.

So. thr1() does futex(FUTEX_WAIT_). It should take 0x5F5E100 / 10 + delta.
main() does futex(FUTEX_WAKE_). Where (and how) do you see the slowdown?
If in delta - the reason might be not a futex, but (say) scheduler.
If in FUTEX_WAKE_ - might be scheduler again. You will get different
numbers depending on whether they run on the same CPU or not. Please
check.

And please re-test with the recent kernel. I bet nobody will even try to
fix 2.6.21.

That said. I won't by surprised by 3usec-10usec slowdown, but I am not
expert on futexes, and I have no idea how much the code differs compared
to 2.6.10.
Comment 13 Oleg Nesterov 2009-01-09 05:16:42 UTC
forgot to mention...

Please note that it is just silly to use strace to
get the performance numbers, especially to measure
the "fast" syscalls. And "-c" makes this even more
silly, because it mixes the _quite different_ futex
syscalls.

Note You need to log in before you can comment on or make changes to this bug.