Latest working kernel version: 2.6.10 Earliest failing kernel version: 2.6.18 Distribution: MontaVista CGE4/CGE5 Hardware Environment: PowerPC 750 Software Environment: MontaVista CGE4/CGE5 Problem Description: We are working an embeded Linux using MontaVista CGE. Since we upgraded from CGE4 (2.6.10) to CGE5 (2.6.18), we found that our applications' performance drop very much. With writing a small test program, and runing it with strace, I found the main performance difference is on futex system. It seems futex on CGE5 is much heavier than on CGE4. strace data on CGE4 looks like this: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 39.26 0.060724 222 274 write 38.64 0.059762 2 27246 time 19.51 0.030168 1 27244 nanosleep 1.29 0.002002 501 4 1 futex 0.64 0.000995 995 1 execve 0.37 0.000579 2 274 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.154657 55102 12 total While on CGE5 is: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 69.14 0.347917 69583 5 1 futex 23.53 0.118406 4 29890 nanosleep 6.62 0.033335 1 29892 time 0.52 0.002606 9 300 write 0.07 0.000357 1 300 gettimeofday 0.05 0.000239 239 1 execve ------ ----------- ----------- --------- --------- ---------------- 100.00 0.503226 60446 10 total Steps to reproduce: 1. Write a simple C program that just looply certain time to do nanosleep 5ms. 2. Build the program to executable named tnano. 2. Use this command "strace -Ffctrr -o nosleep.trace ./tnano" to it on both 2.6.10 and 2.6.18, and cat nosleep.trace to see data.
Correct an error in bug description: 2.6.18 should be 2.6.21.
gee, that's nearly three years worth of kernels. And 2.6.21 is nearly two years old! Are you able to send a simple C program which demonstrates this? Preferably one which runs on x86 too, so we can see if the regression is there as well, so we can bisect it. Thanks.
Below is my test program: tnano.c =========================== #include <string.h> #include <math.h> #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <errno.h> #include <unistd.h> #include <stdarg.h> #include <time.h> #include <pthread.h> #include <sys/timeb.h> #define DURATION 300 #define EMF_LOG 1 #define PLOG_ERROR 1 #define SUCCESS 0 #define FAIL -1 void LogMessage(int a, int b, const char *func, const char *fmt, ...) { struct timeb tp; struct tm mytm; char buf[1024]; va_list ap; ftime(&tp); localtime_r(&tp.time, &mytm); va_start(ap, fmt); vsprintf(buf, fmt, ap); va_end(ap); printf("%02d:%02d:%02d.%03d - %08x - %s - %s\n", mytm.tm_hour, mytm.tm_min, mytm.tm_sec, tp.millitm, pthread_self(), func, buf); } void *thr1(void *arg) { int count; time_t now, beginTime; struct timespec tv; time(&beginTime); LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "started"); count = 0; while (1) { time(&now); if (DURATION > 0 && now - beginTime > DURATION) break; tv.tv_sec = 0; tv.tv_nsec = (0x5F5E100 / 10); nanosleep(&tv, NULL); count ++; if (count >= 100) { LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "wake up"); count = 0; } } return NULL; } main() { pthread_t tid1; pthread_attr_t attr; struct sched_param param; int ret; pthread_attr_init(&attr); param.sched_priority = 20; pthread_attr_setschedparam(&attr, ¶m); pthread_attr_setschedpolicy(&attr, SCHED_RR); pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED); ret = pthread_create(&tid1, &attr, thr1, (void *)10); if (ret != 0) { LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "create thread fail , %d, %s", ret, strerror(ret)); return 0; } LogMessage(EMF_LOG, PLOG_ERROR, __FUNCTION__, "start join"); pthread_join(tid1, NULL); }
> It seems futex on CGE5 is much heavier than on CGE4. > ... > Steps to reproduce: > 1. Write a simple C program that just looply certain time to do nanosleep > 5ms. Could you clarify? This (and the test program you showed) doesn't use futex at all. Except for pthread_join(), but in that case the caller just sleeps waiting for sub-thread completion.
I don't know how futex is called. My program doesn't directly call it. Have you try to compile my program and run it with strace?
> Have you try to compile my program and run it with strace? Err... currently I can't. On my machine strace -f is broken ;) Did you? Please, just do "strace -ffT -o xxx" and grep for futex. And I bet you will see just a few futex syscalls which are needed for pthread_create/pthread_join. OK, perhaps a couple more, for the first printf() from the main thread. But all of them should be very quick, except the one used by pthread_join(). And this is _normal_, because the main thread just sleeps DURATION seconds until sub-thread exits. No? To clarify, I mean futex == syscall, strace can't see the fast path in the userspace.
I tried "strace -ffT -o xxx", the result is same as Oleg said. So that, then I would ask questions: 1. time shown by strace is duration the call last or actual CPU time the call used? 2. As you said, in pthread_join(), futex's time is very long. But when it a thread sleep, doesn't it release CPU? 3. How to explain the difference of strace data caught on CGE4 and CGE5? Those are same programs and same CPU. From top command, my test program occupies more CPU time on CGE5 than on CGE4. strace data also shows that.
1. strace just reports the time spent in syscall, it can't know whether the tracee sleeps or burns CPU. $ strace -e nanosleep -T perl -e 'sleep 1' nanosleep({1, 0}, {1, 0}) = 0 <1.000121> note also that "-r" does _not_ measure the syscall itself (please read the man ;) $ strace -e nanosleep -r perl -e 'sleep 1; sleep 1' 0.000000 nanosleep({1, 0}, {1, 0}) = 0 1.000630 nanosleep({1, 0}, {1, 0}) = 0 note the difference in report for two subsequent nanosleep's. 2. see above. 3. I don't know, sorry. Did you do the testing with the same userspace (libc/strace) ? Could you do "strace -ffT -o xxx" on CGE4 ? You can also add -tt. Anyway. I don't trust top/strace/etc when it comes to performance regressions. We need the test-case which demonstrates the slowdown via the simple /usr/bin/time or something similar. That said, > While on CGE5 is: > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 69.14 0.347917 69583 5 1 futex looks a bit strange for this test-case, but I don't really know what strace does with "-c". Perhaps just a bug in strace... But, again, please don't forget you used -r, this has nothing to do with the performance of futex. Evan, please either give more info about regression, or close this bug.
> From top command, my test program occupies more > CPU time on CGE5 than on CGE4. this _might_ be explained by CONFIG_HZ/etc difference, the modern kernels make the sleep much more "precise". please modify this program so that thr1() prints "how_many_times_i_called_nanosleep" at the end. not that I really think we will see the difference, but just in case.
Hi folks, Let me try to fill in some of the blanks here. The real application uses a timer signal to periodically unlock a semaphore (10ms intervals). In the case of the complete (embedded) system, there are multiple instances of the process running. In upgrading from CGE4 to CGE5, it was noted that the CPU utilization, as reported by top, was higher for what, in the test cases, were idle processes that should be doing little other than waking up, seeing that there is nothing to do, and going back to sleep via a semaphore. After investigating, it appears that futex calls take significantly more time under the newer version. From test runs with the attached code, under the 2.6.21-based kernel, futex calls are reported as taking 10usec, vs 3usec in the 2.6.10 version. Doesn't seem like much, but we have enough instances running per processor that reported CPU utilization jumps 10%, which is enough to cause problems. The attached file (tsleep.c) is a cut-down version of the test code we've been using to analyze the issue. In our case, it is cross compiled, so we do a ppc_7xx-gcc -lrt tsleep.c -o posix_sleeper to build it. POSIX and SYS V semaphores are both affected, and there's more of a penalty when the sem unlock is called from a timer signal vs. a sleeping thread, but this is the most simple case I could put together. I tried benchmarking on x86 desktop versions, but only under VMware, and the numbers were so inconsistent that I don't think they're valid :-(. Testing with strace -fc returns numbers that seem appropropriate as far as counts, and the overall CPU time is close enough to what one sees from top that it all seems consistent. The downside of '-c' is that it gives you averages, but running enough loops should reduce the effect of outliers. (one could also use -p to get thread data once it has launched). Thanks for your time investigating this.
Created attachment 19721 [details] Sample code that exhibits the issue Simple test case, use gcc -lrt tsleep.c -o yourappnamehere
I am sorry, but when I saw this test-case, my first intent was just close the window and forget about this bugreport ;) > After investigating, it appears that futex calls take significantly more time > under the newer version. From test runs with the attached code, under the > 2.6.21-based kernel, futex calls are reported as taking 10usec, vs 3usec in > the > 2.6.10 version. How did you get these numbers? Did you use the same userspace (libc/etc) for the testing? _Which_ futex calls were slow-downed ? What this test-case does is very simple. But it is not immediately obvious because it is overcomplicated. Please simplify the code as much as possible before sending. try_wait was added (I guess) just to add more confusion. And you didn't explain how this test-case should be used! Afaics, what it does is int Cycles = 2000; sem_t MyPSem; int DoneRunning = 0; thr1() { for (0 .. Cycles) sem_wait(&MyPsem); DoneRunning = 1; } main() { while (!DoneRunning) { nanosleep(0x5F5E100 / 10); sem_post(&MyPsem); } } thats all, and all printfs are meaningless. So. thr1() does futex(FUTEX_WAIT_). It should take 0x5F5E100 / 10 + delta. main() does futex(FUTEX_WAKE_). Where (and how) do you see the slowdown? If in delta - the reason might be not a futex, but (say) scheduler. If in FUTEX_WAKE_ - might be scheduler again. You will get different numbers depending on whether they run on the same CPU or not. Please check. And please re-test with the recent kernel. I bet nobody will even try to fix 2.6.21. That said. I won't by surprised by 3usec-10usec slowdown, but I am not expert on futexes, and I have no idea how much the code differs compared to 2.6.10.
forgot to mention... Please note that it is just silly to use strace to get the performance numbers, especially to measure the "fast" syscalls. And "-c" makes this even more silly, because it mixes the _quite different_ futex syscalls.