Created attachment 27401 [details] ver_linux output from one affected system In a program intensive in user time CPU, the 'utime' field in /proc/<pid>/stat becomes freezed at some point (utime will never be incremented again) and the 'stime' starts to (wrongly) receive the increments instead. At the same time /proc/stat correctly assings the CPU spent as user mode time (in fact, commands like "top" still show proper global CPU reports). But this problem causes getrusage() to return incorrect values, reporting user mode time as kernel time for the running process (for example, the GNU gmp library benchmark reports wrong timings). Affected kernel versions: 2.6.33, 2.6.34, 2.6.35 Last kernel version not affected by this bug: 2.6.32 Machines known to reproduce the bug: Core 2 Duo E8500, Atom N270, Pentium D. Steps to reproduce the problem: 1) Make a tiny C program doing an infinite loop and run it 2) Examine /proc/PID/stat: - After 2-4 minutes, the User CPU time becomes stalled (in a value from about 15000 to 25000) and will never increase again. - At the same time, the Kernel CPU time incorrectly begins to increase, reporting 100% of the CPU ticks. - However /proc/stat correctly continues reporting the CPU being spent as User time. In example bellow, the user time becomes freezed at "21533" after just 215.33 seconds: ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date 20549 21 116668 6587 Mon Aug 9 23:53:53 CEST 2010 ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date 21533 21 117656 6588 Mon Aug 9 23:54:03 CEST 2010 ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date 21533 639 118275 6589 Mon Aug 9 23:54:09 CEST 2010 ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date 21533 1795 119434 6590 Mon Aug 9 23:54:20 CEST 2010 ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date 21533 16337 134138 6670 Mon Aug 9 23:56:46 CEST 2010
Apparently happens only on 32-bit systems, e.g. x86_64 does not have this problem while i686 does.
An easy way to test this is with openssl. Notice that the last two tests complete in 0 seconds and supposedly compute an infinite number of keys per second, and the 20 seconds that they take are accounted as system time: $ time openssl speed rsa Doing 512 bit private rsa's for 10s: 22511 512 bit private RSA's in 9.97s Doing 512 bit public rsa's for 10s: 244429 512 bit public RSA's in 9.99s Doing 1024 bit private rsa's for 10s: 4103 1024 bit private RSA's in 10.00s Doing 1024 bit public rsa's for 10s: 78098 1024 bit public RSA's in 9.99s Doing 2048 bit private rsa's for 10s: 624 2048 bit private RSA's in 10.00s Doing 2048 bit public rsa's for 10s: 22588 2048 bit public RSA's in 10.00s Doing 4096 bit private rsa's for 10s: 94 4096 bit private RSA's in 0.00s Doing 4096 bit public rsa's for 10s: 5502 4096 bit public RSA's in 0.00s ... sign verify sign/s verify/s rsa 512 bits 0.000443s 0.000041s 2257.9 24467.4 rsa 1024 bits 0.002437s 0.000128s 410.3 7817.6 rsa 2048 bits 0.016026s 0.000443s 62.4 2258.8 rsa 4096 bits 0.000000s 0.000000s inf inf real 1m20.278s user 1m0.086s sys 0m20.114s $ uname -r 2.6.34.6-54.fc13.i686.PAE
Fixed in 2.6.35.6 and in other branches: http://lkml.org/lkml/2010/9/14/180 See also: https://bugzilla.redhat.com/show_bug.cgi?id=633037