Bug 16559

Summary: /proc/<pid>/stat incorrect user and kernel time accounting
Product: Timers Reporter: Ciriaco Garcia de Celis (sysman)
Component: OtherAssignee: john stultz (john.stultz)
Status: CLOSED CODE_FIX    
Severity: normal CC: cebbert, kernel-bugzilla
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.33, 2.6.34, 2.6.35 Subsystem:
Regression: No Bisected commit-id:
Attachments: ver_linux output from one affected system

Description Ciriaco Garcia de Celis 2010-08-10 23:56:48 UTC
Created attachment 27401 [details]
ver_linux output from one affected system

In a program intensive in user time CPU, the 'utime' field in /proc/<pid>/stat becomes freezed at some point (utime will never be incremented again) and the 'stime' starts to (wrongly) receive the increments instead.

At the same time /proc/stat correctly assings the CPU spent as user mode time (in fact, commands like "top" still show proper global CPU reports). But this problem causes getrusage() to return incorrect values, reporting user mode time as kernel time for the running process (for example, the GNU gmp library benchmark reports wrong timings).

Affected kernel versions: 2.6.33, 2.6.34, 2.6.35
Last kernel version not affected by this bug: 2.6.32
Machines known to reproduce the bug: Core 2 Duo E8500, Atom N270, Pentium D.

Steps to reproduce the problem:

1) Make a tiny C program doing an infinite loop and run it

2) Examine /proc/PID/stat:
   - After 2-4 minutes, the User CPU time becomes stalled (in a value from about 15000 to 25000) and will never increase again.
   - At the same time, the Kernel CPU time incorrectly begins to increase, reporting 100% of the CPU ticks.
   - However /proc/stat correctly continues reporting the CPU being spent as User time.

In example bellow, the user time becomes freezed at "21533" after just 215.33 seconds:

ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date
20549 21
116668 6587
Mon Aug  9 23:53:53 CEST 2010

ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date
21533 21
117656 6588
Mon Aug  9 23:54:03 CEST 2010

ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date
21533 639
118275 6589
Mon Aug  9 23:54:09 CEST 2010

ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date
21533 1795
119434 6590
Mon Aug  9 23:54:20 CEST 2010

ciri@quark:/proc/12096$ cat stat | awk '{ print $14 " " $15 }' ; head -1 ../stat | awk '{ print $2 " " $4 }' ; date
21533 16337
134138 6670
Mon Aug  9 23:56:46 CEST 2010
Comment 1 Chuck Ebbert 2010-09-14 13:09:14 UTC
Apparently happens only on 32-bit systems, e.g. x86_64 does not have this problem while i686 does.
Comment 2 Chuck Ebbert 2010-09-14 13:22:02 UTC
An easy way to test this is with openssl. Notice that the last two tests complete in 0 seconds and supposedly compute an infinite number of keys per second, and the 20 seconds that they take are accounted as system time:

$ time openssl speed rsa
Doing 512 bit private rsa's for 10s: 22511 512 bit private RSA's in 9.97s
Doing 512 bit public rsa's for 10s: 244429 512 bit public RSA's in 9.99s
Doing 1024 bit private rsa's for 10s: 4103 1024 bit private RSA's in 10.00s
Doing 1024 bit public rsa's for 10s: 78098 1024 bit public RSA's in 9.99s
Doing 2048 bit private rsa's for 10s: 624 2048 bit private RSA's in 10.00s
Doing 2048 bit public rsa's for 10s: 22588 2048 bit public RSA's in 10.00s
Doing 4096 bit private rsa's for 10s: 94 4096 bit private RSA's in 0.00s
Doing 4096 bit public rsa's for 10s: 5502 4096 bit public RSA's in 0.00s
...
                  sign    verify    sign/s verify/s
rsa  512 bits 0.000443s 0.000041s   2257.9  24467.4
rsa 1024 bits 0.002437s 0.000128s    410.3   7817.6
rsa 2048 bits 0.016026s 0.000443s     62.4   2258.8
rsa 4096 bits 0.000000s 0.000000s      inf      inf

real	1m20.278s
user	1m0.086s
sys	0m20.114s

$ uname -r
2.6.34.6-54.fc13.i686.PAE
Comment 3 Ciriaco Garcia de Celis 2010-09-27 23:07:44 UTC
Fixed in 2.6.35.6 and in other branches:

  http://lkml.org/lkml/2010/9/14/180

See also:

  https://bugzilla.redhat.com/show_bug.cgi?id=633037