Bug 9135

Summary: top displaying 9999% CPU usage
Product: Process Management Reporter: Rafael J. Wysocki (rjwysocki)
Component: OtherAssignee: process_other
Status: RESOLVED PATCH_ALREADY_AVAILABLE    
Severity: normal CC: borntraeger, bsingharora, cebbert, elendil, spencer
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.23-rc9 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 9056    

Description Rafael J. Wysocki 2007-10-08 12:46:23 UTC
References      : http://lkml.org/lkml/2007/10/3/123
Submitter       : Frans Pop <elendil@planet.nl>
Handled-By      : Chuck Ebbert <cebbert@redhat.com>
                  Christian Borntraeger <borntraeger@de.ibm.com>
Comment 1 Rafael J. Wysocki 2007-10-08 12:47:11 UTC
Patch have been proposed: http://lkml.org/lkml/2007/10/4/389
Comment 2 Rafael J. Wysocki 2007-10-08 12:47:56 UTC
Another patch has been proposed: http://lkml.org/lkml/2007/10/4/405
Comment 3 Frans Pop 2007-10-12 13:22:49 UTC
The second proposed patch has been shown to not completely fix the issue:
http://lkml.org/lkml/2007/10/5/128

The first proposed patch is basically a reversion to 2.6.22 behavior and has been tested to fix the regression, although it may not be the best solution in the long run as it reduces accuracy:
http://lkml.org/lkml/2007/10/5/150
Comment 4 Frans Pop 2007-11-05 14:02:13 UTC
Fix committed in Linus' git tree for 2.6.24 with following two commits.

commit 73a2bcb0edb9ffb0b007b3546b430e2c6e415eee
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
    sched: keep utime/stime monotonic

commit 9301899be75b464ef097f0b5af7af6d9bd8f68a7
Author: Balbir Singh <balbir@linux.vnet.ibm.com>
    sched: fix /proc/<PID>/stat stime/utime monotonicity, part 2

Currently waiting for stable team to respond to request to include these changes in next point release.
Comment 5 Spencer Candland 2008-08-21 13:05:05 UTC
Unfortunately commits 73a2bcb0edb9ffb0b007b3546b430e2c6e415eee and 9301899be75b464ef097f0b5af7af6d9bd8f68a7 do not appear to completely fix this bug.  I am still seeing this, at least in certain circumstances.

I am seeing this behavior with all 2.6.23+ kernels, including 2.6.26.3 and 2.6.27-rc4 on x86_64 SMP Xeon, and Opteron machines.  I am able to replicate this reliably by running apache with the worker MPM, and running a simple apache benchmark, for example:

ab -r -n 100000 -c 1000 http://localhost/

watching stime and utime for the apache children I see both jumping around a bit and see decreases quite frequently.

The first patch from this bug, which reverts to the 2.6.22 behavior, appears to correct the problem.

I have been able to reliably duplicate this bug using x86_64 defconfig on all 2.6.23+ kernels.  I have also tried playing with many timer and clock related config options (with and without NO_HZ, HIGH_RES_TIMERS, different RTC options, etc), I have tried changing clocksource (tsc, hpet, jiffies), but with no success.  I never see any complaints about the clock or timer, even when booting with report_lost_ticks.

I am unable to duplicate this when running with nosmp, and I have only seen it on threaded processes, however  am unable to say for sure that this only occurs on SMP and threaded processes, or if my testing is simply insufficient to replicate it.