Bug 16314 - Erroneous idle times for processes.
Summary: Erroneous idle times for processes.
Status: RESOLVED CODE_FIX
Alias: None
Product: Timers
Classification: Unclassified
Component: Other (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Jeremy Fitzhardinge
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-28 22:15 UTC by overlordq
Modified: 2010-09-01 18:35 UTC (History)
1 user (show)

See Also:
Kernel Version: 2.6.34
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description overlordq 2010-06-28 22:15:52 UTC
First of all, this a kernel running in a pv-grub domU instance.

That out of the way. Let's take a look at /proc/uptime which was changed[1] recently.

# cat /proc/uptime
266799.19 2122033.45
# bc -l <<<'100-(2122033.45*100/266799.19)'
-695.36727604008093128018

Total uptime is about 700% smaller then idle time. I do belive this qualifies as 'interesting results'[2].

Now lets take a look at some running processes:

# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   8396   316 ?        Ss   Jun25   0:38 init [2]
root         2  0.0  0.0      0     0 ?        S    Jun25   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Jun25   0:00 [migration/0]
root         4 6908619  0.0   0     0 ?        S    Jun25 21067849:40 [ksoftirqd/0]
root         5  0.0  0.0      0     0 ?        S    Jun25   0:00 [migration/1]
root         6  0.0  0.0      0     0 ?        S    Jun25   0:00 [ksoftirqd/1]
root         7  0.0  0.0      0     0 ?        S    Jun25   0:00 [migration/2]
root         8  0.0  0.0      0     0 ?        S    Jun25   0:02 [ksoftirqd/2]
root         9  0.0  0.0      0     0 ?        S    Jun25   0:00 [migration/3]
root        10 6908619  0.0   0     0 ?        S    Jun25 21067849:44 [ksoftirqd/3]
root        11  0.0  0.0      0     0 ?        S    Jun25   0:29 [events/0]
root        12 6908618  0.0   0     0 ?        S    Jun25 21067846:52 [events/1]
root        13 13818288  0.0  0     0 ?        S    Jun25 42182429:26 [events/2]
root        14 6908618  0.0   0     0 ?        S    Jun25 21067848:43 [events/3]

Pretty sure it hasn't been running for 4812 years.

# cat /proc/13/stat
13 (events/2) S 2 0 0 0 -1 2216722496 0 0 0 0 1844394016203 1844674397263 0 0 20 0 1 0 0 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 0 0 0 17 2 0 0 0 0 0

# cat /proc/11/stat
11 (events/0) S 2 0 0 0 -1 2216722496 0 0 0 0 31 2926 0 0 20 0 1 0 0 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 0 0 0 17 0 0 0 0 0 0

Current kernel config can be found here[3]. Fairly certain this didn't do this on older kernels, but I'm not 100% sure when it shown up.

Misc likely irrelevant info:

# cat /proc/version
Linux version 2.6.34 (root@thedarkcitadel.com) (gcc version 4.4.4 (Debian 4.4.4-4) ) #2 SMP Fri Jun 11 14:54:47 UTC 2010

# cat /proc/cpuinfo[4]

# cat /proc/ioports
0000-ffff : paravirt-ioport
  0070-0071 : rtc_cmos

# cat /proc/iomem
00000000-00000fff : reserved
00001000-0009ffff : System RAM
000a0000-000fffff : reserved
00100000-1fffffff : System RAM
  01000000-01502d84 : Kernel code
  01502d85-01895e6f : Kernel data
  0190f000-019bac8f : Kernel bss

# cat /proc/stat
cpu  122050 0 175567 212686266 558050 4 2353 11722 0 0
cpu0 64035 0 108214 52607311 454021 3 1824 4115 0 0
cpu1 23684 0 23669 53366541 35141 0 175 2715 0 0
cpu2 16829 0 19625 53367490 30240 0 154 2442 0 0
cpu3 17501 0 24057 53344922 38646 0 199 2448 0 0
intr 38758769 9459273 860276 482 0 253 7418083 652162 709 0 2366 5916902 350321 700 0 1089 6046488 316670 571 0 514 358 93 3822286 1 1723059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 40317508
btime 1277495889
processes 83837
procs_running 1
procs_blocked 0
softirq 38955506 0 17994681 503954 2186164 0 0 1 2723076 25210 15522420

1-http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.30.9
2-http://lkml.indiana.edu/hypermail/linux/kernel/0905.1/01059.html
3-http://www.thedarkcitadel.com/config.txt
3-http://www.thedarkcitadel.com/cpuinfo.txt
Comment 1 john stultz 2010-06-28 23:09:06 UTC
So what was the last kernel version that worked properly for you?
Comment 2 overlordq 2010-06-29 01:29:55 UTC
# umount /foot/mouth

Seems I can get 2.6.31.13 to work, but 2.6.34 is a no go.
Comment 3 john stultz 2010-06-29 01:51:22 UTC
Jeremy: Any thoughts on this one?
Comment 4 Jeremy Fitzhardinge 2010-06-29 09:09:43 UTC
That's interesting.  I'm not sure what timebase uptime is computed with.  It sounds like it may be using something like the scheduler clock, which under Xen does not include time stolen by other domains.

How busy is the domain in question, and how busy is the overall system?  Hm, but according to /proc/stat, there's no stolen time on any cpu (second last column, I think), so that suggests the rest of the system is very idle.

Has the uptime computation changed between .31 and .34?
Comment 5 overlordq 2010-06-30 07:54:34 UTC
The domain is effectively idle, very low traffic, not sure on the overall system, I'd have to check.

The problem shows up more the longer the box is up. Like right now only
root        13 16857174  0.0  0     0 ?        S    Jun29 21067849:29 [events/2]
out of the kernel threads is showing up wrong, other process tend toward this the longer its up. 

This weekend I'll double check .31 to see if leaving it up longer manifests problems and try to narrow down a specific revision when I have time to recompile a lot of kernels.
Comment 6 Jeremy Fitzhardinge 2010-08-30 17:37:00 UTC
Recent mainline and stable kernels have a fix to convert Xen's sched_clock to use plain wallclock time.  Do they still exhibit this problem?

Thanks,
    J
Comment 7 overlordq 2010-09-01 15:19:06 UTC
I'll test a new stock kernel and see if the problem still occurs.  In the mean time I have been using a proposed patch[1] from the xen-devel lists which also mitigates the problem.

1 - http://lists.xensource.com/archives/html/xen-devel/2010-07/msg00738.html
Comment 8 Jeremy Fitzhardinge 2010-09-01 16:11:43 UTC
OK, that fixes the symptom that the upstream patch fixes properly.
Comment 9 overlordq 2010-09-01 16:47:37 UTC
2.6.36-rc3 at a glance appears to be working correctly.

Note You need to log in before you can comment on or make changes to this bug.