Bug 120151
Summary: | top/htop no longer show the CPU usage by process | ||
---|---|---|---|
Product: | Other | Reporter: | Benjamin Robin (benjarobin+kernel) |
Component: | Other | Assignee: | process_other |
Status: | RESOLVED INVALID | ||
Severity: | high | CC: | benjarobin+kernel, kernel, tglx |
Priority: | P1 | ||
Hardware: | x86-64 | ||
OS: | Linux | ||
Kernel Version: | 4.6.2 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
bisect log
cpuinfo Script to automate test Dmesg with bug (linux lts 4.4.16) |
Created attachment 219791 [details]
cpuinfo
Created attachment 219871 [details]
Script to automate test
I am using this script to check the reporting of CPU usage per process
A did find a couple weird think : * After an hour or less (I have no idea), the top/ps start working * I do have exactly the same problem with the LTS branch 4.4.14 * With 4.5.4 I cannot reproduce the problem just after booting So is it an hardware problem ? The bisect looks coherent, but since I can reproduce the problem with the LTS branch, I do not really understand it... Looks like exactly the same problem that was reported by Vladimir Panteleev : https://lkml.org/lkml/2016/7/3/103 Hi Benjamin, I'm curious, do you also have the responsiveness problem I described in my LKML bug report? One thing I noticed in common is that you also use an Intel CPU with 6 cores, perhaps that has something to do with it. I do have some problem of responsiveness with Firefox... Much longer to start, the interface freeze... I included a bash command in my LKML email, perhaps you could try that? Note that on my machine it makes it almost completely unresponsive, so you may need to use Magic SysRq or physically reset your PC. I did run the test with the kernel 4.6.3-1-ARCH The bash command : for N in $(seq $(nproc)) ; do while true ; do ; done & ; done Before launching the bash command I did check that top was not working: Top was working normally, I did launch anyway the bash command => No freeze, the PC can still be used normally (top report all CPU at ~ 100%) I reboot the PC, this time top was not working normally, after running the bash command => Freeze, cannot even switch to a tty Interesting; I also noticed that top would start working again after a few hours, but couldn't quantify it, so I omitted it from my report. It's interesting to know that you can reproduce both parts of my problem (which means it's not an isolated case), and that the performance problem is "fixed" by waiting some time together with the CPU accounting problem. (In reply to Benjamin Robin from comment #3) > A did find a couple weird think : > * I do have exactly the same problem with the LTS branch 4.4.14 I couldn't reproduce it after building 4.4.14 from source. Can you confirm? > * With 4.5.4 I cannot reproduce the problem just after booting I don't think I've ever noticed this problem *appear* after not manifesting straight after booting. How long does it typically take to appear? Created attachment 227021 [details] Dmesg with bug (linux lts 4.4.16) See https://lkml.org/lkml/2016/7/5/97 (In reply to Vladimir Panteleev from comment #10) > I couldn't reproduce it after building 4.4.14 from source. Can you confirm? I did not tried... Sorry was very busy (and I am still very busy) > I don't think I've ever noticed this problem *appear* after not manifesting > straight after booting. How long does it typically take to appear? Sorry I misspoke, I never see this problem to appear after running normally. I only see this problem disappear... I also tried without updating the microcode, but I still have the same error in dmesg and the same problem... I've been unavailable (vacation + work) as well. One thing I noticed, though, is that the problem only happens after a reboot, but never after a cold boot (power off then power on). This seems to indicate that the root cause is a hardware or firmware bug. This also provides a simple workaround. Does the same happen for you? (In reply to Vladimir Panteleev from comment #13) > Does the same happen for you? Ok, now I am puzzled... it looks like I can only reproduce it after a reboot with 4.4.16 or 4.6.4 => After a cold boot even if with 4.6 it's working But how I did see this problem ? I almost never reboot, unless I update the kernel... I have a doubt, well the future will tell us Why both of us find the same commit ? Why I can reproduce with lts which doesn't contain this commit ? Or this is just a change in the kernel that reveals a hardware bug that is hidden in a precise configuration? I updated the BIOS of my motherboard (Gigabyte X99-UD4-CF) from version F12 to F22 : http://www.gigabyte.com/products/product-page.aspx?pid=5123&dl=1#bios and I noticed the following changes : * TSC clock initialization is now a success * tsc clock is now listed in /sys/devices/system/clocksource/clocksource0/available_clocksource * tsc clock is used instead of hpet (/sys/devices/system/clocksource/clocksource0/current_clocksource) * No longer reproduce the bug after reboot (need more testing) I no longer reproduce the problem (for now), even if I start with "clocksource=hpet" |
Created attachment 219781 [details] bisect log After updating from 4.5.4 to 4.6.2, top and htop do not work properly. The global CPU usage still work but the CPU usage per process no longer works... I test it by running the following command : stress -c 2 -t 30 & Followed by top or htop... I did a git bisect between the tag v4.5 and v4.6, and I found this bad commit : # first bad commit: [1cf4f629d9d246519a1e76c021806f2a51ddba4d] cpu/hotplug: Move online calls to hotplugged cpu I was not able to revert it. I did try to revert "710d60c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip" but without success.