Bug 120151

Summary: top/htop no longer show the CPU usage by process
Product: Other Reporter: Benjamin Robin (benjarobin+kernel)
Component: OtherAssignee: process_other
Status: RESOLVED INVALID    
Severity: high CC: benjarobin+kernel, kernel, tglx
Priority: P1    
Hardware: x86-64   
OS: Linux   
Kernel Version: 4.6.2 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: bisect log
cpuinfo
Script to automate test
Dmesg with bug (linux lts 4.4.16)

Description Benjamin Robin 2016-06-13 19:22:53 UTC
Created attachment 219781 [details]
bisect log

After updating from 4.5.4 to 4.6.2, top and htop do not work properly.
The global CPU usage still work but the CPU usage per process no longer works...

I test it by running the following command : stress -c 2 -t 30 &
Followed by top or htop...

I did a git bisect between the tag v4.5 and v4.6, and I found this bad commit :

# first bad commit: [1cf4f629d9d246519a1e76c021806f2a51ddba4d] cpu/hotplug: Move online calls to hotplugged cpu

I was not able to revert it. I did try to revert "710d60c Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip" but without success.
Comment 1 Benjamin Robin 2016-06-13 19:23:15 UTC
Created attachment 219791 [details]
cpuinfo
Comment 2 Benjamin Robin 2016-06-14 09:58:46 UTC
Created attachment 219871 [details]
Script to automate test

I am using this script to check the reporting of CPU usage per process
Comment 3 Benjamin Robin 2016-07-03 18:01:10 UTC
A did find a couple weird think :
 * After an hour or less (I have no idea), the top/ps start working
 * I do have exactly the same problem with the LTS branch 4.4.14
 * With 4.5.4 I cannot reproduce the problem just after booting

So is it an hardware problem ? The bisect looks coherent, but since I can reproduce the problem with the LTS branch, I do not really understand it...
Comment 4 Benjamin Robin 2016-07-03 18:03:46 UTC
Looks like exactly the same problem that was reported by Vladimir Panteleev : https://lkml.org/lkml/2016/7/3/103
Comment 5 Vladimir Panteleev 2016-07-03 18:30:49 UTC
Hi Benjamin,

I'm curious, do you also have the responsiveness problem I described in my LKML bug report?

One thing I noticed in common is that you also use an Intel CPU with 6 cores, perhaps that has something to do with it.
Comment 6 Benjamin Robin 2016-07-03 18:32:54 UTC
I do have some problem of responsiveness with Firefox... Much longer to start, the interface freeze...
Comment 7 Vladimir Panteleev 2016-07-03 18:35:46 UTC
I included a bash command in my LKML email, perhaps you could try that? Note that on my machine it makes it almost completely unresponsive, so you may need to use Magic SysRq or physically reset your PC.
Comment 8 Benjamin Robin 2016-07-03 18:56:14 UTC
I did run the test with the kernel 4.6.3-1-ARCH

The bash command : for N in $(seq $(nproc)) ; do while true ; do ; done & ; done

Before launching the bash command I did check that top was not working: Top was working normally, I did launch anyway the bash command => No freeze, the PC can still be used normally (top report all CPU at ~ 100%)
I reboot the PC, this time top was not working normally, after running the bash command => Freeze, cannot even switch to a tty
Comment 9 Vladimir Panteleev 2016-07-04 20:19:09 UTC
Interesting; I also noticed that top would start working again after a few hours, but couldn't quantify it, so I omitted it from my report. It's interesting to know that you can reproduce both parts of my problem (which means it's not an isolated case), and that the performance problem is "fixed" by waiting some time together with the CPU accounting problem.
Comment 10 Vladimir Panteleev 2016-07-04 21:07:36 UTC
(In reply to Benjamin Robin from comment #3)
> A did find a couple weird think :
>  * I do have exactly the same problem with the LTS branch 4.4.14

I couldn't reproduce it after building 4.4.14 from source. Can you confirm?

>  * With 4.5.4 I cannot reproduce the problem just after booting

I don't think I've ever noticed this problem *appear* after not manifesting straight after booting. How long does it typically take to appear?
Comment 11 Benjamin Robin 2016-07-31 17:44:18 UTC
Created attachment 227021 [details]
Dmesg with bug (linux lts 4.4.16)

See https://lkml.org/lkml/2016/7/5/97
Comment 12 Benjamin Robin 2016-07-31 17:57:41 UTC
(In reply to Vladimir Panteleev from comment #10)
> I couldn't reproduce it after building 4.4.14 from source. Can you confirm?
I did not tried... Sorry was very busy (and I am still very busy)

> I don't think I've ever noticed this problem *appear* after not manifesting
> straight after booting. How long does it typically take to appear?
Sorry I misspoke, I never see this problem to appear after running normally. I only see this problem disappear...  

I also tried without updating the microcode, but I still have the same error in dmesg and the same problem...
Comment 13 Vladimir Panteleev 2016-07-31 19:11:01 UTC
I've been unavailable (vacation + work) as well. One thing I noticed, though, is that the problem only happens after a reboot, but never after a cold boot (power off then power on). This seems to indicate that the root cause is a hardware or firmware bug. This also provides a simple workaround.

Does the same happen for you?
Comment 14 Benjamin Robin 2016-07-31 20:36:38 UTC
(In reply to Vladimir Panteleev from comment #13)
> Does the same happen for you?
Ok, now I am puzzled... it looks like I can only reproduce it after a reboot with 4.4.16 or 4.6.4 => After a cold boot even if with 4.6 it's working

But how I did see this problem ? I almost never reboot, unless I update the kernel... I have a doubt, well the future will tell us

Why both of us find the same commit ? Why I can reproduce with lts which doesn't contain this commit ? Or this is just a change in the kernel that reveals a hardware bug that is hidden in a precise configuration?
Comment 15 Benjamin Robin 2016-07-31 21:54:39 UTC
I updated the BIOS of my motherboard (Gigabyte X99-UD4-CF) from version F12 to F22 : http://www.gigabyte.com/products/product-page.aspx?pid=5123&dl=1#bios and I noticed the following changes :
 * TSC clock initialization is now a success
 * tsc clock is now listed in /sys/devices/system/clocksource/clocksource0/available_clocksource
 * tsc clock is used instead of hpet (/sys/devices/system/clocksource/clocksource0/current_clocksource)
 * No longer reproduce the bug after reboot (need more testing)
Comment 16 Benjamin Robin 2016-08-01 11:45:23 UTC
I no longer reproduce the problem (for now), even if I start with "clocksource=hpet"