Bug 207245 - huge CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate
Summary: huge CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate
Status: NEW
Alias: None
Product: Platform Specific/Hardware
Classification: Unclassified
Component: x86-64 (show other bugs)
Hardware: x86-64 Linux
: P1 blocking
Assignee: platform_x86_64@kernel-bugs.osdl.org
URL: https://lore.kernel.org/lkml/ce809769...
Keywords:
Depends on:
Blocks:
 
Reported: 2020-04-14 22:31 UTC by Christoph Anton Mitterer
Modified: 2020-07-06 17:52 UTC (History)
0 users

See Also:
Kernel Version: anything beyond 5.2.x
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Christoph Anton Mitterer 2020-04-14 22:31:08 UTC
Since no one seems to care at lkml (https://lore.kernel.org/lkml/ce8097694ddfab616616f8f81521495d99c74416.camel@scientia.net/T/#re454505a9f78ec90a0e7d7810c42d103c32fd4e6), adding the following severe regression to bugzilla for the records:


Hey.


For several months now, I've been chasing a tremendous heat increase
(CPU/GPU) respectively power usage on my notebook.

It basically started after upgrading from 5.2 to 5.3, at least I
haven't explicitly noted any grave changes from before 5.2 to 5.2.
The issue (actually there might be several) persists until at least 5.4
and 5.5.

Things are so bad, that when just type this mail,... that I can hear
the fan go up considerably (and temps up to 90°C) just by typing the
mail in the mail client (while it goes back to - still insane - 65°C
idle, when not typing... ok idle here(!) is with firefox running).
Similar things when I scroll through a terminal window, Alt-Tab cycle
between windows, and so on.


Testing is a bit difficult for me, as I couldn't come up with an easy
way to reproducibly generate real world load (like this typing, or
scrolling terminal windows), yet I tried to do an extensive test
series, which I think will illustrate some things.


Not really sure what the normal average or idle temps of that CPU are,
but I guess getting at average >80°C by just typing shouldn't be the
case.




1) Previous tests
*****************
When first searching for the reason of the temperature increase, I've
had opened several tickets:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945055
https://lore.kernel.org/lkml/d05aba2742ae42783788c954e2a380e7fcb10830.camel@scientia.net/

Finally to find (by coincidence):
https://gitlab.freedesktop.org/drm/intel/issues/614
when reporting:
https://gitlab.freedesktop.org/drm/intel/-/issues/953
myself.

At first I thought #614 would be the bug, but the fix for that went
into 5.5-rc, and in fact, with 5.5.x I do see the GPU entering RC6
sleep states again, yet the temperature of my system is still crazy.




2) Testing Environment
**********************
(for these new tests here)
- Fujitsu Lifebook U757
- most recent BIOS version (1.25) in the tests below (I've had used an
  older one in previous tests from the links)
- 32GB memory, some Sandisk SSD
- Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
- microcode: sig=0x806e9, pf=0x80, revision=0xca
- Debian sid, all packages (unless some totally unrelated stuff at
  their newest versions in unstable)
- all used kernels are stock kernels from Debian
- I do use full dm-crypt encryption of the system, but that shouldn't
  be a cause for the problems, I guess.
- in my /etc/sysfs.conf I have:
  devices/system/cpu/intel_pstate/no_turbo = 1
  basically since I have that laptop... with turbo enabled I always
got 
  these:
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510420] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510422] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510465] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510467] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511427] mce: CPU3: Package temperature/speed normal
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511430] mce: CPU0: Package temperature/speed normal
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511431] mce: CPU1: Package temperature/speed normal
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511436] mce: CPU2: Package temperature/speed normal
  => so for the tests with ipntel_pstate not being disabled, turbo mode
     was always disabled




3) How tests were made
**********************
I've tested with the following combinations:
- kernels 5.2.17 and 5.5.13
- with and without intel_pstate=disable
- with Cinnamon and GNOME Shell in classic mode

For all tests the notebook was placed in the same position and ran with
the same commands for tests, no other major processes (like firefox or
so) were running, just the respective bare desktop environment
(cinnamon or gnome shell classic), cron/anacron were stopped.

I always took temperature measurements with the output from sensors and
CSV output from powertop (which contains all the sleep states and high
energy users).

Temperature and powertop measurements were started at basically the
same time. powertop running for n iterations each 20s.
But since powertop takes a while to start the temperature measurements
are effectively shorter.


a) deep-idle
For these tests I've waited very long (like 5 minutes or more) for the
system to cool down.
Measurements with, e.g.:
export NAME="5.2.17/ipstate-disable/thermald-no/gnome-shell-classic/deep-idle" ; timeout 80 sh -c "while true; do sleep 1; sensors; done | grep °C > ${NAME}.temp"
and
export NAME="5.2.17/ipstate-disable/thermald-no/cinnamon/deep-idle" ; powertop -i 4  --csv=${NAME}.powertop.csv


b) idle
Basically the same as (a), just not waiting so long to cool down.
Effectively I've always produced some load (with the fan and CPU temp
noticeably going up over 65°C), then stopped and waited for a minute.


c) winmove
After waiting a while for the CPU to cool down, starting the
measurement and then moving a terminal window fast and constantly in
circles over the whole screen.

The measurement is split in 3 phases. During warmup and main I've moved
the window like crazy. As soon as the cooldown phase began, I've
stopped that and did nothing more until the measurement finished.

Measurements with, e.g.:
export NAME="5.2.17/ipstate-disable/thermald-no/cinnamon/winmove" ; powertop -i 3 --csv=${NAME}.warmup.powertop.csv ; beep ; powertop -i 4 --csv=${NAME}.main.powertop.csv ; beep ; powertop -i 4 --csv=${NAME}.cooldown.powertop.csv
export NAME="5.2.17/ipstate-disable/thermald-no/gnome-shell-classic/winmove" ; powertop -i 3 --csv=${NAME}.warmup.powertop.csv ; beep ; powertop -i 4 --csv=${NAME}.main.powertop.csv ; beep ; powertop -i 4 --csv=${NAME}.cooldown.powertop.csv


d) verify, verify-data
That measure when running a poorly written shell script of mine, the
shell script would read a list of regular files from find, and tries to
verify the SHA512 sum of that file, potentially stored as XATTR on it.
The script is poorly written, does quite a number of forks, pipes, and
so on, which seemed good for these test.

Difference between verify and verify-data are directories on which I
let find run.

with verify: I did it on /home/, where many files don't have my hash
XATTRs set

with verify-data: I did it on a dir, where basically all files have
them set an such there's more going into actual SHA512 calculation

Measurements with, e.g.:
export NAME="5.2.17/ipstate-disable/thermald-no/gnome-shell-classic/verifyxattr" ; timeout 160 sh -c "while true; do sleep 1; sensors; done | grep °C > ${NAME}.warmup+main.temp" ; beep
export NAME="5.2.17/ipstate-disable/thermald-no/cinnamon/verifyxattr" ; powertop -i 8 --csv=${NAME}.warmup+main.powertop.csv ; bee


e) mpv-gpu-vaapi
Playing back a:
    Stream #0:0: Video: h264 (High), yuv420p(tv, bt709, progressive), 1920x1080 [SAR 1:1 DAR 16:9], 59.94 fps, 59.94 tbr, 1k tbn, 119.88 tbc (default)
video via:
mpv someVideo.mkv -ao=null
in fullscreen.
.config/mpv/mpv.conf has:
script-opts=osc-deadzonesize=0
hwdec=auto

So the video plays back with:
(+) Video --vid=1 (*) (h264 1920x1080 59.940fps)
 (+) Audio --aid=1 --alang=eng (*) (opus 2ch 48000Hz)
Using hardware decoding (vaapi-copy).
AO: [null] 48000Hz stereo 2ch floatp
VO: [gpu] 1920x1080 nv12
AV: 00:00:01 / 00:36:23 (0%) A-V:  0.000

i.e. gpu and vaapi-copy

I've made two phases:
warmup+main: starting the video, going immediately to fullscreen
cooldown: as soon as the beep for it came, stopping mpv

Measurements with, e.g.:
export NAME="5.2.17/ipstate-disable/thermald-no/gnome-shell-classic/mpv-gpu-vaapi" ; timeout 80 sh -c "while true; do sleep 1; sensors; done | grep °C > ${NAME}.warmup+main.temp" ; beep ; timeout 80 sh -c "while true; do sleep 1; sensors; done | grep °C > ${NAME}.cooldown.temp"
export NAME="5.2.17/ipstate-disable/thermald-no/cinnamon/mpv-gpu-vaapi" ; powertop -i 4 --csv=${NAME}.warmup+main.powertop.csv ; beep ; powertop -i 4 --csv=${NAME}.cooldown.powertop.csv


f) unhide-brute
Running the unhide program in brute mode (using the C version, not the
ruby version).
unhide seems to do a lot of forking, which also seems to CPU to go
crazy in terms of temperature (at least in some cases).

Measurements with, e.g.:
export NAME="5.5.13/ipstate-active-hwp/thermald-no/cinnamon/unhide-brute" ; timeout 240 sh -c "while true; do sleep 1; sensors; done | grep °C > ${NAME}.warmup+main.temp" ; beep
export NAME="5.5.13/ipstate-active-hwp/thermald-no/cinnamon/unhide-brute" ; powertop -i 12 --csv=${NAME}.warmup+main.powertop.csv ; beep




4) Results
**********
For ease of use I've placed all my original test files and derived ones
in a git repo:
https://github.com/calestyo/cpu-tests


At first, I'm just looking at the bare temperatures of the Package,
which I've extracted to ./pack-temps for each of the tests.


I use ips = intel_pstate. in the following



a) deep-idle and idle
cinnamon_deep-idle.svg
=> 5.2 with ips=off runs ~2 °C cooler than 5.5 with ips=on
                         ~4 °C cooler than 5.5 with ips=off
                         ~3 °C cooler than 5.2 with ips=on
=> so from hottest to coolest:
   5.5/ips=off 5.2/ips=on  5.5/ips=on  5.2/ips=off

cinnamon_idle.svg
=> at least it's quite obvious that 5.2 with ips=off runs coolest
  
compared to 5.5 with ips=off it's 4-5 °C...
   compared to 5.5 with
ips=on  it's still quite noticeable in the
                             
beginning (could be testing though)
   it's also noticeable cooler than
5.2 with ips=on
=> so from hottest to coolest:
   5.5/ips=off 5.2/ips=on 
5.5/ips=on  5.2/ips=off
   (at least in the end)


gnome-shell-classic_deep-idle.svg
=> I'd say the results are too close, as one could deduce anything
valuable here,... but  5.5 looks better here (and ips=on looks even
better than having it disabled)

gnome-shell-classic_idle.svg
=> similarly here, the numbers are rather close, but 5.5 with ips=on
looks better... could however depend on the testing and the difference
is only ~2°C


=> One also sees, cinnamon runs considerably hotter than gnome-shell-
classic, none of them have any special applets or so running (just task
bar, workspace switcher, clock).



b) winmove.*
cinnamon_winmove.warmup.svg
cinnamon_winmove.main.svg
cinnamon_winmove.cooldown.svg
=> 5.2/ips=off runs considerably cooler than everything else, something
   around 10-15°C?
=> 5.5/ips=on is clearly the worst
=> 5.2/ips=on and 5.5/ips=off are similar, but 5.2 seems still a bit
   better during warmup and main

In this test it seems:
=> ips=off is considerably cooler/better for each kernel
=> 5.2 is considerably better or at least equal than the best of 5.5


gnome-shell-classic_winmove.warmup.svg
gnome-shell-classic_winmove.main.svg
gnome-shell-classic_winmove.cooldown.svg
=> again, g-s-c seems to do much better than cinnamon, but
=> 5.5/ips=off seems worst
   5.5/ips=on  seems slightly best
   5.2/* in the mid-range
=> but again, the numbers are pretty close so this could be just from
   testing



c) mpv-gpu-vaapi
mpv-gpu-vaapi.warmup+main.svg
=> most likely the blue line (5.2/ips=off/cinnamon) is just bogus,
   could redo it if someone needs
difficult to say something,...
=> all end up at 95-100°C, the ones from cinnamon much faster
=> the only thing I'd personally deduce from these is, that hardware
   acceleration has some severe problem in my setup. I'd have expected
   that playing back a video in fullscreen should be not problem at all
   for the GPU

mpv-gpu-vaapi.cooldown.svg
=> the only clear thing, I guess, is that 5.5/ips=off is worst with
cinnamon



d) verifyxattr and verifyxattr-data

cinnamon_verifyxattr.warmup+main.svg
cinnamon_verifyxattr-data.warmup+main.svg
=> the ones with ips=on are considerably worse...
=> interestingly, for verifyxattr 5.2/ips=off is better with
   5.5/ips=off being the 2nd
=> but for verifyxattr-data (which is ought to be more actual SHA512
   computation intensive and less just-forking), it's vice-versa
   and 5.5/ips=off is better than 5.2/ips=off

=> seems like a hint that forking and/or process switches or similar
   things could cause the temperature issues


gnome-shell-classic_verifyxattr.warmup+main.svg
gnome-shell-classic_verifyxattr-data.warmup+main.svg
similarly:
=> the ones with ips=on are considerably worse...
=> interestingly for verifyxattr 5.2 and 5.5 with ips=off
   are more or less the same
=> but for verifyxattr-data 5.5 is noticeably better


=> these differences cannot be directly explained by some GPU issues,
   at least not to my knowledge, since not much graphical output was
   produced



e) unhide-brute
cinnamon_unhide-brute.warmup+main.svg
gnome-shell-classic_unhide-brute.warmup+main.svg
=> both are nearly the same, except that under g-s-c, 5.5/ips=on
   is noticeable worse than under 5.2/ips=on
=> again, ips=off is *much* better than ips=on
=> 5.2 is considerably better than 5.5


=> these differences cannot be directly explained by some GPU issues,
   at least not to my knowledge, since not much graphical output was
   produced




I think overall conclusions are mostly:
Especially when the temperatures vary greatly, then
- 5.2 is much better than 5.5
- ips=off is much better or at least similar to ips=on





5) the powertop files
*********************
So far I've only taken a glance on them in trying to deduce anything
meaningful.
My hope would have been that some experts here have more experience on
reading them. ;-)


Looking for example at the files from my unhide-brute tests, comparing
5.2/ips=on/cinnamon with 5.5/ips=on/cinnamon, it seemed sometimes that:
Timer;tick_sched_timer
kWork;intel_atomic_commit_work
kWork;free_work

might be offenders... but that's all not so obvious (at least to me).




6) Observations / Other
***********************
a) One thing I've noted sometimes, but not always:
When the system was under "some" load that caused extreme
temperatures... even when I stopped that load, temperatures didn't
always go back.
I mean it's clear that cooling takes a while, but sometimes things went
on for 5 mins or more.


b) Sometimes it might have seemed, that putting the system to suspend
cured the symptoms for a while,... but not always and I haven't tested
this in both kernels and ips=* variations.


c) While writing this email I'm in 5.5/ips=off/g-s-c ...
   (not sure whether ips=on would have been much better)
   The short time idle-temperature is already pretty bad (~65°C)
   (firefox, which seems to be a bad temperature offender, runs
   though).
   Closing FF and the idle temp goes to around ~58-59 (would probably
   go lower if I wait for longer)
   But when I now start moving around the mouse pointer, everything
   stays the roughly same... but I had also situations, where just by
   moving the pointer, temperatures when to 80°C or higher,... stopped
   moving the pointer, and the fell again. 
 
   When I know press&hold a key in a gnome-terminal window, say
   constantly writing "n" to it... nothing changes.

   When I do the same in Evolution's mail compose window...
   temperatures go up to 74°C.
   Correspondingly I see
   /usr/lib/x86_64-linux-gnu/webkit2gtk-4.0/WebKitWebProcess
   going through the roof in powertop.

   Now one could argue there's just something fishy in
   Evolution/WebKit, but from what I remember it's by far not that bad
   (if at all) under 5.2.


d) Similarly, if I just move around the pointer via the touchpad:
   1.71 W     13,2 ms/s     436,3        Interrupt      PS/2 Touchpad / Keyboard / Mouse
   becomes the top power consumer with quite a lot it seems?


e) At least until 5.5 (which fixed the i915 GPU doesn't go to RC6
   issue), I quite often saw the temperatures go crazy, while top
   didn't show that much CPU utilisation.
   Well it's quite clear if the issue was only in the GPU, but even
   with that fixed it still seemed at least sometimes during my tests,
   that I saw extreme temperatures while top didn't show even close to
   100% CPU utilisation.

f) right now, eventstat shows something like this:
Event/s PID      %CPU  PR  NI Task               Init Function                
   81.92   10050   2.1   0   0 WebKitWebProces    hrtimer_wakeup               
   51.95    1600   0.1   0   0 Xorg               it_real_fn                   
   49.95    1600   0.1   0   0 Xorg               hrtimer_wakeup               
   46.95   10050   2.1   0   0 WebKitWebProces    tick_sched_timer             
   32.97    3184   0.7   0   0 gnome-terminal-    hrtimer_wakeup               
   28.97   63710   0.0   0   0 [kworker/0:2-event intel_uncore_fw_release_timer
   11.99    2831   0.1   0   0 gnome-shell        hrtimer_wakeup               
    8.99    3417   0.1   0   0 evolution          hrtimer_wakeup               
    5.00    1600   0.1   0   0 Xorg               tick_sched_timer             
    5.00   71575   0.3   0   0 top                it_real_fn                   
    4.00    3184   0.7   0   0 gnome-terminal-    tick_sched_timer             
    4.00   59885   0.1   0   0 diodon             hrtimer_wakeup               
    3.00   71575   0.3   0   0 top                tick_sched_timer             
    2.00   72584   0.0   0   0 <unknown>          tick_sched_timer             
    2.00    2831   0.1   0   0 gnome-shell        timerfd_tmrproc              
    1.00   72585   0.0   0   0 sleep              hrtimer_wakeup               
    1.00   66991   0.0   0 -20 [kworker/u9:2-i915 tick_sched_timer             
    1.00     750   0.0   0   0 haveged            hrtimer_wakeup               

that hrtimer_wapeup and tick_sched_timer appear quite often in the top list.
Anything that changed there after 5.2?

Attaching with strace to that WebKitWebProcess:
$ strace -p 10050
strace: Process 10050 attached
restart_syscall(<... resuming interrupted read ...>) = 0
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 0 (Timeout)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 0 (Timeout)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 0 (Timeout)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 0 (Timeout)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 14) = 0 (Timeout)

in a very fast "loop"...
Interestingly, printing that out, which is really! fast... doesn't seem
to increase the temperature much.

When I press&hold a key in evolution strace output shows a lot of
these:
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, "\2\0\0\0\0\0\0\0", 16)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
madvise(0x7f74af5ea000, 4096, MADV_NORMAL) = 0
madvise(0x7f74af5ea000, 4096, MADV_DODUMP) = 0
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, "\2\0\0\0\0\0\0\0", 16)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
memfd_create("WebKitSharedMemory", MFD_CLOEXEC) = 23
ftruncate(23, 18585600)                 = 0
mmap(NULL, 18585600, PROT_READ|PROT_WRITE, MAP_SHARED, 23, 0) = 0x7f74ae346000
fcntl(23, F_DUPFD_CLOEXEC, 0)           = 18
munmap(0x7f74ae346000, 18585600)        = 0
close(23)                               = 0
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
madvise(0x7f74af5e9000, 4096, MADV_NORMAL) = 0
madvise(0x7f74af5e9000, 4096, MADV_DODUMP) = 0
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670f5d690, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, "\2\0\0\0\0\0\0\0", 16)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
memfd_create("WebKitSharedMemory", MFD_CLOEXEC) = 24
ftruncate(24, 18585600)                 = 0
mmap(NULL, 18585600, PROT_READ|PROT_WRITE, MAP_SHARED, 24, 0) = 0x7f74ae346000
fcntl(24, F_DUPFD_CLOEXEC, 0)           = 18
munmap(0x7f74ae346000, 18585600)        = 0
close(24)                               = 0
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, "\2\0\0\0\0\0\0\0", 16)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x55c670f5d690, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7f74c0005ab0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x55c670eff418, FUTEX_WAKE_PRIVATE, 1) = 1
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, events=POLLIN}], 3, 0) = 1 ([{fd=4, revents=POLLIN}])
read(4, "\5\0\0\0\0\0\0\0", 16)         = 8
write(4, "\1\0\0\0\0\0\0\0", 8)         = 8
write(15, "\1\0\0\0\0\0\0\0", 8)        = 8
futex(0x7f74d8000b60, FUTEX_WAKE_PRIVATE, 1) = 1
memfd_create("WebKitSharedMemory", MFD_CLOEXEC) = 24
ftruncate(24, 18585600)                 = 0
mmap(NULL, 18585600, PROT_READ|PROT_WRITE, MAP_SHARED, 24, 0) = 0x7f74ae346000
fcntl(24, F_DUPFD_CLOEXEC, 0)           = 18

Not sure if that's normal...


Anyway, back to my needle-in-the-haystack-search:
diodon, which is a small little clipboard helper also shows up in
powertop since a while, not it that much mW, but still much more I'd
have expected from a little tool that does basically nothing.
Attaching to it with strace again reveals a lot of polling:

writev(7, [{iov_base="\203(\3\0\336\5\0\0\2\0\0\0", iov_len=12}, {iov_base=NULL, iov_len=0}, {iov_base="", iov_len=0}], 3) = 12
poll([{fd=7, events=POLLIN}], 1, -1)    = 1 ([{fd=7, revents=POLLIN}])
recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1(!T\7\0\0\0\336\5\0\0[\201S\2\0\0]\0\0\0!\1\0\0]\0\0\0!\1"..., iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 60
poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 3, 0) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\1\0\0\0\0\0\0\0", 16)         = 8
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 3, 494) = 0 (Timeout)
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 3, 3) = 0 (Timeout)
poll([{fd=7, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT}])
writev(7, [{iov_base="\22\0\7\0\2\0\300\4L\2\0\0L\2\0\0\10\0\0\0\1\0\0\0a\1\0\0", iov_len=28}, {iov_base=NULL, iov_len=0}, {iov_base="", iov_len=0}], 3) = 28
poll([{fd=7, events=POLLIN}], 1, -1)    = 1 ([{fd=7, revents=POLLIN}])
recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\34\0\"T\2\0\300\4L\2\0\0\207f\24\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=7, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT}])
writev(7, [{iov_base="\27\0\2\0\1\0\0\0", iov_len=8}, {iov_base=NULL, iov_len=0}, {iov_base="", iov_len=0}], 3) = 8
poll([{fd=7, events=POLLIN}], 1, -1)    = 1 ([{fd=7, revents=POLLIN}])
recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1\0#T\0\0\0\0\2\0@\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
poll([{fd=7, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT}])
writev(7, [{iov_base="\30\0\6\0\2\0\300\4\1\0\0\0M\2\0\0\230\1\0\0\207f\24\2", iov_len=24}, {iov_base=NULL, iov_len=0}, {iov_base="", iov_len=0}], 3) = 24
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}], 3, 494) = 1 ([{fd=7, revents=POLLIN}])
recvmsg(7, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\34\0$T\2\0\300\4\230\1\0\0\214f\24\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
recvmsg(7, {msg_namelen=0}, 0)          = -1 EAGAIN (Resource temporarily unavailable)

All this again with 5.5/ips=off/g-c-s.


g) Looking at powertop:
  1.56 W      9,8 ms/s     397,8        Interrupt      PS/2 Touchpad / Keyboard / Mouse
  836 mW      1,6 ms/s     213,9        kWork          dbs_work_handler
  804 mW      2,7 ms/s     205,5        Timer          tick_sched_timer
  594 mW     30,2 ms/s     144,6        Process        [PID 2831] /usr/bin/gnome-shell
  585 mW     35,6 ms/s     140,9        Process        [PID 1600] /usr/lib/xorg/Xorg :0 -seat seat0 -auth /var/run/lightdm/root/:0 -nolisten
(this is during and effectively idly system)

tick_sched_timer and dbs_work_handler appear there quite often at the top.
And they keyboard/mouse/touchpad, too.




7) Conclusions
**************
Well not that many except:
- video acceleration seems not really working
- cinnamon generally worse than gnome-shell-classic (which of course
  doesn't have to be a kernel issue, but it still seems to have gotten
  worse with >5.2 ... and it might be a pointer to what's wrong in the
  kernel
- when the temperature differs more greatly in the measurements, than
  5.2 seems typically much better than 5.5
  and ips=off, too.


My tests are obviously somehow limited. None of them simulates the
"normal" usage, like just switching between windows (well the winmove
tests does to some extent), scrolling up and down in a window and so
on.... and these use cases also greatly increase temperature... not
rarely over 80°C.


Any ideas what could cause all this? Context switches? Spectre&friends
protections that were added after 5.2? Interrupts? Something related to
polling?

Or any ideas what to do in terms of further tests? Other kernel
options? Or other tools (things like eventstat and so on)?


Any help would be appreciated, cause right now my laptop is more of an
oven and it starts to literally burn my legs when I work with it.



Thanks,
Chris.
Comment 1 Christoph Anton Mitterer 2020-04-14 22:38:35 UTC
I've made some further very extensive tests in the meantime, but these were mostly for clearly GPU related stuff, i.e. the problem that the temperatures go through the roof when playing back any video.
These were reported here:
https://gitlab.freedesktop.org/drm/intel/-/issues/953#note_463451

But I haven't made any plots/conclusions for that new set of tests, yet (will keep this ticket updated once I've done).



As for the general (I mean even when doing non-graphics intensive stuff like the unhide-brute or sha512 sum verify tests that I've described above) extreme temperature increase since >5.2 that I see, ... what I would try next is whether
mitigations=off changes anything (it didn't for video playback).


Also I found out about the nice features of perf record respectively perf report.
I've played a bit with that already and the first "results" showed that when I do anyting (like just typing at the keyboard, quickly moving up/down in e.g. Evolutions mail list, or just Alt-Tab-ing between windows, the number of events recorded there increases by magnitudes(!!).


I'd be thankful for any guide in what to actually test to better nail down that problem I see.

Thanks!
Comment 2 Christoph Anton Mitterer 2020-04-20 02:17:11 UTC
I've upgraded to 5.5.17 (again the stock Debian sid package), and all future tests with 5.5.x will be with this.

Problems unchanged.
Comment 3 Christoph Anton Mitterer 2020-04-20 02:21:21 UTC
I've also checked 5.5.17 with intel_pstate being enabled but at the same time using:

iommu=off mitigations=off pci=nomsi


I didn't repeat all tests as extensively as they're in the git repo, but I've played back a video with mpv and did some casual working (Atl-Tab-switching between windows, scrolling/up down in some windows, etc.).

None of these seem to help in terms of my CPU temperature going through the roof.
Comment 4 Christoph Anton Mitterer 2020-04-27 03:07:25 UTC
I've added a number of plots for my test series specifically about video playback (mentioned in Comment 1) here:
https://gitlab.freedesktop.org/drm/intel/issues/956#note_477885
and in the following comments.

Not sure if these plots give any good general clues, since they're mostly video acceleration related (which is likely a bug of it's own).

Yet in these tests there is one notable exception that works more or less well, and that in turn might - for an expert - be a hint what changed in kernels >5.2 that cause parts or all of the overheating issues I see since then (which occur, as stated several times, even for non-GPU-related processes).


And just to point that out again:
Even in 5.2 and before: video playback never worked really well and caused quite some temperature (though never nearly close to 100°C, though this in turn I tried only with cinnamon and could thus be something cinnamon related)... but now with >5.2 even non-fullscreen videos that take up just some centimetres of the screen cause the temperatures to go through the roof, while with fullscreen, as you can see, it goes nearly always to 100°C.
Comment 5 Christoph Anton Mitterer 2020-04-27 03:24:09 UTC
btw:

I've also added moving average plots to the git repo for my first test series, that is the one from the beginning of this ticket, where I compare 5.2 with 5.5 and for each intel_pstate=enabled with disabled under a number of differen workloads, both GPU intensive, GPU-related but not so intensive and GPU-unrelated.

These plots show more clearly how dramatic the temperature rise from 5.2 to 5.5 is,... and also how within each version how big the difference between intel_pstate being enabled and not.

https://github.com/calestyo/cpu-tests/commit/b256e5457b32dabc19035031073e2f09fe882cc4
Comment 6 Christoph Anton Mitterer 2020-07-06 17:52:46 UTC
Still completely broken in both 5.6 and 5.7 ... scrolling up a website in e.g. chromium gives mitt 96°C ... guess that's Intel hard&software :-D

Note You need to log in before you can comment on or make changes to this bug.