Bug 114551

Summary: Regression: bogus passive trip point causes processors throttled unexpectedly - Lenovo G510, Y50-70 laptop
Product: Power Management Reporter: greeenify
Component: ThermalAssignee: Zhang Rui (rui.zhang)
Status: CLOSED CODE_FIX    
Severity: high CC: cunio, dmatej, doaxan77, dsmythies, frolvlad, greeenify, johanneicher, lenb, marcin.j.nowak, megamak, rui.zhang, yu.c.chen
Priority: P1    
Hardware: Intel   
OS: Linux   
Kernel Version: 4.4.5-1 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: attachment-997-0.html
This patch causes the regression
Fix suggested by Zhang Rui

Description greeenify 2016-03-14 00:50:05 UTC
It seems like there is a regression of the pstate behavior after suspend in the 4.4 line. After I resume from suspend the max. scaling frequency is locked to 2.0Ghz and after the second resume it is locked to 1.36Ghz and doesn't change on  further suspends. It is possible to modify/update the max. scaling frequency more downwards, but not upwards. Writes to the kernel sys-fs fail as well as other tools like "cpupower". Moreover the turbo option is also irreversible disabled after resume.

The min. scaling frequency seems to be not affected.

I discovered this behavior with 4.4.5-1 running ArchLinux x86_64.
CPU: Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz

The only thing that helped was to downgrade to 4.3.3-3.

The behavior is nearly identical to the ones reported:

https://bugzilla.kernel.org/show_bug.cgi?id=90421
https://bugzilla.kernel.org/show_bug.cgi?id=61241

Is there any way I could help to trace or debug this behavior?
Comment 1 Chen Yu 2016-03-14 14:43:13 UTC
Hi
1. did you test with 100% cpuload on a single cpu?
2. does this problem still reproduce on latest 4.5?
3. could you plz provide:
grep . /sys/devices/system/cpu/intel_pstate/*
grep . /sys/devices/system/cpu/cpu3/cpufreq/*
(before and after suspended)
thx
Comment 2 greeenify 2016-03-14 23:22:29 UTC
Hi, 

Thanks for your fast reply. I have to use my laptop during the week - will report back at the latest on the weekend.
Comment 3 Vlad Frolov 2016-03-15 05:25:31 UTC
Hi,

I can confirm the similar behaviour on my Lenovo Y50-70 laptop. I have narrowed this down to that 4.4.3 mainline kernel works fine, but 4.4.4 introduces the regression, but I don't have enough time to narrow this further down. Another Ubuntu user reported that the issue was backported to 4.2.0-28 .. 4.2.0-30 (4.2.0-27 worked fine) releases: http://askubuntu.com/questions/745087/cpu-clock-slower-after-each-resume-from-sleep
Comment 4 Vlad Frolov 2016-03-15 06:00:28 UTC
I forgot to summarize here my findings which I have posted in my comments to the askubuntu question.

I have up-to-date ArchLinux x86_64 and I have tested the following kernels:

* Mainline 4.4.0 - no regression
* Mainline 4.4.1 - no regression
* Mainline 4.4.2 - no regression
* Mainline 4.4.3 - no regression
* Mainline 4.4.4 - regression starts here
* Mainline 4.4.5 - regression is still here
* Mainline 4.5rc7 - regression is still here
* Mainline 4.5.0 - regression is still here

My preferred kernel includes Liquorix patches, so the very first time I encountered the regression there, but I have built and checked vanilla kernels to confirm that the issue is not in the custom patches.
Comment 5 Chen Yu 2016-03-15 06:41:25 UTC
*** Bug 113531 has been marked as a duplicate of this bug. ***
Comment 6 Vlad Frolov 2016-03-15 07:00:00 UTC
Here is how CPU frequency changes on each suspend/resume (copied from the askubuntu question with some extra comments). Notice changes in "frequency should be within 800 MHz and XXX GHz.", "current CPU frequency is XXX MHz", and /sys/devices/system/cpu/intel_pstate/max_perf_pct.


After a fresh boot:

==========================================================================
root@alain-Y50-70:~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 3.60 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 817 MHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
root@alain-Y50-70:~# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
100

==========================================================================


After the first suspend/resume:

==========================================================================
root@alain-Y50-70:~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 2.88 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 800 MHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
root@alain-Y50-70:~# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
80

==========================================================================

After the second:

==========================================================================
root@alain-Y50-70:~# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
==========================================================================

After the third:

==========================================================================
root@alain-Y50-70:~# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
40
==========================================================================

After the forth and on max_perf_pct stays 40, but current CPU frequency drops even further...

==========================================================================
root@alain-Y50-70:~# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
40
root@alain-Y50-70:~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 1.44 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 699 MHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
==========================================================================

==========================================================================
root@alain-Y50-70:~# cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
40
root@alain-Y50-70:~# cpupower frequency-info
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.60 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 1.44 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 605 MHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
==========================================================================

The machine becomes noticeably less responsive even after 3 suspend/resume cycles, but going beyond that makes it almost unusable. I could go down to 200Mhz...

You may also notice that "hardware limits" states minimum 800Mhz, but "current CPU frequency" goes even lower than that limit.
Comment 7 Jacek Pawlyta 2016-03-15 13:37:12 UTC
I am confirming the same behaviour.

I have tested kernel 4.4.5 and 4.5rc7 from Fedora on
Lenovo G550 laptop with 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 23
Model name:            Intel(R) Core(TM)2 Duo CPU     T9300  @ 2.50GHz
Stepping:              6
CPU MHz:               800.000
CPU max MHz:           2501.0000
CPU min MHz:           800.0000
BogoMIPS:              4987.79
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              6144K
NUMA node0 CPU(s):     0,1
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm ida dtherm tpr_shadow vnmi flexpriority

Please notice that my CPU does not use p-states but cpufreq
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 10.0 us.
  hardware limits: 800 MHz - 2.50 GHz
  available frequency steps: 2.50 GHz, 2.50 GHz, 2.00 GHz, 1.60 GHz, 1.20 GHz, 800 MHz
  available cpufreq governors: conservative, userspace, powersave, ondemand, performance
  current policy: frequency should be within 800 MHz and 1.20 GHz.
                  The governor "conservative" may decide which speed to use
                  within this range.
  current CPU frequency is 1.20 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
analyzing CPU 1:
  driver: acpi-cpufreq
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 10.0 us.
  hardware limits: 800 MHz - 2.50 GHz
  available frequency steps: 2.50 GHz, 2.50 GHz, 2.00 GHz, 1.60 GHz, 1.20 GHz, 800 MHz
  available cpufreq governors: conservative, userspace, powersave, ondemand, performance
  current policy: frequency should be within 800 MHz and 1.20 GHz.
                  The governor "conservative" may decide which speed to use
                  within this range.
  current CPU frequency is 1.20 GHz (asserted by call to hardware).
  boost state support:
    Supported: yes
    Active: yes
Comment 8 Vlad Frolov 2016-03-15 14:04:40 UTC
Jacek Pawlyta noted (the comment was hard to spot it, actually) that he encountered the regression even with acpi-cpufreq driver while all previous reporters were using intel-pstate driver and assumed that it is related to it. I think, the issue title should be changed since the regression seems not to be limited to Haswell and intel-pstate (Jacek Pawlyta has Core2Duo).
Comment 9 Doug Smythies 2016-03-15 15:11:45 UTC
@Vlad: Yes, it is my understanding that this issue is not intel_pstate specific.

@greenify: What brand and model is your computer? So far, I have only heard of this issue on Lenovo computers.

Other references, in addition to the one Vlad gave:
http://ubuntuforums.org/showthread.php?t=2316101
Comment 10 greeenify 2016-03-15 15:33:39 UTC
Created attachment 209231 [details]
attachment-997-0.html

Lenovo G510

On Tue, Mar 15, 2016, 17:11 <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=114551
>
> Doug Smythies <dsmythies@telus.net> changed:
>
>            What    |Removed                     |Added
>
> ----------------------------------------------------------------------------
>                  CC|                            |dsmythies@telus.net
>
> --- Comment #9 from Doug Smythies <dsmythies@telus.net> ---
> @Vlad: Yes, it is my understanding that this issue is not intel_pstate
> specific.
>
> @greenify: What brand and model is your computer? So far, I have only
> heard of
> this issue on Lenovo computers.
>
> Other references, in addition to the one Vlad gave:
> http://ubuntuforums.org/showthread.php?t=2316101
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
> You reported the bug.
>
Comment 11 Jacek Pawlyta 2016-03-15 21:10:23 UTC
I am confirming findings made by Vlad Frolov, 

Fedora's kernel-4.4.3-300.fc23.x86_64 works fine on my Lenovo G550 with C2Duo T9300 controlled by acpi-cpufreq

The first broken Fedora kernel is 4.4.4 as well.
Comment 12 Johann Eicher 2016-03-16 14:51:43 UTC
I'd like to confirm that I have the same issue on my Lenovo y5070, with either the intel_pstate or acpi driver. Currently I am running the 4.2.0-34 kernel.
Comment 13 Srinivas Pandruvada 2016-03-16 15:13:25 UTC
So based on comments this issue has nothing to do intel_pstate as acpi_cpufreq has the same issue. Suspend/Resume can be broken by many kernel drivers. It is possible that each reporter here has different issue.
Please do the following steps:
- Upload dmesg which include both suspend and resume sequence
- Test suspend/resume with kernel command line option to check if the built in drivers and core kernel is causing the issue,  test_suspend=mem,10
- Once the resume run turbostat --debug -i 1 --msr=0x199
Comment 14 Maknesium 2016-03-16 21:38:42 UTC
I'd like to confirm exactly the same issue on my Lenovo y580 laptop.

I ran kernel 3.19 generic - no issues.

As soon as I started using 4.2.0-34-generic (most recent version in Ubuntu 15.10 right now) the issue popped up.

Using the older Kernel is my workaround atm.
Comment 15 Vlad Frolov 2016-03-17 04:13:31 UTC
I have caught the misbehaving module! It is `thermal`! Doing `rmmod thermal && modprobe thermal` I immediately see the drop in CPU frequency!

I'm going to revert the patches made to the `thermal` module in 4.4.4 patchset and recompile the kernel now.
Comment 16 Vlad Frolov 2016-03-17 05:07:33 UTC
Created attachment 209611 [details]
This patch causes the regression

Reverting these patches resume/suspend and rmmod/modprobe `thermal` module resolves the regression!
Comment 17 Doug Smythies 2016-03-17 05:17:10 UTC
@Vlad: Good work.
Please be aware of this thread, from just a hew hours ago, about the same commit.
http://marc.info/?t=145816738700001&r=1&w=2
Comment 18 Vlad Frolov 2016-03-18 02:36:29 UTC
Just for the reference, I have provided the following details to the kernel developers in the thread on the mailing list mentioned by @Doug:


> Can you send me the output of "grep . /sys/class/thermal/*/*" both w/ and w/o
> the broken patch series?

4.4.4 without thermal patches (deduplicated output):

=================================================================
/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
... (7 more cooling_deviceN groups with the same values as above)
/sys/class/thermal/cooling_device8/cur_state:-1
/sys/class/thermal/cooling_device8/max_state:50
/sys/class/thermal/cooling_device8/type:intel_powerclamp
/sys/class/thermal/thermal_zone0/available_policies:power_allocator user_space bang_bang fair_share step_wise
/sys/class/thermal/thermal_zone0/cdev0_trip_point:2
/sys/class/thermal/thermal_zone0/cdev0_weight:0
... (7 more cdevN_trip_point & cdevN_weight with the same values)
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/policy:step_wise
/sys/class/thermal/thermal_zone0/temp:62000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:127000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:127000
/sys/class/thermal/thermal_zone0/trip_point_1_type:hot
/sys/class/thermal/thermal_zone0/trip_point_2_temp:0
/sys/class/thermal/thermal_zone0/trip_point_2_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz
/sys/class/thermal/thermal_zone1/available_policies:power_allocator user_space bang_bang fair_share step_wise
/sys/class/thermal/thermal_zone1/integral_cutoff:0
/sys/class/thermal/thermal_zone1/k_d:0
/sys/class/thermal/thermal_zone1/k_i:0
/sys/class/thermal/thermal_zone1/k_po:0
/sys/class/thermal/thermal_zone1/k_pu:0
/sys/class/thermal/thermal_zone1/offset:0
/sys/class/thermal/thermal_zone1/policy:user_space
/sys/class/thermal/thermal_zone1/slope:0
/sys/class/thermal/thermal_zone1/sustainable_power:0
/sys/class/thermal/thermal_zone1/temp:59000
/sys/class/thermal/thermal_zone1/trip_point_0_temp:72000
/sys/class/thermal/thermal_zone1/trip_point_0_type:passive
/sys/class/thermal/thermal_zone1/trip_point_1_temp:0
/sys/class/thermal/thermal_zone1/trip_point_1_type:passive
/sys/class/thermal/thermal_zone1/type:x86_pkg_temp
=================================================================

original 4.4.4 (with thermal patches) (also deduplicated output):

=================================================================
/sys/class/thermal/cooling_device0/cur_state:0
/sys/class/thermal/cooling_device0/max_state:10
/sys/class/thermal/cooling_device0/type:Processor
... (7 more cooling_deviceN groups with the same values as above)
/sys/class/thermal/cooling_device8/cur_state:-1
/sys/class/thermal/cooling_device8/max_state:50
/sys/class/thermal/cooling_device8/type:intel_powerclamp
/sys/class/thermal/thermal_zone0/available_policies:user_space bang_bang fair_share step_wise
/sys/class/thermal/thermal_zone0/cdev0_trip_point:2
/sys/class/thermal/thermal_zone0/cdev0_weight:0
... (7 more cdevN_trip_point & cdevN_weight with the same values)
/sys/class/thermal/thermal_zone0/mode:enabled
/sys/class/thermal/thermal_zone0/policy:step_wise
/sys/class/thermal/thermal_zone0/temp:63000
/sys/class/thermal/thermal_zone0/trip_point_0_temp:127000
/sys/class/thermal/thermal_zone0/trip_point_0_type:critical
/sys/class/thermal/thermal_zone0/trip_point_1_temp:127000
/sys/class/thermal/thermal_zone0/trip_point_1_type:hot
/sys/class/thermal/thermal_zone0/trip_point_2_temp:0
/sys/class/thermal/thermal_zone0/trip_point_2_type:passive
/sys/class/thermal/thermal_zone0/type:acpitz
/sys/class/thermal/thermal_zone1/available_policies:user_space bang_bang fair_share step_wise
/sys/class/thermal/thermal_zone1/integral_cutoff:0
/sys/class/thermal/thermal_zone1/k_d:0
/sys/class/thermal/thermal_zone1/k_i:0
/sys/class/thermal/thermal_zone1/k_po:0
/sys/class/thermal/thermal_zone1/k_pu:0
/sys/class/thermal/thermal_zone1/offset:0
/sys/class/thermal/thermal_zone1/policy:user_space
/sys/class/thermal/thermal_zone1/slope:0
/sys/class/thermal/thermal_zone1/sustainable_power:0
/sys/class/thermal/thermal_zone1/temp:65000
/sys/class/thermal/thermal_zone1/trip_point_0_temp:72000
/sys/class/thermal/thermal_zone1/trip_point_0_type:passive
/sys/class/thermal/thermal_zone1/trip_point_1_temp:0
/sys/class/thermal/thermal_zone1/trip_point_1_type:passive
/sys/class/thermal/thermal_zone1/type:x86_pkg_temp
=================================================================

> What does it show here when performance drops?
> grep . /sys/devices/system/cpu/intel_pstate/*

# grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:22
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/num_pstates:28
/sys/devices/system/cpu/intel_pstate/turbo_pct:36

# rmmod thermal

# grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:22
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/num_pstates:28
/sys/devices/system/cpu/intel_pstate/turbo_pct:36

# modprobe thermal

# grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:80
/sys/devices/system/cpu/intel_pstate/min_perf_pct:22
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/num_pstates:28
/sys/devices/system/cpu/intel_pstate/turbo_pct:36


> Is the problem still occurs if you set 
> /sys/class/thermal/thermal_zone*/mode to "disabled" 

Yes, the problem still occurs. (I have tested it just like above and the outcome is the same.)


> please do the following test both w/ and w/o the patches,
> 1. # echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control
> 2. rmmod and insmod thermal
> 3. get the dmesg output

With the thermal patches (original 4.4.4):

==========================================================
[ 8354.627365] update_temperature: thermal thermal_zone0: last_temperature N/A, current_temperature=59000
[ 8354.627375] thermal_zone_trip_update: thermal thermal_zone0: Trip2[type=1,temp=0]:trend=0,throttle=1
[ 8354.627380] get_target_state: thermal cooling_device7: cur_state=2
[ 8354.627383] thermal_zone_trip_update: thermal cooling_device7: old_target=-1, target=3
[ 8354.627386] get_target_state: thermal cooling_device6: cur_state=2
[ 8354.627389] thermal_zone_trip_update: thermal cooling_device6: old_target=-1, target=3
[ 8354.627393] get_target_state: thermal cooling_device5: cur_state=2
[ 8354.627396] thermal_zone_trip_update: thermal cooling_device5: old_target=-1, target=3
[ 8354.627399] get_target_state: thermal cooling_device4: cur_state=2
[ 8354.627402] thermal_zone_trip_update: thermal cooling_device4: old_target=-1, target=3
[ 8354.627405] get_target_state: thermal cooling_device3: cur_state=2
[ 8354.627408] thermal_zone_trip_update: thermal cooling_device3: old_target=-1, target=3
[ 8354.627412] get_target_state: thermal cooling_device2: cur_state=2
[ 8354.627415] thermal_zone_trip_update: thermal cooling_device2: old_target=-1, target=3
[ 8354.627418] get_target_state: thermal cooling_device1: cur_state=2
[ 8354.627421] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=3
[ 8354.627425] get_target_state: thermal cooling_device0: cur_state=2
[ 8354.627428] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=3
[ 8354.627432] thermal_cdev_update: thermal cooling_device7: zone0->target=3
[ 8354.627441] thermal_cdev_update: thermal cooling_device7: set to state 3
[ 8354.627444] thermal_cdev_update: thermal cooling_device6: zone0->target=3
[ 8354.627451] thermal_cdev_update: thermal cooling_device6: set to state 3
[ 8354.627454] thermal_cdev_update: thermal cooling_device5: zone0->target=3
[ 8354.627461] thermal_cdev_update: thermal cooling_device5: set to state 3
[ 8354.627464] thermal_cdev_update: thermal cooling_device4: zone0->target=3
[ 8354.627471] thermal_cdev_update: thermal cooling_device4: set to state 3
[ 8354.627473] thermal_cdev_update: thermal cooling_device3: zone0->target=3
[ 8354.627480] thermal_cdev_update: thermal cooling_device3: set to state 3
[ 8354.627483] thermal_cdev_update: thermal cooling_device2: zone0->target=3
[ 8354.627490] thermal_cdev_update: thermal cooling_device2: set to state 3
[ 8354.627493] thermal_cdev_update: thermal cooling_device1: zone0->target=3
[ 8354.627501] thermal_cdev_update: thermal cooling_device1: set to state 3
[ 8354.627504] thermal_cdev_update: thermal cooling_device0: zone0->target=3
[ 8354.627511] thermal_cdev_update: thermal cooling_device0: set to state 3
[ 8354.627519] thermal LNXTHERM:00: registered as thermal_zone0
[ 8354.627521] ACPI: Thermal Zone [TZ00] (59 C)
==========================================================

Without the thermal patches (4.4.4 without the patches [reverted]):

==========================================================
[   28.144010] update_temperature: thermal thermal_zone1: last_temperature=69000, current_temperature=63000
[   34.154054] update_temperature: thermal thermal_zone1: last_temperature=63000, current_temperature=62000
[   37.094852] update_temperature: thermal thermal_zone0: last_temperature=0, current_temperature=65000
[   37.094857] thermal_zone_trip_update: thermal thermal_zone0: Trip2[type=1,temp=0]:trend=0,throttle=1
[   37.094859] get_target_state: thermal cooling_device7: cur_state=0
[   37.094860] thermal_zone_trip_update: thermal cooling_device7: old_target=-1, target=-1
[   37.094862] get_target_state: thermal cooling_device6: cur_state=0
[   37.094863] thermal_zone_trip_update: thermal cooling_device6: old_target=-1, target=-1
[   37.094864] get_target_state: thermal cooling_device5: cur_state=0
[   37.094865] thermal_zone_trip_update: thermal cooling_device5: old_target=-1, target=-1
[   37.094867] get_target_state: thermal cooling_device4: cur_state=0
[   37.094868] thermal_zone_trip_update: thermal cooling_device4: old_target=-1, target=-1
[   37.094869] get_target_state: thermal cooling_device3: cur_state=0
[   37.094870] thermal_zone_trip_update: thermal cooling_device3: old_target=-1, target=-1
[   37.094872] get_target_state: thermal cooling_device2: cur_state=0
[   37.094873] thermal_zone_trip_update: thermal cooling_device2: old_target=-1, target=-1
[   37.094874] get_target_state: thermal cooling_device1: cur_state=0
[   37.094875] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=-1
[   37.094877] get_target_state: thermal cooling_device0: cur_state=0
[   37.094878] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=-1
[   37.094882] thermal LNXTHERM:00: registered as thermal_zone0
[   37.094883] ACPI: Thermal Zone [TZ00] (65 C)
==========================================================



Here is Srinivas's guess about the cause:

> I think, the problem is your device has a passive trip temp of 0
> /sys/class/thermal/thermal_zone0/trip_point_2_temp:0
> /sys/class/thermal/thermal_zone0/trip_point_2_type:passive

> Which triggers a false throttle = true. I think we should this trip as
> invalid in the case of 
> if (tz->temperature >= trip_temp) {} check
> in thermal_zone_trip_update().


P.S. I guess, the "Component" description of this issue should be changed from 	intel_pstate to thermal.
Comment 19 David Matějček 2016-03-19 12:28:06 UTC
Same problem with Lenovo Y510P:
uname -a
Linux dmatej-lenovo 4.2.0-34-generic #39-Ubuntu SMP Thu Mar 10 22:13:01 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

cpufreq-info 
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 0.97 ms.
  hardware limits: 800 MHz - 3.10 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 1.86 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency is 1.80 GHz.
...

After several suspends I ended up with 580 MHz, even under the minimal limit (800-800 policy, now 800-1860). I can change the policy, but only inside this range; I suppose that it should be limited only by hardware limits ...?
Comment 20 David Matějček 2016-03-19 13:03:04 UTC
I tried also latest kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily/

uname -a
Linux dmatej-lenovo 4.5.0-040500-generic #201603140130 SMP Mon Mar 14 05:32:22 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

This is after 3 suspends (after the second one system waked up spontaneously).
...
  hardware limits: 800 MHz - 3.10 GHz
  available cpufreq governors: performance, powersave
  current policy: frequency should be within 800 MHz and 1.24 GHz.
...
Comment 21 Maknesium 2016-03-20 16:08:24 UTC
I use the Lenovo y580 - a fallback to Kernel 4.2.0-19-generic is my current workaround for the problem...
Comment 22 Vlad Frolov 2016-03-21 13:10:31 UTC
Created attachment 210111 [details]
Fix suggested by Zhang Rui

I have tested this fix and it works fine for both `rmmod & modprobe thermal` and suspend & resume use-cases.
Comment 23 Vlad Frolov 2016-03-28 03:58:09 UTC
I have been running the fix on top of 4.4.4 mainline kernel and 4.4.6 kernel with Liquorix patches since the last comment. I have not encountered any problems with the patch at all. When will it land to the mainline?
Comment 24 Zhang Rui 2016-03-28 23:20:39 UTC
Patch has been shipped in 4.6-rc1. Bug closed.

commit 81ad4276b505e987dd8ebbdf63605f92cd172b52
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Mar 18 10:03:24 2016 +0800

    Thermal: Ignore invalid trip points
    
    In some cases, platform thermal driver may report invalid trip points,
    thermal core should not take any action for these trip points.
    
    This fixed a regression that bogus trip point starts to screw up thermal
    control on some Lenovo laptops, after
    commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461
    Author: Zhang Rui <rui.zhang@intel.com>
    Date:   Fri Oct 30 16:31:47 2015 +0800
    
        Thermal: initialize thermal zone device correctly
    
        After thermal zone device registered, as we have not read any
        temperature before, thus tz->temperature should not be 0,
        which actually means 0C, and thermal trend is not available.
        In this case, we need specially handling for the first
        thermal_zone_device_update().
    
        Both thermal core framework and step_wise governor is
        enhanced to handle this. And since the step_wise governor
        is the only one that uses trends, so it's the only thermal
        governor that needs to be updated.
    
        Tested-by: Manuel Krause <manuelkrause@netscape.net>
        Tested-by: szegad <szegadlo@poczta.onet.pl>
        Tested-by: prash <prash.n.rao@gmail.com>
        Tested-by: amish <ammdispose-arch@yahoo.com>
        Tested-by: Matthias <morpheusxyz123@yahoo.de>
        Reviewed-by: Javi Merino <javi.merino@arm.com>
        Signed-off-by: Zhang Rui <rui.zhang@intel.com>
        Signed-off-by: Chen Yu <yu.c.chen@intel.com>
    
    CC: <stable@vger.kernel.org> #3.18+
    Link: https://bugzilla.redhat.com/show_bug.cgi?id=1317190
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=114551
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Comment 25 Vlad Frolov 2016-03-30 09:05:36 UTC
I have tested and can confirm that 4.6-rc1 works fine! Thank you, Zhang!
Comment 26 Marcin Nowak 2016-05-21 19:13:58 UTC
Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz on Linux 4.6.0-1-MANJARO finally seems to work fine. God bless you!