Bug 91411 - After suspend to ram fan spinns always at full speed
Summary: After suspend to ram fan spinns always at full speed
Status: CLOSED CODE_FIX
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Fan (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Chen Yu
URL:
Keywords:
: 92431 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-01-16 11:57 UTC by Matthias
Modified: 2016-01-25 02:50 UTC (History)
12 users (show)

See Also:
Kernel Version: 3.18.0
Tree: Mainline
Regression: Yes


Attachments
dmesg after resume from supend to ram (59.38 KB, application/octet-stream)
2015-01-16 11:57 UTC, Matthias
Details
System environment (4.98 KB, text/plain)
2015-01-16 11:59 UTC, Matthias
Details
kernel configuration (81.63 KB, text/plain)
2015-01-16 11:59 UTC, Matthias
Details
lspci (1.82 KB, text/plain)
2015-01-16 12:00 UTC, Matthias
Details
dmesg after resume from supend to ram (59.38 KB, text/plain)
2015-01-16 12:00 UTC, Matthias
Details
acpidump NW9440 (95.15 KB, application/octet-stream)
2015-01-31 08:21 UTC, Matthias
Details
working drivers/acpi/fan.c from kernel 3.17.8 (5.58 KB, text/x-csrc)
2015-02-14 08:54 UTC, szegad
Details
acpidump kernel 3.18.6 (373.79 KB, text/plain)
2015-02-16 11:00 UTC, szegad
Details
# ls -l /sys/bus/platform/drivers/acpi-fan (910 bytes, text/plain)
2015-02-17 08:57 UTC, szegad
Details
ls -l /sys/bus/platform/devices/PNP0C0B\:*/ (4.79 KB, text/plain)
2015-02-17 08:58 UTC, szegad
Details
# grep . /sys/bus/platform/devices/PNP0C0B\:*/firmware_node/* (3.79 KB, text/plain)
2015-02-17 08:59 UTC, szegad
Details
(before suspend) grep . /sys/class/thermal/thermal_zone*/* (3.00 KB, text/plain)
2015-02-25 09:01 UTC, szegad
Details
(after suspend) grep . /sys/class/thermal/thermal_zone*/* (3.00 KB, text/plain)
2015-02-25 09:01 UTC, szegad
Details
(before suspend) find -L /sys/class/thermal/ -maxdepth 2 -name "cur_state" -print -exec cat {} \; (331 bytes, text/plain)
2015-02-26 17:46 UTC, szegad
Details
(after) find -L /sys/class/thermal/ -maxdepth 2 -name "cur_state" -print -exec cat {} \; (331 bytes, text/plain)
2015-02-26 17:47 UTC, szegad
Details
ls -l /sys/class/thermal/thermal_zone*/ (5.15 KB, text/plain)
2015-02-26 17:49 UTC, szegad
Details
dmesg with debug (115.37 KB, text/plain)
2015-02-27 09:15 UTC, szegad
Details
My tests with linux-3.19.0 (9.69 KB, text/plain)
2015-02-28 09:03 UTC, Matthias
Details
tzp=300, dmesg just afer resume, when tzp works (109.60 KB, text/plain)
2015-03-03 08:07 UTC, szegad
Details
tzp=300, dmesg after fan down, when tzp works (113.86 KB, text/plain)
2015-03-03 08:07 UTC, szegad
Details
tzp=300, dmesg just afer resume, when tzp doesn't work (137.96 KB, text/plain)
2015-03-03 08:08 UTC, szegad
Details
tzp=300, dmesg just after 30s, when tzp doesn't work (141.92 KB, text/plain)
2015-03-03 08:09 UTC, szegad
Details
tzp=300, dmesg after fan down, when tzp doesn't work initially (155.79 KB, text/plain)
2015-03-03 08:10 UTC, szegad
Details
dmesg after long suspend (178.52 KB, text/plain)
2015-03-05 13:53 UTC, szegad
Details
resume after long suspend some minutes later (177.59 KB, text/plain)
2015-03-05 14:13 UTC, szegad
Details
debug patch (3.46 KB, patch)
2015-03-09 07:58 UTC, Zhang Rui
Details | Diff
fan full speed, system is cool (127.25 KB, text/plain)
2015-03-10 10:10 UTC, szegad
Details
dmesg after long suspend, fan is full speed, system cool (145.59 KB, text/plain)
2015-03-10 15:29 UTC, szegad
Details
Patch to avoid racing problem when doing thermal update(by move trend calculation into each thermal_instance) (8.27 KB, application/octet-stream)
2015-09-22 10:48 UTC, Chen Yu
Details
V6-0001-Thermal-initialize-thermal-zone- (4.96 KB, application/octet-stream)
2015-10-22 08:00 UTC, Chen Yu
Details
Thermal-handle-thermal-zone-device- (3.99 KB, application/octet-stream)
2015-10-22 08:01 UTC, Chen Yu
Details
V6-0003-Thermal-do-thermal-zone-update-after-a-cooling-devic (3.32 KB, application/octet-stream)
2015-10-22 08:01 UTC, Chen Yu
Details

Description Matthias 2015-01-16 11:57:54 UTC
Created attachment 163591 [details]
dmesg after resume from supend to ram

Hello,

after suspend to ram the fan of my HP NW9440 spinns always at full speed.

Last known good version: 3.17.7
First known bad version: 3.18.0
Last known bad version: 3.18.2

Behaviour: After supend to ram fan spinns at full speed
Expected behaviour: Fan speed should be picked according to the current cpu and gpu temperatures.

Steps to reproduce: Suspend to ram and then resume.
Comment 1 Matthias 2015-01-16 11:59:03 UTC
Created attachment 163601 [details]
System environment
Comment 2 Matthias 2015-01-16 11:59:49 UTC
Created attachment 163611 [details]
kernel configuration
Comment 3 Matthias 2015-01-16 12:00:12 UTC
Created attachment 163621 [details]
lspci
Comment 4 Matthias 2015-01-16 12:00:47 UTC
Created attachment 163631 [details]
dmesg after resume from supend to ram
Comment 5 Len Brown 2015-01-27 01:28:35 UTC
can you show the system temperature using turbostat(1) before and after to verify that the fan is not running in response to a real change in temperature?
Comment 6 szegad 2015-01-27 16:42:14 UTC
Hello!
 I can confirm this problem on my laptop HP 6830s running Fedora 21.
Kernel 3.17.8-300 is fine, kernel 3.18.3-201 is causing this problem.
It's not the matter of real temperatures, because it pops up even if the laptop is left for a long time in suspend (and get absolutely cold). CPU load is 96-97 % idle.
Turbostat doesn't work on my system.
 Similar problem has existed in earlier kernel versions, got fixed and here is back again.
Comment 7 Matthias 2015-01-28 16:14:16 UTC
Got one better. Did a git bisect.

Testcase was:
boot with init=/bin/bb
mount sysfs
echo mem > /sys/power/state
resume

Output from git bisect:

19593a1fb1f6718406afca5b867dab184289d406 is the first bad commit
commit 19593a1fb1f6718406afca5b867dab184289d406
Author: Aaron Lu <aaron.lu@intel.com>
Date:   Tue Nov 19 16:59:20 2013 +0800

    ACPI / fan: convert to platform driver
    
    Convert ACPI fan driver to a platform driver for the purpose of phasing
    out ACPI bus.
    
    Signed-off-by: Aaron Lu <aaron.lu@intel.com>
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>

:040000 040000 0933047229261d8e4fe4d1614377ec55b2459f82 b19fc40556b8e55a0c591ac00cd4af4b938f73dc M      drivers


git bisect start
# good: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17
git bisect good bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9
# bad: [b2776bf7149bddd1f4161f14f79520f17fc1d71d] Linux 3.18
git bisect bad b2776bf7149bddd1f4161f14f79520f17fc1d71d
# good: [754c780953397dd5ee5191b7b3ca67e09088ce7a] Merge branch 'for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
git bisect good 754c780953397dd5ee5191b7b3ca67e09088ce7a
# good: [2d65a9f48fcdf7866aab6457bc707ca233e0c791] Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
git bisect good 2d65a9f48fcdf7866aab6457bc707ca233e0c791
# bad: [88e237610b426897f0e9935adb6a60bd38bfe6c6] Merge tag 'armsoc-for-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 88e237610b426897f0e9935adb6a60bd38bfe6c6
# skip: [8a5de18239e418fe7b1f36504834689f754d8ccc] Merge tag 'kvm-arm-for-3.18-take-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
git bisect skip 8a5de18239e418fe7b1f36504834689f754d8ccc
# good: [1c2150283cae895526d0db3953d13d139f4e7a03] ext4: convert ext4_bread() to use the ERR_PTR convention
git bisect good 1c2150283cae895526d0db3953d13d139f4e7a03
# good: [c5bbcb5822b25c9f738db98e6d6ad2506cab8136] cxgb4i: Remove duplicate call to dst_neigh_lookup()
git bisect good c5bbcb5822b25c9f738db98e6d6ad2506cab8136
# skip: [0a582821d4f8edf41d9b56ae057ee2002fc275f0] Merge tag 'fbdev-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
git bisect skip 0a582821d4f8edf41d9b56ae057ee2002fc275f0
# good: [a687ecaf50f18329206c6b78764a8c7bd30a9df0] ceph: export ceph_session_state_name function
git bisect good a687ecaf50f18329206c6b78764a8c7bd30a9df0
# good: [0ef090151345e693bd9b50c9b7aaf34ae5e9cac3] MAINTAINERS: add atmel ssc driver maintainer entry                                                         
git bisect good 0ef090151345e693bd9b50c9b7aaf34ae5e9cac3                                                                                                      
# good: [3d32e4dbe71374a6780eaf51d719d76f9a9bf22f] kvm: fix excessive pages un-pinning in kvm_iommu_map error path.                                           
git bisect good 3d32e4dbe71374a6780eaf51d719d76f9a9bf22f                                                                                                      
# bad: [1c45d9a920e6ef4fce38921e4fc776c2abca3197] Merge tag 'pm+acpi-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm               
git bisect bad 1c45d9a920e6ef4fce38921e4fc776c2abca3197
# good: [816fb4175c29b16948fb24a92053bea1e79908cc] Merge tag 'remove-weak-declarations' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect good 816fb4175c29b16948fb24a92053bea1e79908cc
# good: [a91e99e27a683608d221fb18b70d7de9d801de4a] Merge branches 'pm-cpuidle' and 'pm-cpufreq'
git bisect good a91e99e27a683608d221fb18b70d7de9d801de4a
# bad: [4384b8fe162d8aa03905d02073707bcf364cc7ce] Thermal: introduce int3403 thermal driver
git bisect bad 4384b8fe162d8aa03905d02073707bcf364cc7ce
# good: [8dd41f78adebb57909cccb0272e74c79e38b5238] ACPI / fan: remove no need check for device pointer
git bisect good 8dd41f78adebb57909cccb0272e74c79e38b5238
# bad: [9519a6356cbf63b1f22a7a208385dc56092c8b7d] ACPI / Fan: add ACPI 4.0 style fan support
git bisect bad 9519a6356cbf63b1f22a7a208385dc56092c8b7d
# bad: [19593a1fb1f6718406afca5b867dab184289d406] ACPI / fan: convert to platform driver
git bisect bad 19593a1fb1f6718406afca5b867dab184289d406
# good: [2bb3a2bf9939f3361e25045f4ef7b136b864c3b8] ACPI / fan: use acpi_device_xxx_power instead of acpi_bus equivelant
git bisect good 2bb3a2bf9939f3361e25045f4ef7b136b864c3b8
# first bad commit: [19593a1fb1f6718406afca5b867dab184289d406] ACPI / fan: convert to platform driver
Comment 8 Matthias 2015-01-31 08:19:41 UTC
Turbostat does not work for me either. Gkrellm shows a temperature of 35°C after resuming for both cpu cores and gpu. Idle temperature lies at about 42°C for this machine. The highest fan speed is reached at 80°C. At normal room temperature (20°C) and at highest load, this fan speed is never reached. The last but one fan speed suffices to keep the temperature at 78°C under heavy load. 
On a side note, my Lenovo X220 is not affected.
I attached an acpidump of my HP NW9440.

The skipped git biscets did not compile. 

cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T7400  @ 2.16GHz
stepping        : 6
microcode       : 0xd1
cpu MHz         : 2167.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow
bugs            :
bogomips        : 4322.51
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         T7400  @ 2.16GHz
stepping        : 6
microcode       : 0xd1
cpu MHz         : 2167.000
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 1                                                                                                                                           
cpu cores       : 2                                                                                                                                           
apicid          : 1                                                                                                                                           
initial apicid  : 1                                                                                                                                           
fpu             : yes                                                                                                                                         
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow
bugs            :
bogomips        : 4322.51
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:
Comment 9 Matthias 2015-01-31 08:21:23 UTC
Created attachment 165351 [details]
acpidump NW9440
Comment 10 Zhang Rui 2015-02-09 05:49:17 UTC
Matthias,

(In reply to Matthias from comment #7)
> Got one better. Did a git bisect.
> 
> Testcase was:
> boot with init=/bin/bb
> mount sysfs
> echo mem > /sys/power/state
> resume
> 
> Output from git bisect:
> 
> 19593a1fb1f6718406afca5b867dab184289d406 is the first bad commit
> commit 19593a1fb1f6718406afca5b867dab184289d406
> Author: Aaron Lu <aaron.lu@intel.com>
> Date:   Tue Nov 19 16:59:20 2013 +0800
> 
>     ACPI / fan: convert to platform driver
>     
>     Convert ACPI fan driver to a platform driver for the purpose of phasing
>     out ACPI bus.
>     
>     Signed-off-by: Aaron Lu <aaron.lu@intel.com>
>     Signed-off-by: Zhang Rui <rui.zhang@intel.com>
> 

does the problem still exist if you revert this patch?

thanks,
rui
Comment 11 amish 2015-02-13 06:58:39 UTC
This bug also reports similar FAN issue that is appearing since 3.18 (works before 3.17)

Only difference is that FAN starts running at full speed somewhere in middle of BOOT sequence - i.e. no need to suspend and resume.

https://bugzilla.kernel.org/show_bug.cgi?id=92431

Related ARCH forum thread
https://bbs.archlinux.org/viewtopic.php?id=192255
Comment 12 szegad 2015-02-14 08:52:18 UTC
I can confirm that taking drivers/acpi/fan.c from fedora's kernel-3.17.8-300.fc21 and putting it to fedora's kernel-3.18.6-200.fc21 , compiling and running kernel SOLVES the problem of fan running full speed after resume from sleep.
Laptop HP 6820s.
Comment 13 szegad 2015-02-14 08:52:50 UTC
I've attached the working drivers/acpi/fan.c
Comment 14 szegad 2015-02-14 08:54:36 UTC
Created attachment 166781 [details]
working drivers/acpi/fan.c from kernel 3.17.8
Comment 15 amish 2015-02-14 10:22:30 UTC
I am not kernel OR hardware expert but looking at difference between 3.17 and 3.19 source (drivers/acpi/fan.c), it appears that 3.19 kernel now differentiates between two types of fans. One is acpi4 and other non-acpi4.

May be somewhere the detection of fan type goes wrong and hence as a side effect they run at full speed?

I dont know how to detect if our fan is really acpi4 OR non-acpi4? And also how to know what is the type actually detected (guessed) by kernel is?

If someone can tell me a way to know this, I can check it.
Comment 16 szegad 2015-02-14 10:26:44 UTC
I've also tried to generate from git the series of patches bringing fan.c from 3.17.8 to 3.18.6:
looking at this history:
https://github.com/torvalds/linux/commits/master/drivers/acpi/fan.c

and generating the patches from git repo:
git format-patch -8 bbb16fef19122ec9f20fb865c45375e12f85d2a1 fan.c
It creates:
0001-ACPI-fan-printk-replacement.patch
0002-ACPI-fan-remove-unused-macro.patch
0003-ACPI-fan-remove-no-need-check-for-device-pointer.patch
0004-ACPI-fan-use-acpi_device_xxx_power-instead-of-acpi_b.patch
0005-ACPI-fan-convert-to-platform-driver.patch
0006-ACPI-Fan-add-ACPI-4.0-style-fan-support.patch
0007-ACPI-Fan-support-INT3404-thermal-device.patch
0008-ACPI-Fan-Use-bus-id-as-the-name-for-non-PNP0C0B-Fan-.patch

1.
I tried to move forward from 3.17.8.
Git applied cleanly 0001-ACPI-fan-printk-replacement.patch
and a failed on 0002-ACPI-fan-remove-unused-macro.patch
(it seems that 0002-ACPI-fan-remove-unused-macro.patch is looking for "printk code" removed in 0001-ACPI-fan-printk-replacement.patch)
I done some manual merging, but with every patch it became harder - so I've tried a different approach.

2.
I tried reversing patches down from 3.18.6
0008-ACPI-Fan-Use-bus-id-as-the-name-for-non-PNP0C0B-Fan-.patch is a later patch, so I skipped it.
0007-ACPI-Fan-support-INT3404-thermal-device.patch was reversed cleanly.
0006-ACPI-Fan-add-ACPI-4.0-style-fan-support.patch did not apply with a lot of problems.

Now I'm stuck. I'm probably doing something wrong. If you got any hints how to prove that without ACPI-fan-convert-to-platform-driver.patch this bug is gone - please tell me.
Comment 17 szegad 2015-02-14 10:31:31 UTC
(In reply to amish from comment #15)
> I am not kernel OR hardware expert but looking at difference between 3.17
> and 3.19 source (drivers/acpi/fan.c), it appears that 3.19 kernel now
> differentiates between two types of fans. One is acpi4 and other non-acpi4.
You're probably talking about this commit:
https://github.com/torvalds/linux/commit/9519a6356cbf63b1f22a7a208385dc56092c8b7d
which is add-ACPI-4.0-style-fan-support.patch.
It's added AFTER ACPI-fan-convert-to-platform-driver which seems to be a problem nailed by Matthias' using git bisect.
I've tried to prove that it's indeed the problem, but failed (look at comment #16).
Otherwise you guess seems legit. We just need a way to prove it.
Comment 18 amish 2015-02-14 10:43:15 UTC
(In reply to bojanvuk from comment #16)
> 2.
> I tried reversing patches down from 3.18.6
> 0008-ACPI-Fan-Use-bus-id-as-the-name-for-non-PNP0C0B-Fan-.patch is a later
> patch, so I skipped it.
> 0007-ACPI-Fan-support-INT3404-thermal-device.patch was reversed cleanly.
> 0006-ACPI-Fan-add-ACPI-4.0-style-fan-support.patch did not apply with a lot
> of problems.
> 
> Now I'm stuck. I'm probably doing something wrong. If you got any hints how
> to prove that without ACPI-fan-convert-to-platform-driver.patch this bug is
> gone - please tell me.

Sometimes commits are related to each other so I think you need to reverse all patches committed on Oct 10, 2014. (total 6)
Comment 19 szegad 2015-02-14 10:45:19 UTC
Maybe, but it won't do any good, because we know that that bug is probably somewhere between the patches applied on Oct 10, 2014 :)
Comment 20 amish 2015-02-14 11:01:35 UTC
Can you modify the following function in source code when you compile kernel and check which one is NOT causing issue?

drivers/acpi/fan.c (line 228)

static bool acpi_fan_is_acpi4(struct acpi_device *device)
{
        return true; //add this line here
        return acpi_has_method(device->handle, "_FIF") &&
        ...
}

Try first with "return true;" as result. And then try again with "return false;"

If issue still persists that means fan type detection code is not wrong and issue is somewhere else.
Comment 21 Aaron Lu 2015-02-15 05:30:32 UTC
To verify if the said commit is the culprit, you can do so:
$ cd /your/linux/git/tree
$ git reset --hard 19593a1fb1f6718406afca5b867dab184289d406
Build kernel and test;
$ git reset --hard HEAD~1
Build kernel and test again.
Comment 22 szegad 2015-02-15 12:15:02 UTC
(In reply to amish from comment #20)
> If issue still persists that means fan type detection code is not wrong and
> issue is somewhere else.
Yes, the issue is somewhere else.
return false = everything is like it was before
return true = can't even reset the fan by hand (/sys/class/thermal/cooling_deviceN)
Comment 23 szegad 2015-02-15 19:26:52 UTC
Thank you, Aaron.

That's right. The fan under kernel 1 commit back from 19593a1fb1f6718406afca5b867dab184289d406 works as expected.
Comment 24 Radek Podgorny 2015-02-15 23:01:25 UTC
see https://bugzilla.kernel.org/show_bug.cgi?id=93301 for the bug repotrt i've just created. it may (or may not) be a somewhat connected issue.
Comment 25 Aaron Lu 2015-02-16 03:02:50 UTC
(In reply to szegad from comment #23)
> Thank you, Aaron.
> 
> That's right. The fan under kernel 1 commit back from
> 19593a1fb1f6718406afca5b867dab184289d406 works as expected.

And the kernel with the 19593a1fb1f6718406afca5b867dab184289d406 commit doesn't work, right?

And your acpidump please:
# acpidump > acpidump.txt
Comment 26 szegad 2015-02-16 11:00:06 UTC
Created attachment 167091 [details]
acpidump kernel 3.18.6

Right, it doesn't
Comment 27 Aaron Lu 2015-02-17 05:13:53 UTC
With the culprit commit, please show me the output of:
# ls -l /sys/bus/platform/drivers/acpi-fan
...
# ls -l /sys/bus/platform/devices/PNP0C0B\:*/
...
# grep . /sys/bus/platform/devices/PNP0C0B\:*/firmware_node/*
...

And then for each of the cooling_device:
/sys/bus/platform/devices/PNP0C0B:XX/thermal_cooling, does setting the value 0 and 1 to the cur_state file make the fan spin and off?
Comment 28 szegad 2015-02-17 08:57:40 UTC
Created attachment 167231 [details]
# ls -l /sys/bus/platform/drivers/acpi-fan
Comment 29 szegad 2015-02-17 08:58:26 UTC
Created attachment 167241 [details]
ls -l /sys/bus/platform/devices/PNP0C0B\:*/
Comment 30 szegad 2015-02-17 08:59:07 UTC
Created attachment 167271 [details]
# grep . /sys/bus/platform/devices/PNP0C0B\:*/firmware_node/*
Comment 31 szegad 2015-02-17 09:00:56 UTC
/sys/bus/platform/devices/PNP0C0B:XX/thermal_cooling

PNP0C0B:00 works
PNP0C0B:01 works
PNP0C0B:02 works
PNP0C0B:03 works
PNP0C0B:04 does nothing
PNP0C0B:05 works and it sounds like going full speed just like after the resume from sleep
PNP0C0B:06 works
PNP0C0B:07 does nothing

Question:
How is this different from accessing /sys/class/thermal/cooling_deviceN ?
I've got there from cooling_device0 to cooling_device10 .
Comment 32 Manuel Krause 2015-02-18 22:04:17 UTC
Can someone of you, please, have a look @ BUG 78201, and check whether we have a duplicate here? Rui?

Thanks,
Manuel
Comment 33 Aaron Lu 2015-02-25 06:35:27 UTC
(In reply to szegad from comment #31)
> /sys/bus/platform/devices/PNP0C0B:XX/thermal_cooling
> 
> PNP0C0B:00 works
> PNP0C0B:01 works
> PNP0C0B:02 works
> PNP0C0B:03 works
> PNP0C0B:04 does nothing
> PNP0C0B:05 works and it sounds like going full speed just like after the
> resume from sleep
> PNP0C0B:06 works
> PNP0C0B:07 does nothing
> 
> Question:
> How is this different from accessing /sys/class/thermal/cooling_deviceN ?
> I've got there from cooling_device0 to cooling_device10 .

Sorry for the long delay, it's Chinese new year here.

PNP0C0B:0X corresponds to cooling_device0X according to your attachment:
https://bugzilla.kernel.org/attachment.cgi?id=167241.

So the problem only occurs after resume, and then the thermal zone's temperature is still correct, only the FAN spins at full speed, right?

Please attach the following output before suspend and after resume:
# grep . /sys/class/thermal/thermal_zone*/*
Comment 34 szegad 2015-02-25 09:01:04 UTC
Created attachment 168211 [details]
(before suspend) grep . /sys/class/thermal/thermal_zone*/*
Comment 35 szegad 2015-02-25 09:01:47 UTC
Created attachment 168221 [details]
(after suspend) grep . /sys/class/thermal/thermal_zone*/*
Comment 36 szegad 2015-02-25 09:07:44 UTC
Happy New Year!

Files attached. 
What bothers me is the thermal_zone2: its temp is 5000 before suspend and 100000 after resume (I left my laptop for an hour to cool down!). However it's critical trip point 0 is 110000.
What's more after running my "cool all" script (echo 0 > /sys/class/thermal/cooling_deviceX/cur_state) the temp fell down in 1-2 seconds to 8400 and then to 0. Then after some time it came back to 5000.
Comment 37 Aaron Lu 2015-02-26 02:59:36 UTC
(In reply to szegad from comment #36)
> Happy New Year!

Thanks :-)

> 
> Files attached. 
> What bothers me is the thermal_zone2: its temp is 5000 before suspend and
> 100000 after resume (I left my laptop for an hour to cool down!). However
> it's critical trip point 0 is 110000.

It doesn't matter: thermal_zone2 doesn't have any cooling device(i.e. fan) bound to it so no matter what temperature it is, no cooling operation would occur.

> What's more after running my "cool all" script (echo 0 >
> /sys/class/thermal/cooling_deviceX/cur_state) the temp fell down in 1-2
> seconds to 8400 and then to 0. Then after some time it came back to 5000.

After resume, before you run any script, can you please check the cur_state of the cooling_device{0-7}? i.e. /sys/class/thermal/cooling_device?/cur_state. Since the fan is spinning at full speed, I suppose some of them should be set to 1?

BTW, please attach the output of:
ls -l /sys/class/thermal/thermal_zone*/
Comment 38 szegad 2015-02-26 17:46:43 UTC
Created attachment 168321 [details]
(before suspend) find -L /sys/class/thermal/ -maxdepth 2 -name "cur_state" -print -exec cat {} \;
Comment 39 szegad 2015-02-26 17:47:30 UTC
Created attachment 168331 [details]
(after) find -L /sys/class/thermal/ -maxdepth 2 -name "cur_state" -print -exec cat {} \;
Comment 40 szegad 2015-02-26 17:49:55 UTC
Created attachment 168341 [details]
ls -l /sys/class/thermal/thermal_zone*/
Comment 41 Aaron Lu 2015-02-27 02:41:15 UTC
All fan's cur_state is set to 1 after resume, i.e. all fan is turned on. I think this is related to thermal core, please run this after boot:
# echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control

And then do a suspend-resume, attach the dmesg, thanks.
Comment 42 szegad 2015-02-27 09:15:51 UTC
Created attachment 168381 [details]
dmesg with debug
Comment 43 Manuel Krause 2015-02-27 14:18:13 UTC
(In reply to Aaron Lu from comment #41)
> All fan's cur_state is set to 1 after resume, i.e. all fan is turned on. I
> think this is related to thermal core, please run this after boot:
> # echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control
> 
> And then do a suspend-resume, attach the dmesg, thanks.

Can you, Aaron Lu, please, perhaps, risk a little look at BUG 78201 https://bugzilla.kernel.org/show_bug.cgi?id=78201
Title: "Lower fan speeds are forgotten after resume from ram/disk"

There you'd find many more dmesg logs from the past. Over several kernels. With several patches' alternatives to look at.

Maybe you could/ should even exchange your knowledge with Zhang Rui.

BR, Manuel
Comment 44 Aaron Lu 2015-02-28 08:06:15 UTC
[  187.823166] update_temperature: thermal thermal_zone0: last_temperature=48000, current_temperature=48000
[  187.823172] thermal_zone_trip_update: thermal thermal_zone0: Trip1[type=0,temp=105000]:trend=0,throttle=0
[  187.823193] get_target_state: thermal cooling_device7: cur_state=1
[  187.823197] thermal_zone_trip_update: thermal cooling_device7: old_target=-1, target=-1
[  187.823202] thermal_zone_trip_update: thermal thermal_zone0: Trip2[type=0,temp=70000]:trend=0,throttle=0
[  187.823256] get_target_state: thermal cooling_device5: cur_state=1
[  187.823259] thermal_zone_trip_update: thermal cooling_device5: old_target=-1, target=-1
[  187.823264] thermal_zone_trip_update: thermal thermal_zone0: Trip3[type=0,temp=60000]:trend=0,throttle=0
[  187.823316] get_target_state: thermal cooling_device6: cur_state=1
[  187.823320] thermal_zone_trip_update: thermal cooling_device6: old_target=-1, target=-1

so the cooling_device[5-7]'s cur_state is 1, which means the FAN devices are all turned on after resume and it is done by the platform_bus's power_domain callback(acpi_dev_resume_early->acpi_dev_pm_full_power) since the FAN device is platform device now. And later thermal_zone0's temperature is lower and lower so no trip point cross event would ever occur that made the update for this thermal zone never happen again. There is a problem in the get_target_state for the trend = RAISE or STABLE case where the cooling device's state is not properly set. If we set a poll for this thermal zone, the trend=DROP case will occur and I think that should cure the problem. This can be verified by adding thermal.tzp=300 to kernel cmdline, can you please check this?
Comment 45 Matthias 2015-02-28 09:03:45 UTC
Created attachment 168421 [details]
My tests with linux-3.19.0

I had the time to do some testing. My findings are attached.
Comment 46 szegad 2015-02-28 16:07:31 UTC
That's right, Aaaron, when I resumed the fan started to spin at full speed, but
after a couple of seconds it went down!!!
Comment 47 szegad 2015-02-28 16:19:03 UTC
...however it worked 2 times out of 3.
Comment 48 Matthias 2015-02-28 16:32:38 UTC
The current fan state changes from resume to resume as it can be seen below in comparison to comment 45. The output is different from the first one after resume. I will keep an eye on it

/sys/class/thermal/cooling_device0/cur_state:
1
/sys/class/thermal/cooling_device1/cur_state:
1
/sys/class/thermal/cooling_device2/cur_state:
1
/sys/class/thermal/cooling_device3/cur_state:
1
/sys/class/thermal/cooling_device4/cur_state:
1
/sys/class/thermal/cooling_device5/cur_state:
1
/sys/class/thermal/cooling_device6/cur_state:
1
/sys/class/thermal/cooling_device7/cur_state:
1
/sys/class/thermal/cooling_device8/cur_state:
1
/sys/class/thermal/cooling_device9/cur_state:
1
/sys/class/thermal/cooling_device10/cur_state:
1
/sys/class/thermal/cooling_device11/cur_state:
0
/sys/class/thermal/cooling_device12/cur_state:
0
/sys/class/thermal/cooling_device13/cur_state:
0

Fan speed for each cooling device:
cooling_device0 100%
cooling_device1 70%
cooling_device2 60%
cooling_device3 40%
cooling_device4 25%
cooling_device5 100%
cooling_device6 70%
cooling_device7 60%
cooling_device8 40%
cooling_device9 25%
Following fan speed seam to be the lowest fan speed. So i can't really tell if there is a different fan speed between those four devices.
cooling_device10 20%
cooling_device11 20%
cooling_device12 20%
cooling_device13 20%
Comment 49 Aaron Lu 2015-03-03 05:15:05 UTC
(In reply to szegad from comment #47)
> ...however it worked 2 times out of 3.

For the case it doesn't work, can you please attach the debug dmesg? I wonder if the polling for thermal_zone0 stopped?
Comment 50 szegad 2015-03-03 08:07:06 UTC
Created attachment 168661 [details]
tzp=300, dmesg just afer resume, when tzp works
Comment 51 szegad 2015-03-03 08:07:58 UTC
Created attachment 168671 [details]
tzp=300, dmesg after fan down, when tzp works
Comment 52 szegad 2015-03-03 08:08:29 UTC
Created attachment 168681 [details]
tzp=300, dmesg just afer resume, when tzp doesn't work
Comment 53 szegad 2015-03-03 08:09:10 UTC
Created attachment 168691 [details]
tzp=300, dmesg just after 30s, when tzp doesn't work
Comment 54 szegad 2015-03-03 08:10:03 UTC
Created attachment 168701 [details]
tzp=300, dmesg after fan down, when tzp doesn't work initially
Comment 55 szegad 2015-03-03 08:15:21 UTC
Ok, so here it goes.
Case 1: TZP works.
Just after resume, when fan is full speed:
https://bugzilla.kernel.org/attachment.cgi?id=168661
30s after resume, when the fan is down
https://bugzilla.kernel.org/attachment.cgi?id=168671

Case 2: TZP doesn't work (initially)
Just after resume, when fan is full speed:
https://bugzilla.kernel.org/attachment.cgi?id=168681
30s after resume, fan is still at full speed
https://bugzilla.kernel.org/attachment.cgi?id=168691
After some time and another TZP check I guess, it finally goes down:
https://bugzilla.kernel.org/attachment.cgi?id=168701

The difference between these two cases is that in the case 1 I resumed the system after a few seconds, when it was still warm, but in the case 2 I left it sleeping to cool down completely.
Comment 56 Aaron Lu 2015-03-03 08:56:34 UTC
Thanks szegad, it's very useful. The FAN will go down when thermal_zone0 temperature goes down. For case 2, for a long time, the thermal_zone0's temperature is rising, which is pretty surprising since the FAN is spinning in full speed. Anyway, this is just a verification that the problem is indeed due to the thermal zone's handling of the FAN cooling devices.

Rui has some patches that should handle this situation.

Rui,
can you please give a pointer to your patches?
Comment 57 Zhang Rui 2015-03-04 02:57:39 UTC
yes, please check if the patches at
https://bugzilla.kernel.org/show_bug.cgi?id=78201#c142
https://bugzilla.kernel.org/show_bug.cgi?id=78201#c143
works for you or not.
Note, they are based on 4.0-rc2.
Comment 58 Manuel Krause 2015-03-04 23:23:09 UTC
If you're running 3.19.0 and "lazy" you can ty my unofficial ones from:
https://bugzilla.kernel.org/show_bug.cgi?id=78201#c145 and
https://bugzilla.kernel.org/show_bug.cgi?id=78201#c146
(they only differ in _one_ context line from Rui's original ones)

They work for me from bootup over hibernation (disk) or sleep (RAM).

Maybe, they don't work after very first boot with the changed kernel. (https://bugzilla.kernel.org/show_bug.cgi?id=78201#c144)

Best regards,
Manuel
Comment 59 Matthias 2015-03-05 08:02:01 UTC
Patches work for my nw9440 with linux-4.0-rc2. Thanks!
Comment 60 szegad 2015-03-05 09:05:19 UTC
I'm still evaluating those patches on 3.18.7, some cases are ok, but I have to dig into others.
Should I do it on 4.0rc2 instead?
Comment 61 szegad 2015-03-05 13:53:56 UTC
Created attachment 169231 [details]
dmesg after long suspend
Comment 62 szegad 2015-03-05 14:07:47 UTC
When suspend system for a longer period of time and let it cooldown completely it still starts with fan at full speed and doesn't want to step down even after a few minutes.
dmesg just after resume attached:
https://bugzilla.kernel.org/attachment.cgi?id=169231
Comment 63 szegad 2015-03-05 14:13:52 UTC
Created attachment 169241 [details]
resume after long suspend some minutes later

I tried to change the fan speed by changind the system load from low to high and back again, but it's still spinning at full speed.
Comment 64 Manuel Krause 2015-03-05 19:19:12 UTC
(In reply to szegad from comment #63)
> 
> I tried to change the fan speed by changind the system load from low to high
> and back again, but it's still spinning at full speed.

I'm a bit confused: According to the logs cur_state of several cooling_devices are set to 0 or 1 and back when resuming and later with the temperature changes. So I assume the patches are applied correctly. Have you re-checked this? Or rebooted another time (last sentence of Comment 58) ?

With slightly changed patches I've now a 3.18.8 running well. And the 3.19.0 also managed the fan correctly after last night's hibernation cooldown. Between 3.18.8 and 3.19.0 there are not many significant changes to the affected files. Maybe this one can help?: https://github.com/torvalds/linux/commit/a940cb34fed73b2d4809a4575f2981d5927e2c21
It's in 3.19.0 but not yet in 3.18.8. I seem to not need it so far.

Best regards,
Manuel
Comment 65 Zhang Rui 2015-03-09 07:57:45 UTC
(In reply to szegad from comment #62)
> When suspend system for a longer period of time and let it cooldown
> completely it still starts with fan at full speed and doesn't want to step
> down even after a few minutes.
> dmesg just after resume attached:
> https://bugzilla.kernel.org/attachment.cgi?id=169231

in the log, I see you've done three suspend, at 3860s, 5181s, 5200s, does this contain the one that you resume with system cool and fan at full speed?
Comment 66 Zhang Rui 2015-03-09 07:58:58 UTC
Created attachment 169941 [details]
debug patch

please apply this debug patch on top and re-do the test.
PS: if the problem is reproduced, please attach the dmesg and point to me when the bug happens so that I can check the dmesg accordingly.
Comment 67 szegad 2015-03-10 07:36:25 UTC
(In reply to Zhang Rui from comment #65)
> in the log, I see you've done three suspend, at 3860s, 5181s, 5200s, does
> this contain the one that you resume with system cool and fan at full speed?

It's the last one.

However with the debug patch applied I can't reproduce this behaviour yet. It's strange, because it happend many times before. I'm still trying.
Comment 68 szegad 2015-03-10 10:10:37 UTC
Created attachment 170261 [details]
fan full speed, system is cool

Dmesg just after resume from suspend. System is cool, fan is running at full speed.
Applied patches https://bugzilla.kernel.org/show_bug.cgi?id=78201#c142
https://bugzilla.kernel.org/show_bug.cgi?id=78201#c143 ,
but NOT the debug patch https://bugzilla.kernel.org/attachment.cgi?id=169941 -> see comment https://bugzilla.kernel.org/show_bug.cgi?id=91411#c67.
Comment 69 szegad 2015-03-10 10:13:10 UTC
Rui,
 I can't reproduce the behaviour with the debug patch applied, but I can do it straight away with the last two patches live. It seems to me that this debug patch changes something important.
Comment 70 szegad 2015-03-10 15:29:21 UTC
Created attachment 170301 [details]
dmesg after long suspend, fan is full speed, system cool

both patches + debug patch
Comment 71 szegad 2015-03-10 15:30:07 UTC
Ok, I've managed to reproduce this. Last resume from suspend:
https://bugzilla.kernel.org/attachment.cgi?id=170301
Comment 72 Manuel Krause 2015-03-11 00:37:58 UTC
(In reply to szegad from comment #71)
> Ok, I've managed to reproduce this. Last resume from suspend:
> https://bugzilla.kernel.org/attachment.cgi?id=170301

Hi, szegad!

Are you absolutely sure with the last dmesg, that you've applied all three patches correctly and also verified that the intended code landed where it is supposed to, before making the kernel?

In this most recent log you don't even get a 
"last_temperature N/A, current_temperature=....." for any of your suspends/resumes what is a significant output from Rui's new patches, normally. It seems to me, that the patches are not applied at all.

Please, be so kind, to re-check safely and maybe re-build your kernel.

Excuse me, I definitely don't want to bother you, but this bug is lasting so long now, that I don't like any possible "false negatives" any more.

Regards, Manuel
Comment 73 szegad 2015-03-11 08:34:11 UTC
Yes, they're applied correctly. In other case I would have fan spinning at 100% after resume all the time.
Comment 74 szegad 2015-03-11 09:17:14 UTC
Manuel,
Anyway - look at this:
[ 6149.022160] update_temperature: thermal thermal_zone1: last_temperature=23100, current_temperature=23100

it's in there.
Comment 75 Manuel Krause 2015-03-12 00:57:27 UTC
It is the term "last_temperature N/A, current_temperature=....."
................................^^^^^...
that should occur on each resume to invoke an update shortly.
With Rui's newest patches. It's the "N/A" that's missing in your logs.
Comment 76 Manuel Krause 2015-03-12 01:20:30 UTC
Rui, so szegad's THERMAL_TEMP_INVALID doesn't get set in the right place. So he may get into the "else" case in:

@@ -469,8 +476,12 @@ static void update_temperature(struct thermal_zone_device *tz)
 	mutex_unlock(&tz->lock);
 
 	trace_thermal_temperature(tz);
-	dev_dbg(&tz->device, "last_temperature=%d, current_temperature=%d\n",
-				tz->last_temperature, tz->temperature);
+	if (tz->last_temperature == THERMAL_TEMP_INVALID)
+		dev_dbg(&tz->device, "last_temperature N/A, current_temperature=%d\n",
+			tz->temperature);
+	else
+		dev_dbg(&tz->device, "last_temperature=%d, current_temperature=%d\n",
+			tz->last_temperature, tz->temperature);
 }
 
 void thermal_zone_device_update(struct thermal_zone_device *tz)
Comment 77 szegad 2015-03-12 09:17:18 UTC
Ok, I will verify this once again just to be 1000% sure.
Comment 78 szegad 2015-03-12 09:36:26 UTC
Ok, Manuel - you're right.
So Comments 70 & 71 are invalid, but Comment 69 is vaild.

With this two patches applied I can get fan at 100% after cool system resume, but  adding the debug patch seems to change something and I can't reproduce the faulty behaviour yet. I'm still trying anyway.
Comment 79 Manuel Krause 2015-03-12 21:42:36 UTC
@ szegad: Thank you very very much for rechecking and clarifying this and for your continued testing!

Now, it would be a fine addition if the guys whose fan(s) don't get adjusted after bootup (BUG 93301 and BUG 92431) also submitted positive test results with these patches for their issue. ;-)
Comment 80 szegad 2015-03-13 08:29:49 UTC
One thing bothers me: is this "debug patch" meant to fix a problem or it's just for debugging purposes and it fixes a bug only as a side effect?
Comment 81 Manuel Krause 2015-03-13 21:30:44 UTC
You definitely shouldn't let you bother by a misleading patch title during the phase of testing. 
I'm no programmer, so with my limited expertise/ knowledge about mutex locking: This "debug" patch changes the way of locking in several places of the code, where trip_points, related devices and temperatures are read in from the system and/or reset. So, this is not for debugging purposes at all, but for fixing previous wrong or impossible access to the mentioned data.

As you're not reproducing the wrong behaviour, so far, and I'm using this "debug" patch on top with my system that didn't need it before, with still the wished behaviour as result, this way with all three patches would be the right one.

Rui, in case that my attempt to explain things is wrong or inaccurate, please correct me.
Comment 82 Zhang Rui 2015-03-14 09:29:12 UTC
Hi, Manuel,

thanks a lot for your effort on this bug.
First of all, yes, the debug patch does have some functional change. the reason I call it "debug patch" is that I've not found the root cause and this patch may help me debug further.
The debug patch actually makes thermal_zone_device_update() an atomic behavior because I've seen this symptom.
1. thermal core updates all thermal zones after resume
2. ACPI thermal driver also schedules a thermal_zone_device_update() work during resume, which is invoked while thermal core is updating the thermal zones.
3. the second thermal zone update pollutes the data used by the first one, and causes some problem.

Patch 3/10 in https://bugzilla.kernel.org/show_bug.cgi?id=78201#c149 should remove the redundant thermal zone update raised by ACPI thermal driver. But, IMO, I should also consider this debug patch as an upstream candidate.
Comment 83 szegad 2015-03-14 13:03:44 UTC
 Thanks for the explanation. So I'm happy to announce my failure to reproduce the bug with all three patches applied.
 Do you need me to check something more?
Comment 84 Zhang Rui 2015-03-14 13:38:15 UTC
Please help me do the test mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150
Comment 85 szegad 2015-03-15 23:00:56 UTC
I testing with patches 1-4 and all is working fine. Now I'll apply rest of the patches.
Comment 86 Zhang Rui 2015-03-16 13:40:16 UTC
please try the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157 instead to see if they work for you or not.
Comment 87 Manuel Krause 2015-03-16 23:45:18 UTC
@szegad: You can safely test and give an additional "good positive". ;-)

Best regards and thank you for investing your time!
Comment 88 szegad 2015-03-17 22:57:22 UTC
Ok, testing patches from https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157 gives no signs of problem. I will do some more tomorrow , however I do not expect any troubles. Great!
Comment 89 szegad 2015-03-18 22:53:38 UTC
I can say that it's working for me. Thank you, Rui!
Comment 90 Zhang Rui 2015-03-24 07:19:45 UTC
Patches to fix the problem sent out.
https://patchwork.kernel.org/patch/6077231/
https://patchwork.kernel.org/patch/6077241/
https://patchwork.kernel.org/patch/6077251/

These three bugs are necessary to fix the regression introduced by
commit 19593a1fb1f6718406afca5b867dab184289d406
Author: Aaron Lu <aaron.lu@intel.com>
Date:   Tue Nov 19 16:59:20 2013 +0800

    ACPI / fan: convert to platform driver
    
    Convert ACPI fan driver to a platform driver for the purpose of phasing
    out ACPI bus.
    
    Signed-off-by: Aaron Lu <aaron.lu@intel.com>
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>

About the thermal_zone_device_update() locking fix, I found the patch still has deadlock issue, aka, an upstream fix is not ready at the moment, thus I decided to push the above three patches to upstream first, in order to be shipped in 4.0 release.

For the locking issue, I would on this as 4.1 material.

Mark the bug as Resolved as the regressions should be fixed by the above three patches.

If you want to track the thermal locking issue, please open another thread, or else I will send out the patch directly, in 4.1 release window.
Comment 91 Manuel Krause 2015-03-26 19:35:58 UTC
(In reply to Zhang Rui from comment #90)
...
> About the thermal_zone_device_update() locking fix, I found the patch still
> has deadlock issue, aka, an upstream fix is not ready at the moment, thus I
> decided to push the above three patches to upstream first, in order to be
> shipped in 4.0 release.
> 
> For the locking issue, I would on this as 4.1 material.
> 
> Mark the bug as Resolved as the regressions should be fixed by the above
> three patches.
> 
> If you want to track the thermal locking issue, please open another thread,
> or else I will send out the patch directly, in 4.1 release window.

What do you mean with "deadlock issue"? Is there a risk when having the patch applied?
Comment 92 Zhang Rui 2015-04-01 07:05:49 UTC
yes, there is.
As the problem fixed by that patch is not a regression (the problem actually exists from the beginning), I'd prefer to rework it and send it as 4.1 material.
Comment 93 Manuel Krause 2015-04-01 19:51:18 UTC
Please, don't request us to open another BUG for this. I think all of us on this thread here would be glad, if you would just report here, when you have new patches available for this issue in the 4.1 kernel queue. Could make it a bit easier for all participants.

Thanks,
Manuel
Comment 94 Zhang Rui 2015-04-07 05:32:15 UTC
sure, no problem.
Comment 95 Zhang Rui 2015-04-08 12:59:27 UTC
Hi, guy,

please help check if the patches at comment #183/#184/#185 in bug #78201 work for you or not.
As there is some functional changes, I need to make sure they have been tested before sending upstream.
Comment 96 Zhang Rui 2015-04-08 12:59:45 UTC
s/guy/guys
Comment 97 Matthias 2015-04-22 09:24:02 UTC
Your patches do work with linux-3.19.4 and 3.19.5. No problems here so far. 
Thank you Rui! Your work is greatly appreciated.
Comment 98 Zhang Rui 2015-09-01 05:27:47 UTC
Yu will take care of the patch sets and push for upstream.
Comment 99 Chen Yu 2015-09-22 10:48:49 UTC
Created attachment 188091 [details]
Patch to avoid racing problem when doing thermal update(by move trend calculation into each thermal_instance)

Hi, Matthias 
Can you please help me test the latest patch, which is supposed to fix racing problem in thermal management. This patch is not related to your bug but I'd like to send this patch together with Rui's patches. Can you please help to test this patch on top of Rui's:
https://patchwork.kernel.org/patch/6166681/
https://patchwork.kernel.org/patch/6166691/
https://patchwork.kernel.org/patch/6166701/
and please enable the thermal dynamic debug 
to see if there is any fan problems, Thanks!
Comment 100 Len Brown 2015-09-29 14:47:40 UTC
*** Bug 92431 has been marked as a duplicate of this bug. ***
Comment 101 Matthias 2015-10-04 10:58:29 UTC
Sorry for getting back so late. I can't compile linux-4.1.9 with your Patch Chen. Got this error: 

In file included from drivers/thermal/thermal_core.c:34:0:
include/linux/thermal.h:348:14: warning: 'struct thermal_instance' declared inside parameter list
    int, enum thermal_trip_type);
              ^
include/linux/thermal.h:348:14: warning: its scope is only this definition or declaration, which is probably not what you want
  CC      drivers/tty/pty.o
drivers/thermal/thermal_core.c:192:5: error: conflicting types for 'get_instance_trend'
 int get_instance_trend(struct thermal_zone_device *tz,
     ^
In file included from drivers/thermal/thermal_core.c:34:0:
include/linux/thermal.h:347:5: note: previous declaration of 'get_instance_trend' was here
 int get_instance_trend(struct thermal_zone_device *, struct thermal_instance *,
     ^
In file included from include/linux/linkage.h:6:0,
                 from include/linux/kernel.h:6,
                 from include/linux/list.h:8,
                 from include/linux/module.h:9,
                 from drivers/thermal/thermal_core.c:28:
drivers/thermal/thermal_core.c:209:15: error: conflicting types for 'get_instance_trend'
 EXPORT_SYMBOL(get_instance_trend);
               ^
include/linux/export.h:57:21: note: in definition of macro '__EXPORT_SYMBOL'
  extern typeof(sym) sym;     \
                     ^
drivers/thermal/thermal_core.c:209:1: note: in expansion of macro 'EXPORT_SYMBOL'
 EXPORT_SYMBOL(get_instance_trend);
 ^
In file included from drivers/thermal/thermal_core.c:34:0:
include/linux/thermal.h:347:5: note: previous declaration of 'get_instance_trend' was here
 int get_instance_trend(struct thermal_zone_device *, struct thermal_instance *,
     ^
scripts/Makefile.build:258: recipe for target 'drivers/thermal/thermal_core.o' failed
make[2]: *** [drivers/thermal/thermal_core.o] Error 1
scripts/Makefile.build:403: recipe for target 'drivers/thermal' failed
make[1]: *** [drivers/thermal] Error 2

With linux-4.2.3 your patch gets rejected at the following part:

--- drivers/thermal/step_wise.c
+++ drivers/thermal/step_wise.c
@@ -154,11 +154,17 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip)
 		if (instance->trip != trip)
 			continue;
 
+		instance->temperature = tz->temperature;
+		instance_trend = get_instance_trend(tz, instance, trip_temp, trip_type);
 		old_target = instance->target;
-		instance->target = get_target_state(instance, trend, throttle);
+		if (instance_trend)
+			instance->target = get_target_state(instance, instance_trend, throttle);
+		else
+			instance->target = get_target_state(instance, trend, throttle);
 		dev_dbg(&instance->cdev->device, "old_target=%d, target=%d\n",
 					old_target, (int)instance->target);
 
+		instance->last_temperature = tz->temperature;
 		if (instance->initialized &&
 		    old_target == instance->target)
 			continue;
Comment 102 Chen Yu 2015-10-22 08:00:35 UTC
Created attachment 190821 [details]
V6-0001-Thermal-initialize-thermal-zone-
Comment 103 Chen Yu 2015-10-22 08:01:06 UTC
Created attachment 190831 [details]
Thermal-handle-thermal-zone-device-
Comment 104 Chen Yu 2015-10-22 08:01:51 UTC
Created attachment 190841 [details]
V6-0003-Thermal-do-thermal-zone-update-after-a-cooling-devic
Comment 105 Chen Yu 2015-10-22 08:02:24 UTC
Plz help test this version of patches, thanks!
Comment 106 Matt Devo 2015-11-02 16:11:55 UTC
(In reply to Chen Yu from comment #105)
> Plz help test this version of patches, thanks!

the patches in comments 102-104 fix the issue on my Haswell 2955U-based Asus ChromeBox when applied to kernel 4.3 (final) - thanks very much!
Comment 107 Martin Rejda 2015-11-05 13:06:45 UTC
I have older notebook - HP 6510b with core2duo 9300T. I've tested patches against archlinux kernel 4.2.5-1 and everything is working very well. No more full speed after resume. I hope that patches will be merged soon.
Comment 108 Matthias 2015-11-13 09:10:34 UTC
I have been running kernel 4.2.5 and 4.2.6 with the patches for more than ten days now. No issues so far on my NW9440. Thanks!
Comment 109 Chen Yu 2016-01-25 02:50:20 UTC
Patch set has been merged, so close current thread as fixed.

commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Oct 30 16:31:47 2015 +0800

    Thermal: initialize thermal zone device correctly

commit ff140fea847e1c2002a220571ab106c2456ed252
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Oct 30 16:31:58 2015 +0800

    Thermal: handle thermal zone device properly during system sleep

commit 4511f7166a2deb5f7a578cf87fd2fe1ae83527e3
Author: Chen Yu <yu.c.chen@intel.com>
Date:   Fri Oct 30 16:32:10 2015 +0800

    Thermal: do thermal zone update after a cooling device registered

Note You need to log in before you can comment on or make changes to this bug.