Created attachment 163591 [details] dmesg after resume from supend to ram Hello, after suspend to ram the fan of my HP NW9440 spinns always at full speed. Last known good version: 3.17.7 First known bad version: 3.18.0 Last known bad version: 3.18.2 Behaviour: After supend to ram fan spinns at full speed Expected behaviour: Fan speed should be picked according to the current cpu and gpu temperatures. Steps to reproduce: Suspend to ram and then resume.
Created attachment 163601 [details] System environment
Created attachment 163611 [details] kernel configuration
Created attachment 163621 [details] lspci
Created attachment 163631 [details] dmesg after resume from supend to ram
can you show the system temperature using turbostat(1) before and after to verify that the fan is not running in response to a real change in temperature?
Hello! I can confirm this problem on my laptop HP 6830s running Fedora 21. Kernel 3.17.8-300 is fine, kernel 3.18.3-201 is causing this problem. It's not the matter of real temperatures, because it pops up even if the laptop is left for a long time in suspend (and get absolutely cold). CPU load is 96-97 % idle. Turbostat doesn't work on my system. Similar problem has existed in earlier kernel versions, got fixed and here is back again.
Got one better. Did a git bisect. Testcase was: boot with init=/bin/bb mount sysfs echo mem > /sys/power/state resume Output from git bisect: 19593a1fb1f6718406afca5b867dab184289d406 is the first bad commit commit 19593a1fb1f6718406afca5b867dab184289d406 Author: Aaron Lu <aaron.lu@intel.com> Date: Tue Nov 19 16:59:20 2013 +0800 ACPI / fan: convert to platform driver Convert ACPI fan driver to a platform driver for the purpose of phasing out ACPI bus. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> :040000 040000 0933047229261d8e4fe4d1614377ec55b2459f82 b19fc40556b8e55a0c591ac00cd4af4b938f73dc M drivers git bisect start # good: [bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9] Linux 3.17 git bisect good bfe01a5ba2490f299e1d2d5508cbbbadd897bbe9 # bad: [b2776bf7149bddd1f4161f14f79520f17fc1d71d] Linux 3.18 git bisect bad b2776bf7149bddd1f4161f14f79520f17fc1d71d # good: [754c780953397dd5ee5191b7b3ca67e09088ce7a] Merge branch 'for-v3.18' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping git bisect good 754c780953397dd5ee5191b7b3ca67e09088ce7a # good: [2d65a9f48fcdf7866aab6457bc707ca233e0c791] Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux git bisect good 2d65a9f48fcdf7866aab6457bc707ca233e0c791 # bad: [88e237610b426897f0e9935adb6a60bd38bfe6c6] Merge tag 'armsoc-for-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect bad 88e237610b426897f0e9935adb6a60bd38bfe6c6 # skip: [8a5de18239e418fe7b1f36504834689f754d8ccc] Merge tag 'kvm-arm-for-3.18-take-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm git bisect skip 8a5de18239e418fe7b1f36504834689f754d8ccc # good: [1c2150283cae895526d0db3953d13d139f4e7a03] ext4: convert ext4_bread() to use the ERR_PTR convention git bisect good 1c2150283cae895526d0db3953d13d139f4e7a03 # good: [c5bbcb5822b25c9f738db98e6d6ad2506cab8136] cxgb4i: Remove duplicate call to dst_neigh_lookup() git bisect good c5bbcb5822b25c9f738db98e6d6ad2506cab8136 # skip: [0a582821d4f8edf41d9b56ae057ee2002fc275f0] Merge tag 'fbdev-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux git bisect skip 0a582821d4f8edf41d9b56ae057ee2002fc275f0 # good: [a687ecaf50f18329206c6b78764a8c7bd30a9df0] ceph: export ceph_session_state_name function git bisect good a687ecaf50f18329206c6b78764a8c7bd30a9df0 # good: [0ef090151345e693bd9b50c9b7aaf34ae5e9cac3] MAINTAINERS: add atmel ssc driver maintainer entry git bisect good 0ef090151345e693bd9b50c9b7aaf34ae5e9cac3 # good: [3d32e4dbe71374a6780eaf51d719d76f9a9bf22f] kvm: fix excessive pages un-pinning in kvm_iommu_map error path. git bisect good 3d32e4dbe71374a6780eaf51d719d76f9a9bf22f # bad: [1c45d9a920e6ef4fce38921e4fc776c2abca3197] Merge tag 'pm+acpi-3.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect bad 1c45d9a920e6ef4fce38921e4fc776c2abca3197 # good: [816fb4175c29b16948fb24a92053bea1e79908cc] Merge tag 'remove-weak-declarations' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci git bisect good 816fb4175c29b16948fb24a92053bea1e79908cc # good: [a91e99e27a683608d221fb18b70d7de9d801de4a] Merge branches 'pm-cpuidle' and 'pm-cpufreq' git bisect good a91e99e27a683608d221fb18b70d7de9d801de4a # bad: [4384b8fe162d8aa03905d02073707bcf364cc7ce] Thermal: introduce int3403 thermal driver git bisect bad 4384b8fe162d8aa03905d02073707bcf364cc7ce # good: [8dd41f78adebb57909cccb0272e74c79e38b5238] ACPI / fan: remove no need check for device pointer git bisect good 8dd41f78adebb57909cccb0272e74c79e38b5238 # bad: [9519a6356cbf63b1f22a7a208385dc56092c8b7d] ACPI / Fan: add ACPI 4.0 style fan support git bisect bad 9519a6356cbf63b1f22a7a208385dc56092c8b7d # bad: [19593a1fb1f6718406afca5b867dab184289d406] ACPI / fan: convert to platform driver git bisect bad 19593a1fb1f6718406afca5b867dab184289d406 # good: [2bb3a2bf9939f3361e25045f4ef7b136b864c3b8] ACPI / fan: use acpi_device_xxx_power instead of acpi_bus equivelant git bisect good 2bb3a2bf9939f3361e25045f4ef7b136b864c3b8 # first bad commit: [19593a1fb1f6718406afca5b867dab184289d406] ACPI / fan: convert to platform driver
Turbostat does not work for me either. Gkrellm shows a temperature of 35°C after resuming for both cpu cores and gpu. Idle temperature lies at about 42°C for this machine. The highest fan speed is reached at 80°C. At normal room temperature (20°C) and at highest load, this fan speed is never reached. The last but one fan speed suffices to keep the temperature at 78°C under heavy load. On a side note, my Lenovo X220 is not affected. I attached an acpidump of my HP NW9440. The skipped git biscets did not compile. cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7400 @ 2.16GHz stepping : 6 microcode : 0xd1 cpu MHz : 2167.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow bugs : bogomips : 4322.51 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU T7400 @ 2.16GHz stepping : 6 microcode : 0xd1 cpu MHz : 2167.000 cache size : 4096 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow bugs : bogomips : 4322.51 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
Created attachment 165351 [details] acpidump NW9440
Matthias, (In reply to Matthias from comment #7) > Got one better. Did a git bisect. > > Testcase was: > boot with init=/bin/bb > mount sysfs > echo mem > /sys/power/state > resume > > Output from git bisect: > > 19593a1fb1f6718406afca5b867dab184289d406 is the first bad commit > commit 19593a1fb1f6718406afca5b867dab184289d406 > Author: Aaron Lu <aaron.lu@intel.com> > Date: Tue Nov 19 16:59:20 2013 +0800 > > ACPI / fan: convert to platform driver > > Convert ACPI fan driver to a platform driver for the purpose of phasing > out ACPI bus. > > Signed-off-by: Aaron Lu <aaron.lu@intel.com> > Signed-off-by: Zhang Rui <rui.zhang@intel.com> > does the problem still exist if you revert this patch? thanks, rui
This bug also reports similar FAN issue that is appearing since 3.18 (works before 3.17) Only difference is that FAN starts running at full speed somewhere in middle of BOOT sequence - i.e. no need to suspend and resume. https://bugzilla.kernel.org/show_bug.cgi?id=92431 Related ARCH forum thread https://bbs.archlinux.org/viewtopic.php?id=192255
I can confirm that taking drivers/acpi/fan.c from fedora's kernel-3.17.8-300.fc21 and putting it to fedora's kernel-3.18.6-200.fc21 , compiling and running kernel SOLVES the problem of fan running full speed after resume from sleep. Laptop HP 6820s.
I've attached the working drivers/acpi/fan.c
Created attachment 166781 [details] working drivers/acpi/fan.c from kernel 3.17.8
I am not kernel OR hardware expert but looking at difference between 3.17 and 3.19 source (drivers/acpi/fan.c), it appears that 3.19 kernel now differentiates between two types of fans. One is acpi4 and other non-acpi4. May be somewhere the detection of fan type goes wrong and hence as a side effect they run at full speed? I dont know how to detect if our fan is really acpi4 OR non-acpi4? And also how to know what is the type actually detected (guessed) by kernel is? If someone can tell me a way to know this, I can check it.
I've also tried to generate from git the series of patches bringing fan.c from 3.17.8 to 3.18.6: looking at this history: https://github.com/torvalds/linux/commits/master/drivers/acpi/fan.c and generating the patches from git repo: git format-patch -8 bbb16fef19122ec9f20fb865c45375e12f85d2a1 fan.c It creates: 0001-ACPI-fan-printk-replacement.patch 0002-ACPI-fan-remove-unused-macro.patch 0003-ACPI-fan-remove-no-need-check-for-device-pointer.patch 0004-ACPI-fan-use-acpi_device_xxx_power-instead-of-acpi_b.patch 0005-ACPI-fan-convert-to-platform-driver.patch 0006-ACPI-Fan-add-ACPI-4.0-style-fan-support.patch 0007-ACPI-Fan-support-INT3404-thermal-device.patch 0008-ACPI-Fan-Use-bus-id-as-the-name-for-non-PNP0C0B-Fan-.patch 1. I tried to move forward from 3.17.8. Git applied cleanly 0001-ACPI-fan-printk-replacement.patch and a failed on 0002-ACPI-fan-remove-unused-macro.patch (it seems that 0002-ACPI-fan-remove-unused-macro.patch is looking for "printk code" removed in 0001-ACPI-fan-printk-replacement.patch) I done some manual merging, but with every patch it became harder - so I've tried a different approach. 2. I tried reversing patches down from 3.18.6 0008-ACPI-Fan-Use-bus-id-as-the-name-for-non-PNP0C0B-Fan-.patch is a later patch, so I skipped it. 0007-ACPI-Fan-support-INT3404-thermal-device.patch was reversed cleanly. 0006-ACPI-Fan-add-ACPI-4.0-style-fan-support.patch did not apply with a lot of problems. Now I'm stuck. I'm probably doing something wrong. If you got any hints how to prove that without ACPI-fan-convert-to-platform-driver.patch this bug is gone - please tell me.
(In reply to amish from comment #15) > I am not kernel OR hardware expert but looking at difference between 3.17 > and 3.19 source (drivers/acpi/fan.c), it appears that 3.19 kernel now > differentiates between two types of fans. One is acpi4 and other non-acpi4. You're probably talking about this commit: https://github.com/torvalds/linux/commit/9519a6356cbf63b1f22a7a208385dc56092c8b7d which is add-ACPI-4.0-style-fan-support.patch. It's added AFTER ACPI-fan-convert-to-platform-driver which seems to be a problem nailed by Matthias' using git bisect. I've tried to prove that it's indeed the problem, but failed (look at comment #16). Otherwise you guess seems legit. We just need a way to prove it.
(In reply to bojanvuk from comment #16) > 2. > I tried reversing patches down from 3.18.6 > 0008-ACPI-Fan-Use-bus-id-as-the-name-for-non-PNP0C0B-Fan-.patch is a later > patch, so I skipped it. > 0007-ACPI-Fan-support-INT3404-thermal-device.patch was reversed cleanly. > 0006-ACPI-Fan-add-ACPI-4.0-style-fan-support.patch did not apply with a lot > of problems. > > Now I'm stuck. I'm probably doing something wrong. If you got any hints how > to prove that without ACPI-fan-convert-to-platform-driver.patch this bug is > gone - please tell me. Sometimes commits are related to each other so I think you need to reverse all patches committed on Oct 10, 2014. (total 6)
Maybe, but it won't do any good, because we know that that bug is probably somewhere between the patches applied on Oct 10, 2014 :)
Can you modify the following function in source code when you compile kernel and check which one is NOT causing issue? drivers/acpi/fan.c (line 228) static bool acpi_fan_is_acpi4(struct acpi_device *device) { return true; //add this line here return acpi_has_method(device->handle, "_FIF") && ... } Try first with "return true;" as result. And then try again with "return false;" If issue still persists that means fan type detection code is not wrong and issue is somewhere else.
To verify if the said commit is the culprit, you can do so: $ cd /your/linux/git/tree $ git reset --hard 19593a1fb1f6718406afca5b867dab184289d406 Build kernel and test; $ git reset --hard HEAD~1 Build kernel and test again.
(In reply to amish from comment #20) > If issue still persists that means fan type detection code is not wrong and > issue is somewhere else. Yes, the issue is somewhere else. return false = everything is like it was before return true = can't even reset the fan by hand (/sys/class/thermal/cooling_deviceN)
Thank you, Aaron. That's right. The fan under kernel 1 commit back from 19593a1fb1f6718406afca5b867dab184289d406 works as expected.
see https://bugzilla.kernel.org/show_bug.cgi?id=93301 for the bug repotrt i've just created. it may (or may not) be a somewhat connected issue.
(In reply to szegad from comment #23) > Thank you, Aaron. > > That's right. The fan under kernel 1 commit back from > 19593a1fb1f6718406afca5b867dab184289d406 works as expected. And the kernel with the 19593a1fb1f6718406afca5b867dab184289d406 commit doesn't work, right? And your acpidump please: # acpidump > acpidump.txt
Created attachment 167091 [details] acpidump kernel 3.18.6 Right, it doesn't
With the culprit commit, please show me the output of: # ls -l /sys/bus/platform/drivers/acpi-fan ... # ls -l /sys/bus/platform/devices/PNP0C0B\:*/ ... # grep . /sys/bus/platform/devices/PNP0C0B\:*/firmware_node/* ... And then for each of the cooling_device: /sys/bus/platform/devices/PNP0C0B:XX/thermal_cooling, does setting the value 0 and 1 to the cur_state file make the fan spin and off?
Created attachment 167231 [details] # ls -l /sys/bus/platform/drivers/acpi-fan
Created attachment 167241 [details] ls -l /sys/bus/platform/devices/PNP0C0B\:*/
Created attachment 167271 [details] # grep . /sys/bus/platform/devices/PNP0C0B\:*/firmware_node/*
/sys/bus/platform/devices/PNP0C0B:XX/thermal_cooling PNP0C0B:00 works PNP0C0B:01 works PNP0C0B:02 works PNP0C0B:03 works PNP0C0B:04 does nothing PNP0C0B:05 works and it sounds like going full speed just like after the resume from sleep PNP0C0B:06 works PNP0C0B:07 does nothing Question: How is this different from accessing /sys/class/thermal/cooling_deviceN ? I've got there from cooling_device0 to cooling_device10 .
Can someone of you, please, have a look @ BUG 78201, and check whether we have a duplicate here? Rui? Thanks, Manuel
(In reply to szegad from comment #31) > /sys/bus/platform/devices/PNP0C0B:XX/thermal_cooling > > PNP0C0B:00 works > PNP0C0B:01 works > PNP0C0B:02 works > PNP0C0B:03 works > PNP0C0B:04 does nothing > PNP0C0B:05 works and it sounds like going full speed just like after the > resume from sleep > PNP0C0B:06 works > PNP0C0B:07 does nothing > > Question: > How is this different from accessing /sys/class/thermal/cooling_deviceN ? > I've got there from cooling_device0 to cooling_device10 . Sorry for the long delay, it's Chinese new year here. PNP0C0B:0X corresponds to cooling_device0X according to your attachment: https://bugzilla.kernel.org/attachment.cgi?id=167241. So the problem only occurs after resume, and then the thermal zone's temperature is still correct, only the FAN spins at full speed, right? Please attach the following output before suspend and after resume: # grep . /sys/class/thermal/thermal_zone*/*
Created attachment 168211 [details] (before suspend) grep . /sys/class/thermal/thermal_zone*/*
Created attachment 168221 [details] (after suspend) grep . /sys/class/thermal/thermal_zone*/*
Happy New Year! Files attached. What bothers me is the thermal_zone2: its temp is 5000 before suspend and 100000 after resume (I left my laptop for an hour to cool down!). However it's critical trip point 0 is 110000. What's more after running my "cool all" script (echo 0 > /sys/class/thermal/cooling_deviceX/cur_state) the temp fell down in 1-2 seconds to 8400 and then to 0. Then after some time it came back to 5000.
(In reply to szegad from comment #36) > Happy New Year! Thanks :-) > > Files attached. > What bothers me is the thermal_zone2: its temp is 5000 before suspend and > 100000 after resume (I left my laptop for an hour to cool down!). However > it's critical trip point 0 is 110000. It doesn't matter: thermal_zone2 doesn't have any cooling device(i.e. fan) bound to it so no matter what temperature it is, no cooling operation would occur. > What's more after running my "cool all" script (echo 0 > > /sys/class/thermal/cooling_deviceX/cur_state) the temp fell down in 1-2 > seconds to 8400 and then to 0. Then after some time it came back to 5000. After resume, before you run any script, can you please check the cur_state of the cooling_device{0-7}? i.e. /sys/class/thermal/cooling_device?/cur_state. Since the fan is spinning at full speed, I suppose some of them should be set to 1? BTW, please attach the output of: ls -l /sys/class/thermal/thermal_zone*/
Created attachment 168321 [details] (before suspend) find -L /sys/class/thermal/ -maxdepth 2 -name "cur_state" -print -exec cat {} \;
Created attachment 168331 [details] (after) find -L /sys/class/thermal/ -maxdepth 2 -name "cur_state" -print -exec cat {} \;
Created attachment 168341 [details] ls -l /sys/class/thermal/thermal_zone*/
All fan's cur_state is set to 1 after resume, i.e. all fan is turned on. I think this is related to thermal core, please run this after boot: # echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control And then do a suspend-resume, attach the dmesg, thanks.
Created attachment 168381 [details] dmesg with debug
(In reply to Aaron Lu from comment #41) > All fan's cur_state is set to 1 after resume, i.e. all fan is turned on. I > think this is related to thermal core, please run this after boot: > # echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control > > And then do a suspend-resume, attach the dmesg, thanks. Can you, Aaron Lu, please, perhaps, risk a little look at BUG 78201 https://bugzilla.kernel.org/show_bug.cgi?id=78201 Title: "Lower fan speeds are forgotten after resume from ram/disk" There you'd find many more dmesg logs from the past. Over several kernels. With several patches' alternatives to look at. Maybe you could/ should even exchange your knowledge with Zhang Rui. BR, Manuel
[ 187.823166] update_temperature: thermal thermal_zone0: last_temperature=48000, current_temperature=48000 [ 187.823172] thermal_zone_trip_update: thermal thermal_zone0: Trip1[type=0,temp=105000]:trend=0,throttle=0 [ 187.823193] get_target_state: thermal cooling_device7: cur_state=1 [ 187.823197] thermal_zone_trip_update: thermal cooling_device7: old_target=-1, target=-1 [ 187.823202] thermal_zone_trip_update: thermal thermal_zone0: Trip2[type=0,temp=70000]:trend=0,throttle=0 [ 187.823256] get_target_state: thermal cooling_device5: cur_state=1 [ 187.823259] thermal_zone_trip_update: thermal cooling_device5: old_target=-1, target=-1 [ 187.823264] thermal_zone_trip_update: thermal thermal_zone0: Trip3[type=0,temp=60000]:trend=0,throttle=0 [ 187.823316] get_target_state: thermal cooling_device6: cur_state=1 [ 187.823320] thermal_zone_trip_update: thermal cooling_device6: old_target=-1, target=-1 so the cooling_device[5-7]'s cur_state is 1, which means the FAN devices are all turned on after resume and it is done by the platform_bus's power_domain callback(acpi_dev_resume_early->acpi_dev_pm_full_power) since the FAN device is platform device now. And later thermal_zone0's temperature is lower and lower so no trip point cross event would ever occur that made the update for this thermal zone never happen again. There is a problem in the get_target_state for the trend = RAISE or STABLE case where the cooling device's state is not properly set. If we set a poll for this thermal zone, the trend=DROP case will occur and I think that should cure the problem. This can be verified by adding thermal.tzp=300 to kernel cmdline, can you please check this?
Created attachment 168421 [details] My tests with linux-3.19.0 I had the time to do some testing. My findings are attached.
That's right, Aaaron, when I resumed the fan started to spin at full speed, but after a couple of seconds it went down!!!
...however it worked 2 times out of 3.
The current fan state changes from resume to resume as it can be seen below in comparison to comment 45. The output is different from the first one after resume. I will keep an eye on it /sys/class/thermal/cooling_device0/cur_state: 1 /sys/class/thermal/cooling_device1/cur_state: 1 /sys/class/thermal/cooling_device2/cur_state: 1 /sys/class/thermal/cooling_device3/cur_state: 1 /sys/class/thermal/cooling_device4/cur_state: 1 /sys/class/thermal/cooling_device5/cur_state: 1 /sys/class/thermal/cooling_device6/cur_state: 1 /sys/class/thermal/cooling_device7/cur_state: 1 /sys/class/thermal/cooling_device8/cur_state: 1 /sys/class/thermal/cooling_device9/cur_state: 1 /sys/class/thermal/cooling_device10/cur_state: 1 /sys/class/thermal/cooling_device11/cur_state: 0 /sys/class/thermal/cooling_device12/cur_state: 0 /sys/class/thermal/cooling_device13/cur_state: 0 Fan speed for each cooling device: cooling_device0 100% cooling_device1 70% cooling_device2 60% cooling_device3 40% cooling_device4 25% cooling_device5 100% cooling_device6 70% cooling_device7 60% cooling_device8 40% cooling_device9 25% Following fan speed seam to be the lowest fan speed. So i can't really tell if there is a different fan speed between those four devices. cooling_device10 20% cooling_device11 20% cooling_device12 20% cooling_device13 20%
(In reply to szegad from comment #47) > ...however it worked 2 times out of 3. For the case it doesn't work, can you please attach the debug dmesg? I wonder if the polling for thermal_zone0 stopped?
Created attachment 168661 [details] tzp=300, dmesg just afer resume, when tzp works
Created attachment 168671 [details] tzp=300, dmesg after fan down, when tzp works
Created attachment 168681 [details] tzp=300, dmesg just afer resume, when tzp doesn't work
Created attachment 168691 [details] tzp=300, dmesg just after 30s, when tzp doesn't work
Created attachment 168701 [details] tzp=300, dmesg after fan down, when tzp doesn't work initially
Ok, so here it goes. Case 1: TZP works. Just after resume, when fan is full speed: https://bugzilla.kernel.org/attachment.cgi?id=168661 30s after resume, when the fan is down https://bugzilla.kernel.org/attachment.cgi?id=168671 Case 2: TZP doesn't work (initially) Just after resume, when fan is full speed: https://bugzilla.kernel.org/attachment.cgi?id=168681 30s after resume, fan is still at full speed https://bugzilla.kernel.org/attachment.cgi?id=168691 After some time and another TZP check I guess, it finally goes down: https://bugzilla.kernel.org/attachment.cgi?id=168701 The difference between these two cases is that in the case 1 I resumed the system after a few seconds, when it was still warm, but in the case 2 I left it sleeping to cool down completely.
Thanks szegad, it's very useful. The FAN will go down when thermal_zone0 temperature goes down. For case 2, for a long time, the thermal_zone0's temperature is rising, which is pretty surprising since the FAN is spinning in full speed. Anyway, this is just a verification that the problem is indeed due to the thermal zone's handling of the FAN cooling devices. Rui has some patches that should handle this situation. Rui, can you please give a pointer to your patches?
yes, please check if the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c142 https://bugzilla.kernel.org/show_bug.cgi?id=78201#c143 works for you or not. Note, they are based on 4.0-rc2.
If you're running 3.19.0 and "lazy" you can ty my unofficial ones from: https://bugzilla.kernel.org/show_bug.cgi?id=78201#c145 and https://bugzilla.kernel.org/show_bug.cgi?id=78201#c146 (they only differ in _one_ context line from Rui's original ones) They work for me from bootup over hibernation (disk) or sleep (RAM). Maybe, they don't work after very first boot with the changed kernel. (https://bugzilla.kernel.org/show_bug.cgi?id=78201#c144) Best regards, Manuel
Patches work for my nw9440 with linux-4.0-rc2. Thanks!
I'm still evaluating those patches on 3.18.7, some cases are ok, but I have to dig into others. Should I do it on 4.0rc2 instead?
Created attachment 169231 [details] dmesg after long suspend
When suspend system for a longer period of time and let it cooldown completely it still starts with fan at full speed and doesn't want to step down even after a few minutes. dmesg just after resume attached: https://bugzilla.kernel.org/attachment.cgi?id=169231
Created attachment 169241 [details] resume after long suspend some minutes later I tried to change the fan speed by changind the system load from low to high and back again, but it's still spinning at full speed.
(In reply to szegad from comment #63) > > I tried to change the fan speed by changind the system load from low to high > and back again, but it's still spinning at full speed. I'm a bit confused: According to the logs cur_state of several cooling_devices are set to 0 or 1 and back when resuming and later with the temperature changes. So I assume the patches are applied correctly. Have you re-checked this? Or rebooted another time (last sentence of Comment 58) ? With slightly changed patches I've now a 3.18.8 running well. And the 3.19.0 also managed the fan correctly after last night's hibernation cooldown. Between 3.18.8 and 3.19.0 there are not many significant changes to the affected files. Maybe this one can help?: https://github.com/torvalds/linux/commit/a940cb34fed73b2d4809a4575f2981d5927e2c21 It's in 3.19.0 but not yet in 3.18.8. I seem to not need it so far. Best regards, Manuel
(In reply to szegad from comment #62) > When suspend system for a longer period of time and let it cooldown > completely it still starts with fan at full speed and doesn't want to step > down even after a few minutes. > dmesg just after resume attached: > https://bugzilla.kernel.org/attachment.cgi?id=169231 in the log, I see you've done three suspend, at 3860s, 5181s, 5200s, does this contain the one that you resume with system cool and fan at full speed?
Created attachment 169941 [details] debug patch please apply this debug patch on top and re-do the test. PS: if the problem is reproduced, please attach the dmesg and point to me when the bug happens so that I can check the dmesg accordingly.
(In reply to Zhang Rui from comment #65) > in the log, I see you've done three suspend, at 3860s, 5181s, 5200s, does > this contain the one that you resume with system cool and fan at full speed? It's the last one. However with the debug patch applied I can't reproduce this behaviour yet. It's strange, because it happend many times before. I'm still trying.
Created attachment 170261 [details] fan full speed, system is cool Dmesg just after resume from suspend. System is cool, fan is running at full speed. Applied patches https://bugzilla.kernel.org/show_bug.cgi?id=78201#c142 https://bugzilla.kernel.org/show_bug.cgi?id=78201#c143 , but NOT the debug patch https://bugzilla.kernel.org/attachment.cgi?id=169941 -> see comment https://bugzilla.kernel.org/show_bug.cgi?id=91411#c67.
Rui, I can't reproduce the behaviour with the debug patch applied, but I can do it straight away with the last two patches live. It seems to me that this debug patch changes something important.
Created attachment 170301 [details] dmesg after long suspend, fan is full speed, system cool both patches + debug patch
Ok, I've managed to reproduce this. Last resume from suspend: https://bugzilla.kernel.org/attachment.cgi?id=170301
(In reply to szegad from comment #71) > Ok, I've managed to reproduce this. Last resume from suspend: > https://bugzilla.kernel.org/attachment.cgi?id=170301 Hi, szegad! Are you absolutely sure with the last dmesg, that you've applied all three patches correctly and also verified that the intended code landed where it is supposed to, before making the kernel? In this most recent log you don't even get a "last_temperature N/A, current_temperature=....." for any of your suspends/resumes what is a significant output from Rui's new patches, normally. It seems to me, that the patches are not applied at all. Please, be so kind, to re-check safely and maybe re-build your kernel. Excuse me, I definitely don't want to bother you, but this bug is lasting so long now, that I don't like any possible "false negatives" any more. Regards, Manuel
Yes, they're applied correctly. In other case I would have fan spinning at 100% after resume all the time.
Manuel, Anyway - look at this: [ 6149.022160] update_temperature: thermal thermal_zone1: last_temperature=23100, current_temperature=23100 it's in there.
It is the term "last_temperature N/A, current_temperature=....." ................................^^^^^... that should occur on each resume to invoke an update shortly. With Rui's newest patches. It's the "N/A" that's missing in your logs.
Rui, so szegad's THERMAL_TEMP_INVALID doesn't get set in the right place. So he may get into the "else" case in: @@ -469,8 +476,12 @@ static void update_temperature(struct thermal_zone_device *tz) mutex_unlock(&tz->lock); trace_thermal_temperature(tz); - dev_dbg(&tz->device, "last_temperature=%d, current_temperature=%d\n", - tz->last_temperature, tz->temperature); + if (tz->last_temperature == THERMAL_TEMP_INVALID) + dev_dbg(&tz->device, "last_temperature N/A, current_temperature=%d\n", + tz->temperature); + else + dev_dbg(&tz->device, "last_temperature=%d, current_temperature=%d\n", + tz->last_temperature, tz->temperature); } void thermal_zone_device_update(struct thermal_zone_device *tz)
Ok, I will verify this once again just to be 1000% sure.
Ok, Manuel - you're right. So Comments 70 & 71 are invalid, but Comment 69 is vaild. With this two patches applied I can get fan at 100% after cool system resume, but adding the debug patch seems to change something and I can't reproduce the faulty behaviour yet. I'm still trying anyway.
@ szegad: Thank you very very much for rechecking and clarifying this and for your continued testing! Now, it would be a fine addition if the guys whose fan(s) don't get adjusted after bootup (BUG 93301 and BUG 92431) also submitted positive test results with these patches for their issue. ;-)
One thing bothers me: is this "debug patch" meant to fix a problem or it's just for debugging purposes and it fixes a bug only as a side effect?
You definitely shouldn't let you bother by a misleading patch title during the phase of testing. I'm no programmer, so with my limited expertise/ knowledge about mutex locking: This "debug" patch changes the way of locking in several places of the code, where trip_points, related devices and temperatures are read in from the system and/or reset. So, this is not for debugging purposes at all, but for fixing previous wrong or impossible access to the mentioned data. As you're not reproducing the wrong behaviour, so far, and I'm using this "debug" patch on top with my system that didn't need it before, with still the wished behaviour as result, this way with all three patches would be the right one. Rui, in case that my attempt to explain things is wrong or inaccurate, please correct me.
Hi, Manuel, thanks a lot for your effort on this bug. First of all, yes, the debug patch does have some functional change. the reason I call it "debug patch" is that I've not found the root cause and this patch may help me debug further. The debug patch actually makes thermal_zone_device_update() an atomic behavior because I've seen this symptom. 1. thermal core updates all thermal zones after resume 2. ACPI thermal driver also schedules a thermal_zone_device_update() work during resume, which is invoked while thermal core is updating the thermal zones. 3. the second thermal zone update pollutes the data used by the first one, and causes some problem. Patch 3/10 in https://bugzilla.kernel.org/show_bug.cgi?id=78201#c149 should remove the redundant thermal zone update raised by ACPI thermal driver. But, IMO, I should also consider this debug patch as an upstream candidate.
Thanks for the explanation. So I'm happy to announce my failure to reproduce the bug with all three patches applied. Do you need me to check something more?
Please help me do the test mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150
I testing with patches 1-4 and all is working fine. Now I'll apply rest of the patches.
please try the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157 instead to see if they work for you or not.
@szegad: You can safely test and give an additional "good positive". ;-) Best regards and thank you for investing your time!
Ok, testing patches from https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157 gives no signs of problem. I will do some more tomorrow , however I do not expect any troubles. Great!
I can say that it's working for me. Thank you, Rui!
Patches to fix the problem sent out. https://patchwork.kernel.org/patch/6077231/ https://patchwork.kernel.org/patch/6077241/ https://patchwork.kernel.org/patch/6077251/ These three bugs are necessary to fix the regression introduced by commit 19593a1fb1f6718406afca5b867dab184289d406 Author: Aaron Lu <aaron.lu@intel.com> Date: Tue Nov 19 16:59:20 2013 +0800 ACPI / fan: convert to platform driver Convert ACPI fan driver to a platform driver for the purpose of phasing out ACPI bus. Signed-off-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> About the thermal_zone_device_update() locking fix, I found the patch still has deadlock issue, aka, an upstream fix is not ready at the moment, thus I decided to push the above three patches to upstream first, in order to be shipped in 4.0 release. For the locking issue, I would on this as 4.1 material. Mark the bug as Resolved as the regressions should be fixed by the above three patches. If you want to track the thermal locking issue, please open another thread, or else I will send out the patch directly, in 4.1 release window.
(In reply to Zhang Rui from comment #90) ... > About the thermal_zone_device_update() locking fix, I found the patch still > has deadlock issue, aka, an upstream fix is not ready at the moment, thus I > decided to push the above three patches to upstream first, in order to be > shipped in 4.0 release. > > For the locking issue, I would on this as 4.1 material. > > Mark the bug as Resolved as the regressions should be fixed by the above > three patches. > > If you want to track the thermal locking issue, please open another thread, > or else I will send out the patch directly, in 4.1 release window. What do you mean with "deadlock issue"? Is there a risk when having the patch applied?
yes, there is. As the problem fixed by that patch is not a regression (the problem actually exists from the beginning), I'd prefer to rework it and send it as 4.1 material.
Please, don't request us to open another BUG for this. I think all of us on this thread here would be glad, if you would just report here, when you have new patches available for this issue in the 4.1 kernel queue. Could make it a bit easier for all participants. Thanks, Manuel
sure, no problem.
Hi, guy, please help check if the patches at comment #183/#184/#185 in bug #78201 work for you or not. As there is some functional changes, I need to make sure they have been tested before sending upstream.
s/guy/guys
Your patches do work with linux-3.19.4 and 3.19.5. No problems here so far. Thank you Rui! Your work is greatly appreciated.
Yu will take care of the patch sets and push for upstream.
Created attachment 188091 [details] Patch to avoid racing problem when doing thermal update(by move trend calculation into each thermal_instance) Hi, Matthias Can you please help me test the latest patch, which is supposed to fix racing problem in thermal management. This patch is not related to your bug but I'd like to send this patch together with Rui's patches. Can you please help to test this patch on top of Rui's: https://patchwork.kernel.org/patch/6166681/ https://patchwork.kernel.org/patch/6166691/ https://patchwork.kernel.org/patch/6166701/ and please enable the thermal dynamic debug to see if there is any fan problems, Thanks!
*** Bug 92431 has been marked as a duplicate of this bug. ***
Sorry for getting back so late. I can't compile linux-4.1.9 with your Patch Chen. Got this error: In file included from drivers/thermal/thermal_core.c:34:0: include/linux/thermal.h:348:14: warning: 'struct thermal_instance' declared inside parameter list int, enum thermal_trip_type); ^ include/linux/thermal.h:348:14: warning: its scope is only this definition or declaration, which is probably not what you want CC drivers/tty/pty.o drivers/thermal/thermal_core.c:192:5: error: conflicting types for 'get_instance_trend' int get_instance_trend(struct thermal_zone_device *tz, ^ In file included from drivers/thermal/thermal_core.c:34:0: include/linux/thermal.h:347:5: note: previous declaration of 'get_instance_trend' was here int get_instance_trend(struct thermal_zone_device *, struct thermal_instance *, ^ In file included from include/linux/linkage.h:6:0, from include/linux/kernel.h:6, from include/linux/list.h:8, from include/linux/module.h:9, from drivers/thermal/thermal_core.c:28: drivers/thermal/thermal_core.c:209:15: error: conflicting types for 'get_instance_trend' EXPORT_SYMBOL(get_instance_trend); ^ include/linux/export.h:57:21: note: in definition of macro '__EXPORT_SYMBOL' extern typeof(sym) sym; \ ^ drivers/thermal/thermal_core.c:209:1: note: in expansion of macro 'EXPORT_SYMBOL' EXPORT_SYMBOL(get_instance_trend); ^ In file included from drivers/thermal/thermal_core.c:34:0: include/linux/thermal.h:347:5: note: previous declaration of 'get_instance_trend' was here int get_instance_trend(struct thermal_zone_device *, struct thermal_instance *, ^ scripts/Makefile.build:258: recipe for target 'drivers/thermal/thermal_core.o' failed make[2]: *** [drivers/thermal/thermal_core.o] Error 1 scripts/Makefile.build:403: recipe for target 'drivers/thermal' failed make[1]: *** [drivers/thermal] Error 2 With linux-4.2.3 your patch gets rejected at the following part: --- drivers/thermal/step_wise.c +++ drivers/thermal/step_wise.c @@ -154,11 +154,17 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip) if (instance->trip != trip) continue; + instance->temperature = tz->temperature; + instance_trend = get_instance_trend(tz, instance, trip_temp, trip_type); old_target = instance->target; - instance->target = get_target_state(instance, trend, throttle); + if (instance_trend) + instance->target = get_target_state(instance, instance_trend, throttle); + else + instance->target = get_target_state(instance, trend, throttle); dev_dbg(&instance->cdev->device, "old_target=%d, target=%d\n", old_target, (int)instance->target); + instance->last_temperature = tz->temperature; if (instance->initialized && old_target == instance->target) continue;
Created attachment 190821 [details] V6-0001-Thermal-initialize-thermal-zone-
Created attachment 190831 [details] Thermal-handle-thermal-zone-device-
Created attachment 190841 [details] V6-0003-Thermal-do-thermal-zone-update-after-a-cooling-devic
Plz help test this version of patches, thanks!
(In reply to Chen Yu from comment #105) > Plz help test this version of patches, thanks! the patches in comments 102-104 fix the issue on my Haswell 2955U-based Asus ChromeBox when applied to kernel 4.3 (final) - thanks very much!
I have older notebook - HP 6510b with core2duo 9300T. I've tested patches against archlinux kernel 4.2.5-1 and everything is working very well. No more full speed after resume. I hope that patches will be merged soon.
I have been running kernel 4.2.5 and 4.2.6 with the patches for more than ten days now. No issues so far on my NW9440. Thanks!
Patch set has been merged, so close current thread as fixed. commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461 Author: Zhang Rui <rui.zhang@intel.com> Date: Fri Oct 30 16:31:47 2015 +0800 Thermal: initialize thermal zone device correctly commit ff140fea847e1c2002a220571ab106c2456ed252 Author: Zhang Rui <rui.zhang@intel.com> Date: Fri Oct 30 16:31:58 2015 +0800 Thermal: handle thermal zone device properly during system sleep commit 4511f7166a2deb5f7a578cf87fd2fe1ae83527e3 Author: Chen Yu <yu.c.chen@intel.com> Date: Fri Oct 30 16:32:10 2015 +0800 Thermal: do thermal zone update after a cooling device registered