Bug 78201

Summary: Lower fan speeds are forgotten after resume from ram/disk
Product: Power Management Reporter: Manuel Krause (manuelkrause)
Component: ThermalAssignee: Chen Yu (yu.c.chen)
Status: CLOSED CODE_FIX    
Severity: blocking CC: aaron.lu, ammdispose-arch, auxsvr, diesal3, doctorjellyface, kernel, lenb, liquid.acid, manuelkrause, mb, peter.klotz99, prash.n.rao, radek, registrace, rui.zhang, syrjala, szegadlo, vevais
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.12 - 4.4 Subsystem:
Regression: No Bisected commit-id:
Attachments: MK-3.15.0.thermal__thermal_zone_paths
MK-3.15.0.thermal__cooling_device_paths
MK-3.15.0.dmesg-pre-suspend
MK-3.15.0.tmon-pre-suspend.log
MK-3.15.0.dmesg-post-suspend
MK-3.15.0.tmon-post-suspend.log
Reverting PATCH for a 3.14.7 kernel
REVERT thermal: step_wise: cdev only needs update on a new target state @ 3.12.22
PATCH: initialize thermal zone device correctly
handle system sleep correctly
MK-3.16-rc2+.dmesg-before-S2RAM.log
MK-3.16-rc2+.tmon-before-S2RAM.log
MK-3.16-rc2+.dmesg-after-S2RAM.log
MK-3.16-rc2+.tmon-after-S2RAM.log
MK-3.16-rc2+.dmesg-after-S2DISK.log
MK-3.16-rc2+.tmon-after-S2DISK.log
PATCH V2: handle system sleep correctly (1.39 KB,
PATCH: initialize thermal zone device correctly
Patch: handle thermal zone device update events correctly during system sleep
PATCH v2: initialize thermal zone device correctly
Patch v2: handle thermal zone device update events correctly during system sleep
dmesg of a 3.17-rc4 with Zhang Rui's most recent patches and dynamic_debug enabled
tmon.log of a 3.17-rc4 with Zhang Rui's most recent patches and dynamic_debug enabled
PATCH v3: initialize thermal zone device correctly
Patch v3: handle thermal zone device update events correctly during system sleep
dmesg output of a 3.17-rc4 with V3 of the patches
debug patch
Patch v4: handle thermal zone device update events correctly during system sleep
dmesg output of a 3.17-rc4 with V4 of the patches + debug.patch + dyn_debug
Working patch 1of2 for 3.16 kernels
Working patch 2of2 for 3.16 kernels
tmon.log of a 3.17-rc4 with V4 of the patches according to Comment 52 (reproduced)
dmesg.log of a 3.17-rc4 with V4 of the patches according to Comment 52 (reproduced)
tmon.log of a 3.17-rc4 with V4 of the patches SUSPEND TO RAM
dmesg.log of a 3.17-rc4 with V4 of the patches SUSPEND_TO_RAM
Patch-for-upstream: 1/3
Patch-for-upstream: 2/3
Patch-for-upstream: 3/3 debug patch
patch-know-to-work: 1/3
patch-know-to-work: 2/3
patch-know-to-work: 3/3 debug patch
dmesg of 3.17-rc4 with "known-to-work" patches with suspend-to-disk
dmesg of 3.17-rc4 with "known-to-work" patches with suspend-to-RAM
dmesg of 3.17-rc4 with "for-upstream" patches with suspend-to-disk
dmesg of 3.17-rc4 with "for-upstream" patches with suspend-to-RAM
dmesg of a 3.16.3+ with patches for-upstream +acpi.debug +dynamic_debug
[PATCH] ACPI: invoke acpi_device_wakeup() with correct parameters
debug patch
Kernel-Bugzilla-78201-knowntowork-V2-PLUS-for3.17.x PATCH
PATCH for 3.19, 1/2
PATCH for 3.19, 2/2
kernel-bugzilla-78201-MK.dmesg-3.17.6.log
kernel-bugzilla-78201-MK.dmesg-3.18.0.log
BUG_78201.kernel-3.18.7.0001-THERMAL-initialize-thermal-zone-device-correctly.MOD6
BUG_78201.kernel-3.18.7.0002-Thermal-handle-thermal-zone-device-update-events-cor.MOD1
patch for upstream: 1-2
patch for upstream: 2-2
patch-for-upstream:1-2 corrected version
dmesg with 2 hibernations with Rui's two 3.19-related patches
0001-Thermal-initialize-thermal-zone-device-correctly.patch
0002-Thermal-handle-thermal-zone-device-update-events-cor.patch
dmesg with 2 hibernations with Rui's two 4.0-rc2-related patches
Newest 0001 (1of2) patch for the 3.19.0 kernel
Newest 0002 (2of2) patch for the 3.19.0 kernel
dmesg.log from working 3.19.1 according to Comment 147
patches for upstream
patches for upstream - 2
dmesg of 3.19.1 with patches 0001..0004 applied
dmesg of 3.19.1 with patches 0001..0011 applied
patch v3
dmesg of 4.0.0-rc3 with only the four v3 patches applied
Chromebox dmesg after patch
dmesg of a 4.0-rc4 with the five relevant patches with bouncing fan
patch v5 1/3
patch v5 2/3
patch v5 3/3
This patch is to avoid lock racing when doing thermal_update.
V6-0001-Thermal-initialize-thermal-zone-device-correctly
V6-0002-Thermal-handle-thermal-zone-device-properly-during-s
V6-0003-Thermal-do-thermal-zone-update-after-a-cooling-devic

Description Manuel Krause 2014-06-17 21:23:26 UTC
I've found this issue _after_ helping to debug https://bugzilla.kernel.org/show_bug.cgi?id=71711

My System is a Core 2 Duo Notebook HP Compaq 6730b. (More from the dmesg below.)

When freshly booting my system, and using it, the fan is working fine, adding something CPU stressing like worldcommunitygrid, also no problem.

Suspending the system to RAM / disk and resuming results in:
My fan only triggers the next possible fan level(s) after resume.

If I immediately resume with a still hot system, I'd get only the 75°C step or as next the emergency cooling. Dropping the CPU load with a so resumed system makes the fan go off. Then at, say 75°C, the fan does it's work again until the related low trigger temperature for this trip point is met and the fan goes off again.
This may be dangerous for systems that suspend and resume above the last met temperature for a trip point. What can be high!

Resuming a colder system results in lower temperatures to be taken as next fan kick in, then lower trip points are reckognized over runtime after suspend.

Zhang Rui advised me to attach following files:
attach output of "grep . /sys/class/thermal/thermal_zone*/cdev*/device/path"
attach output of "grep . /sys/class/thermal/cdev*/device/path"
 run "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control"
 reproduce the problem
attach the dmesg output and tmon output

So, I'd do now.
Comment 1 Manuel Krause 2014-06-17 21:26:29 UTC
Created attachment 140111 [details]
MK-3.15.0.thermal__thermal_zone_paths
Comment 2 Manuel Krause 2014-06-17 21:27:31 UTC
Created attachment 140121 [details]
MK-3.15.0.thermal__cooling_device_paths
Comment 3 Manuel Krause 2014-06-17 21:29:36 UTC
Created attachment 140131 [details]
MK-3.15.0.dmesg-pre-suspend
Comment 4 Manuel Krause 2014-06-17 21:31:14 UTC
Created attachment 140141 [details]
MK-3.15.0.tmon-pre-suspend.log
Comment 5 Manuel Krause 2014-06-17 21:33:13 UTC
Created attachment 140151 [details]
MK-3.15.0.dmesg-post-suspend
Comment 6 Manuel Krause 2014-06-17 21:33:59 UTC
Created attachment 140161 [details]
MK-3.15.0.tmon-post-suspend.log
Comment 7 Manuel Krause 2014-06-17 21:44:00 UTC
As Guenter Roeck already asked me to test older kernels, I was able to narrow it down to the not working kernels: 3.12.20 -> 3.15.0. (meant 3.11.9 as last good from my archives.)

I've then done some bisecting over the weekend and got this as result:

178c2490b99f898efc06d1ad75cadc84f13021a6 is the first bad commit
commit 178c2490b99f898efc06d1ad75cadc84f13021a6
Author: Shawn Guo <shawn.guo@linaro.org>
Date:   Mon Jun 17 21:24:23 2013 +0800

    thermal: step_wise: cdev only needs update on a new target state
    
    The cooling device only needs update on a new target state.  Since we
    already check old target in thermal_zone_trip_update(), we can do one
    more check to see if it's a new target state.  If not, we can reasonably
    save some uncecesary code execution.
    
    Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
    Acked-by: Eduardo Valentin <eduardo.valentin@ti.com>
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>

:040000 040000 34a6fb0f2405e5366f753d1da69bf3cb0f49a869 c4b953739e5b1ba259149eb68c48e0e2585ddd67 M	drivers

Best regards,
Manuel Krause
Comment 8 Manuel Krause 2014-06-17 22:03:00 UTC
Created attachment 140181 [details]
Reverting PATCH for a 3.14.7 kernel

If I'd revert the above mentioned commit, also a 3.14.7 kernel does work very well, without any issues.

This is a reverting PATCH only.
Comment 9 Manuel Krause 2014-06-21 22:29:36 UTC
Created attachment 140561 [details]
REVERT thermal: step_wise: cdev only needs update on a new target state @ 3.12.22

This is another working patch for now reverting a kernel 3.12.22.

Thanks for attention and best regards, 
Manuel
Comment 10 Zhang Rui 2014-06-23 05:25:53 UTC
Created attachment 140681 [details]
PATCH: initialize thermal zone device correctly
Comment 11 Zhang Rui 2014-06-23 05:26:29 UTC
Created attachment 140691 [details]
handle system sleep correctly
Comment 12 Zhang Rui 2014-06-23 05:29:17 UTC
Well, the real problem of this bug is that after resume, thermal core still uses some cached value to make thermal decisions.
IMO, the two patches above, based on 3.16-rc2, is the real fix for this issue.
Please apply them on top of 3.16-rc2 and check if the problem still exists.
If yes, please
run "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control"
reproduce the problem
attach the dmesg output and tmon output
Comment 13 Manuel Krause 2014-06-23 18:59:30 UTC
Unfortunately, 3.16-rc2 + your 2 patches don't help for resuming from RAM/ DISK. A freshly booted system works well, though.

In the following you'll find both the dmesg & tmon output from before suspend-to-ram, after this task, and after suspend-to disk. All actions taken subsequently. The dmesg files are cut to each task and the tmon output comes from separate runs of 'tmon -l'. I hope this workflow doesn't make it too complicated to review this issue.

Best regards, and thank you for your work!
Comment 14 Manuel Krause 2014-06-23 19:01:04 UTC
Created attachment 140741 [details]
MK-3.16-rc2+.dmesg-before-S2RAM.log
Comment 15 Manuel Krause 2014-06-23 19:02:00 UTC
Created attachment 140751 [details]
MK-3.16-rc2+.tmon-before-S2RAM.log
Comment 16 Manuel Krause 2014-06-23 19:03:02 UTC
Created attachment 140761 [details]
MK-3.16-rc2+.dmesg-after-S2RAM.log
Comment 17 Manuel Krause 2014-06-23 19:03:47 UTC
Created attachment 140771 [details]
MK-3.16-rc2+.tmon-after-S2RAM.log
Comment 18 Manuel Krause 2014-06-23 19:04:39 UTC
Created attachment 140781 [details]
MK-3.16-rc2+.dmesg-after-S2DISK.log
Comment 19 Manuel Krause 2014-06-23 19:05:26 UTC
Created attachment 140791 [details]
MK-3.16-rc2+.tmon-after-S2DISK.log
Comment 20 Zhang Rui 2014-06-23 23:53:43 UTC
Created attachment 140801 [details]
PATCH V2:  handle system sleep correctly (1.39 KB,

this is an updated version of the second patch, please apply and retry.
Note: when doing the tests, please boot with kernel parameter "initcall_debug" this time.
Comment 21 Manuel Krause 2014-06-24 16:34:38 UTC
(In reply to Zhang Rui from comment #20)
> Created attachment 140801 [details]
> PATCH V2:  handle system sleep correctly (1.39 KB,
> 
> this is an updated version of the second patch, please apply and retry.
> Note: when doing the tests, please boot with kernel parameter
> "initcall_debug" this time.

Great work! Thank you very much: 
 Version 2 of the second patch does the magic!

Many following suspends+resumes to RAM/ disk now show the desired behaviour, i.e. passing trip_point temperatures trigger the right fan speed in upward and downward direction. No "forgotten" fan speed levels at all.

Do you still want related test logs?

And a last question, do you consider these patches to be applicable to 3.15 / 3.14 / 3.12 kernels as well?

Best regards, and keep up your good work,
Manuel
Comment 22 Zhang Rui 2014-06-27 02:25:22 UTC
Well, IMO, this is still not the perfect fix, because what we really want is to deffer the execution of thermal_zone_device_update() until we have successfully reset the thermal zone devices during resume.
Currently, this patch does the trick by doing this in the .late_suspend() callback and no one invokes thermal_zone_device_update() later than that.
IMO, a wait queue would be a better choice, I will investigate if it is possible to do it in this way.
Comment 23 Manuel Krause 2014-06-27 16:05:54 UTC
O.K., I'll stay tuned on this. Please, feel free to post any patch that follows your improved approach: I'd really like to test it!

Best regarrds, Manuel
Comment 24 Manuel Krause 2014-07-08 18:26:17 UTC
*ping* Any news on your better choice? 
I would be glad to either receive a preliminary patch from you or to see this issue being resolved with a "not perfect but working" patch in actual and affected kernels.

Thanks for your work in any case, 
Manuel
Comment 25 Manuel Krause 2014-08-21 23:47:26 UTC
Apparently also kernel 3.16.1 doesn't have a proper fix.

Your, Zhang Rui's, proposed patches keep to work, though.

Manuel
Comment 26 Zhang Rui 2014-08-24 06:44:47 UTC
Created attachment 147891 [details]
PATCH: initialize thermal zone device correctly
Comment 27 Zhang Rui 2014-08-24 06:47:06 UTC
Created attachment 147901 [details]
Patch: handle thermal zone device update events correctly during system sleep

please try these two patches on clean 3.17-rc1 and see if the problem still exists.
Comment 28 Manuel Krause 2014-08-26 04:18:51 UTC
One of your both most recent patches leads to general overheating on here. It (fan + trigger) doesn't step up after resume.

The first patch, as I compared, is quite equal. But: The second one???

Please, also don't tag your old patches as "obsolete", unless the newer ones are tagged as "good". My 3.16.1 kernel is running quite well with your old(!) patches.

Please think over plastics overheating & so on...
Manuel
Comment 29 Zhang Rui 2014-08-26 06:41:30 UTC
Created attachment 148241 [details]
PATCH v2: initialize thermal zone device correctly
Comment 30 Zhang Rui 2014-08-26 06:43:17 UTC
Created attachment 148251 [details]
Patch v2: handle thermal zone device update events correctly during system sleep

Sorry that I posted an out-of-date version for PATCH 1/2.
please try these two updated patches instead.
Comment 31 Manuel Krause 2014-08-27 03:56:45 UTC
OMG, NO! What have you done?

3.17-rc1+ your new patches: Now lower temperatures <-> fan speeds are set upon resume, but they will _NEVER_ step up from this "loaded level" according to a temperature increase. --> Burning notebook. Same as with your patches from 20140824.

Manuel
Comment 32 Zhang Rui 2014-08-27 05:55:32 UTC
(In reply to Manuel Krause from comment #31)
> OMG, NO! What have you done?
> 
> 3.17-rc1+ your new patches: Now lower temperatures <-> fan speeds are set
> upon resume, but they will _NEVER_ step up from this "loaded level"
> according to a temperature increase. --> Burning notebook. Same as with your
> patches from 20140824.
> 
can you please attach the dmesg output after resume with the patch applied, and with command "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control"?
Comment 33 Manuel Krause 2014-09-06 22:33:36 UTC
Please, be so kind, to excuse the delay.

For several weeks now, I'm not able to boot a 3.17-rcX kernel on here.

As it was possible before, Comment 28+, and now it is NOT, I'm quite clueless.

Maybe I've updated some openSUSE package that prevents it? Or I'm missing one? A freshly compiled 3.16.2 boots well.

If you know anything about this preventing issue, please, let me know.

I still want to help you and others to fix this BUG.

Best regards, Manuel
Comment 34 Zhang Rui 2014-09-09 01:47:40 UTC
hmmm,
3.17-rc1 works for you but later -rc releases don't?
I was not aware of such problems in 3.17-rc kernels, can you please try a git bisect to find out which commit introduces the problem?
Comment 35 Len Brown 2014-09-09 14:17:36 UTC
as this is a longstanding bug, exposed by recent changes,
we're marking this as not a regression.

3.17-rc is problematic, since latest kernel is not booting at all.
(is that boot failure filed as a regression???), preventing debugging
this issue.

It sounds like 3.17-rc1 is latest that boots, and so perhaps debugging
of the fan issue could continue on that release?
Comment 36 Manuel Krause 2014-09-11 19:45:09 UTC
O.k., I'm now able to boot a 3.17-rc kernel again. It is now 3.17-rc4. 

As I normally install the openSUSE .rpm (once they puplishe), it's not quite sure what was causing the issue, me not beeing able to boot -rc2 to -rc3. Can be a packaging failure from them or something from their git state. I retrieved the 3.17-rc1 binaries from a backup(!) with which I did the first testing on Zhang Rui's new patches and it worked (booted but the BUG's issues remained, of course).

This "preventing issue" was obviously solved somehow inbetween, so I hope you won't advise me to bisect as it may take much needless time.

In the following I'll attach the dmesg output and a parallel tmon log after issuing "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control". Unfortunately our current ambient temperatures with the upcoming autumn do not bring up high temperatures in a short period of time, so the logs are ... lengthy.

Best regards, Manuel
Comment 37 Manuel Krause 2014-09-11 19:54:31 UTC
Created attachment 149931 [details]
dmesg of a 3.17-rc4 with Zhang Rui's most recent patches and dynamic_debug enabled
Comment 38 Manuel Krause 2014-09-11 19:56:21 UTC
Created attachment 149941 [details]
tmon.log of a 3.17-rc4 with Zhang Rui's most recent patches and dynamic_debug enabled
Comment 39 Manuel Krause 2014-09-11 20:11:20 UTC
And to be absolutly clear, with the most recent patches, on 3.17-rcX, now the higher fan speeds / triggers aren't reached. (Remember, before, the lower speeds had been forgotten.)

There had been some patches from Zhang Rui in this long time, that I continue to apply to my 3.16 kernels, and they do work, definitely.

Best regards, Manuel
Comment 40 Zhang Rui 2014-09-18 03:30:23 UTC
Created attachment 150751 [details]
PATCH v3: initialize thermal zone device correctly
Comment 41 Zhang Rui 2014-09-18 03:31:02 UTC
Created attachment 150761 [details]
Patch v3: handle thermal zone device update events correctly during system sleep
Comment 42 Zhang Rui 2014-09-18 03:32:32 UTC
Hi, Manuel,
there do is a bug in the previous patches, so please apply these two updated ones and see if they can fix the problem for you or not.

BTY, please attach the dmesg output after suspend/hibernation, with dynamic_debug enabled as well.
Comment 43 Manuel Krause 2014-09-18 23:43:34 UTC
Created attachment 150811 [details]
dmesg output of a 3.17-rc4 with V3 of the patches

I hope this dmesg output helps a bit for your further conclusions.

After resume from disk/ram changing temperatures still DO NOT trigger a fan level change.

The attached log may only cover the upwards (temp) path, but I've also generated logs on here, proving, that the downwards path doesn't work either: After resume the fan won't change it's speed according to the CPU temp any more (seems fixed to the temp gathered at resume time). Freshly booted everything works well.

I can send you the other logs, if you wish, but maybe you first think about the severe error that must be in the patches for the resume case not allowing the fan speed to change!

Best regards, Manuel Krause
Comment 44 Zhang Rui 2014-09-19 01:45:06 UTC
Created attachment 150821 [details]
debug patch
Comment 45 Zhang Rui 2014-09-19 02:00:19 UTC
Created attachment 150831 [details]
Patch v4: handle thermal zone device update events correctly during system sleep
Comment 46 Zhang Rui 2014-09-19 02:12:38 UTC
please apply the updated second patch, and the debug patch and re-do the test.

(In reply to Manuel Krause from comment #43)
> Created attachment 150811 [details]
> dmesg output of a 3.17-rc4 with V3 of the patches
> 
> I hope this dmesg output helps a bit for your further conclusions.
> 
> After resume from disk/ram changing temperatures still DO NOT trigger a fan
> level change.
> 
From the dmesg output, we can see that
1. when resuming from hibernation, the temperature is 63C, the Fan cooling device 2,3,4 is turned on as we expected. Please confirm the fan is do running at a certain speed level.
2. there is no thermal zone update() request then, I'm not sure how you change the temperature, but the last thermal_zone_device_update() is invoked at 299s, upon resuming from hibernation.

> The attached log may only cover the upwards (temp) path, but I've also
> generated logs on here, proving, that the downwards path doesn't work
> either: After resume the fan won't change it's speed according to the CPU
> temp any more (seems fixed to the temp gathered at resume time). Freshly
> booted everything works well.
> 
please apply the updated second patch and the debug patch, I want to make sure that thermal_zone_device_update() is invoked correctly after resume, when you change the temperature.
Comment 47 Manuel Krause 2014-09-19 22:41:14 UTC
At first the answers to Comment 46:
1. The fan is running well at all levels (25, 34, 45, 58, 100 <-emergency) and I, of course, did observe them always before reporting that changing fan speed is not happening. And, with the V3 patches the system comes up from resume with a properly adjusted fan-speed, according to the temp. But it lost the ability to step up or down.
2. The way I change the temp. is not manually issued. I just start a worldcommunitygrid client or a kernel compile to increase the temp. And stop them, if I want to see the fan level go down.

Manuel
Comment 48 Manuel Krause 2014-09-19 22:54:39 UTC
Created attachment 151001 [details]
dmesg output of a 3.17-rc4 with V4 of the patches + debug.patch + dyn_debug

At the moment, with these patches, the system after hibernation resume, sticks at 100% fan speed, not depending on the load.

Not the final goal.

Manuel
Comment 49 Manuel Krause 2014-09-19 23:18:26 UTC
Created attachment 151011 [details]
Working patch 1of2 for 3.16 kernels

Maybe, to remember you... Manuel
Comment 50 Manuel Krause 2014-09-19 23:21:53 UTC
Created attachment 151021 [details]
Working patch 2of2 for 3.16 kernels

Best regards, Manuel
Comment 51 Zhang Rui 2014-09-22 02:03:45 UTC
(In reply to Manuel Krause from comment #48)
> Created attachment 151001 [details]
> dmesg output of a 3.17-rc4 with V4 of the patches + debug.patch + dyn_debug
> 
> At the moment, with these patches, the system after hibernation resume,
> sticks at 100% fan speed, not depending on the load.
> 

well, from the dmesg output, we can see that your hibernation started at 567s, and ended at 569s, and the last dmesg output is 572s.
so I could not see how the system "sticks at 100% fan speed, not depending on the load."
Question, is your dmesg output got soon after hibernation, and then you did some test and made such conclusion?
If yes, please send me the full dmesg output, because I want to see how the thermal core re-act when the temperature changes after hibernation.
If no, then this seems to be another problem that "thermal notification is missing after resuming from hibernation".
Comment 52 Manuel Krause 2014-09-22 22:51:31 UTC
Though, I had an extra long dmesg log for the last reported test run, and there didn't happen anything after that last dmesg output @572s for another 96s, I preferred to reproduce the problem and generate a new set of logs for you, now including a corresponding tmon.log, too, so that you hopefully can see the related reactions of the system. I'll attach them in the two follow-up posts.

I hope that I applied the right patches!? I used 
"PATCH v3: initialize thermal zone device correctly" (attachment 150751 [details]), 
"Patch v4: handle thermal zone device update events correctly during system sleep" (attachment 150831 [details]) and the 
"debug patch" (attachment 150821 [details]). 
Can you, please, also recheck, that these are the right ones?

It's a pity: This latest combination of patches falls into your second category from Comment 51: 'another problem that "thermal notification is missing after resuming from hibernation" ' -- Directly with resume from hibernation the fan goes to 100% and stays there, whatever I'd do to increase or decrease CPU load up- or down beyond known trigger temperatures.

Detailed description what these actual logs are about:
 booting, echoing the dynamic debug command,
 start worldcommunitygrid client + wait + stop it again + wait,
 hibernate + wait + resume (now fan at 100%),
 start worldcommunitygrid client + wait + stop it again + wait,
 connect to the web,
 stop tmon + save the outputs to tmon.log and dmesg.log

I really hope we'll get this solved soon and I thank you very very much for your continued engagement.

Best regards, Manuel
Comment 53 Manuel Krause 2014-09-22 22:58:06 UTC
Created attachment 151431 [details]
tmon.log of a 3.17-rc4 with V4 of the patches according to Comment 52 (reproduced)
Comment 54 Manuel Krause 2014-09-22 23:00:08 UTC
Created attachment 151441 [details]
dmesg.log of a 3.17-rc4 with V4 of the patches according to Comment 52 (reproduced)
Comment 55 Manuel Krause 2014-09-23 00:10:32 UTC
I assume you won't like me after this and the following two posts... With the same setup and testing, but now suspending _to_RAM_ instead of hibernating (to disk) I generated the following logs: Showing here, the fan now sticks to the lowest previous level:
Comment 56 Manuel Krause 2014-09-23 00:12:57 UTC
Created attachment 151481 [details]
tmon.log of a 3.17-rc4 with V4 of the patches SUSPEND TO RAM
Comment 57 Manuel Krause 2014-09-23 00:16:17 UTC
Created attachment 151491 [details]
dmesg.log of a 3.17-rc4 with V4 of the patches SUSPEND_TO_RAM
Comment 58 Manuel Krause 2014-09-23 00:29:10 UTC
This obviously proves your second case theory from Comment 51: 'another problem that "thermal notification is missing after resuming from hibernation" '. Meaning now == also from suspend.

But I don't understand, why hibernation (to disk) or suspend (to RAM) do behave that different. The recovery from each of them must check for actual temperatures IMO.

Best regards, Manuel
Comment 59 Manuel Krause 2014-09-23 00:35:16 UTC
... AND reset the fan's levels, too!!!
Comment 60 Zhang Rui 2014-09-23 02:24:02 UTC
(In reply to Manuel Krause from comment #52)
> Though, I had an extra long dmesg log for the last reported test run, and
> there didn't happen anything after that last dmesg output @572s for another
> 96s, I preferred to reproduce the problem and generate a new set of logs for
> you, now including a corresponding tmon.log, too, so that you hopefully can
> see the related reactions of the system. I'll attach them in the two
> follow-up posts.
> 
> I hope that I applied the right patches!? I used 
> "PATCH v3: initialize thermal zone device correctly" (attachment 150751 [details]
> [details]), 
> "Patch v4: handle thermal zone device update events correctly during system
> sleep" (attachment 150831 [details]) and the 
> "debug patch" (attachment 150821 [details]). 
> Can you, please, also recheck, that these are the right ones?
> 
yes, they are.

> It's a pity: This latest combination of patches falls into your second
> category from Comment 51: 'another problem that "thermal notification is
> missing after resuming from hibernation" ' -- Directly with resume from
> hibernation the fan goes to 100% and stays there, whatever I'd do to
> increase or decrease CPU load up- or down beyond known trigger temperatures.
> 
well, bad new to me. I think we need to do further debug to check why Linux behaves differently with patch v1 and v4.
Comment 61 Zhang Rui 2014-09-23 02:26:37 UTC
Created attachment 151501 [details]
Patch-for-upstream: 1/3
Comment 62 Zhang Rui 2014-09-23 02:27:41 UTC
Created attachment 151511 [details]
Patch-for-upstream: 2/3
Comment 63 Zhang Rui 2014-09-23 02:28:33 UTC
Created attachment 151521 [details]
Patch-for-upstream: 3/3 debug patch
Comment 64 Zhang Rui 2014-09-23 02:37:25 UTC
Created attachment 151531 [details]
patch-know-to-work: 1/3
Comment 65 Zhang Rui 2014-09-23 02:37:51 UTC
Created attachment 151541 [details]
patch-know-to-work: 2/3
Comment 66 Zhang Rui 2014-09-23 02:38:21 UTC
Created attachment 151551 [details]
patch-know-to-work: 3/3 debug patch
Comment 67 Zhang Rui 2014-09-23 02:45:44 UTC
now here are two patch sets, patch-for-upstream and patch-known-to-work.
For each patch set, please do the following test
1. rebuild your kernel with CONFIG_ACPI_DEBUG=y.
2. reboot with boot option "acpi.debug_level=0x04 acpi.debug_layer=0x04"
3. enable dynamic debug
4. run suspend/hibernate
5. after resuming, try to change the temperature, either lower or higher.
6. attach the dmesg output

Note that I've made a change in the patch-for-upstream patch set, so do please apply these refreshed ones.
Comment 68 Manuel Krause 2014-09-25 22:52:22 UTC
Created attachment 151941 [details]
dmesg of 3.17-rc4 with "known-to-work" patches with suspend-to-disk
Comment 69 Manuel Krause 2014-09-25 22:53:44 UTC
Created attachment 151951 [details]
dmesg of 3.17-rc4 with "known-to-work" patches with suspend-to-RAM
Comment 70 Manuel Krause 2014-09-25 22:55:20 UTC
Created attachment 151961 [details]
dmesg of 3.17-rc4 with "for-upstream" patches with suspend-to-disk
Comment 71 Manuel Krause 2014-09-25 22:56:47 UTC
Created attachment 151971 [details]
dmesg of 3.17-rc4 with "for-upstream" patches with suspend-to-RAM
Comment 72 Manuel Krause 2014-09-25 23:18:38 UTC
The from you so called "known-to-work" patches from you do not make the fan step up after resume (disk+RAM). It sticks at some low temp/speed.

The "for-upstream" patches show: After Suspend-to-disk (hibernate) my fan sticks at 100% -- but after Suspend-to-RAM -- it shows the old behaviour, not stepping up from some low temp/speed.

What you'd see in the logs at the end: That I mount some partition on here: sda11. This was always the beginning of a new kernel compilation to increase the system's temperature. As I know, that the system's temp highly increases afterwards, by this means, there's something wrong, either with your debug logging or with THIS BUG's topic, to not refresh the fan <-> thermal_zone relations upon resume.

Best regards, Manuel
Comment 73 Zhang Rui 2014-09-26 02:26:27 UTC
(In reply to Manuel Krause from comment #72)
> The from you so called "known-to-work" patches from you do not make the fan
> step up after resume (disk+RAM). It sticks at some low temp/speed.
> 
this is weird, then can you please attach the patches that are know to work for you? I don't think I made some functional changes.
Comment 74 Manuel Krause 2014-09-28 17:47:03 UTC
The patches you've taken and called "known-to-work" are quite right and are identical to mine, regarding the code, to the ones I already posted here in Comment 49: attachment 151011 [details] and Comment 50: attachment 151021 [details]. Unfortunately I haven't compared them earlier, so I didn't find better ideas at that time.

Over the last days I found time to do some cross testing, that I obviously should have done much earlier. Sorry for that.

Long story in short:
* known-to-work patches with 3.17.0-rc6: 
  - after suspend-to-disk: fan speed stays @ a low value (45%) despite 
    increasing CPU load beyond temp. trigger point
* known-to-work patches with 3.17.0-rc1:
   == same as with 3.17.0-rc6 and == 3.17.0-rc4
=> These two and my previously posted results with -rc4 for BOTH patchsets lead me to the assumption that some new 3.17 code inhibits updating temperature and fan correlations completely after resume from disk/RAM. You are more involved into development than me, so, it's your turn.

* for-upstream patches with 3.16.3 (+BFS +BFQ +some others):
       --First time that I tried this combination--
   - after suspend-to-RAM: everything o.k.
   - after suspend-to-disk: fan sticks @ 100% until a lower temp. trigger is 
      met, then it adjusts to the right fan speed. Depending on the CPU load, 
      maybe WCG client is running, this may take forever, as the fan cools too 
      much unneededly, but the CPU doesn't emit enough heat, so that the 
      algorithm seems to hang "in the middle". Now the fan will forever stay 
      @ 100%. And, yes, this is then noisy! Even with an idle CPU this 
      "getting a lower temp. trigger" can take from 1 to 3 minutes at this 
      season. So this method seems not to be as usable as the very old one.

If you've time to explain, please, tell me what you want to achieve with the new set of patches in general.

Best regards,
Manuel Krause
Comment 75 Manuel Krause 2014-09-28 18:03:15 UTC
Created attachment 152031 [details]
dmesg of a 3.16.3+ with patches for-upstream +acpi.debug +dynamic_debug

Unfortunately the beginning of the log is missing. I don't know why. I hope this log is helpful, nontheless.

dmesg of a 3.16.3+ with patches for-upstream +acpi.debug +dynamic_debug
Comment 76 Manuel Krause 2014-09-28 18:11:10 UTC
In the log in Comment 75 I did one suspend-to-ram, and two suspend-to-disk.
Comment 77 Manuel Krause 2014-10-05 00:34:35 UTC
I've, until now, done some bisecting for your convenience. On the topic, when your "known-to-work" patches began to fail to work since 3.16.

Result:
git bisect bad
f35cec255557d1037ff0d772edfd6e7b1e92cdc0 is the first bad commit
commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Wed Jul 23 01:00:53 2014 +0200

    ACPI / PM: Always enable wakeup GPEs when enabling device wakeup
    
    Wakeup GPEs are currently only enabled when setting up devices for
    remote wakeup at run time.  During system-wide transitions they are
    enabled by ACPICA at the very last stage of suspend (before asking
    the BIOS to take over).  Of course, that only works for system
    sleep states supported by ACPI, so in particular it doesn't work
    for the "freeze" sleep state.
    
    For this reason, modify the ACPI core device PM code to enable wakeup
    GPEs for devices when setting them up for wakeup regardless of whether
    that is remote wakeup at runtime or system wakeup.  That allows the
    same device wakeup setup routine to be used for both runtime PM and
    system-wide PM and makes it possible to reduce code size quite a bit.
    
    This make ACPI-based PCI Wake-on-LAN work with the "freeze" sleep
    state on my venerable Toshiba Portege R500 and should help other
    systems too.
    
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

:040000 040000 4818deb40271308e35a69fae46673090737c9dd9 40965a4df729f5cb0ea848b1c0e70ea475b8a558 M      drivers
:040000 040000 d08af810abb0b5768615cdd4dea34b577704171a 0d113d89c1c831c294170c9a9d7a5cbedf884904 M      include


I currently don't know, what patches I'd need to revert, based on this above, to prove that YOUR patches would also work with 3.17 (without the compromising commit above).

Please advise me.
Thanks in advance and best regards,

Manuel
Comment 78 Manuel Krause 2014-10-09 23:02:16 UTC
Some new test results with kernel 3.17.0:

1. Vanilla without any additional patches clearly shows the BUG on here: After Suspend-to-disk the fan stays @ 0 forever.


2. Now with REVERTED commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0
(http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=f35cec255557d1037ff0d772edfd6e7b1e92cdc0):

2.1. With the "known-to-work" patches: Everything works as expected and wished

2.2. With the "for-upstream" patches: Same results as in my Comment 74, section 
      for-upstream patches. So, still not really usable.

Maybe, we can invite Rafael J. Wysocki to have a look on his patch again, to find out why it breaks both of your patch-sets for 3.17.

Thanks in advance,
Manuel
Comment 79 Zhang Rui 2014-10-23 12:27:19 UTC
Created attachment 154651 [details]
[PATCH] ACPI: invoke acpi_device_wakeup() with correct parameters

first, please apply this patch on top of rafael' patch and see if there is anything different, because there is a bug in his patch. And I will check if his patch changes any behavior that breaks our "known-to-work" patches.
Comment 80 Manuel Krause 2014-10-23 19:23:11 UTC
Tested 3.17.0 (as 3.17.1 doesn't change anything for our work) without reverting Rafael's commit but WITH your fix in Comment 79:

known-to-work patches don't trigger fan changes after suspend-to-disk

Although this patch is a real fix for the wrong parameters, this seems to be not enough to be fixed.
Comment 81 Zhang Rui 2014-10-24 06:22:15 UTC
Created attachment 154731 [details]
debug patch

please apply this patch on top of the "known-to-work" patches and the fix patch attached yesterday, and see if it helps.
Comment 82 Manuel Krause 2014-10-24 20:00:43 UTC
BTW: Thank you very much for your continued efforts!

To summarise: 
3.17.0 with nothing reverted + the 2 "known-to-work" patches (+ the related debug message patch of those days)  + attachment 154731 [details] debug patch + '[PATCH] ACPI: invoke acpi_device_wakeup() with correct parameters', attachment 154651 [details] 
brings back a working base! Cheers!:

Fan speed is stepping up and down as desired, initial fan speed is appropriate, after bootup, after suspend-to-disk and as well as after suspend-to-RAM.

Now, I really hope that your last debug patch from Comment 81 doesn't break things that Rafael wanted to fix with his commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0.

And... What's next? ;-)
Comment 83 Zhang Rui 2014-10-29 06:50:39 UTC
I'm trying to fix this issue right now, can you please send me the acpidump of this machine?
Comment 84 Manuel Krause 2014-10-29 20:52:13 UTC
As I can't estimate how sensitive the acpidump data is, I decided to send it to your email address directly.

I hope it helps!
Comment 85 Manuel Krause 2014-11-11 20:51:02 UTC
I've just seen that your fix from Comment 79 will most likely make it into 3.17.3!
Hopefully you are able to get ahead with the rest of this BUG, too.
Currently, I'm using the 3.17.2 with BFS, BFQ and the patches mentioned in Comment 82 without the debug-message-throwing one -- and it works very well! Thanks!

Best wishes and best regards, Manuel
Comment 86 Mario Becroft 2014-11-23 10:02:01 UTC
I have the identical issue on my HP 8510w using latest 3.17.4 kernel. As this is a production system I'm not able to test and retest all the patches, so I have reverted to an older kernel pre this bug. But if there is any info I can provide from the running system that would help, please let me know.
Comment 87 Manuel Krause 2014-11-24 00:52:57 UTC
Created attachment 158651 [details]
Kernel-Bugzilla-78201-knowntowork-V2-PLUS-for3.17.x PATCH

@Mario Becroft:
You can try this patch, what is a combination of Zhang Rui's two V2 patches, which worked well (known-to-work), with the third one, needed for 3.17.x.

It should be safe to use it on your machine, as it doesn't seem to differ much from mine. 
Please report back, I'm "glad" not to be the only one with this BUG. ;-)

This combination works very well on here @hp compaq 6730b with 3.17.4.

Best regards, Manuel Krause
Comment 88 Mario Becroft 2014-11-24 03:40:03 UTC
I'm not able to reboot this production system on demand so a bit difficult for me to test, unless it's really essential to get my feedback I can schedule a maintenance window.

Is there some reason these patches are being held back from inclusion in the next 3.17 release?
Comment 89 Manuel Krause 2014-11-26 00:09:22 UTC
I don't know why we don't get the working patches into the kernel. Zhang Rui never explained. :-(

What I see from testing 3.18-rc5-6, these patches won't work there any more. There had been modifications to the thermal and acpi subsystems and not only to the files in question. But the plain vanilla doesn't do the job either.
When I tested each of the 3.18 modifications vs. the 3.17 working solution (by modifying patches) I didn't get ONE positive result. So we do depend on someone else's expertise/ experience.
Comment 90 Manuel Krause 2014-11-28 22:58:43 UTC
*PING* @ Zhang Rui
Comment 91 Manuel Krause 2014-12-01 23:23:01 UTC
I'm really feeling sad and disappointed, that this BUG is originating from 3.12 and we would apparently not get our a fixes into upcoming 3.18 and not into 3.17.5.

Zhang Rui, please, finish this work, it's annoying over this much time now.

Manuel
Comment 92 Zhang Rui 2014-12-02 13:11:44 UTC
Hi, guys,
sorry for the long delay.
The reason why I did not push the working patches is that
the root cause of the bug is that we should not use cached value after resume.
And the "working" patches clear the cached value during suspend, but it is not guaranteed to be the last step of suspend (some driver may do thermal_zone_device_update() later during suspend, and then cached values are still used during resume).
Thus although it "fixes" the problem you have, it hides the real problem that may come again sometime, and that's why I don't want to push upstream.
But you're right, "a working solution today is better than a perfect plan tomorrow", I will try to push those two patches to 3.19, and I'll rewrite a better solution sometime later, wish I could get your help to test them again then.
Comment 93 Zhang Rui 2014-12-02 14:43:44 UTC
Created attachment 159431 [details]
PATCH for 3.19, 1/2
Comment 94 Zhang Rui 2014-12-02 14:46:32 UTC
Created attachment 159441 [details]
PATCH for 3.19, 2/2

I've rebased the "working" patches on top of 3.18-rc.
please double check if they work for you or not, after applying the debug patch at https://bugzilla.kernel.org/attachment.cgi?id=154731
I need to send them out with your Tested-by ASAP as the 3.19 merge window will open soon.
Comment 95 Manuel Krause 2014-12-02 20:03:49 UTC
Mmmh, now I really need your help with your knowledge. Like I already indicated in Comment 89, also kernel 3.18.0-rc7 showed: the thermal system fails.

Symptoms: After booting and increasing cpu load the fan won't trigger any fan speed increase, although well known trigger points are passed. After hibernate+resume the fan hangs at 100% forever.

The patch from https://bugzilla.kernel.org/attachment.cgi?id=154731 contains one hunk (No. 3) that doesn't correspond to 3.18-rc kernels any more (The line in error in this hunk is line 3):

diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c
index 67075f8..40a485ce 100644
--- a/drivers/acpi/device_pm.c
+++ b/drivers/acpi/device_pm.c
@@ -878,13 +886,13 @@ int acpi_dev_suspend_late(struct device *dev)
 
 	target_state = acpi_target_system_state();
 	wakeup = device_may_wakeup(dev);
-	error = acpi_device_wakeup(adev, target_state, wakeup);
+	error = __acpi_device_sleep_wake(adev, target_state, wakeup);
 	if (wakeup && error)
 		return error;
 
 	error = acpi_dev_pm_low_power(dev, adev, target_state);
 	if (error)
-		acpi_device_wakeup(adev, ACPI_STATE_UNKNOWN, false);
+		__acpi_device_sleep_wake(adev, ACPI_STATE_UNKNOWN, false);
 
 	return error;
 }

So I've tested 3 kernels for you during this afternoon.
One with all three patches + leaving the kernel original line,
one with all three patches + with your debug patch original line,
one without any of your patches.

All 3 (each verified twice) tests were leading to the results reported on the top of this comment.

Can you please check the surrounding of the files our patches touch for changes in the 3.18-rc development timeline? Maybe the failing hunk gives a hint where to begin the investigation?

Good luck, and yes, of course, I'll stay tuned to help in testing!!!
Comment 96 Manuel Krause 2014-12-02 21:04:01 UTC
(In reply to Zhang Rui from comment #92)
> Hi, guys,
> sorry for the long delay.
> The reason why I did not push the working patches is that
> the root cause of the bug is that we should not use cached value after
> resume.
> And the "working" patches clear the cached value during suspend, but it is
> not guaranteed to be the last step of suspend (some driver may do
> thermal_zone_device_update() later during suspend, and then cached values
> are still used during resume).
> Thus although it "fixes" the problem you have, it hides the real problem
> that may come again sometime, and that's why I don't want to push upstream.
> But you're right, "a working solution today is better than a perfect plan
> tomorrow", I will try to push those two patches to 3.19, and I'll rewrite a
> better solution sometime later, wish I could get your help to test them
> again then.

My thoughts on this: Isn't it possible to reset the fan(s) and temperatures and their correspondence during resume, like it happens at a normal reboot? (O.k. it does not do with the current 3.18-rc at boot at all, on here.) But normally in the booting process in kernel 3.17 we've got a temp and a trigger point and then a resulting fan speed. So, why should that process NOT occur at resume again? Any speed gain in this case is useless if it sacrifices system safety.

And now to your approach, that may sound different from my idea above, but may be the same idea, why don't you delete the cached value also at resume (again, double-safe) and then re-evaluate the the real temperature data?
Comment 97 Zhang Rui 2014-12-03 02:28:11 UTC
(In reply to Manuel Krause from comment #89)
> I don't know why we don't get the working patches into the kernel. Zhang Rui
> never explained. :-(
> 
> What I see from testing 3.18-rc5-6, these patches won't work there any more.
> There had been modifications to the thermal and acpi subsystems and not only
> to the files in question. But the plain vanilla doesn't do the job either.
> When I tested each of the 3.18 modifications vs. the 3.17 working solution
> (by modifying patches) I didn't get ONE positive result.

do you mean the 3.17 working solution does not work for you any more?
Comment 98 Zhang Rui 2014-12-03 02:31:04 UTC
(In reply to Manuel Krause from comment #96)
> (In reply to Zhang Rui from comment #92)
> > Hi, guys,
> > sorry for the long delay.
> > The reason why I did not push the working patches is that
> > the root cause of the bug is that we should not use cached value after
> > resume.
> > And the "working" patches clear the cached value during suspend, but it is
> > not guaranteed to be the last step of suspend (some driver may do
> > thermal_zone_device_update() later during suspend, and then cached values
> > are still used during resume).
> > Thus although it "fixes" the problem you have, it hides the real problem
> > that may come again sometime, and that's why I don't want to push upstream.
> > But you're right, "a working solution today is better than a perfect plan
> > tomorrow", I will try to push those two patches to 3.19, and I'll rewrite a
> > better solution sometime later, wish I could get your help to test them
> > again then.
> 
> My thoughts on this: Isn't it possible to reset the fan(s) and temperatures
> and their correspondence during resume, like it happens at a normal reboot?
> (O.k. it does not do with the current 3.18-rc at boot at all, on here.) But
> normally in the booting process in kernel 3.17 we've got a temp and a
> trigger point and then a resulting fan speed. So, why should that process
> NOT occur at resume again? Any speed gain in this case is useless if it
> sacrifices system safety.
> 
> And now to your approach, that may sound different from my idea above, but
> may be the same idea, why don't you delete the cached value also at resume
> (again, double-safe) and then re-evaluate the the real temperature data?

No, the same reason: the thermal .resume_early() callback is not guaranteed to be the first one during resume, which means that some thermal driver may already issued thermal_zone_device_update() before thermal core .resume_early() callback, and your approach would clear the meaningful values for those drivers.
Comment 99 Zhang Rui 2014-12-03 05:01:56 UTC
(In reply to Manuel Krause from comment #95)
> Mmmh, now I really need your help with your knowledge. Like I already
> indicated in Comment 89, also kernel 3.18.0-rc7 showed: the thermal system
> fails.
> 
> Symptoms: After booting and increasing cpu load the fan won't trigger any
> fan speed increase, although well known trigger points are passed.

Well, this is sad.
It seems that we've got another regression that thermal system fails after boot.

please make sure your kernel is built with CONFIG_DYNAMIC_DEBUG=y,
and run echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control
after kernel boot.
and try to warm up the system and attach the dmesg output. We need to fix this first.
Comment 100 Manuel Krause 2014-12-03 17:55:35 UTC
To answer all in one:
* The two working patches + debug patch work for my 3.17.4 very well.
* To the regression: I've tested 3.18.0-rc5, -rc6 and -rc7 -- and for none of them the fan control works. I don't even need to bother with sending a dmesg (with the dynamic_debug command run) as there is exactly NO related output in it, although I've passed several trigger points up- and downwards. I've rechecked the .config for eventual misconfiguration, but it's all in it's place. Do I need some special kernel command line in addition?
* From comparing a 3.17.3 dmesg with a 3.18.0-rc7 after bootup I miss some lines completely in the latter (indicated with --- instead of timestamp):

...
[    0.671244] pcie_pme 0000:00:1c.5:pcie01: service driver pcie_pme loaded
[    0.671294] intel_idle: does not run on family 6 model 23
[    ---     ] ACPI: Fan [FAN0] (off)
[    ---     ] ACPI: Fan [FAN1] (off)
[    ---     ] ACPI: Fan [FAN2] (off)
[    ---     ] ACPI: Fan [FAN3] (off)
[    ---     ] ACPI: Fan [FAN4] (off)
[    0.672777] Monitor-Mwait will be used to enter C-1 state
[    0.672783] Monitor-Mwait will be used to enter C-2 state
[    0.672788] tsc: Marking TSC unstable due to TSC halts in idle
[    0.672794] ACPI: acpi_idle registered with cpuidle
...

I don't know if that's a hint?! According to tmon these do exist and they (the fan levels) are manually adjustible with e.g. 'echo 1 > /sys/class/thermal/cooling_device1/cur_state'.

* To be honest, I don't want to bisect the early 3.18-rcs, testing the -rc5 again showed up that it's highly buggy on here.


Best regards and thank you in advance for any new idea to try out,
Manuel
Comment 101 Manuel Krause 2014-12-04 23:15:39 UTC
(In reply to Zhang Rui from comment #98)
> ...
> No, the same reason: the thermal .resume_early() callback is not guaranteed
> to be the first one during resume, which means that some thermal driver may
> already issued thermal_zone_device_update() before thermal core
> .resume_early() callback, and your approach would clear the meaningful
> values for those drivers.

So, wouldn't it then make sense to have a kind of "subscription list" for drivers that do change their thermal settings? Meaning, for the suspend case as well as for the resume case, from there could be read out that/ what may have changed inbetween (e.g. some differences changed by the BIOS). What value then needs to be re-evaluated would depend on this comparison and driver/ hardware quality. New values then would be then set depending on the previuos comparison.
Comment 102 Zhang Rui 2014-12-09 15:21:18 UTC
(In reply to Manuel Krause from comment #100)
> To answer all in one:
> * The two working patches + debug patch work for my 3.17.4 very well.
> * To the regression: I've tested 3.18.0-rc5, -rc6 and -rc7 -- and for none
> of them the fan control works. I don't even need to bother with sending a
> dmesg (with the dynamic_debug command run) as there is exactly NO related
> output in it, although I've passed several trigger points up- and downwards.
> I've rechecked the .config for eventual misconfiguration, but it's all in
> it's place. Do I need some special kernel command line in addition?
> * From comparing a 3.17.3 dmesg with a 3.18.0-rc7 after bootup I miss some
> lines completely in the latter (indicated with --- instead of timestamp):
> 
> ...
> [    0.671244] pcie_pme 0000:00:1c.5:pcie01: service driver pcie_pme loaded
> [    0.671294] intel_idle: does not run on family 6 model 23
> [    ---     ] ACPI: Fan [FAN0] (off)
> [    ---     ] ACPI: Fan [FAN1] (off)
> [    ---     ] ACPI: Fan [FAN2] (off)
> [    ---     ] ACPI: Fan [FAN3] (off)
> [    ---     ] ACPI: Fan [FAN4] (off)
> [    0.672777] Monitor-Mwait will be used to enter C-1 state
> [    0.672783] Monitor-Mwait will be used to enter C-2 state
> [    0.672788] tsc: Marking TSC unstable due to TSC halts in idle
> [    0.672794] ACPI: acpi_idle registered with cpuidle
> ...
> 
> I don't know if that's a hint?

No, this message is removed by commit 9519a6356cbf63b1f22a7a208385dc56092c8b7d.

> ! According to tmon these do exist and they
> (the fan levels) are manually adjustible with e.g. 'echo 1 >
> /sys/class/thermal/cooling_device1/cur_state'.
> 
questions:
1. does manually changing the cooling state turn on/off the fan?
2. can you please boot with "acpi.debug_level=0x04 acpi.debug_layer=0x04" for both 3.17 and 3.18 kernel? I think we are missing thermal events during boot.
Comment 103 Manuel Krause 2014-12-10 23:24:12 UTC
(In reply to Zhang Rui from comment #102)
> questions:
> 1. does manually changing the cooling state turn on/off the fan?
> 2. can you please boot with "acpi.debug_level=0x04 acpi.debug_layer=0x04"
> for both 3.17 and 3.18 kernel? I think we are missing thermal events during
> boot.

1. Manual changes to the cooling state (cur_state) like I mentioned above do work for both kernel versions. But at the moment I see: not all are working @ 3.17.6. I have 6 fan speed levels, meaning 5 cooling states plus 1 state "all 0", each representing one fan speed: 0% (all 5 @ 0), 25%, 34%, 45%, 58%, 100% (all 5 @ 1). I don't know when or why one of the levels (25%) doesn't work all of the sudden.

2. In the following I'll attach two dmesg logs for a 3.17.6 kernel and a 3.18.0 kernel. For both I've used
- kernel command line with acpi.debug_level=0x04 acpi.debug_layer=0x04
- issued echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control
- heated the system over the last trip point (74 C) before emergency would kick in
- hibernate & resume
For 3.18.0 I needed to issue extra cooling commands to the fans (maybe you'll read them).
Comment 104 Manuel Krause 2014-12-10 23:30:50 UTC
Created attachment 160361 [details]
kernel-bugzilla-78201-MK.dmesg-3.17.6.log
Comment 105 Manuel Krause 2014-12-10 23:32:09 UTC
Created attachment 160371 [details]
kernel-bugzilla-78201-MK.dmesg-3.18.0.log
Comment 106 Manuel Krause 2014-12-17 21:11:28 UTC
Any insights or news on why kernel's automatic thermal control fails for my system @ 3.18.x kernels?

I've now tested a 3.18.1 with your patches and with a carefully reviewed and partly changed .config. But the above mentioned problems (for 3.18.0-rc) do remain completely. 
Without manually issuing a 1 to /sys/class/thermal/cooling_deviceX/cur_state the system fan's X level would always remain at it's bootup state. After resume I need to manually issue a "echo 0 > /sys/class/thermal/cooling_device0/cur_state" to at least calm down the fan from 100%.

The 3.17.7 works well with your 3 patches.
Comment 107 Manuel Krause 2014-12-17 21:43:22 UTC
And just to ask in advance: If I'd find time to bisect the latter case with 3.18... in the coming days:
a) What should I do with results that don't boot well (3.18-rc1 to -4 were buggy) 
b) should I always apply your patches, too, or should I only look for the commit, when thermal management failed on here?

Thanks for your advise, Manuel
Comment 108 Manuel Krause 2014-12-26 01:16:21 UTC
Coming back from quite a weird odyssey of bisecting and testing...
 ...I'm clueless. :-(

* (offtopic) I had first needed to bisect the culprit of not beeing able to boot my machine with a 3.17.0-rc2 (Comment 34, Comment 35) to finish the follow-up bisect:
 commit 208bf9fdcd3575aa4a5d48b3e0295f7cdaf6fc44
 drm/i915: Fix locking for intel_enable_pipe_a()

* the follow-up was, to bisect, what inhibits automatic fan control in 3.18.x:
I got a bisecting result:
 commit 6ab3430129e258ea31dd214adf1c760dfafde67a
 mfd: Add ACPI support

Reverting this commit on a previously failing 3.17.0-rc2 brings back automatic fan control after bootup --  BUT REVERTING this on a previously failing 3.18.x does NOT help. ((We already know, that my 3.17.7 works well, also with your patches applied.))
So, I honestly don't know WHAT ELSE I need to revert for 3.18.x. Please, have a look and advise!

* Then I've tested a 3.19.0-rc1: Here, automatic fan control does work from bootup, but your patches don't work any more for the hibernation case (neither, without them). Maybe you can review them for this kernel, too?!

Best regards, Manuel
Comment 109 Manuel Krause 2015-01-03 08:59:59 UTC
I'm quite astonished, that there aren't more people affected. 
Kernel 3.18.x is completely useless for my system without manually echoing ones or zeros to /sys/class/thermal/cooling_deviceX/cur_state via console. It's a shame.
Comment 110 Manuel Krause 2015-01-10 03:38:51 UTC
*PING* Zhang Rui, 
can you please answer to Comment 108 (and 109), regarding kernel 3.18.x at least?! Now, that the 3.17.x series got the EOL - the last kernel that worked with your knowntowork patches.
I would need to file a new bug, but that could complicate things even more.

Thanks, and a Happy New Year, Manuel
Comment 111 Manuel Krause 2015-01-18 18:40:28 UTC
See Comment 106. It's still valid for 3.18.3. Please, find some time to help to fix it.
Comment 112 Zhang Rui 2015-01-20 00:26:39 UTC
Hi, Manuel,
(In reply to Manuel Krause from comment #108)
> Coming back from quite a weird odyssey of bisecting and testing...
>  ...I'm clueless. :-(
> 
> * (offtopic) I had first needed to bisect the culprit of not beeing able to
> boot my machine with a 3.17.0-rc2 (Comment 34, Comment 35) to finish the
> follow-up bisect:
>  commit 208bf9fdcd3575aa4a5d48b3e0295f7cdaf6fc44
>  drm/i915: Fix locking for intel_enable_pipe_a()
> 
Does this regression got fixed in the latest upstream kernel?

> * the follow-up was, to bisect, what inhibits automatic fan control in
> 3.18.x:
> I got a bisecting result:
>  commit 6ab3430129e258ea31dd214adf1c760dfafde67a
>  mfd: Add ACPI support
> 
> Reverting this commit on a previously failing 3.17.0-rc2 brings back
> automatic fan control after bootup

Hah, this is really weird, I can not explain how this could happen.


> --  BUT REVERTING this on a previously
> failing 3.18.x does NOT help. ((We already know, that my 3.17.7 works well,
> also with your patches applied.))
> So, I honestly don't know WHAT ELSE I need to revert for 3.18.x. Please,
> have a look and advise!
>
There is a known issue of commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0, which is fixed in 3.19-rc1. But for 3.18, I don't think you can simply revert it. can you please apply commit 78579b7c7eb45f0e7ec5e9437087ed21749f9a9c
and commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on top of 3.18.x?

 
> * Then I've tested a 3.19.0-rc1: Here, automatic fan control does work from
> bootup,

which I think is fixed by the above two patches.

> but your patches don't work any more for the hibernation case
> (neither, without them). Maybe you can review them for this kernel, too?!
>
yes, will do.
Comment 113 Manuel Krause 2015-01-20 23:47:26 UTC
(In reply to Zhang Rui from comment #112)
> Hi, Manuel,
> (In reply to Manuel Krause from comment #108)
> > Coming back from quite a weird odyssey of bisecting and testing...
> >  ...I'm clueless. :-(
> > 
> > * (offtopic) I had first needed to bisect the culprit of not beeing able to
> > boot my machine with a 3.17.0-rc2 (Comment 34, Comment 35) to finish the
> > follow-up bisect:
> >  commit 208bf9fdcd3575aa4a5d48b3e0295f7cdaf6fc44
> >  drm/i915: Fix locking for intel_enable_pipe_a()
> > 
> Does this regression got fixed in the latest upstream kernel?

This regression only appeared while bisecting this issue ON HERE with 3.17-rc2+-. It is fixed.

> 
> > * the follow-up was, to bisect, what inhibits automatic fan control in
> > 3.18.x:
> > I got a bisecting result:
> >  commit 6ab3430129e258ea31dd214adf1c760dfafde67a
> >  mfd: Add ACPI support
> > 
> > Reverting this commit on a previously failing 3.17.0-rc2 brings back
> > automatic fan control after bootup
> 
> Hah, this is really weird, I can not explain how this could happen.
> 
> 
> > --  BUT REVERTING this on a previously
> > failing 3.18.x does NOT help. ((We already know, that my 3.17.7 works well,
> > also with your patches applied.))
> > So, I honestly don't know WHAT ELSE I need to revert for 3.18.x. Please,
> > have a look and advise!
> >
> There is a known issue of commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0,
> which is fixed in 3.19-rc1. But for 3.18, I don't think you can simply
> revert it. can you please apply commit
> 78579b7c7eb45f0e7ec5e9437087ed21749f9a9c
> and commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on top of 3.18.x?
> 

Commit 78579b7c7eb45f0e7ec5e9437087ed21749f9a9c already made it's way into the kernel as of 3.18.3.
Commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea is still missing BUT NEEDED: This one brings back automatic fan control for me!!!

Together with your last knowntowork patches this 3.18.3 even works well after suspend-to-disk.

Regards, Manuel Krause

>  
> > * Then I've tested a 3.19.0-rc1: Here, automatic fan control does work from
> > bootup,
> 
> which I think is fixed by the above two patches.
> 
> > but your patches don't work any more for the hibernation case
> > (neither, without them). Maybe you can review them for this kernel, too?!
> >
> yes, will do.
Comment 114 Manuel Krause 2015-01-21 19:47:10 UTC
Any idea, when commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea can get into the 3.18.x kernel series -- or in other words, can you push it a bit?!

Thank you very much for your work,
Manuel
Comment 115 Manuel Krause 2015-01-27 22:15:11 UTC
Apparently I was too fast with my "good" experience from Comment 113.
The patches you mentioned there regain automatic fan control (vs. temperature).

Regarding your knowntowork patches, they don't work reliably with 3.18.3. The first shot in the above mentioned comment worked well, but I've made a bad experience over time with more reboots and more hibernate+resumes.

The worst case is, with your patches, that the fan goes up to 100% un-neededly and staying there forever, indicated by "/":

Boot:       1   2   3
Hibernate1: /   /   +
2           /   +   +
3               /   /
4               /   +
5               /
6               /

Best regards, Manuel
Comment 116 Claire Farron 2015-01-30 14:21:12 UTC
Hi, I don't know if this is related, but quite a few users on ArchLinux are suffering from the fan running at full speed all the time starting with 3.18.2.

Reverting to 3.18.1 fixed the problem for one, whilst the rest reverted back to 3.17.6 without trying 3.18.1.

This is the related BBS thread for information: https://bbs.archlinux.org/viewtopic.php?id=192255

Do you want me to see if commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea fixes their problem?
Comment 117 Manuel Krause 2015-01-30 18:16:12 UTC
Hi Claire,
I can definitely encourage you to either apply the commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on kernels before 3.18.5 or better just use this newer kernel. Currently, I'm testing it (good results!) but will need more boots and hibernates+resumes to have a reasonable final report.

Regarding the linked forum thread (that I've been following for some weeks now), I'm not sure if all the mentioned problems there are connected to this bug here or they get fixed by the above commit(s). At least, it's certainly the same problem complex, you only need to read that the involved laptops belong to one HP/Compaq notebook family -- most probably incorporating one weird BIOS bug/glitch.

Try, and please report back! Thank you!
Comment 118 Claire Farron 2015-02-01 14:16:50 UTC
(In reply to Manuel Krause from comment #117)
> Hi Claire,
> I can definitely encourage you to either apply the commit
> 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on kernels before 3.18.5 or better
> just use this newer kernel. Currently, I'm testing it (good results!) but
> will need more boots and hibernates+resumes to have a reasonable final
> report.
> 
> Regarding the linked forum thread (that I've been following for some weeks
> now), I'm not sure if all the mentioned problems there are connected to this
> bug here or they get fixed by the above commit(s). At least, it's certainly
> the same problem complex, you only need to read that the involved laptops
> belong to one HP/Compaq notebook family -- most probably incorporating one
> weird BIOS bug/glitch.
> 
> Try, and please report back! Thank you!

It would appear that whilst it's the looks to be the same problem area, it seems that this patch isn't working for them. I reported it as https://bugzilla.kernel.org/show_bug.cgi?id=92431
Comment 119 Manuel Krause 2015-02-01 23:51:51 UTC
Claire's right in opening the new BUG 92431 for the problems that are shown in the archlinux forum thread (comment 116).

The commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea only brings back automatic fan control from bootup for my system. It doesn't improve anything else regarding this bug here.

To amend comment 115, the known-to-work patches don't only "work unreliably" on 3.18.x (now with .5) -- they honestly don't make any difference vs. a kernel without them. 
Actual behaviour: After resume from suspend to disk the fan goes to 100% and will only get down to lower and normal values if the trigger temperature drops downwards below one of the next (lower) temp trip points. The reason for the positive report in comment 113, what I was able to investigate so far, was having a warm system and a dropping trigger temp during resume, so that the next lower trip point was passed, always inducing a fan speed adjustment to lower speed. Sad, that this was apparently a row of coincidents for that testing. 
Unfortunately a kernel 3.18.5 without these three known-to-work patches behaves the same way, so I need to conclude: they don't work for 3.18.x any more, too.
Comment 120 Manuel Krause 2015-02-07 21:10:14 UTC
Can you, PLEASE, come back to this bug? 
I want to get rid of the current behavoiur, that after resuming from suspend the fan is at 100% unnededly. (And I may need to heat up the system beyond next upper "real" trip_point and then cool down to next lower trip_point to get the automatic back working again.) 
Your patches seem to NOT work any more.

Why can't there simply be read in the actual temperatures after resume for the purpose to figure out the needed fan level?

Please, come over with new patches for 3.18.6+ !

Thanks, in advance,
Manuel
Comment 121 Mario Becroft 2015-02-07 22:35:29 UTC
FWIW, on my similar laptop exhibiting the same issue, I simply found the the EC register controling fan aspeed and wrote a userland script to bend it to my will using the ec debug driver. Not exactly elegant, but effective. I can share with you notes on how to do this if it would be useful.
Comment 122 Manuel Krause 2015-02-08 18:27:25 UTC
Actually, I don't want to discorage Zhang Rui to fix the correct temp. initialisation on this notebook family...

@Mario Becroft, if it doesn't make too much extra work, you can put the usage instructions on top of your userland script and upload it somewhere and post the link here (or, of course, post it to my email address only). But if it's useful for you and you want to share it, I think other affected users should benefit from it, too.

I also thought of simply issuing a command, after every resume, to at least silence the 100% fan level ('echo 0 > /sys/class/thermal/cooling_device0/cur_state'). So far I have not investigated, if that would/could interfere any later automatic control from kernel side.

Thank you in advance and best regards,
Manuel
Comment 123 Manuel Krause 2015-02-08 21:10:01 UTC
Also, a pure 3.19.0-rc7 doesn't improve anything. :-(
Comment 124 Manuel Krause 2015-02-14 00:23:35 UTC
To bring back some sunshine :) to this BUG:

@ Zhang Rui: I've now done a new row of new tests with the 3.18.7 kernel with the digged out "for-upstream-patches". They are originating from Comment 61 and Comment 62. With modified versions to apply to the current kernel I was able to get a good behaviour after resume.

Most important change needed to get it work, was, so to say:

--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -169,6 +169,13 @@ int get_tz_trend(struct thermal_zone_device *tz, int trip)
 {
 	enum thermal_trend trend;
 
+	/*
+	 * If it's the first time to read the temperature, pretend it is DECREASING
+	 * -- otherwise we'd get 100% fan speed (emergency level) after resume.
+	*/
+	if (tz->last_temperature == THERMAL_TEMP_INVALID)
+		return THERMAL_TREND_DROPPING;
+	 
 	if (tz->emul_temperature || !tz->ops->get_trend ||
 	    tz->ops->get_trend(tz, trip, &trend)) {
 		if (tz->temperature > tz->last_temperature)

Regarding the safety issue, why you've wanted to set THERMAL_TREND_RAISING, I can say, either while booting or while resuming, this is counter-productive: The fan has passed all tests with actualising temperatures inbetween (boot/resume).

I'll report back, on how 3.19.0 works with this combo.

Best regards,
Manuel
Comment 125 Manuel Krause 2015-02-14 01:12:29 UTC
Created attachment 166701 [details]
BUG_78201.kernel-3.18.7.0001-THERMAL-initialize-thermal-zone-device-correctly.MOD6

1 of 2 good patches for 3.18.7
Comment 126 Manuel Krause 2015-02-14 01:17:03 UTC
Created attachment 166711 [details]
BUG_78201.kernel-3.18.7.0002-Thermal-handle-thermal-zone-device-update-events-cor.MOD1

1 of 2 good patches for 3.18.7
Comment 127 Manuel Krause 2015-02-14 01:19:38 UTC
Comment 126 is: sorry, for the typo, No. 2 of 2 good patches for 3.18.7
Comment 128 Manuel Krause 2015-02-14 20:57:45 UTC
O.k., again good news: Several reboots and hibernation cycles later I can say, that also 3.19.0 works with these two patches. Do you want me to attach debugging output, for you to check the technical / programmer's aspects of this approach?

Can you, please, also say something about if there exists any relation to BUG 91411?

Thank you in advance,
Manuel
Comment 129 Ville Syrjala 2015-02-15 23:08:01 UTC
Just FYI my HP/Compaq nc6000 is also hit by this bug. The patches from comment 125 and comment 126 seem to do the trick and get it back to working condition.

I had an additional problem of thermal notifications being AWOL but I bisected that down and sent a revert [1].

Now, the last time thermal was broken on this machine wasn't that long ago (bug 56591). These HP/Compaq machines seem to be extremely fragile wrt. thermal/acpi issues, so I would suggest Rui or someone dealing with the thermal subsystem get themselves at least one of these machines for some regular regression testing. Otherwise I fear these regressions will keep on coming.

[1] http://marc.info/?l=linux-acpi&m=142403948106603&w=2
Comment 130 Zhang Rui 2015-02-16 02:22:46 UTC
Hi, Manuel,

first of all, I'd like to say thank you for your continuous help on this issue and your feedback really helps a lot. I'm sorry that usually my response is a bit slow because I'm not full time work on bugzilla, plus I need to handle tens of bugs at the same time.

your recent comments are really helpful. If I'm right, the only difference from my original patches is that you change the thermal trend to dropping instead of increasing, right? I choose trend raising just for safety reason, because if choose raising, the bad effect would be the fan spinning unnecessarily, but if choose dropping, the bad effect would be cause hardware damage if the system resumes while overheating.

I will rework the patches a bit and update soon.
Comment 131 Zhang Rui 2015-02-16 03:08:30 UTC
Created attachment 166981 [details]
patch for upstream: 1-2
Comment 132 Zhang Rui 2015-02-16 03:08:54 UTC
Created attachment 166991 [details]
patch for upstream: 2-2
Comment 133 Zhang Rui 2015-02-16 03:09:33 UTC
Hi, Manuel,

please try these two patches on top of 3.19 and see if the problem still exists.
Comment 134 Manuel Krause 2015-02-17 00:14:49 UTC
Created attachment 167201 [details]
patch-for-upstream:1-2 corrected version

I now attach a corrected version of the FIRST patch that did not apply and not compile cleanly with 3.19.0

Many thanks go to Olaf Hering from BUG 67101 in his Comment 24 for his work. https://bugzilla.kernel.org/attachment.cgi?id=167121&action=diff
Comment 135 Manuel Krause 2015-02-17 00:41:50 UTC
Hi, Rui, 

no, these new patches do result in the wrong behaviour, again: After resuming from hibernation, always the emergency (100%) fan level is on, what is not needed at all. And from my observation this also is NOT always the next upper fan level from the current temperature <-> trip_point relation!

I think "the automatic" should stay at least as low as the level before hibernate and then re-consider by next time reading the actual temperatures. 
Actually with the patches from my Comments 125 & 126, here is no problem with choosing a lower next trip_point, as it is actually re-read during booting or during resuming from hibernate. Read it like: I can HEAR the proper fan adjustment upwardly during the boot/resume procedures, although we had a DROPPING policy.

Best regards,
Manuel
Comment 136 Manuel Krause 2015-02-17 01:09:37 UTC
From my observations: I need to make drop temperature over 2 trip_points, after resume from hibernation, downwards, to get a working fan again, due the false initialisation by code (100% ->58% fan), temp < 68°C, and possibly the other one below (58%->45% fan) temp < 54°C.

Please, always consider, the system is going through the BIOS (that is setting fan speeds as well) also for resuming from hibernate.

Manuel
Comment 137 Sebastian M. 2015-02-18 04:15:12 UTC
I am experiencing the same problem on my HP 6910p. Fan starts to spin at full speed after middle of boot-process.
Comment 138 Manuel Krause 2015-02-18 22:00:40 UTC
(In reply to Sebastian M. from comment #137)
> I am experiencing the same problem on my HP 6910p. Fan starts to spin at
> full speed after middle of boot-process.

Hi, Sebastian,

this BUG here covers the issue of 100% fan speed after suspend to disk/ram.

For issues with high fan speeds already from bootup you should, please, have a look at the newly appeared BUG 92431 AND BUG 93301 -- to see if they better match your system's failure.

Best regards,
Manuel
Comment 139 Manuel Krause 2015-02-27 14:33:09 UTC
(In reply to Zhang Rui from comment #133)
> Hi, Manuel,
> 
> please try these two patches on top of 3.19 and see if the problem still
> exists.

This failure must not go into kernel 4.0.

Please, let us fix it before.

BR, Manuel
Comment 140 Zhang Rui 2015-03-02 02:44:58 UTC
(In reply to Manuel Krause from comment #135)
> Hi, Rui, 
> 
> no, these new patches do result in the wrong behaviour, again: After
> resuming from hibernation, always the emergency (100%) fan level is on, what
> is not needed at all. And from my observation this also is NOT always the
> next upper fan level from the current temperature <-> trip_point relation!
> 
please enable the dynamic debug in drivers/thermal/thermal_core.c and drivers/thermal/step_wise.c and attach the dmesg output after this happens.
Comment 141 Manuel Krause 2015-03-03 18:38:00 UTC
Created attachment 168751 [details]
dmesg with 2 hibernations with Rui's two 3.19-related patches

This dmesg.log describes the following scenario:
booting into the relatively cold system, "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control", suspend to disk and resume shortly afterwards,
fan is @ 100% and it takes about 81s after resume, that the next lower trip point is passed, so the fan then goes down,
now I increase CPU load and wait to warm the system and suspend to disk again and resume again shortly afterwards,
fan again @ 100%, nothing happens for a long period of time, then I reduce CPU load, and shortly afterwards the fan spins down by 3(!) levels.

Reminder: For my system thermal_zone1 [CPUZ] is the relevant one for cooling and only cooling_device0..4 are the fan levels (25%, 34%, 45%, 58%, 100%) which are themselves shown in thermal_zone4 [FDTZ]. And the TuxOnIce debugging info comes when resume has almost finished but before my ethernet attaches.

This time I invested some time to try to understand, what this debugging output possibly means. But I don't understand, what thermal_sys does NOT do. E.g. after both resumes, why are cooling_device2..4 set to 1 (their state already is 1, BTW), but devices 0..1 are left at cur_state 1 and not reset? The related temperature would allow setting the latter devices to state 0. In my opinion / understanding, this is the point. Can they eventually be left out, because the temp is lower than their corresponding trip_point temp? That shouldn't happen.

I hope this helps, best regards,
Manuel
Comment 142 Zhang Rui 2015-03-04 02:54:31 UTC
Created attachment 168791 [details]
0001-Thermal-initialize-thermal-zone-device-correctly.patch
Comment 143 Zhang Rui 2015-03-04 02:56:04 UTC
Created attachment 168801 [details]
0002-Thermal-handle-thermal-zone-device-update-events-cor.patch

Great, I think I've found the root cause.

I've made several changes in this new patch set and rebase it on top of 4.0-rc2, can you please help check if they fix the problem for you or not?
Comment 144 Manuel Krause 2015-03-04 22:46:14 UTC
Created attachment 169051 [details]
dmesg with 2 hibernations with Rui's two 4.0-rc2-related patches

Great work, I'd like to say! The new patches DO WORK FOR ME.
Please, have a close look to the attached log, to check if thermal_sys is doing everything like you intended. I found it a bit confusing to follow, in which sequence and how often things get checked and states adjusted (and what not).

The test in the log follows the scenario from Comment 141 run on a 4.0-rc1. During boot everything is o.k. The fan does not resume with 100% fan anymore and also quickly adjusts according to any increased/decreased load. I only had one problem: The very first resume with 4.0-rc1 with your new patches with a relatively cold system came up with 100% fan, but adjusted after ~91s. I was not able to reproduce it, but don't want to omit to mention it. (Any ideas, how this may happen?) 
The attached log is from the third successful run of my scenario.

I've now also patched my actual 3.19.0 setup with your new ones (needed only one context line to be changed, the one with the typo in original 3.19.0) -- and it also passed my tests. Resuming from suspend to RAM also works fine.

So, please, check if the log shows the right handling from your programmer's point of view.

Me being lucky now, best regards and many thanks to you, especially for your patience and endurance,
Manuel
Comment 145 Manuel Krause 2015-03-04 23:08:11 UTC
Created attachment 169061 [details]
Newest 0001 (1of2) patch for the 3.19.0 kernel

For testers on the "old" 3.19 kernel series.
Comment 146 Manuel Krause 2015-03-04 23:11:11 UTC
Created attachment 169071 [details]
Newest 0002 (2of2) patch for the 3.19.0 kernel

For testers on the "old" 3.19 kernel series.

Good luck, and please, report back! :-)
Comment 147 Manuel Krause 2015-03-11 00:51:15 UTC
Patches from Comment 145 and Comment 146 still working with 3.19.1 plus the debug patch from BUG 91411 https://bugzilla.kernel.org/attachment.cgi?id=169941 (although I didn't need it).
Comment 148 Manuel Krause 2015-03-11 01:12:37 UTC
Created attachment 170341 [details]
dmesg.log from working 3.19.1 according to Comment 147
Comment 149 Zhang Rui 2015-03-14 08:32:00 UTC
Created attachment 170581 [details]
patches for upstream

Hi,

This is the new version of patches that I'd like to submit for upstream.
The first three patches, which can fix the problem you have IMO, are likely to go to 4.0, and the follow seven patches are some enhancements based on the first three patches.
Please check if they work for you or not. (Please attach the dmesg output after resume, with dynamic debug enabled)
Comment 150 Zhang Rui 2015-03-14 13:37:13 UTC
Created attachment 170611 [details]
patches for upstream - 2

Please help me do the test in two steps.
1. please check if patch 1~4 fixes the problem for you or not.
2. please apply all the patches in this tarball and check if the problem is gone.
Comment 151 Manuel Krause 2015-03-14 20:55:18 UTC
Created attachment 170641 [details]
dmesg of 3.19.1 with patches 0001..0004 applied

So, the first round of tests succeeded! :-) This is with the first four patches. Please don't mind, that I've tested kernel 3.19.1, first. (Let me know, if you want me to redo it on kernel 4.0-rc3.)
BTW, I've left out the "0001-Debug-patch-to-sync-thermal-zone-update.patch" from the tarball as the code obviously reappears under the correct new name "0004-Thermal-make-thermal_zone_device_update-atomic.patch".

The attached dmesg shows one suspend-to-disk with relatively cold system and low load, then one suspend-to-disk with a hot system followed by a suspend-to-RAM.

From bootup through all resumes everything works fine and the fan levels are adjusted quickly on load change. The log also doesn't show any double thermal zone update confusions any more, like it was before. 
Good work! Thank you very much!
And kernel with the next seven patches is compiling atm.
Comment 152 Manuel Krause 2015-03-14 22:45:26 UTC
Created attachment 170651 [details]
dmesg of 3.19.1 with patches 0001..0011 applied

O.k. The second round succeeded, too. :-)))
Nice to read, that this bothering BUG may lead to an overall improvement for the thermal system!

With all 11 (eleven) patches applied. Plus the one that 3.19.1 doesn't have so far, to be applied _before_ "0005-Thermal-make-thermal-framework-be-aware-of-thermal-m.patch":
direct link: https://github.com/torvalds/linux/commit/12ca7188468ee29c4e717f73db4bf43c90954fc7.patch

Same tests as in Comment 151. 
Same good results also in the log.

For the case of an over night really cooled down system, I'll report back tomorrow.

Best regards,
Manuel
Comment 153 Manuel Krause 2015-03-14 23:29:54 UTC
As a question aside:

My system now (running your 11 patches) sometimes gets something like this:

[  871.712504] thermal_zone_trip_update: thermal thermal_zone1: Trip4[type=0,temp=60000]:trend=1,throttle=1
..................^^^^^
[  871.716620] thermal_zone_trip_update: thermal thermal_zone1: Trip4[type=0,temp=55000]:trend=1,throttle=1
..................^^^^^

This is cut out from the last posted dmesg. And I see this 55000 as trip_point in the running tmon.

This is contradicting somehow. And I think, the 60000 (60°C) would be more useful for cooling on here. Why does the trip point temp get changed?

I know, that this occurred before on here, but not what patches were involved at that former times.

Please, have a look on it.

Thanks,
Manuel
Comment 154 Manuel Krause 2015-03-15 00:17:08 UTC
The beahviour described in Comment 153 has NOTHING to do with the current patch testing!
Comment 155 Manuel Krause 2015-03-15 18:34:29 UTC
O.k., also with a really cold system (with the 11 patches) everything works fine.

For me the issue is fixed and gets my tested-by tag.

I hope the slight change between the "0001-Debug-patch-to-sync-thermal-zone-update.patch" and the "0004-Thermal-make-thermal_zone_device_update-atomic.patch" is really intended.

Please, let me know, if you wish me to perform more/ different tests.
Comment 156 Zhang Rui 2015-03-16 13:29:57 UTC
I think this is done by BIOS, i.e. BIOS changes the trip point value after a trip point is crossed.
The reason is to avoid fan device bumping around a certain temperature, say, fan on at 60c, temperature drops, and then fan off, and temperature raises and fan on...
Comment 157 Zhang Rui 2015-03-16 13:39:37 UTC
Created attachment 170771 [details]
patch v3

please try this patch set instead.
I made some changes in this patch set, and consider this as 4.0 material.
Please give it a try.
Comment 158 Manuel Krause 2015-03-16 20:30:06 UTC
Created attachment 170831 [details]
dmesg of 4.0.0-rc3 with only the four v3 patches applied

This time I booted with kernel command line append module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp" to see and show what happens at boot for my system.

dmesg shows bootup, one relatively cold suspend-to disk, one hot suspend-to disk and one suspend-to RAM.

Everything works fine and, if I understand it correctly, the log also shows that all is in it's place. :-)
Comment 159 Manuel Krause 2015-03-16 22:29:23 UTC
Just for the records: The patches from Comment 157 do also cleanly apply, compile and work fine on a 3.19.1 kernel. (I don't bother you with another dmesg.log as I compared both and they just show the same behaviour for the same tests.)

Rui, I hope that you can come to a final decision about the smartest solution for this BUG soon and give it to the stable and longterm maintainers, too. Remember that this bug is open since 3.12.x kernels.

Thank you very much again and best regards!
Comment 160 Martin Spacek 2015-03-18 01:17:28 UTC
Hello! I had this problem on an ASUS Chromebox (Intel Haswell Celeron 2955U CPU) running Xubuntu 14.04. It only has a single fan, for the CPU. With newer kernels, starting around 3.18, the CPU fan would spin up to max after resuming from suspend, making quite a loud whine for about 10-20 sec before settling back down to its normal, nearly inaudible level.

I just tried Zhang's patchset (comment 157), applying it to the latest Ubuntu 3.19.1 based kernel (ubuntu-vervet 3.19.0-9.9), and the problem is fixed! Fan seems to be running at normal speed now on resume. Thanks!
Comment 161 Martin Spacek 2015-03-18 01:41:33 UTC
I just noticed that after the first boot with the patchset, and after maybe the second suspend, the fan would briefly pulse up quite high on resume (maybe 75% of maximum noise) for maybe 1 second, before settling back down. On that boot, that was consistent for several suspend/resume cycles in a row.

After rebooting, I haven't been able to replicate it. It does still pulse up very briefly on resume, but not nearly as loudly. I have to put my ear near it to notice. I don't think this is due to an actual temperature difference. On the current boot, with no really noticeable fan noise on resume, I tried running a Python loop indefinitely, eating up 100% of a core. Then suspended and resumed, no effect on fan resume noise.
Comment 162 Martin Spacek 2015-03-18 01:57:33 UTC
Created attachment 170961 [details]
Chromebox dmesg after patch

Sorry for the spam, but I just noticed what was causing the brief fan pulse on resume: it pulses if I have an SD card in the slot. If I remove it, I get virtually no fan pulse on resume. Weird. Maybe the order of initialization on resume changes with an SD card in the slot, which allows the fan to spin up higher by default before thermal controls have a chance to bring it back down. Here's a dmesg with 8 suspend/resume cycles. The first 4 had the SD card and high fan pulse (1 sec) noise, the next 2 had no SD card and virtually no pulse noise, and the last 2 had the SD card and high fan pulse noise again.
Comment 163 Martin Spacek 2015-03-18 02:07:51 UTC
And to further clarify, the brief fan pulse doesn't happen with an inserted USB stick. I've tried two different USB sticks in two different slots. It only happens with an SD card (I've tried two different ones) in the SD card slot.
Comment 164 Manuel Krause 2015-03-18 02:17:43 UTC
With my 3.19.1 kernel I use the TuxOnIce pathces, that seem to use CPU heavily upon suspend-to-disk/resume. The Fan would sometimes go to 100% in each procedure. But that is o.k. as a safety measuere. With the current patches from Rui the system always calms down, at least when X is up, just before the desktop (KDE) re-login shows up. And that's less then 20s after boot/resume.
Comment 165 Manuel Krause 2015-03-18 23:25:42 UTC
@ Zhang Rui:
While looking at BUG 67101, and manually adjusting fan levels myself by issuing:
"echo 0 > /sys/class/thermal/cooling_device0/cur_state" # top fan speed 100%
"echo 0 > /sys/class/thermal/cooling_device1/cur_state" 
"echo 0 > /sys/class/thermal/cooling_device2/cur_state" 
"echo 0 > /sys/class/thermal/cooling_device3/cur_state" 
"echo 0 > /sys/class/thermal/cooling_device4/cur_state" # lowest fan speed 25%/0%

the FAN later on only takes the actual trip_point <-> temp into account. The lower trip_point <-> temp relations are not recalled/ used. The fan then bounces from 58% (actual and last used trip_point temp met) to 0% (when temp lower than this actual trip_point).
If I refill the appropriate lower /sys/class/thermal/cooling_device?/cur_state with "1" then (with appropriate device numbers), manually, everything works well again afterwards. (E.g. on load change it'll find next upper/lower fan level.)

Watching tmon during this testing shows this behaviour, too.

-- I can't consider this as failure safe. People are not said to manually change their FAN levels... -- But this behaviour is not correct. Even if manual interference is not expected, the "automatic" should find back to the next right lower/upper fan level.
Comment 166 Sebastian M. 2015-03-19 21:07:06 UTC
Hello! 
It appears that this bug can't be reproduced after applying these patches and thus is fixed. My question, though, is when are these patches going to flow into a new release? Will linux-4.0.0 come without this bug? 

Thanks...
Comment 167 Manuel Krause 2015-03-20 01:12:30 UTC
(In reply to Sebastian M. from comment #166)
> Hello! 
> It appears that this bug can't be reproduced after applying these patches
> and thus is fixed. My question, though, is when are these patches going to
> flow into a new release? Will linux-4.0.0 come without this bug? 
> 
> Thanks...

There do remain danger zones. Comment 165. So I don't like you to talk of "a fixed" BUG.
Comment 168 Zhang Rui 2015-03-20 04:10:10 UTC
(In reply to Manuel Krause from comment #165)
> @ Zhang Rui:
> While looking at BUG 67101, and manually adjusting fan levels myself by
> issuing:
> "echo 0 > /sys/class/thermal/cooling_device0/cur_state" # top fan speed 100%
> "echo 0 > /sys/class/thermal/cooling_device1/cur_state" 
> "echo 0 > /sys/class/thermal/cooling_device2/cur_state" 
> "echo 0 > /sys/class/thermal/cooling_device3/cur_state" 
> "echo 0 > /sys/class/thermal/cooling_device4/cur_state" # lowest fan speed
> 25%/0%
> 
> the FAN later on only takes the actual trip_point <-> temp into account. The
> lower trip_point <-> temp relations are not recalled/ used. The fan then
> bounces from 58% (actual and last used trip_point temp met) to 0% (when temp
> lower than this actual trip_point).
> If I refill the appropriate lower
> /sys/class/thermal/cooling_device?/cur_state with "1" then (with appropriate
> device numbers), manually, everything works well again afterwards. (E.g. on
> load change it'll find next upper/lower fan level.)
> 
please redo the test with thermal dynamic debug turned on, and attach the dmesg output after the fan bounces from 58% to 0%.
Comment 169 Zhang Rui 2015-03-20 04:14:22 UTC
IMO, the symptom is something related with buggy BIOS, and it is a different problem.
Thus I'd like to send these four patches together with the patch at https://bugzilla.kernel.org/show_bug.cgi?id=92431#c78 out ASAP, so that they can be included in 4.0, and I will push these patches for 3.19 stable kernel.

what do you think?
Comment 170 Manuel Krause 2015-03-20 23:04:51 UTC
(In reply to Zhang Rui from comment #169)
> IMO, the symptom is something related with buggy BIOS, and it is a different
> problem.
> Thus I'd like to send these four patches together with the patch at
> https://bugzilla.kernel.org/show_bug.cgi?id=92431#c78 out ASAP, so that they
> can be included in 4.0, and I will push these patches for 3.19 stable kernel.
> 
> what do you think?

Of course, yes: Please send out these five patches for 4.0 and push them in 3.19 !
Best working solution for these notebooks, so far, since the bug exists.

Noone normally bothers with adjusting fan speeds manually, so my Comment 165 shouldn't be a show stopper for the more important fixes that are in your patches.

I would be glad to see them in the kernel, best regards and thank you for your great work,

Manuel
Comment 171 Manuel Krause 2015-03-20 23:39:38 UTC
Created attachment 171521 [details]
dmesg of a 4.0-rc4 with the five relevant patches with bouncing fan

Unfortunately, the vanilla 4.0-rc4 seems to dissipate more temperature than my 3.19.2 with several other patches, so I needed to verify the logs first for a longer time. The fan HERE actually bounces from 100% to 0% and back. (The 3.19.2 kernel bounced from 58 to 0 and back.)

The dmesg log only shows the warming up of my notebook, under CPU load, while I continuously deactivate the relevant fan levels.
"echo 0 > /sys/class/thermal/cooling_device0/cur_state" # top fan speed 100%
"echo 0 > /sys/class/thermal/cooling_device1/cur_state" 
"echo 0 > /sys/class/thermal/cooling_device2/cur_state" 
"echo 0 > /sys/class/thermal/cooling_device3/cur_state" 
"echo 0 > /sys/class/thermal/cooling_device4/cur_state" # lowest fan speed 25%/0%

Then it ends up bouncing between 100% and 0%. Same seen on tmon + gkrellm.

I've written into the log, where I first suspect a failure in the automatic: 
"NO THERMAL_CDEV_UPDATE for cooling devices lower than thermal cooling_device0 happens (meaning devices 1 to 4), ALTHOUGH their target state is correctly set to 1 and their cur_state is 0.

A BIOS bug? Maybe. But maybe, that you have a nice idea for a failure safe coded workaround.

Thank you in advance,
Manuel

P.S.: I can repeat the test on 3.19.2 again, if you wish, but that would only show the bouncing between 58 and 0 but not bring more evidence.
Comment 172 Manuel Krause 2015-03-21 01:03:54 UTC
Mmmh. Maybe, you should simply forget my Comment 171 and Comment 165. After spending more thoughts on them:

If someone really manually turns off all the fans in the Linux internals, she/he should know of the danger that this action imposes. And, on the other side, everyone should be able to turn off special fan levels. (This is possible with your patches.)

As your patches prevent conditions, that need manual intervention, at system side on boot and resumes from disk/RAM, I'd say for the current state that there is no manual action needed any more.

So, please, push these five patches!

Manuel
Comment 173 Manuel Krause 2015-03-21 20:04:00 UTC
Maybe my last bothering question regarding this bug complex, especially regarding the patch from https://bugzilla.kernel.org/show_bug.cgi?id=92431#c78:

If I don't get the 
printk("head 0x%p, head->next 0x%p, pos->cdev_node 0x%p, pos->tz 0x%p, next->tz 0x%p\n", &cdev->thermal_instances, (&cdev->thermal_instances)->next, &pos->cdev_node, pos->tz, next->tz);

from this applied patch in my logs like prash's logs from https://bugzilla.kernel.org/show_bug.cgi?id=92431#c79 show them whenever a thermal_zone gets registered 
-- is this indicating a false behaviour or just that my system just does not pass this section of code in a correct sense?

Thank you in advance for clarifying!
Comment 174 Zhang Rui 2015-03-23 08:19:31 UTC
Hah, the patch from https://bugzilla.kernel.org/show_bug.cgi?id=92431#c79 puts the fan devices to proper state if the fan is registered AFTER thermal zone device registered.
And this rarely happens.
Comment 176 Manuel Krause 2015-03-24 23:53:23 UTC
Thank you for submitting these first three patches.

Should we monitor these patchwork URLs for the ongoing discussion or would you inform us on here about patch changes?
Comment 177 Manuel Krause 2015-03-31 23:16:58 UTC
Mmh, should we worry???
See my question @ https://bugzilla.kernel.org/show_bug.cgi?id=91411#c91
that you have not answered so far.

Also, you've changed things within the first three patches. Up to V3:
 https://patchwork.kernel.org/patch/6127231/
 https://patchwork.kernel.org/patch/6127221/
 https://patchwork.kernel.org/patch/6127241/

Are these patches still compatible with "the old" other two:
 https://bugzilla.kernel.org/attachment.cgi?id=171921
 https://bugzilla.kernel.org/attachment.cgi?id=171931
?

Please, also read all the replies more carefully in the kernel patchwork area. I've now seen more than two cases where you did not answer to reviewers' questions. You've unfortunally archived them 'away'.

Last one:
> > +/* Invalid/uninitialized temperature */
> > +#define THERMAL_TEMP_INVALID       -27400
> > +
> 
> Did you mean -274000 (note the missing '0') which is < 0K? A comment would be
> helpful for anybody reading the code. Took me a while to parse it.
> 
> >  /* Unit conversion macros */
> >  #define KELVIN_TO_CELSIUS(t)       (long)(((long)t-2732 >= 0) ?    \
> >                             ((long)t-2732+5)/10 : ((long)t-2732-5)/10)

Thanks for more transparency,
Manuel
Comment 178 Manuel Krause 2015-03-31 23:24:29 UTC
And what do the re-newed patches change? Should we re-test with them?
Comment 179 Zhang Rui 2015-04-01 07:22:40 UTC
yes, thanks for reminding me. :)
I didn't ask you guys to test because I think there is no functional change.
But please, kindly help me test the patch V4, to make a double check.
Comment 180 Manuel Krause 2015-04-01 19:21:11 UTC
Maybe for V4 to appear we need April 1st to finish?! ;-)

I've tested the V3 patches from the links written in Comment 177, instead, and, although you don't like it, tested them on my 3.19.3 kernel (only with the three patches, omitting the two other patches that you consider as 4.1 material).

Works well so far, from booting over to hibernating with a hot and also with a relatively cold system. Also, Suspend-to-RAM is resuming well.

Maybe I've still some issue with the very lowest fan level, like I may have reported before, already. Although the cooling device 4 is set to 1, system being at very low temperature, I don't see it beeing reported via the fan RPM representation in acpitz4. I watch this in tmon and in gkrellm, too. 
And, it seems to be a bit difficult for me to verify that this fan level (25%) gets REALLY activated in the hardware. When the next upper fan level (34%) kicks in, I can verify this by hearing a slightly higher noise than my hdds usually produce and by watching a flickering lighter's flame near the fan outlet. That's not possible with the 25% level, either because the fan is still off -- or it is running that slowly that no flame can flicker. For now I don't want to use a screwdriver to be able to see the fan spinning at 25%.
I fyou have a possible explanation from the programming side or any other idea for me to verify on here, please, let me know.

Thank you very much,
Manuel
Comment 181 Manuel Krause 2015-04-04 17:35:46 UTC
Was it intended by you to tag the three V3 patches @patchwork as "superseeded" and no newer ones (e.g. V4) showed up there? I'm just wondering.
Comment 182 Manuel Krause 2015-04-07 22:52:10 UTC
The time frame for 4.0-final is getting tight.
Please, consider Eduardo Valentin's comments {I mentioned earlier, that you should read more closely...} @ https://patchwork.kernel.org/patch/6166681/  and push out a V5 patchset soon, that I can test. The V4 vs. V3 comparison results in one poor added 0. ;-(
Comment 183 Zhang Rui 2015-04-08 12:55:32 UTC
Created attachment 173531 [details]
patch v5 1/3
Comment 184 Zhang Rui 2015-04-08 12:55:57 UTC
Created attachment 173541 [details]
patch v5 2/3
Comment 185 Zhang Rui 2015-04-08 12:56:19 UTC
Created attachment 173551 [details]
patch v5 3/3
Comment 186 Zhang Rui 2015-04-08 12:57:54 UTC
well, my evolution is broken and I did missed some email when using outlook.
Sorry about this.
Please give the above patches a try.
Comment 187 Manuel Krause 2015-04-08 16:22:51 UTC
Hi, Rui!
Thank you for the updates.
The three V5 patches work very well in all scenarios (bootup cold, cold+hot resume from disk, as well as resumes from RAM)! Thank you very much!

BTW, please be aware, that Eduardo has even written another comment @ V4-1of3: https://patchwork.kernel.org/patch/6166681/

Best regards,
Manuel
Comment 188 Manuel Krause 2015-04-09 19:56:43 UTC
@Rui: Your last two messages to https://patchwork.kernel.org/patch/6166681/ are not readable, please check, what went wrong.
Comment 189 amish 2015-04-10 02:45:25 UTC
Its in base64. You can read them by passing those crypted lines to "base64 -d"
Comment 190 Manuel Krause 2015-04-10 15:43:47 UTC
Thank you! I didn't know it's that easy. ;-)
So, lets hope for V6 patches to surface...
Comment 191 Manuel Krause 2015-04-12 20:27:41 UTC
@ Rui:
Are we going to see a V6 patches' revision somewhere, or could you just send in _tested_ V5 for inclusion into kernel 4.0 final?
Comment 192 Manuel Krause 2015-04-15 23:44:41 UTC
ping ...
Comment 193 Manuel Krause 2015-04-30 18:29:53 UTC
Any news from you, Rui?

As it didn't go into 4.0.x ...
Where is V6 of your patches?

And what about the locking fix work?

What about your findings with 
https://bugzilla.kernel.org/show_bug.cgi?id=96881

The three posted patches on here, v5, do work with my system until now, with kernel 4.0.1 at the moment.
Comment 194 Zhang Rui 2015-05-01 03:33:18 UTC
it fails to get into 4.0, but I will post refreshed ones and push for 4.1, and then backport to 3.18, 3.19, and 4.0.
Comment 195 Manuel Krause 2015-05-09 19:59:59 UTC
@Rui:
Please, don't forget to send in, at least, the three V5 patches for kernel 4.1 !!!
Comment 196 Manuel Krause 2015-05-29 20:42:49 UTC
Kernel 4.1.0-rc5 still doesn't have these patches. :-(((

Just FYI, without the patches from Comment 183, Comment 184 and Comment 185 -- the V5 patches -- my thermal system would not work correctly.
I've tested the three patches on 3.18.14 (only context modifications needed to backport), on 4.0.4 and on 4.1.0-rc5. With them everything works fine.

Please, don't wait for the answer from the "almost" dead thread https://bugzilla.kernel.org/show_bug.cgi?id=71711 --
just send in the known good patches.

Thank you in advance,
Manuel Krause
Comment 197 Juraj Fiala 2015-05-30 11:18:44 UTC
I second that. The three patches work perfectly. The locking issue is another issue. Please send the patches in.
Comment 198 Manuel Krause 2015-06-15 14:08:14 UTC
I mark this as REOPENED, as noone has sent in the patches for 4.1, nor initiated a backport to 4.0.x. 
Rui, that would have been your job for, now how many months?
Comment 199 prash 2015-06-23 14:33:58 UTC
I see that 4.1 got released. I assume Zhang Rui is on vacation. Can someone else please push the patches into the release?
Comment 200 Sebastian M. 2015-07-30 06:31:05 UTC
It's really sad to see that this hasn't been pushed yet.
Comment 201 amish 2015-07-30 06:46:05 UTC
I wonder if only Zhang Rui can push it? Why can't someone else with authority push it?

Dependence on one person is always sad for open source software / codes.
Comment 202 Manuel Krause 2015-08-08 13:22:04 UTC
I just re-tested with the upcoming kernel version, atm with 4.2.0-rc5, and the patches are still not included, but urgently needed.
With applied V5 patches, see the references in my comment 196, everything works fine in suspend-to-disk and suspend-to-RAM.

So, please, can someone push the patches to the kernel finally? It's so disappointing to have this BUG still open. :-(((

Best regards,
Manuel Krause
Comment 203 Mario Becroft 2015-08-08 13:38:46 UTC
bugzilla-daemon@bugzilla.kernel.org writes:

> So, please, can someone push the patches to the kernel finally? It's so
> disappointing to have this BUG still open. :-(((

Agreed. This is preventing me from upgrading the kernel on my laptops.
Comment 204 Aaron Lu 2015-08-24 05:24:25 UTC
Rui,

What is the status of these patches?
Comment 205 prash 2015-08-25 15:49:04 UTC
Aaron,

FWIW, kernel 4.1.6 still has the bug. I tested on the latest git checkout (4.2.0-rc7-g2c6625c), and the patches work just fine. Can you please push the patches yourself? A cursory search on bugzilla indicates that Zhang hasn't been active here for the last few months.
Comment 206 Manuel Krause 2015-08-25 19:30:25 UTC
The last I've heard from Rui was in May, that he wanted to "refresh" the patches (whatever this meant). Before, he had ideas to enhance them in the future, too.

Aaron, if you're a person in charge or capable of reviewing/ reworking the "V5" patches' code in Comment 183, Comment 184 and Comment 185 -- we would be very thankful. Especially -- if you'd finally send them in for kernel inclusion!

The code works well on here @ kernel 4.1.6 and the last one I've checked was 4.2.0-rc5. (Maybe there were needed minor adjustments in the patches' context lines to apply cleanly.)

Best regards, 
Manuel Krause
Comment 207 Aaron Lu 2015-08-26 02:07:37 UTC
OK, I'll talk to Rui when he comes back from Linux Plumber.
Comment 208 Zhang Rui 2015-09-01 02:42:53 UTC
reassign to Chen yu, who will take and refresh the previous patches and push for upstream.
Comment 209 Juraj Fiala 2015-09-01 10:29:11 UTC
Any idea if the patches will get into 4.2?
Comment 210 Chen Yu 2015-09-13 16:03:06 UTC
Just go through this thread, and the story is so long but it's interesting to watch the process on how to debug these bugs. Will send out a serie of patches recently. Maybe the patch will get merged into 4.3-rcn.
Yu
Comment 211 Manuel Krause 2015-09-15 16:14:11 UTC
First: Rui, thank you very much for your longterm work, and, for handing it over!

Second: Welcome, Yu! 
You don't need to reconstruct the whole thread, unless you need it for your history studies... ;-)

Third: The three patches I've lastly mentioned / referenced to in my Comment 206 are working well with a 4.2.0 kernel. (Maybe context adjustments needed.)

Final: It'would be very nice, to see this BUGfix working in a kernel in really near future. If you'd send in reworked patches to LKML/etc., please, let us know in this place. We want to test them, too.

Best regards, 
Manuel Krause
Comment 212 Chen Yu 2015-09-22 10:37:05 UTC
Hi Manuel,
Thanks a lot for tracking this thread, BTW, do you have bandwidth to help me test the latest patch, which is supposed to fix racing problem in thermal management. This patch is not related to your bug though... I'd like to send this patch together with Rui's patches. Can you please help to test this patch on top of Rui's:
https://patchwork.kernel.org/patch/6166681/
https://patchwork.kernel.org/patch/6166691/
https://patchwork.kernel.org/patch/6166701/
and please enable the thermal dynamic debug 
to see if there is any fan problems after resume, Thanks!

Yu
Comment 213 Chen Yu 2015-09-22 10:39:00 UTC
Created attachment 188081 [details]
This patch is to avoid lock racing when doing thermal_update.

This patch is to avoid lock racing when doing thermal_update, please apply on top of
https://patchwork.kernel.org/patch/6166681/
https://patchwork.kernel.org/patch/6166691/
https://patchwork.kernel.org/patch/6166701/
Comment 214 Juraj Fiala 2015-09-22 14:46:45 UTC
Hey, I might test it too, depends on if I find time.

Also, will this make it into 4.3? I would really like to give this old laptop to my mum who really needs it but it's quite unusable without these patches, and I don't really want to give her Arch since she needs it as stable as possible.
Comment 215 Manuel Krause 2015-09-22 17:03:23 UTC
Hi Yu,
can you please have a look at your referenced patches and compare them with the three v5 patches that Rui had attached to this thread (and not published on Patchwork)? They are:
https://bugzilla.kernel.org/attachment.cgi?id=173531
https://bugzilla.kernel.org/attachment.cgi?id=173541
https://bugzilla.kernel.org/attachment.cgi?id=173551

If I compared them correctly, patch 1 has one significant change in the very first hunk.

And then... coming to your new lock race avoiding patch: Everything works fine and as expected with it applied on top (now with kernel 4.2.1). Tested with a hot system as well as with a relatively cold one, with suspend-to-disk and with suspend-to-RAM. No problems at all. I need to add, that I kept using the v5 patches (that worked well for months now) and didn't take the v4 ones that you refer to. In case that you want to have debugging logs, please let me know.

Best regards,
Manuel Krause
Comment 216 prash 2015-09-22 17:23:46 UTC
Hi Chen Yu,

Do you mind refreshing the patches and releasing just *one* patchset that we could apply? With the large time-frame of patches it's become difficult for me to keep track of all the patches. If you were to release one patchset, I could test that too, and we could base all further discussion (and patches) on that patchset, henceforth.
Comment 218 Chen Yu 2015-09-27 08:11:53 UTC
BTW: The locking fixing patch will not be sent at this moment, since we
need to do more research on it.
Comment 219 Juraj Fiala 2015-09-27 09:07:03 UTC
Hey, thanks for sending those patches in! What about the racing patch, will you send that in too? I can try compiling the kernel today and test it if you want.
Comment 220 Manuel Krause 2015-09-28 20:26:55 UTC
Hi, Yu,
your patches are apparently for 4.3-rc kernels (one hunk with fuzz at a 4.2.1) but no code change in comparison to Rui's V5 patches. -- And yes they keep working on 4.2.1. Haven't tested 4.3-rc yet.
I've been reading the review comments added to your patchwork submissions, so I'm looking forward to the improved next revision.

I really appreciate progress on this longstanding topic and want to thank you in advance for your work. Please, keep us informed about new patches to test.

Best regards, 
Manuel
Comment 221 Manuel Krause 2015-10-18 00:29:30 UTC
Any chance to see reworked patches soon? -- Or do we need to wait until kernel 10.x.y to have this fixed? Unbelievable non-progress on this BUG fix. :-(
Comment 222 Chen Yu 2015-10-19 02:05:17 UTC
Javi has some comments on locking implementation with the third patch:
https://patchwork.kernel.org/patch/7273041/
although this patch works in current kernel, javi suggests we find a more graceful solution for scalability consideration. 
Since the locking always needs careful design, we are still testing different implementation on patch3.
Comment 223 Chen Yu 2015-10-22 08:04:34 UTC
Created attachment 190851 [details]
V6-0001-Thermal-initialize-thermal-zone-device-correctly

Hi, Manuel
can you please help test these three V6 patches?(only a small change from
V5)
thanks
Comment 224 Chen Yu 2015-10-22 08:05:11 UTC
Created attachment 190861 [details]
V6-0002-Thermal-handle-thermal-zone-device-properly-during-s
Comment 225 Chen Yu 2015-10-22 08:06:56 UTC
Created attachment 190871 [details]
V6-0003-Thermal-do-thermal-zone-update-after-a-cooling-devic
Comment 226 Chen Yu 2015-10-22 08:08:08 UTC
Plz test with thermal dynamic debug enabled and provide dmesg , thanks!
Comment 227 Manuel Krause 2015-10-23 16:14:07 UTC
O.k. -- I've waited for 4.2.4 to show up -- and tested the patches. 
(Only some little context variations, fuzzes, for this kernel.)
First news: Everything still works fine.

If I'd attach the dmesg with dynamic debug info, how would you like me to invoke it? Is it sufficient for you if I issued "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control" after system startup and play a bit with suspend-to-disk and different temperatures? Or do you want me to boot with a special kernel command line append? 
Please, advise me, as I don't recall the most efficient/useful logging version, especially regarding the command line.

Best regards, and many many thanks for your work,
Manuel
Comment 228 Chen Yu 2015-10-25 12:32:14 UTC
Well if it works..there is no need to gather the dmesg. will send out this patch to maillist.
Thanks
Yu
Comment 229 Manuel Krause 2015-11-06 19:43:51 UTC
I've now tested the most recent posted patch revision with a 4.3.0 kernel:
The three patches:
https://patchwork.kernel.org/patch/7525501/
https://patchwork.kernel.org/patch/7525491/
https://patchwork.kernel.org/patch/7525431/

And they do work very well and as expected.

As they're still not in kernel 4.3.0, and there didn't appear any correctional proposals for some days, I would thank you in advance if you pushed them to kernel as soon as possible.

Best regards, Manuel
Comment 230 Chen Yu 2015-11-07 02:53:32 UTC
This series of patch will be merged into 4.4-rc1.
Yu
Comment 231 Juraj Fiala 2015-11-07 17:14:35 UTC
Thank you!
Comment 232 René Kliment 2015-11-16 23:23:26 UTC
Hello. 

I experience this problem as well and I'm very happy that there exists a patch for it (thanks!).

I compiled 4.4.0-rc1-g8005c49 today and the problem persists. I also see the patches at patchwork still marked as new. Could you estimate when the patches will be merged? (another 4.4 rc?) I don't mean to push - I just can't wait for the patches to be merged :-) Appreciate the hard work.

Thank you, René.
Comment 233 Manuel Krause 2015-11-25 20:04:19 UTC
A short lookup into the sources of 4.4-rc2 shows, that the patches are still NOT included.

What is it about your continuous problems with the kernel inclusion of the working patches? Wrong channels? Laziness?

I'm quite upset about that for now. Since kernel 3.12 !!!

Manuel
Comment 234 Chen Yu 2015-12-01 07:29:42 UTC
I think it is just merged into the thermal tree now, will inform you once it get pulled into upstream.
Comment 235 Mario Becroft 2015-12-28 04:07:37 UTC
I has been 1.5 years and this is *still* not in mainline? Here I was thinking I could finally upgrade my laptop over Christmas to a current kernel...

Glad to hear Chen that you "think" it is merged into the thermal tree and will be pulled in upstream.

Last time I submitted a patch to the kernel, which is a few years ago, it was accepted within days. If there is anything I or someone else here can do to expedite this process, someone in the know please explain.

Sorry for the slightly rant-ish tone. I really want to know if there is anything else that needs to be done here, to get this closed.

Cheers,
Mario
Comment 236 Manuel Krause 2015-12-28 23:58:15 UTC
No need to apologize for possible rant-ish tones on this issue any more.

The patches work well with 4.3.3.
Looked up into 4.4-rc7 sources: They're STILL NOT INCLUDED.

I don't understand this delay any more. It's simply ridiculous. :-(

Maybe we need to send in the three last good patches to LKML on our own. Unfortunately, for 4.4 it again seems to be too late.

Pains,
Manuel
Comment 237 René Kliment 2015-12-29 09:08:33 UTC
It seems that it has been merged - https://patchwork.kernel.org/patch/7525501/ It was marked as "New" a few days back. Not sure where it lands, since 4.4-rc7 is out and it is not there, but waiting for 4.5 is nothing compared to the 1.5 years.
Comment 238 Manuel Krause 2016-01-15 20:41:39 UTC
Someone available and CAPABLE to send in the patches to LKML directly, _bypassing_ the current assignees CHEN YU and ZHANG RUI?

The patches are still not included into 4.4.0.
Comment 239 Zhang Rui 2016-01-18 01:09:02 UTC
The patches have been merged in thermal tree and will be queued for 4.5.
This is a big change for thermal core, thus it should be a -rc1 material instead of -rc5 and later.
Comment 240 Chen Yu 2016-01-25 02:48:49 UTC
Patch set has been merged, so close current thread as fixed.

commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Oct 30 16:31:47 2015 +0800

    Thermal: initialize thermal zone device correctly

commit ff140fea847e1c2002a220571ab106c2456ed252
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Oct 30 16:31:58 2015 +0800

    Thermal: handle thermal zone device properly during system sleep

commit 4511f7166a2deb5f7a578cf87fd2fe1ae83527e3
Author: Chen Yu <yu.c.chen@intel.com>
Date:   Fri Oct 30 16:32:10 2015 +0800

    Thermal: do thermal zone update after a cooling device registered