Bug 78201
Description
Manuel Krause
2014-06-17 21:23:26 UTC
Created attachment 140111 [details]
MK-3.15.0.thermal__thermal_zone_paths
Created attachment 140121 [details]
MK-3.15.0.thermal__cooling_device_paths
Created attachment 140131 [details]
MK-3.15.0.dmesg-pre-suspend
Created attachment 140141 [details]
MK-3.15.0.tmon-pre-suspend.log
Created attachment 140151 [details]
MK-3.15.0.dmesg-post-suspend
Created attachment 140161 [details]
MK-3.15.0.tmon-post-suspend.log
As Guenter Roeck already asked me to test older kernels, I was able to narrow it down to the not working kernels: 3.12.20 -> 3.15.0. (meant 3.11.9 as last good from my archives.) I've then done some bisecting over the weekend and got this as result: 178c2490b99f898efc06d1ad75cadc84f13021a6 is the first bad commit commit 178c2490b99f898efc06d1ad75cadc84f13021a6 Author: Shawn Guo <shawn.guo@linaro.org> Date: Mon Jun 17 21:24:23 2013 +0800 thermal: step_wise: cdev only needs update on a new target state The cooling device only needs update on a new target state. Since we already check old target in thermal_zone_trip_update(), we can do one more check to see if it's a new target state. If not, we can reasonably save some uncecesary code execution. Signed-off-by: Shawn Guo <shawn.guo@linaro.org> Acked-by: Eduardo Valentin <eduardo.valentin@ti.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com> :040000 040000 34a6fb0f2405e5366f753d1da69bf3cb0f49a869 c4b953739e5b1ba259149eb68c48e0e2585ddd67 M drivers Best regards, Manuel Krause Created attachment 140181 [details]
Reverting PATCH for a 3.14.7 kernel
If I'd revert the above mentioned commit, also a 3.14.7 kernel does work very well, without any issues.
This is a reverting PATCH only.
Created attachment 140561 [details]
REVERT thermal: step_wise: cdev only needs update on a new target state @ 3.12.22
This is another working patch for now reverting a kernel 3.12.22.
Thanks for attention and best regards,
Manuel
Created attachment 140681 [details]
PATCH: initialize thermal zone device correctly
Created attachment 140691 [details]
handle system sleep correctly
Well, the real problem of this bug is that after resume, thermal core still uses some cached value to make thermal decisions. IMO, the two patches above, based on 3.16-rc2, is the real fix for this issue. Please apply them on top of 3.16-rc2 and check if the problem still exists. If yes, please run "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control" reproduce the problem attach the dmesg output and tmon output Unfortunately, 3.16-rc2 + your 2 patches don't help for resuming from RAM/ DISK. A freshly booted system works well, though. In the following you'll find both the dmesg & tmon output from before suspend-to-ram, after this task, and after suspend-to disk. All actions taken subsequently. The dmesg files are cut to each task and the tmon output comes from separate runs of 'tmon -l'. I hope this workflow doesn't make it too complicated to review this issue. Best regards, and thank you for your work! Created attachment 140741 [details]
MK-3.16-rc2+.dmesg-before-S2RAM.log
Created attachment 140751 [details]
MK-3.16-rc2+.tmon-before-S2RAM.log
Created attachment 140761 [details]
MK-3.16-rc2+.dmesg-after-S2RAM.log
Created attachment 140771 [details]
MK-3.16-rc2+.tmon-after-S2RAM.log
Created attachment 140781 [details]
MK-3.16-rc2+.dmesg-after-S2DISK.log
Created attachment 140791 [details]
MK-3.16-rc2+.tmon-after-S2DISK.log
Created attachment 140801 [details]
PATCH V2: handle system sleep correctly (1.39 KB,
this is an updated version of the second patch, please apply and retry.
Note: when doing the tests, please boot with kernel parameter "initcall_debug" this time.
(In reply to Zhang Rui from comment #20) > Created attachment 140801 [details] > PATCH V2: handle system sleep correctly (1.39 KB, > > this is an updated version of the second patch, please apply and retry. > Note: when doing the tests, please boot with kernel parameter > "initcall_debug" this time. Great work! Thank you very much: Version 2 of the second patch does the magic! Many following suspends+resumes to RAM/ disk now show the desired behaviour, i.e. passing trip_point temperatures trigger the right fan speed in upward and downward direction. No "forgotten" fan speed levels at all. Do you still want related test logs? And a last question, do you consider these patches to be applicable to 3.15 / 3.14 / 3.12 kernels as well? Best regards, and keep up your good work, Manuel Well, IMO, this is still not the perfect fix, because what we really want is to deffer the execution of thermal_zone_device_update() until we have successfully reset the thermal zone devices during resume. Currently, this patch does the trick by doing this in the .late_suspend() callback and no one invokes thermal_zone_device_update() later than that. IMO, a wait queue would be a better choice, I will investigate if it is possible to do it in this way. O.K., I'll stay tuned on this. Please, feel free to post any patch that follows your improved approach: I'd really like to test it! Best regarrds, Manuel *ping* Any news on your better choice? I would be glad to either receive a preliminary patch from you or to see this issue being resolved with a "not perfect but working" patch in actual and affected kernels. Thanks for your work in any case, Manuel Apparently also kernel 3.16.1 doesn't have a proper fix. Your, Zhang Rui's, proposed patches keep to work, though. Manuel Created attachment 147891 [details]
PATCH: initialize thermal zone device correctly
Created attachment 147901 [details]
Patch: handle thermal zone device update events correctly during system sleep
please try these two patches on clean 3.17-rc1 and see if the problem still exists.
One of your both most recent patches leads to general overheating on here. It (fan + trigger) doesn't step up after resume. The first patch, as I compared, is quite equal. But: The second one??? Please, also don't tag your old patches as "obsolete", unless the newer ones are tagged as "good". My 3.16.1 kernel is running quite well with your old(!) patches. Please think over plastics overheating & so on... Manuel Created attachment 148241 [details]
PATCH v2: initialize thermal zone device correctly
Created attachment 148251 [details]
Patch v2: handle thermal zone device update events correctly during system sleep
Sorry that I posted an out-of-date version for PATCH 1/2.
please try these two updated patches instead.
OMG, NO! What have you done? 3.17-rc1+ your new patches: Now lower temperatures <-> fan speeds are set upon resume, but they will _NEVER_ step up from this "loaded level" according to a temperature increase. --> Burning notebook. Same as with your patches from 20140824. Manuel (In reply to Manuel Krause from comment #31) > OMG, NO! What have you done? > > 3.17-rc1+ your new patches: Now lower temperatures <-> fan speeds are set > upon resume, but they will _NEVER_ step up from this "loaded level" > according to a temperature increase. --> Burning notebook. Same as with your > patches from 20140824. > can you please attach the dmesg output after resume with the patch applied, and with command "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control"? Please, be so kind, to excuse the delay. For several weeks now, I'm not able to boot a 3.17-rcX kernel on here. As it was possible before, Comment 28+, and now it is NOT, I'm quite clueless. Maybe I've updated some openSUSE package that prevents it? Or I'm missing one? A freshly compiled 3.16.2 boots well. If you know anything about this preventing issue, please, let me know. I still want to help you and others to fix this BUG. Best regards, Manuel hmmm, 3.17-rc1 works for you but later -rc releases don't? I was not aware of such problems in 3.17-rc kernels, can you please try a git bisect to find out which commit introduces the problem? as this is a longstanding bug, exposed by recent changes, we're marking this as not a regression. 3.17-rc is problematic, since latest kernel is not booting at all. (is that boot failure filed as a regression???), preventing debugging this issue. It sounds like 3.17-rc1 is latest that boots, and so perhaps debugging of the fan issue could continue on that release? O.k., I'm now able to boot a 3.17-rc kernel again. It is now 3.17-rc4. As I normally install the openSUSE .rpm (once they puplishe), it's not quite sure what was causing the issue, me not beeing able to boot -rc2 to -rc3. Can be a packaging failure from them or something from their git state. I retrieved the 3.17-rc1 binaries from a backup(!) with which I did the first testing on Zhang Rui's new patches and it worked (booted but the BUG's issues remained, of course). This "preventing issue" was obviously solved somehow inbetween, so I hope you won't advise me to bisect as it may take much needless time. In the following I'll attach the dmesg output and a parallel tmon log after issuing "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control". Unfortunately our current ambient temperatures with the upcoming autumn do not bring up high temperatures in a short period of time, so the logs are ... lengthy. Best regards, Manuel Created attachment 149931 [details]
dmesg of a 3.17-rc4 with Zhang Rui's most recent patches and dynamic_debug enabled
Created attachment 149941 [details]
tmon.log of a 3.17-rc4 with Zhang Rui's most recent patches and dynamic_debug enabled
And to be absolutly clear, with the most recent patches, on 3.17-rcX, now the higher fan speeds / triggers aren't reached. (Remember, before, the lower speeds had been forgotten.) There had been some patches from Zhang Rui in this long time, that I continue to apply to my 3.16 kernels, and they do work, definitely. Best regards, Manuel Created attachment 150751 [details]
PATCH v3: initialize thermal zone device correctly
Created attachment 150761 [details]
Patch v3: handle thermal zone device update events correctly during system sleep
Hi, Manuel, there do is a bug in the previous patches, so please apply these two updated ones and see if they can fix the problem for you or not. BTY, please attach the dmesg output after suspend/hibernation, with dynamic_debug enabled as well. Created attachment 150811 [details]
dmesg output of a 3.17-rc4 with V3 of the patches
I hope this dmesg output helps a bit for your further conclusions.
After resume from disk/ram changing temperatures still DO NOT trigger a fan level change.
The attached log may only cover the upwards (temp) path, but I've also generated logs on here, proving, that the downwards path doesn't work either: After resume the fan won't change it's speed according to the CPU temp any more (seems fixed to the temp gathered at resume time). Freshly booted everything works well.
I can send you the other logs, if you wish, but maybe you first think about the severe error that must be in the patches for the resume case not allowing the fan speed to change!
Best regards, Manuel Krause
Created attachment 150821 [details]
debug patch
Created attachment 150831 [details]
Patch v4: handle thermal zone device update events correctly during system sleep
please apply the updated second patch, and the debug patch and re-do the test. (In reply to Manuel Krause from comment #43) > Created attachment 150811 [details] > dmesg output of a 3.17-rc4 with V3 of the patches > > I hope this dmesg output helps a bit for your further conclusions. > > After resume from disk/ram changing temperatures still DO NOT trigger a fan > level change. > From the dmesg output, we can see that 1. when resuming from hibernation, the temperature is 63C, the Fan cooling device 2,3,4 is turned on as we expected. Please confirm the fan is do running at a certain speed level. 2. there is no thermal zone update() request then, I'm not sure how you change the temperature, but the last thermal_zone_device_update() is invoked at 299s, upon resuming from hibernation. > The attached log may only cover the upwards (temp) path, but I've also > generated logs on here, proving, that the downwards path doesn't work > either: After resume the fan won't change it's speed according to the CPU > temp any more (seems fixed to the temp gathered at resume time). Freshly > booted everything works well. > please apply the updated second patch and the debug patch, I want to make sure that thermal_zone_device_update() is invoked correctly after resume, when you change the temperature. At first the answers to Comment 46: 1. The fan is running well at all levels (25, 34, 45, 58, 100 <-emergency) and I, of course, did observe them always before reporting that changing fan speed is not happening. And, with the V3 patches the system comes up from resume with a properly adjusted fan-speed, according to the temp. But it lost the ability to step up or down. 2. The way I change the temp. is not manually issued. I just start a worldcommunitygrid client or a kernel compile to increase the temp. And stop them, if I want to see the fan level go down. Manuel Created attachment 151001 [details]
dmesg output of a 3.17-rc4 with V4 of the patches + debug.patch + dyn_debug
At the moment, with these patches, the system after hibernation resume, sticks at 100% fan speed, not depending on the load.
Not the final goal.
Manuel
Created attachment 151011 [details]
Working patch 1of2 for 3.16 kernels
Maybe, to remember you... Manuel
Created attachment 151021 [details]
Working patch 2of2 for 3.16 kernels
Best regards, Manuel
(In reply to Manuel Krause from comment #48) > Created attachment 151001 [details] > dmesg output of a 3.17-rc4 with V4 of the patches + debug.patch + dyn_debug > > At the moment, with these patches, the system after hibernation resume, > sticks at 100% fan speed, not depending on the load. > well, from the dmesg output, we can see that your hibernation started at 567s, and ended at 569s, and the last dmesg output is 572s. so I could not see how the system "sticks at 100% fan speed, not depending on the load." Question, is your dmesg output got soon after hibernation, and then you did some test and made such conclusion? If yes, please send me the full dmesg output, because I want to see how the thermal core re-act when the temperature changes after hibernation. If no, then this seems to be another problem that "thermal notification is missing after resuming from hibernation". Though, I had an extra long dmesg log for the last reported test run, and there didn't happen anything after that last dmesg output @572s for another 96s, I preferred to reproduce the problem and generate a new set of logs for you, now including a corresponding tmon.log, too, so that you hopefully can see the related reactions of the system. I'll attach them in the two follow-up posts. I hope that I applied the right patches!? I used "PATCH v3: initialize thermal zone device correctly" (attachment 150751 [details]), "Patch v4: handle thermal zone device update events correctly during system sleep" (attachment 150831 [details]) and the "debug patch" (attachment 150821 [details]). Can you, please, also recheck, that these are the right ones? It's a pity: This latest combination of patches falls into your second category from Comment 51: 'another problem that "thermal notification is missing after resuming from hibernation" ' -- Directly with resume from hibernation the fan goes to 100% and stays there, whatever I'd do to increase or decrease CPU load up- or down beyond known trigger temperatures. Detailed description what these actual logs are about: booting, echoing the dynamic debug command, start worldcommunitygrid client + wait + stop it again + wait, hibernate + wait + resume (now fan at 100%), start worldcommunitygrid client + wait + stop it again + wait, connect to the web, stop tmon + save the outputs to tmon.log and dmesg.log I really hope we'll get this solved soon and I thank you very very much for your continued engagement. Best regards, Manuel Created attachment 151431 [details] tmon.log of a 3.17-rc4 with V4 of the patches according to Comment 52 (reproduced) Created attachment 151441 [details] dmesg.log of a 3.17-rc4 with V4 of the patches according to Comment 52 (reproduced) I assume you won't like me after this and the following two posts... With the same setup and testing, but now suspending _to_RAM_ instead of hibernating (to disk) I generated the following logs: Showing here, the fan now sticks to the lowest previous level: Created attachment 151481 [details]
tmon.log of a 3.17-rc4 with V4 of the patches SUSPEND TO RAM
Created attachment 151491 [details]
dmesg.log of a 3.17-rc4 with V4 of the patches SUSPEND_TO_RAM
This obviously proves your second case theory from Comment 51: 'another problem that "thermal notification is missing after resuming from hibernation" '. Meaning now == also from suspend. But I don't understand, why hibernation (to disk) or suspend (to RAM) do behave that different. The recovery from each of them must check for actual temperatures IMO. Best regards, Manuel ... AND reset the fan's levels, too!!! (In reply to Manuel Krause from comment #52) > Though, I had an extra long dmesg log for the last reported test run, and > there didn't happen anything after that last dmesg output @572s for another > 96s, I preferred to reproduce the problem and generate a new set of logs for > you, now including a corresponding tmon.log, too, so that you hopefully can > see the related reactions of the system. I'll attach them in the two > follow-up posts. > > I hope that I applied the right patches!? I used > "PATCH v3: initialize thermal zone device correctly" (attachment 150751 [details] > [details]), > "Patch v4: handle thermal zone device update events correctly during system > sleep" (attachment 150831 [details]) and the > "debug patch" (attachment 150821 [details]). > Can you, please, also recheck, that these are the right ones? > yes, they are. > It's a pity: This latest combination of patches falls into your second > category from Comment 51: 'another problem that "thermal notification is > missing after resuming from hibernation" ' -- Directly with resume from > hibernation the fan goes to 100% and stays there, whatever I'd do to > increase or decrease CPU load up- or down beyond known trigger temperatures. > well, bad new to me. I think we need to do further debug to check why Linux behaves differently with patch v1 and v4. Created attachment 151501 [details]
Patch-for-upstream: 1/3
Created attachment 151511 [details]
Patch-for-upstream: 2/3
Created attachment 151521 [details]
Patch-for-upstream: 3/3 debug patch
Created attachment 151531 [details]
patch-know-to-work: 1/3
Created attachment 151541 [details]
patch-know-to-work: 2/3
Created attachment 151551 [details]
patch-know-to-work: 3/3 debug patch
now here are two patch sets, patch-for-upstream and patch-known-to-work. For each patch set, please do the following test 1. rebuild your kernel with CONFIG_ACPI_DEBUG=y. 2. reboot with boot option "acpi.debug_level=0x04 acpi.debug_layer=0x04" 3. enable dynamic debug 4. run suspend/hibernate 5. after resuming, try to change the temperature, either lower or higher. 6. attach the dmesg output Note that I've made a change in the patch-for-upstream patch set, so do please apply these refreshed ones. Created attachment 151941 [details]
dmesg of 3.17-rc4 with "known-to-work" patches with suspend-to-disk
Created attachment 151951 [details]
dmesg of 3.17-rc4 with "known-to-work" patches with suspend-to-RAM
Created attachment 151961 [details]
dmesg of 3.17-rc4 with "for-upstream" patches with suspend-to-disk
Created attachment 151971 [details]
dmesg of 3.17-rc4 with "for-upstream" patches with suspend-to-RAM
The from you so called "known-to-work" patches from you do not make the fan step up after resume (disk+RAM). It sticks at some low temp/speed. The "for-upstream" patches show: After Suspend-to-disk (hibernate) my fan sticks at 100% -- but after Suspend-to-RAM -- it shows the old behaviour, not stepping up from some low temp/speed. What you'd see in the logs at the end: That I mount some partition on here: sda11. This was always the beginning of a new kernel compilation to increase the system's temperature. As I know, that the system's temp highly increases afterwards, by this means, there's something wrong, either with your debug logging or with THIS BUG's topic, to not refresh the fan <-> thermal_zone relations upon resume. Best regards, Manuel (In reply to Manuel Krause from comment #72) > The from you so called "known-to-work" patches from you do not make the fan > step up after resume (disk+RAM). It sticks at some low temp/speed. > this is weird, then can you please attach the patches that are know to work for you? I don't think I made some functional changes. The patches you've taken and called "known-to-work" are quite right and are identical to mine, regarding the code, to the ones I already posted here in Comment 49: attachment 151011 [details] and Comment 50: attachment 151021 [details]. Unfortunately I haven't compared them earlier, so I didn't find better ideas at that time. Over the last days I found time to do some cross testing, that I obviously should have done much earlier. Sorry for that. Long story in short: * known-to-work patches with 3.17.0-rc6: - after suspend-to-disk: fan speed stays @ a low value (45%) despite increasing CPU load beyond temp. trigger point * known-to-work patches with 3.17.0-rc1: == same as with 3.17.0-rc6 and == 3.17.0-rc4 => These two and my previously posted results with -rc4 for BOTH patchsets lead me to the assumption that some new 3.17 code inhibits updating temperature and fan correlations completely after resume from disk/RAM. You are more involved into development than me, so, it's your turn. * for-upstream patches with 3.16.3 (+BFS +BFQ +some others): --First time that I tried this combination-- - after suspend-to-RAM: everything o.k. - after suspend-to-disk: fan sticks @ 100% until a lower temp. trigger is met, then it adjusts to the right fan speed. Depending on the CPU load, maybe WCG client is running, this may take forever, as the fan cools too much unneededly, but the CPU doesn't emit enough heat, so that the algorithm seems to hang "in the middle". Now the fan will forever stay @ 100%. And, yes, this is then noisy! Even with an idle CPU this "getting a lower temp. trigger" can take from 1 to 3 minutes at this season. So this method seems not to be as usable as the very old one. If you've time to explain, please, tell me what you want to achieve with the new set of patches in general. Best regards, Manuel Krause Created attachment 152031 [details]
dmesg of a 3.16.3+ with patches for-upstream +acpi.debug +dynamic_debug
Unfortunately the beginning of the log is missing. I don't know why. I hope this log is helpful, nontheless.
dmesg of a 3.16.3+ with patches for-upstream +acpi.debug +dynamic_debug
In the log in Comment 75 I did one suspend-to-ram, and two suspend-to-disk. I've, until now, done some bisecting for your convenience. On the topic, when your "known-to-work" patches began to fail to work since 3.16. Result: git bisect bad f35cec255557d1037ff0d772edfd6e7b1e92cdc0 is the first bad commit commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0 Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Date: Wed Jul 23 01:00:53 2014 +0200 ACPI / PM: Always enable wakeup GPEs when enabling device wakeup Wakeup GPEs are currently only enabled when setting up devices for remote wakeup at run time. During system-wide transitions they are enabled by ACPICA at the very last stage of suspend (before asking the BIOS to take over). Of course, that only works for system sleep states supported by ACPI, so in particular it doesn't work for the "freeze" sleep state. For this reason, modify the ACPI core device PM code to enable wakeup GPEs for devices when setting them up for wakeup regardless of whether that is remote wakeup at runtime or system wakeup. That allows the same device wakeup setup routine to be used for both runtime PM and system-wide PM and makes it possible to reduce code size quite a bit. This make ACPI-based PCI Wake-on-LAN work with the "freeze" sleep state on my venerable Toshiba Portege R500 and should help other systems too. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> :040000 040000 4818deb40271308e35a69fae46673090737c9dd9 40965a4df729f5cb0ea848b1c0e70ea475b8a558 M drivers :040000 040000 d08af810abb0b5768615cdd4dea34b577704171a 0d113d89c1c831c294170c9a9d7a5cbedf884904 M include I currently don't know, what patches I'd need to revert, based on this above, to prove that YOUR patches would also work with 3.17 (without the compromising commit above). Please advise me. Thanks in advance and best regards, Manuel Some new test results with kernel 3.17.0: 1. Vanilla without any additional patches clearly shows the BUG on here: After Suspend-to-disk the fan stays @ 0 forever. 2. Now with REVERTED commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0 (http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=f35cec255557d1037ff0d772edfd6e7b1e92cdc0): 2.1. With the "known-to-work" patches: Everything works as expected and wished 2.2. With the "for-upstream" patches: Same results as in my Comment 74, section for-upstream patches. So, still not really usable. Maybe, we can invite Rafael J. Wysocki to have a look on his patch again, to find out why it breaks both of your patch-sets for 3.17. Thanks in advance, Manuel Created attachment 154651 [details]
[PATCH] ACPI: invoke acpi_device_wakeup() with correct parameters
first, please apply this patch on top of rafael' patch and see if there is anything different, because there is a bug in his patch. And I will check if his patch changes any behavior that breaks our "known-to-work" patches.
Tested 3.17.0 (as 3.17.1 doesn't change anything for our work) without reverting Rafael's commit but WITH your fix in Comment 79: known-to-work patches don't trigger fan changes after suspend-to-disk Although this patch is a real fix for the wrong parameters, this seems to be not enough to be fixed. Created attachment 154731 [details]
debug patch
please apply this patch on top of the "known-to-work" patches and the fix patch attached yesterday, and see if it helps.
BTW: Thank you very much for your continued efforts! To summarise: 3.17.0 with nothing reverted + the 2 "known-to-work" patches (+ the related debug message patch of those days) + attachment 154731 [details] debug patch + '[PATCH] ACPI: invoke acpi_device_wakeup() with correct parameters', attachment 154651 [details] brings back a working base! Cheers!: Fan speed is stepping up and down as desired, initial fan speed is appropriate, after bootup, after suspend-to-disk and as well as after suspend-to-RAM. Now, I really hope that your last debug patch from Comment 81 doesn't break things that Rafael wanted to fix with his commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0. And... What's next? ;-) I'm trying to fix this issue right now, can you please send me the acpidump of this machine? As I can't estimate how sensitive the acpidump data is, I decided to send it to your email address directly. I hope it helps! I've just seen that your fix from Comment 79 will most likely make it into 3.17.3! Hopefully you are able to get ahead with the rest of this BUG, too. Currently, I'm using the 3.17.2 with BFS, BFQ and the patches mentioned in Comment 82 without the debug-message-throwing one -- and it works very well! Thanks! Best wishes and best regards, Manuel I have the identical issue on my HP 8510w using latest 3.17.4 kernel. As this is a production system I'm not able to test and retest all the patches, so I have reverted to an older kernel pre this bug. But if there is any info I can provide from the running system that would help, please let me know. Created attachment 158651 [details]
Kernel-Bugzilla-78201-knowntowork-V2-PLUS-for3.17.x PATCH
@Mario Becroft:
You can try this patch, what is a combination of Zhang Rui's two V2 patches, which worked well (known-to-work), with the third one, needed for 3.17.x.
It should be safe to use it on your machine, as it doesn't seem to differ much from mine.
Please report back, I'm "glad" not to be the only one with this BUG. ;-)
This combination works very well on here @hp compaq 6730b with 3.17.4.
Best regards, Manuel Krause
I'm not able to reboot this production system on demand so a bit difficult for me to test, unless it's really essential to get my feedback I can schedule a maintenance window. Is there some reason these patches are being held back from inclusion in the next 3.17 release? I don't know why we don't get the working patches into the kernel. Zhang Rui never explained. :-( What I see from testing 3.18-rc5-6, these patches won't work there any more. There had been modifications to the thermal and acpi subsystems and not only to the files in question. But the plain vanilla doesn't do the job either. When I tested each of the 3.18 modifications vs. the 3.17 working solution (by modifying patches) I didn't get ONE positive result. So we do depend on someone else's expertise/ experience. *PING* @ Zhang Rui I'm really feeling sad and disappointed, that this BUG is originating from 3.12 and we would apparently not get our a fixes into upcoming 3.18 and not into 3.17.5. Zhang Rui, please, finish this work, it's annoying over this much time now. Manuel Hi, guys, sorry for the long delay. The reason why I did not push the working patches is that the root cause of the bug is that we should not use cached value after resume. And the "working" patches clear the cached value during suspend, but it is not guaranteed to be the last step of suspend (some driver may do thermal_zone_device_update() later during suspend, and then cached values are still used during resume). Thus although it "fixes" the problem you have, it hides the real problem that may come again sometime, and that's why I don't want to push upstream. But you're right, "a working solution today is better than a perfect plan tomorrow", I will try to push those two patches to 3.19, and I'll rewrite a better solution sometime later, wish I could get your help to test them again then. Created attachment 159431 [details]
PATCH for 3.19, 1/2
Created attachment 159441 [details] PATCH for 3.19, 2/2 I've rebased the "working" patches on top of 3.18-rc. please double check if they work for you or not, after applying the debug patch at https://bugzilla.kernel.org/attachment.cgi?id=154731 I need to send them out with your Tested-by ASAP as the 3.19 merge window will open soon. Mmmh, now I really need your help with your knowledge. Like I already indicated in Comment 89, also kernel 3.18.0-rc7 showed: the thermal system fails. Symptoms: After booting and increasing cpu load the fan won't trigger any fan speed increase, although well known trigger points are passed. After hibernate+resume the fan hangs at 100% forever. The patch from https://bugzilla.kernel.org/attachment.cgi?id=154731 contains one hunk (No. 3) that doesn't correspond to 3.18-rc kernels any more (The line in error in this hunk is line 3): diff --git a/drivers/acpi/device_pm.c b/drivers/acpi/device_pm.c index 67075f8..40a485ce 100644 --- a/drivers/acpi/device_pm.c +++ b/drivers/acpi/device_pm.c @@ -878,13 +886,13 @@ int acpi_dev_suspend_late(struct device *dev) target_state = acpi_target_system_state(); wakeup = device_may_wakeup(dev); - error = acpi_device_wakeup(adev, target_state, wakeup); + error = __acpi_device_sleep_wake(adev, target_state, wakeup); if (wakeup && error) return error; error = acpi_dev_pm_low_power(dev, adev, target_state); if (error) - acpi_device_wakeup(adev, ACPI_STATE_UNKNOWN, false); + __acpi_device_sleep_wake(adev, ACPI_STATE_UNKNOWN, false); return error; } So I've tested 3 kernels for you during this afternoon. One with all three patches + leaving the kernel original line, one with all three patches + with your debug patch original line, one without any of your patches. All 3 (each verified twice) tests were leading to the results reported on the top of this comment. Can you please check the surrounding of the files our patches touch for changes in the 3.18-rc development timeline? Maybe the failing hunk gives a hint where to begin the investigation? Good luck, and yes, of course, I'll stay tuned to help in testing!!! (In reply to Zhang Rui from comment #92) > Hi, guys, > sorry for the long delay. > The reason why I did not push the working patches is that > the root cause of the bug is that we should not use cached value after > resume. > And the "working" patches clear the cached value during suspend, but it is > not guaranteed to be the last step of suspend (some driver may do > thermal_zone_device_update() later during suspend, and then cached values > are still used during resume). > Thus although it "fixes" the problem you have, it hides the real problem > that may come again sometime, and that's why I don't want to push upstream. > But you're right, "a working solution today is better than a perfect plan > tomorrow", I will try to push those two patches to 3.19, and I'll rewrite a > better solution sometime later, wish I could get your help to test them > again then. My thoughts on this: Isn't it possible to reset the fan(s) and temperatures and their correspondence during resume, like it happens at a normal reboot? (O.k. it does not do with the current 3.18-rc at boot at all, on here.) But normally in the booting process in kernel 3.17 we've got a temp and a trigger point and then a resulting fan speed. So, why should that process NOT occur at resume again? Any speed gain in this case is useless if it sacrifices system safety. And now to your approach, that may sound different from my idea above, but may be the same idea, why don't you delete the cached value also at resume (again, double-safe) and then re-evaluate the the real temperature data? (In reply to Manuel Krause from comment #89) > I don't know why we don't get the working patches into the kernel. Zhang Rui > never explained. :-( > > What I see from testing 3.18-rc5-6, these patches won't work there any more. > There had been modifications to the thermal and acpi subsystems and not only > to the files in question. But the plain vanilla doesn't do the job either. > When I tested each of the 3.18 modifications vs. the 3.17 working solution > (by modifying patches) I didn't get ONE positive result. do you mean the 3.17 working solution does not work for you any more? (In reply to Manuel Krause from comment #96) > (In reply to Zhang Rui from comment #92) > > Hi, guys, > > sorry for the long delay. > > The reason why I did not push the working patches is that > > the root cause of the bug is that we should not use cached value after > > resume. > > And the "working" patches clear the cached value during suspend, but it is > > not guaranteed to be the last step of suspend (some driver may do > > thermal_zone_device_update() later during suspend, and then cached values > > are still used during resume). > > Thus although it "fixes" the problem you have, it hides the real problem > > that may come again sometime, and that's why I don't want to push upstream. > > But you're right, "a working solution today is better than a perfect plan > > tomorrow", I will try to push those two patches to 3.19, and I'll rewrite a > > better solution sometime later, wish I could get your help to test them > > again then. > > My thoughts on this: Isn't it possible to reset the fan(s) and temperatures > and their correspondence during resume, like it happens at a normal reboot? > (O.k. it does not do with the current 3.18-rc at boot at all, on here.) But > normally in the booting process in kernel 3.17 we've got a temp and a > trigger point and then a resulting fan speed. So, why should that process > NOT occur at resume again? Any speed gain in this case is useless if it > sacrifices system safety. > > And now to your approach, that may sound different from my idea above, but > may be the same idea, why don't you delete the cached value also at resume > (again, double-safe) and then re-evaluate the the real temperature data? No, the same reason: the thermal .resume_early() callback is not guaranteed to be the first one during resume, which means that some thermal driver may already issued thermal_zone_device_update() before thermal core .resume_early() callback, and your approach would clear the meaningful values for those drivers. (In reply to Manuel Krause from comment #95) > Mmmh, now I really need your help with your knowledge. Like I already > indicated in Comment 89, also kernel 3.18.0-rc7 showed: the thermal system > fails. > > Symptoms: After booting and increasing cpu load the fan won't trigger any > fan speed increase, although well known trigger points are passed. Well, this is sad. It seems that we've got another regression that thermal system fails after boot. please make sure your kernel is built with CONFIG_DYNAMIC_DEBUG=y, and run echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control after kernel boot. and try to warm up the system and attach the dmesg output. We need to fix this first. To answer all in one: * The two working patches + debug patch work for my 3.17.4 very well. * To the regression: I've tested 3.18.0-rc5, -rc6 and -rc7 -- and for none of them the fan control works. I don't even need to bother with sending a dmesg (with the dynamic_debug command run) as there is exactly NO related output in it, although I've passed several trigger points up- and downwards. I've rechecked the .config for eventual misconfiguration, but it's all in it's place. Do I need some special kernel command line in addition? * From comparing a 3.17.3 dmesg with a 3.18.0-rc7 after bootup I miss some lines completely in the latter (indicated with --- instead of timestamp): ... [ 0.671244] pcie_pme 0000:00:1c.5:pcie01: service driver pcie_pme loaded [ 0.671294] intel_idle: does not run on family 6 model 23 [ --- ] ACPI: Fan [FAN0] (off) [ --- ] ACPI: Fan [FAN1] (off) [ --- ] ACPI: Fan [FAN2] (off) [ --- ] ACPI: Fan [FAN3] (off) [ --- ] ACPI: Fan [FAN4] (off) [ 0.672777] Monitor-Mwait will be used to enter C-1 state [ 0.672783] Monitor-Mwait will be used to enter C-2 state [ 0.672788] tsc: Marking TSC unstable due to TSC halts in idle [ 0.672794] ACPI: acpi_idle registered with cpuidle ... I don't know if that's a hint?! According to tmon these do exist and they (the fan levels) are manually adjustible with e.g. 'echo 1 > /sys/class/thermal/cooling_device1/cur_state'. * To be honest, I don't want to bisect the early 3.18-rcs, testing the -rc5 again showed up that it's highly buggy on here. Best regards and thank you in advance for any new idea to try out, Manuel (In reply to Zhang Rui from comment #98) > ... > No, the same reason: the thermal .resume_early() callback is not guaranteed > to be the first one during resume, which means that some thermal driver may > already issued thermal_zone_device_update() before thermal core > .resume_early() callback, and your approach would clear the meaningful > values for those drivers. So, wouldn't it then make sense to have a kind of "subscription list" for drivers that do change their thermal settings? Meaning, for the suspend case as well as for the resume case, from there could be read out that/ what may have changed inbetween (e.g. some differences changed by the BIOS). What value then needs to be re-evaluated would depend on this comparison and driver/ hardware quality. New values then would be then set depending on the previuos comparison. (In reply to Manuel Krause from comment #100) > To answer all in one: > * The two working patches + debug patch work for my 3.17.4 very well. > * To the regression: I've tested 3.18.0-rc5, -rc6 and -rc7 -- and for none > of them the fan control works. I don't even need to bother with sending a > dmesg (with the dynamic_debug command run) as there is exactly NO related > output in it, although I've passed several trigger points up- and downwards. > I've rechecked the .config for eventual misconfiguration, but it's all in > it's place. Do I need some special kernel command line in addition? > * From comparing a 3.17.3 dmesg with a 3.18.0-rc7 after bootup I miss some > lines completely in the latter (indicated with --- instead of timestamp): > > ... > [ 0.671244] pcie_pme 0000:00:1c.5:pcie01: service driver pcie_pme loaded > [ 0.671294] intel_idle: does not run on family 6 model 23 > [ --- ] ACPI: Fan [FAN0] (off) > [ --- ] ACPI: Fan [FAN1] (off) > [ --- ] ACPI: Fan [FAN2] (off) > [ --- ] ACPI: Fan [FAN3] (off) > [ --- ] ACPI: Fan [FAN4] (off) > [ 0.672777] Monitor-Mwait will be used to enter C-1 state > [ 0.672783] Monitor-Mwait will be used to enter C-2 state > [ 0.672788] tsc: Marking TSC unstable due to TSC halts in idle > [ 0.672794] ACPI: acpi_idle registered with cpuidle > ... > > I don't know if that's a hint? No, this message is removed by commit 9519a6356cbf63b1f22a7a208385dc56092c8b7d. > ! According to tmon these do exist and they > (the fan levels) are manually adjustible with e.g. 'echo 1 > > /sys/class/thermal/cooling_device1/cur_state'. > questions: 1. does manually changing the cooling state turn on/off the fan? 2. can you please boot with "acpi.debug_level=0x04 acpi.debug_layer=0x04" for both 3.17 and 3.18 kernel? I think we are missing thermal events during boot. (In reply to Zhang Rui from comment #102) > questions: > 1. does manually changing the cooling state turn on/off the fan? > 2. can you please boot with "acpi.debug_level=0x04 acpi.debug_layer=0x04" > for both 3.17 and 3.18 kernel? I think we are missing thermal events during > boot. 1. Manual changes to the cooling state (cur_state) like I mentioned above do work for both kernel versions. But at the moment I see: not all are working @ 3.17.6. I have 6 fan speed levels, meaning 5 cooling states plus 1 state "all 0", each representing one fan speed: 0% (all 5 @ 0), 25%, 34%, 45%, 58%, 100% (all 5 @ 1). I don't know when or why one of the levels (25%) doesn't work all of the sudden. 2. In the following I'll attach two dmesg logs for a 3.17.6 kernel and a 3.18.0 kernel. For both I've used - kernel command line with acpi.debug_level=0x04 acpi.debug_layer=0x04 - issued echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control - heated the system over the last trip point (74 C) before emergency would kick in - hibernate & resume For 3.18.0 I needed to issue extra cooling commands to the fans (maybe you'll read them). Created attachment 160361 [details]
kernel-bugzilla-78201-MK.dmesg-3.17.6.log
Created attachment 160371 [details]
kernel-bugzilla-78201-MK.dmesg-3.18.0.log
Any insights or news on why kernel's automatic thermal control fails for my system @ 3.18.x kernels? I've now tested a 3.18.1 with your patches and with a carefully reviewed and partly changed .config. But the above mentioned problems (for 3.18.0-rc) do remain completely. Without manually issuing a 1 to /sys/class/thermal/cooling_deviceX/cur_state the system fan's X level would always remain at it's bootup state. After resume I need to manually issue a "echo 0 > /sys/class/thermal/cooling_device0/cur_state" to at least calm down the fan from 100%. The 3.17.7 works well with your 3 patches. And just to ask in advance: If I'd find time to bisect the latter case with 3.18... in the coming days: a) What should I do with results that don't boot well (3.18-rc1 to -4 were buggy) b) should I always apply your patches, too, or should I only look for the commit, when thermal management failed on here? Thanks for your advise, Manuel Coming back from quite a weird odyssey of bisecting and testing... ...I'm clueless. :-( * (offtopic) I had first needed to bisect the culprit of not beeing able to boot my machine with a 3.17.0-rc2 (Comment 34, Comment 35) to finish the follow-up bisect: commit 208bf9fdcd3575aa4a5d48b3e0295f7cdaf6fc44 drm/i915: Fix locking for intel_enable_pipe_a() * the follow-up was, to bisect, what inhibits automatic fan control in 3.18.x: I got a bisecting result: commit 6ab3430129e258ea31dd214adf1c760dfafde67a mfd: Add ACPI support Reverting this commit on a previously failing 3.17.0-rc2 brings back automatic fan control after bootup -- BUT REVERTING this on a previously failing 3.18.x does NOT help. ((We already know, that my 3.17.7 works well, also with your patches applied.)) So, I honestly don't know WHAT ELSE I need to revert for 3.18.x. Please, have a look and advise! * Then I've tested a 3.19.0-rc1: Here, automatic fan control does work from bootup, but your patches don't work any more for the hibernation case (neither, without them). Maybe you can review them for this kernel, too?! Best regards, Manuel I'm quite astonished, that there aren't more people affected. Kernel 3.18.x is completely useless for my system without manually echoing ones or zeros to /sys/class/thermal/cooling_deviceX/cur_state via console. It's a shame. *PING* Zhang Rui, can you please answer to Comment 108 (and 109), regarding kernel 3.18.x at least?! Now, that the 3.17.x series got the EOL - the last kernel that worked with your knowntowork patches. I would need to file a new bug, but that could complicate things even more. Thanks, and a Happy New Year, Manuel See Comment 106. It's still valid for 3.18.3. Please, find some time to help to fix it. Hi, Manuel, (In reply to Manuel Krause from comment #108) > Coming back from quite a weird odyssey of bisecting and testing... > ...I'm clueless. :-( > > * (offtopic) I had first needed to bisect the culprit of not beeing able to > boot my machine with a 3.17.0-rc2 (Comment 34, Comment 35) to finish the > follow-up bisect: > commit 208bf9fdcd3575aa4a5d48b3e0295f7cdaf6fc44 > drm/i915: Fix locking for intel_enable_pipe_a() > Does this regression got fixed in the latest upstream kernel? > * the follow-up was, to bisect, what inhibits automatic fan control in > 3.18.x: > I got a bisecting result: > commit 6ab3430129e258ea31dd214adf1c760dfafde67a > mfd: Add ACPI support > > Reverting this commit on a previously failing 3.17.0-rc2 brings back > automatic fan control after bootup Hah, this is really weird, I can not explain how this could happen. > -- BUT REVERTING this on a previously > failing 3.18.x does NOT help. ((We already know, that my 3.17.7 works well, > also with your patches applied.)) > So, I honestly don't know WHAT ELSE I need to revert for 3.18.x. Please, > have a look and advise! > There is a known issue of commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0, which is fixed in 3.19-rc1. But for 3.18, I don't think you can simply revert it. can you please apply commit 78579b7c7eb45f0e7ec5e9437087ed21749f9a9c and commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on top of 3.18.x? > * Then I've tested a 3.19.0-rc1: Here, automatic fan control does work from > bootup, which I think is fixed by the above two patches. > but your patches don't work any more for the hibernation case > (neither, without them). Maybe you can review them for this kernel, too?! > yes, will do. (In reply to Zhang Rui from comment #112) > Hi, Manuel, > (In reply to Manuel Krause from comment #108) > > Coming back from quite a weird odyssey of bisecting and testing... > > ...I'm clueless. :-( > > > > * (offtopic) I had first needed to bisect the culprit of not beeing able to > > boot my machine with a 3.17.0-rc2 (Comment 34, Comment 35) to finish the > > follow-up bisect: > > commit 208bf9fdcd3575aa4a5d48b3e0295f7cdaf6fc44 > > drm/i915: Fix locking for intel_enable_pipe_a() > > > Does this regression got fixed in the latest upstream kernel? This regression only appeared while bisecting this issue ON HERE with 3.17-rc2+-. It is fixed. > > > * the follow-up was, to bisect, what inhibits automatic fan control in > > 3.18.x: > > I got a bisecting result: > > commit 6ab3430129e258ea31dd214adf1c760dfafde67a > > mfd: Add ACPI support > > > > Reverting this commit on a previously failing 3.17.0-rc2 brings back > > automatic fan control after bootup > > Hah, this is really weird, I can not explain how this could happen. > > > > -- BUT REVERTING this on a previously > > failing 3.18.x does NOT help. ((We already know, that my 3.17.7 works well, > > also with your patches applied.)) > > So, I honestly don't know WHAT ELSE I need to revert for 3.18.x. Please, > > have a look and advise! > > > There is a known issue of commit f35cec255557d1037ff0d772edfd6e7b1e92cdc0, > which is fixed in 3.19-rc1. But for 3.18, I don't think you can simply > revert it. can you please apply commit > 78579b7c7eb45f0e7ec5e9437087ed21749f9a9c > and commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on top of 3.18.x? > Commit 78579b7c7eb45f0e7ec5e9437087ed21749f9a9c already made it's way into the kernel as of 3.18.3. Commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea is still missing BUT NEEDED: This one brings back automatic fan control for me!!! Together with your last knowntowork patches this 3.18.3 even works well after suspend-to-disk. Regards, Manuel Krause > > > * Then I've tested a 3.19.0-rc1: Here, automatic fan control does work from > > bootup, > > which I think is fixed by the above two patches. > > > but your patches don't work any more for the hibernation case > > (neither, without them). Maybe you can review them for this kernel, too?! > > > yes, will do. Any idea, when commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea can get into the 3.18.x kernel series -- or in other words, can you push it a bit?! Thank you very much for your work, Manuel Apparently I was too fast with my "good" experience from Comment 113. The patches you mentioned there regain automatic fan control (vs. temperature). Regarding your knowntowork patches, they don't work reliably with 3.18.3. The first shot in the above mentioned comment worked well, but I've made a bad experience over time with more reboots and more hibernate+resumes. The worst case is, with your patches, that the fan goes up to 100% un-neededly and staying there forever, indicated by "/": Boot: 1 2 3 Hibernate1: / / + 2 / + + 3 / / 4 / + 5 / 6 / Best regards, Manuel Hi, I don't know if this is related, but quite a few users on ArchLinux are suffering from the fan running at full speed all the time starting with 3.18.2. Reverting to 3.18.1 fixed the problem for one, whilst the rest reverted back to 3.17.6 without trying 3.18.1. This is the related BBS thread for information: https://bbs.archlinux.org/viewtopic.php?id=192255 Do you want me to see if commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea fixes their problem? Hi Claire, I can definitely encourage you to either apply the commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on kernels before 3.18.5 or better just use this newer kernel. Currently, I'm testing it (good results!) but will need more boots and hibernates+resumes to have a reasonable final report. Regarding the linked forum thread (that I've been following for some weeks now), I'm not sure if all the mentioned problems there are connected to this bug here or they get fixed by the above commit(s). At least, it's certainly the same problem complex, you only need to read that the involved laptops belong to one HP/Compaq notebook family -- most probably incorporating one weird BIOS bug/glitch. Try, and please report back! Thank you! (In reply to Manuel Krause from comment #117) > Hi Claire, > I can definitely encourage you to either apply the commit > 175f8e2650f7ca6b33d338be3ccc1c00e89594ea on kernels before 3.18.5 or better > just use this newer kernel. Currently, I'm testing it (good results!) but > will need more boots and hibernates+resumes to have a reasonable final > report. > > Regarding the linked forum thread (that I've been following for some weeks > now), I'm not sure if all the mentioned problems there are connected to this > bug here or they get fixed by the above commit(s). At least, it's certainly > the same problem complex, you only need to read that the involved laptops > belong to one HP/Compaq notebook family -- most probably incorporating one > weird BIOS bug/glitch. > > Try, and please report back! Thank you! It would appear that whilst it's the looks to be the same problem area, it seems that this patch isn't working for them. I reported it as https://bugzilla.kernel.org/show_bug.cgi?id=92431 Claire's right in opening the new BUG 92431 for the problems that are shown in the archlinux forum thread (comment 116). The commit 175f8e2650f7ca6b33d338be3ccc1c00e89594ea only brings back automatic fan control from bootup for my system. It doesn't improve anything else regarding this bug here. To amend comment 115, the known-to-work patches don't only "work unreliably" on 3.18.x (now with .5) -- they honestly don't make any difference vs. a kernel without them. Actual behaviour: After resume from suspend to disk the fan goes to 100% and will only get down to lower and normal values if the trigger temperature drops downwards below one of the next (lower) temp trip points. The reason for the positive report in comment 113, what I was able to investigate so far, was having a warm system and a dropping trigger temp during resume, so that the next lower trip point was passed, always inducing a fan speed adjustment to lower speed. Sad, that this was apparently a row of coincidents for that testing. Unfortunately a kernel 3.18.5 without these three known-to-work patches behaves the same way, so I need to conclude: they don't work for 3.18.x any more, too. Can you, PLEASE, come back to this bug? I want to get rid of the current behavoiur, that after resuming from suspend the fan is at 100% unnededly. (And I may need to heat up the system beyond next upper "real" trip_point and then cool down to next lower trip_point to get the automatic back working again.) Your patches seem to NOT work any more. Why can't there simply be read in the actual temperatures after resume for the purpose to figure out the needed fan level? Please, come over with new patches for 3.18.6+ ! Thanks, in advance, Manuel FWIW, on my similar laptop exhibiting the same issue, I simply found the the EC register controling fan aspeed and wrote a userland script to bend it to my will using the ec debug driver. Not exactly elegant, but effective. I can share with you notes on how to do this if it would be useful. Actually, I don't want to discorage Zhang Rui to fix the correct temp. initialisation on this notebook family... @Mario Becroft, if it doesn't make too much extra work, you can put the usage instructions on top of your userland script and upload it somewhere and post the link here (or, of course, post it to my email address only). But if it's useful for you and you want to share it, I think other affected users should benefit from it, too. I also thought of simply issuing a command, after every resume, to at least silence the 100% fan level ('echo 0 > /sys/class/thermal/cooling_device0/cur_state'). So far I have not investigated, if that would/could interfere any later automatic control from kernel side. Thank you in advance and best regards, Manuel Also, a pure 3.19.0-rc7 doesn't improve anything. :-( To bring back some sunshine :) to this BUG: @ Zhang Rui: I've now done a new row of new tests with the 3.18.7 kernel with the digged out "for-upstream-patches". They are originating from Comment 61 and Comment 62. With modified versions to apply to the current kernel I was able to get a good behaviour after resume. Most important change needed to get it work, was, so to say: --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -169,6 +169,13 @@ int get_tz_trend(struct thermal_zone_device *tz, int trip) { enum thermal_trend trend; + /* + * If it's the first time to read the temperature, pretend it is DECREASING + * -- otherwise we'd get 100% fan speed (emergency level) after resume. + */ + if (tz->last_temperature == THERMAL_TEMP_INVALID) + return THERMAL_TREND_DROPPING; + if (tz->emul_temperature || !tz->ops->get_trend || tz->ops->get_trend(tz, trip, &trend)) { if (tz->temperature > tz->last_temperature) Regarding the safety issue, why you've wanted to set THERMAL_TREND_RAISING, I can say, either while booting or while resuming, this is counter-productive: The fan has passed all tests with actualising temperatures inbetween (boot/resume). I'll report back, on how 3.19.0 works with this combo. Best regards, Manuel Created attachment 166701 [details]
BUG_78201.kernel-3.18.7.0001-THERMAL-initialize-thermal-zone-device-correctly.MOD6
1 of 2 good patches for 3.18.7
Created attachment 166711 [details]
BUG_78201.kernel-3.18.7.0002-Thermal-handle-thermal-zone-device-update-events-cor.MOD1
1 of 2 good patches for 3.18.7
Comment 126 is: sorry, for the typo, No. 2 of 2 good patches for 3.18.7 O.k., again good news: Several reboots and hibernation cycles later I can say, that also 3.19.0 works with these two patches. Do you want me to attach debugging output, for you to check the technical / programmer's aspects of this approach? Can you, please, also say something about if there exists any relation to BUG 91411? Thank you in advance, Manuel Just FYI my HP/Compaq nc6000 is also hit by this bug. The patches from comment 125 and comment 126 seem to do the trick and get it back to working condition. I had an additional problem of thermal notifications being AWOL but I bisected that down and sent a revert [1]. Now, the last time thermal was broken on this machine wasn't that long ago (bug 56591). These HP/Compaq machines seem to be extremely fragile wrt. thermal/acpi issues, so I would suggest Rui or someone dealing with the thermal subsystem get themselves at least one of these machines for some regular regression testing. Otherwise I fear these regressions will keep on coming. [1] http://marc.info/?l=linux-acpi&m=142403948106603&w=2 Hi, Manuel, first of all, I'd like to say thank you for your continuous help on this issue and your feedback really helps a lot. I'm sorry that usually my response is a bit slow because I'm not full time work on bugzilla, plus I need to handle tens of bugs at the same time. your recent comments are really helpful. If I'm right, the only difference from my original patches is that you change the thermal trend to dropping instead of increasing, right? I choose trend raising just for safety reason, because if choose raising, the bad effect would be the fan spinning unnecessarily, but if choose dropping, the bad effect would be cause hardware damage if the system resumes while overheating. I will rework the patches a bit and update soon. Created attachment 166981 [details]
patch for upstream: 1-2
Created attachment 166991 [details]
patch for upstream: 2-2
Hi, Manuel, please try these two patches on top of 3.19 and see if the problem still exists. Created attachment 167201 [details] patch-for-upstream:1-2 corrected version I now attach a corrected version of the FIRST patch that did not apply and not compile cleanly with 3.19.0 Many thanks go to Olaf Hering from BUG 67101 in his Comment 24 for his work. https://bugzilla.kernel.org/attachment.cgi?id=167121&action=diff Hi, Rui, no, these new patches do result in the wrong behaviour, again: After resuming from hibernation, always the emergency (100%) fan level is on, what is not needed at all. And from my observation this also is NOT always the next upper fan level from the current temperature <-> trip_point relation! I think "the automatic" should stay at least as low as the level before hibernate and then re-consider by next time reading the actual temperatures. Actually with the patches from my Comments 125 & 126, here is no problem with choosing a lower next trip_point, as it is actually re-read during booting or during resuming from hibernate. Read it like: I can HEAR the proper fan adjustment upwardly during the boot/resume procedures, although we had a DROPPING policy. Best regards, Manuel From my observations: I need to make drop temperature over 2 trip_points, after resume from hibernation, downwards, to get a working fan again, due the false initialisation by code (100% ->58% fan), temp < 68°C, and possibly the other one below (58%->45% fan) temp < 54°C. Please, always consider, the system is going through the BIOS (that is setting fan speeds as well) also for resuming from hibernate. Manuel I am experiencing the same problem on my HP 6910p. Fan starts to spin at full speed after middle of boot-process. (In reply to Sebastian M. from comment #137) > I am experiencing the same problem on my HP 6910p. Fan starts to spin at > full speed after middle of boot-process. Hi, Sebastian, this BUG here covers the issue of 100% fan speed after suspend to disk/ram. For issues with high fan speeds already from bootup you should, please, have a look at the newly appeared BUG 92431 AND BUG 93301 -- to see if they better match your system's failure. Best regards, Manuel (In reply to Zhang Rui from comment #133) > Hi, Manuel, > > please try these two patches on top of 3.19 and see if the problem still > exists. This failure must not go into kernel 4.0. Please, let us fix it before. BR, Manuel (In reply to Manuel Krause from comment #135) > Hi, Rui, > > no, these new patches do result in the wrong behaviour, again: After > resuming from hibernation, always the emergency (100%) fan level is on, what > is not needed at all. And from my observation this also is NOT always the > next upper fan level from the current temperature <-> trip_point relation! > please enable the dynamic debug in drivers/thermal/thermal_core.c and drivers/thermal/step_wise.c and attach the dmesg output after this happens. Created attachment 168751 [details]
dmesg with 2 hibernations with Rui's two 3.19-related patches
This dmesg.log describes the following scenario:
booting into the relatively cold system, "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control", suspend to disk and resume shortly afterwards,
fan is @ 100% and it takes about 81s after resume, that the next lower trip point is passed, so the fan then goes down,
now I increase CPU load and wait to warm the system and suspend to disk again and resume again shortly afterwards,
fan again @ 100%, nothing happens for a long period of time, then I reduce CPU load, and shortly afterwards the fan spins down by 3(!) levels.
Reminder: For my system thermal_zone1 [CPUZ] is the relevant one for cooling and only cooling_device0..4 are the fan levels (25%, 34%, 45%, 58%, 100%) which are themselves shown in thermal_zone4 [FDTZ]. And the TuxOnIce debugging info comes when resume has almost finished but before my ethernet attaches.
This time I invested some time to try to understand, what this debugging output possibly means. But I don't understand, what thermal_sys does NOT do. E.g. after both resumes, why are cooling_device2..4 set to 1 (their state already is 1, BTW), but devices 0..1 are left at cur_state 1 and not reset? The related temperature would allow setting the latter devices to state 0. In my opinion / understanding, this is the point. Can they eventually be left out, because the temp is lower than their corresponding trip_point temp? That shouldn't happen.
I hope this helps, best regards,
Manuel
Created attachment 168791 [details]
0001-Thermal-initialize-thermal-zone-device-correctly.patch
Created attachment 168801 [details]
0002-Thermal-handle-thermal-zone-device-update-events-cor.patch
Great, I think I've found the root cause.
I've made several changes in this new patch set and rebase it on top of 4.0-rc2, can you please help check if they fix the problem for you or not?
Created attachment 169051 [details] dmesg with 2 hibernations with Rui's two 4.0-rc2-related patches Great work, I'd like to say! The new patches DO WORK FOR ME. Please, have a close look to the attached log, to check if thermal_sys is doing everything like you intended. I found it a bit confusing to follow, in which sequence and how often things get checked and states adjusted (and what not). The test in the log follows the scenario from Comment 141 run on a 4.0-rc1. During boot everything is o.k. The fan does not resume with 100% fan anymore and also quickly adjusts according to any increased/decreased load. I only had one problem: The very first resume with 4.0-rc1 with your new patches with a relatively cold system came up with 100% fan, but adjusted after ~91s. I was not able to reproduce it, but don't want to omit to mention it. (Any ideas, how this may happen?) The attached log is from the third successful run of my scenario. I've now also patched my actual 3.19.0 setup with your new ones (needed only one context line to be changed, the one with the typo in original 3.19.0) -- and it also passed my tests. Resuming from suspend to RAM also works fine. So, please, check if the log shows the right handling from your programmer's point of view. Me being lucky now, best regards and many thanks to you, especially for your patience and endurance, Manuel Created attachment 169061 [details]
Newest 0001 (1of2) patch for the 3.19.0 kernel
For testers on the "old" 3.19 kernel series.
Created attachment 169071 [details]
Newest 0002 (2of2) patch for the 3.19.0 kernel
For testers on the "old" 3.19 kernel series.
Good luck, and please, report back! :-)
Patches from Comment 145 and Comment 146 still working with 3.19.1 plus the debug patch from BUG 91411 https://bugzilla.kernel.org/attachment.cgi?id=169941 (although I didn't need it). Created attachment 170341 [details] dmesg.log from working 3.19.1 according to Comment 147 Created attachment 170581 [details]
patches for upstream
Hi,
This is the new version of patches that I'd like to submit for upstream.
The first three patches, which can fix the problem you have IMO, are likely to go to 4.0, and the follow seven patches are some enhancements based on the first three patches.
Please check if they work for you or not. (Please attach the dmesg output after resume, with dynamic debug enabled)
Created attachment 170611 [details]
patches for upstream - 2
Please help me do the test in two steps.
1. please check if patch 1~4 fixes the problem for you or not.
2. please apply all the patches in this tarball and check if the problem is gone.
Created attachment 170641 [details]
dmesg of 3.19.1 with patches 0001..0004 applied
So, the first round of tests succeeded! :-) This is with the first four patches. Please don't mind, that I've tested kernel 3.19.1, first. (Let me know, if you want me to redo it on kernel 4.0-rc3.)
BTW, I've left out the "0001-Debug-patch-to-sync-thermal-zone-update.patch" from the tarball as the code obviously reappears under the correct new name "0004-Thermal-make-thermal_zone_device_update-atomic.patch".
The attached dmesg shows one suspend-to-disk with relatively cold system and low load, then one suspend-to-disk with a hot system followed by a suspend-to-RAM.
From bootup through all resumes everything works fine and the fan levels are adjusted quickly on load change. The log also doesn't show any double thermal zone update confusions any more, like it was before.
Good work! Thank you very much!
And kernel with the next seven patches is compiling atm.
Created attachment 170651 [details] dmesg of 3.19.1 with patches 0001..0011 applied O.k. The second round succeeded, too. :-))) Nice to read, that this bothering BUG may lead to an overall improvement for the thermal system! With all 11 (eleven) patches applied. Plus the one that 3.19.1 doesn't have so far, to be applied _before_ "0005-Thermal-make-thermal-framework-be-aware-of-thermal-m.patch": direct link: https://github.com/torvalds/linux/commit/12ca7188468ee29c4e717f73db4bf43c90954fc7.patch Same tests as in Comment 151. Same good results also in the log. For the case of an over night really cooled down system, I'll report back tomorrow. Best regards, Manuel As a question aside: My system now (running your 11 patches) sometimes gets something like this: [ 871.712504] thermal_zone_trip_update: thermal thermal_zone1: Trip4[type=0,temp=60000]:trend=1,throttle=1 ..................^^^^^ [ 871.716620] thermal_zone_trip_update: thermal thermal_zone1: Trip4[type=0,temp=55000]:trend=1,throttle=1 ..................^^^^^ This is cut out from the last posted dmesg. And I see this 55000 as trip_point in the running tmon. This is contradicting somehow. And I think, the 60000 (60°C) would be more useful for cooling on here. Why does the trip point temp get changed? I know, that this occurred before on here, but not what patches were involved at that former times. Please, have a look on it. Thanks, Manuel The beahviour described in Comment 153 has NOTHING to do with the current patch testing! O.k., also with a really cold system (with the 11 patches) everything works fine. For me the issue is fixed and gets my tested-by tag. I hope the slight change between the "0001-Debug-patch-to-sync-thermal-zone-update.patch" and the "0004-Thermal-make-thermal_zone_device_update-atomic.patch" is really intended. Please, let me know, if you wish me to perform more/ different tests. I think this is done by BIOS, i.e. BIOS changes the trip point value after a trip point is crossed. The reason is to avoid fan device bumping around a certain temperature, say, fan on at 60c, temperature drops, and then fan off, and temperature raises and fan on... Created attachment 170771 [details]
patch v3
please try this patch set instead.
I made some changes in this patch set, and consider this as 4.0 material.
Please give it a try.
Created attachment 170831 [details]
dmesg of 4.0.0-rc3 with only the four v3 patches applied
This time I booted with kernel command line append module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp" to see and show what happens at boot for my system.
dmesg shows bootup, one relatively cold suspend-to disk, one hot suspend-to disk and one suspend-to RAM.
Everything works fine and, if I understand it correctly, the log also shows that all is in it's place. :-)
Just for the records: The patches from Comment 157 do also cleanly apply, compile and work fine on a 3.19.1 kernel. (I don't bother you with another dmesg.log as I compared both and they just show the same behaviour for the same tests.) Rui, I hope that you can come to a final decision about the smartest solution for this BUG soon and give it to the stable and longterm maintainers, too. Remember that this bug is open since 3.12.x kernels. Thank you very much again and best regards! Hello! I had this problem on an ASUS Chromebox (Intel Haswell Celeron 2955U CPU) running Xubuntu 14.04. It only has a single fan, for the CPU. With newer kernels, starting around 3.18, the CPU fan would spin up to max after resuming from suspend, making quite a loud whine for about 10-20 sec before settling back down to its normal, nearly inaudible level. I just tried Zhang's patchset (comment 157), applying it to the latest Ubuntu 3.19.1 based kernel (ubuntu-vervet 3.19.0-9.9), and the problem is fixed! Fan seems to be running at normal speed now on resume. Thanks! I just noticed that after the first boot with the patchset, and after maybe the second suspend, the fan would briefly pulse up quite high on resume (maybe 75% of maximum noise) for maybe 1 second, before settling back down. On that boot, that was consistent for several suspend/resume cycles in a row. After rebooting, I haven't been able to replicate it. It does still pulse up very briefly on resume, but not nearly as loudly. I have to put my ear near it to notice. I don't think this is due to an actual temperature difference. On the current boot, with no really noticeable fan noise on resume, I tried running a Python loop indefinitely, eating up 100% of a core. Then suspended and resumed, no effect on fan resume noise. Created attachment 170961 [details]
Chromebox dmesg after patch
Sorry for the spam, but I just noticed what was causing the brief fan pulse on resume: it pulses if I have an SD card in the slot. If I remove it, I get virtually no fan pulse on resume. Weird. Maybe the order of initialization on resume changes with an SD card in the slot, which allows the fan to spin up higher by default before thermal controls have a chance to bring it back down. Here's a dmesg with 8 suspend/resume cycles. The first 4 had the SD card and high fan pulse (1 sec) noise, the next 2 had no SD card and virtually no pulse noise, and the last 2 had the SD card and high fan pulse noise again.
And to further clarify, the brief fan pulse doesn't happen with an inserted USB stick. I've tried two different USB sticks in two different slots. It only happens with an SD card (I've tried two different ones) in the SD card slot. With my 3.19.1 kernel I use the TuxOnIce pathces, that seem to use CPU heavily upon suspend-to-disk/resume. The Fan would sometimes go to 100% in each procedure. But that is o.k. as a safety measuere. With the current patches from Rui the system always calms down, at least when X is up, just before the desktop (KDE) re-login shows up. And that's less then 20s after boot/resume. @ Zhang Rui: While looking at BUG 67101, and manually adjusting fan levels myself by issuing: "echo 0 > /sys/class/thermal/cooling_device0/cur_state" # top fan speed 100% "echo 0 > /sys/class/thermal/cooling_device1/cur_state" "echo 0 > /sys/class/thermal/cooling_device2/cur_state" "echo 0 > /sys/class/thermal/cooling_device3/cur_state" "echo 0 > /sys/class/thermal/cooling_device4/cur_state" # lowest fan speed 25%/0% the FAN later on only takes the actual trip_point <-> temp into account. The lower trip_point <-> temp relations are not recalled/ used. The fan then bounces from 58% (actual and last used trip_point temp met) to 0% (when temp lower than this actual trip_point). If I refill the appropriate lower /sys/class/thermal/cooling_device?/cur_state with "1" then (with appropriate device numbers), manually, everything works well again afterwards. (E.g. on load change it'll find next upper/lower fan level.) Watching tmon during this testing shows this behaviour, too. -- I can't consider this as failure safe. People are not said to manually change their FAN levels... -- But this behaviour is not correct. Even if manual interference is not expected, the "automatic" should find back to the next right lower/upper fan level. Hello! It appears that this bug can't be reproduced after applying these patches and thus is fixed. My question, though, is when are these patches going to flow into a new release? Will linux-4.0.0 come without this bug? Thanks... (In reply to Sebastian M. from comment #166) > Hello! > It appears that this bug can't be reproduced after applying these patches > and thus is fixed. My question, though, is when are these patches going to > flow into a new release? Will linux-4.0.0 come without this bug? > > Thanks... There do remain danger zones. Comment 165. So I don't like you to talk of "a fixed" BUG. (In reply to Manuel Krause from comment #165) > @ Zhang Rui: > While looking at BUG 67101, and manually adjusting fan levels myself by > issuing: > "echo 0 > /sys/class/thermal/cooling_device0/cur_state" # top fan speed 100% > "echo 0 > /sys/class/thermal/cooling_device1/cur_state" > "echo 0 > /sys/class/thermal/cooling_device2/cur_state" > "echo 0 > /sys/class/thermal/cooling_device3/cur_state" > "echo 0 > /sys/class/thermal/cooling_device4/cur_state" # lowest fan speed > 25%/0% > > the FAN later on only takes the actual trip_point <-> temp into account. The > lower trip_point <-> temp relations are not recalled/ used. The fan then > bounces from 58% (actual and last used trip_point temp met) to 0% (when temp > lower than this actual trip_point). > If I refill the appropriate lower > /sys/class/thermal/cooling_device?/cur_state with "1" then (with appropriate > device numbers), manually, everything works well again afterwards. (E.g. on > load change it'll find next upper/lower fan level.) > please redo the test with thermal dynamic debug turned on, and attach the dmesg output after the fan bounces from 58% to 0%. IMO, the symptom is something related with buggy BIOS, and it is a different problem. Thus I'd like to send these four patches together with the patch at https://bugzilla.kernel.org/show_bug.cgi?id=92431#c78 out ASAP, so that they can be included in 4.0, and I will push these patches for 3.19 stable kernel. what do you think? (In reply to Zhang Rui from comment #169) > IMO, the symptom is something related with buggy BIOS, and it is a different > problem. > Thus I'd like to send these four patches together with the patch at > https://bugzilla.kernel.org/show_bug.cgi?id=92431#c78 out ASAP, so that they > can be included in 4.0, and I will push these patches for 3.19 stable kernel. > > what do you think? Of course, yes: Please send out these five patches for 4.0 and push them in 3.19 ! Best working solution for these notebooks, so far, since the bug exists. Noone normally bothers with adjusting fan speeds manually, so my Comment 165 shouldn't be a show stopper for the more important fixes that are in your patches. I would be glad to see them in the kernel, best regards and thank you for your great work, Manuel Created attachment 171521 [details]
dmesg of a 4.0-rc4 with the five relevant patches with bouncing fan
Unfortunately, the vanilla 4.0-rc4 seems to dissipate more temperature than my 3.19.2 with several other patches, so I needed to verify the logs first for a longer time. The fan HERE actually bounces from 100% to 0% and back. (The 3.19.2 kernel bounced from 58 to 0 and back.)
The dmesg log only shows the warming up of my notebook, under CPU load, while I continuously deactivate the relevant fan levels.
"echo 0 > /sys/class/thermal/cooling_device0/cur_state" # top fan speed 100%
"echo 0 > /sys/class/thermal/cooling_device1/cur_state"
"echo 0 > /sys/class/thermal/cooling_device2/cur_state"
"echo 0 > /sys/class/thermal/cooling_device3/cur_state"
"echo 0 > /sys/class/thermal/cooling_device4/cur_state" # lowest fan speed 25%/0%
Then it ends up bouncing between 100% and 0%. Same seen on tmon + gkrellm.
I've written into the log, where I first suspect a failure in the automatic:
"NO THERMAL_CDEV_UPDATE for cooling devices lower than thermal cooling_device0 happens (meaning devices 1 to 4), ALTHOUGH their target state is correctly set to 1 and their cur_state is 0.
A BIOS bug? Maybe. But maybe, that you have a nice idea for a failure safe coded workaround.
Thank you in advance,
Manuel
P.S.: I can repeat the test on 3.19.2 again, if you wish, but that would only show the bouncing between 58 and 0 but not bring more evidence.
Mmmh. Maybe, you should simply forget my Comment 171 and Comment 165. After spending more thoughts on them: If someone really manually turns off all the fans in the Linux internals, she/he should know of the danger that this action imposes. And, on the other side, everyone should be able to turn off special fan levels. (This is possible with your patches.) As your patches prevent conditions, that need manual intervention, at system side on boot and resumes from disk/RAM, I'd say for the current state that there is no manual action needed any more. So, please, push these five patches! Manuel Maybe my last bothering question regarding this bug complex, especially regarding the patch from https://bugzilla.kernel.org/show_bug.cgi?id=92431#c78: If I don't get the printk("head 0x%p, head->next 0x%p, pos->cdev_node 0x%p, pos->tz 0x%p, next->tz 0x%p\n", &cdev->thermal_instances, (&cdev->thermal_instances)->next, &pos->cdev_node, pos->tz, next->tz); from this applied patch in my logs like prash's logs from https://bugzilla.kernel.org/show_bug.cgi?id=92431#c79 show them whenever a thermal_zone gets registered -- is this indicating a false behaviour or just that my system just does not pass this section of code in a correct sense? Thank you in advance for clarifying! Hah, the patch from https://bugzilla.kernel.org/show_bug.cgi?id=92431#c79 puts the fan devices to proper state if the fan is registered AFTER thermal zone device registered. And this rarely happens. Patches to fix the problem sent out. https://patchwork.kernel.org/patch/6077231/ https://patchwork.kernel.org/patch/6077241/ https://patchwork.kernel.org/patch/6077251/ Thank you for submitting these first three patches. Should we monitor these patchwork URLs for the ongoing discussion or would you inform us on here about patch changes? Mmh, should we worry??? See my question @ https://bugzilla.kernel.org/show_bug.cgi?id=91411#c91 that you have not answered so far. Also, you've changed things within the first three patches. Up to V3: https://patchwork.kernel.org/patch/6127231/ https://patchwork.kernel.org/patch/6127221/ https://patchwork.kernel.org/patch/6127241/ Are these patches still compatible with "the old" other two: https://bugzilla.kernel.org/attachment.cgi?id=171921 https://bugzilla.kernel.org/attachment.cgi?id=171931 ? Please, also read all the replies more carefully in the kernel patchwork area. I've now seen more than two cases where you did not answer to reviewers' questions. You've unfortunally archived them 'away'. Last one: > > +/* Invalid/uninitialized temperature */ > > +#define THERMAL_TEMP_INVALID -27400 > > + > > Did you mean -274000 (note the missing '0') which is < 0K? A comment would be > helpful for anybody reading the code. Took me a while to parse it. > > > /* Unit conversion macros */ > > #define KELVIN_TO_CELSIUS(t) (long)(((long)t-2732 >= 0) ? \ > > ((long)t-2732+5)/10 : ((long)t-2732-5)/10) Thanks for more transparency, Manuel And what do the re-newed patches change? Should we re-test with them? yes, thanks for reminding me. :) I didn't ask you guys to test because I think there is no functional change. But please, kindly help me test the patch V4, to make a double check. Maybe for V4 to appear we need April 1st to finish?! ;-) I've tested the V3 patches from the links written in Comment 177, instead, and, although you don't like it, tested them on my 3.19.3 kernel (only with the three patches, omitting the two other patches that you consider as 4.1 material). Works well so far, from booting over to hibernating with a hot and also with a relatively cold system. Also, Suspend-to-RAM is resuming well. Maybe I've still some issue with the very lowest fan level, like I may have reported before, already. Although the cooling device 4 is set to 1, system being at very low temperature, I don't see it beeing reported via the fan RPM representation in acpitz4. I watch this in tmon and in gkrellm, too. And, it seems to be a bit difficult for me to verify that this fan level (25%) gets REALLY activated in the hardware. When the next upper fan level (34%) kicks in, I can verify this by hearing a slightly higher noise than my hdds usually produce and by watching a flickering lighter's flame near the fan outlet. That's not possible with the 25% level, either because the fan is still off -- or it is running that slowly that no flame can flicker. For now I don't want to use a screwdriver to be able to see the fan spinning at 25%. I fyou have a possible explanation from the programming side or any other idea for me to verify on here, please, let me know. Thank you very much, Manuel Was it intended by you to tag the three V3 patches @patchwork as "superseeded" and no newer ones (e.g. V4) showed up there? I'm just wondering. The time frame for 4.0-final is getting tight. Please, consider Eduardo Valentin's comments {I mentioned earlier, that you should read more closely...} @ https://patchwork.kernel.org/patch/6166681/ and push out a V5 patchset soon, that I can test. The V4 vs. V3 comparison results in one poor added 0. ;-( Created attachment 173531 [details]
patch v5 1/3
Created attachment 173541 [details]
patch v5 2/3
Created attachment 173551 [details]
patch v5 3/3
well, my evolution is broken and I did missed some email when using outlook. Sorry about this. Please give the above patches a try. Hi, Rui! Thank you for the updates. The three V5 patches work very well in all scenarios (bootup cold, cold+hot resume from disk, as well as resumes from RAM)! Thank you very much! BTW, please be aware, that Eduardo has even written another comment @ V4-1of3: https://patchwork.kernel.org/patch/6166681/ Best regards, Manuel @Rui: Your last two messages to https://patchwork.kernel.org/patch/6166681/ are not readable, please check, what went wrong. Its in base64. You can read them by passing those crypted lines to "base64 -d" Thank you! I didn't know it's that easy. ;-) So, lets hope for V6 patches to surface... @ Rui: Are we going to see a V6 patches' revision somewhere, or could you just send in _tested_ V5 for inclusion into kernel 4.0 final? ping ... Any news from you, Rui? As it didn't go into 4.0.x ... Where is V6 of your patches? And what about the locking fix work? What about your findings with https://bugzilla.kernel.org/show_bug.cgi?id=96881 The three posted patches on here, v5, do work with my system until now, with kernel 4.0.1 at the moment. it fails to get into 4.0, but I will post refreshed ones and push for 4.1, and then backport to 3.18, 3.19, and 4.0. @Rui: Please, don't forget to send in, at least, the three V5 patches for kernel 4.1 !!! Kernel 4.1.0-rc5 still doesn't have these patches. :-((( Just FYI, without the patches from Comment 183, Comment 184 and Comment 185 -- the V5 patches -- my thermal system would not work correctly. I've tested the three patches on 3.18.14 (only context modifications needed to backport), on 4.0.4 and on 4.1.0-rc5. With them everything works fine. Please, don't wait for the answer from the "almost" dead thread https://bugzilla.kernel.org/show_bug.cgi?id=71711 -- just send in the known good patches. Thank you in advance, Manuel Krause I second that. The three patches work perfectly. The locking issue is another issue. Please send the patches in. I mark this as REOPENED, as noone has sent in the patches for 4.1, nor initiated a backport to 4.0.x. Rui, that would have been your job for, now how many months? I see that 4.1 got released. I assume Zhang Rui is on vacation. Can someone else please push the patches into the release? It's really sad to see that this hasn't been pushed yet. I wonder if only Zhang Rui can push it? Why can't someone else with authority push it? Dependence on one person is always sad for open source software / codes. I just re-tested with the upcoming kernel version, atm with 4.2.0-rc5, and the patches are still not included, but urgently needed. With applied V5 patches, see the references in my comment 196, everything works fine in suspend-to-disk and suspend-to-RAM. So, please, can someone push the patches to the kernel finally? It's so disappointing to have this BUG still open. :-((( Best regards, Manuel Krause bugzilla-daemon@bugzilla.kernel.org writes: > So, please, can someone push the patches to the kernel finally? It's so > disappointing to have this BUG still open. :-((( Agreed. This is preventing me from upgrading the kernel on my laptops. Rui, What is the status of these patches? Aaron, FWIW, kernel 4.1.6 still has the bug. I tested on the latest git checkout (4.2.0-rc7-g2c6625c), and the patches work just fine. Can you please push the patches yourself? A cursory search on bugzilla indicates that Zhang hasn't been active here for the last few months. The last I've heard from Rui was in May, that he wanted to "refresh" the patches (whatever this meant). Before, he had ideas to enhance them in the future, too. Aaron, if you're a person in charge or capable of reviewing/ reworking the "V5" patches' code in Comment 183, Comment 184 and Comment 185 -- we would be very thankful. Especially -- if you'd finally send them in for kernel inclusion! The code works well on here @ kernel 4.1.6 and the last one I've checked was 4.2.0-rc5. (Maybe there were needed minor adjustments in the patches' context lines to apply cleanly.) Best regards, Manuel Krause OK, I'll talk to Rui when he comes back from Linux Plumber. reassign to Chen yu, who will take and refresh the previous patches and push for upstream. Any idea if the patches will get into 4.2? Just go through this thread, and the story is so long but it's interesting to watch the process on how to debug these bugs. Will send out a serie of patches recently. Maybe the patch will get merged into 4.3-rcn. Yu First: Rui, thank you very much for your longterm work, and, for handing it over! Second: Welcome, Yu! You don't need to reconstruct the whole thread, unless you need it for your history studies... ;-) Third: The three patches I've lastly mentioned / referenced to in my Comment 206 are working well with a 4.2.0 kernel. (Maybe context adjustments needed.) Final: It'would be very nice, to see this BUGfix working in a kernel in really near future. If you'd send in reworked patches to LKML/etc., please, let us know in this place. We want to test them, too. Best regards, Manuel Krause Hi Manuel, Thanks a lot for tracking this thread, BTW, do you have bandwidth to help me test the latest patch, which is supposed to fix racing problem in thermal management. This patch is not related to your bug though... I'd like to send this patch together with Rui's patches. Can you please help to test this patch on top of Rui's: https://patchwork.kernel.org/patch/6166681/ https://patchwork.kernel.org/patch/6166691/ https://patchwork.kernel.org/patch/6166701/ and please enable the thermal dynamic debug to see if there is any fan problems after resume, Thanks! Yu Created attachment 188081 [details] This patch is to avoid lock racing when doing thermal_update. This patch is to avoid lock racing when doing thermal_update, please apply on top of https://patchwork.kernel.org/patch/6166681/ https://patchwork.kernel.org/patch/6166691/ https://patchwork.kernel.org/patch/6166701/ Hey, I might test it too, depends on if I find time. Also, will this make it into 4.3? I would really like to give this old laptop to my mum who really needs it but it's quite unusable without these patches, and I don't really want to give her Arch since she needs it as stable as possible. Hi Yu, can you please have a look at your referenced patches and compare them with the three v5 patches that Rui had attached to this thread (and not published on Patchwork)? They are: https://bugzilla.kernel.org/attachment.cgi?id=173531 https://bugzilla.kernel.org/attachment.cgi?id=173541 https://bugzilla.kernel.org/attachment.cgi?id=173551 If I compared them correctly, patch 1 has one significant change in the very first hunk. And then... coming to your new lock race avoiding patch: Everything works fine and as expected with it applied on top (now with kernel 4.2.1). Tested with a hot system as well as with a relatively cold one, with suspend-to-disk and with suspend-to-RAM. No problems at all. I need to add, that I kept using the v5 patches (that worked well for months now) and didn't take the v4 ones that you refer to. In case that you want to have debugging logs, please let me know. Best regards, Manuel Krause Hi Chen Yu, Do you mind refreshing the patches and releasing just *one* patchset that we could apply? With the large time-frame of patches it's become difficult for me to keep track of all the patches. If you were to release one patchset, I could test that too, and we could base all further discussion (and patches) on that patchset, henceforth. Hi, all Patch sent at: https://patchwork.kernel.org/patch/7273051/ https://patchwork.kernel.org/patch/7273001/ https://patchwork.kernel.org/patch/7273041/ BTW: The locking fixing patch will not be sent at this moment, since we need to do more research on it. Hey, thanks for sending those patches in! What about the racing patch, will you send that in too? I can try compiling the kernel today and test it if you want. Hi, Yu, your patches are apparently for 4.3-rc kernels (one hunk with fuzz at a 4.2.1) but no code change in comparison to Rui's V5 patches. -- And yes they keep working on 4.2.1. Haven't tested 4.3-rc yet. I've been reading the review comments added to your patchwork submissions, so I'm looking forward to the improved next revision. I really appreciate progress on this longstanding topic and want to thank you in advance for your work. Please, keep us informed about new patches to test. Best regards, Manuel Any chance to see reworked patches soon? -- Or do we need to wait until kernel 10.x.y to have this fixed? Unbelievable non-progress on this BUG fix. :-( Javi has some comments on locking implementation with the third patch: https://patchwork.kernel.org/patch/7273041/ although this patch works in current kernel, javi suggests we find a more graceful solution for scalability consideration. Since the locking always needs careful design, we are still testing different implementation on patch3. Created attachment 190851 [details]
V6-0001-Thermal-initialize-thermal-zone-device-correctly
Hi, Manuel
can you please help test these three V6 patches?(only a small change from
V5)
thanks
Created attachment 190861 [details]
V6-0002-Thermal-handle-thermal-zone-device-properly-during-s
Created attachment 190871 [details]
V6-0003-Thermal-do-thermal-zone-update-after-a-cooling-devic
Plz test with thermal dynamic debug enabled and provide dmesg , thanks! O.k. -- I've waited for 4.2.4 to show up -- and tested the patches. (Only some little context variations, fuzzes, for this kernel.) First news: Everything still works fine. If I'd attach the dmesg with dynamic debug info, how would you like me to invoke it? Is it sufficient for you if I issued "echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control" after system startup and play a bit with suspend-to-disk and different temperatures? Or do you want me to boot with a special kernel command line append? Please, advise me, as I don't recall the most efficient/useful logging version, especially regarding the command line. Best regards, and many many thanks for your work, Manuel Well if it works..there is no need to gather the dmesg. will send out this patch to maillist. Thanks Yu I've now tested the most recent posted patch revision with a 4.3.0 kernel: The three patches: https://patchwork.kernel.org/patch/7525501/ https://patchwork.kernel.org/patch/7525491/ https://patchwork.kernel.org/patch/7525431/ And they do work very well and as expected. As they're still not in kernel 4.3.0, and there didn't appear any correctional proposals for some days, I would thank you in advance if you pushed them to kernel as soon as possible. Best regards, Manuel This series of patch will be merged into 4.4-rc1. Yu Thank you! Hello. I experience this problem as well and I'm very happy that there exists a patch for it (thanks!). I compiled 4.4.0-rc1-g8005c49 today and the problem persists. I also see the patches at patchwork still marked as new. Could you estimate when the patches will be merged? (another 4.4 rc?) I don't mean to push - I just can't wait for the patches to be merged :-) Appreciate the hard work. Thank you, René. A short lookup into the sources of 4.4-rc2 shows, that the patches are still NOT included. What is it about your continuous problems with the kernel inclusion of the working patches? Wrong channels? Laziness? I'm quite upset about that for now. Since kernel 3.12 !!! Manuel I think it is just merged into the thermal tree now, will inform you once it get pulled into upstream. I has been 1.5 years and this is *still* not in mainline? Here I was thinking I could finally upgrade my laptop over Christmas to a current kernel... Glad to hear Chen that you "think" it is merged into the thermal tree and will be pulled in upstream. Last time I submitted a patch to the kernel, which is a few years ago, it was accepted within days. If there is anything I or someone else here can do to expedite this process, someone in the know please explain. Sorry for the slightly rant-ish tone. I really want to know if there is anything else that needs to be done here, to get this closed. Cheers, Mario No need to apologize for possible rant-ish tones on this issue any more. The patches work well with 4.3.3. Looked up into 4.4-rc7 sources: They're STILL NOT INCLUDED. I don't understand this delay any more. It's simply ridiculous. :-( Maybe we need to send in the three last good patches to LKML on our own. Unfortunately, for 4.4 it again seems to be too late. Pains, Manuel It seems that it has been merged - https://patchwork.kernel.org/patch/7525501/ It was marked as "New" a few days back. Not sure where it lands, since 4.4-rc7 is out and it is not there, but waiting for 4.5 is nothing compared to the 1.5 years. Someone available and CAPABLE to send in the patches to LKML directly, _bypassing_ the current assignees CHEN YU and ZHANG RUI? The patches are still not included into 4.4.0. The patches have been merged in thermal tree and will be queued for 4.5. This is a big change for thermal core, thus it should be a -rc1 material instead of -rc5 and later. Patch set has been merged, so close current thread as fixed. commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461 Author: Zhang Rui <rui.zhang@intel.com> Date: Fri Oct 30 16:31:47 2015 +0800 Thermal: initialize thermal zone device correctly commit ff140fea847e1c2002a220571ab106c2456ed252 Author: Zhang Rui <rui.zhang@intel.com> Date: Fri Oct 30 16:31:58 2015 +0800 Thermal: handle thermal zone device properly during system sleep commit 4511f7166a2deb5f7a578cf87fd2fe1ae83527e3 Author: Chen Yu <yu.c.chen@intel.com> Date: Fri Oct 30 16:32:10 2015 +0800 Thermal: do thermal zone update after a cooling device registered |