Bug 92431

Summary: Fan Running at Full Speed with 3.18.1+
Product: Power Management Reporter: Claire Farron (diesal3)
Component: ThermalAssignee: Chen Yu (yu.c.chen)
Status: CLOSED DUPLICATE    
Severity: normal CC: aaron.lu, alan, ammdispose-arch, diesal3, lenb, manuelkrause, marius, prash.n.rao, radek, sluckxz, szegadlo
Priority: P1    
Hardware: All   
OS: Linux   
URL: https://bbs.archlinux.org/viewtopic.php?id=192255
Kernel Version: 3.18.0 - 3.19 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: amish-3.17.6
amish-3.18.2
marcoc-3.18.2
ristic-3.17.6
ristic-3.18.1
ristic-3.18.2
triple_star-3.14.x
triple_star-3.18.2
ubone-lts
ubone-3.18.2
output of "acpi -tc" on 3.14.32 (LTS) kernel
output of "acpi -tc" on 3.18.6 (release) kernel
output of "acpi -tc" on 3.19 kernel
output of "acpi -tc" on 3.18.0 kernel
acpidump kernel 3.17.6
acpidump kernel 3.18.6
prash-class-thermal-3.14.34.txt
prash-class-thermal-device-path-3.14.34.txt
prash-class-thermal-3.18.6.txt
prash-class-thermal-device-path-3.18.6.txt
prash-acpidump
amish-acpidump
amish-class-thermal-3.17.6-1-ARCH.txt
amish-class-thermal-3.18.6-1-ARCH.txt
amish-class-thermal-device-path-3.17.6-1-ARCH.txt
amish-class-thermal-device-path-3.18.6-1-ARCH.txt
prash-class-thermal-4.0.0-rc3-g9eccca0
prash-class-thermal-4.0.0-rc3-g9eccca0
prash-dmesg-4.0.0-rc3-g9eccca0.xz
prash-class-thermal-4.0.0-rc3-g9eccca0
prash-dmesg-4.0.0-rc3-g9eccca0.tar.xz
0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch
prash-dmesg-4.0.0-rc4-g06e5801-7.xz
prash-journalctl-4.0.0-rc4-g06e5801.tar.xz
0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch
prash-thermal-logs.xz

Description Claire Farron 2015-02-01 14:11:26 UTC
Fan runs at full speed on certain HP/Compaq Laptops at full speed after upgrading to 3.18.2 or higher.

Known Fixes: Downgrade to 3.18.1 or any previous kernel version.

Main Information Thread: https://bbs.archlinux.org/viewtopic.php?id=192255

This is not a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=78201 as 3.18.5 does not resolve the problem.
Comment 1 Claire Farron 2015-02-01 14:12:11 UTC
Created attachment 165411 [details]
amish-3.17.6
Comment 2 Claire Farron 2015-02-01 14:12:30 UTC
Created attachment 165421 [details]
amish-3.18.2
Comment 3 Claire Farron 2015-02-01 14:12:51 UTC
Created attachment 165431 [details]
marcoc-3.18.2
Comment 4 Claire Farron 2015-02-01 14:13:11 UTC
Created attachment 165441 [details]
ristic-3.17.6
Comment 5 Claire Farron 2015-02-01 14:13:32 UTC
Created attachment 165451 [details]
ristic-3.18.1
Comment 6 Claire Farron 2015-02-01 14:13:55 UTC
Created attachment 165461 [details]
ristic-3.18.2
Comment 7 Claire Farron 2015-02-01 14:14:21 UTC
Created attachment 165471 [details]
triple_star-3.14.x
Comment 8 Claire Farron 2015-02-01 14:14:50 UTC
Created attachment 165481 [details]
triple_star-3.18.2
Comment 9 Claire Farron 2015-02-01 14:15:07 UTC
Created attachment 165491 [details]
ubone-lts
Comment 10 Claire Farron 2015-02-01 14:15:26 UTC
Created attachment 165501 [details]
ubone-3.18.2
Comment 11 Aaron Lu 2015-02-06 05:40:25 UTC
I briefly viewed the topic in the archlinux forum, is it that the problem starts to appear from v3.18.2 and the last known good kernel is v3.18.1?
Comment 12 Radek Podgorny 2015-02-06 09:47:12 UTC
well, at least for me, it's not that easy.

i've (unsuccessfully) tried to bisect the issue and it seems like the problem is in 3.18.0 as well. but since for some boots, it's ok even on 3.18.5, it's hard to tell. :-(
Comment 13 Claire Farron 2015-02-06 10:36:28 UTC
It looks like 3.18.1 is good for @amish and then 3.18.2 is bad.

For @ristic, it seems the issue starts with 3.18.1 (so maybe, the same as Radek here).

I think for the others, we'd need to ask them to try 3.18.1 (and maybe 3.18.0) to see where it kicks in.

Does that sound reasonable?
Comment 14 sluckxz 2015-02-07 02:29:10 UTC
3.18.1 fan runs full speed for me. 
3.17.6-1-ARCH is normal.
Probook 4510s.
Comment 15 prash 2015-02-08 17:09:13 UTC
Created attachment 166081 [details]
output of "acpi -tc" on 3.14.32 (LTS) kernel

run on HP ProBook 4410s
Comment 16 prash 2015-02-08 17:11:01 UTC
Created attachment 166091 [details]
output of "acpi -tc" on 3.18.6 (release) kernel

run on HP ProBook 4410s
Comment 17 prash 2015-02-08 17:13:06 UTC
As indicated by my attachments, the problem remains in 3.18.6; all the fan control bits are set to '1' on bootup.
Comment 18 prash 2015-02-09 20:56:38 UTC
Created attachment 166241 [details]
output of "acpi -tc" on 3.19 kernel

This is what I get when I execute:
% (uname -srvmo && acpi -tc) > prash-`uname -r`.log

on my HP ProBook 4410s.

I have tested all the releases from 3.18.1 to 3.19, and in each, the fan runs at its maximum speed, right from bootup.

On my system, Thermal 0, which corresponds to FDTZ and thermal_zone5, refers to the fan speed.
Comment 19 Manuel Krause 2015-02-09 22:28:43 UTC
*PING* *PING* *PING* @ Zhang Rui
Comment 20 amish 2015-02-10 02:14:30 UTC
Same issue in HP Probook 4510s.

PS: I had already reported in arch forum. Just adding weight to bug.
Comment 21 prash 2015-02-10 20:28:50 UTC
Created attachment 166411 [details]
output of "acpi -tc" on 3.18.0 kernel

This is what I get when I execute:
% (uname -srvmo; acpi -tc) > prash-`uname -r`.log

on my HP ProBook 4410s.

As indicated by Thermal 0, and all the Cooling channels, the fan is running at its maximum speed.
Comment 22 amish 2015-02-11 10:37:32 UTC
Just wanted to know why is status still NEEDINFO even after so many responses?

I think developers normally see status and do not look into tickets marked NEEDINFO thinking that it is still awaiting response from reporter or some user?

Btw, I do not know what kind if info is needed apart from kernel version and hardware make?
Comment 23 Claire Farron 2015-02-11 11:52:57 UTC
(In reply to amish from comment #22)
> Just wanted to know why is status still NEEDINFO even after so many
> responses?
> 
> I think developers normally see status and do not look into tickets marked
> NEEDINFO thinking that it is still awaiting response from reporter or some
> user?

I should have said that I (the reporter) am not affected by this problem, but I am reporting on other people's behalf.
Comment 24 amish 2015-02-11 16:42:40 UTC
I know that you are reporter and not affected by it (Thank you for reporting it even then)

What I mean (politely) is - bug should atleast be moved to "CONFIRMED" status now.

Also since there is no response from Zhang Rui, to whom this bug is assigned to. I just wanted to know if he (or some other kernel developer) is aware?

3.19 is also released and bug exists in that too (as per reports in ARCH forum)

I am seeking urgent attention because in ARCH Linux, sticking to older kernel is not recommended and you must remain "up-to-date" with all packages to avoid future issues due to older packages.
Comment 25 Alan 2015-02-11 18:45:20 UTC
This is not a support forum.

If you have a service level agreement with your supplier then talk to them. Bugs in bugzilla get dealt with as and when someone feels like fixing one.
Comment 26 amish 2015-02-13 05:43:31 UTC
One more similar looking bug which also reports that issue occurs after 3.18.x kernel but does not occur with 3.17

https://bugzilla.kernel.org/show_bug.cgi?id=91411

Someone has identified bad commit from git bisect (I have no idea what it is!)
Comment 27 szegad 2015-02-13 10:30:43 UTC
See:
https://bugzilla.kernel.org/show_bug.cgi?id=91411#c10

Matthias bisected the first bad commit in the similar issue.
You could try  if it fixes you problem.
Comment 28 Aaron Lu 2015-02-15 02:35:59 UTC
So v3.17 is OK, does v3.18 start to have this issue or only v3.18.x kernel has this issue?
Also, it seems multiple people are affected, please respond the above question and attach your acpidump:
# acpidump > acpidump.txt
Comment 29 amish 2015-02-15 07:01:42 UTC
Created attachment 166931 [details]
acpidump kernel 3.17.6
Comment 30 amish 2015-02-15 07:02:06 UTC
Created attachment 166941 [details]
acpidump kernel 3.18.6
Comment 31 amish 2015-02-15 07:03:44 UTC
Yes 3.17 is OK.

And yes starts with 3.18.

For me it starts from 3.18.2 but for others 3.18.1. I did not test twice with 3.18.1 because it had graphical issue.

But for most people it appears to start right from start of 3.18.

Please see comments above for acpidump for kernel 3.17.6 and 3.18.6
Comment 32 Aaron Lu 2015-02-15 07:56:02 UTC
So different people start to have problem from different kernel version, it suggests the root cause may be different, please do a git bisect to find the offending commit.

For people who start to have problem from v3.18.x but v3.18 works, the bisect should not cost much time since your problem starts from a stable kernel version, i.e. v3.18.x works while v3.18.x+1 doesn't. Please use the stable git tree to do the bisect:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-3.18.y

The above mentioned bad commit in comment #26 should not be your problem since it is available in v3.18 but v3.18 works for you.

BTW, the acpidump is the same no matter which kernel version is in use, so one upload is enough :-)
Comment 33 prash 2015-02-15 12:03:01 UTC
(In reply to amish from comment #31)
> For me it starts from 3.18.2 but for others 3.18.1. I did not test twice
> with 3.18.1 because it had graphical issue.

Notwithstanding the graphics issue, could you please retest with 3.18.0 and 3.18.1? I'm asking because it's very likely that you and I have identical system boards (see http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c01905586-6.pdf). I suspect that fan control seemed to work for you only temporarily, as it has, for many of us, with one or more kernel versions. Therefore, I also suspect that starting to bisect *after* 3.18.0 will prove futile for you.
Comment 34 amish 2015-02-15 12:31:00 UTC
I have no idea/experience about kernel compilation OR git. :(

Just learning that is going to take me long time and unfortunately I do not have so much time due to other commitments. Really feel sorry that I am helpless.

I will re-test with 3.18.1 and report back. This time I will run multiple reboots just to be sure.

If someone (trusted user) can compile and upload kernel 3.18.0 for Arch Linux - 64bit, I can definitely test that too.

Thanks.
Comment 35 prash 2015-02-15 12:35:29 UTC
(In reply to amish from comment #34)
> If someone (trusted user) can compile and upload kernel 3.18.0 for Arch
> Linux - 64bit, I can definitely test that too.

Claire Farron has already uploaded 3.18.0. See https://bbs.archlinux.org/viewtopic.php?pid=1501688#p1501688.
Comment 36 amish 2015-02-15 13:47:03 UTC
Ok so I tested again with 3.18.1! And PROBLEM EXISTS in 3.18.1 as well.

I dont know how come it did not occur last time OR did I overlook something?

Just to re-iterate things I observed.


With 3.18 series
-----------------
During GRUB boot fan speed is normal.

But half way during booting it starts running at full speed. And always remains at FULL SPEED. Even when CPU usage is negligible, it still runs at FULL SPEED.

acpi -tc gives:
Thermal 0: ok, 90.0 degrees C

90.0 degrees may be some hint


With 3.17 series
-----------------
Fan is normal till KDE starts loading. Fan then goes to FULL speed while KDE is still loading.

After KDE is loaded, FAN is back to normal.

acpi -tc gives:
Thermal 0: ok, 55.0 degrees C


So only Thermal 0 differs. Rest of the temperatures are similar for both series.


I believe that the case would remain same even with 3.18.0.
Comment 37 amish 2015-02-15 14:47:40 UTC
So tried with 3.18.0 too (compiled by Claire Farron - from link above)

Same issue and same symptom as 3.18.1 :(

Surprising that acpi -tc always gives:
Thermal 0: ok, 90.0 degrees C

And did not change even after 5 minutes
Comment 38 Radek Podgorny 2015-02-15 23:01:05 UTC
see https://bugzilla.kernel.org/show_bug.cgi?id=93301 for the bug repotrt i've just created. it may (or may not) be a somewhat connected issue.
Comment 39 Manuel Krause 2015-02-18 21:50:27 UTC
BTW, one of the staying Thermal X values, either staying at high or low level for you, is most probably representing the fan speed, like it does on my system HP/Compaq 6730b @ Thermal 0. These systems don't have an extra fan speed sensor except for that "Thermal 0" value.

So, seeing that value not to change for you is the evidence of the error (no fan speed change) and not generally the cause of the misbehaviour.

Best regards,
Manuel
(from BUG 78201)
Comment 40 Zhang Rui 2015-03-02 03:25:54 UTC
please attach the output of "cat /sys/class/thermal/thermal_zone0/device/path"
Comment 41 Zhang Rui 2015-03-02 03:45:49 UTC
I suspect this is the same problem in 93301 because
1. the reason of the problem is that the temperature does not change after boot.
2. they are all on HP platforms.

so please check if reverting commit 6ab3430129e258ea31dd214adf1c760dfafde67a or build your kernel with "git checkout 6ab3430129e258ea31dd214adf1c760dfafde67a" can fix the problem or not.
Comment 42 prash 2015-03-02 06:15:03 UTC
@Zhang Rui,

I ran this:
% for i in /sys/class/thermal/thermal_zone*; do; echo -n $i "-- "; cat $i/device/path; done 
/sys/class/thermal/thermal_zone0 -- \_TZ_.GFXZ
/sys/class/thermal/thermal_zone1 -- \_TZ_.DTSZ
/sys/class/thermal/thermal_zone2 -- \_TZ_.CPUZ
/sys/class/thermal/thermal_zone3 -- \_TZ_.SKNZ
/sys/class/thermal/thermal_zone4 -- \_TZ_.BATZ
/sys/class/thermal/thermal_zone5 -- \_TZ_.FDTZ

% for i in /sys/class/thermal/thermal_zone*; do; echo -n $i " -- "; cat $i/temp; done
/sys/class/thermal/thermal_zone0  -- 16000
/sys/class/thermal/thermal_zone1  -- 43000
/sys/class/thermal/thermal_zone2  -- 41000
/sys/class/thermal/thermal_zone3  -- 44000
/sys/class/thermal/thermal_zone4  -- 24800
/sys/class/thermal/thermal_zone5  -- 30000

Please note that this thermal_zone0 corresponds to "Thermal 5" as reported by "acpi -t"; the counting goes backwards. Moreover, the GFXZ readout has never been meaningful for me. It always reports 16°C (or 16000).

It takes me quite a few hours to compile the kernel on my old laptop, so I'll wait for someone with a faster machine to try it out first. If no one does it in the next few days, I'll do it myself.
Comment 43 prash 2015-03-05 18:14:45 UTC
Created attachment 169261 [details]
prash-class-thermal-3.14.34.txt

output of grep -s . /sys/class/thermal/*/*
Comment 44 prash 2015-03-05 18:16:27 UTC
Created attachment 169271 [details]
prash-class-thermal-device-path-3.14.34.txt

Linux 3.14.34 output of grep . /sys/class/thermal/*/device/path
Comment 45 prash 2015-03-05 18:18:32 UTC
Created attachment 169281 [details]
prash-class-thermal-3.18.6.txt

Kernel 3.18.6 output of grep -s . /sys/class/thermal/*/*
Comment 46 prash 2015-03-05 18:19:43 UTC
Created attachment 169291 [details]
prash-class-thermal-device-path-3.18.6.txt

Kernel 3.18.6 output of grep . /sys/class/thermal/*/device/path
Comment 47 prash 2015-03-05 18:34:30 UTC
@Zhang Rui,

I have attached the command outputs that you had asked for on the Archlinux BBS.

My system:
% inxi -F                  
System:    Host: Prash5 Kernel: 3.14.34-1-lts x86_64 (64 bit) Desktop: KDE 5 Distro: Arch Linux
Machine:   System: Hewlett-Packard product: HP ProBook 4410s v: F.20
           Mobo: Hewlett-Packard model: 3072 v: KBC Version 24.0F
           Bios: Hewlett-Packard v: 68PZI Ver. F.20 date: 12/09/2011
CPU:       Dual core Intel Core2 Duo T6570 (-MCP-) cache: 2048 KB 
           clock speeds: max: 2101 MHz 1: 1200 MHz 2: 1200 MHz
Graphics:  Card: Intel Mobile 4 Series Integrated Graphics Controller
           Display Server: N/A driver: intel Resolution: 104x39
Audio:     Card Intel 82801I (ICH9 Family) HD Audio Controller driver: snd_hda_intel
           Sound: Advanced Linux Sound Architecture v: k3.14.34-1-lts
Network:   Card-1: Intel PRO/Wireless 5100 AGN [Shiloh] Network Connection driver: iwlwifi
           IF: wls1 state: up mac: 00:22:fa:f7:2a:34
           Card-2: Marvell 88E8072 PCI-E Gigabit Ethernet Controller driver: sky2
           IF: ens5 state: down mac: 00:25:b3:5d:c8:70
Drives:    HDD Total Size: 120.0GB (84.5% used) ID-1: /dev/sda model: Samsung_SSD_840 size: 120.0GB
Partition: ID-1: / size: 109G used: 95G (92%) fs: ext4 dev: /dev/sda4
           ID-2: /boot size: 976M used: 94M (11%) fs: ext4 dev: /dev/sda3
Sensors:   System Temperatures: cpu: 38.0C mobo: N/A
           Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 170 Uptime: 3 min Memory: 1105.0/5874.1MB Init: systemd
           Client: Shell (zsh) inxi: 2.2.19 

> 5. in 3.18 kernel, when the problem is reproduced, please confirm whether the
> temperature changes or not if you change the workload manually.

On 3.18.6, I ensured that both cores of my processor were 100% utilized, waited for a minute, and saw "sensors" report that my cores were running at ~50°C. I then killed the CPU intensive tasks, and watched the temperature go back to ~35°C. The fan was running at its max speed ever since bootup, and it remained that way no matter how I made the CPU temperature rise or fall.
Comment 48 prash 2015-03-05 18:40:17 UTC
Created attachment 169301 [details]
prash-acpidump
Comment 49 amish 2015-03-06 03:27:55 UTC
Created attachment 169461 [details]
amish-acpidump

Output of acpidump
Comment 50 amish 2015-03-06 03:30:32 UTC
Created attachment 169471 [details]
amish-class-thermal-3.17.6-1-ARCH.txt

Kernel 3.17.6
grep -s . /sys/class/thermal/*/*
Comment 51 amish 2015-03-06 03:31:43 UTC
Created attachment 169481 [details]
amish-class-thermal-3.18.6-1-ARCH.txt

Kernel 3.17.6
grep -s . /sys/class/thermal/*/*
Comment 52 amish 2015-03-06 03:32:35 UTC
(In reply to amish from comment #51)
> Created attachment 169481 [details]
> amish-class-thermal-3.18.6-1-ARCH.txt
> 
> Kernel 3.17.6
> grep -s . /sys/class/thermal/*/*

Please read as Kernel 3.18.6.
Comment 53 amish 2015-03-06 03:34:36 UTC
Created attachment 169491 [details]
amish-class-thermal-device-path-3.17.6-1-ARCH.txt

Kernel 3.17.6
grep . /sys/class/thermal/*/device/path
Comment 54 amish 2015-03-06 03:36:10 UTC
Created attachment 169501 [details]
amish-class-thermal-device-path-3.18.6-1-ARCH.txt

Kernel 3.18.6
grep . /sys/class/thermal/*/device/path

Please note there is big size difference in output compared to 3.17.6

3.17.6 size is 1280 bytes
3.18.6 size is 520 bytes
Comment 55 amish 2015-03-06 03:44:35 UTC
Uploaded files as per Zhang Rui's post here:
https://bbs.archlinux.org/viewtopic.php?pid=1507923#p1507923

NOTE:
Mine and prash's system should be more or less similar.
Mine is HP Probook 4510s and his is 4410s


Question 2 to 4 - files attached above

Question 1.
% inxi -F (removed HDD partition info)

System:    Host: amish Kernel: 3.17.6-1-ARCH x86_64 (64 bit) Desktop: KDE 5 Distro: Arch Linux
Machine:   System: Hewlett-Packard product: HP ProBook 4510s v: F.12
           Mobo: Hewlett-Packard model: 3072 v: KBC Version 24.0D
           Bios: Hewlett-Packard v: 68PZI Ver. F.12 date: 11/30/2009
CPU:       Dual core Intel Core2 Duo T6570 (-MCP-) cache: 2048 KB
           clock speeds: max: 2101 MHz 1: 1600 MHz 2: 1600 MHz
Graphics:  Card: Intel Mobile 4 Series Integrated Graphics Controller
           Display Server: X.Org 1.17.1 driver: intel Resolution: 1366x768@59.64hz
           GLX Renderer: Mesa DRI Mobile Intel GM45 Express GLX Version: 2.1 Mesa 10.4.5
Audio:     Card Intel 82801I (ICH9 Family) HD Audio Controller driver: snd_hda_intel
           Sound: Advanced Linux Sound Architecture v: k3.17.6-1-ARCH
Network:   Card-1: Intel PRO/Wireless 5100 AGN [Shiloh] Network Connection driver: iwlwifi
           IF: wls1 state: down mac: xxx
           Card-2: Marvell 88E8072 PCI-E Gigabit Ethernet Controller driver: sky2
           IF: ens5 state: up speed: 100 Mbps duplex: full mac: xxx
Sensors:   System Temperatures: cpu: 47.0C mobo: N/A
           Fan Speeds (in rpm): cpu: N/A
Info:      Processes: 191 Uptime: 2:10 Memory: 2085.5/3858.0MB Client: Shell (zsh) inxi: 2.2.19


Question 5: in 3.18 kernel, when the problem is reproduced, please confirm whether the temperature changes or not if you change the workload manually.


Same observation as prash. CPU temeratures increase (on adding load) and decrease (on reducing load) but FAN is always at full speed.
Comment 56 Zhang Rui 2015-03-09 08:00:31 UTC
please apply the patch at comment #142 and comment #143 at bug #78201, and see if the problem still exists.
If yes, please attach the output of "grep -s . /sys/class/thermal/*/*" when the bug is reproduced.
Comment 57 Manuel Krause 2015-03-12 21:51:42 UTC
If these two patches from Comment 56 alone don't cure the issue you can try one additional debug patch from Zhang Rui from Comment https://bugzilla.kernel.org/show_bug.cgi?id=91411#c66, from BUG 91411, direct link to the patch: https://bugzilla.kernel.org/attachment.cgi?id=169941 that may be of benefit.

Thank you in advance for reporting back!
Comment 58 Zhang Rui 2015-03-14 13:45:52 UTC
please apply the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150 and see if the problem still exists.
If yes, please run echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control, and attach the dmesg output after the problem reproduced.
Comment 59 prash 2015-03-15 01:47:56 UTC
Created attachment 170661 [details]
prash-class-thermal-4.0.0-rc3-g9eccca0

Reporting for the latest linux-stable.git with the patches referred to at comment #56.

The bug is seen here too; the fan stays running at max speed, right from bootup.

I will report the status of the other patches over the next couple of days, as I slowly compile the kernels.

For those of you who want to test this release, you can find it at http://www41.zippyshare.com/v/BGHdvPMI/file.html
Comment 60 prash 2015-03-15 09:30:10 UTC
@Zhang Rui,

I tried applying the patches from comment #58, but it looks like there are some conflicts in the patch set. It's asking me if I want to revert previously applied patches. The offending file was 0004-Thermal-make-thermal_zone_device_update-atomic.patch.

For the record, I also tried applying the patch set to a fresh checkout, unpatched with patches from comment #56 and #57.

Can you please generate me a fresh patchset?
Comment 61 prash 2015-03-15 13:34:24 UTC
Created attachment 170681 [details]
prash-class-thermal-4.0.0-rc3-g9eccca0

Output after applying the patchset from https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150. Per https://bugzilla.kernel.org/show_bug.cgi?id=78201#c151, that "0001-Debug-patch-to-sync-thermal-zone-update.patch" has been superseded by 0004, I omitted that file from the patches I applied.

Current status: the same problem as before: fan runs at max speed.

Per comment #58, I also ran echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control (as root). It produced the following in dmesg:

---- begin paste ----
[Mar15 14:09] update_temperature: thermal thermal_zone2: last_temperature=38000, current_temperature=30000
[  +0.000007] thermal_zone_trip_update: thermal thermal_zone2: Trip1[type=1,temp=105000]:trend=2,throttle=0
[  +0.000005] get_target_state: thermal cooling_device1: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=-1
[  +0.000003] get_target_state: thermal cooling_device0: cur_state=0
[  +0.000002] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=-1
[  +0.000005] thermal_zone_trip_update: thermal thermal_zone2: Trip2[type=0,temp=84000]:trend=2,throttle=0
[  +0.000035] get_target_state: thermal cooling_device9: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device9: old_target=-1, target=-1
[  +0.000003] thermal_zone_trip_update: thermal thermal_zone2: Trip3[type=0,temp=74000]:trend=2,throttle=0
[  +0.000031] get_target_state: thermal cooling_device10: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device10: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip4[type=0,temp=62000]:trend=2,throttle=0
[  +0.000030] get_target_state: thermal cooling_device11: cur_state=0
[  +0.000002] thermal_zone_trip_update: thermal cooling_device11: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip5[type=0,temp=52000]:trend=2,throttle=0
[  +0.000029] get_target_state: thermal cooling_device12: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device12: old_target=-1, target=-1
[  +0.000003] thermal_zone_trip_update: thermal thermal_zone2: Trip6[type=0,temp=44000]:trend=2,throttle=0
[  +0.000304] get_target_state: thermal cooling_device13: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device13: old_target=0, target=-1
[  +0.000003] thermal_cdev_update: thermal cooling_device13: zone2->target=18446744073709551615
[  +0.000004] thermal_cdev_update: thermal cooling_device13: set to state 0
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip7[type=0,temp=30000]:trend=2,throttle=1
[  +0.000031] get_target_state: thermal cooling_device14: cur_state=1
[  +0.000002] thermal_zone_trip_update: thermal cooling_device14: old_target=1, target=0
[  +0.000003] thermal_cdev_update: thermal cooling_device14: zone2->target=0
[  +0.003391] thermal_cdev_update: thermal cooling_device14: set to state 0
[  +0.001679] update_temperature: thermal thermal_zone2: last_temperature=30000, current_temperature=30000
[  +0.000006] thermal_zone_trip_update: thermal thermal_zone2: Trip1[type=1,temp=105000]:trend=2,throttle=0
[  +0.000005] get_target_state: thermal cooling_device1: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=-1
[  +0.000004] get_target_state: thermal cooling_device0: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip2[type=0,temp=84000]:trend=0,throttle=0
[  +0.000042] get_target_state: thermal cooling_device9: cur_state=0
[  +0.000004] thermal_zone_trip_update: thermal cooling_device9: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip3[type=0,temp=74000]:trend=0,throttle=0
[  +0.000040] get_target_state: thermal cooling_device10: cur_state=0
[  +0.000003] thermal_zone_trip_update: thermal cooling_device10: old_target=-1, target=-1
[  +0.000005] thermal_zone_trip_update: thermal thermal_zone2: Trip4[type=0,temp=62000]:trend=0,throttle=0
[  +0.000038] get_target_state: thermal cooling_device11: cur_state=0
[  +0.000004] thermal_zone_trip_update: thermal cooling_device11: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip5[type=0,temp=52000]:trend=0,throttle=0
[  +0.000038] get_target_state: thermal cooling_device12: cur_state=0
[  +0.000004] thermal_zone_trip_update: thermal cooling_device12: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip6[type=0,temp=44000]:trend=0,throttle=0
[  +0.000038] get_target_state: thermal cooling_device13: cur_state=0
[  +0.000004] thermal_zone_trip_update: thermal cooling_device13: old_target=-1, target=-1
[  +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip7[type=0,temp=37000]:trend=0,throttle=0
[  +0.000037] get_target_state: thermal cooling_device14: cur_state=0
[  +0.000004] thermal_zone_trip_update: thermal cooling_device14: old_target=0, target=0
---- end paste ----

For anyone else who wants to test it, I have uploaded this build to http://www26.zippyshare.com/v/Eozr4aq0/file.html
Comment 62 Zhang Rui 2015-03-16 02:59:52 UTC
please use boot option module.dyndbg="module thermal_sys +fp" and attach the dmesg after boot, with the patches applied.
Comment 63 prash 2015-03-16 08:36:10 UTC
Created attachment 170721 [details]
prash-dmesg-4.0.0-rc3-g9eccca0.xz

Output of dmesg on 4.0.0-rc3 with applied patches and boot option module.dyndbg="module thermal_sys +fp"
Comment 64 Zhang Rui 2015-03-16 08:43:51 UTC
no, you should use module.dyndbg="module thermal_sys +fp", rather than  "module.dyndbg=module thermal_sys +fp"
Comment 65 prash 2015-03-16 09:26:42 UTC
I did do that. And just to be sure, I did it again. I start my system with grub2 and systemd. Then I wondered if the double quotes are clashing with one of them, and tried single quotes. Apparently they get treated the same, and the dmesg output remains more or less the same each time. Later into the boot, the order in which USB and graphics drivers get initialized changes.

Here's what I've tried:
https://imgur.com/41EnHxg -- double quotes
https://imgur.com/CD0zcgw -- single quotes

If I'm still doing something wrong, can you please let me know how I can get it right?
Comment 66 Zhang Rui 2015-03-16 11:44:25 UTC
hmmm, please use the following instead as it works in bug #67101.
Xodule.Xyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp"
Comment 67 prash 2015-03-16 12:19:04 UTC
Created attachment 170751 [details]
prash-class-thermal-4.0.0-rc3-g9eccca0

I have attached the dmesg output for two instances. For one of them, I passed "module.dyndbg...", and for the other "Xodule.Xyndbg...". I did it twice because I thought Xodule was a typo. Apparently, the output is similar in both cases.

Anyhow, this time around there is more information in the dmesg output.
Comment 68 Manuel Krause 2015-03-16 23:15:37 UTC
BTW, Rui has made new patches for the other BUG 78201 https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157. For me they do work, but as you know, our problem is a bit different. 

Mainly, I don't understand the fact Rui describes in BUG 67101,
{https://bugzilla.kernel.org/show_bug.cgi?id=67101#c27}
that the log shows that fan levels get reset and adjusted but don't get in the real hardware itself(?). Is the code missing the right cooling_device* ?

Maybe, with the next log, you can give an example with increasing load/temperature/fan speed and decreasing all three on your system?

Another BTW: For me with old grub adding: module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp" 
to the kernel command line brings some wanted debugging output for me.

Best regards and thank you for your time!
Comment 69 Zhang Rui 2015-03-17 01:36:10 UTC
hmmm, this sounds like a grub bug?
can you please append module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp"  to grub.cfg file directly and then reboot and see if we have any luck?
Comment 70 prash 2015-03-17 08:14:37 UTC
Created attachment 170911 [details]
prash-dmesg-4.0.0-rc3-g9eccca0.tar.xz

This time, I made the changes to grub.cfg. However, it still appears the same way in dmesg.

In addition to that, as Manuel Krause suggested, I took the CPU temperature from 35°C (boot) to 50°C (heavy load) and back to 35°C. I got a dump of dmesg after the entire process.

I have also attached the contents of /sys/class/thermal/*/device/path and /sys/class/thermal/*/*.

I don't know if it is a grub bug, but I see additional output in dmesg, with get_target_state, and thermal_zone_trip_update. So maybe it's not a bug, and the kernel handles it the same in either case.
Comment 71 Zhang Rui 2015-03-18 08:51:56 UTC
Created attachment 171061 [details]
0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch

please apply this patch on top and see if the problem still exists.
Comment 72 prash 2015-03-18 08:54:13 UTC
Sorry, I don't understand what you mean by "on top". Does it mean I should apply this patch *before* (top line in my build script) all the others or *after* (an addition to the other patches)?
Comment 73 Zhang Rui 2015-03-18 09:05:43 UTC
you should apply it after the four patches have been applied.
Comment 74 prash 2015-03-18 09:14:11 UTC
Thanks. Got it.

By the way, I have been applying all these patches so far:

0001-Thermal-initialize-thermal-zone-device-correctly.patch
0002-Thermal-handle-thermal-zone-device-update-events-cor.patch
0003-ACPI-thermal-remove-unused-thermal-suspend-callbacks.patch
0004-Thermal-make-thermal_zone_device_update-atomic.patch
0005-Thermal-make-thermal-framework-be-aware-of-thermal-m.patch
0006-ACPI-thermal-remove-redundant-code-for-thermal-mode-.patch
0007-platform-acerhdf-remove-redundant-thermal_zone_devic.patch
0008-Thermal-db8500_thermal-remove-redundant-thermal_zone.patch
0009-Thermal-imx_thermal-remove-redundant-code-after-ther.patch
0010-Thermal-of-thermal-remove-redundant-code-after-therm.patch
0011-Thermal-ti-soc-thermal-remove-redundant-code-after-t.patch

I understand your latest comment to mean I that instead of the above, I should do this:

0001-Thermal-initialize-thermal-zone-device-correctly.patch
0002-Thermal-handle-thermal-zone-device-update-events-cor.patch
0003-ACPI-thermal-remove-unused-thermal-suspend-callbacks.patch
0004-Thermal-make-thermal_zone_device_update-atomic.patch
0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch
Comment 75 Zhang Rui 2015-03-18 09:27:18 UTC
I mean you should apply this patch on top of the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157
Comment 76 prash 2015-03-18 12:09:37 UTC
Created attachment 171071 [details]
prash-dmesg-4.0.0-rc4-g06e5801-7.xz

dmesg dump of kernel v 4.0.0-rc4 with patches from https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157.

When I applied the patch at https://bugzilla.kernel.org/show_bug.cgi?id=92431#c71, the system would not boot. I was not able to capture a dmesg dump of that, but I have screenshots: https://imgur.com/jKsGZE8,HIL0cqD,pTEVbVL Please note that there are three image files here.
Comment 77 prash 2015-03-18 14:16:45 UTC
Created attachment 171131 [details]
prash-journalctl-4.0.0-rc4-g06e5801.tar.xz

Please disregard the screenshots I attached in my previous message. I went though my journalctl logs, and discovered that the messages had been saved there after all.

The current attachment contains two files:
prash-journalctl-4.0.0-rc4-g06e5801.1.log -- This refers to the instance for which I attached screenshots. I waited for about 10 minutes for it to boot. However, I had not passed the dyndbg when booting it up. I forced a shutdown after that.
prash-journalctl-4.0.0-rc4-g06e5801.2.log -- This refers to a later instance, where I passed the dyndbg flags. I forced a shutdown as soon as I had recorded the thermal debug messages.
Comment 78 Zhang Rui 2015-03-19 06:19:34 UTC
Created attachment 171161 [details]
0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch

then please drop the previous one and apply this one instead.
please attach the sys log output after boot, even if the problem still exists.
Comment 79 prash 2015-03-19 09:37:11 UTC
Created attachment 171171 [details]
prash-thermal-logs.xz

The latest patch seems to have fixed the fan problem.

I have attached the dmesg dump of 4.0.0-rc4-g06e5801 with patches from #75 and #78.

After bootup, I got my CPU cores fully loaded, waited for their temperature to hit ~50°C, and dumped the sensor readings to *-warm.log. I basically did:
sensors > prash-`uname -r`-warm.log && acpi -tc >> prash-`uname -r`-warm.log

I then killed the load processes, waited until the CPU temperature reached ~40°C, and dumped the readings to *-cool.log. Then I waited until the CPU cooled further, and the fan stopped completely. The sensor readings for that are in *-cold.log.

I took a dmesg dump after all the above steps, so you can see all the thermal debug messages.

I then rebooted to my LTS kernel, and got the sensor readings as above. I have attached that set too.

It looks like the fan control behaves well now.
Thank you!

PS: To anyone else who wants to test it on their own machine, you can download my build from http://www39.zippyshare.com/v/RRB3uKle/file.html
Comment 80 amish 2015-03-20 07:56:30 UTC
Thanks prash for lots of debugging and patience.

And thanks Rui for patches!

I tried the kernel uploaded by prash (from above post) and it works fine. Checked 3-4 times by rebooting.

One strange thing I noticed is when I was running "sensors" repeatedly every second (when system was idle), mostly it showsed 47 deg C but for 1 second it suddenly jumped to 58 deg and next second it was back to 47 deg.

But Fan was normal at that time.

Also once (just after boot) I noticed sensors showing temperature of around 75 deg but FAN was normal and it started running faster after 4-5 second.

I am not sure if these are normal behaviour. But I did not see it again.

Atleast the issue of "Fan running at Full speed" is gone.

Fan speed goes up and down as expected based on load.
Comment 81 Zhang Rui 2015-03-24 07:14:37 UTC
Patches to fix the problem sent out.
https://patchwork.kernel.org/patch/6077231/
https://patchwork.kernel.org/patch/6077241/
https://patchwork.kernel.org/patch/6077251/

Mark the bug as Resolved.
Will close the bug once the patches merged in upstream kernel.
Comment 82 Zhang Rui 2015-04-08 13:00:21 UTC
Hi, guys,

please help check if the patches at comment #183/#184/#185 in bug #78201 work for you or not.
As there is some functional changes, I need to make sure they have been tested before sending upstream.
Comment 83 prash 2015-04-08 13:04:44 UTC
Hi Zhang Rui,

Do you want me to apply these patches after the previous patches, or do you want me to apply them on a fresh checkout? I think you mean the latter, but I'd rather be safe than sorry.
Comment 84 Zhang Rui 2015-04-08 13:06:18 UTC
Yes, the later, please apply them on a vanilla kernel. Better 4.0-rc
Comment 85 prash 2015-04-08 13:43:27 UTC
I applied the patches to a clean checkout of 4.0.0-rc5-gbc465aa. I performed some basic CPU load testing. The kernel behaves just the way it did with the previous set of patches. Looks like everything is in order.
Comment 86 Zhang Rui 2015-04-08 13:59:03 UTC
Great to know. Thanks.
I will resend the patches tomorrow, after the patches have been tested by others.
Comment 87 Manuel Krause 2015-06-15 15:11:51 UTC
Code fix still not sent to kernel, this should be marked as REOPENED until someone really does send in the patches and they're accepted.

*ping* @ Rui...
Comment 88 Aaron Lu 2015-08-24 05:27:10 UTC
Rui,

Are these patches merged?
Comment 89 Zhang Rui 2015-09-01 05:28:49 UTC
Yu will take over the patches and push for upstream.
Comment 91 Len Brown 2015-09-29 14:47:40 UTC
same cause as bug 91411 - duplicate.

*** This bug has been marked as a duplicate of bug 91411 ***
Comment 92 amish 2015-09-30 02:27:54 UTC
This bug is not exact duplicate of #91411

In this bug Fan runs at full speed right after system boot (at same stage) and never slows down.

In other bug it runs full speed only after suspend. (indicating that it runs normally till its suspended)

But may be the cause of the issue is same.
Comment 93 Chen Yu 2015-10-22 08:11:50 UTC
We marked this one as duplicated because we've sent out a serie of patches to fix the fan problem, and one of them will fix your boot up problem.
Plz refer to https://bugzilla.kernel.org/show_bug.cgi?id=78201 for latest patches, thanks!
Comment 94 Manuel Krause 2015-11-06 19:54:04 UTC
The most recent three patches are for kernel 4.3.0:
https://patchwork.kernel.org/patch/7525501/
https://patchwork.kernel.org/patch/7525491/
https://patchwork.kernel.org/patch/7525431/

and they work fine.

Manuel, 
from BUG 78201