Fan runs at full speed on certain HP/Compaq Laptops at full speed after upgrading to 3.18.2 or higher. Known Fixes: Downgrade to 3.18.1 or any previous kernel version. Main Information Thread: https://bbs.archlinux.org/viewtopic.php?id=192255 This is not a duplicate of https://bugzilla.kernel.org/show_bug.cgi?id=78201 as 3.18.5 does not resolve the problem.
Created attachment 165411 [details] amish-3.17.6
Created attachment 165421 [details] amish-3.18.2
Created attachment 165431 [details] marcoc-3.18.2
Created attachment 165441 [details] ristic-3.17.6
Created attachment 165451 [details] ristic-3.18.1
Created attachment 165461 [details] ristic-3.18.2
Created attachment 165471 [details] triple_star-3.14.x
Created attachment 165481 [details] triple_star-3.18.2
Created attachment 165491 [details] ubone-lts
Created attachment 165501 [details] ubone-3.18.2
I briefly viewed the topic in the archlinux forum, is it that the problem starts to appear from v3.18.2 and the last known good kernel is v3.18.1?
well, at least for me, it's not that easy. i've (unsuccessfully) tried to bisect the issue and it seems like the problem is in 3.18.0 as well. but since for some boots, it's ok even on 3.18.5, it's hard to tell. :-(
It looks like 3.18.1 is good for @amish and then 3.18.2 is bad. For @ristic, it seems the issue starts with 3.18.1 (so maybe, the same as Radek here). I think for the others, we'd need to ask them to try 3.18.1 (and maybe 3.18.0) to see where it kicks in. Does that sound reasonable?
3.18.1 fan runs full speed for me. 3.17.6-1-ARCH is normal. Probook 4510s.
Created attachment 166081 [details] output of "acpi -tc" on 3.14.32 (LTS) kernel run on HP ProBook 4410s
Created attachment 166091 [details] output of "acpi -tc" on 3.18.6 (release) kernel run on HP ProBook 4410s
As indicated by my attachments, the problem remains in 3.18.6; all the fan control bits are set to '1' on bootup.
Created attachment 166241 [details] output of "acpi -tc" on 3.19 kernel This is what I get when I execute: % (uname -srvmo && acpi -tc) > prash-`uname -r`.log on my HP ProBook 4410s. I have tested all the releases from 3.18.1 to 3.19, and in each, the fan runs at its maximum speed, right from bootup. On my system, Thermal 0, which corresponds to FDTZ and thermal_zone5, refers to the fan speed.
*PING* *PING* *PING* @ Zhang Rui
Same issue in HP Probook 4510s. PS: I had already reported in arch forum. Just adding weight to bug.
Created attachment 166411 [details] output of "acpi -tc" on 3.18.0 kernel This is what I get when I execute: % (uname -srvmo; acpi -tc) > prash-`uname -r`.log on my HP ProBook 4410s. As indicated by Thermal 0, and all the Cooling channels, the fan is running at its maximum speed.
Just wanted to know why is status still NEEDINFO even after so many responses? I think developers normally see status and do not look into tickets marked NEEDINFO thinking that it is still awaiting response from reporter or some user? Btw, I do not know what kind if info is needed apart from kernel version and hardware make?
(In reply to amish from comment #22) > Just wanted to know why is status still NEEDINFO even after so many > responses? > > I think developers normally see status and do not look into tickets marked > NEEDINFO thinking that it is still awaiting response from reporter or some > user? I should have said that I (the reporter) am not affected by this problem, but I am reporting on other people's behalf.
I know that you are reporter and not affected by it (Thank you for reporting it even then) What I mean (politely) is - bug should atleast be moved to "CONFIRMED" status now. Also since there is no response from Zhang Rui, to whom this bug is assigned to. I just wanted to know if he (or some other kernel developer) is aware? 3.19 is also released and bug exists in that too (as per reports in ARCH forum) I am seeking urgent attention because in ARCH Linux, sticking to older kernel is not recommended and you must remain "up-to-date" with all packages to avoid future issues due to older packages.
This is not a support forum. If you have a service level agreement with your supplier then talk to them. Bugs in bugzilla get dealt with as and when someone feels like fixing one.
One more similar looking bug which also reports that issue occurs after 3.18.x kernel but does not occur with 3.17 https://bugzilla.kernel.org/show_bug.cgi?id=91411 Someone has identified bad commit from git bisect (I have no idea what it is!)
See: https://bugzilla.kernel.org/show_bug.cgi?id=91411#c10 Matthias bisected the first bad commit in the similar issue. You could try if it fixes you problem.
So v3.17 is OK, does v3.18 start to have this issue or only v3.18.x kernel has this issue? Also, it seems multiple people are affected, please respond the above question and attach your acpidump: # acpidump > acpidump.txt
Created attachment 166931 [details] acpidump kernel 3.17.6
Created attachment 166941 [details] acpidump kernel 3.18.6
Yes 3.17 is OK. And yes starts with 3.18. For me it starts from 3.18.2 but for others 3.18.1. I did not test twice with 3.18.1 because it had graphical issue. But for most people it appears to start right from start of 3.18. Please see comments above for acpidump for kernel 3.17.6 and 3.18.6
So different people start to have problem from different kernel version, it suggests the root cause may be different, please do a git bisect to find the offending commit. For people who start to have problem from v3.18.x but v3.18 works, the bisect should not cost much time since your problem starts from a stable kernel version, i.e. v3.18.x works while v3.18.x+1 doesn't. Please use the stable git tree to do the bisect: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-3.18.y The above mentioned bad commit in comment #26 should not be your problem since it is available in v3.18 but v3.18 works for you. BTW, the acpidump is the same no matter which kernel version is in use, so one upload is enough :-)
(In reply to amish from comment #31) > For me it starts from 3.18.2 but for others 3.18.1. I did not test twice > with 3.18.1 because it had graphical issue. Notwithstanding the graphics issue, could you please retest with 3.18.0 and 3.18.1? I'm asking because it's very likely that you and I have identical system boards (see http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c01905586-6.pdf). I suspect that fan control seemed to work for you only temporarily, as it has, for many of us, with one or more kernel versions. Therefore, I also suspect that starting to bisect *after* 3.18.0 will prove futile for you.
I have no idea/experience about kernel compilation OR git. :( Just learning that is going to take me long time and unfortunately I do not have so much time due to other commitments. Really feel sorry that I am helpless. I will re-test with 3.18.1 and report back. This time I will run multiple reboots just to be sure. If someone (trusted user) can compile and upload kernel 3.18.0 for Arch Linux - 64bit, I can definitely test that too. Thanks.
(In reply to amish from comment #34) > If someone (trusted user) can compile and upload kernel 3.18.0 for Arch > Linux - 64bit, I can definitely test that too. Claire Farron has already uploaded 3.18.0. See https://bbs.archlinux.org/viewtopic.php?pid=1501688#p1501688.
Ok so I tested again with 3.18.1! And PROBLEM EXISTS in 3.18.1 as well. I dont know how come it did not occur last time OR did I overlook something? Just to re-iterate things I observed. With 3.18 series ----------------- During GRUB boot fan speed is normal. But half way during booting it starts running at full speed. And always remains at FULL SPEED. Even when CPU usage is negligible, it still runs at FULL SPEED. acpi -tc gives: Thermal 0: ok, 90.0 degrees C 90.0 degrees may be some hint With 3.17 series ----------------- Fan is normal till KDE starts loading. Fan then goes to FULL speed while KDE is still loading. After KDE is loaded, FAN is back to normal. acpi -tc gives: Thermal 0: ok, 55.0 degrees C So only Thermal 0 differs. Rest of the temperatures are similar for both series. I believe that the case would remain same even with 3.18.0.
So tried with 3.18.0 too (compiled by Claire Farron - from link above) Same issue and same symptom as 3.18.1 :( Surprising that acpi -tc always gives: Thermal 0: ok, 90.0 degrees C And did not change even after 5 minutes
see https://bugzilla.kernel.org/show_bug.cgi?id=93301 for the bug repotrt i've just created. it may (or may not) be a somewhat connected issue.
BTW, one of the staying Thermal X values, either staying at high or low level for you, is most probably representing the fan speed, like it does on my system HP/Compaq 6730b @ Thermal 0. These systems don't have an extra fan speed sensor except for that "Thermal 0" value. So, seeing that value not to change for you is the evidence of the error (no fan speed change) and not generally the cause of the misbehaviour. Best regards, Manuel (from BUG 78201)
please attach the output of "cat /sys/class/thermal/thermal_zone0/device/path"
I suspect this is the same problem in 93301 because 1. the reason of the problem is that the temperature does not change after boot. 2. they are all on HP platforms. so please check if reverting commit 6ab3430129e258ea31dd214adf1c760dfafde67a or build your kernel with "git checkout 6ab3430129e258ea31dd214adf1c760dfafde67a" can fix the problem or not.
@Zhang Rui, I ran this: % for i in /sys/class/thermal/thermal_zone*; do; echo -n $i "-- "; cat $i/device/path; done /sys/class/thermal/thermal_zone0 -- \_TZ_.GFXZ /sys/class/thermal/thermal_zone1 -- \_TZ_.DTSZ /sys/class/thermal/thermal_zone2 -- \_TZ_.CPUZ /sys/class/thermal/thermal_zone3 -- \_TZ_.SKNZ /sys/class/thermal/thermal_zone4 -- \_TZ_.BATZ /sys/class/thermal/thermal_zone5 -- \_TZ_.FDTZ % for i in /sys/class/thermal/thermal_zone*; do; echo -n $i " -- "; cat $i/temp; done /sys/class/thermal/thermal_zone0 -- 16000 /sys/class/thermal/thermal_zone1 -- 43000 /sys/class/thermal/thermal_zone2 -- 41000 /sys/class/thermal/thermal_zone3 -- 44000 /sys/class/thermal/thermal_zone4 -- 24800 /sys/class/thermal/thermal_zone5 -- 30000 Please note that this thermal_zone0 corresponds to "Thermal 5" as reported by "acpi -t"; the counting goes backwards. Moreover, the GFXZ readout has never been meaningful for me. It always reports 16°C (or 16000). It takes me quite a few hours to compile the kernel on my old laptop, so I'll wait for someone with a faster machine to try it out first. If no one does it in the next few days, I'll do it myself.
Created attachment 169261 [details] prash-class-thermal-3.14.34.txt output of grep -s . /sys/class/thermal/*/*
Created attachment 169271 [details] prash-class-thermal-device-path-3.14.34.txt Linux 3.14.34 output of grep . /sys/class/thermal/*/device/path
Created attachment 169281 [details] prash-class-thermal-3.18.6.txt Kernel 3.18.6 output of grep -s . /sys/class/thermal/*/*
Created attachment 169291 [details] prash-class-thermal-device-path-3.18.6.txt Kernel 3.18.6 output of grep . /sys/class/thermal/*/device/path
@Zhang Rui, I have attached the command outputs that you had asked for on the Archlinux BBS. My system: % inxi -F System: Host: Prash5 Kernel: 3.14.34-1-lts x86_64 (64 bit) Desktop: KDE 5 Distro: Arch Linux Machine: System: Hewlett-Packard product: HP ProBook 4410s v: F.20 Mobo: Hewlett-Packard model: 3072 v: KBC Version 24.0F Bios: Hewlett-Packard v: 68PZI Ver. F.20 date: 12/09/2011 CPU: Dual core Intel Core2 Duo T6570 (-MCP-) cache: 2048 KB clock speeds: max: 2101 MHz 1: 1200 MHz 2: 1200 MHz Graphics: Card: Intel Mobile 4 Series Integrated Graphics Controller Display Server: N/A driver: intel Resolution: 104x39 Audio: Card Intel 82801I (ICH9 Family) HD Audio Controller driver: snd_hda_intel Sound: Advanced Linux Sound Architecture v: k3.14.34-1-lts Network: Card-1: Intel PRO/Wireless 5100 AGN [Shiloh] Network Connection driver: iwlwifi IF: wls1 state: up mac: 00:22:fa:f7:2a:34 Card-2: Marvell 88E8072 PCI-E Gigabit Ethernet Controller driver: sky2 IF: ens5 state: down mac: 00:25:b3:5d:c8:70 Drives: HDD Total Size: 120.0GB (84.5% used) ID-1: /dev/sda model: Samsung_SSD_840 size: 120.0GB Partition: ID-1: / size: 109G used: 95G (92%) fs: ext4 dev: /dev/sda4 ID-2: /boot size: 976M used: 94M (11%) fs: ext4 dev: /dev/sda3 Sensors: System Temperatures: cpu: 38.0C mobo: N/A Fan Speeds (in rpm): cpu: N/A Info: Processes: 170 Uptime: 3 min Memory: 1105.0/5874.1MB Init: systemd Client: Shell (zsh) inxi: 2.2.19 > 5. in 3.18 kernel, when the problem is reproduced, please confirm whether the > temperature changes or not if you change the workload manually. On 3.18.6, I ensured that both cores of my processor were 100% utilized, waited for a minute, and saw "sensors" report that my cores were running at ~50°C. I then killed the CPU intensive tasks, and watched the temperature go back to ~35°C. The fan was running at its max speed ever since bootup, and it remained that way no matter how I made the CPU temperature rise or fall.
Created attachment 169301 [details] prash-acpidump
Created attachment 169461 [details] amish-acpidump Output of acpidump
Created attachment 169471 [details] amish-class-thermal-3.17.6-1-ARCH.txt Kernel 3.17.6 grep -s . /sys/class/thermal/*/*
Created attachment 169481 [details] amish-class-thermal-3.18.6-1-ARCH.txt Kernel 3.17.6 grep -s . /sys/class/thermal/*/*
(In reply to amish from comment #51) > Created attachment 169481 [details] > amish-class-thermal-3.18.6-1-ARCH.txt > > Kernel 3.17.6 > grep -s . /sys/class/thermal/*/* Please read as Kernel 3.18.6.
Created attachment 169491 [details] amish-class-thermal-device-path-3.17.6-1-ARCH.txt Kernel 3.17.6 grep . /sys/class/thermal/*/device/path
Created attachment 169501 [details] amish-class-thermal-device-path-3.18.6-1-ARCH.txt Kernel 3.18.6 grep . /sys/class/thermal/*/device/path Please note there is big size difference in output compared to 3.17.6 3.17.6 size is 1280 bytes 3.18.6 size is 520 bytes
Uploaded files as per Zhang Rui's post here: https://bbs.archlinux.org/viewtopic.php?pid=1507923#p1507923 NOTE: Mine and prash's system should be more or less similar. Mine is HP Probook 4510s and his is 4410s Question 2 to 4 - files attached above Question 1. % inxi -F (removed HDD partition info) System: Host: amish Kernel: 3.17.6-1-ARCH x86_64 (64 bit) Desktop: KDE 5 Distro: Arch Linux Machine: System: Hewlett-Packard product: HP ProBook 4510s v: F.12 Mobo: Hewlett-Packard model: 3072 v: KBC Version 24.0D Bios: Hewlett-Packard v: 68PZI Ver. F.12 date: 11/30/2009 CPU: Dual core Intel Core2 Duo T6570 (-MCP-) cache: 2048 KB clock speeds: max: 2101 MHz 1: 1600 MHz 2: 1600 MHz Graphics: Card: Intel Mobile 4 Series Integrated Graphics Controller Display Server: X.Org 1.17.1 driver: intel Resolution: 1366x768@59.64hz GLX Renderer: Mesa DRI Mobile Intel GM45 Express GLX Version: 2.1 Mesa 10.4.5 Audio: Card Intel 82801I (ICH9 Family) HD Audio Controller driver: snd_hda_intel Sound: Advanced Linux Sound Architecture v: k3.17.6-1-ARCH Network: Card-1: Intel PRO/Wireless 5100 AGN [Shiloh] Network Connection driver: iwlwifi IF: wls1 state: down mac: xxx Card-2: Marvell 88E8072 PCI-E Gigabit Ethernet Controller driver: sky2 IF: ens5 state: up speed: 100 Mbps duplex: full mac: xxx Sensors: System Temperatures: cpu: 47.0C mobo: N/A Fan Speeds (in rpm): cpu: N/A Info: Processes: 191 Uptime: 2:10 Memory: 2085.5/3858.0MB Client: Shell (zsh) inxi: 2.2.19 Question 5: in 3.18 kernel, when the problem is reproduced, please confirm whether the temperature changes or not if you change the workload manually. Same observation as prash. CPU temeratures increase (on adding load) and decrease (on reducing load) but FAN is always at full speed.
please apply the patch at comment #142 and comment #143 at bug #78201, and see if the problem still exists. If yes, please attach the output of "grep -s . /sys/class/thermal/*/*" when the bug is reproduced.
If these two patches from Comment 56 alone don't cure the issue you can try one additional debug patch from Zhang Rui from Comment https://bugzilla.kernel.org/show_bug.cgi?id=91411#c66, from BUG 91411, direct link to the patch: https://bugzilla.kernel.org/attachment.cgi?id=169941 that may be of benefit. Thank you in advance for reporting back!
please apply the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150 and see if the problem still exists. If yes, please run echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control, and attach the dmesg output after the problem reproduced.
Created attachment 170661 [details] prash-class-thermal-4.0.0-rc3-g9eccca0 Reporting for the latest linux-stable.git with the patches referred to at comment #56. The bug is seen here too; the fan stays running at max speed, right from bootup. I will report the status of the other patches over the next couple of days, as I slowly compile the kernels. For those of you who want to test this release, you can find it at http://www41.zippyshare.com/v/BGHdvPMI/file.html
@Zhang Rui, I tried applying the patches from comment #58, but it looks like there are some conflicts in the patch set. It's asking me if I want to revert previously applied patches. The offending file was 0004-Thermal-make-thermal_zone_device_update-atomic.patch. For the record, I also tried applying the patch set to a fresh checkout, unpatched with patches from comment #56 and #57. Can you please generate me a fresh patchset?
Created attachment 170681 [details] prash-class-thermal-4.0.0-rc3-g9eccca0 Output after applying the patchset from https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150. Per https://bugzilla.kernel.org/show_bug.cgi?id=78201#c151, that "0001-Debug-patch-to-sync-thermal-zone-update.patch" has been superseded by 0004, I omitted that file from the patches I applied. Current status: the same problem as before: fan runs at max speed. Per comment #58, I also ran echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control (as root). It produced the following in dmesg: ---- begin paste ---- [Mar15 14:09] update_temperature: thermal thermal_zone2: last_temperature=38000, current_temperature=30000 [ +0.000007] thermal_zone_trip_update: thermal thermal_zone2: Trip1[type=1,temp=105000]:trend=2,throttle=0 [ +0.000005] get_target_state: thermal cooling_device1: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=-1 [ +0.000003] get_target_state: thermal cooling_device0: cur_state=0 [ +0.000002] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=-1 [ +0.000005] thermal_zone_trip_update: thermal thermal_zone2: Trip2[type=0,temp=84000]:trend=2,throttle=0 [ +0.000035] get_target_state: thermal cooling_device9: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device9: old_target=-1, target=-1 [ +0.000003] thermal_zone_trip_update: thermal thermal_zone2: Trip3[type=0,temp=74000]:trend=2,throttle=0 [ +0.000031] get_target_state: thermal cooling_device10: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device10: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip4[type=0,temp=62000]:trend=2,throttle=0 [ +0.000030] get_target_state: thermal cooling_device11: cur_state=0 [ +0.000002] thermal_zone_trip_update: thermal cooling_device11: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip5[type=0,temp=52000]:trend=2,throttle=0 [ +0.000029] get_target_state: thermal cooling_device12: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device12: old_target=-1, target=-1 [ +0.000003] thermal_zone_trip_update: thermal thermal_zone2: Trip6[type=0,temp=44000]:trend=2,throttle=0 [ +0.000304] get_target_state: thermal cooling_device13: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device13: old_target=0, target=-1 [ +0.000003] thermal_cdev_update: thermal cooling_device13: zone2->target=18446744073709551615 [ +0.000004] thermal_cdev_update: thermal cooling_device13: set to state 0 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip7[type=0,temp=30000]:trend=2,throttle=1 [ +0.000031] get_target_state: thermal cooling_device14: cur_state=1 [ +0.000002] thermal_zone_trip_update: thermal cooling_device14: old_target=1, target=0 [ +0.000003] thermal_cdev_update: thermal cooling_device14: zone2->target=0 [ +0.003391] thermal_cdev_update: thermal cooling_device14: set to state 0 [ +0.001679] update_temperature: thermal thermal_zone2: last_temperature=30000, current_temperature=30000 [ +0.000006] thermal_zone_trip_update: thermal thermal_zone2: Trip1[type=1,temp=105000]:trend=2,throttle=0 [ +0.000005] get_target_state: thermal cooling_device1: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device1: old_target=-1, target=-1 [ +0.000004] get_target_state: thermal cooling_device0: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device0: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip2[type=0,temp=84000]:trend=0,throttle=0 [ +0.000042] get_target_state: thermal cooling_device9: cur_state=0 [ +0.000004] thermal_zone_trip_update: thermal cooling_device9: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip3[type=0,temp=74000]:trend=0,throttle=0 [ +0.000040] get_target_state: thermal cooling_device10: cur_state=0 [ +0.000003] thermal_zone_trip_update: thermal cooling_device10: old_target=-1, target=-1 [ +0.000005] thermal_zone_trip_update: thermal thermal_zone2: Trip4[type=0,temp=62000]:trend=0,throttle=0 [ +0.000038] get_target_state: thermal cooling_device11: cur_state=0 [ +0.000004] thermal_zone_trip_update: thermal cooling_device11: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip5[type=0,temp=52000]:trend=0,throttle=0 [ +0.000038] get_target_state: thermal cooling_device12: cur_state=0 [ +0.000004] thermal_zone_trip_update: thermal cooling_device12: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip6[type=0,temp=44000]:trend=0,throttle=0 [ +0.000038] get_target_state: thermal cooling_device13: cur_state=0 [ +0.000004] thermal_zone_trip_update: thermal cooling_device13: old_target=-1, target=-1 [ +0.000004] thermal_zone_trip_update: thermal thermal_zone2: Trip7[type=0,temp=37000]:trend=0,throttle=0 [ +0.000037] get_target_state: thermal cooling_device14: cur_state=0 [ +0.000004] thermal_zone_trip_update: thermal cooling_device14: old_target=0, target=0 ---- end paste ---- For anyone else who wants to test it, I have uploaded this build to http://www26.zippyshare.com/v/Eozr4aq0/file.html
please use boot option module.dyndbg="module thermal_sys +fp" and attach the dmesg after boot, with the patches applied.
Created attachment 170721 [details] prash-dmesg-4.0.0-rc3-g9eccca0.xz Output of dmesg on 4.0.0-rc3 with applied patches and boot option module.dyndbg="module thermal_sys +fp"
no, you should use module.dyndbg="module thermal_sys +fp", rather than "module.dyndbg=module thermal_sys +fp"
I did do that. And just to be sure, I did it again. I start my system with grub2 and systemd. Then I wondered if the double quotes are clashing with one of them, and tried single quotes. Apparently they get treated the same, and the dmesg output remains more or less the same each time. Later into the boot, the order in which USB and graphics drivers get initialized changes. Here's what I've tried: https://imgur.com/41EnHxg -- double quotes https://imgur.com/CD0zcgw -- single quotes If I'm still doing something wrong, can you please let me know how I can get it right?
hmmm, please use the following instead as it works in bug #67101. Xodule.Xyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp"
Created attachment 170751 [details] prash-class-thermal-4.0.0-rc3-g9eccca0 I have attached the dmesg output for two instances. For one of them, I passed "module.dyndbg...", and for the other "Xodule.Xyndbg...". I did it twice because I thought Xodule was a typo. Apparently, the output is similar in both cases. Anyhow, this time around there is more information in the dmesg output.
BTW, Rui has made new patches for the other BUG 78201 https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157. For me they do work, but as you know, our problem is a bit different. Mainly, I don't understand the fact Rui describes in BUG 67101, {https://bugzilla.kernel.org/show_bug.cgi?id=67101#c27} that the log shows that fan levels get reset and adjusted but don't get in the real hardware itself(?). Is the code missing the right cooling_device* ? Maybe, with the next log, you can give an example with increasing load/temperature/fan speed and decreasing all three on your system? Another BTW: For me with old grub adding: module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp" to the kernel command line brings some wanted debugging output for me. Best regards and thank you for your time!
hmmm, this sounds like a grub bug? can you please append module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp" to grub.cfg file directly and then reboot and see if we have any luck?
Created attachment 170911 [details] prash-dmesg-4.0.0-rc3-g9eccca0.tar.xz This time, I made the changes to grub.cfg. However, it still appears the same way in dmesg. In addition to that, as Manuel Krause suggested, I took the CPU temperature from 35°C (boot) to 50°C (heavy load) and back to 35°C. I got a dump of dmesg after the entire process. I have also attached the contents of /sys/class/thermal/*/device/path and /sys/class/thermal/*/*. I don't know if it is a grub bug, but I see additional output in dmesg, with get_target_state, and thermal_zone_trip_update. So maybe it's not a bug, and the kernel handles it the same in either case.
Created attachment 171061 [details] 0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch please apply this patch on top and see if the problem still exists.
Sorry, I don't understand what you mean by "on top". Does it mean I should apply this patch *before* (top line in my build script) all the others or *after* (an addition to the other patches)?
you should apply it after the four patches have been applied.
Thanks. Got it. By the way, I have been applying all these patches so far: 0001-Thermal-initialize-thermal-zone-device-correctly.patch 0002-Thermal-handle-thermal-zone-device-update-events-cor.patch 0003-ACPI-thermal-remove-unused-thermal-suspend-callbacks.patch 0004-Thermal-make-thermal_zone_device_update-atomic.patch 0005-Thermal-make-thermal-framework-be-aware-of-thermal-m.patch 0006-ACPI-thermal-remove-redundant-code-for-thermal-mode-.patch 0007-platform-acerhdf-remove-redundant-thermal_zone_devic.patch 0008-Thermal-db8500_thermal-remove-redundant-thermal_zone.patch 0009-Thermal-imx_thermal-remove-redundant-code-after-ther.patch 0010-Thermal-of-thermal-remove-redundant-code-after-therm.patch 0011-Thermal-ti-soc-thermal-remove-redundant-code-after-t.patch I understand your latest comment to mean I that instead of the above, I should do this: 0001-Thermal-initialize-thermal-zone-device-correctly.patch 0002-Thermal-handle-thermal-zone-device-update-events-cor.patch 0003-ACPI-thermal-remove-unused-thermal-suspend-callbacks.patch 0004-Thermal-make-thermal_zone_device_update-atomic.patch 0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch
I mean you should apply this patch on top of the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157
Created attachment 171071 [details] prash-dmesg-4.0.0-rc4-g06e5801-7.xz dmesg dump of kernel v 4.0.0-rc4 with patches from https://bugzilla.kernel.org/show_bug.cgi?id=78201#c157. When I applied the patch at https://bugzilla.kernel.org/show_bug.cgi?id=92431#c71, the system would not boot. I was not able to capture a dmesg dump of that, but I have screenshots: https://imgur.com/jKsGZE8,HIL0cqD,pTEVbVL Please note that there are three image files here.
Created attachment 171131 [details] prash-journalctl-4.0.0-rc4-g06e5801.tar.xz Please disregard the screenshots I attached in my previous message. I went though my journalctl logs, and discovered that the messages had been saved there after all. The current attachment contains two files: prash-journalctl-4.0.0-rc4-g06e5801.1.log -- This refers to the instance for which I attached screenshots. I waited for about 10 minutes for it to boot. However, I had not passed the dyndbg when booting it up. I forced a shutdown after that. prash-journalctl-4.0.0-rc4-g06e5801.2.log -- This refers to a later instance, where I passed the dyndbg flags. I forced a shutdown as soon as I had recorded the thermal debug messages.
Created attachment 171161 [details] 0001-Thermal-do-thermal-zone-update-after-a-cooling-devic.patch then please drop the previous one and apply this one instead. please attach the sys log output after boot, even if the problem still exists.
Created attachment 171171 [details] prash-thermal-logs.xz The latest patch seems to have fixed the fan problem. I have attached the dmesg dump of 4.0.0-rc4-g06e5801 with patches from #75 and #78. After bootup, I got my CPU cores fully loaded, waited for their temperature to hit ~50°C, and dumped the sensor readings to *-warm.log. I basically did: sensors > prash-`uname -r`-warm.log && acpi -tc >> prash-`uname -r`-warm.log I then killed the load processes, waited until the CPU temperature reached ~40°C, and dumped the readings to *-cool.log. Then I waited until the CPU cooled further, and the fan stopped completely. The sensor readings for that are in *-cold.log. I took a dmesg dump after all the above steps, so you can see all the thermal debug messages. I then rebooted to my LTS kernel, and got the sensor readings as above. I have attached that set too. It looks like the fan control behaves well now. Thank you! PS: To anyone else who wants to test it on their own machine, you can download my build from http://www39.zippyshare.com/v/RRB3uKle/file.html
Thanks prash for lots of debugging and patience. And thanks Rui for patches! I tried the kernel uploaded by prash (from above post) and it works fine. Checked 3-4 times by rebooting. One strange thing I noticed is when I was running "sensors" repeatedly every second (when system was idle), mostly it showsed 47 deg C but for 1 second it suddenly jumped to 58 deg and next second it was back to 47 deg. But Fan was normal at that time. Also once (just after boot) I noticed sensors showing temperature of around 75 deg but FAN was normal and it started running faster after 4-5 second. I am not sure if these are normal behaviour. But I did not see it again. Atleast the issue of "Fan running at Full speed" is gone. Fan speed goes up and down as expected based on load.
Patches to fix the problem sent out. https://patchwork.kernel.org/patch/6077231/ https://patchwork.kernel.org/patch/6077241/ https://patchwork.kernel.org/patch/6077251/ Mark the bug as Resolved. Will close the bug once the patches merged in upstream kernel.
Hi, guys, please help check if the patches at comment #183/#184/#185 in bug #78201 work for you or not. As there is some functional changes, I need to make sure they have been tested before sending upstream.
Hi Zhang Rui, Do you want me to apply these patches after the previous patches, or do you want me to apply them on a fresh checkout? I think you mean the latter, but I'd rather be safe than sorry.
Yes, the later, please apply them on a vanilla kernel. Better 4.0-rc
I applied the patches to a clean checkout of 4.0.0-rc5-gbc465aa. I performed some basic CPU load testing. The kernel behaves just the way it did with the previous set of patches. Looks like everything is in order.
Great to know. Thanks. I will resend the patches tomorrow, after the patches have been tested by others.
Code fix still not sent to kernel, this should be marked as REOPENED until someone really does send in the patches and they're accepted. *ping* @ Rui...
Rui, Are these patches merged?
Yu will take over the patches and push for upstream.
Patch sent at: https://patchwork.kernel.org/patch/7273051/ https://patchwork.kernel.org/patch/7273001/ https://patchwork.kernel.org/patch/7273041/
same cause as bug 91411 - duplicate. *** This bug has been marked as a duplicate of bug 91411 ***
This bug is not exact duplicate of #91411 In this bug Fan runs at full speed right after system boot (at same stage) and never slows down. In other bug it runs full speed only after suspend. (indicating that it runs normally till its suspended) But may be the cause of the issue is same.
We marked this one as duplicated because we've sent out a serie of patches to fix the fan problem, and one of them will fix your boot up problem. Plz refer to https://bugzilla.kernel.org/show_bug.cgi?id=78201 for latest patches, thanks!
The most recent three patches are for kernel 4.3.0: https://patchwork.kernel.org/patch/7525501/ https://patchwork.kernel.org/patch/7525491/ https://patchwork.kernel.org/patch/7525431/ and they work fine. Manuel, from BUG 78201