Bug 71711 - Strange / dangerous fan policy since 3.13
Strange / dangerous fan policy since 3.13
Status: CLOSED INSUFFICIENT_DATA
Product: Power Management
Classification: Unclassified
Component: Thermal
All Linux
: P1 high
Assigned To: Zhang Rui
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-03-08 05:05 UTC by step-ali
Modified: 2015-06-03 23:45 UTC (History)
20 users (show)

See Also:
Kernel Version: 3.13.5
Tree: Mainline
Regression: Yes


Attachments
acpidump HP Compaq 6730b (508.14 KB, text/plain)
2014-04-28 18:03 UTC, Manuel Krause
Details
ACPI / AC: Use proper name for netlink event generation (687 bytes, patch)
2014-05-07 10:32 UTC, Rafael J. Wysocki
Details | Diff
Guenter Roecks patch adapted for a 3.14.4 vanilla kernel (6.62 KB, patch)
2014-05-20 22:52 UTC, Manuel Krause
Details | Diff
patch 4 (4.98 KB, patch)
2015-03-24 07:45 UTC, Zhang Rui
Details | Diff
patch-5 (5.08 KB, patch)
2015-03-24 07:46 UTC, Zhang Rui
Details | Diff

Description step-ali 2014-03-08 05:05:48 UTC
My fans are acting strangely since 3.13 upgrade.

Behaviour on 3.12:

Fans running pretty much all the time on 30%, temperatures 30-40 C.

Behaviour on 3.13:

Fans are idle until temperatures rise to 84 C (this is hot!), then ramp up to 75% (high noise) for a few seconds until temperatures drop to 72 C. Then they idle again.

This seems pretty dangerous, because the threshold of 84 degrees is just too high. I'd be fine with 60.

Laptop: macbook air 2013
OS: Archlinux
Comment 1 Guenter Roeck 2014-03-08 05:28:01 UTC
Seen with other systems as well.

Additional information: https://bugs.archlinux.org/task/39005
Comment 2 Guenter Roeck 2014-03-09 17:25:05 UTC
e-mail exchange on the subject.

On 2014-03-08 16:59, Guenter Roeck wrote:
> On 03/08/2014 03:08 AM, Jean Delvare wrote:
>> On Fri, 7 Mar 2014 14:52:30 -0800, Guenter Roeck wrote:
>>> On Fri, Mar 07, 2014 at 11:04:29PM +0100, Manuel Krause wrote:
>>>> Hi, and thanks for the quick response!
>>>> No special fancy "fan control policy". 'fancontrol' isn't up or
>>>> running.
>>>> Vanilla kernels 3.11.* and 3.12.* had been working on here
>>>> without
>>>> any extra work.
>>>> -- 
>>>> # sensors
>>>> acpitz-virtual-0
>>>> Adapter: Virtual device
>>>> temp1:        +71.0°C  (crit = +256.0°C)
>>>> temp2:        +69.0°C  (crit = +110.0°C)
>>>> temp3:        +52.0°C  (crit = +105.0°C)
>>>> temp4:        +25.0°C  (crit = +110.0°C)
>>>> temp5:        +58.0°C  (crit = +110.0°C)
>>>>
>>>> coretemp-isa-0000
>>>> Adapter: ISA adapter
>>>> Core 0:       +62.0°C  (high = +105.0°C, crit = +105.0°C)
>>>> Core 1:       +60.0°C  (high = +105.0°C, crit = +105.0°C)
>>>> -- 
>>>> My notebook (HP/Compaq 6730b) does not have a seperate fan
>>>> sensor.
>>>> This is with 3.12.13 with my normal workload.
>>>>
>>>> Please, trust my above mentionned values of 94 °C vs. 74°C as I
>>>> don't like to boot 3.13.6 anymore, to avoid harm to the
>>>> notebook's
>>>> casing.
>>>
>>> Understood. Unfortunately, we'll need to get information
>>> from the new kernel to be able to track down the problem.
>>
>> Indeed. Not only the run-time temperatures, but also the high
>> and crit
>> limits.
>>
>>>> But I'd do to test any improvement-patch.
>>>
>>> So far I have no idea what is going on. I don't see anything
>>> in the
>>> drivers providing above data that would explain the behavior,
>>> but I might be missing something.
>>
>> Looks like a regression in the acpi subsystem or in power
>> management,
>> not hwmon. Hwmon is merely reporting the temperatures, it's not
>> responsible for the actual temperatures.
>>
>
> I would agree. I don't think we have enough information to be sure,
> though. There might be some unintended interaction or interference.
>
> gpu is a good hint ... for example, look at commit b9ed919f1c8
> (drm/nouveau/drm/pm: remove everything except the hwmon interfaces
> to THERM). nouveau does export pwm and fan control information,
> so any change in that code may have unintended side effects.
> Similar, I don't know how ec39f64bba (drm/radeon/dpm: Convert to
> use devm_hwmon_register_with_groups) could have the observed impact,
> as it is purely passive, but I prefer to be rather safe than sorry.
>
> This problem has now been submitted into bugzilla as
> https://bugzilla.kernel.org/show_bug.cgi?id=71711.
>
> Guenter
>

Sorry, for beeing late, had to search for/accumulate much info for you...
I hope, you like me to put it into one answer to you all CCing you.

My GFX is a GM45 Intel (mobile), shared memory, running the opensource Mesa drivers/extensions.
kernel-module: i915

According to the output of 'cpupower': I have
CPUidle driver: acpi_idle
CPUidle governor: menu

CPUfreq:
  driver: acpi-cpufreq
  available cpufreq governors: ondemand, performance
-
And "ondemand" is running.
-- 

# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +41.0°C  (crit = +256.0°C)
temp2:        +92.0°C  (crit = +110.0°C)
temp3:        +71.0°C  (crit = +105.0°C)
temp4:        +26.5°C  (crit = +110.0°C)
temp5:        +25.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +86.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +84.0°C  (high = +105.0°C, crit = +105.0°C)

FROM a critical "smelly" situation today, kernel-compilation, fan @100%.
-- 

Additional findings:

Identification from bootup ACPI initialisation vs. sensors:
temp1 = DTSZ
temp2 = CPUZ --> triggering Cooling in 3.12.13 if > 74°C
temp3 = SKNZ
temp4 = BATZ "Battery Zone" always calm ~ +6°C of ambient T
temp5 = FDTZ --- in 3.12.13 a representation of the cooling-fan (25 - 45 - 58 - max?)
Core 0 & Core 1 are the internal CPU T sensors.

With the 3.13.x (.5+) kernels the first gatherered cooling settings from bootup do stay forever. Means, rebooting a hot system will get a FDTZ @45°C+ and won't make any problems, as it does cool enough (even for kernel compiling on here). If it gets 25°C @bootup the system goes into emergency cooling somewhen. Same is with a suspend/resume.

Kernel 3.12.13 adjusts the cooling on it's own, but appropriately.


Thank you all for your engagement, best regards,
Manuel Krause.
Comment 3 Manuel Krause 2014-03-11 17:36:00 UTC
# Based on the shown email in Comment 2 Rafael J Wysocki asked me on 2014-03-09 # 18:58:
> This almost certainly is an ACPI regression, but I'm not sure whether
> thermal management or CPU power management is broken on your system.
>
> Can you compare the contents of /sys/class/thermal/ from working and
> not working kernels, please?
>
> Rafael
>
# which I answered the following way (I hope it'll be complete on here):

Hi again,
unfortunately you didn't specify how deeply I should dig into /sys/class/thermal. So you get the lines from # BOF # to # EOF # below. I hope they're readable without more comments.

The most remarkable changes, in my eyes, had happened within "thermal_zone1".

Best regards,
Manuel Krause


# BOF #
Following ones are all from /sys/class/thermal/ which are links to -> ../../devices/virtual/thermal/

I've listed the directories in sections of cooling_devices and thermal_zones separately for each bad/good kernel. For Emailing purposes only. You can merge them into a spreadsheet for your evaluation on your own. I've left out reporting some subdirs and subdir's values that _really_ didn't seem to need attention.

Also, I've had collected the #sensors output for each readout, having reproduced nearly the same workload, represented by the "Fan speed" (thermal_zone4==FDTZ).

And I've done my very best to not produce typos or c&p errors.


 3.13.5 -- 20140309 -- 20:52 -- bad
=============================
dir             |-
                 /type       /cur_state  /max_state
cooling_device0  Processor    0          10
cooling_device1  Processor    0          10
cooling_device2  Fan          0           1
cooling_device3  Fan          1           1
cooling_device4  Fan          0           1
cooling_device5  Fan          0           1
cooling_device6  Fan          0           1
cooling_device7  LCD          0          24

 3.12.13 -- 20140310 -- 00:26 -- good
==============================
dir             |-
                 /type       /cur_state  /max_state
cooling_device0  Processor    0          10
cooling_device1  Processor    0          10
cooling_device2  Fan          0           1
cooling_device3  Fan          1           1
cooling_device4  Fan          1           1
cooling_device5  Fan          1           1
cooling_device6  Fan          1           1
cooling_device7  LCD          0          24


 3.13.5 -- 20140309 -- 20:52 -- bad
=============================
dir          |-
              /passive /temp  |-     /cdev?_  /trip_   /trip_
                                      trip_    point_   point_
                                      point    ?_temp   ?_type
thermal_zone0  0        68000   ?=0    n.a.   256000   critical
thermal_zone1   n.a.    70000 |-
                                ?=0   6       110000   critical
                                ?=1   5       107000   passive
                                ?=2   4        90000   active
                                ?=3   3        75000   active
                                ?=4   2        55000   active
                                ?=5   1        45000   active
                                ?=6   1        30000   active
thermal_zone2   n.a.    54000 |-
                                ?=0   1       105000   critical
                                ?=1   1        95000   passive
thermal_zone3   n.a.    25800 |-
                                ?=0   1       110000   critical
                                ?=1   1        60000   passive
thermal_zone4  0        58000   ?=0    n.a.   110000   critical


 3.12.13 -- 20140310 -- 00:26 -- good
==============================
dir          |-
              /passive /temp  |-     /cdev?_  /trip_   /trip_
                                      trip_    point_   point_
                                      point    ?_temp   ?_type
thermal_zone0  0        50000   ?=0    n.a.   256000   critical
thermal_zone1   n.a.    70000 |-
                                ?=0   1       110000   critical
                                ?=1   1       107000   passive
                                ?=2   2        90000   active
                                ?=3   3        67000   active
                                ?=4   4        55000   active
                                ?=5   5        45000   active
                                ?=6   6        30000   active
thermal_zone2   n.a.    53000 |-
                                ?=0   1       105000   critical
                                ?=1   1        95000   passive
thermal_zone3   n.a.    25600 |-
                                ?=0   1       110000   critical
                                ?=1   1        60000   passive
thermal_zone4  0        58000   ?=0    n.a.   110000   critical

---
Legend here:
       /type  is always  acpitz
       /mode             enabled
       /policy           step_wise

      - from kernel ACPI initialisation: thermal_zone0==DTSZ,
         thermal_zone1==CPUZ, thermal_zone2==SKNZ,
         thermal_zone3==BATZ, thermal_zone4==FDTZ
      - n.a. means      file or value is not available
___
Legend in general:
             /power/control          is always  auto
             /power/runtime_status              unsupported
             /uevent                            ''==empty

----------------------------------------------------------------

 3.13.5 -- 20140309 -- 20:52 -- bad
=============================
# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +68.0°C  (crit = +256.0°C)
temp2:        +70.0°C  (crit = +110.0°C)
temp3:        +54.0°C  (crit = +105.0°C)
temp4:        +25.8°C  (crit = +110.0°C)
temp5:        +58.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +66.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +63.0°C  (high = +105.0°C, crit = +105.0°C)


 3.12.13 -- 20140310 -- 00:26 -- good
==============================
# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +50.0°C  (crit = +256.0°C)
temp2:        +70.0°C  (crit = +110.0°C)
temp3:        +53.0°C  (crit = +105.0°C)
temp4:        +25.6°C  (crit = +110.0°C)
temp5:        +58.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +65.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +61.0°C  (high = +105.0°C, crit = +105.0°C)

# EOF #
Comment 4 Manuel Krause 2014-03-11 22:25:42 UTC
# also posted to linux-kernel && linux-pm
# my findings from tonight:

Hi, and thank you for your attention ^^

at the bottom of this email you'd get the actual values for the new 3.12.14 kernel for two different levels of usage and ambient temperature.
You'd read, in kernel 3.12.14 the /cdev?_trip_point enumeration has changed to the way of 3.13.? and also one /trip_point_?_temp did. But 3.12.14 is working as well as 3.12.13. (So my first eyecatcher didn't lead to useful things.)
I'm not capaple of finding or understanding the related code, but, please, let me present an idea of what MAY be going on:

In 3.12.13+, on my system, the effective cooling fan speed seems to be an accumulation, maybe bitwise, of cooling_device[2-6]/cur_state, that each get activated (=1) by a certain other temperature value or level; each of the cooling_device[2-6]/cur_state stays @1 as long as their ref. temp. does not undershoot. For my system this ref. temp.  would most likely be triggered by temp2 == thermal_zone1/temp [CPUZ].

In 3.13.? there seems to get only one of cooling_device[2-6]/cur_state be set to 1, the others left and/or rewritten with 0. And the fan speed algorithm then accumulates only one 1 without seeing the [_LEVEL_] number of cooling_device[2-6]... or re-requesting the related trigger temperature.

I hope this leads you developers nearer to a conclusion on how to fix it,
best regards, Manuel Krause

_____________________________
3.12.14 -- 20140311 -- 19:07 -- changed, not broken -- normal use
=============================
/sys/class/thermal/*  which
are links to -> ../../devices/virtual/thermal/*

dir             |-
                 /type       /cur_state  /max_state  Maybe
                                                      trigger
                                                      /PWM
...
cooling_device2  Fan          0           1          not yet
                                                      observed
cooling_device3  Fan          0           1          FDTZ==58°C
cooling_device4  Fan          1           1          FDTZ==45°C
cooling_device5  Fan          1           1          FDTZ==34°C
cooling_device6  Fan          1           1          FDTZ==25°C
...

dir          |-
              /passive /temp  |-     /cdev?_  /trip_   /trip_
                                      trip_    point_   point_
                                      point    ?_temp   ?_type
...
thermal_zone1   n.a.    73000 |- (CPUZ)
                                ?=0   6       110000   critical
                                ?=1   5       107000   passive
                                ?=2   4        90000   active
                                ?=3   3        75000   active
                                ?=4   2        55000   active
                                ?=5   1        45000   active
                                ?=6   1        30000   active
...
thermal_zone4   n.a.    45000   ?=0    n.a.   110000   critical (FDTZ)
...

# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +46.0°C  (crit = +256.0°C)
temp2:        +73.0°C  (crit = +110.0°C)
temp3:        +57.0°C  (crit = +105.0°C)
temp4:        +26.3°C  (crit = +110.0°C)
temp5:        +45.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +68.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +66.0°C  (high = +105.0°C, crit = +105.0°C)


_____________________________
3.12.14 -- 20140311 -- 21:09 -- changed, not broken -- idle state
=============================

dir             |-
                 /type       /cur_state  /max_state  Maybe
                                                      trigger
                                                      /PWM
...
cooling_device2  Fan          0           1          not yet
                                                      observed
cooling_device3  Fan          0           1          FDTZ==58°C
cooling_device4  Fan          0           1          FDTZ==45°C
cooling_device5  Fan          0           1          FDTZ==34°C
cooling_device6  Fan          1           1          FDTZ==25°C
...

dir          |-
              /passive /temp
thermal_zone1   n.a.    46000 ... (CPUZ)
...
thermal_zone4   n.a.    25000 ... (FDTZ)
...

# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +50.0°C  (crit = +256.0°C)
temp2:        +46.0°C  (crit = +110.0°C)
temp3:        +44.0°C  (crit = +105.0°C)
temp4:        +25.7°C  (crit = +110.0°C)
temp5:        +25.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +41.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +41.0°C  (high = +105.0°C, crit = +105.0°C)
_____________________________
Comment 5 Manuel Krause 2014-03-20 20:28:21 UTC
[SNIP]

Long time no reply from you... Have I overseen a unwritten convention? Or were my charts that unusable for your analysis/work?

Two days ago, I tried the 3.14.0-rc7-vanilla. And the problem persists. "Strange / dangerous fan policy..."

Since kernel 3.13.6 I've managed to 'fix' the potential overheating problem by manually issuing a:
"echo 1 > /sys/class/thermal/cooling_device3/cur_state" *)
_before_ obviously critical temperatures occur. Remind: This particular setting may only work for my system! ...and keeps working for 3.14-rc.

In the following I'd like to present you a modified output of my /sys/class/thermal, that I've written a script for (for my system), that shows the results in the way of linux/Documentation/thermal/sysfs-api.txt, point 3:
{I've uploded the files to pastebin, to not swamp you and the lists with so many lines of logs.}

For the last good kernel -- 3.12.14 -- in-use:
 http://pastebin.com/HL1PNcda
For my first bad kernel revision 3.13 -- at critical temp:
 http://pastebin.com/98hgf1a9
For the last bad kernel -- 3.14.0-rc7 -- at critical temp:
 http://pastebin.com/MuTwTnjD
For the last bad kernel -- 3.14.0-rc7 -- after issuing the
 *) command:
 http://pastebin.com/2peda54z

Please, have a look at them! And maybe, give me hints on how I can help you to further debug this issue, as my manual method works but it's annoying.

And, PLEASE CC: ME, as I'm not on the lists. Or lead this Email-thread to someone in charge.

Thank you for your work && best regards,
Manuel Krause
Comment 6 Manuel Krause 2014-03-31 23:17:17 UTC
3.12.15 works very well
3.13.7 fails
3.14.0-rc8 fails

I've tried the tmon tool, now, too. Nice eyecandy and for monitoring!

I've tried to revert all "thermal" related patches from 3.12.14->3.13.7 from 3.13.7. But they don't seem to matter. (Even if I apply the vice-versa patch to 3.12.15.)

So "thermal" is out?

For the failing kernels: Not any reached trip point (active) triggers ONE fan action!

Next would be ACPI, to be investigated,

THX for this audience,
Manuel Krause
Comment 7 Roman Spirgi 2014-04-02 08:39:24 UTC
I'm not sure if this is related to this bug but since Kernel 3.13 my fan speed is far to high and noisy as soon as the system is booting up ... I'm using Fedora 20. With Kernel 3.12.X everything was fine instead and fan speed was on a acceptable level ...

[ant@fedorant ~]$ sensors
nouveau-pci-0100
Adapter: PCI adapter
fan1:        6693 RPM
temp1:        +69.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +52.0°C  (high = +83.0°C, crit = +99.0°C)
Core 1:       +50.0°C  (high = +83.0°C, crit = +99.0°C)
Core 2:       +52.0°C  (high = +83.0°C, crit = +99.0°C)
Core 3:       +50.0°C  (high = +83.0°C, crit = +99.0°C)

it8720-isa-0a10
Adapter: ISA adapter
in0:          +0.86 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in1:          +3.04 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in2:          +3.33 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
+5V:          +3.04 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in4:          +2.94 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in5:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
in6:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
5VSB:         +2.96 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
Vbat:         +2.99 V  
fan1:         838 RPM  (min =    0 RPM)
fan2:         949 RPM  (min =    0 RPM)
temp1:       +127.0°C  (low  =  -1.0°C, high = +127.0°C)  ALARM  sensor = thermal diode
temp2:        +22.0°C  (low  =  -1.0°C, high = +127.0°C)  ALARM  sensor = thermistor
temp3:        -47.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = Intel PECI
cpu0_vid:    +0.000 V
intrusion0:  ALARM

Any ideas?
Comment 8 Guenter Roeck 2014-04-02 14:02:47 UTC
On 04/02/2014 01:39 AM, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=71711
>
> --- Comment #7 from Roman Spirgi <the.ant@gmx.net> ---
> I'm not sure if this is related to this bug but since Kernel 3.13 my fan speed
> is far to high and noisy as soon as the system is booting up ... I'm using
> Fedora 20. With Kernel 3.12.X everything was fine instead and fan speed was on
> a acceptable level ...
>
> [ant@fedorant ~]$ sensors
> nouveau-pci-0100
> Adapter: PCI adapter
> fan1:        6693 RPM
> temp1:        +69.0°C  (high = +95.0°C, hyst =  +3.0°C)
>                         (crit = +105.0°C, hyst =  +5.0°C)
>                         (emerg = +135.0°C, hyst =  +5.0°C)
>
Looks like Nouveau fan control does not work. No idea what may be causing this ...
well, possibly. There are two suspicious commits between 3.12 and 3.13.
Maybe the "remove everything" commit has undesirable side effects.

eec9901 drm/nouveau/hwmon: fix compilation without CONFIG_HWMON
b9ed919 drm/nouveau/drm/pm: remove everything except the hwmon interfaces to THERM

I would suggest to open a separate bug against the Nouveau component.

[ Side note: The displayed values for hyst are wrong. Those should be absolute
   temperatures, not temperature differences. But that is yet another bug. ]

> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0:       +52.0°C  (high = +83.0°C, crit = +99.0°C)
> Core 1:       +50.0°C  (high = +83.0°C, crit = +99.0°C)
> Core 2:       +52.0°C  (high = +83.0°C, crit = +99.0°C)
> Core 3:       +50.0°C  (high = +83.0°C, crit = +99.0°C)
>
> it8720-isa-0a10
> Adapter: ISA adapter
> in0:          +0.86 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> in1:          +3.04 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> in2:          +3.33 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> +5V:          +3.04 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> in4:          +2.94 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> in5:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> in6:          +2.16 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> 5VSB:         +2.96 V  (min =  +0.00 V, max =  +4.08 V)  ALARM
> Vbat:         +2.99 V
> fan1:         838 RPM  (min =    0 RPM)
> fan2:         949 RPM  (min =    0 RPM)
> temp1:       +127.0°C  (low  =  -1.0°C, high = +127.0°C)  ALARM  sensor =
> thermal diode
> temp2:        +22.0°C  (low  =  -1.0°C, high = +127.0°C)  ALARM  sensor =
> thermistor
> temp3:        -47.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = Intel PECI
> cpu0_vid:    +0.000 V
> intrusion0:  ALARM
>

Something in your system configuration is wrong. Usually this comes from the BIOS,
so you you might want to check if there is a BIOS upgrade available. It looks like
the system believes that your CPU is freezing and therefore runs the CPU fan at
minimum speed. That may be ok with the current load, but might be a problem
if the CPUs get busy and run hot. That is not related to the nouveau problem,
though.

Guenter
Comment 9 Daniele 2014-04-02 17:29:28 UTC
I can confirm the original bug reported. I reproduced it with a HP 625 (AMD athlon processor with AMD HD 4200 graphics) laptop.

I tested ubuntu 3.12, 3.13 and 3.14 kernels, and the problem appeared in 3.13.

Best regards,
Daniele
Comment 10 Jean Delvare 2014-04-03 07:51:51 UTC
(In reply to Guenter Roeck from comment #8)
> Something in your system configuration is wrong. Usually this comes from the
> BIOS, so you you might want to check if there is a BIOS upgrade available. It
> looks like the system believes that your CPU is freezing and therefore runs
> the CPU fan at minimum speed.

As I recall the IT87xx chips need an offset programmed by the
BIOS in order to return "sane" temperature values from PECI sources.
Without the offset, the driver returns the thermal margin as a negative
value (-47°C here would mean the CPU runs 47 pseudo-°C below its
critical temperature.) This matches the values returned by coretemp (99
- 47 = 52). This would justify the low fan speeds.

The original poster could try setting temp3_offset to 99 (in the right chip section of sensors.conf, followed by "sensors -s" as root) and see if it makes the system behave differently.
Comment 11 Roman Spirgi 2014-04-03 13:51:56 UTC
Jean, indeed:
...
temp3:        +46.0°C  (low  =  -1.0°C, high = +127.0°C)  sensor = Intel PECI
...
But it's definitely noisier now ;)

Guenter, thank you, I did open "https://bugs.freedesktop.org/show_bug.cgi?id=77003" for the NVIDIA fan speed issue.

Thank you guys,
Roman
Comment 12 Jean Delvare 2014-04-03 16:13:48 UTC
It really all depends on what the automatic fan control setup expects. Unfortunately I don't think the it87 driver exposes its trip points to user-space so you'd have to poke at the registers directly.
Comment 13 Jernej Jakob 2014-04-04 19:03:19 UTC
Hello everyone,

I can confirm this bug as well on an HP Probook 4710s. So there are now at least 5 confirmed reports.

Please see:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1290110 (my bug report)
http://lkml.iu.edu//hypermail/linux/kernel/1404.0/02012.html (my archived post to the linux-kernel mailing list)

It would be worth to also take a look at the DSDT, as there are other minor quirks on my system that could point there... (brightness always on max after reboot/suspend, coarse brightness setting range)
I've already disassembled mine but am stumped at what to do next (this is my first look at anything ACPI related), how to debug...

But as previous kernels worked okay with this same DSDT, maybe they didn't control the fan speed through ACPI but left it to the BIOS?

For info on disassembling the DSDT see https://wiki.archlinux.org/index.php/DSDT
Comment 14 Manuel Krause 2014-04-06 02:40:55 UTC
I've now bisected two times. From two different kernel origins, just to be sure, as I'm new to this stupid-and-lengthy method, and, to be sure, I haven't given a false positive inbetween due to boredom.

In the end it says each time:
# git bisect bad | tee -a /var/log/bisect.log
cc8ef52707341e67a12067d6ead991d56ea017ca is the first bad commit
commit cc8ef52707341e67a12067d6ead991d56ea017ca
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Wed Sep 25 20:39:45 2013 +0800

    ACPI / AC: convert ACPI ac driver to platform bus

    Signed-off-by: Zhang Rui <rui.zhang@intel.com>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

:040000 040000 5a0d397cfcbf53c03390f2805b83754cb7837d84 4a2af1454f65d67f1d1a507c08e3b9ef3ffe57e7 M      drivers


Please help me, on how I can help debug this more, and please also read the newest from
https://bugzilla.kernel.org/show_bug.cgi?id=71711

Manuel Krause
Comment 15 Zhang Rui 2014-04-07 12:06:27 UTC
Hi, Manuel,

nice report.

(In reply to Manuel Krause from comment #3)
> 
>  3.13.5 -- 20140309 -- 20:52 -- bad
> =============================
> dir             |-
>                  /type       /cur_state  /max_state
> cooling_device0  Processor    0          10
> cooling_device1  Processor    0          10
> cooling_device2  Fan          0           1
> cooling_device3  Fan          1           1
> cooling_device4  Fan          0           1
> cooling_device5  Fan          0           1
> cooling_device6  Fan          0           1
> cooling_device7  LCD          0          24
> 
>  3.12.13 -- 20140310 -- 00:26 -- good
> ==============================
> dir             |-
>                  /type       /cur_state  /max_state
> cooling_device0  Processor    0          10
> cooling_device1  Processor    0          10
> cooling_device2  Fan          0           1
> cooling_device3  Fan          1           1
> cooling_device4  Fan          1           1
> cooling_device5  Fan          1           1
> cooling_device6  Fan          1           1
> cooling_device7  LCD          0          24
> 
> 
>  3.13.5 -- 20140309 -- 20:52 -- bad
> =============================
> dir          |-
>               /passive /temp  |-     /cdev?_  /trip_   /trip_
>                                       trip_    point_   point_
>                                       point    ?_temp   ?_type
> thermal_zone0  0        68000   ?=0    n.a.   256000   critical
> thermal_zone1   n.a.    70000 |-
>                                 ?=0   6       110000   critical
>                                 ?=1   5       107000   passive
>                                 ?=2   4        90000   active
>                                 ?=3   3        75000   active
>                                 ?=4   2        55000   active
>                                 ?=5   1        45000   active
>                                 ?=6   1        30000   active
> thermal_zone2   n.a.    54000 |-
>                                 ?=0   1       105000   critical
>                                 ?=1   1        95000   passive
> thermal_zone3   n.a.    25800 |-
>                                 ?=0   1       110000   critical
>                                 ?=1   1        60000   passive
> thermal_zone4  0        58000   ?=0    n.a.   110000   critical
> 
> 
>  3.12.13 -- 20140310 -- 00:26 -- good
> ==============================
> dir          |-
>               /passive /temp  |-     /cdev?_  /trip_   /trip_
>                                       trip_    point_   point_
>                                       point    ?_temp   ?_type
> thermal_zone0  0        50000   ?=0    n.a.   256000   critical
> thermal_zone1   n.a.    70000 |-
>                                 ?=0   1       110000   critical
>                                 ?=1   1       107000   passive
>                                 ?=2   2        90000   active
>                                 ?=3   3        67000   active
>                                 ?=4   4        55000   active
>                                 ?=5   5        45000   active
>                                 ?=6   6        30000   active
> thermal_zone2   n.a.    53000 |-
>                                 ?=0   1       105000   critical
>                                 ?=1   1        95000   passive
> thermal_zone3   n.a.    25600 |-
>                                 ?=0   1       110000   critical
>                                 ?=1   1        60000   passive
> thermal_zone4  0        58000   ?=0    n.a.   110000   critical
> 
this is not enough, can you please attach the output of 
" grep . /sys/class/thermal/thermal_zone*/cdev*/device/path"

I need to figure out why /sys/class/thermal/thermal_zone1/cdev0_trip_point equals 1 in 3.12, while it equals 6 in 3.13.

plus, can you please attach the output of "grep . /sys/class/thermal/cooling_device*/device/path" in both 3.12 and 3.13 as well.
Comment 16 Manuel Krause 2014-04-08 01:00:43 UTC
Let's start with my actual GOOD kernel:

# uname -r
3.12.16-ck2
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
/sys/class/thermal/thermal_zone1/cdev0/device/path:\_TZ_.FAN4
/sys/class/thermal/thermal_zone1/cdev1/device/path:\_TZ_.FAN3
/sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN2
/sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
/sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN0
/sys/class/thermal/thermal_zone1/cdev5/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone1/cdev6/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0
# grep . /sys/class/thermal/cooling_device*/device/path
/sys/class/thermal/cooling_device0/device/path:\_PR_.CPU0
/sys/class/thermal/cooling_device1/device/path:\_PR_.CPU1
/sys/class/thermal/cooling_device2/device/path:\_TZ_.FAN0
/sys/class/thermal/cooling_device3/device/path:\_TZ_.FAN1
/sys/class/thermal/cooling_device4/device/path:\_TZ_.FAN2
/sys/class/thermal/cooling_device5/device/path:\_TZ_.FAN3
/sys/class/thermal/cooling_device6/device/path:\_TZ_.FAN4
/sys/class/thermal/cooling_device7/device/path:\_SB_.PCI0.GFX0.DD02

And have a newer BAD kernel:

# uname -r
3.13.8-ck1
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
/sys/class/thermal/thermal_zone1/cdev0/device/path:\_TZ_.FAN4
/sys/class/thermal/thermal_zone1/cdev1/device/path:\_TZ_.FAN3
/sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN2
/sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
/sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN0
/sys/class/thermal/thermal_zone1/cdev5/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone1/cdev6/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0
# grep . /sys/class/thermal/cooling_device*/device/path
/sys/class/thermal/cooling_device0/device/path:\_PR_.CPU0
/sys/class/thermal/cooling_device1/device/path:\_PR_.CPU1
/sys/class/thermal/cooling_device2/device/path:\_TZ_.FAN0
/sys/class/thermal/cooling_device3/device/path:\_TZ_.FAN1
/sys/class/thermal/cooling_device4/device/path:\_TZ_.FAN2
/sys/class/thermal/cooling_device5/device/path:\_TZ_.FAN3
/sys/class/thermal/cooling_device6/device/path:\_TZ_.FAN4
/sys/class/thermal/cooling_device7/device/path:\_SB_.PCI0.GFX0.DD02

The "grep . /sys/class/thermal/cooling_device*/device/path" results stay 
always the same as above, so I omit them in the following.

There are generally only two different re-occurring scenarios for
"grep . /sys/class/thermal/thermal_zone*/cdev*/device/path", so that I 
want to abbreviate them in the following:

Scenario-1:
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
/sys/class/thermal/thermal_zone1/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone1/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN0
/sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
/sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN2
/sys/class/thermal/thermal_zone1/cdev5/device/path:\_TZ_.FAN3
/sys/class/thermal/thermal_zone1/cdev6/device/path:\_TZ_.FAN4
/sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0

Scenario-2:
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
/sys/class/thermal/thermal_zone1/cdev0/device/path:\_TZ_.FAN4
/sys/class/thermal/thermal_zone1/cdev1/device/path:\_TZ_.FAN3
/sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN2
/sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
/sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN0
/sys/class/thermal/thermal_zone1/cdev5/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone1/cdev6/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0

Already, during bisecting this issue, I've found out, that these scenarios
have something to do with rebooting: So, I've rebooted the new bisected kernel
twice in the second roundup.
But I haven't expected the following disorder:

This is a row of results from last night, rebooting different kernels, one
after the other, and capturing some relevant data.


# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.13
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.12.13
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.13
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.13.5
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.13.5
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.5
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.13.8
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-2

# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1

# uname -r
3.12.16
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
Scenario-1


Please, mind, what doesn't come from this data: 3.13.x _never_ triggers 
a new fan speed when needed (higher/lower). 3.12.x _always_ does, at 
least after hitting a higher active temp trigger!

Manuel Krause
Comment 17 Lan Tianyu 2014-04-18 02:19:52 UTC
ping Rui ... Please have a look this bug.
Comment 18 Manuel Krause 2014-04-18 19:08:23 UTC
There had been additional steps in the meantime, but unfortunately no sulution so far.

You can read the related postings to lkml e.g. with:
http://marc.info/?l=linux-kernel&w=2&r=1&s=dangerous+fan+policy&q=b

Best regards, Manuel Krause
Comment 19 Pohjoistuuli 2014-04-21 16:47:55 UTC
Hi!

I have similar problem on HP ProBook 4510s (Firmware F.20, Intel T3000) running 64bit-kernel 3.13.9 (Kubuntu) or 64-bit kernel 3.13 - 3.14 on arch. I remarked that the regulation of the fan (not necessarily the fan itself!) stops after boot. So, if the system is cold, the fan is running at 0% (= off) or at 20% (which is an unusual number as the fan speed rises usually in 15% stepintel pentium dual core t3000 "microcode" updates on this hardware). On reboot, when the machine is warm, fan speeds of 30% or 45% are often observed depending on the CPU temperature at boot time. After booting, the fan speed does not change anymore and keeps constant. So, when the machine was started cold, the fan is off until the temperature reaches critical values and runs then with 90% (= full speed) until the temperature drops. It goes then off again completely. This is not nice as the cooling might not be sufficient and my machine may shut down hard. Such behaviour is not nice and also not in-line with the idea of 'Laptop' because the machine gets so hot that I don't want to leave it on the top of my lap to avoid burning myself :-)
According to my interpretation, the system ignores all active trip points, but reacts on the passive and critical trip points.

I found also a not so perferct workaround after some trial and error with boot parameters: passing 'thermal.tzp=1' (or any other higher number) to the kernel at boot time (unload and reload thermal with the tzp-parameter does not help) restores the temperature depending fan speed regulation. This work around comes unfortunately with the trade-off of two or three kworker-processes that consume up to the full capacity of one CPU, which makes the system sluggy and raises power consumption.

I hope that this info on the problem helps finding a real fix, which would be appreciated.

Regards, Thomas
Comment 20 Manuel Krause 2014-04-23 01:23:56 UTC
(In reply to Pohjoistuuli from comment #19)
[...]
> I have similar problem on HP ProBook 4510s (Firmware F.20, Intel T3000)
> running 64bit-kernel 3.13.9 (Kubuntu) or 64-bit kernel 3.13 - 3.14 on arch.
[...]

@Pohjoistuuli // Thomas
Your machine has the same symptoms as mine with 3.13.x +
Have you tried a 3.12.y kernel of your distro (or even vanilla)?

BTW, you can issue a command at runtime or via a startup script to set "echo 1 > /sys/class/thermal/cooling_device3/cur_state" e.g. (my favourite). 6 is the lowest of cooling_device~ representing fan speed knobs. Just try.

@ Rui Zhang
I don't want this to be handled as a HP-Laptop-only problem, as 3.12.x is able to serve the fans and temps appropriately.

Best regards, Manuel
Comment 21 Zhang Rui 2014-04-23 06:40:27 UTC
(In reply to Pohjoistuuli from comment #19)
> Hi!
> 
> I have similar problem on HP ProBook 4510s (Firmware F.20, Intel T3000)
> running 64bit-kernel 3.13.9 (Kubuntu) or 64-bit kernel 3.13 - 3.14 on arch.
> I remarked that the regulation of the fan (not necessarily the fan itself!)
> stops after boot. So, if the system is cold, the fan is running at 0% (=
> off) or at 20% (which is an unusual number as the fan speed rises usually in
> 15% stepintel pentium dual core t3000 "microcode" updates on this hardware).
> On reboot, when the machine is warm, fan speeds of 30% or 45% are often
> observed depending on the CPU temperature at boot time. After booting, the
> fan speed does not change anymore and keeps constant. So, when the machine
> was started cold, the fan is off until the temperature reaches critical
> values and runs then with 90% (= full speed) until the temperature drops. It
> goes then off again completely.

I've seen exactly the same behavior on one of my test laptop.
And the problem is that ACPICA can not handle some kind of AML code well, PLUS, the fix for the problem ships in 3.13-rc1.
So the symptom I've seen is not a regression and exists in all Linux previous release.
Anyway, please attach the acpidump of your machine, so that I can check if they are the same AML problem.

BTW, it would be nice if you can try 3.12 kernel to verify if this is a regression or not.
Comment 22 Zhang Rui 2014-04-23 07:06:02 UTC
(In reply to Manuel Krause from comment #16)
> There are generally only two different re-occurring scenarios for
> "grep . /sys/class/thermal/thermal_zone*/cdev*/device/path", so that I 
> want to abbreviate them in the following:
> 
> Scenario-1:
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> /sys/class/thermal/thermal_zone1/cdev0/device/path:\_PR_.CPU1
> /sys/class/thermal/thermal_zone1/cdev1/device/path:\_PR_.CPU0
> /sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN0
> /sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
> /sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN2
> /sys/class/thermal/thermal_zone1/cdev5/device/path:\_TZ_.FAN3
> /sys/class/thermal/thermal_zone1/cdev6/device/path:\_TZ_.FAN4
> /sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
> /sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
> /sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
> /sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0
> 
> Scenario-2:
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> /sys/class/thermal/thermal_zone1/cdev0/device/path:\_TZ_.FAN4
> /sys/class/thermal/thermal_zone1/cdev1/device/path:\_TZ_.FAN3
> /sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN2
> /sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
> /sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN0
> /sys/class/thermal/thermal_zone1/cdev5/device/path:\_PR_.CPU1
> /sys/class/thermal/thermal_zone1/cdev6/device/path:\_PR_.CPU0
> /sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
> /sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
> /sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
> /sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0
> 
> Already, during bisecting this issue, I've found out, that these scenarios
> have something to do with rebooting: So, I've rebooted the new bisected
> kernel
> twice in the second roundup.
> But I haven't expected the following disorder:
> 
> This is a row of results from last night, rebooting different kernels, one
> after the other, and capturing some relevant data.
> 
> 
> # uname -r
> 3.12.16
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> Scenario-2
> 
> # uname -r
> 3.13.8
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> Scenario-2
> 
> # uname -r
> 3.13.8
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> Scenario-1
> 
> # uname -r
> 3.12.13
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> Scenario-2
> 
> # uname -r
> 3.12.13
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> Scenario-1
> 
> # uname -r
> 3.12.13
> # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> Scenario-2
> 
I suppose these 3.12.13 kernel are the exactly the same kernel without any rebuilding, right?
could you please change your config file and always build in the ACPI thermal and fan driver and see if this problem still exists?
Comment 23 Jan Kriho 2014-04-23 12:21:14 UTC
(In reply to Zhang Rui from comment #21)
> (In reply to Pohjoistuuli from comment #19)
> > Hi!
> > 
> > I have similar problem on HP ProBook 4510s (Firmware F.20, Intel T3000)
> > running 64bit-kernel 3.13.9 (Kubuntu) or 64-bit kernel 3.13 - 3.14 on arch.
> > I remarked that the regulation of the fan (not necessarily the fan itself!)
> > stops after boot. So, if the system is cold, the fan is running at 0% (=
> > off) or at 20% (which is an unusual number as the fan speed rises usually in
> > 15% stepintel pentium dual core t3000 "microcode" updates on this hardware).
> > On reboot, when the machine is warm, fan speeds of 30% or 45% are often
> > observed depending on the CPU temperature at boot time. After booting, the
> > fan speed does not change anymore and keeps constant. So, when the machine
> > was started cold, the fan is off until the temperature reaches critical
> > values and runs then with 90% (= full speed) until the temperature drops. It
> > goes then off again completely.
> 
> I've seen exactly the same behavior on one of my test laptop.
> And the problem is that ACPICA can not handle some kind of AML code well,
> PLUS, the fix for the problem ships in 3.13-rc1.
> So the symptom I've seen is not a regression and exists in all Linux
> previous release.
> Anyway, please attach the acpidump of your machine, so that I can check if
> they are the same AML problem.
> 
> BTW, it would be nice if you can try 3.12 kernel to verify if this is a
> regression or not.

I can confirm having the same problem with HP Compaq 6830s -- the fan is off until temperature reaches critical, then runs full speed. When the temperature drops below 8x °C, the fan stops completely. This is happening both on 3.13 and 3.14

3.12 works fine

I'll post my acpidump when I get to the machine. Are there any more listings you are interested in?
Comment 24 Jernej Jakob 2014-04-23 15:19:11 UTC
These symptoms are exactly the ones I am experiencing. Please see comment 13 and my post to the mailing list: http://lkml.iu.edu//hypermail/linux/kernel/1404.0/02012.html

I have disassembled the DSDT from my machine, fixed most errors and warnings and tried booting with this one, but no change. I haven't dumped the other tables yet, but I will post them when I do.

3.12 is what was on this laptop until now (Ubuntu Saucy), then everything worked fine. No other changes, no fan control utilities, no negative temperatures (checked with lm-sensors). Just stock installs...
Comment 25 E.Glorg 2014-04-23 22:17:31 UTC
Got the same bug on Debian 7.4 with kernel 3.13-0, HP 4310s laptop. While kernels 3.12 worked correctly, after installing 3.13 fan went off after boot and turned on only when temperature reached 80 C and for very high speed. After cooling to ~75 C the fan went off again. The only thing I can state now is that this bug seems to be chipset-independed, it shows itself on AMD and Intel laptops and even on old Athlon-based desktop box.
Comment 26 Manuel Krause 2014-04-23 23:20:01 UTC
(In reply to Zhang Rui from comment #22)
> (In reply to Manuel Krause from comment #16)
> > There are generally only two different re-occurring scenarios for
> > "grep . /sys/class/thermal/thermal_zone*/cdev*/device/path", so that I 
> > want to abbreviate them in the following:
> > 
> > Scenario-1:
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > /sys/class/thermal/thermal_zone1/cdev0/device/path:\_PR_.CPU1
> > /sys/class/thermal/thermal_zone1/cdev1/device/path:\_PR_.CPU0
> > /sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN0
> > /sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
> > /sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN2
> > /sys/class/thermal/thermal_zone1/cdev5/device/path:\_TZ_.FAN3
> > /sys/class/thermal/thermal_zone1/cdev6/device/path:\_TZ_.FAN4
> > /sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
> > /sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
> > /sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
> > /sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0
> > 
> > Scenario-2:
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > /sys/class/thermal/thermal_zone1/cdev0/device/path:\_TZ_.FAN4
> > /sys/class/thermal/thermal_zone1/cdev1/device/path:\_TZ_.FAN3
> > /sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN2
> > /sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN1
> > /sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN0
> > /sys/class/thermal/thermal_zone1/cdev5/device/path:\_PR_.CPU1
> > /sys/class/thermal/thermal_zone1/cdev6/device/path:\_PR_.CPU0
> > /sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
> > /sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
> > /sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
> > /sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0
> > 
> > Already, during bisecting this issue, I've found out, that these scenarios
> > have something to do with rebooting: So, I've rebooted the new bisected
> > kernel
> > twice in the second roundup.
> > But I haven't expected the following disorder:
> > 
> > This is a row of results from last night, rebooting different kernels, one
> > after the other, and capturing some relevant data.
> > 
> > 
> > # uname -r
> > 3.12.16
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > Scenario-2
> > 
> > # uname -r
> > 3.13.8
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > Scenario-2
> > 
> > # uname -r
> > 3.13.8
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > Scenario-1
> > 
> > # uname -r
> > 3.12.13
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > Scenario-2
> > 
> > # uname -r
> > 3.12.13
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > Scenario-1
> > 
> > # uname -r
> > 3.12.13
> > # grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
> > Scenario-2
> > 
> I suppose these 3.12.13 kernel are the exactly the same kernel without any
> rebuilding, right?

Yes, of course, without rebuilding. Only re-/booting previously built kernels, to show you the obvious differences after rebooting.

> could you please change your config file and always build in the ACPI
> thermal and fan driver and see if this problem still exists?

I've done so for a 3.12.13 kernel and a 3.13.11.

We'd get a new Scenario-3:
# grep . /sys/class/thermal/thermal_zone*/cdev*/device/path
/sys/class/thermal/thermal_zone1/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone1/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone1/cdev2/device/path:\_TZ_.FAN4
/sys/class/thermal/thermal_zone1/cdev3/device/path:\_TZ_.FAN3
/sys/class/thermal/thermal_zone1/cdev4/device/path:\_TZ_.FAN2
/sys/class/thermal/thermal_zone1/cdev5/device/path:\_TZ_.FAN1
/sys/class/thermal/thermal_zone1/cdev6/device/path:\_TZ_.FAN0
/sys/class/thermal/thermal_zone2/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone2/cdev1/device/path:\_PR_.CPU0
/sys/class/thermal/thermal_zone3/cdev0/device/path:\_PR_.CPU1
/sys/class/thermal/thermal_zone3/cdev1/device/path:\_PR_.CPU0

As new comparison:
fan, thermal, processor as MODULES; sequentially rebooted same kernel:
3.12.17 - 1. boot: Scenario-1
3.12.17 - 2. boot: Scenario-1
3.12.17 - 3. boot: Scenario-2
3.12.17 - 4. boot: Scenario-2
3.12.17 - 5. boot: Scenario-2
3.12.17 - 6. boot: Scenario-1

fan, thermal, processor as BUILT IN:
3.12.13 - 6 sequential reboots: all Scenario-3

fan, thermal, processor as BUILT IN:
3.13.11 - 6 sequential reboots: all Scenario-3

After that config change 3.12 still works fine / 3.13 still FAILS:

In my opinion, this has nothing to do with the original fan / trip point problem. But fine, if you can fix this little bug, too, in addition. ;-)

Best regards, Manuel Krause
Comment 27 Pohjoistuuli 2014-04-27 07:16:41 UTC
(In reply to Zhang Rui from comment #21)
> (In reply to Pohjoistuuli from comment #19)

Sorry for answering quite late. I am usually busy during the week and testing this is surprisingly time-consuming (waiting for the system to have the right start temperature and then then waiting for it to raise etc). I use now tmon, which makes testing the thermal behaviour of laptops much easier. It is also a quite handy tool to regulate the fan speed. I raise the CPU-temperature usually with 'openssl speed'. Finding this 'technique' improved testing speed quite much.

> Anyway, please attach the acpidump of your machine, so that I can check if
> they are the same AML problem.

The acpidump is now on my harddrive, but I did not find a function to attach a file to this message. I run also a check with fwts on my machine (on Ubuntu 14.04). fwts reported problems in the DSDT. I can provide also this log if needed (and when I know how ;-).
 
> BTW, it would be nice if you can try 3.12 kernel to verify if this is a
> regression or not.

I have checked out ArchLinux kernels 3.12.9-2 and 3.13.1-1. 3.12.9-2 runs fine and 3.13.1-1 does not regulate the fan speed when passing an active trip point temperature. Other ArchLinux kernels that I have tested so far are 3.10.37-1 (lts), which works fine, and 3.14.1-1 (today's kernel), which does not regulate the fan speed.

Some other remarks:
- I can confirm Manuel's observations regarding cdev*_trip_point. I can see also all three numbering versions on my laptop (version 3 on Ubuntu 14.04, which has the the acpi routines compiled in the kernel). tmon does not have any problems with this and shows under kernels 3.10, 3.12., 3.13 and 3.14 the same setup and works without any differences. Additionally checking dmesg did not reveal relevant differences between 3.12 and 3.13 to me.
- My machine has a thermal zone GFXZ (acpitz0), which isthat not connected to any hardware because my computer has only chipset graphics. The 'temperature' is constant at 16'C. Is this perhaps a problem in this context? Is the acpi system looking only at the wrong thermal zone?
- The behaviour of my machine is different when on battery and when on AC. The reason for this is a BIOS setting, which affects the lowest fan speed level. On battery, it is always 0% rpm (= completely off). When on AC, it is possible to choose in the BIOS between 0% rpm (like when on battery) or 20% rpm as minimum value (my setup). This difference between AC and battery made remarking this error in the beginning quite difficult.
- For cooling my machine at normal CPU load, 20-30% rpm are often sufficient. Under full load, the CPU temperature rarely exceeds 60'C when the fans are running with 45% of max. rpm. Therefore, problems with overheating and fan regulation were first quite confusing.
- tmon is really nice - including the user interface!!!

Thanks for looking into this, Thomas
Comment 28 Manuel Krause 2014-04-28 18:03:46 UTC
Created attachment 134061 [details]
acpidump HP Compaq 6730b

Maybe a acpidump from my machine can help?

@Pohjoistuuli / Thomas: At the top, above the comments and below the header of this bugzilla page, there is the box "Attachment" with the function to add one. (I also needed a while to find it.) ;-)

I hope there's still someone working on this bug?!
Regards, Manuel
Comment 29 Manuel Krause 2014-04-29 04:54:04 UTC
And kernel 3.15.0-rc2 also fails in (all) the same way(s). Regards, Manuel
Comment 30 Rafael J. Wysocki 2014-04-30 21:35:40 UTC
Rui, care to prepare a revert of commit cc8ef5270734 (ACPI / AC: convert ACPI ac driver to platform bus) on top of 3.15-rc3 so that Manuel can test it?
Comment 31 Manuel Krause 2014-04-30 21:45:53 UTC
Rui, best for me would be a patch to apply to some released kernels, as I don't want to go bisecting again for nothing. Thx!
Comment 32 Rafael J. Wysocki 2014-04-30 22:26:13 UTC
It would be most useful to us to know if the revert on top of the current mainline (that is, 3.15-rc3) works, though.  If it doesn't, we need to look somewhere else anyway.
Comment 33 Manuel Krause 2014-04-30 22:36:30 UTC
O.K. You're right, indeed. 3.15-rc3 is here. So, please: Give me a patch!!!
Comment 34 Manuel Krause 2014-05-07 01:26:49 UTC
Without any patch from you... :-(

3.14.3 fails and
3.15.0-rc4 fails, too.
Comment 35 Guenter Roeck 2014-05-07 02:17:38 UTC
I'll send a compile-tested-only patch in a minute. For the Brave ...
Comment 36 Rafael J. Wysocki 2014-05-07 10:11:22 UTC
Patch to test: https://patchwork.kernel.org/patch/4124871/

Thanks Guenter!
Comment 37 Rafael J. Wysocki 2014-05-07 10:32:47 UTC
Created attachment 135301 [details]
ACPI / AC: Use proper name for netlink event generation

Manuel, if the Guenter's patch from the previous comment helps, can you please check if this one helps too?
Comment 38 Manuel Krause 2014-05-07 22:21:19 UTC
Thank you both to provide something to test finally!!! :-)))

I've now tested the two variants with 3.15.0-rc4, they apply && compile fine. (For now only with the thermal, fan and processor _built into_ the kernel.)

Guenters reverting patch works !!!
Rafaels does not, it does not change fan speeds when passing the trip point temperatures.

And now?
Comment 39 Rafael J. Wysocki 2014-05-07 22:23:46 UTC
Well, I'll queue up the revert for 3.15 and then we'll need to figure out what was wrong with that commit.

Thanks!
Comment 40 Manuel Krause 2014-05-07 23:09:38 UTC
Oh, and in the meantime I've patched my 3.14.3 with Guenters reverting patch (with some fuzzes and offsets o.k.) -- and it also works very well!

I stay tuned to this bug -- and still like to help you to figure out.

Best regards to all participants, Manuel
Comment 41 Manuel Krause 2014-05-20 22:52:30 UTC
Created attachment 136881 [details]
Guenter Roecks patch adapted for a 3.14.4 vanilla kernel

Unfortunately I haven't seen someone to add Guenters reverting patch to 3.14.x kernels so far.
So I'd like to post you something adapted for 3.14.4. There were only cosmetical changes needed from Guenters original version for 3.15-rcX. And, yes, it works on here.
Comment 42 Guenter Roeck 2014-05-21 00:45:23 UTC
Unless I am missing something, the patch is not yet upstream, so we can not back-port it to 3.14.
Comment 43 Angelo Compagnucci 2014-05-23 13:22:15 UTC
Just compiled and installed kernel 3.15-rc6 on my Intel ICH9 laptop, the problem still remain and it's very dangerous.

with this kernel at least the fan runs at a very low speed, but doesn't follow thermal variances, so the temperature can easily rise to 80C.

So thi is not resolved for me.
Comment 44 Guenter Roeck 2014-05-23 13:41:47 UTC
Quite surprising, because 3.15-rc6 does include the fix,
as tested by Manuel.

Manuel, any chance you can re-test with 3.15-rc6 ?
Comment 45 Angelo Compagnucci 2014-05-23 14:09:50 UTC
Hi Guenter,

My fault, I was running 3.15rc5 instead of rc6! RC& works wonderfully,
fan runs smoothly than any previous kernel thermal management. There
is only one hiccup, fan never reaches 100% full speed also if the
temperature rises over 77C the fun runs max at 70%.

I have to manually write 1 into
/sys/devices/virtual/thermal/cooling_device0/cur_state to freshen the
cpu to a normal level, this is particularly annoying when I'm
compiling, because I have to reissue a command occasionally.

Thank you for your support!

2014-05-23 15:41 GMT+02:00  <bugzilla-daemon@bugzilla.kernel.org>:
> https://bugzilla.kernel.org/show_bug.cgi?id=71711
>
> --- Comment #44 from Guenter Roeck <linux@roeck-us.net> ---
> Quite surprising, because 3.15-rc6 does include the fix,
> as tested by Manuel.
>
> Manuel, any chance you can re-test with 3.15-rc6 ?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 46 Manuel Krause 2014-05-23 21:56:21 UTC
(In reply to Guenter Roeck from comment #44)
> Quite surprising, because 3.15-rc6 does include the fix,
> as tested by Manuel.
> 
> Manuel, any chance you can re-test with 3.15-rc6 ?

Yes, I've just tested it -- and it works fine for me, as expected.

And, I'm not concerned about the temp. <-> fan levels as Angelo mentions. IIRC, this is the normal behaviour also known from kernels before 3.13 .

Thanks to you, Guenter!
Comment 47 Manuel Krause 2014-06-01 16:24:41 UTC
3.14.5 is out now... without this fix... Can someone of you sleepy guys, please, ... begin to... at least think of... bringing Guenters patch to the so called "stable" kernel... finally ??!
My simply converted patch for 3.14.4 is still working with 3.14.5. See Comment 41.

This is a quite disappointig thread. Has someone begun to work on the original failure, why the conversion of AC to platform bus didn't work?

Thanks, Manuel
Comment 48 Guenter Roeck 2014-06-01 18:44:51 UTC
On Sun, Jun 01, 2014 at 04:24:41PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=71711
> 
> --- Comment #47 from Manuel Krause <manuelkrause@netscape.net> ---
> 3.14.5 is out now... without this fix... Can someone of you sleepy guys,
> please, ... begin to... at least think of... bringing Guenters patch to the so
> called "stable" kernel... finally ??!
> My simply converted patch for 3.14.4 is still working with 3.14.5. See Comment
> 41.
> 
Please chill down. You do have a working solution, don't you ?

The 3.14 maintainer mentioned a couple of days ago that he has more than 200
patches pending for 3.14, on top of 3.14.5. Greg is doing an excellent job
maintaining the stable kernel releases. Calling him sleepy is, to say it very
politely, not appropriate.

> This is a quite disappointig thread. Has someone begun to work on the original
> failure, why the conversion of AC to platform bus didn't work?
> 

As far as I know no one who actually helped fixing your problem is getting paid
for this task, including me. Actually, I am specifically _not_ paid for anything
I do in the upstream kernel. In addition to that, it occurs to me that you are
most likely not paying anything to anyone for providing you support either.
You might want to consider adjusting your expectations a bit, or switch to a
pay-for-use operating system.

Having said that, Linux being an open source operating system, I am sure the
responsible maintainer would be happy to get a patch from you to fix the
original failure.

Thanks,
Guenter
Comment 49 Joonas Saarinen 2014-06-02 11:22:00 UTC
HP 2230s is also affected. A fresh kernel pulled from the Linus tree seems to work fine now.
Comment 50 Manuel Krause 2014-06-05 11:54:44 UTC
At first I want to apologize a bit for my words in my Comment 47. I'm no native english speaker so I obviously/may have not found the *right* words to express my disappointment with the ongoing of this thread since early 2014/03. And I felt that I should not "chill down" until this is included into the actual kernel series.

Of course, I did NOT want to question the work of people *working* on this bug. Neither those, helping me to help to resolve it for other people, too. Guenter is a great helper. 

I don't think my disappointment is worth a discussion about paid support or something related. IIRC, I have provided needed info ASAP and also invested some of my spare time for your debugging work, as well as you and others. And I'd do it in future again, too.
Don't blame me for not having enough Linux programming knowledge, so far, to just provide a better "convert AC to platform bus" patch -- that's a bit inappropriate, too.
---
According to a yesterdays' message from Greg and a look to the stable queue: Guenters revert patch would be included in 4.14.6.
---
Cheers!
And thank you for your understanding,

Manuel
Comment 51 Manuel Krause 2014-06-05 11:59:00 UTC
- revert patch would be included in 4.14.6.
+ revert patch would be included in 3.14.6.

Sorry for the typo.
Comment 52 Manuel Krause 2014-06-12 17:22:29 UTC
HOUSTON, WE'VE GOT A PROBLEM...

I don't know why I haven't tested it thoroughly so far... Maybe, due to the ambient temperatures and my usual workflow for testing this one, only aiming at high temperatures? (I used worldcommunitygrid to achieve this.)

This patches' settings  DO NOT surviwe a SUSPEND TO DISK: The settings for the actually needed trip point <-> fan speed are, unfortunately, then forgotten?

For the suspend-to-disk way I've checked several kernels, today,
3.15.0 pure vanilla			NOGO
3.14.5 +BFQ +CK/BFS + revert patch 	NOGO
3.14.6 +BFQ +CK/BFS +TuxOnIce 		NOGO
3.14.7 +BFQ +CK/BFS +TuxOnIce 		NOGO
3.12.18 +BFQ +CK/BFS 			NOGO

It's a pity, to bother you again,

any ideas?!

Best regards, Manuel
Comment 53 Guenter Roeck 2014-06-12 17:31:03 UTC
On Thu, Jun 12, 2014 at 05:22:29PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=71711
> 
> --- Comment #52 from Manuel Krause <manuelkrause@netscape.net> ---
> HOUSTON, WE'VE GOT A PROBLEM...
> 
> I don't know why I haven't tested it thoroughly so far... Maybe, due to the
> ambient temperatures and my usual workflow for testing this one, only aiming at
> high temperatures? (I used worldcommunitygrid to achieve this.)
> 
> This patches' settings  DO NOT surviwe a SUSPEND TO DISK: The settings for the
> actually needed trip point <-> fan speed are, unfortunately, then forgotten?
> 
> For the suspend-to-disk way I've checked several kernels, today,
> 3.15.0 pure vanilla            NOGO
> 3.14.5 +BFQ +CK/BFS + revert patch     NOGO
> 3.14.6 +BFQ +CK/BFS +TuxOnIce         NOGO
> 3.14.7 +BFQ +CK/BFS +TuxOnIce         NOGO
> 3.12.18 +BFQ +CK/BFS             NOGO
> 
> It's a pity, to bother you again,
> 
> any ideas?!
> 
Unless I am missing something, looks like a separate problem.
Does this work with any earlier kernels ?

Guenter
Comment 54 Manuel Krause 2014-06-12 18:55:00 UTC
To be more accurate: The last triggered trip_point before suspend seems to be taken as the one to focus as next after suspend. But there is no correlation to lower fan speeds. It's lost, then? 

I can pass this trip point upwardly and the fan goes to the related level. Going below, it may go to 0 fan speed.


The higher fan numbers (what are the fan's speed levels on here, but in vice-versa order, 04: is 24% fan; 03: 34%; 02: 45%; 01: 58%; 00: 100%) come up as 0 then (B). 

Meaning with the help of the "tmon" tool:

(A) At boot everything is ok (for all the mentioned kernels):

ID  Cooling Dev   Cur    Max   Thermal Zone Binding                                                                                                  │
│00          Fan     0      1   │││││││││││ ││││*││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│01          Fan     1      1   │││││││││││ │││*│││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│02          Fan     1      1   │││││││││││ ││*││││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│03          Fan     1      1   │││││││││││ │*│││││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│04          Fan     1      1   │││││││││││ *││││││││││ │││││││││││ │││││││││││ ││││││││││││


(B) At resume NOT ok:

│00          Fan     0      1   │││││││││││ ││││*││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│01          Fan     1      1   │││││││││││ │││*│││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│02          Fan     0      1   │││││││││││ ││*││││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│03          Fan     0      1   │││││││││││ │*│││││││││ │││││││││││ │││││││││││ ││││││││││││                                                          │
│04          Fan     0      1   │││││││││││ *││││││││││ │││││││││││ │││││││││││ ││││││││││││


This is affecting suspend-to-ram, too, on here. 
(I've already reported this symptom at the beginning of this thread 
~ Comment 3.)

@Guenter: Do I really need to dig out kernels from before 3.12? 

Best regards, Manuel
Comment 55 Zhang Rui 2014-06-13 04:20:55 UTC
First of all, this seems to be a different problem.
could you please file a new bug, build the latest upstream kernel, say 3.15, boot and
1. attach the output of "grep . /sys/class/thermal/thermal_zone*/cdev*/device/path"
2. attach the output of "# grep . /sys/class/thermal/cdev*/device/path"
3. run "# echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control"
4. reproduce the problem you showed in comment #54
5. attach the dmesg output and tmon output.
Comment 56 Joonas Saarinen 2014-06-13 11:40:07 UTC
For my hardware both suspend and hibernate are OK.
Comment 57 Manuel Krause 2014-06-13 18:48:44 UTC
(In reply to Zhang Rui from comment #55)
> First of all, this seems to be a different problem.
> could you please file a new bug, build the latest upstream kernel, say 3.15,
> boot and
> 1. attach the output of "grep .
> /sys/class/thermal/thermal_zone*/cdev*/device/path"
> 2. attach the output of "# grep . /sys/class/thermal/cdev*/device/path"
> 3. run "# echo 'module thermal_sys +fp' >
> /sys/kernel/debug/dynamic_debug/control"
> 4. reproduce the problem you showed in comment #54
> 5. attach the dmesg output and tmon output.

Thank you very much, for pointing out the details that would be helpful. Of course, I can file a new bug.
But before I'd do this -- could you, please, have a look at 
 https://bugzilla.kernel.org/show_bug.cgi?id=67101
 "weird fan control with 3.12, was ok in 3.9"
that I've found by coincidence. The symptoms seem to be the same (except for my system not needing to shut down, as the thermal's emergency cooling is very effective). Unfortunately the original poster didn't finish. 
What do you say?
Please, advise me, whether it would be better to revive that bug and add my additional info or to file a new one.

Thank you in advance, Manuel
Comment 58 Manuel Krause 2014-06-13 18:58:47 UTC
(In reply to Joonas Saarinen from comment #56)
> For my hardware both suspend and hibernate are OK.

Can you, please, tell me which BIOS version you're running? I'm running the one before the latest as the latest is only installable via Windows with much more addon software.

Mine is a: (excerpt from 'dmesg | grep BIOS')
DMI: Hewlett-Packard HP Compaq 6730b (KU489ET#ABD)/30DD, BIOS 68PDD Ver. F.17 12/02/2010

Thank you in advance, Manuel
Comment 59 Joonas Saarinen 2014-06-13 20:01:58 UTC
DMI: Hewlett-Packard HP 2230s /3037, BIOS 68PHU Ver. F.20 12/10/2011
Comment 60 Zhang Rui 2014-06-14 14:16:05 UTC
Manuel, please file a new bug.
Comment 61 Manuel Krause 2014-06-17 22:19:56 UTC
(In reply to Zhang Rui from comment #60)
> Manuel, please file a new bug.

A BIOS update from F.17 to F.20 did not achieve any efforts.

Btw., some distro specific bug reports falsely (not from my hands) point to here.

I've now filed a new bug upon my Comment 52 ++
https://bugzilla.kernel.org/show_bug.cgi?id=78201

Thank you all for your guidance, 
Manuel
Comment 62 Oliver Joos 2014-07-04 09:52:10 UTC
Our 3 laptops Compaq nx8220 run Mint 17 and I just upgraded to 3.13.0-30. They are still affected. After resume they heat up to 100°C until cpu throttling occurs. A quite serious issue.

Jörg-Karl Bösner did a reverse-bisect and may have found the evil commit: https://launchpad.net/bugs/1312860

Please backport the fix also to 3.13.x, since this kernel is part of many "Long Term Support" distros.
Comment 63 Joonas Saarinen 2014-07-04 13:51:49 UTC
That 3.13.0-30 is an Ubuntu kernel and is always based on upstream 3.13.0 with Canonical's own selection of patches applied on top of it. From there the same kernel seems to trickle to Mint. So Ubuntu would have to apply the patch "ACPI / AC: convert ACPI ac driver to platform bus" to the 3.13.0-?? patch queue.
Comment 64 Manuel Krause 2014-07-04 19:58:26 UTC
I don't know if it's still valid, but the patch had been picked up by Kamal Mostafa who has told to maintain 3.13.y.z. 
Patch: http://patchwork.ozlabs.org/patch/360895/

Maybe you'd  also like to read
 https://wiki.ubuntu.com/Kernel/Dev/ExtendedStable
and https://lkml.org/lkml/2014/4/23/516

Best regards, 
Manuel Krause
Comment 65 Manuel Krause 2014-07-04 20:06:54 UTC
(In reply to Oliver Joos from comment #62)
> Our 3 laptops Compaq nx8220 run Mint 17 and I just upgraded to 3.13.0-30.
> They are still affected. After resume they heat up to 100°C until cpu
> throttling occurs. A quite serious issue.
> 
> Jörg-Karl Bösner did a reverse-bisect and may have found the evil commit:
> https://launchpad.net/bugs/1312860
> 
> Please backport the fix also to 3.13.x, since this kernel is part of many
> "Long Term Support" distros.

This BUG, here, only covers false fan speed after booting.

For the issue of high temperatures without fan action after resume from disk/RAM, please attach to https://bugzilla.kernel.org/show_bug.cgi?id=78201.

Thank you in advance,
Manuel Krause
Comment 66 Joonas Saarinen 2014-07-05 08:51:46 UTC
> So Ubuntu would have to apply the patch "ACPI / AC: convert ACPI ac driver to
> platform bus" to the 3.13.0-?? patch queue.

Just to refine my message a bit...they obviously should apply the *revert* patch. :)

Here's also a direct link to the aforementioned "extended stable" Ubuntu kernel, where it already is reverted:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.4-trusty/

But as Manuel says, in Oliver's case it might actually be a different bug if it reveals itself after suspend.
Comment 67 Joonas Saarinen 2014-07-05 15:10:51 UTC
Looking at this changelog, the revert patch seems to be already part of upcoming 3.13.0-31 Ubuntu kernel.

https://launchpad.net/ubuntu/trusty/+source/linux/+changelog
Comment 68 jose_wojnacki 2015-01-18 14:46:02 UTC
I'm running Archlinux (up to date) and I'm having the same issue. With the latest kernel 3.18.2 the problem is still there for me. Every time I unplug the ac power the laptop fan stops until temp reaches 84°C and then ramps down to 74°C with full speed fan.
With kernel 3.11.4 I have no problem at all.
Do you guys still have this issue?
Comment 69 Manuel Krause 2015-01-19 20:40:05 UTC
There are currently many HP/Compaq notebook owners having problems with kernel 3.18.x. We are waiting for Zhang Rui to wake up from his winter sleep & him to catch up. See: https://bbs.archlinux.org/viewtopic.php?id=192255&p=2 (Read from the first page to get full info, and, some people on there don't handle the full fan speed value correctly.)

Most probably you would need to file a new BUG, but I'd attach to it soon with my logs..

Best regards,
 Manuel
Comment 70 Manuel Krause 2015-01-19 20:53:00 UTC
You can also have a look at https://bugzilla.kernel.org/show_bug.cgi?id=78201, if that's something regarding your fan problem.

Best regards, Manuel
Comment 71 step-ali 2015-01-19 23:25:43 UTC
for my hardware , the problem seems to be resolved by installing the latest beta of osx 10.10.2 , it has a firmware update that solves the issue under linux & windows .
hope this helps .
Comment 72 step-ali 2015-01-30 01:06:47 UTC
nope , the problem is still there :

temperature is fine around 35 to 40 c ,

but the fans kiks rpm from 2000 to 5900 & then back to 4100 ,

cpu utilization is 10 - 13 % .

kernel : 3.18.4
os : Archlinux
Hardware : Macbook Air 2013
Comment 73 step-ali 2015-02-16 19:13:25 UTC
PLEASE RESPOND ,

the problem is solved by updating to a new firmware with osx 10.10.2 ,

in linux 3.19 , the patch you have made make the laptop very noisy & fans

spinning at a very high rpm .

in linux 3.14.33 , everything is fine ( thermal , fan rpm ) ,


so can you please kindly revert or remove the patch , as it's not necessary any 

more after osx 10.10.2 update .
Comment 74 Zhang Rui 2015-02-17 09:12:41 UTC
(In reply to step-ali from comment #73)
> PLEASE RESPOND ,
> 
> the problem is solved by updating to a new firmware with osx 10.10.2 ,
> 
> in linux 3.19 , the patch you have made make the laptop very noisy & fans
> 
> spinning at a very high rpm .
> 
which patch are you referring to?

> in linux 3.14.33 , everything is fine ( thermal , fan rpm ) ,
> 
> 
> so can you please kindly revert or remove the patch , as it's not necessary
> any 
> 
> more after osx 10.10.2 update .
Comment 75 step-ali 2015-02-20 18:27:16 UTC
(In reply to Zhang Rui from comment #74)
> (In reply to step-ali from comment #73)
> > PLEASE RESPOND ,
> > 
> > the problem is solved by updating to a new firmware with osx 10.10.2 ,
> > 
> > in linux 3.19 , the patch you have made make the laptop very noisy & fans
> > 
> > spinning at a very high rpm .
> > 
> which patch are you referring to?
> 
> > in linux 3.14.33 , everything is fine ( thermal , fan rpm ) ,
> > 
> > 
> > so can you please kindly revert or remove the patch , as it's not necessary
> > any 
> > 
> > more after osx 10.10.2 update .

the patch that made the fans spin harder ,


all i know is on 3.19 there is no heat but the fans spin at high rpm on 10-15 

cpu  utilization

on 3.14.33 there is heat up to 89 c & the fans doesn't spin up on the same cpu 

utilization .
Comment 76 Zhang Rui 2015-03-03 13:23:15 UTC
(In reply to step-ali from comment #75)
> (In reply to Zhang Rui from comment #74)
> > (In reply to step-ali from comment #73)
> > > PLEASE RESPOND ,
> > > 
> > > the problem is solved by updating to a new firmware with osx 10.10.2 ,
> > > 
> > > in linux 3.19 , the patch you have made make the laptop very noisy & fans
> > > 
> > > spinning at a very high rpm .
> > > 
> > which patch are you referring to?
> > 
> > > in linux 3.14.33 , everything is fine ( thermal , fan rpm ) ,
> > > 
> > > 
> > > so can you please kindly revert or remove the patch , as it's not necessary
> > > any 
> > > 
> > > more after osx 10.10.2 update .
> 
> the patch that made the fans spin harder ,
> 
step-ali,
actually, I don't think which patch introduces this problem.
But there is indeed some bug report complaining that the fan speed never changes after boot, since 3.18.
so can you please refer to bug #93301 and check if it is the same commit (6ab3430129e258ea31dd214adf1c760dfafde67a) that introduces this problem for you?
Comment 77 step-ali 2015-03-03 14:53:25 UTC
(In reply to Zhang Rui from comment #76)
> (In reply to step-ali from comment #75)
> > (In reply to Zhang Rui from comment #74)
> > > (In reply to step-ali from comment #73)
> > > > PLEASE RESPOND ,
> > > > 
> > > > the problem is solved by updating to a new firmware with osx 10.10.2 ,
> > > > 
> > > > in linux 3.19 , the patch you have made make the laptop very noisy & fans
> > > > 
> > > > spinning at a very high rpm .
> > > > 
> > > which patch are you referring to?
> > > 
> > > > in linux 3.14.33 , everything is fine ( thermal , fan rpm ) ,
> > > > 
> > > > 
> > > > so can you please kindly revert or remove the patch , as it's not necessary
> > > > any 
> > > > 
> > > > more after osx 10.10.2 update .
> > 
> > the patch that made the fans spin harder ,
> > 
> step-ali,
> actually, I don't think which patch introduces this problem.
> But there is indeed some bug report complaining that the fan speed never
> changes after boot, since 3.18.
> so can you please refer to bug #93301 and check if it is the same commit
> (6ab3430129e258ea31dd214adf1c760dfafde67a) that introduces this problem for
> you?

I don't think so ,

before 3.18 we had a high cpu utilization (25 to 30%) that was fixed by recent 

apple osx 10.10.2 update , ( the problem was solved temporarily by disabling 

some gpe ) but there wasn't any fan or heat problem .


After the osx 10.10.2 update ( was during linux 3.18 ) the fan spins up ( very 

high rpm )on very little cpu utilization ( watching a video in chrome ) & then 

spins down when idling .


on 3.14.33 it's the reverse , the fan doesn't spin up but the temperature rises 

to 90 degree celsius ( also while watching videos on chrome ) , which is 

harmful to the laptop .


the solution would be something in the middle ,

BUT PLEASE HURRY , MY MACHINE IS FRYING .
Comment 78 Zhang Rui 2015-03-14 13:41:48 UTC
Please
1. rebuild your kernel with the patches at https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150 applied.
2. run echo 'module thermal_sys +fp' > /sys/kernel/debug/dynamic_debug/control after boot
3. attach the dmesg output after the problem is reproduced.
Comment 79 step-ali 2015-03-15 14:36:25 UTC
(In reply to Zhang Rui from comment #78)
> Please
> 1. rebuild your kernel with the patches at
> https://bugzilla.kernel.org/show_bug.cgi?id=78201#c150 applied.
> 2. run echo 'module thermal_sys +fp' >
> /sys/kernel/debug/dynamic_debug/control after boot
> 3. attach the dmesg output after the problem is reproduced.

sorry , I don't know how to merge a patch & compile .


after weeks of testing it looks like another firmware issue that needs to be 

updated from apple , like the gpe66 issue , because the issue is also occurring 

in windows too ( high temperature ) .


when i first bought the laptop it ran fine with linux , i guess i wish i never 

updated osx , i never use it anyway .


i will submit a bug report to apple & see what happen .
Comment 80 Zhang Rui 2015-03-16 02:55:48 UTC
as there is a firmware update, so can you please try 3.14 kernel again with your new firmware(In reply to step-ali from comment #77)
> (In reply to Zhang Rui from comment #76)
> > (In reply to step-ali from comment #75)
> > > (In reply to Zhang Rui from comment #74)
> > > > (In reply to step-ali from comment #73)
> > > > > PLEASE RESPOND ,
> > > > > 
> > > > > the problem is solved by updating to a new firmware with osx 10.10.2 ,
> > > > > 
> > > > > in linux 3.19 , the patch you have made make the laptop very noisy & fans
> > > > > 
> > > > > spinning at a very high rpm .
> > > > > 
> > > > which patch are you referring to?
> > > > 
> > > > > in linux 3.14.33 , everything is fine ( thermal , fan rpm ) ,
> > > > > 
> > > > > 
> > > > > so can you please kindly revert or remove the patch , as it's not necessary
> > > > > any 
> > > > > 
> > > > > more after osx 10.10.2 update .
> > > 
> > > the patch that made the fans spin harder ,
> > > 
> > step-ali,
> > actually, I don't think which patch introduces this problem.
> > But there is indeed some bug report complaining that the fan speed never
> > changes after boot, since 3.18.
> > so can you please refer to bug #93301 and check if it is the same commit
> > (6ab3430129e258ea31dd214adf1c760dfafde67a) that introduces this problem for
> > you?
> 
> I don't think so ,
> 
> before 3.18 we had a high cpu utilization (25 to 30%) that was fixed by
> recent 
> 
> apple osx 10.10.2 update , ( the problem was solved temporarily by disabling 
> 
> some gpe ) but there wasn't any fan or heat problem .
> 
> 
> After the osx 10.10.2 update ( was during linux 3.18 ) the fan spins up (
> very 
> 
> high rpm )on very little cpu utilization ( watching a video in chrome ) &
> then 
> 
> spins down when idling .
> 
> 
> on 3.14.33 it's the reverse , the fan doesn't spin up but the temperature
> rises 
> 
> to 90 degree celsius ( also while watching videos on chrome ) , which is 
> 
> harmful to the laptop.
> 
is this symptom got with updated firmware?
Comment 81 step-ali 2015-03-16 03:01:08 UTC
it's after kernel 3.18 & osx firmware update
Comment 82 Zhang Rui 2015-03-24 07:43:19 UTC
> > 
> > on 3.14.33 it's the reverse , the fan doesn't spin up but the temperature
> > rises 
> > 
> > to 90 degree celsius ( also while watching videos on chrome ) , which is 
> > 
> > harmful to the laptop.
> > 
> is this symptom got with updated firmware?

I mean did you get this symptom with 3.14 kernel, after firmware updated?

Please do the following test on 4.0-rc kernel
1. apply the patches at
https://patchwork.kernel.org/patch/6077231/
https://patchwork.kernel.org/patch/6077241/
https://patchwork.kernel.org/patch/6077251/
2. please apply the two patches attached later
3. after build, please boot with kernel parameter module.dyndbg="module thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp"
4. attach the acpidump output of your mac book
5. attach the output of "grep . /sys/class/thermal/*/*/path" after boot
6. attach the dmesg output after the bug reproduced
7. attach the output of "grep . /sys/class/thermal/thermal*/*" after the bug reproduced
Comment 83 Zhang Rui 2015-03-24 07:45:47 UTC
Created attachment 171921 [details]
patch 4
Comment 84 Zhang Rui 2015-03-24 07:46:04 UTC
Created attachment 171931 [details]
patch-5
Comment 85 Zhang Rui 2015-03-29 13:16:12 UTC
ping...
Comment 86 step-ali 2015-03-29 14:24:57 UTC
(In reply to Zhang Rui from comment #82)
> > > 
> > > on 3.14.33 it's the reverse , the fan doesn't spin up but the temperature
> > > rises 
> > > 
> > > to 90 degree celsius ( also while watching videos on chrome ) , which is 
> > > 
> > > harmful to the laptop.
> > > 
> > is this symptom got with updated firmware?
> 
> I mean did you get this symptom with 3.14 kernel, after firmware updated?
> 
> Please do the following test on 4.0-rc kernel
> 1. apply the patches at
> https://patchwork.kernel.org/patch/6077231/
> https://patchwork.kernel.org/patch/6077241/
> https://patchwork.kernel.org/patch/6077251/
> 2. please apply the two patches attached later
> 3. after build, please boot with kernel parameter module.dyndbg="module
> thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp"
> 4. attach the acpidump output of your mac book
> 5. attach the output of "grep . /sys/class/thermal/*/*/path" after boot
> 6. attach the dmesg output after the bug reproduced
> 7. attach the output of "grep . /sys/class/thermal/thermal*/*" after the bug
> reproduced

yes , the symptom is htere after firmware update on 3.14 lts & 3.19
Comment 87 E.Glorg 2015-03-29 16:15:42 UTC
Upgraded to kernel 3.16 from Debian Jessie repos. Having performed no firmware upgrade, just upgraded OS. Strange, but problem has gone.
Here's uname output:
$ uname -srvom
Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt7-1 (2015-03-01) x86_64 GNU/Linux
$ cat /etc/issue
Debian GNU/Linux 8
Installation of 3.16 on Debian 7.x still gives that old problem.
Comment 88 step-ali 2015-03-29 16:29:32 UTC
(In reply to E.Glorg from comment #87)
> Upgraded to kernel 3.16 from Debian Jessie repos. Having performed no
> firmware upgrade, just upgraded OS. Strange, but problem has gone.
> Here's uname output:
> $ uname -srvom
> Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt7-1 (2015-03-01) x86_64
> GNU/Linux
> $ cat /etc/issue
> Debian GNU/Linux 8
> Installation of 3.16 on Debian 7.x still gives that old problem.

yep , upgrade to any kernel above 3.17 & you will have the problem again
Comment 89 step-ali 2015-03-29 19:11:10 UTC
ha ,

i discovered something strange ,

under wayland everything is running normally 69 degree celsius ( where it was 

around 85-90 under Xorg ) fan rpm is 1300 ( 5000 under xorg ) all using the same 

kernel 3.14.37 lts running under Archlinux .


could it be a xorg-server issue ??!

if so then how come the problem disappear with kernel under 3.17 ??!
Comment 90 Zhang Rui 2015-03-30 00:58:49 UTC
(In reply to step-ali from comment #86)
> (In reply to Zhang Rui from comment #82)
> > > > 
> > > > on 3.14.33 it's the reverse , the fan doesn't spin up but the temperature
> > > > rises 
> > > > 
> > > > to 90 degree celsius ( also while watching videos on chrome ) , which is 
> > > > 
> > > > harmful to the laptop.
> > > > 
> > > is this symptom got with updated firmware?
> > 
> > I mean did you get this symptom with 3.14 kernel, after firmware updated?
> > 
> > Please do the following test on 4.0-rc kernel
> > 1. apply the patches at
> > https://patchwork.kernel.org/patch/6077231/
> > https://patchwork.kernel.org/patch/6077241/
> > https://patchwork.kernel.org/patch/6077251/
> > 2. please apply the two patches attached later
> > 3. after build, please boot with kernel parameter module.dyndbg="module
> > thermal_sys +fp" dyndbg="file thermal_core.c +fp; file step_wise.c +fp"
> > 4. attach the acpidump output of your mac book
> > 5. attach the output of "grep . /sys/class/thermal/*/*/path" after boot
> > 6. attach the dmesg output after the bug reproduced
> > 7. attach the output of "grep . /sys/class/thermal/thermal*/*" after the bug
> > reproduced
> 
> yes , the symptom is htere after firmware update on 3.14 lts & 3.19

please do the test and attach the debug information requested above.
Comment 91 Zhang Rui 2015-04-13 05:24:12 UTC
ping...
Comment 92 step-ali 2015-04-19 18:36:10 UTC
sorry , don't know how to apply patches to the kernel ,

but the problem is still there with kernel 4.0 .
Comment 93 Zhang Rui 2015-05-04 04:58:21 UTC
do you know how to build a customized kernel?
please download the patches and run "patch -p1 < foo.patch" to apply each of them in ascending order, and then build the kernel.
Comment 94 Zhang Rui 2015-05-25 04:10:16 UTC
ping...
Comment 95 Zhang Rui 2015-06-03 08:08:11 UTC
bug closed as we can more make any progress w/o bug reporter' response and help.
Please feel free to reopen it if you can build customized kernel to help debug the issue.
Comment 96 step-ali 2015-06-03 23:45:10 UTC
(In reply to Zhang Rui from comment #95)
> bug closed as we can more make any progress w/o bug reporter' response and
> help.
> Please feel free to reopen it if you can build customized kernel to help
> debug the issue.

sorry , just don't have the time to build a customized kernel ,

will test with 4.1 .

Note You need to log in before you can comment on or make changes to this bug.