Bug 56591

Summary: HP/Compaq NC6000 fan speed regression on 3.7+
Product: Power Management Reporter: Ville Syrjala (syrjala)
Component: ThermalAssignee: Zhang Rui (rui.zhang)
Status: CLOSED CODE_FIX    
Severity: normal CC: auxsvr, jake, kernel-bugs, rjw, rui.zhang
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 3.7+ Subsystem:
Regression: Yes Bisected commit-id:
Attachments: Logs and dumps from 3.7.10 and 3.9-rc6 kernels
the output of acpidump for HP 2510p laptop
debug patch to do a thermal_zone_device_update after reading temperature
debug patch to check cooling state transition in step_wise governor
kernel log with debug patches
patch: do not always return THERMAL_TREND_RAISING for ACPI active cooling
dmesg with patch from comment #13
patch v2: do not always return THERMAL_TREND_RAISING for ACPI active cooling
dmesg with v2 patch from comment #15
patch v3: do not always return THERMAL_TREND_RAISING for ACPI active cooling
dmesg with v3 patch from comment #17
patch v4: do not always return THERMAL_TREND_RAISING for ACPI active cooling
dmesg with v4 patch from comment #20
debug oatch to check acpi thermal_get_trend
dmesg with debug patch from comment #23
Patch to fix temp units before comparison
Patch to make sure coolong devices turn off when temp drops to trip point
dmesg with patches from comments #25 and #26
patch v5: do not always return THERMAL_TREND_RAISING for ACPI active cooling
patch for upstream

Description Ville Syrjala 2013-04-13 23:45:12 UTC
Created attachment 98531 [details]
Logs and dumps from 3.7.10 and 3.9-rc6 kernels

There's a regression in thermal subsystem in 3.7 kernel, which results in fan speed staying after during resume.

Other people have seen the problem, although it may be that the exact nature of the bug is a bit different between different machines.

Here are the mailing list references:
https://lkml.org/lkml/2013/3/2/171
https://lkml.org/lkml/2013/4/13/142

For me on 3.7.10 the symptoms of the bug depend on whether I resume the laptop when it's totally cooled down, or still warm. When the bug occurs, the reported temperature somehow goes out of sync with the trip points, and the cooling device cur_state doesn't always match the fan speed.

On 3.9-rc6, there's actually another regression, which causes the fan speed to misbehave already without any suspend/resume cycles. What happens is that the fan speeds up when the temperature rises, but when the temperature comes back down, the fan doesn't slow down. When cooling down, the temperature of the trip points does seem to change due to hysteresis, so some activity happens when the trip points are reached, but somehow the cooling devices don't get turned off. The cooling device cur_state accurately reflects the fan speed in this case.

I also tested
git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux.git thermal
but it appears to behave exactly like 3.9-rc6.

I've attached a tarball with logs from both bug scenarios on a 3.7.10 kernel. I've included logs from a 3.9-rc6 kernel as well, but as stated you don't actually need to suspend/resume to trigger the misbehavior on 3.9-rc6. The "desc.txt" files inside the tarball describe the steps I took when I collected the logs. The tarball also contains acpidump output.
Comment 1 Jake Edge 2013-04-14 12:38:08 UTC
Created attachment 98571 [details]
the output of acpidump for HP 2510p laptop
Comment 2 Jake Edge 2013-04-14 12:42:24 UTC
I have seen the same basic behavior on all kernels > 3.6.11 on an HP/Compaq 2510p laptop (acpidump attached as requested).  After a resume, the fan spins up to full speed and stays there, no matter what the system is doing.  "temp6" in "acpitz-virtual-0" shows +100°C and stays there, which is presumably why the fan stays on.  In earlier kernels, I have seen some odd behavior with temp6, but always 50° or lower (including some amusing 0° readings).  I have verified this with 3.9-rc6 and tried Rui's thermal branch yesterday (13 April 2013) without success.
Comment 3 Zhang Rui 2013-04-14 16:29:27 UTC
(In reply to comment #0)
> Created an attachment (id=98531) [details]
> Logs and dumps from 3.7.10 and 3.9-rc6 kernels
> 
> There's a regression in thermal subsystem in 3.7 kernel, which results in fan
> speed staying after during resume.
> 
> Other people have seen the problem, although it may be that the exact nature
> of
> the bug is a bit different between different machines.
> 
> Here are the mailing list references:
> https://lkml.org/lkml/2013/3/2/171
> https://lkml.org/lkml/2013/4/13/142
> 
> For me on 3.7.10 the symptoms of the bug depend on whether I resume the
> laptop
> when it's totally cooled down, or still warm.

According to Ville,
"My HP Compaq NC6000 exhibits slightly different behaviour. For me the
reported temp is always accurate but the trip points get out of sync
with the actual temperature.
With 3.7 I saw two different kinds of problems coming out of resume.
In one case the fan stays on until the temp rises high enough to get the
trip points back into sync. In the other case the fan goes off, and stays
off even when the temperature rises above the highest active trip point,
but it does appear to get back into sync when the temp starts to come back
down. I think the difference might stem from resuming when the laptop has
cooled down fully vs. when it's still warm."
Comment 4 Zhang Rui 2013-04-14 17:26:22 UTC
Created attachment 98621 [details]
debug patch to do a thermal_zone_device_update after reading temperature

please apply this patch on top of thermal -thermal branch and do a "grep . /sys/class/thermal/*/*" after resume and see if it helps.
Comment 5 Ville Syrjala 2013-04-14 22:31:36 UTC
(In reply to comment #4)
> Created an attachment (id=98621) [details]
> debug patch to do a thermal_zone_device_update after reading temperature
> 
> please apply this patch on top of thermal -thermal branch and do a "grep .
> /sys/class/thermal/*/*" after resume and see if it helps.

No change with that patch. As I stated before, I don't even need to suspend the laptop to make 3.9+ misbehave (3.7 only started to misbehave after suspend). The reported temperature is always correct, but with 3.9+ the cooling devices turn on correctly when temperature rises, but they never turn off when the temperature drops back down.

I think Jake's bug is something different since his machine seems to report the wrong temperature. Maybe he should file another bug for it, so we don't get confused...
Comment 6 Zhang Rui 2013-04-15 00:31:12 UTC
please attach the kernel config file for 3.9-rc6.
Comment 7 Zhang Rui 2013-04-15 00:43:56 UTC
I just want to make sure that you're using the step_wise governor.
Comment 8 Zhang Rui 2013-04-15 01:01:47 UTC
Created attachment 98671 [details]
debug patch to check cooling state transition in step_wise governor

please apply this patch on top of thermal -thermal branch and previous patch.
and attach the dmesg output after grep . /sys/class/thermal/thermal*/temp, when the fan is running in full speed during idle.
Comment 9 Zhang Rui 2013-04-15 01:11:47 UTC
Jade,

can we please debug your issue in bug #56601?
I think your problem is a duplicate of that one. :)
Comment 10 Ville Syrjala 2013-04-15 13:55:15 UTC
(In reply to comment #6)
> please attach the kernel config file for 3.9-rc6.

The config is included in the tarball I attached.

But anyways, these are the thermal options I have:
#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_USER_SPACE is not set
# CONFIG_CPU_THERMAL is not set
# CONFIG_THERMAL_EMULATION is not set
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set


I'll try the debug patch later tonight.
Comment 11 Ville Syrjala 2013-04-15 20:36:33 UTC
Created attachment 98821 [details]
kernel log with debug patches

Here's the dmesg log with -thermal + the two debug patches applied.

These are the steps I followed:

1. boot
<printk timestamp ~79 here>
2. run 'while true ; do true ; done'
   wait for fan to speed up 
<printk timestamp ~179 here>
3. kill while loop and wait
<print timestamp ~253 here>
4. grep /sys/class/thermal/*/temp
5. dmesg > dmesg.txt
Comment 12 Zhang Rui 2013-04-16 05:53:03 UTC
fist, I think it is the follow commit that introduces the regression.

commit 4ae46befb49d4173122e0afa995c4e93d01948a2
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Wed Jun 27 10:05:39 2012 +0800

    Thermal: Introduce thermal_zone_trip_update()
    
    This function is used to update the cooling state of
    all the cooling devices that are bound to an active trip point.
    
    This will be used for passive cooling as well, in the future patches.
    as both active and passive cooling can share the same algorithm,
    which is
    
    1. if the temperature is higher than a trip point,
       a. if the trend is THERMAL_TREND_RAISING, use higher cooling
          state for this trip point
       b. if the trend is THERMAL_TREND_DROPPING, use lower cooling
          state for this trip point
    
    2. if the temperature is lower than a trip point, use lower
       cooling state for this trip point.
    
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>
    Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl>
    Reviewed-by: Eduardo Valentin <eduardo.valentin@ti.com>

here is the piece of code that actually cause the problem:
-       /* Only PASSIVE trip points need TREND */
-       if (type != THERMAL_TRIP_PASSIVE)
-               return -EINVAL;
+       if (type == THERMAL_TRIP_ACTIVE) {
+               /* aggressive active cooling */
+               *trend = THERMAL_TREND_RAISING;
+               return 0;
+       }

My original idea is that:
when temperature is higher than the trip point, keep the fan on, even if the temperature is dropping.

But the code will do:
when temperature is higher than the trip point, the fan keeps on.
when the temperature is lower than the trip point,
the fan is still on because thermal_get_trend in driver/acpi/thermal.c returns THERMAL_TREND_RAISING, thus the thermal_instance will never be deactived.
Comment 13 Zhang Rui 2013-04-16 08:11:25 UTC
Created attachment 98841 [details]
patch: do not always return THERMAL_TREND_RAISING for ACPI active cooling

please apply this patch on top and see if it helps.
If not, please attach the dmesg after the same test.
Comment 14 Ville Syrjala 2013-04-16 23:39:29 UTC
Created attachment 98891 [details]
dmesg with patch from comment #13

The fan speed increases as temp rises, but then when temp starts to drop, the fan gets turned off too early. It seems to go from max->off directly instead of stepping down.
Comment 15 Zhang Rui 2013-04-17 01:24:58 UTC
Created attachment 98901 [details]
patch v2: do not always return THERMAL_TREND_RAISING for ACPI active cooling

please try this refreshed patch and see if it helps.
Comment 16 Ville Syrjala 2013-04-17 07:27:43 UTC
Created attachment 98921 [details]
dmesg with v2 patch from comment #15

Unfortunately the v2 patch didn't help. Fan still gets turned off too early instead of stepping down.
Comment 17 Zhang Rui 2013-04-17 13:38:26 UTC
Created attachment 98961 [details]
patch v3: do not always return THERMAL_TREND_RAISING for ACPI active cooling

well, please help me try this refreshed one. and attach the dmesg output.
BTW, can you please verify if the fan spins down much  quicker than it did before, say, 3.6?
Comment 18 Ville Syrjala 2013-04-17 22:55:57 UTC
Created attachment 99101 [details]
dmesg with v3 patch from comment #17

With the v3 patch the fan doesn't come on at all.
Comment 19 Ville Syrjala 2013-04-17 23:21:41 UTC
(In reply to comment #17)
> well, please help me try this refreshed one. and attach the dmesg output.
> BTW, can you please verify if the fan spins down much  quicker than it did
> before, say, 3.6?

The fan speed directly matches cooling devices 3 to 0 (from lowest speed to highest). The difference in noise between to 3 to 1 is very clear, but between 1 and 0 it's more difficult to hear. The difference between fan off, and lowest fan speed (cdev3) can also be heard.

With a working kernel (even with 3.7, assuming I haven't suspended the laptop) the fan speed seems to follow the trip points exactly like this:
cdev3.cur_state = temp > trip5
cdev2.cur_state = temp > trip4
cdev1.cur_state = temp > trip3
cdev0.cur_state = temp > trip2

Well, except under normal circumstances the laptop won't reach trip_point_2 (80C). The temp normally stays below 70C even under heavy load.
Comment 20 Zhang Rui 2013-04-19 00:40:55 UTC
Created attachment 99251 [details]
patch v4: do not always return THERMAL_TREND_RAISING for ACPI active cooling

please try if this patch helps or not.
Comment 21 Zhang Rui 2013-04-19 00:44:47 UTC
(In reply to comment #19)
> (In reply to comment #17)
> > well, please help me try this refreshed one. and attach the dmesg output.
> > BTW, can you please verify if the fan spins down much  quicker than it did
> > before, say, 3.6?
> 
> The fan speed directly matches cooling devices 3 to 0 (from lowest speed to
> highest). The difference in noise between to 3 to 1 is very clear, but
> between
> 1 and 0 it's more difficult to hear. The difference between fan off, and
> lowest
> fan speed (cdev3) can also be heard.
> 
> With a working kernel (even with 3.7, assuming I haven't suspended the
> laptop)
> the fan speed seems to follow the trip points exactly like this:
> cdev3.cur_state = temp > trip5
> cdev2.cur_state = temp > trip4
> cdev1.cur_state = temp > trip3
> cdev0.cur_state = temp > trip2
> 
this is the behavior before 3.7.
But I want to make a few changes for this,
say,
cdev3.cur_state = temp >= trip5 ? 1 : (temperature is raising ? 1 : 0);
but I failed to handle the temperature stable case in the previous 3 fixes attached, which happens all the time on your machines because there are two thermal events within a second even if there is no temperature changes, according to the log you attached.

so please try this one which I think would be a real fix.
Comment 22 Ville Syrjala 2013-04-21 21:34:23 UTC
Created attachment 99551 [details]
dmesg with v4 patch from comment #20

The fan still turns off too early with the v4 patch.

Checking the cdev state from sysfs also agrees with this assesment. Even though the temperature can be above the lower trip points, all cdevs have cur_state=0.
Comment 23 Zhang Rui 2013-04-22 05:07:28 UTC
Created attachment 99571 [details]
debug oatch to check acpi thermal_get_trend

you're right.

[  398.254232] step_wise_throttle for zone 0, trip 5
[  398.254237] step_wise trips update
[  398.254242] trip temp 40000, type 0
[  398.254247] trend 2
[  398.254252] temperature 42000, throttle 1
[  398.254315] get_target_state: cur_state 0
[  398.254321] old_target 0, target 0
[  398.254327] cdev_update: cdev 3, zone 0, target 0
[  398.254333] set cdev 3 to state 0

the current temperature is 42C, while the trip point is 40C.
but trend is 2, which equals THERMAL_TREND_DROPPING.
this is definitely not what we want, because acpi thermal_get_trend should always return THERMAL_TREND_RAISING when temperature is above the trip point.

please apply this debug patch on top and attach the dmesg output.
I need to check why acpi thermal_get_trend returns THERMAL_TREND_DROPPING in this case.
Comment 24 Ville Syrjala 2013-04-22 17:13:54 UTC
Created attachment 99711 [details]
dmesg with debug patch from comment #23

Here's the log with the debug patch. I can immediately see that the problem is wrong units.
Comment 25 Ville Syrjala 2013-04-22 17:15:09 UTC
Created attachment 99721 [details]
Patch to fix temp units before comparison
Comment 26 Ville Syrjala 2013-04-22 17:16:19 UTC
Created attachment 99731 [details]
Patch to make sure coolong devices turn off when temp drops to trip point
Comment 27 Ville Syrjala 2013-04-22 17:18:17 UTC
Created attachment 99741 [details]
dmesg with patches from comments #25 and #26

With these two patches the fan seems to behave quite normally.

Note that I haven't tried suspend/resume yet, but I'll do that now.
Comment 28 Ville Syrjala 2013-04-22 21:12:41 UTC
Things seems mostly OK after suspend/resume as well.


I did hit one additional issue though. I did the while loop and let the temperature rise a bit, and then stopped the while loop. The fan started to slow down appropriately, but then for some reason I never saw any temperature notifications after the temp dropped to 43C. That obviously left the fan on at the lowest speed since the trip point is at 40C. I was monitoring the temp through sysfs (I had reverted the thermal_zone_device_update debug patch so that I can read the temp without disturbing the system). I waited until the temp had fallen to 30C but there still wasn't anything in the dmesg log after the 43C messages, and the fan was still on.

So I'm guessing either ACPI failed to notify us that the trip point was reached, or somehow the tz->temperature wasn't up to date (but I can't how this could happen since the temperature is updated as the first thing in thermal_zone_device_update()).
Comment 29 Zhang Rui 2013-04-23 02:11:40 UTC
Created attachment 99811 [details]
patch v5: do not always return THERMAL_TREND_RAISING for ACPI active cooling

this is the final version of the patch that I'll push upstream. please check if it works or not.
Comment 30 Zhang Rui 2013-04-23 02:13:54 UTC
(In reply to comment #27)

> With these two patches the fan seems to behave quite normally.
> 
Great, ville, thanks for your patch, that's really helpful.
As non of these patches has been upstream yet, I merged these three altogether, instead of separate incremental patches and push it to upstream.
Comment 31 Ville Syrjala 2013-04-23 09:21:14 UTC
(In reply to comment #30)
> (In reply to comment #27)
> 
> > With these two patches the fan seems to behave quite normally.
> > 
> Great, ville, thanks for your patch, that's really helpful.
> As non of these patches has been upstream yet, I merged these three
> altogether,
> instead of separate incremental patches and push it to upstream.

The patch looks good. I just applied it alone on top of the thermal branch to make sure none of the debug patches affected the behaviour, and then I performed the same tests again. Everything seems to be in order.

So feel free to add
Tested-by: Ville Syrjälä <syrjala@sci.fi>
to the patch.
Comment 32 Zhang Rui 2013-04-24 01:31:08 UTC
Created attachment 99861 [details]
patch for upstream
Comment 33 Zhang Rui 2013-04-24 01:31:58 UTC
Mark the bug as Resolved.
Will close this bug report when the patch in #32 hits upstream kernel.
Comment 34 Rafael J. Wysocki 2013-06-24 23:27:52 UTC
Fixed by 94a4093 "ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points".