Bug 56591
Description
Ville Syrjala
2013-04-13 23:45:12 UTC
Created attachment 98571 [details]
the output of acpidump for HP 2510p laptop
I have seen the same basic behavior on all kernels > 3.6.11 on an HP/Compaq 2510p laptop (acpidump attached as requested). After a resume, the fan spins up to full speed and stays there, no matter what the system is doing. "temp6" in "acpitz-virtual-0" shows +100°C and stays there, which is presumably why the fan stays on. In earlier kernels, I have seen some odd behavior with temp6, but always 50° or lower (including some amusing 0° readings). I have verified this with 3.9-rc6 and tried Rui's thermal branch yesterday (13 April 2013) without success. (In reply to comment #0) > Created an attachment (id=98531) [details] > Logs and dumps from 3.7.10 and 3.9-rc6 kernels > > There's a regression in thermal subsystem in 3.7 kernel, which results in fan > speed staying after during resume. > > Other people have seen the problem, although it may be that the exact nature > of > the bug is a bit different between different machines. > > Here are the mailing list references: > https://lkml.org/lkml/2013/3/2/171 > https://lkml.org/lkml/2013/4/13/142 > > For me on 3.7.10 the symptoms of the bug depend on whether I resume the > laptop > when it's totally cooled down, or still warm. According to Ville, "My HP Compaq NC6000 exhibits slightly different behaviour. For me the reported temp is always accurate but the trip points get out of sync with the actual temperature. With 3.7 I saw two different kinds of problems coming out of resume. In one case the fan stays on until the temp rises high enough to get the trip points back into sync. In the other case the fan goes off, and stays off even when the temperature rises above the highest active trip point, but it does appear to get back into sync when the temp starts to come back down. I think the difference might stem from resuming when the laptop has cooled down fully vs. when it's still warm." Created attachment 98621 [details]
debug patch to do a thermal_zone_device_update after reading temperature
please apply this patch on top of thermal -thermal branch and do a "grep . /sys/class/thermal/*/*" after resume and see if it helps.
(In reply to comment #4) > Created an attachment (id=98621) [details] > debug patch to do a thermal_zone_device_update after reading temperature > > please apply this patch on top of thermal -thermal branch and do a "grep . > /sys/class/thermal/*/*" after resume and see if it helps. No change with that patch. As I stated before, I don't even need to suspend the laptop to make 3.9+ misbehave (3.7 only started to misbehave after suspend). The reported temperature is always correct, but with 3.9+ the cooling devices turn on correctly when temperature rises, but they never turn off when the temperature drops back down. I think Jake's bug is something different since his machine seems to report the wrong temperature. Maybe he should file another bug for it, so we don't get confused... please attach the kernel config file for 3.9-rc6. I just want to make sure that you're using the step_wise governor. Created attachment 98671 [details]
debug patch to check cooling state transition in step_wise governor
please apply this patch on top of thermal -thermal branch and previous patch.
and attach the dmesg output after grep . /sys/class/thermal/thermal*/temp, when the fan is running in full speed during idle.
Jade, can we please debug your issue in bug #56601? I think your problem is a duplicate of that one. :) (In reply to comment #6) > please attach the kernel config file for 3.9-rc6. The config is included in the tarball I attached. But anyways, these are the thermal options I have: # # ACPI drivers # # CONFIG_SENSORS_ACPI_POWER is not set # CONFIG_SENSORS_ATK0110 is not set CONFIG_THERMAL=y CONFIG_THERMAL_HWMON=y CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y # CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set # CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set # CONFIG_THERMAL_GOV_FAIR_SHARE is not set CONFIG_THERMAL_GOV_STEP_WISE=y # CONFIG_THERMAL_GOV_USER_SPACE is not set # CONFIG_CPU_THERMAL is not set # CONFIG_THERMAL_EMULATION is not set # CONFIG_INTEL_POWERCLAMP is not set CONFIG_WATCHDOG=y CONFIG_WATCHDOG_CORE=y # CONFIG_WATCHDOG_NOWAYOUT is not set I'll try the debug patch later tonight. Created attachment 98821 [details]
kernel log with debug patches
Here's the dmesg log with -thermal + the two debug patches applied.
These are the steps I followed:
1. boot
<printk timestamp ~79 here>
2. run 'while true ; do true ; done'
wait for fan to speed up
<printk timestamp ~179 here>
3. kill while loop and wait
<print timestamp ~253 here>
4. grep /sys/class/thermal/*/temp
5. dmesg > dmesg.txt
fist, I think it is the follow commit that introduces the regression. commit 4ae46befb49d4173122e0afa995c4e93d01948a2 Author: Zhang Rui <rui.zhang@intel.com> Date: Wed Jun 27 10:05:39 2012 +0800 Thermal: Introduce thermal_zone_trip_update() This function is used to update the cooling state of all the cooling devices that are bound to an active trip point. This will be used for passive cooling as well, in the future patches. as both active and passive cooling can share the same algorithm, which is 1. if the temperature is higher than a trip point, a. if the trend is THERMAL_TREND_RAISING, use higher cooling state for this trip point b. if the trend is THERMAL_TREND_DROPPING, use lower cooling state for this trip point 2. if the temperature is lower than a trip point, use lower cooling state for this trip point. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl> Reviewed-by: Eduardo Valentin <eduardo.valentin@ti.com> here is the piece of code that actually cause the problem: - /* Only PASSIVE trip points need TREND */ - if (type != THERMAL_TRIP_PASSIVE) - return -EINVAL; + if (type == THERMAL_TRIP_ACTIVE) { + /* aggressive active cooling */ + *trend = THERMAL_TREND_RAISING; + return 0; + } My original idea is that: when temperature is higher than the trip point, keep the fan on, even if the temperature is dropping. But the code will do: when temperature is higher than the trip point, the fan keeps on. when the temperature is lower than the trip point, the fan is still on because thermal_get_trend in driver/acpi/thermal.c returns THERMAL_TREND_RAISING, thus the thermal_instance will never be deactived. Created attachment 98841 [details]
patch: do not always return THERMAL_TREND_RAISING for ACPI active cooling
please apply this patch on top and see if it helps.
If not, please attach the dmesg after the same test.
Created attachment 98891 [details] dmesg with patch from comment #13 The fan speed increases as temp rises, but then when temp starts to drop, the fan gets turned off too early. It seems to go from max->off directly instead of stepping down. Created attachment 98901 [details]
patch v2: do not always return THERMAL_TREND_RAISING for ACPI active cooling
please try this refreshed patch and see if it helps.
Created attachment 98921 [details] dmesg with v2 patch from comment #15 Unfortunately the v2 patch didn't help. Fan still gets turned off too early instead of stepping down. Created attachment 98961 [details]
patch v3: do not always return THERMAL_TREND_RAISING for ACPI active cooling
well, please help me try this refreshed one. and attach the dmesg output.
BTW, can you please verify if the fan spins down much quicker than it did before, say, 3.6?
Created attachment 99101 [details] dmesg with v3 patch from comment #17 With the v3 patch the fan doesn't come on at all. (In reply to comment #17) > well, please help me try this refreshed one. and attach the dmesg output. > BTW, can you please verify if the fan spins down much quicker than it did > before, say, 3.6? The fan speed directly matches cooling devices 3 to 0 (from lowest speed to highest). The difference in noise between to 3 to 1 is very clear, but between 1 and 0 it's more difficult to hear. The difference between fan off, and lowest fan speed (cdev3) can also be heard. With a working kernel (even with 3.7, assuming I haven't suspended the laptop) the fan speed seems to follow the trip points exactly like this: cdev3.cur_state = temp > trip5 cdev2.cur_state = temp > trip4 cdev1.cur_state = temp > trip3 cdev0.cur_state = temp > trip2 Well, except under normal circumstances the laptop won't reach trip_point_2 (80C). The temp normally stays below 70C even under heavy load. Created attachment 99251 [details]
patch v4: do not always return THERMAL_TREND_RAISING for ACPI active cooling
please try if this patch helps or not.
(In reply to comment #19) > (In reply to comment #17) > > well, please help me try this refreshed one. and attach the dmesg output. > > BTW, can you please verify if the fan spins down much quicker than it did > > before, say, 3.6? > > The fan speed directly matches cooling devices 3 to 0 (from lowest speed to > highest). The difference in noise between to 3 to 1 is very clear, but > between > 1 and 0 it's more difficult to hear. The difference between fan off, and > lowest > fan speed (cdev3) can also be heard. > > With a working kernel (even with 3.7, assuming I haven't suspended the > laptop) > the fan speed seems to follow the trip points exactly like this: > cdev3.cur_state = temp > trip5 > cdev2.cur_state = temp > trip4 > cdev1.cur_state = temp > trip3 > cdev0.cur_state = temp > trip2 > this is the behavior before 3.7. But I want to make a few changes for this, say, cdev3.cur_state = temp >= trip5 ? 1 : (temperature is raising ? 1 : 0); but I failed to handle the temperature stable case in the previous 3 fixes attached, which happens all the time on your machines because there are two thermal events within a second even if there is no temperature changes, according to the log you attached. so please try this one which I think would be a real fix. Created attachment 99551 [details] dmesg with v4 patch from comment #20 The fan still turns off too early with the v4 patch. Checking the cdev state from sysfs also agrees with this assesment. Even though the temperature can be above the lower trip points, all cdevs have cur_state=0. Created attachment 99571 [details]
debug oatch to check acpi thermal_get_trend
you're right.
[ 398.254232] step_wise_throttle for zone 0, trip 5
[ 398.254237] step_wise trips update
[ 398.254242] trip temp 40000, type 0
[ 398.254247] trend 2
[ 398.254252] temperature 42000, throttle 1
[ 398.254315] get_target_state: cur_state 0
[ 398.254321] old_target 0, target 0
[ 398.254327] cdev_update: cdev 3, zone 0, target 0
[ 398.254333] set cdev 3 to state 0
the current temperature is 42C, while the trip point is 40C.
but trend is 2, which equals THERMAL_TREND_DROPPING.
this is definitely not what we want, because acpi thermal_get_trend should always return THERMAL_TREND_RAISING when temperature is above the trip point.
please apply this debug patch on top and attach the dmesg output.
I need to check why acpi thermal_get_trend returns THERMAL_TREND_DROPPING in this case.
Created attachment 99711 [details] dmesg with debug patch from comment #23 Here's the log with the debug patch. I can immediately see that the problem is wrong units. Created attachment 99721 [details]
Patch to fix temp units before comparison
Created attachment 99731 [details]
Patch to make sure coolong devices turn off when temp drops to trip point
Created attachment 99741 [details]
dmesg with patches from comments #25 and #26
With these two patches the fan seems to behave quite normally.
Note that I haven't tried suspend/resume yet, but I'll do that now.
Things seems mostly OK after suspend/resume as well. I did hit one additional issue though. I did the while loop and let the temperature rise a bit, and then stopped the while loop. The fan started to slow down appropriately, but then for some reason I never saw any temperature notifications after the temp dropped to 43C. That obviously left the fan on at the lowest speed since the trip point is at 40C. I was monitoring the temp through sysfs (I had reverted the thermal_zone_device_update debug patch so that I can read the temp without disturbing the system). I waited until the temp had fallen to 30C but there still wasn't anything in the dmesg log after the 43C messages, and the fan was still on. So I'm guessing either ACPI failed to notify us that the trip point was reached, or somehow the tz->temperature wasn't up to date (but I can't how this could happen since the temperature is updated as the first thing in thermal_zone_device_update()). Created attachment 99811 [details]
patch v5: do not always return THERMAL_TREND_RAISING for ACPI active cooling
this is the final version of the patch that I'll push upstream. please check if it works or not.
(In reply to comment #27) > With these two patches the fan seems to behave quite normally. > Great, ville, thanks for your patch, that's really helpful. As non of these patches has been upstream yet, I merged these three altogether, instead of separate incremental patches and push it to upstream. (In reply to comment #30) > (In reply to comment #27) > > > With these two patches the fan seems to behave quite normally. > > > Great, ville, thanks for your patch, that's really helpful. > As non of these patches has been upstream yet, I merged these three > altogether, > instead of separate incremental patches and push it to upstream. The patch looks good. I just applied it alone on top of the thermal branch to make sure none of the debug patches affected the behaviour, and then I performed the same tests again. Everything seems to be in order. So feel free to add Tested-by: Ville Syrjälä <syrjala@sci.fi> to the patch. Created attachment 99861 [details]
patch for upstream
Mark the bug as Resolved. Will close this bug report when the patch in #32 hits upstream kernel. Fixed by 94a4093 "ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points". |