Bug 58311 - Laptop fans turn on after suspend/resume and don't turn off
Summary: Laptop fans turn on after suspend/resume and don't turn off
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: Power Management
Classification: Unclassified
Component: Thermal (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-16 03:38 UTC by sonnyrao
Modified: 2015-12-06 22:54 UTC (History)
5 users (show)

See Also:
Kernel Version: 3.8.13
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments
A thermal logging patch for 3.8 (3.39 KB, patch)
2013-05-23 00:46 UTC, sonnyrao
Details | Diff
Log for 3.8.13 across resume with previous patch (87.76 KB, application/octet-stream)
2013-05-23 01:24 UTC, sonnyrao
Details

Description sonnyrao 2013-05-16 03:38:18 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=895276 for more details

and additional thread on lkml:
http://lkml.indiana.edu/hypermail/linux/kernel/1305.1/03870.html

Summary is that post 3.7 up to 3.8.13, I see cooling devices corresponding to fan levels turn on after a resume from S3 and don't turn off. 

The bug appears to be fixed in 3.9.1 and on and 3.10-rc1

Copying my comment #32 from above bugzilla:

-------------------------------
I see this problem on the Samsung 550 Chromebook (on the Pixel we either have the EC auto-set fan speed, or have user-space communicate directly with the EC to control fan speeds)

It looks like this was introduced somewhere in this merge:


Merge: 125c4c7 c072fed                                                                                                  
Author:     Len Brown <len.brown@intel.com>                                                                             
AuthorDate: Tue Oct 9 01:35:52 2012 -0400                                                                               
Commit:     Len Brown <len.brown@intel.com>                                                                             
CommitDate: Tue Oct 9 01:35:52 2012 -0400                                                                               
                                                                                
    Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux into thermal
                                                                             
    Conflicts:                                                                                                          
        drivers/staging/omap-thermal/omap-thermal-common.                     
                OMAP supplied dummy TC1 and TC2,                                                                        
                at the same time that the thermal tree removed them                                                     
                from thermal_zone_device_register()                                                                     
                                                                               
        drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c                                                   
                propogate the upstream MAX_IDR_LEVEL re-name                                                            
                        to prevent a build failure                                                                      
                                                                                                                        
        Previously-fixed-by: Stephen Rothwell <sfr@canb.auug.org.au>                                                    
                                                                                                                        
    Signed-off-by: Len Brown <len.brown@intel.com>


but unfortunately, those patches within that merge branch aren't completing a suspend-resume for me, so I can't bisect down to a specific patch, but it's probably one of the ones to the generic thermal framework.

I will try to rebase those patches later and see which one is the culprit later.

---------------------------


This patch seemed to fix the problem in 3.9:

commit 94a409319561ec1847fd9bf996a2d5843ad00932
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Apr 26 09:19:53 2013 +0000

    ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points
    
    Commit 4ae46be "Thermal: Introduce thermal_zone_trip_update()"
    introduced a regression causing the fan to be always on even when
    the system is idle.

however, I'm still seeing it in 3.8.13, which has that patch.
Comment 1 Zhang Rui 2013-05-20 03:05:35 UTC
(In reply to comment #0)
> See https://bugzilla.redhat.com/show_bug.cgi?id=895276 for more details
> 
> and additional thread on lkml:
> http://lkml.indiana.edu/hypermail/linux/kernel/1305.1/03870.html
> 
> Summary is that post 3.7 up to 3.8.13, I see cooling devices corresponding to
> fan levels turn on after a resume from S3 and don't turn off. 
> 
> The bug appears to be fixed in 3.9.1 and on and 3.10-rc1
> 
what do you mean?
the problem can not be reproduced with 3.10-rc1 kernel?
Comment 2 Zhang Rui 2013-05-20 03:29:19 UTC
(In reply to comment #1)
> (In reply to comment #0)
> > See https://bugzilla.redhat.com/show_bug.cgi?id=895276 for more details
> > 
> > and additional thread on lkml:
> > http://lkml.indiana.edu/hypermail/linux/kernel/1305.1/03870.html
> > 
> > Summary is that post 3.7 up to 3.8.13, I see cooling devices corresponding
> to
> > fan levels turn on after a resume from S3 and don't turn off. 
> > 
> > The bug appears to be fixed in 3.9.1 and on and 3.10-rc1
> > 
> what do you mean?
> the problem can not be reproduced with 3.10-rc1 kernel?

if the problem still exists. please do the same test as requested at https://bugzilla.kernel.org/show_bug.cgi?id=58301#c3
Comment 3 Zhang Rui 2013-05-20 09:22:48 UTC
(In reply to comment #2)
> 
> if the problem still exists. please do the same test as requested at
> https://bugzilla.kernel.org/show_bug.cgi?id=58301#c3

please try the refreshed patch in comment #6 and #7 in bug #58301, and see if it helps or not.
Comment 4 sonnyrao 2013-05-22 21:44:30 UTC
That patch doesn't seem to apply to my 3.8 tree:

patching file drivers/thermal/step_wise.c
Hunk #1 FAILED at 119.
Hunk #2 succeeded at 154 (offset -11 lines).
1 out of 2 hunks FAILED -- saving rejects to file drivers/thermal/step_wise.c.rej
can't find file to patch at input line 53
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
|index d755440..aeca923 100644
|--- a/drivers/thermal/thermal_core.c
|+++ b/drivers/thermal/thermal_core.c
Comment 5 sonnyrao 2013-05-23 00:46:57 UTC
Created attachment 102281 [details]
A thermal logging patch for 3.8

Here's a version of the logging patch that I created for 3.8
Comment 6 sonnyrao 2013-05-23 01:24:38 UTC
Created attachment 102301 [details]
Log for 3.8.13 across resume with previous patch

Here's a log of a 3.8.13 with the patch from 102281 enabled
Comment 7 sonnyrao 2013-05-23 01:27:42 UTC
This is the state after resume assosicated with logs in comment #6

(none) ~ # cat /sys/class/thermal/cooling_device[01234]/cur_state
1
1
1
0
1
Comment 8 Zhang Rui 2013-05-23 02:23:13 UTC
(In reply to comment #4)
> That patch doesn't seem to apply to my 3.8 tree:
> 
> patching file drivers/thermal/step_wise.c
> Hunk #1 FAILED at 119.
> Hunk #2 succeeded at 154 (offset -11 lines).
> 1 out of 2 hunks FAILED -- saving rejects to file
> drivers/thermal/step_wise.c.rej
> can't find file to patch at input line 53
> Perhaps you used the wrong -p or --strip option?
> The text leading up to this was:
> --------------------------
> |diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> |index d755440..aeca923 100644
> |--- a/drivers/thermal/thermal_core.c
> |+++ b/drivers/thermal/thermal_core.c

you should replace thermal_core.c with thermal_sys.c in the patch.
Comment 9 sonnyrao 2013-05-23 03:14:52 UTC
Replacing thermal_core.c with thermal_sys.c still doesn't apply:

Applying: Debug patch to check cooling state transition.
error: patch failed: drivers/thermal/step_wise.c:119
error: drivers/thermal/step_wise.c: patch does not apply
error: patch failed: drivers/thermal/thermal_sys.c:441
error: drivers/thermal/thermal_sys.c: patch does not apply
Patch failed at 0001 Debug patch to check cooling state transition.
The copy of the patch that failed is found in:
   /home/sonnyrao/trunk/src/third_party/kernel-next/.git/rebase-apply/patch
When you have resolved this problem, run "git am --resolved".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".


can you look at the patch in the attachment in comment #6 and see if that is roughly equivalent?

Also, I didn't have patch7 (since it didn't apply) in there when I made the log the first time, I made a port of it to 3.8:

diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
index 6840230..6b7ac56 100644
--- a/drivers/thermal/step_wise.c
+++ b/drivers/thermal/step_wise.c
@@ -51,6 +51,9 @@ static unsigned long get_target_state(struct thermal_instance *instance,
        } else if (trend == THERMAL_TREND_DROPPING) {
                cur_state = cur_state > instance->lower ?
                            (cur_state - 1) : instance->lower;
+       } else if (trend == THERMAL_TREND_STABLE) {
+               /* Keep the cooling state as it is */
+               cur_state = instance->target;
        }
        printk("%s: new cur_state %d\n", __func__, cur_state);
        return cur_state;

but that still didn't fix the issue.
Comment 10 sonnyrao 2013-05-23 04:11:42 UTC
Also, I just rebased the patches that  I mentioned above, instead of using a merge commit, and bisected that down to this commit (sha won't match since rebased)

7b4977c3b139b9096ad17083ebd6de00d94804e7 is the first bad commit
commit 7b4977c3b139b9096ad17083ebd6de00d94804e7
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Wed Jun 27 14:13:04 2012 +0800

    Thermal: Introduce simple arbitrator for setting device cooling state
    
    This fixes the problem that a cooling device may be referenced by
    by multiple trip points in multiple thermal zones.
    
    With this patch, we have two stages for updating a thermal zone,
    1. check if a thermal_instance needs to be updated or not
    2. update the cooling device, based on the target cooling state
       of all its instances.
    
    Note that, currently, the cooling device is set to the deepest
    cooling state required.
    
    Signed-off-by: Zhang Rui <rui.zhang@intel.com>
    Reviewed-by: Rafael J. Wysocki <rjw@sisk.pl>
    Reviewed-by: Eduardo Valentin <eduardo.valentin@ti.com>
Comment 11 sonnyrao 2013-05-23 05:09:24 UTC
I'm starting to think that the problem could be that cdev instances which are "inactive" according to linux state really are active, because they were activated by acpi_fan_suspend() in drivers/acpi/fan.c

static int acpi_fan_suspend(struct device *dev)
{
        if (!dev)
                return -EINVAL;

        acpi_bus_set_power(to_acpi_device(dev)->handle, ACPI_STATE_D0);

        return AE_OK;
}

that codes sets the state to D0 just prior to suspend, and they are still in D0 coming out of suspend into resume, but as far as the thermal zone code is concerned, they are marked inactive, so it does nothing to them.
Comment 12 Zhang Rui 2013-05-23 06:00:33 UTC
yes, that is the problem.
and my patch in comment #7 bug #50301 should fix it.
please forget about the debug patch and try the fix patch only.
Comment 13 Dutch Guy 2013-05-23 06:23:45 UTC
I don't know if it is direct related, but I see the same problem (patch not tested and no know how (yet) how to do that).

But what I see on Ubuntu 12.10 with 3.9.2 custom compiled kernel is after suspend no fancontrol starting. I have had the sleepd script, but that was not always working.
By accident I came across this threat.

Did test this (before and after suspend, both resulting in "0"):
cat /sys/class/thermal/cooling_device[01234]/cur_state

Furthermore, no errors in logs.

Main problem is before suspend the fancontrol is running smooth (by a fresh startup) after suspend they are running at max speed.
Before suspend an RPM of 3k and after suspend 6k.
The only thing I conclude so far is that it ain't been fixed in 3.9.2

Can you clarify how to proceed?
Comment 14 Dutch Guy 2013-05-23 06:25:26 UTC
I don't know if it is direct related, but I see the same problem (patch not tested and no know how (yet) how to do that).

But what I see on Ubuntu 12.10 with 3.9.2 custom compiled kernel is after suspend no fancontrol starting. I have had the sleepd script, but that was not always working.
By accident I came across this threat.

Did test this (before and after suspend, both resulting in "0"):
cat /sys/class/thermal/cooling_device[01234]/cur_state

Furthermore, no errors in logs.

Main problem is before suspend the fancontrol is running smooth (by a fresh startup) after suspend they are running at max speed.
Before suspend an RPM of 3k and after suspend 6k.
The only thing I conclude so far is that it ain't been fixed in 3.9.2

Can you clarify how to proceed?
Comment 15 Zhang Rui 2013-05-23 06:26:41 UTC
(In reply to comment #12)
> yes, that is the problem.
> and my patch in comment #7 bug #50301 should fix it.
> please forget about the debug patch and try the fix patch only.

sorry I made a mistake here.
The fix patch should be in #7 bug #58301.
https://bugzilla.kernel.org/show_bug.cgi?id=58301#c7
Comment 16 Zhang Rui 2013-05-23 06:27:24 UTC
(In reply to comment #14)
> Can you clarify how to proceed?

please try the patch at
https://bugzilla.kernel.org/show_bug.cgi?id=58301#c7
and see if it helps.
Comment 17 Dutch Guy 2013-05-23 11:41:54 UTC
Okay, just recompiled kernel with option (fan from acpi on -> just to be sure named here)

Furthermore the step_wise.c as followed:

        case THERMAL_TREND_DROP_FULL:
                if (cur_state == instance->lower) {
                        if (!throttle)
                                cur_state = -1;
                } else
                        cur_state = instance->lower;
                break;
        case THERMAL_TREND_STABLE:
                /* Keep the cooling state as it is */
                cur_state = instance->target;
        default:
                break;
        }

This is not working. Before suspend my service fancontrol is running fine. However after suspend it is not comming back online.
This needs to be done by manual restart.

Bios all auto options are disabled.

Please advise (also if more logs are needed)
Comment 18 Zhang Rui 2013-05-27 04:03:52 UTC
(In reply to comment #17)

> 
> This is not working. Before suspend my service fancontrol is running fine.

Hmm, I do not know how this works.
But as the fan should be controlled by kernel driver, can you reproduce the problem without this service?
Comment 19 Zhang Rui 2013-05-27 05:01:15 UTC
BTW, please try the 4 patches attached in bug #58301 and see if they help or not.
Comment 20 Zhang Rui 2013-05-30 01:47:34 UTC
please apply the patch at
https://patchwork.kernel.org/patch/2633071/
on top of the four patches and see if the problem still exists.
Comment 21 sonnyrao 2013-05-30 06:45:46 UTC
I did a reverse bisect of 3.8 -> 3.9 + the thermal trend rising patch applied during the bisect, which seemed to fix the problem on 3.9.2, just to try to narrow down the combination of patches which fix the problem for me on 3.8 and it bisected to this patch:


commit 3dbfff3dfe6714aeefb615c65bec0800dc5a4c51
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Mon Nov 19 16:10:20 2012 +0800

    Introduce THERMAL_TREND_RAISE/DROP_FULL support for step_wise governor


the patch right after this, where the problem is fixes is:
commit b8bb6cb999858043489c1ddef08eed2127559169
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Thu Nov 22 15:45:02 2012 +0800

    step_wise: Unify the code for both throttle and dethrottle


Below is the thermal trend rising patch which fixes the problem for me in 3.9.2 but not 3.8.13 by itself.

commit 94a409319561ec1847fd9bf996a2d5843ad00932
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Apr 26 09:19:53 2013 +0000

    ACPI / thermal: do not always return THERMAL_TREND_RAISING for active trip points
    

so, if I cherry pick in these three patches onto 3.8.13, then the problem is fixed (which also contains the THERMAL_TREND_RAISING patch above):

commit d069015e268bcac6c8bd8997b3235f5f977d3ab6
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Mon Nov 19 15:33:51 2012 +0800

    Introduce THERMAL_TREND_RAISE_FULL and THERMAL_TREND_DROP_FULL

commit 3dbfff3dfe6714aeefb615c65bec0800dc5a4c51
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Mon Nov 19 16:10:20 2012 +0800

    Introduce THERMAL_TREND_RAISE/DROP_FULL support for step_wise governor
 

commit b8bb6cb999858043489c1ddef08eed2127559169
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Thu Nov 22 15:45:02 2012 +0800

    step_wise: Unify the code for both throttle and dethrottle
    
and I confirmed that the last patch is indeed required, which was a bit surprising.
Comment 22 Zhang Rui 2013-06-21 08:12:56 UTC
(In reply to comment #21)
> I did a reverse bisect of 3.8 -> 3.9 + the thermal trend rising patch applied
> during the bisect, which seemed to fix the problem on 3.9.2, just to try to
> narrow down the combination of patches which fix the problem for me on 3.8
> and
> it bisected to this patch:
> 
> 
> commit 3dbfff3dfe6714aeefb615c65bec0800dc5a4c51
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Mon Nov 19 16:10:20 2012 +0800
> 
>     Introduce THERMAL_TREND_RAISE/DROP_FULL support for step_wise governor
> 
> 
> the patch right after this, where the problem is fixes is:
> commit b8bb6cb999858043489c1ddef08eed2127559169
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Thu Nov 22 15:45:02 2012 +0800
> 
>     step_wise: Unify the code for both throttle and dethrottle
> 
> 
> Below is the thermal trend rising patch which fixes the problem for me in
> 3.9.2
> but not 3.8.13 by itself.
> 
so the thermal trend patch fixes the problem for 3.9.
and the thermal trend patch plus commit b8bb6cb fixes the problem in 3.8 stable kernel, right?

> commit 94a409319561ec1847fd9bf996a2d5843ad00932
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Fri Apr 26 09:19:53 2013 +0000
> 
>     ACPI / thermal: do not always return THERMAL_TREND_RAISING for active
>     trip
> points
> 
> 
> so, if I cherry pick in these three patches onto 3.8.13, then the problem is
> fixed (which also contains the THERMAL_TREND_RAISING patch above):
> 
> commit d069015e268bcac6c8bd8997b3235f5f977d3ab6
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Mon Nov 19 15:33:51 2012 +0800
> 
>     Introduce THERMAL_TREND_RAISE_FULL and THERMAL_TREND_DROP_FULL
> 
> commit 3dbfff3dfe6714aeefb615c65bec0800dc5a4c51
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Mon Nov 19 16:10:20 2012 +0800
> 
>     Introduce THERMAL_TREND_RAISE/DROP_FULL support for step_wise governor
> 
> 
> commit b8bb6cb999858043489c1ddef08eed2127559169
> Author: Zhang Rui <rui.zhang@intel.com>
> Date:   Thu Nov 22 15:45:02 2012 +0800
> 
>     step_wise: Unify the code for both throttle and dethrottle
> 
> and I confirmed that the last patch is indeed required, which was a bit
> surprising.

"the last patch", I suppose you mean commit b8bb6cb99, right?
Comment 23 sonnyrao 2013-06-21 08:42:09 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > I did a reverse bisect of 3.8 -> 3.9 + the thermal trend rising patch
> applied
> > during the bisect, which seemed to fix the problem on 3.9.2, just to try to
> > narrow down the combination of patches which fix the problem for me on 3.8
> and
> > it bisected to this patch:
> > 
> > 
> > commit 3dbfff3dfe6714aeefb615c65bec0800dc5a4c51
> > Author: Zhang Rui <rui.zhang@intel.com>
> > Date:   Mon Nov 19 16:10:20 2012 +0800
> > 
> >     Introduce THERMAL_TREND_RAISE/DROP_FULL support for step_wise governor
> > 
> > 
> > the patch right after this, where the problem is fixes is:
> > commit b8bb6cb999858043489c1ddef08eed2127559169
> > Author: Zhang Rui <rui.zhang@intel.com>
> > Date:   Thu Nov 22 15:45:02 2012 +0800
> > 
> >     step_wise: Unify the code for both throttle and dethrottle
> > 
> > 
> > Below is the thermal trend rising patch which fixes the problem for me in
> 3.9.2
> > but not 3.8.13 by itself.
> > 
> so the thermal trend patch fixes the problem for 3.9.
> and the thermal trend patch plus commit b8bb6cb fixes the problem in 3.8
> stable
> kernel, right?

The thermal trend patch did fix it in 3.9, and thermal trend + all 3 of those patches below fix it in 3.8

> 
> > commit 94a409319561ec1847fd9bf996a2d5843ad00932
> > Author: Zhang Rui <rui.zhang@intel.com>
> > Date:   Fri Apr 26 09:19:53 2013 +0000
> > 
> >     ACPI / thermal: do not always return THERMAL_TREND_RAISING for active
> trip
> > points
> > 
> > 
> > so, if I cherry pick in these three patches onto 3.8.13, then the problem
> is
> > fixed (which also contains the THERMAL_TREND_RAISING patch above):
> > 
> > commit d069015e268bcac6c8bd8997b3235f5f977d3ab6
> > Author: Zhang Rui <rui.zhang@intel.com>
> > Date:   Mon Nov 19 15:33:51 2012 +0800
> > 
> >     Introduce THERMAL_TREND_RAISE_FULL and THERMAL_TREND_DROP_FULL
> > 
> > commit 3dbfff3dfe6714aeefb615c65bec0800dc5a4c51
> > Author: Zhang Rui <rui.zhang@intel.com>
> > Date:   Mon Nov 19 16:10:20 2012 +0800
> > 
> >     Introduce THERMAL_TREND_RAISE/DROP_FULL support for step_wise governor
> > 
> > 
> > commit b8bb6cb999858043489c1ddef08eed2127559169
> > Author: Zhang Rui <rui.zhang@intel.com>
> > Date:   Thu Nov 22 15:45:02 2012 +0800
> > 
> >     step_wise: Unify the code for both throttle and dethrottle
> > 
> > and I confirmed that the last patch is indeed required, which was a bit
> > surprising.
> 
> "the last patch", I suppose you mean commit b8bb6cb99, right?

yes
Comment 24 Zhang Rui 2013-06-24 01:14:02 UTC
I see.
Close this bug report as the problem has been fixed in upstream kernel.
will backport the three patches to stable tree.
Comment 25 Oliver Joos 2015-12-06 19:46:14 UTC
Sorry to reactivate such an old bug report, but I fear that this bug came back in kernel 3.17.x and is still alive in 4.2.x (I did not check 4.3.x or later yet)

I confirm that it was fixed at least in 3.13.x and 3.16.x.

Please comment if you think it is a regression or a new bug,
and if I shall open a new bug report for it.

Some infos from affected users:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1504751
  http://ubuntuforums.org/showthread.php?t=2294906
Comment 26 sonnyrao 2015-12-06 22:54:33 UTC
At this point someone with a device that exhibits the problem probably needs to run a git bisect on it to find out when it broke again.

Note You need to log in before you can comment on or make changes to this bug.