Bug 196975 - Fan still blows up after fixing the regression - Thinkpad T470
Summary: Fan still blows up after fixing the regression - Thinkpad T470
Status: CLOSED CODE_FIX
Alias: None
Product: Power Management
Classification: Unclassified
Component: cpuidle (show other bugs)
Hardware: All Linux
: P1 normal
Assignee: Len Brown
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-18 01:56 UTC by Lv Zheng
Modified: 2017-12-13 04:38 UTC (History)
6 users (show)

See Also:
Kernel Version: 4.13
Tree: Mainline
Regression: No


Attachments
lspci output (30.24 KB, text/plain)
2017-09-18 01:57 UTC, Lv Zheng
Details
dmidecode output (15.16 KB, text/plain)
2017-09-18 01:58 UTC, Lv Zheng
Details
dmesg output (83.51 KB, text/plain)
2017-09-18 23:33 UTC, Tomislav Ivek
Details
acpidump output (774.35 KB, text/plain)
2017-09-18 23:34 UTC, Tomislav Ivek
Details
dmidecode output (15.16 KB, text/plain)
2017-09-18 23:35 UTC, Tomislav Ivek
Details
lspci output (30.24 KB, text/plain)
2017-09-18 23:36 UTC, Tomislav Ivek
Details
turbostat output for kernel 4.8.0 (3.92 KB, text/plain)
2017-10-18 07:48 UTC, Tomislav Ivek
Details
turbostat output for kernel 4.13.5 (3.92 KB, text/plain)
2017-10-18 07:48 UTC, Tomislav Ivek
Details
turbostat output for kernel 4.8.0 (75.15 KB, text/plain)
2017-10-18 20:37 UTC, Tomislav Ivek
Details
turbostat output for kernel 4.13.5 (55.06 KB, text/plain)
2017-10-18 20:37 UTC, Tomislav Ivek
Details
turbostat output for kernel 4.13.5 after EC fail (66.78 KB, text/plain)
2017-10-18 20:38 UTC, Tomislav Ivek
Details

Description Lv Zheng 2017-09-18 01:56:45 UTC
Split from bug 196129, reported by Tomislav Ivek (tomislav.ivek@gmail.com):

Comment 36:
With patch 256927 on 4.12.0rc6 I got the problem on T470 within ten days, with both acpi.ec_freeze_events=N and Y.

Now I am testing patches 9835823 and 9835825 on kernel 4.12.0 without any specific kernel boot options. A couple of days in, so far so good.

Comment 40:
T470, patched kernel 4.12.4 and the bug is still present. The symptoms do not occur as often as without the patch but every couple of days the fans still spins fast after resume and acpitz-virtual-0 is stuck at 48°C. BIOS version: N1QET56W, BIOS Revision: 1.31, Firmware Revision: 1.14

Comment 43:
(In reply to Lv Zheng from comment #42)
> To Tomislav:
> 
> Could you try to patch the followings and see if the situation can be
> improved:
> 
> https://patchwork.kernel.org/patch/9870917/
> https://patchwork.kernel.org/patch/9870915/
> https://patchwork.kernel.org/patch/9870919/
> https://patchwork.kernel.org/patch/9870925/
> 
> Thanks
> Lv


Lv, I am now running 4.13.0-rc4 with the four patches above. So far so good, but as the bug only appears sporadically I would like to test the new kernel for a couple of days under normal workloads.

Comment 45:
After three days of normal behavior, today the bug resurfaced again on T470 running 4.13.0-rc4 and the four latest patches, after a 5-gour standby. This is better than hearing fans spin up every resume, but still not fixed.

I do not see anything suspicious in dmesg except perhaps this message which might be unrelated:
mei_me 0000:00:16.0: can't suspend (mei_me_pm_runtime_suspend [mei_me] returned -62)
mei_me 0000:00:16.0: unexpected reset: dev_state = ENABLED fw status = 90000245 80100306 00000020 00084400 00000000 40400AD9

Anything else I can try? Thank you for your hard work.

Comment 54:
(In reply to Lv Zheng from comment #52)
> If you mean the regression fixes. It should be in 4.13 kernels.
> $ git tag --contains 66259146
> v4.13
> v4.13-rc1
> v4.13-rc2
> v4.13-rc3
> v4.13-rc4
> v4.13-rc5
> v4.13-rc6
> v4.13-rc7
> 
> Anyone that still can reproduce this issue on latest v4.13 kernels (even
> with lowered replication rate), please upload lspci/dmidecode output here.
> Thanks in advance.

On ThinkPad T470 I do see the issue as reported in comment 45 on 4.13-rc4 with patches:
https://patchwork.kernel.org/patch/9870917/
https://patchwork.kernel.org/patch/9870915/
https://patchwork.kernel.org/patch/9870919/
https://patchwork.kernel.org/patch/9870925/

(will attach dmidecode output below). Next I can try the latest 4.13 as well as the recently published BIOS update.
Comment 1 Lv Zheng 2017-09-18 01:57:49 UTC
Created attachment 258451 [details]
lspci output
Comment 2 Lv Zheng 2017-09-18 01:58:14 UTC
Created attachment 258453 [details]
dmidecode output
Comment 3 Lv Zheng 2017-09-18 01:59:16 UTC
To Tomislav:

Please upload acpidump output and full dmesg output here.

I've split this bug in order to have PM guys to investigate if the ME failure message matters.

Thanks
Lv
Comment 4 Tomislav Ivek 2017-09-18 23:33:49 UTC
Created attachment 258475 [details]
dmesg output

My apologies for the slow reply, I was away from this particular machine.

In order to reproduce the bug on T470 I am now running 4.13.0. In the attached dmesg the first standby-resume did not show the bug, but immediately after the second standby-resume did.
Comment 5 Tomislav Ivek 2017-09-18 23:34:30 UTC
Created attachment 258477 [details]
acpidump output
Comment 6 Tomislav Ivek 2017-09-18 23:35:31 UTC
Created attachment 258479 [details]
dmidecode output
Comment 7 Tomislav Ivek 2017-09-18 23:36:03 UTC
Created attachment 258481 [details]
lspci output
Comment 8 Tomislav Ivek 2017-09-18 23:38:13 UTC
I forgot to mention, the 4.13.0 I am running in the outputs above is the unpatched vanilla kerel.
Comment 9 Zhang Rui 2017-09-19 02:18:11 UTC
(In reply to Tomislav Ivek from comment #4)
> Created attachment 258475 [details]
> dmesg output
> 
> My apologies for the slow reply, I was away from this particular machine.
> 
> In order to reproduce the bug on T470 I am now running 4.13.0. In the
> attached dmesg the first standby-resume did not show the bug, but
> immediately after the second standby-resume did.

show what bug? the fan spinning bug?
I didn't see mei_me errors this time.
so if the fan spins without the mei error, probably these are unrelated.

please confirm the problem can be reproduced in 4.8 kernel, where the EC offending patch has not been introduced into the upstream kernel or not.
Comment 10 Zhang Rui 2017-09-19 02:22:50 UTC
say we have three stages
1. kernel before 4.9, EC offending patches are not in upstream, no one reports the fan spinning problem.
2. from 4.9 to 4.12, problems are reported by different users
3. kernel 4.13 and later, problems are hard to reproduce after reverting the EC patches.

Now the problem to me is that
does the problem always exist?
if yes, no one reports the problem before 4.9 because it is hard to reproduce.
if no, there must be something else that also breaks suspend after 4.8
Comment 11 Tomislav Ivek 2017-09-19 06:10:55 UTC
(In reply to Zhang Rui from comment #9)
> (In reply to Tomislav Ivek from comment #4)
> > Created attachment 258475 [details]
> > dmesg output
> > 
> > My apologies for the slow reply, I was away from this particular machine.
> > 
> > In order to reproduce the bug on T470 I am now running 4.13.0. In the
> > attached dmesg the first standby-resume did not show the bug, but
> > immediately after the second standby-resume did.
> 
> show what bug? the fan spinning bug?

Yes, the fan spinning bug after resume, from the title.


> I didn't see mei_me errors this time.
> so if the fan spins without the mei error, probably these are unrelated.
> 

You are correct, there was no mei error this time.

> please confirm the problem can be reproduced in 4.8 kernel, where the EC
> offending patch has not been introduced into the upstream kernel or not.

I can do that. Building 4.8, will report back.
Comment 12 Darrick J. Wong 2017-09-21 19:32:16 UTC
FWIW I'm on stock 4.13.2 and still experience this problem periodically.

When it does, I see:

"thermal thermal_zone3: failed to read out thermal zone (-5)"

in dmesg.  Same lspci as the one posted, etc.  Will try the n1qur09w.iso firmware and report back.

(I cannot revert to 4.8, it's too old to support the XFS filesystems on this laptop.)
Comment 13 Lv Zheng 2017-09-25 03:11:35 UTC
This actually is explainable.

The problem seems to be:
When cpu is busy exitting/entering C-states, EC FW will get a failure by reading CPU temperature via PECI.
I think this could even happen during runtime, not just post-resume.
Comment 14 Darrick J. Wong 2017-09-25 16:30:20 UTC
FWIW, the n1qur09w firmware did not magically fix things. :(
Comment 15 Tomislav Ivek 2017-09-26 02:59:49 UTC
(In reply to Zhang Rui from comment #9)
> please confirm the problem can be reproduced in 4.8 kernel, where the EC
> offending patch has not been introduced into the upstream kernel or not.


After a full week on 4.8, I am not able to reproduce the bug with the fan blowing after resume.

For comparison, on 4.13.0 the bug appears within the first couple of resumes. The bug appears within three-five days with regular standby-resumes on 4.13-rc4 with the following patches:
https://patchwork.kernel.org/patch/9870917/
https://patchwork.kernel.org/patch/9870915/
https://patchwork.kernel.org/patch/9870919/
https://patchwork.kernel.org/patch/9870925/
Comment 16 Lv Zheng 2017-10-17 06:41:49 UTC
Hi,

Can you try several tests:

1. Boot a kernel with the following parameters to see if the problem still can be reproduced:
 idle=X
Where X can be a string of "halt/nomwait/poll", hope you can give all of them a try.

2. Make sure you are using intel_idle, then try to boot a kernel with the following parameters to see if the problem can still be reproduced:
  intel_idle.max_cstate=Y
Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9.

Thanks and best regards
Lv
Comment 17 Lv Zheng 2017-10-17 06:49:24 UTC
It is said and confirmed that the original regression can be fixed by the following commit:

commit 662591461c4b9a1e3b9b159dbf37648a585ebaae
Author: Lv Zheng <lv.zheng@intel.com>
Date:   Wed Jul 12 11:09:09 2017 +0800

    ACPI / EC: Drop EC noirq hooks to fix a regression

    According to bug reports, although the busy polling mode can make
    noirq stages execute faster, it causes abnormal fan blowing up after
    system resume (see the first link below for a video demonstration)
    on Lenovo ThinkPad X1 Carbon - the 5th Generation.  The problem can
    be fixed by upgrading the EC firmware on that machine.

    However, many reporters confirm that the problem can be fixed by
    stopping busy polling during suspend/resume and for some of them
    upgrading the EC firmware is not an option.

    For this reason, drop the noirq stage hooks from the EC driver
    to fix the regression.

    Fixes: c3a696b6e8f8 (ACPI / EC: Use busy polling mode when GPE is not enabled)
    Link: https://youtu.be/9NQ9x-Jm99Q
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=196129

And what I learned from a different route is:
1. The bug occurred because there is a failure in EC FW reading a CPU temperature.
2. The failure can be triggered by different reasons. For the regression, it is triggered due to a C6-exit conflicting with an internal EC FW operation, and busy polling makes it easier to occur.

For the bug reported here, probably there is a different reason.

Thanks and best regards
Lv
Comment 18 Tomislav Ivek 2017-10-17 23:52:01 UTC
Hi Lv,

To be as expedient as possible I've done the tests on Fedora's 4.13.5-200.fc26.x86_64.

Hardware: T470 with the latest BIOS N1QET67W (1.42), Firmware revision 1.27 released on 10/10/17.

Methodology: for each proposed boot parameter I have repeatedly put the machine to standby and then resumed it until either the fan-blowing issue was reproduced or 30min have passed. Considering that I can usually get the fan to blow within the first couple of standby-resumes, and within 3 minutes tops, this seems like a reasonable way to test all the options.

(In reply to Lv Zheng from comment #16)
> 1. Boot a kernel with the following parameters to see if the problem still
> can be reproduced:
>  idle=X
> Where X can be a string of "halt/nomwait/poll", hope you can give all of
> them a try.

I was not able to reproduce the problem with either of idle=halt, idle=nomwait, and idle=poll. The maximum idle state with idle=nomwait was C1 and with idle=poll was C0. When using idle=halt, powertop did not report the C-state.


> 2. Make sure you are using intel_idle, then try to boot a kernel with the
> following parameters to see if the problem can still be reproduced:
>   intel_idle.max_cstate=Y
> Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9.

I have verified that intel_idle is indeed in use. I was not able to reproduce the problem with intel_idle.max_cstate=0 up to 5. But, the problem is immediately reproduced with values 6, 7, 8, and 9.

So, on T470 it seems the EC FW failure when reading a CPU temperature is triggered when C6 and C7 are enabled.

Thank you and please let me know if I can provide any other info.

Best,
Tomislav
Comment 19 Zhang Rui 2017-10-18 01:21:37 UTC
(In reply to Tomislav Ivek from comment #18)
> > 2. Make sure you are using intel_idle, then try to boot a kernel with the
> > following parameters to see if the problem can still be reproduced:
> >   intel_idle.max_cstate=Y
> > Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9.
> 
> I have verified that intel_idle is indeed in use. I was not able to
> reproduce the problem with intel_idle.max_cstate=0 up to 5. But, the problem
> is immediately reproduced with values 6, 7, 8, and 9.
> 
are these done with or without the EC regression fixes?
Comment 20 Zhang Rui 2017-10-18 01:29:36 UTC
As this also seems to be c-state related, please attach the output of "turbostat -debug" for both 4.9 kernel, and 4.13 kernel.
Comment 21 Tomislav Ivek 2017-10-18 07:12:57 UTC
(In reply to Zhang Rui from comment #19)
> (In reply to Tomislav Ivek from comment #18)
> > > 2. Make sure you are using intel_idle, then try to boot a kernel with the
> > > following parameters to see if the problem can still be reproduced:
> > >   intel_idle.max_cstate=Y
> > > Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9.
> > 
> > I have verified that intel_idle is indeed in use. I was not able to
> > reproduce the problem with intel_idle.max_cstate=0 up to 5. But, the
> problem
> > is immediately reproduced with values 6, 7, 8, and 9.
> > 
> are these done with or without the EC regression fixes?

The fix for 196129 is included:
$ git tag --contains 66259146
--snip--
kernel-4.13.5-200.fc26
--snip

Any others I should check for?
Comment 22 Zhang Rui 2017-10-18 07:28:27 UTC
please verify if the same thing (disable C6 and deeper) works for kernel without the EC fix or not.
Comment 23 Lv Zheng 2017-10-18 07:29:11 UTC
It is said that when C6 and upper is allowed, CPU temperature read which leads to a PECI access may be too early, causing a conflict between c6-exit and PECI I/F related stuffs in EC FW.

The 66259146 only fixes C6-exit cause in EC driver, there could be different C6-exit cause in the linux upstream changes.

I'm wondering if the CPU temperature read is caused by OS evaluation of _WAK or something else, but I have no chance to confirm that as we are not EC FW developers.
Comment 24 Tomislav Ivek 2017-10-18 07:48:18 UTC
Created attachment 260259 [details]
turbostat output for kernel 4.8.0
Comment 25 Tomislav Ivek 2017-10-18 07:48:46 UTC
Created attachment 260261 [details]
turbostat output for kernel 4.13.5
Comment 26 Zhang Rui 2017-10-18 07:57:10 UTC
well, my bad, you should leave it running for a while to get the output like following,

    Core     CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz     IRQ     SMI  CPU%c1  CPU%c3  CPU%c6  CPU%c7 CoreTmp  PkgTmp GFX%rc6  GFXMHz Totl%C0  Any%C0  GFX%C0 CPUGFX% Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt RAMWatt   PKG_%   RAM_%
       -       -       8    1.48     542    2208     927       0    2.64    0.04    0.91   94.94      27      27   99.23    1050    6.42    5.61    0.59    0.32   43.07   47.76    1.83    0.00    0.00    0.00    0.00    0.89    0.34    0.00    0.00
       0       0       8    0.93     881    2208     107       0    2.74    0.03    1.51   94.79      26      27   99.23    1050    6.42    5.61    0.59    0.32   43.07   47.76    1.84    0.00    0.00    0.00    0.00    0.89    0.34    0.00    0.00
       0       2       9    1.85     462    2208     235       0    1.82
       1       1      10    1.96     498    2208     182       0    2.62    0.04    0.31   95.08      27
       1       3       6    1.18     472    2208     403       0    3.39
Comment 27 Tomislav Ivek 2017-10-18 08:04:11 UTC
(In reply to Zhang Rui from comment #26)
> well, my bad, you should leave it running for a while to get the output like
> following,

Terribly sorry, I presumed the debug output is given in the first part. Will do it later in the day as soon as I get back to that machine.


(In reply to Lv Zheng from comment #23)
> It is said that when C6 and upper is allowed, CPU temperature read which
> leads to a PECI access may be too early, causing a conflict between c6-exit
> and PECI I/F related stuffs in EC FW.
> 
> The 66259146 only fixes C6-exit cause in EC driver, there could be different
> C6-exit cause in the linux upstream changes.
> 
> I'm wondering if the CPU temperature read is caused by OS evaluation of _WAK
> or something else, but I have no chance to confirm that as we are not EC FW
> developers.

I noticed that these patches helped a lot on 4.13.0-rc4:
https://patchwork.kernel.org/patch/9870917/
https://patchwork.kernel.org/patch/9870915/
https://patchwork.kernel.org/patch/9870919/
https://patchwork.kernel.org/patch/9870925/
in the sense that I could run for days and dozens of standby-resume cycles before the problem would resurface. Without these patches I can reproduce the problem within minutes with only a couple of resumes. So, you may be right, there are probably more C6-exit causes, but these four patches touch at least some.
Comment 28 Tomislav Ivek 2017-10-18 18:23:54 UTC
(In reply to Zhang Rui from comment #22)
> please verify if the same thing (disable C6 and deeper) works for kernel
> without the EC fix or not.

I booted 4.12.14 which does not have the EC fix commit 66259146. C6 and deeper are disabled using the intel_idle.max_cstate=5 boot parameter. I am not able to reproduce the EC problem, so disabling C6 and deeper seems to help.
Comment 29 Tomislav Ivek 2017-10-18 20:37:26 UTC
Created attachment 260273 [details]
turbostat output for kernel 4.8.0
Comment 30 Tomislav Ivek 2017-10-18 20:37:52 UTC
Created attachment 260275 [details]
turbostat output for kernel 4.13.5
Comment 31 Tomislav Ivek 2017-10-18 20:38:20 UTC
Created attachment 260277 [details]
turbostat output for kernel 4.13.5 after EC fail
Comment 32 Tomislav Ivek 2017-10-18 20:40:56 UTC
(In reply to Tomislav Ivek from comment #28)
> (In reply to Zhang Rui from comment #22)
> > please verify if the same thing (disable C6 and deeper) works for kernel
> > without the EC fix or not.
> 
> I booted 4.12.14 which does not have the EC fix commit 66259146. C6 and
> deeper are disabled using the intel_idle.max_cstate=5 boot parameter. I am
> not able to reproduce the EC problem, so disabling C6 and deeper seems to
> help.

Noticed that Fedora's 4.12.14 has ec.c patched with its own EC fix, my mistake, so I now report that I cannot reproduce the EC problem on mainline 4.12.0 without EC fix and with disabled C6 & deeper.
Comment 33 Lv Zheng 2017-10-19 01:29:46 UTC
> Noticed that Fedora's 4.12.14 has ec.c patched with its own EC fix.

Yes, regression fixes will be back ported by stable kernels.
So care should be taken when you test with stable kernels (x.y.z) rather than upstream kernels (x.y-rcz).
Comment 34 Tomislav Ivek 2017-10-23 07:11:44 UTC
A related report on Red Hat's bugzilla has news that Lenovo is planning a firmware update in November. They will address the fan-blowing-up issue on T470 and other Kaby Lake models: https://bugzilla.redhat.com/show_bug.cgi?id=1480844
Comment 35 Len Brown 2017-10-23 23:56:14 UTC
Tomislav,
Regarding your test result in comment #27
Are all four patches there required to see this big improvement,
or are the first three sufficient?
Comment 36 Tomislav Ivek 2017-10-24 05:22:10 UTC
(In reply to Len Brown from comment #35)
> Tomislav,
> Regarding your test result in comment #27
> Are all four patches there required to see this big improvement,
> or are the first three sufficient?

I believe I tested only with all four of them applied. I can try them one by one but it will take a while to verify.
Comment 37 Lv Zheng 2017-10-24 06:56:39 UTC
Tomislav,

Glad to know you'll help to provide us useful information related to the EC changes.

If you can use latest upstream kernel, please help to do the validation using latest versioned patches.

1. applying the following 2 patches:
https://patchwork.kernel.org/patch/9971553/
https://patchwork.kernel.org/patch/9971557/

2. applying the following 1 patch on top of the 2 patches:
https://patchwork.kernel.org/patch/9977051/

3. applying the following 2 patches on top of the 3 patches:
https://patchwork.kernel.org/patch/9977049/
https://patchwork.kernel.org/patch/9977047/

It's ok if you are not willing to switch to the latest upstraem kernel. Then please following Len's suggestion, thanks in advance.

Cheers,
Lv
Comment 38 Tomislav Ivek 2017-10-24 09:13:59 UTC
Hi Lv, I can do these test and will report back as soon as I am sure about the results (sometimes it takes a bit to be sure the fan is behaving).

Tomislav
Comment 39 Tim Sinthofen 2017-10-24 17:43:05 UTC
Hi,
I use a T470 with Fedora 27 (Kernel 4.13.8-300.fc27.x86_64).
For me the problem was completely gone for some time. But since some 4.13 versions it is back when the Laptop wakes up from sleep when plugged into AC (always, can be easily reproduced). When waking up from battery it does never happen.
I will be happy to provide logs for debugging. Please tell me what you need.

Tim
Comment 40 Zhang Rui 2017-10-25 02:22:19 UTC
Tim, first, please check the previous comments of this bug report, and confirm if it is exactly the same problem or not.
To me, it should be. But if there is anything different from what Tomislav reported, please update it here.
Comment 41 Tomislav Ivek 2017-10-25 08:38:36 UTC
(In reply to Lv Zheng from comment #37)
> Tomislav,
> 
> Glad to know you'll help to provide us useful information related to the EC
> changes.
> 
> If you can use latest upstream kernel, please help to do the validation
> using latest versioned patches.
> 
> 1. applying the following 2 patches:
> https://patchwork.kernel.org/patch/9971553/
> https://patchwork.kernel.org/patch/9971557/
> 
> 2. applying the following 1 patch on top of the 2 patches:
> https://patchwork.kernel.org/patch/9977051/
> 
> 3. applying the following 2 patches on top of the 3 patches:
> https://patchwork.kernel.org/patch/9977049/
> https://patchwork.kernel.org/patch/9977047/
> 
> It's ok if you are not willing to switch to the latest upstraem kernel. Then
> please following Len's suggestion, thanks in advance.
> 
> Cheers,
> Lv

Lv,

I did the tests on:
0. unpatched upstream tag v4.14-rc6,
1. v4.14-rc6 with the first two patches as above
2. v4.14-rc6 with the first three patches,
3. v4.14-rc6 with all five patched you proposed.

In all these cases I am still able to reproduce the fan blowing after resume. For case 2. it took some hours and reboots to reproduce but that may have been a fluke.

(In reply to Tim Sinthofen from comment #39)
> For me the problem was completely gone for some time. But since some 4.13
> versions it is back when the Laptop wakes up from sleep when plugged into AC
> (always, can be easily reproduced). When waking up from battery it does
> never happen.

Up to v4.13-rc4 about a month ago I did all tests with AC as well as on battery and I was always able to reproduce the bug. I did not think power was relevant so I did the more recent tests AC only. In the meantime the EC FW got updated, maybe my assumption is now wrong. Later in the day I can try the above patches on battery, too.

Tomislav
Comment 42 Lv Zheng 2017-10-26 02:10:24 UTC
Tim,

> For me the problem was completely gone for some time. But since some 4.13
> versions it is back when the Laptop wakes up from sleep when plugged into AC
> (always, can be easily reproduced).

After reverting EC changes noirq changes, I only added commits related to ECDT which is boot related, not S3 related.
Rafael added code related to freeze mode not related to S3.

I wonder if your test is correct if you are directly using sysfs, you need to use a different command line sequence to achieve S3:
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state

So if the situation is getting worse, probably is because some other causes start to create more C6-exit chances during resume. And such causes may not be limited to the EC driver.

Would you please try attachment 260383 [details] to confirm if the problem disappears with kernel boot parameter of "acpi_resume_latency=25"? This patch actually is closer to the root cause than EC changes.
If the problem can be fixed by the workaround, please give attachment 260385 [details] a try to see if it can also fix the problem, it's just an experiment.

Tomislav,

> In all these cases I am still able to reproduce the fan blowing after resume.
> For case 2. it took some hours and reboots to reproduce but that may have
> been a fluke.

I also wonder if you are using the following command lines:
# echo deep > /sys/power/mem_sleep
# echo mem > /sys/power/state
If the command line is not wrong, we can ignore your comment #27.
Comment 43 Tomislav Ivek 2017-10-26 06:09:12 UTC
(In reply to Lv Zheng from comment #42)
> Tomislav,
> 
> > In all these cases I am still able to reproduce the fan blowing after
> resume.
> > For case 2. it took some hours and reboots to reproduce but that may have
> > been a fluke.
> 
> I also wonder if you are using the following command lines:
> # echo deep > /sys/power/mem_sleep
> # echo mem > /sys/power/state
> If the command line is not wrong, we can ignore your comment #27.

Lv,

I tried again with the above command lines and see the same effect as reported in all cases (v4.14-rc6 no patch, patches 1-2, patches 1-3, patches 1-5).

Tomislav
Comment 44 Lv Zheng 2017-10-27 03:53:25 UTC
Tomislav,

Thanks for the confirmation.

Could you give the following test a try?
1. apply only attachment 260383 [details] on top of the failing kernel
   boot with "acpi_resume_latency=25"
2. apply only attachment 260385 [details] on top of the failing kernel
   boot without any parameter

Thanks in advance.
Comment 45 Tim Sinthofen 2017-10-28 07:35:50 UTC
Lv,
Sorry for the late reply. This was the first time I built my own kernel, had to figure out how first.
I just took the vanilla upstream kernel (4.14.0-rc6) and applied this patch: https://bugzilla.kernel.org/attachment.cgi?id=260383&action=diff
Unfortunately using this kernel did not fix the issue.

Tim
Comment 46 Tim Sinthofen 2017-10-28 07:46:47 UTC
Sorry,
wrote to quick. I forgot to set the boot parameter.
With the boot parameter set, the issue seems to be gone. (Tried 10 times going into standby and waking it up again)
Comment 47 Tomislav Ivek 2017-10-29 23:10:18 UTC
(In reply to Lv Zheng from comment #44)
> Tomislav,
> 
> Thanks for the confirmation.
> 
> Could you give the following test a try?
> 1. apply only attachment 260383 [details] on top of the failing kernel
>    boot with "acpi_resume_latency=25"
> 2. apply only attachment 260385 [details] on top of the failing kernel
>    boot without any parameter
> 
> Thanks in advance.

I can confirm Tim's recent observation. I am on 4.14.0-rc6 patched only with attachment 260383 [details].

Without the boot parameter I can reproduce the fan blowing bug.

With the boot parameter "acpi_resume_latency=25" I am now entering day three of regular standby-resumes without the fan blowing after resume. I would still like to give it a couple of days more of testing, but so far this patch together with the boot parameter does seem like a significant improvement.

Tomislav
Comment 48 Tomislav Ivek 2017-10-29 23:12:25 UTC
(In reply to Tomislav Ivek from comment #47)
> (In reply to Lv Zheng from comment #44)
> > Tomislav,
> > 
> > Thanks for the confirmation.
> > 
> > Could you give the following test a try?
> > 1. apply only attachment 260383 [details] on top of the failing kernel
> >    boot with "acpi_resume_latency=25"
> > 2. apply only attachment 260385 [details] on top of the failing kernel
> >    boot without any parameter
> > 
> > Thanks in advance.
> 
... Forgot to mention I still have to test 260385 without any boot parameter, but as I mentioned I would still like to give 260383 it a bit more time before doing that.
Comment 49 Lv Zheng 2017-10-31 09:47:11 UTC
I think we needn't try attachment 260383 [details], as it cannot fix the issue. :)
I managed to obtain a failing machine from our customers, and tested the patch on 4.11 kernel, it didn't help.

I'll try to find if there is a better solution than attachment 260385 [details] on the failing system.
Comment 50 Tim Sinthofen 2017-10-31 11:28:31 UTC
Well i guess our best bet at the moment is Lenovo fixing it properly with a BIOS update. I guess all the solutions we tried here were just workarounds, right?
Let's hope that this comment will stay true! (https://bugzilla.redhat.com/show_bug.cgi?id=1480844#c52)
Comment 51 Tomislav Ivek 2017-10-31 23:25:56 UTC
(In reply to Lv Zheng from comment #49)
> I think we needn't try attachment 260383 [details], as it cannot fix the
> issue. :)
> I managed to obtain a failing machine from our customers, and tested the
> patch on 4.11 kernel, it didn't help.
> 
> I'll try to find if there is a better solution than attachment 260385 [details]
> [details] on the failing system.

Hi Lv,

I tested the attachment 260385 [details] on v4.14.0-rc6 without any boot parameters. The fan-spinning issue appeared immediately after first resume.

Tomislav
Comment 52 Lv Zheng 2017-11-01 07:29:17 UTC
> I guess all the solutions we tried here were just workarounds, right?

Yes, the only known fact to me is:
Any exit from a deeper enough C-state during S3-resume can trigger the problem.

But there is nothing reasonable from OS point of view.
I'm just developing something that can lower down the reproduce ratio, closer to what we might have on Windows.

Thanks and best regards
Lv
Comment 53 Tim Sinthofen 2017-11-09 11:13:32 UTC
Updated to BIOS 1.42 today. It did not fix the bug. (No special kernel, stock fedora 27)
Comment 54 Tomislav Ivek 2017-11-23 19:01:23 UTC
I updated my T470 to today's BIOS Revision 1.43, Firmware Revision 1.29. From the changelog at https://download.lenovo.com/pccbbs/mobiles/n1qur12w.txt :
"- Fixed an issue where fan might rotated with max speed due to not reading CPU temperature correctly."

On v4.14.0-rc6, a couple of standby-resume cycles in and so far the fan is behaving nicely as opposed to previous firmware revision where the fan-spinning issue appeared immediately after first resume.

I plan to test this further for at least a couple of days more.
Comment 55 Darrick J. Wong 2017-11-24 16:17:56 UTC
I have also installed n1qur12w and will report back in a few days.
Comment 56 Tim Sinthofen 2017-11-25 02:25:15 UTC
Using the Bios 1.43 for the last 2 days, I did not have the issue a single time. It seems to be fixed completely.
Comment 57 Tim Sinthofen 2017-12-02 22:11:38 UTC
After another week of usage, the bug did not occur a single time.
I think this bug can be closed. Thanks!
Comment 58 Tomislav Ivek 2017-12-02 22:15:04 UTC
(In reply to Tim Sinthofen from comment #57)
> After another week of usage, the bug did not occur a single time.
> I think this bug can be closed. Thanks!

Uptime of 8 days here, the bug did not occur. I agree it seems to be fixed with the new firmware.
Comment 59 Darrick J. Wong 2017-12-04 05:16:22 UTC
Sleep/resume seems fine here.  Thank you for your time!
Comment 60 Darrick J. Wong 2017-12-13 04:38:16 UTC
Drat, it did it again.  The only operational difference is that I had previously put the machine to sleep while in its docking station and resumed it after undocking.

Note You need to log in before you can comment on or make changes to this bug.