Bug 196975
Summary: | Fan still blows up after fixing the regression - Thinkpad T470 | ||
---|---|---|---|
Product: | Power Management | Reporter: | Lv Zheng (lv.zheng) |
Component: | cpuidle | Assignee: | Len Brown (lenb) |
Status: | CLOSED CODE_FIX | ||
Severity: | normal | CC: | djwong, fengziyonghu, rui.zhang, t.sinthofen, tomislav.ivek, yu.c.chen |
Priority: | P1 | ||
Hardware: | All | ||
OS: | Linux | ||
Kernel Version: | 4.13 | Subsystem: | |
Regression: | No | Bisected commit-id: | |
Attachments: |
lspci output
dmidecode output dmesg output acpidump output dmidecode output lspci output turbostat output for kernel 4.8.0 turbostat output for kernel 4.13.5 turbostat output for kernel 4.8.0 turbostat output for kernel 4.13.5 turbostat output for kernel 4.13.5 after EC fail |
Description
Lv Zheng
2017-09-18 01:56:45 UTC
Created attachment 258451 [details]
lspci output
Created attachment 258453 [details]
dmidecode output
To Tomislav: Please upload acpidump output and full dmesg output here. I've split this bug in order to have PM guys to investigate if the ME failure message matters. Thanks Lv Created attachment 258475 [details]
dmesg output
My apologies for the slow reply, I was away from this particular machine.
In order to reproduce the bug on T470 I am now running 4.13.0. In the attached dmesg the first standby-resume did not show the bug, but immediately after the second standby-resume did.
Created attachment 258477 [details]
acpidump output
Created attachment 258479 [details]
dmidecode output
Created attachment 258481 [details]
lspci output
I forgot to mention, the 4.13.0 I am running in the outputs above is the unpatched vanilla kerel. (In reply to Tomislav Ivek from comment #4) > Created attachment 258475 [details] > dmesg output > > My apologies for the slow reply, I was away from this particular machine. > > In order to reproduce the bug on T470 I am now running 4.13.0. In the > attached dmesg the first standby-resume did not show the bug, but > immediately after the second standby-resume did. show what bug? the fan spinning bug? I didn't see mei_me errors this time. so if the fan spins without the mei error, probably these are unrelated. please confirm the problem can be reproduced in 4.8 kernel, where the EC offending patch has not been introduced into the upstream kernel or not. say we have three stages 1. kernel before 4.9, EC offending patches are not in upstream, no one reports the fan spinning problem. 2. from 4.9 to 4.12, problems are reported by different users 3. kernel 4.13 and later, problems are hard to reproduce after reverting the EC patches. Now the problem to me is that does the problem always exist? if yes, no one reports the problem before 4.9 because it is hard to reproduce. if no, there must be something else that also breaks suspend after 4.8 (In reply to Zhang Rui from comment #9) > (In reply to Tomislav Ivek from comment #4) > > Created attachment 258475 [details] > > dmesg output > > > > My apologies for the slow reply, I was away from this particular machine. > > > > In order to reproduce the bug on T470 I am now running 4.13.0. In the > > attached dmesg the first standby-resume did not show the bug, but > > immediately after the second standby-resume did. > > show what bug? the fan spinning bug? Yes, the fan spinning bug after resume, from the title. > I didn't see mei_me errors this time. > so if the fan spins without the mei error, probably these are unrelated. > You are correct, there was no mei error this time. > please confirm the problem can be reproduced in 4.8 kernel, where the EC > offending patch has not been introduced into the upstream kernel or not. I can do that. Building 4.8, will report back. FWIW I'm on stock 4.13.2 and still experience this problem periodically. When it does, I see: "thermal thermal_zone3: failed to read out thermal zone (-5)" in dmesg. Same lspci as the one posted, etc. Will try the n1qur09w.iso firmware and report back. (I cannot revert to 4.8, it's too old to support the XFS filesystems on this laptop.) This actually is explainable. The problem seems to be: When cpu is busy exitting/entering C-states, EC FW will get a failure by reading CPU temperature via PECI. I think this could even happen during runtime, not just post-resume. FWIW, the n1qur09w firmware did not magically fix things. :( (In reply to Zhang Rui from comment #9) > please confirm the problem can be reproduced in 4.8 kernel, where the EC > offending patch has not been introduced into the upstream kernel or not. After a full week on 4.8, I am not able to reproduce the bug with the fan blowing after resume. For comparison, on 4.13.0 the bug appears within the first couple of resumes. The bug appears within three-five days with regular standby-resumes on 4.13-rc4 with the following patches: https://patchwork.kernel.org/patch/9870917/ https://patchwork.kernel.org/patch/9870915/ https://patchwork.kernel.org/patch/9870919/ https://patchwork.kernel.org/patch/9870925/ Hi, Can you try several tests: 1. Boot a kernel with the following parameters to see if the problem still can be reproduced: idle=X Where X can be a string of "halt/nomwait/poll", hope you can give all of them a try. 2. Make sure you are using intel_idle, then try to boot a kernel with the following parameters to see if the problem can still be reproduced: intel_idle.max_cstate=Y Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9. Thanks and best regards Lv It is said and confirmed that the original regression can be fixed by the following commit: commit 662591461c4b9a1e3b9b159dbf37648a585ebaae Author: Lv Zheng <lv.zheng@intel.com> Date: Wed Jul 12 11:09:09 2017 +0800 ACPI / EC: Drop EC noirq hooks to fix a regression According to bug reports, although the busy polling mode can make noirq stages execute faster, it causes abnormal fan blowing up after system resume (see the first link below for a video demonstration) on Lenovo ThinkPad X1 Carbon - the 5th Generation. The problem can be fixed by upgrading the EC firmware on that machine. However, many reporters confirm that the problem can be fixed by stopping busy polling during suspend/resume and for some of them upgrading the EC firmware is not an option. For this reason, drop the noirq stage hooks from the EC driver to fix the regression. Fixes: c3a696b6e8f8 (ACPI / EC: Use busy polling mode when GPE is not enabled) Link: https://youtu.be/9NQ9x-Jm99Q Link: https://bugzilla.kernel.org/show_bug.cgi?id=196129 And what I learned from a different route is: 1. The bug occurred because there is a failure in EC FW reading a CPU temperature. 2. The failure can be triggered by different reasons. For the regression, it is triggered due to a C6-exit conflicting with an internal EC FW operation, and busy polling makes it easier to occur. For the bug reported here, probably there is a different reason. Thanks and best regards Lv Hi Lv, To be as expedient as possible I've done the tests on Fedora's 4.13.5-200.fc26.x86_64. Hardware: T470 with the latest BIOS N1QET67W (1.42), Firmware revision 1.27 released on 10/10/17. Methodology: for each proposed boot parameter I have repeatedly put the machine to standby and then resumed it until either the fan-blowing issue was reproduced or 30min have passed. Considering that I can usually get the fan to blow within the first couple of standby-resumes, and within 3 minutes tops, this seems like a reasonable way to test all the options. (In reply to Lv Zheng from comment #16) > 1. Boot a kernel with the following parameters to see if the problem still > can be reproduced: > idle=X > Where X can be a string of "halt/nomwait/poll", hope you can give all of > them a try. I was not able to reproduce the problem with either of idle=halt, idle=nomwait, and idle=poll. The maximum idle state with idle=nomwait was C1 and with idle=poll was C0. When using idle=halt, powertop did not report the C-state. > 2. Make sure you are using intel_idle, then try to boot a kernel with the > following parameters to see if the problem can still be reproduced: > intel_idle.max_cstate=Y > Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9. I have verified that intel_idle is indeed in use. I was not able to reproduce the problem with intel_idle.max_cstate=0 up to 5. But, the problem is immediately reproduced with values 6, 7, 8, and 9. So, on T470 it seems the EC FW failure when reading a CPU temperature is triggered when C6 and C7 are enabled. Thank you and please let me know if I can provide any other info. Best, Tomislav (In reply to Tomislav Ivek from comment #18) > > 2. Make sure you are using intel_idle, then try to boot a kernel with the > > following parameters to see if the problem can still be reproduced: > > intel_idle.max_cstate=Y > > Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9. > > I have verified that intel_idle is indeed in use. I was not able to > reproduce the problem with intel_idle.max_cstate=0 up to 5. But, the problem > is immediately reproduced with values 6, 7, 8, and 9. > are these done with or without the EC regression fixes? As this also seems to be c-state related, please attach the output of "turbostat -debug" for both 4.9 kernel, and 4.13 kernel. (In reply to Zhang Rui from comment #19) > (In reply to Tomislav Ivek from comment #18) > > > 2. Make sure you are using intel_idle, then try to boot a kernel with the > > > following parameters to see if the problem can still be reproduced: > > > intel_idle.max_cstate=Y > > > Where Y can be a number of 0-9, hope you can at least try 0,1,3,6,9. > > > > I have verified that intel_idle is indeed in use. I was not able to > > reproduce the problem with intel_idle.max_cstate=0 up to 5. But, the > problem > > is immediately reproduced with values 6, 7, 8, and 9. > > > are these done with or without the EC regression fixes? The fix for 196129 is included: $ git tag --contains 66259146 --snip-- kernel-4.13.5-200.fc26 --snip Any others I should check for? please verify if the same thing (disable C6 and deeper) works for kernel without the EC fix or not. It is said that when C6 and upper is allowed, CPU temperature read which leads to a PECI access may be too early, causing a conflict between c6-exit and PECI I/F related stuffs in EC FW. The 66259146 only fixes C6-exit cause in EC driver, there could be different C6-exit cause in the linux upstream changes. I'm wondering if the CPU temperature read is caused by OS evaluation of _WAK or something else, but I have no chance to confirm that as we are not EC FW developers. Created attachment 260259 [details]
turbostat output for kernel 4.8.0
Created attachment 260261 [details]
turbostat output for kernel 4.13.5
well, my bad, you should leave it running for a while to get the output like following, Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp GFX%rc6 GFXMHz Totl%C0 Any%C0 GFX%C0 CPUGFX% Pkg%pc2 Pkg%pc3 Pkg%pc6 Pkg%pc7 Pkg%pc8 Pkg%pc9 Pk%pc10 PkgWatt RAMWatt PKG_% RAM_% - - 8 1.48 542 2208 927 0 2.64 0.04 0.91 94.94 27 27 99.23 1050 6.42 5.61 0.59 0.32 43.07 47.76 1.83 0.00 0.00 0.00 0.00 0.89 0.34 0.00 0.00 0 0 8 0.93 881 2208 107 0 2.74 0.03 1.51 94.79 26 27 99.23 1050 6.42 5.61 0.59 0.32 43.07 47.76 1.84 0.00 0.00 0.00 0.00 0.89 0.34 0.00 0.00 0 2 9 1.85 462 2208 235 0 1.82 1 1 10 1.96 498 2208 182 0 2.62 0.04 0.31 95.08 27 1 3 6 1.18 472 2208 403 0 3.39 (In reply to Zhang Rui from comment #26) > well, my bad, you should leave it running for a while to get the output like > following, Terribly sorry, I presumed the debug output is given in the first part. Will do it later in the day as soon as I get back to that machine. (In reply to Lv Zheng from comment #23) > It is said that when C6 and upper is allowed, CPU temperature read which > leads to a PECI access may be too early, causing a conflict between c6-exit > and PECI I/F related stuffs in EC FW. > > The 66259146 only fixes C6-exit cause in EC driver, there could be different > C6-exit cause in the linux upstream changes. > > I'm wondering if the CPU temperature read is caused by OS evaluation of _WAK > or something else, but I have no chance to confirm that as we are not EC FW > developers. I noticed that these patches helped a lot on 4.13.0-rc4: https://patchwork.kernel.org/patch/9870917/ https://patchwork.kernel.org/patch/9870915/ https://patchwork.kernel.org/patch/9870919/ https://patchwork.kernel.org/patch/9870925/ in the sense that I could run for days and dozens of standby-resume cycles before the problem would resurface. Without these patches I can reproduce the problem within minutes with only a couple of resumes. So, you may be right, there are probably more C6-exit causes, but these four patches touch at least some. (In reply to Zhang Rui from comment #22) > please verify if the same thing (disable C6 and deeper) works for kernel > without the EC fix or not. I booted 4.12.14 which does not have the EC fix commit 66259146. C6 and deeper are disabled using the intel_idle.max_cstate=5 boot parameter. I am not able to reproduce the EC problem, so disabling C6 and deeper seems to help. Created attachment 260273 [details]
turbostat output for kernel 4.8.0
Created attachment 260275 [details]
turbostat output for kernel 4.13.5
Created attachment 260277 [details]
turbostat output for kernel 4.13.5 after EC fail
(In reply to Tomislav Ivek from comment #28) > (In reply to Zhang Rui from comment #22) > > please verify if the same thing (disable C6 and deeper) works for kernel > > without the EC fix or not. > > I booted 4.12.14 which does not have the EC fix commit 66259146. C6 and > deeper are disabled using the intel_idle.max_cstate=5 boot parameter. I am > not able to reproduce the EC problem, so disabling C6 and deeper seems to > help. Noticed that Fedora's 4.12.14 has ec.c patched with its own EC fix, my mistake, so I now report that I cannot reproduce the EC problem on mainline 4.12.0 without EC fix and with disabled C6 & deeper. > Noticed that Fedora's 4.12.14 has ec.c patched with its own EC fix.
Yes, regression fixes will be back ported by stable kernels.
So care should be taken when you test with stable kernels (x.y.z) rather than upstream kernels (x.y-rcz).
A related report on Red Hat's bugzilla has news that Lenovo is planning a firmware update in November. They will address the fan-blowing-up issue on T470 and other Kaby Lake models: https://bugzilla.redhat.com/show_bug.cgi?id=1480844 Tomislav, Regarding your test result in comment #27 Are all four patches there required to see this big improvement, or are the first three sufficient? (In reply to Len Brown from comment #35) > Tomislav, > Regarding your test result in comment #27 > Are all four patches there required to see this big improvement, > or are the first three sufficient? I believe I tested only with all four of them applied. I can try them one by one but it will take a while to verify. Tomislav, Glad to know you'll help to provide us useful information related to the EC changes. If you can use latest upstream kernel, please help to do the validation using latest versioned patches. 1. applying the following 2 patches: https://patchwork.kernel.org/patch/9971553/ https://patchwork.kernel.org/patch/9971557/ 2. applying the following 1 patch on top of the 2 patches: https://patchwork.kernel.org/patch/9977051/ 3. applying the following 2 patches on top of the 3 patches: https://patchwork.kernel.org/patch/9977049/ https://patchwork.kernel.org/patch/9977047/ It's ok if you are not willing to switch to the latest upstraem kernel. Then please following Len's suggestion, thanks in advance. Cheers, Lv Hi Lv, I can do these test and will report back as soon as I am sure about the results (sometimes it takes a bit to be sure the fan is behaving). Tomislav Hi, I use a T470 with Fedora 27 (Kernel 4.13.8-300.fc27.x86_64). For me the problem was completely gone for some time. But since some 4.13 versions it is back when the Laptop wakes up from sleep when plugged into AC (always, can be easily reproduced). When waking up from battery it does never happen. I will be happy to provide logs for debugging. Please tell me what you need. Tim Tim, first, please check the previous comments of this bug report, and confirm if it is exactly the same problem or not. To me, it should be. But if there is anything different from what Tomislav reported, please update it here. (In reply to Lv Zheng from comment #37) > Tomislav, > > Glad to know you'll help to provide us useful information related to the EC > changes. > > If you can use latest upstream kernel, please help to do the validation > using latest versioned patches. > > 1. applying the following 2 patches: > https://patchwork.kernel.org/patch/9971553/ > https://patchwork.kernel.org/patch/9971557/ > > 2. applying the following 1 patch on top of the 2 patches: > https://patchwork.kernel.org/patch/9977051/ > > 3. applying the following 2 patches on top of the 3 patches: > https://patchwork.kernel.org/patch/9977049/ > https://patchwork.kernel.org/patch/9977047/ > > It's ok if you are not willing to switch to the latest upstraem kernel. Then > please following Len's suggestion, thanks in advance. > > Cheers, > Lv Lv, I did the tests on: 0. unpatched upstream tag v4.14-rc6, 1. v4.14-rc6 with the first two patches as above 2. v4.14-rc6 with the first three patches, 3. v4.14-rc6 with all five patched you proposed. In all these cases I am still able to reproduce the fan blowing after resume. For case 2. it took some hours and reboots to reproduce but that may have been a fluke. (In reply to Tim Sinthofen from comment #39) > For me the problem was completely gone for some time. But since some 4.13 > versions it is back when the Laptop wakes up from sleep when plugged into AC > (always, can be easily reproduced). When waking up from battery it does > never happen. Up to v4.13-rc4 about a month ago I did all tests with AC as well as on battery and I was always able to reproduce the bug. I did not think power was relevant so I did the more recent tests AC only. In the meantime the EC FW got updated, maybe my assumption is now wrong. Later in the day I can try the above patches on battery, too. Tomislav Tim, > For me the problem was completely gone for some time. But since some 4.13 > versions it is back when the Laptop wakes up from sleep when plugged into AC > (always, can be easily reproduced). After reverting EC changes noirq changes, I only added commits related to ECDT which is boot related, not S3 related. Rafael added code related to freeze mode not related to S3. I wonder if your test is correct if you are directly using sysfs, you need to use a different command line sequence to achieve S3: # echo deep > /sys/power/mem_sleep # echo mem > /sys/power/state So if the situation is getting worse, probably is because some other causes start to create more C6-exit chances during resume. And such causes may not be limited to the EC driver. Would you please try attachment 260383 [details] to confirm if the problem disappears with kernel boot parameter of "acpi_resume_latency=25"? This patch actually is closer to the root cause than EC changes. If the problem can be fixed by the workaround, please give attachment 260385 [details] a try to see if it can also fix the problem, it's just an experiment. Tomislav, > In all these cases I am still able to reproduce the fan blowing after resume. > For case 2. it took some hours and reboots to reproduce but that may have > been a fluke. I also wonder if you are using the following command lines: # echo deep > /sys/power/mem_sleep # echo mem > /sys/power/state If the command line is not wrong, we can ignore your comment #27. (In reply to Lv Zheng from comment #42) > Tomislav, > > > In all these cases I am still able to reproduce the fan blowing after > resume. > > For case 2. it took some hours and reboots to reproduce but that may have > > been a fluke. > > I also wonder if you are using the following command lines: > # echo deep > /sys/power/mem_sleep > # echo mem > /sys/power/state > If the command line is not wrong, we can ignore your comment #27. Lv, I tried again with the above command lines and see the same effect as reported in all cases (v4.14-rc6 no patch, patches 1-2, patches 1-3, patches 1-5). Tomislav Tomislav, Thanks for the confirmation. Could you give the following test a try? 1. apply only attachment 260383 [details] on top of the failing kernel boot with "acpi_resume_latency=25" 2. apply only attachment 260385 [details] on top of the failing kernel boot without any parameter Thanks in advance. Lv, Sorry for the late reply. This was the first time I built my own kernel, had to figure out how first. I just took the vanilla upstream kernel (4.14.0-rc6) and applied this patch: https://bugzilla.kernel.org/attachment.cgi?id=260383&action=diff Unfortunately using this kernel did not fix the issue. Tim Sorry, wrote to quick. I forgot to set the boot parameter. With the boot parameter set, the issue seems to be gone. (Tried 10 times going into standby and waking it up again) (In reply to Lv Zheng from comment #44) > Tomislav, > > Thanks for the confirmation. > > Could you give the following test a try? > 1. apply only attachment 260383 [details] on top of the failing kernel > boot with "acpi_resume_latency=25" > 2. apply only attachment 260385 [details] on top of the failing kernel > boot without any parameter > > Thanks in advance. I can confirm Tim's recent observation. I am on 4.14.0-rc6 patched only with attachment 260383 [details]. Without the boot parameter I can reproduce the fan blowing bug. With the boot parameter "acpi_resume_latency=25" I am now entering day three of regular standby-resumes without the fan blowing after resume. I would still like to give it a couple of days more of testing, but so far this patch together with the boot parameter does seem like a significant improvement. Tomislav (In reply to Tomislav Ivek from comment #47) > (In reply to Lv Zheng from comment #44) > > Tomislav, > > > > Thanks for the confirmation. > > > > Could you give the following test a try? > > 1. apply only attachment 260383 [details] on top of the failing kernel > > boot with "acpi_resume_latency=25" > > 2. apply only attachment 260385 [details] on top of the failing kernel > > boot without any parameter > > > > Thanks in advance. > ... Forgot to mention I still have to test 260385 without any boot parameter, but as I mentioned I would still like to give 260383 it a bit more time before doing that. I think we needn't try attachment 260383 [details], as it cannot fix the issue. :) I managed to obtain a failing machine from our customers, and tested the patch on 4.11 kernel, it didn't help. I'll try to find if there is a better solution than attachment 260385 [details] on the failing system. Well i guess our best bet at the moment is Lenovo fixing it properly with a BIOS update. I guess all the solutions we tried here were just workarounds, right? Let's hope that this comment will stay true! (https://bugzilla.redhat.com/show_bug.cgi?id=1480844#c52) (In reply to Lv Zheng from comment #49) > I think we needn't try attachment 260383 [details], as it cannot fix the > issue. :) > I managed to obtain a failing machine from our customers, and tested the > patch on 4.11 kernel, it didn't help. > > I'll try to find if there is a better solution than attachment 260385 [details] > [details] on the failing system. Hi Lv, I tested the attachment 260385 [details] on v4.14.0-rc6 without any boot parameters. The fan-spinning issue appeared immediately after first resume. Tomislav > I guess all the solutions we tried here were just workarounds, right?
Yes, the only known fact to me is:
Any exit from a deeper enough C-state during S3-resume can trigger the problem.
But there is nothing reasonable from OS point of view.
I'm just developing something that can lower down the reproduce ratio, closer to what we might have on Windows.
Thanks and best regards
Lv
Updated to BIOS 1.42 today. It did not fix the bug. (No special kernel, stock fedora 27) I updated my T470 to today's BIOS Revision 1.43, Firmware Revision 1.29. From the changelog at https://download.lenovo.com/pccbbs/mobiles/n1qur12w.txt : "- Fixed an issue where fan might rotated with max speed due to not reading CPU temperature correctly." On v4.14.0-rc6, a couple of standby-resume cycles in and so far the fan is behaving nicely as opposed to previous firmware revision where the fan-spinning issue appeared immediately after first resume. I plan to test this further for at least a couple of days more. I have also installed n1qur12w and will report back in a few days. Using the Bios 1.43 for the last 2 days, I did not have the issue a single time. It seems to be fixed completely. After another week of usage, the bug did not occur a single time. I think this bug can be closed. Thanks! (In reply to Tim Sinthofen from comment #57) > After another week of usage, the bug did not occur a single time. > I think this bug can be closed. Thanks! Uptime of 8 days here, the bug did not occur. I agree it seems to be fixed with the new firmware. Sleep/resume seems fine here. Thank you for your time! Drat, it did it again. The only operational difference is that I had previously put the machine to sleep while in its docking station and resumed it after undocking. |