Created attachment 258449 [details] acpidump of x1c gen5 (2017) I am so sorry I have not been fully tested,I think that the issue still exists on my pc because I found following patch in the fedora src.rpm package: $ rpm2cpio kernel-4.12.11-300.fc26.src.rpm| cpio -div $ cat patch-4.12.11| grep '@@.*acpi_ec_ecdt_probe' -A31 @@ -1812,24 +1812,6 @@ int __init acpi_ec_ecdt_probe(void) } #ifdef CONFIG_PM_SLEEP -static int acpi_ec_suspend_noirq(struct device *dev) -{ - struct acpi_ec *ec = - acpi_driver_data(to_acpi_device(dev)); - - acpi_ec_enter_noirq(ec); - return 0; -} - -static int acpi_ec_resume_noirq(struct device *dev) -{ - struct acpi_ec *ec = - acpi_driver_data(to_acpi_device(dev)); - - acpi_ec_leave_noirq(ec); - return 0; -} - static int acpi_ec_suspend(struct device *dev) { struct acpi_ec *ec = @@ -1851,7 +1833,6 @@ static int acpi_ec_resume(struct device *dev) #endif static const struct dev_pm_ops acpi_ec_pm = { - SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(acpi_ec_suspend_noirq, acpi_ec_resume_noirq) SET_SYSTEM_SLEEP_PM_OPS(acpi_ec_suspend, acpi_ec_resume) }; After receiving your e-mail, I tested the 4.13.1 kernel and reproduced the problem. I copied the output record: [fengzi@x1c ~]% while [ 1 ] ; do sensors | awk '{if ($0 ~ /Package/) temp = $4; else if ($0 ~ /fan/) {fan = $2; unit = $3}} END{print temp" "fan" "unit}'; sleep 2; done +48.0°C 0 RPM +47.0°C 0 RPM +67.0°C 0 RPM ERROR: Can't get value of subfeature temp1_input: I/O error +60.0°C 65535 RPM +61.0°C 65535 RPM +49.0°C 65535 RPM +48.0°C 65535 RPM +47.0°C 0 RPM ERROR: Can't get value of subfeature temp1_input: I/O error +76.0°C 3492 RPM +49.0°C 4538 RPM +47.0°C 5208 RPM +46.0°C 5836 RPM +45.0°C 6423 RPM +44.0°C 7025 RPM +45.0°C 6976 RPM +46.0°C 6960 RPM +45.0°C 6960 RPM +47.0°C 6960 RPM +43.0°C 6960 RPM +44.0°C 6960 RPM ^C% [fengzi@x1c ~]% sensors coretemp-isa-0000 Adapter: ISA adapter Package id 0: +45.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +45.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +42.0°C (high = +100.0°C, crit = +100.0°C) pch_skylake-virtual-0 Adapter: Virtual device temp1: +43.5°C acpitz-virtual-0 Adapter: Virtual device temp1: +48.0°C (crit = +128.0°C) iwlwifi-virtual-0 Adapter: Virtual device temp1: +41.0°C thinkpad-isa-0000 Adapter: ISA adapter fan1: 6960 RPM [fengzi@x1c ~]% uname -r 4.13.1-303.fc27.x86_64 Do I need to provide additional information? Thanks an
Oh! Some things I forgot. In comment 1, Output: "ERROR: Can't get value of subfeature temp1_input: I/O error" corresponds to a suspend and wake-up.
Have you tried to boot
The bug is reported from shian (fengziyonghu@163.com). split from bug 196129. Let me ask: 1. could you upload full dmesg here? 2. have you tried to boot with acpi.ec_freeze_events=N? Thanks
Split here to have the thermal guys to investigate if the "temp1_input" failure matters.
Created attachment 258455 [details] full dmesg output after fan blows up And I copied the output record in the case where the process of reproducing the fan blow was as short as possible: [fengzi@x1c ~]% while ... sensors ... +44.0°C 0 RPM +45.0°C 0 RPM ERROR: Can't get value of subfeature temp1_input: I/O error +54.0°C 65535 RPM +45.0°C 65535 RPM +44.0°C 65535 RPM +44.0°C 65535 RPM +44.0°C 0 RPM ERROR: Can't get value of subfeature temp1_input: I/O error +57.0°C 65535 RPM +44.0°C 4470 RPM +43.0°C 5136 RPM +44.0°C 5747 RPM +44.0°C 6396 RPM +43.0°C 6960 RPM +43.0°C 6993 RPM ^C% [fengzi@x1c ~]% cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.13.2-300.fc27.x86_64 root=UUID=c2247d70-f254-4b50-810a-231290113dd1 ro rhgb quiet LANG=zh_CN.UTF-8 acpi.ec_freeze_events=N
please attach the output of "grep . /sys/class/thermal/thermal*/*" and "grep . /sys/class/thermal/thermal*/device/path" when the error happens.
Anyway, this looks more like a proof that there is an EC FW bug. What the reverted commit has done is to make it 100% triggered during s3-exit. And if there is other cause than an EC FW robustness (need to be confirmed by thermal developers), then apparently Lenovo EC FW developers should do something to improve. So let me assign this to thermal experts. OTOH, can the similar test be done on Windows?
Your sympton cannot be reproduced on platforms that can be fixed by the following regression fix: commit 662591461c4b9a1e3b9b159dbf37648a585ebaae Author: Lv Zheng <lv.zheng@intel.com> Date: Wed Jul 12 11:09:09 2017 +0800 ACPI / EC: Drop EC noirq hooks to fix a regression According to bug reports, although the busy polling mode can make noirq stages execute faster, it causes abnormal fan blowing up after system resume (see the first link below for a video demonstration) on Lenovo ThinkPad X1 Carbon - the 5th Generation. The problem can be fixed by upgrading the EC firmware on that machine. However, many reporters confirm that the problem can be fixed by stopping busy polling during suspend/resume and for some of them upgrading the EC firmware is not an option. For this reason, drop the noirq stage hooks from the EC driver to fix the regression. Fixes: c3a696b6e8f8 (ACPI / EC: Use busy polling mode when GPE is not enabled) Link: https://youtu.be/9NQ9x-Jm99Q Link: https://bugzilla.kernel.org/show_bug.cgi?id=196129 The test result on that kind of platform is: The fan RPM is always reported as 0 except for the moment after S3 resume to desktop. Upon S3 resume to desktop, I see three to four instances of 65536 rpm. At no time did I see an I/O error. As a conclusion, it looks your issue is a different "invalid cpu temperature" issue, and is not related to the original regression report of bug 196129.
Created attachment 260251 [details] grep /sys/class/thermal/* output There are some comments that indicate the output, And contains kernel parameters,
(In reply to shian from comment #9) > Created attachment 260251 [details] > grep /sys/class/thermal/* output > > There are some comments that indicate the output, > And contains kernel parameters, Does this error "ERROR: Can't get value of subfeature temp1_input: I/O error" happen while the Fan blowing? please also attach the "grep . /sys/class/hwmon/hwmon*/*" both before and after the "ERROR: Can't get value of subfeature temp1_input: I/O error" error.
Created attachment 260253 [details] sensors record (4.13.5) (In reply to Lv Zheng from comment #8) After upgrading to 4.13+, It seem the error never occur when the battery powered. The attachment is my test record(And before, I tried the new kernel for nearly a month). As the record shows, The "ERROR: Can't get value of subfeature temp1_input: I/O error" info does not always appear. I found lenovo release the new bios: n1mur10w, I flash it but the fan blow still happens.
(In reply to shian from comment #11) > Created attachment 260253 [details] > sensors record (4.13.5) > > (In reply to Lv Zheng from comment #8) > > After upgrading to 4.13+, It seem the error never occur when the battery > powered. you mean AC adapter plugged, right? can you ever reproduce the error in 4.13+, including battery unpowered? > The attachment is my test record on which kernel?
Comment on attachment 260253 [details] sensors record (4.13.5) >[fengzi@x1c ~]% while [ 1 ] ; do sensors | awk '{if ($1 ~ /Package/) temp=$4; >else if ($1 ~ /fan1/) rpm=$2;} END{print temp" "rpm" RPM"}' ; sleep 2; done >+40.0°C 0 RPM /* >Start, battery only */ >+40.0°C 0 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and good) */ >+45.0°C 65535 RPM >+40.0°C 65535 RPM >+40.0°C 65535 RPM >+39.0°C 65535 RPM >+40.0°C 0 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and good) */ >+48.0°C 65535 RPM >+40.0°C 65535 RPM >+59.0°C 65535 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and good) */ >+46.0°C 65535 RPM >+40.0°C 65535 RPM >+52.0°C 65535 RPM >+40.0°C 65535 RPM >+40.0°C 0 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and good) */ >+52.0°C 65535 RPM >+40.0°C 65535 RPM >+45.0°C 65535 RPM >+39.0°C 65535 RPM >+40.0°C 0 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and good) */ >+45.0°C 65535 RPM >+40.0°C 65535 RPM >+39.0°C 65535 RPM >+40.0°C 65535 RPM >+39.0°C 0 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >*******Now, I plugged power(wake up and fan blow) */ >+43.0°C 3307 RPM >+35.0°C 4354 RPM >+35.0°C 5008 RPM >+35.0°C 5597 RPM >+35.0°C 6237 RPM /* >Before I pressed <fn + 4>(suspend) */ >+42.0°C 65535 RPM /* >After I pressed <fn>(wake up and good) */ >+36.0°C 65535 RPM >+35.0°C 65535 RPM >+35.0°C 65535 RPM >+35.0°C 0 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and fan blow) */ >+45.0°C 2790 RPM >+36.0°C 4354 RPM >+36.0°C 5008 RPM >+36.0°C 5617 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and fan blow) */ >+43.0°C 65535 RPM >+36.0°C 4261 RPM >+55.0°C 4966 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and fan blow) */ >+45.0°C 65535 RPM >+36.0°C 4279 RPM >+53.0°C 4966 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and fan blow) */ >+42.0°C 65535 RPM >+36.0°C 4201 RPM >+36.0°C 4870 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and fan blow) */ >+43.0°C 65535 RPM >+37.0°C 4219 RPM /* >Before I pressed <fn + 4>(suspend) */ >ERROR: Can't get value of subfeature temp1_input: I/O error /* >After I pressed <fn>(wake up and good) */ >+46.0°C 65535 RPM >+37.0°C 65535 RPM >+37.0°C 65535 RPM >+36.0°C 65535 RPM >+36.0°C 0 RPM >^C% >[fengzi@x1c ~]% cat /proc/cmdline >BOOT_IMAGE=/vmlinuz-4.13.5-200.fc26.x86_64 >root=UUID=c2247d70-f254-4b50-810a-231290113dd1 ro rhgb quiet LANG=zh_CN.UTF-8
please attach the output of "grep . /sys/class/hwmon/hwmon*/*" when the error happens.
Created attachment 260255 [details] grep /sys/class/hwmon/hwmon*/* (In reply to Zhang Rui from comment #10) "ERROR: Can't get value of subfeature temp1_input: I/O error" info only appears when the the first output after the wakeup(and does not always appear), After this, no matter whether there is a fan blow, this info will no longer appear. So, I can't get output when "ERROR: Can't get value of subfeature temp1_input: I/O error" appears. (In reply to Zhang Rui from comment #12) Use kernel 4.13.5, I can still reproduce the error in AC adapter plugged,but no battery.
(In reply to shian from comment #15) > Created attachment 260255 [details] > grep /sys/class/hwmon/hwmon*/* > > (In reply to Zhang Rui from comment #10) > > "ERROR: Can't get value of subfeature temp1_input: I/O error" info only > appears when the the first output after the wakeup(and does not always > appear), After this, no matter whether there is a fan blow, this info will > no longer appear. > So, I can't get output when "ERROR: Can't get value of subfeature > temp1_input: I/O error" appears. I see. Then please attach the full sensors output. I want to confirm which sensor device raise the error message.
Created attachment 260257 [details] full sensors output Haha~ It is "iwlwifi-virtual-0".
then I think I see what the problem is. the iwlwifi firmware is not loaded when you grep the sysfs interface, thus -EIO is returned. does the problem still exists if you wait for a while (say, 5 minutes) before running sensors command after wakeup?
(In reply to Zhang Rui from comment #18) > does the problem still exists if you wait for a while (say, 5 minutes) > before running sensors command after wakeup? No, It will not exist. As recorded in the attachment 260257 [details], When sensors run for the second time after 2 seconds, iwlwifi output is correct. It seems the fan blow is independent of whether the sensor has read the error?
(In reply to shian from comment #19) > (In reply to Zhang Rui from comment #18) > > > does the problem still exists if you wait for a while (say, 5 minutes) > > before running sensors command after wakeup? > No, It will not exist. > As recorded in the attachment 260257 [details], When sensors run for the > second time after 2 seconds, iwlwifi output is correct. I mean, after wakeup, wait for a while, and then run sensors command for the first time. does the problem exist? IMO, this is a race problem, and it can be reproduced if you grep the sysfs I/F (via sensors command) before iwlwifi firmwre loaded. So if we grep the sysfs I/F later, the problem will not exist. > It seems the fan blow is independent of whether the sensor has read the > error? yes, these are two different problems. Now, let's look at the fan blow up issue. Please confirm if this is a duplicate of bug #196975.
(In reply to Zhang Rui from comment #20) > I mean, after wakeup, wait for a while, and then run sensors command for the > first time. does the problem exist? Ok, after wakeup, wait for a while, run sensors and output is correct. > Please confirm if this is a duplicate of bug #196975. Yes, I think this is the same problem with bug #196975.
> > does the problem still exists if you wait for a while (say, 5 minutes) > > before running sensors command after wakeup? > No, It will not exist. > As recorded in the attachment 260257 [details], When sensors run for the > second time after 2 seconds, iwlwifi output is correct. I think Rui was asking if "FAN still blows up if you wait for a while before running sensors command after wakeup?". > Yes, I think this is the same problem with bug #196975. I still do not think they are same issue. It's same sympton if CPU temperature is invalid - FAN blows up. But this sympton can be caused by various root causes. And bug #196975 is a known one related to the EC FW specific to that platform which is known to be different from yours. Thanks and best regards Lv
(In reply to Lv Zheng from comment #22) > > > does the problem still exists if you wait for a while (say, 5 minutes) > > > before running sensors command after wakeup? > > No, It will not exist. > > As recorded in the attachment 260257 [details], When sensors run for the > > second time after 2 seconds, iwlwifi output is correct. > > I think Rui was asking if "FAN still blows up if you wait for a while before > running sensors command after wakeup?". > > > Yes, I think this is the same problem with bug #196975. > > I still do not think they are same issue. > It's same sympton if CPU temperature is invalid - FAN blows up. > > But this sympton can be caused by various root causes. > And bug #196975 is a known one related to the EC FW specific to that > platform which is known to be different from yours. > why bug #196975 is known to be different from this one? to confirm it is the same issue, please 1. can the problem be reproduced in 4.9 kernel? 2. in 4.13, do the same test as https://bugzilla.kernel.org/show_bug.cgi?id=196975#c18
(In reply to Lv Zheng from comment #22) > I think Rui was asking if "FAN still blows up if you wait for a while before > running sensors command after wakeup?". I tested again and confirmed, If fan still blows up, and I run sensors after wait for few seconds, It never failure(no "ERROR: Can't get value of subfeature temp1_input: I/O error" error message and the iwlwifi temperature is read correctly) (In reply to Zhang Rui from comment #23) > 1. can the problem be reproduced in 4.9 kernel? Is the 4.9 kernel must be vanilla kernel? In debian 9.1 with 4.9 lts kernel, I did not find this problem. > 2. in 4.13, do the same test as > https://bugzilla.kernel.org/show_bug.cgi?id=196975#c18 I did some tests(May not be rigorous enough), I will sorting out the test results and attach it later.
Created attachment 260267 [details] boot kernel with idle=[halt/nomwait/poll] or intel_idle.max_cstate=[0-9] My cpu model : $ cat /proc/cpuinfo | grep -i name | head -n 1 model name : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
Created attachment 260269 [details] turbostat output in kernel 4.13.5 In attachment, I added some comments indicating when a suspend or wake-up occurred.
Now, I think the iwlwifi error problem is clear, and it is because the firmware is not ready when poking its thermal I/F. From now on, when we say "the problem", it actually means the fan blowing issue. It looks like this problem is similar to bug #196975, but the difference is that intel_idle.max_cstate=5 can fix the problem for #196975, but not for you. And you need intel_idle.max_cstate=3...
> why bug #196975 is known to be different from this one? Maybe we should have shian uploading a dmidecode output here.
Created attachment 260287 [details] dmidecode output (kernel 4.13.5) (In reply to Lv Zheng from comment #28) Yes, bug 196975 and my fan blow seems to be somewhat different, I tested again, when boot with intel_idle.max_cstate=3 and suspend/wakeup 40 times, The fan is running correctly. And when boot with intel_idle.max_cstate=4 fan blow after 2nd wakeup.
To shian: Would you please try attachment 260383 [details] to confirm if the problem disappears with kernel boot parameter of "acpi_resume_latency=25"? If the problem can be fixed by the workaround, please give attachment 260385 [details] a try to see if it can also fix the problem. Thanks in advance.
(In reply to Lv Zheng from comment #30) > To shian: > > Would you please try attachment 260383 [details] to confirm if the problem > disappears with kernel boot parameter of "acpi_resume_latency=25"? > Great! I tried attachment 260383 [details] with acpi_resume_latency=25 on master commit ae59df0349ba, After 30 cycles of wakeup/suspend, I did not find the fan blow problem. I will continue to try the attachment 260385 [details] later.
Ok, For attachment 260385 [details] I did not find fan blow in 30 wakeup/suspend cycles on kernel tag v4.10 without any boot parameter.
The patch turns runtime power resources _ON before invoking _WAK. The following is what I got from the Windows AMLi debugger. AMLI: ffffe000cb4ad040: AsyncEvalObject(\_SB.PCI0.PEG0.PG00._OFF) AMLI: ffffe000cb4ad040: AsyncEvalObject(\_SB.PCI0.PEG1.PG01._OFF) AMLI: ffffe000cb4ad040: AsyncEvalObject(\_SB.PCI0.PEG2.PG02._OFF) AMLI: ffffe000ca279540: AsyncEvalObject(\_PTS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0._ADR) ... AMLI: ffffe000cb4aa040: AsyncEvalObject(\_WAK) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0.RP01._ADR) ... AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0.LPCB.EC._Q1C) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0.RP08._ADR) ... AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKA._DIS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKB._DIS) AMLI: ffffe000cb8b0040: EvalNameSpaceObject(\_TZ.TZ0._CRT) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKC._DIS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKD._DIS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKE._DIS) AMLI: ffffe000cb8b0040: AsyncEvalObject(\_TZ.TZ0._TMP) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKF._DIS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKG._DIS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LNKH._DIS) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0.PEG2.PG02._ON) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0.PEG1.PG01._ON) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.PCI0.PEG0.PG00._OFF) AMLI: ffffe000cb4a7040: AsyncEvalObject(\_SB.LID0._LID) AMLI: ffffe000ca279540: AsyncEvalObject(\_SB.LID0._PSW) AMLI: ffffe000cb4aa040: AsyncEvalObject(\_SB.PCI0.PEG0.PG00._ON) ... On Windows, between _PTS and _WAK, there is no _ON/_OFF executed, which probably means something.
I think attachment 260383 [details] cannot fix the issue. :) I managed to obtain a failing machine from our customers, and tested the patch on 4.11 kernel, it didn't help. I'll try to find if there is a better solution than attachment 260385 [details] on the failing system.
I'm worried that the problems that appear on my pc are not the same as other pc, I test attachment 260383 [details] with v4.11 with acpi_resume_latency=25 to test whether there is a different situation with master commit ae59df0349ba, But I still can not reproduce the problem on v4.11 with the patch.
I flash the new bios n1mur11w that fixed an issue where fan might rotated with max speed due to not reading CPU temperature correctly. But that is not the reason my fan rotated with max speed. I still can reproduce the problem use this bios.
Today, I have done a long time to test the new bios, Now I find that fan rotation occurs after the machine wakeup because the machine's temperature is really high. When the temperature dropped to normal levels, the fan also stopped rotating, I tested hundreds of cycles of suspend/wakeup. The problem seems gone?
good news. The fan blows either because the temperature is really high, or the CPU temperature is read incorrectly. As the later case seems to be gone, bug closed. please feel free to reopen it if it is reproduced again.
Created attachment 261173 [details] turbostat output (kernel-4.14.3 BIOSCD[n1mur11w].iso) Fan blow happened again. It looks like I can hardly reproduce this problem anymore, I uploaded all the relevant information at the time this error occurred.
Created attachment 261175 [details] dmidecode output (kernel-4.14.3 BIOSCD[n1mur11w].iso)
Created attachment 261177 [details] sensors output (kernel-4.14.3 BIOSCD[n1mur11w].iso)
Created attachment 261179 [details] full dmesg output (kernel-4.14.3 BIOSCD[n1mur11w].iso)
Created attachment 261181 [details] acpidump output (kernel-4.14.3 BIOSCD[n1mur11w].iso)