Bug 191181
Summary: | Fans blowing at max speed after resuming - ThinkPad X1/T4xx series | ||
---|---|---|---|
Product: | ACPI | Reporter: | Tatsuyuki Ishi (ishitatsuyuki) |
Component: | Power-Thermal | Assignee: | Zhang Rui (rui.zhang) |
Status: | RESOLVED DUPLICATE | ||
Severity: | normal | CC: | andreas, ateplih, brown, claudio.sacerdoticoen, dewmax, dikiy_evrej, fengziyonghu, haoxian.zeng, info, j.gjorgji, karolszk, lenb, lukemidworth, lv.zheng, m-bugzilla, marcoen, markus.t.h.kernel, max.deineko, nanochaves, nelg, nickj21, nicolopiazzalunga, njkkow, njlmerchant, opensource+kernel, ormandj, p.lettich, philipp.keller, robert.n.sharp, rui.zhang, sander, skyler, srinivas.pandruvada, stgraber, tnelis, tomislav.ivek, yu.c.chen, zach.moazeni, zeba.hrvoje |
Priority: | P1 | ||
Hardware: | Intel | ||
OS: | Linux | ||
Kernel Version: | 4.9 | Subsystem: | |
Regression: | Yes | Bisected commit-id: | |
Attachments: |
required logs
Force D3 config.gz from 4.10.6-1-ARCH dmesg,powertop,turbostat sensors logs showing wakeup with connected/disconnected power adapter sensors during stress acpidump customized _TMP method for Nicolo' X270 acpidump, 4.10.8 config customized DSDT : dump _TMP method for Nicolo' grep and dmesg before and after problem grep and dmesg before and after problem -- corrected acpidump from old firmware x270 acpi dump without issue testA [PATCH] ACPI: EC: Revert back to default wait polling style processing in noirq stage [PATCH] ACPI: EC: Mark a possible IRQ storm period Disable deeper C-states [PATCH] Tune S3 resume step order attachment-15943-0.html |
Description
Tatsuyuki Ishi
2016-12-26 13:45:10 UTC
Still not fixed in 4.10, and downgrading to 4.8.14 resolves the problem. Please take a look. I can confirm this exact same problem on a ThinkPad X1 Carbon 4th. More precise machine information: DMI: LENOVO 20FB0043GE/20FB0043GE, BIOS N1FET49W (1.23 ) 02/08/2017 Same problem with Thinkpad t460s and kernels of the 4.9 and 4.10 series: it seems the sensor acpitz-virtual-0 (randomly) sometimes (after suspend) reports 20+ degrees than actual temperature, thus triggering the fan; they perhaps somehow have broken control for this sensor in the newer kernels. Confirmed as well on a ThinkPad X1 Carbon 4th. This doesn't happen with 4.11 kernel built without ISH support, adding maintainers to CC list. PS: the sensors was working very crap so I suspected it's corrupting something too. I'm also suffering from https://lkml.org/lkml/2017/3/21/900 on 4.11-rc4, and it makes my test unreliable. It *always* crashes on second resume. Waiting for next merge to perform a test with ISH enabled. Interesting factor: after the BUG is triggered, I guess the kernel is somewhat frozen and it also makes the fan blowing up indefinitely. (This is reproduced with ISH turned off) I confirm that the problem is about the ISH support. Hibernation, instead, has no problem. Moreover, when the fan is running at full throttle after a resume from suspend, hibernating and resuming fixes the issues. Could you try these steps: - Build 4.11 with ISH support, which you suspect here. - Boot - attach dmesg. - Attach lsmod | grep -i ish - systemctl status iio-sensor-proxy - Attach cat /proc/interrupts just before suspend suspend and resume - Attach cat /proc/interrupts Does it show huge difference in count Boot again and next time # systemctl stop iio-sensor-proxy and then suspend/resume Created attachment 255671 [details]
required logs
All the logs required, in a .tgz file
(In reply to Claudio Sacerdoti Coen from comment #9) > Created attachment 255671 [details] > required logs > > All the logs required, in a .tgz file The name of the files should be easily mapped to the logs required. interrupts_{before,after} are with iio-sensor-proxy not stopped interrupts_stopped_{before,after} with iio-sensor-proxy stoppe stopping iio-sensor-proxy did not solve the issue Created attachment 255673 [details]
Force D3
I don't see any interrupt storm. Do you have a Fan device here # cat /sys/class/thermal/cooling_device*/type Also a test/debug patch is attached. Please try and send me dmesg. claudio@zenone:~$ cat /sys/class/thermal/cooling_device*/type Processor Processor Processor Processor intel_powerclamp iwlwifi I will try the test/debug patch and let you know. (In reply to Tatsuyuki Ishi from comment #0) > I haven't noticed any suspicious dmesg, but this one is a little bit strange: > thermal thermal_zone2: failed to read out thermal zone (-5) > This was zone3 when I'm running 4.8 though. I am using ThinkPad T470s. I was having this problem (fun runs at maximum speed after resuming from suspend to ram) several days ago. But the problem has gone since the day before yesterday when I upgraded my OS (openSUSE Tumbleweed) to latest snapshots of 20170324 and 20170328.(I was testing something so these two were left to be updated together.) The kernel source is 4.10.4-1.4 or 4.10.5. Today when I was testing to confirm the vanish of this problem, I noticed that I have this same warning in output of `dmesg` too. `thermal thermal_zone2: failed to read out thermal zone (-5)` It is not relevant to this problem, at least not for me. Zeng, please confirm that your kernel is built with ISH support (CONFIG_INTEL_ISH) and you have tried multiple times to reproduce the problem. Please also check if the fan spins up on heavy load, there's a rare case where it will not spin at all burning your machine. (In reply to Tatsuyuki Ishi from comment #15) > Zeng, please confirm that your kernel is built with ISH support > (CONFIG_INTEL_ISH) and you have tried multiple times to reproduce the > problem. Hi, I think I do not have ISH support because `lsmod | grep -i ish` outputs nothing. So maybe my previous problem is not the same as this one although the symptoms match well. > > Please also check if the fan spins up on heavy load, there's a rare case > where it will not spin at all burning your machine. Yes, the fun runs well under load test. Zeng, it's known that this bug won't trigger on kernels with the CONFIG_INTEL_ISH disabled. Have nice days with that build (it's now on by default on most distros despite the driver's crappiness). Hi Ishi, thanks for your confirmation. Hope you guys getting a solution soon. Hi Ishi, one remark: also for me the command `lsmod | grep -i ish` does not output anything, but I clearly have the problem, on a t460s with 4.10.6 the sensor sometimes registers 20+ degrees more, and triggers fan. Is that strange? Nicolo, I haven't built a custom build of 4.10.6 yet so I'm not 100% confident. I've heard you're running Arch. If then, the stock kernel is built with CONFIG_INTEL_ISH. Do you blacklist the modules? That won't solve the problem at all (the reason remains unknown). Tatsuyuki Ishi: You are saying blacklist of module doesn't help. Is this correct? Yes, I'm running Arch, but I'm not blacklisting anything, at least not intentionally, namely I just take the stock kernel from Arch. Nicolo: Can you upload your kernel config (/boot/config-4.10 ..)? Sure, could you be more precise? My boot folder only contains BOOT initramfs-linux-fallback.img intel-ucode.img vmlinuz-linux EFI initramfs-linux.img loader Where can I find the file you mention? (sorry for my ignorance) I don't use ARCH-Linux, but from running system you can try (provided Arch Linux enabled kernel config), try #zcat /proc/config.gz Can everyone confirm whether you are using tpacpi-bat from <https://github.com/teleshoes/tpacpi-bat>? And whether the problem remains if removing tpacpi-bat? Created attachment 255697 [details]
config.gz from 4.10.6-1-ARCH
Srinivas: that worked, here's the file.
Thanks Nicolo. You have ISH enabled as module. Since you don't see in lsmod any ish module, your system doesn't have ISH. Also thermal sensors are not handled by ISH. So not all sensors are owned by ISH. I guess there is some problem somewhere, may be on ISH enabled system since you will have some more interrupts after resume the problem might have been more noticed. Anyway if it is ISH, I uploaded a test patch. And Claudio Sacerdoti will proabbly try. We can go from there. Based on his current logs, are no interrupt storms or anything, but we will try couple of more options to isolate. I have two other thinkpads, I don't see any issue. If not there we will take as thermal regression and debug this issue further. There are number of reasons this could be triggered. So anybody want to help debug, please do. I see, thanks. At least in my case it is clearly sensor acpitz-virtual-0 that sometimes reports 20+ degrees (say 48C instead of 28C) after suspend to RAM, and triggers the fan. I'm on a t460s (no GPU) with the above kernel. Let me know whether I should perform some tests. Nicole: Try this #turbostat --debug # In another window run sudo powertop –time=30 --html # echo mem > /sys/power/state wake up by power button after suspend let powertop and turbostat run for few minutes and copy paste screen output to a file and send also powertop will generate a html file attach both. Nicole: Also include output of dmesg from boot to after suspend and resume complete when you think you reached problem condition #dmesg > ~/dmesg.txt Created attachment 255699 [details]
dmesg,powertop,turbostat
I'm attaching:
-dmesg after the problem
-powertop files from before suspending, after suspending without the problem, and after suspending with the problem
-turbostat files with and without the problem
Btw I'm Nicolo' :)
Created attachment 255703 [details]
sensors
Perhaps I'm overemphasizing this, but here it's clear which sensor is not working. Also, I tried to reproduce the problem by using echo mem > /sys/power/state (as opposed to just closing the laptop lid), but it seems to happen less often (or maybe I was just unlucky).
Nicolo: Thanks for sending logs. From dmesg - There is no ISH in the system. From both with and without turbostat/powertop logs - Temperature is around 30C, which is not a high temperature of concern (105C is max) - The processor almost idle(Look at CPU idle) You are even reaching PC7. Processor is active less than 3% of time. Actally your "with" logs is less than without. So core CPU perspective there is no concern here. System is idle with deep C states. Package and core temp is low. Only concern here is why Fan is still blowing. Your "powertop with after" log show 4660 rpm Device Laptop fan Are you charging laptop? Also dump grep . /sys/class/thermal/* If Linux OS thermal is supposed to control Fan, it will be here. If it controlled by "Embedded controller" then it will not be here. You may want to look at thinkpad fan control program. There is a way from user space to control Fan. This may be in /sys/devices/platform/thinkpad_hwmon by controling pwm. Also some thinkpads will have in /proc/acpi/ibm/fan. Regarding you sensor.tar acpitz-virtual-0 Adapter: Virtual device temp1: +48.0°C (crit = +128.0°C) coretemp-isa-0000 Adapter: ISA adapter Package id 0: +23.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +23.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +23.0°C (high = +100.0°C, crit = +100.0°C) ACPI virtual should be close to package temperature which is close to what turbostat also pointed. I think the Fan controller in EC is looking acpi-virtual, which is not correct here. So this is broken. We can also look at #acpidump > acpi.out Also you can some workload like # stress -c 4 and run sensors and see "acpitz-virtual-0" changes. I guess it will show 48C. This bug doesn't seem to trigger (the fan stays quiet) if the laptop is woken up from standby while being connected to the power adapter. Just upgraded to 4.10.8, the bug seems to be less likely triggered. All the things I know: 1. I'm not supposing that this problem is directly related to the thermal sensors, but it's likely that something is corrupting EC registers. 2. Manually setting EC fan control level works, but it will blow back up when you revert it to "auto". 0xbb: No, the bug is very likely to trigger when being connected to the power adapter. Created attachment 255727 [details]
logs showing wakeup with connected/disconnected power adapter
I am almost sure that this bug never triggered during the last month when I woke up the laptop while charging. I attached some dmesg logs showing this behavior: 1. dmesg_before_unpugged.txt: normal running; disconnected the power adapter. 2. dmesg_wakeup_disconnected.txt: suspend; wake up; fan is at max speed. 3. dmesg_wakeup_pluggedin.txt: suspend again; connected power adapter; wake, fan control is working again. I just tried and it is triggered for me even while the laptop is charging. Srinivas: I can confirm the problem may appear both while charging and on battery. grep . /sys/class/thermal/* grep: /sys/class/thermal/cooling_device0: Is a directory grep: /sys/class/thermal/cooling_device1: Is a directory grep: /sys/class/thermal/cooling_device2: Is a directory grep: /sys/class/thermal/cooling_device3: Is a directory grep: /sys/class/thermal/cooling_device4: Is a directory grep: /sys/class/thermal/cooling_device5: Is a directory grep: /sys/class/thermal/thermal_zone0: Is a directory grep: /sys/class/thermal/thermal_zone1: Is a directory grep: /sys/class/thermal/thermal_zone2: Is a directory grep: /sys/class/thermal/thermal_zone3: Is a directory ls /sys/devices/platform/thinkpad_hwmon/ driver fan1_input modalias power pwm1_enable uevent driver_override hwmon name pwm1 subsystem I didn't try the specific stress you mention, but with a similar program generating prime numbers there are 2 cases: either we start with the normal situation, then the system heats and fan regularly starts, or we start with the fan problem, in which case the fan basically blows at 4k rpm all the time. I will try this again tomorrow to monitor more closely the sensor's behavior. As for acpidump, I cannot find a simple way to install it in Arch linux, maybe someone can help? I agree that something is looking at acpi-virtual-0, which is sometimes broken after suspend to ram for kernels after 4.8. Please let me know if I should perform any other tests. Srinivas: I have some problem recompiling 4.11 git on my laptop, I will try harder. However, I have applied your patch to 4.9 and I recompiled. Results: 1) I see the output of the debugging line, thus I assume the patch works. However, the problem is still there, no benefits. 2) I have tried to not compile ISH at all as suggested by Tatsuyuki and I have suspend/resumed multiple times, with and without AC power. The issue happens less often, but eventually the problem is triggered anyway. Sig :-( The patch is debugging only and just add the logging. You should add what line you saw. I recommend compiling 4.10, it should contain some more fixes. (not a big difference though) Seems ISH is somewhat related but not the exact cause of the issue. I have no idea what's happening. There was no problem when I used 4.9-rc8 compiled with 4.8-ARCH config (there was no "official" 4.9 config at the moment). This happens without even ISH on the system (Nicolo's system) and also seems that even you blacklist ish modules, still happens. Also "Claudio Sacerdoti Coen's" system it happens eventually. I suggest try to get data as I suggest in comment #30. It will show that whether this is a Fan issue even if system is cool or some real high temperature issue. I see, I can do that: basically stress the system (say with prime95) and monitor it, both in the case where the problem arises and when it does not. I suspect that the system will be cool, so it should be more of a fan issue (triggered by the sometimes broken sensor), but I will run more stress tests and report as you say, so we can check better. This is what from logs "logs showing wakeup with connected/disconnected power adapter " In two case "acpitz-virtual-0" was following "coretemp-isa-0000", which should be the case mostly. But after wakeup acpitz-virtual-0 Adapter: Virtual device temp1: +48.0°C (crit = +128.0°C) coretemp-isa-0000 Adapter: ISA adapter Package id 0: +37.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +34.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +35.0°C (high = +100.0°C, crit = +100.0°C) Basically the acpitz-virtual-0, basically stuck. The idea of doing some stress test (after the issue happens) is to see, if the condition of the sensor "acpitz-virtual-0" ever change or always stay 48C. When you run stress the coretemp will go higher, in that case "acpitz-virtual-0" should go high too. If not then it is stuck. If goes high but doesn't go low, means that some other sensor is also in play. It will be great if someone post acpidump of the problem system. On Arch-linux it is part of iasl package. For other distros I see a package called acpidump. Created attachment 255741 [details]
sensors during stress
I attach snapshots of sensors during a quick stress test, with the problem showing up, first and last entries being before and after stress. I can do more extensive study with turbostat etc, but basically it is stuck as you suggested.
Created attachment 255743 [details]
acpidump
Here is acpidump from my system.
The stress test shoe that the acpi-virtual temp got stuck at 48C even if the CPU temp was high. The acpi-virtual temp is read via EC (from acpidump otput), which is controlling Fan. So the fan control algorithm never see the temp drop again, so always on. Added some Rui Zhang for suggestions. If it was regression on 4.9, then git bisect can help to find if some ACPI changes caused this. Correct. I think the problem started with 4.9 series. [ To Srnivas: ISH is no longer the suspect, but since you asked I was able to see "... require reinit after resume"] This may be interesting: my acpitz-virtual-0 is also stuck _exactly_ at +48.0C until the next hibernate. (Restart does not fix the issue: only power down or hibernate). It does not matter if the real temperature is higher/lower than +48.0 at the time of resume or if it is lowered/increased above/below +48.0 after. Precisely, that's also what I kind of pointed out in my comment 3 :) Claudio Sacerdoti Coen : Can you do git bisect between 4.8 and 4.9 disabling ISH? Yes, I am doing it. 12 iterations needed, don't hold your breath :-) Added LV. LV, Do you think that enabling #debug in drivers/ec.c will help? TMP is at offset 0x78 of EC opregion. Looks like this is now 0x80, which will result in temperature 48. Method (_TMP, 0, NotSerialized) // _TMP: Temperature { If (\H8DR) { Local0 = \_SB.PCI0.LPC.EC.TMP0 Local1 = \_SB.PCI0.LPC.EC.TSL2 Local2 = \_SB.PCI0.LPC.EC.TSL3 } Else { Local0 = \RBEC (0x78) Local1 = (\RBEC (0x8A) & 0x7F) Local2 = (\RBEC (0x8B) & 0x7F) } If (Local0 == 0x80) { Local0 = 0x30 } .. I am halfway. I will continue tomorrow. In case the partial information is useful to anyone, here is the bisect log so far: git bisect start # bad: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1 git bisect bad 1001354ca34179f3db924eb66672442a173147dc # good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8 git bisect good c8d2bc9bc39ebea8437fd974fdbc21847bb897a3 # bad: [e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39] Merge tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci git bisect bad e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39 # bad: [687ee0ad4e897e29f4b41f7a20c866d74c5e0660] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next git bisect bad 687ee0ad4e897e29f4b41f7a20c866d74c5e0660 # bad: [e6dce825fba05f447bd22c865e27233182ab3d79] Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty git bisect bad e6dce825fba05f447bd22c865e27233182ab3d79 # bad: [12b7bcb43e6ea834ab2f5dc52d971e379a0ca109] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 12b7bcb43e6ea834ab2f5dc52d971e379a0ca109 # good: [72a9cdd083005900f15934e8568f1ac43a6bb755] Merge tag 'pnp-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect good 72a9cdd083005900f15934e8568f1ac43a6bb755 # good: [a9e57009dacd58052755cf58463ce41a14a01db5] perf record: Fix documentation 'event_sources' -> 'event_source' git bisect good a9e57009dacd58052755cf58463ce41a14a01db5 It's quite tricky since this bug is very less likely to be triggered when ISH is disabled. Make sure you repeat the test for many times when bisecting. Here is the bisect result and log. Unfortunately, it does not point to something easily related. Note: all the bad states were bad. I made my best to detect the good states repeating suspend/resume multiple times. 4b978934a440c1aafce986353001b03289eaa040 is the first bad commit git bisect start # bad: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1 git bisect bad 1001354ca34179f3db924eb66672442a173147dc # good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8 git bisect good c8d2bc9bc39ebea8437fd974fdbc21847bb897a3 # bad: [e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39] Merge tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci git bisect bad e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39 # bad: [687ee0ad4e897e29f4b41f7a20c866d74c5e0660] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next git bisect bad 687ee0ad4e897e29f4b41f7a20c866d74c5e0660 # bad: [e6dce825fba05f447bd22c865e27233182ab3d79] Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty git bisect bad e6dce825fba05f447bd22c865e27233182ab3d79 # bad: [12b7bcb43e6ea834ab2f5dc52d971e379a0ca109] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 12b7bcb43e6ea834ab2f5dc52d971e379a0ca109 # good: [72a9cdd083005900f15934e8568f1ac43a6bb755] Merge tag 'pnp-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect good 72a9cdd083005900f15934e8568f1ac43a6bb755 # good: [a9e57009dacd58052755cf58463ce41a14a01db5] perf record: Fix documentation 'event_sources' -> 'event_source' git bisect bad de956b8f45b3338cfb66a725e22b4050109daf2a # good: [2ab78a724b1fd885b65199707b8e053677745457] Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into efi/core git bisect good 2ab78a724b1fd885b65199707b8e053677745457 # good: [d74b62bc3241af8ebf5141f5b12e89d9d7f341e1] Merge branches 'doc.2016.08.22c', 'exp.2016.08.22c', 'fixes.2016.09.14a', 'hotplug.2016.08.22c' and 'torture.2016.08.22c' into HEAD git bisect good d74b62bc3241af8ebf5141f5b12e89d9d7f341e1 # good: [e23f22b5cb9e44da24cb8494707536211adff8d1] dcdbas: Make use of smp_call_on_cpu() git bisect good e23f22b5cb9e44da24cb8494707536211adff8d1 # good: [8db549491c4a3ce9e1d509b75f78516e497f48ec] smp: Allocate smp_call_on_cpu() workqueue on stack too git bisect good 8db549491c4a3ce9e1d509b75f78516e497f48ec # bad: [4b978934a440c1aafce986353001b03289eaa040] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad 4b978934a440c1aafce986353001b03289eaa040 # good: [2d8fbcd13ea1d0be3a7ea5f20c3a5b44b592e79c] Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu git bisect good 2d8fbcd13ea1d0be3a7ea5f20c3a5b44b592e79c # first bad commit: [4b978934a440c1aafce986353001b03289eaa040] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip The commit is a merge commit for x86-tip tree, so will have many commits in it. Indeed. Can I help in some other way? I am waiting for comments from the person who knows EC more. The category of the bug is changed to power/thermal which gets monitored every week. If you want to deep dive there, there is a #debug in the top of file drivers/acpi/ec.c If you enable you will see lots of EC transaction. Temperature is at offset 0x78. You will see a request and then response. *** Bug 195239 has been marked as a duplicate of this bug. *** Claudio Sacerdoti Coen: I want make sure that in your tests you didn't use ISH (removed from CONFIG). I don't think this has anything to do because it happens on laptop without ISH as (Nicolos's system), but atleast will allow to focus on some other changes. I confirm. Either ISH was disabled from configure, or ISH support was not even merged in the kernel yet. Claudio Sacerdoti Coen: Thanks. It is possible that changes in folder drivers/acpi may have some changes causing this issue. So one try we can do is forklift /drivers/acpi from working version of kernel tree to non working version. So basically delete drivers/acpi folder and copy from working version. There may have some compile issues, if they are minor we can give a try. So the problem is that: temperature reported by ACPI _TMP method stuck at 48C, resulting in fan spinning all the time right? Let's see why 48C is returned by _TMP. Created attachment 255777 [details]
customized _TMP method for Nicolo'
Nicolo'
please rebuild your kernel with CONFIG_ACPI_CUSTOM_METHOD=y, and reboot with acpi.aml_debug_output=1
after boot, please override the _TMP method with this new one attached, by
"cat /tmp/tmp.aml > /sys/kernel/debug/acpi/custom_method"
And then please attach the dmesg output after the problem is reproduced.
NOTE: For the others, please don't try the binary attached in comment #68, as it is made based on the acpidump attached by Nicolo, which should work for his platform only. Zhang: Correct. So I should take my current config file (say from 4.10.8 Arch), edit the line you suggest, and compile the kernel: right? Sorry for my ignorance, but could you please explain in more detail how do I reboot with acpi.aml_debug_output=1? Zhang: when I try to do the last step (as root), I get cat: write error: Invalid argument What should I do? (In reply to Nicolo' from comment #70) > Zhang: > Correct. So I should take my current config file (say from 4.10.8 Arch), > edit the line you suggest, and compile the kernel: right? Sorry for my > ignorance, but could you please explain in more detail how do I reboot with > acpi.aml_debug_output=1? this is a kernel option, you just need to append "acpi.aml_debug_output=1" into your kernel command line. (In reply to Nicolo' from comment #71) > Zhang: > when I try to do the last step (as root), I get > cat: write error: Invalid argument > What should I do? when this happens, do you see some new messages in dmesg output? Let me try on my local machine to see if this feature is broken. no new messages in dmesg output Created attachment 255803 [details]
X270 acpidump, 4.10.8 config
Also seem to be experiencing this on an X270 running 4.10.8.
Right, actually the name 'x1 yoga' in the title is a bit misleading, since this bug appears in multiple models, e.g. t460s or x1 carbon. yeah. it seems that this "customize DSDT" feature is broken in Linux kernel. I will file another report to track that issue. Created attachment 255901 [details] customized DSDT : dump _TMP method for Nicolo' before that feature being fixed, please follow this link https://01.org/linux-acpi/documentation/overriding-dsdt to override the DSDT with the one attached, and boot with kernel parameter "acpi.aml_debug_output=1", and then attach the dmesg output after "grep . /sys/class/thermal/thermal*/*" both before and after the problem reproduced. Zhang: I guess you have already done part of the job, up until I have to Put it where the kernel build can include it: $ cp DSDT.hex $SRC/include/ Add this to the kernel .config: CONFIG_STANDALONE=n CONFIG_ACPI_CUSTOM_DSDT=y CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex" and recompile the kernel: is that right? Created attachment 255903 [details]
grep and dmesg before and after problem
I hope I've done things correctly, please let me know in case I haven't.
[ 0.000000] Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img root=/dev/nvme0n1p5 rw acpi.aml.debug_output=1 it seems that you're using 'acpi.aml.debug_output=1' instead of 'acpi.aml_debug_output=1' please retest with the correct kernel option. :) BTW, please attach the output of "grep . /sys/class/thermal/thermal_zone*/temp" instead of "grep . /sys/class/thermal/thermal*/*" Created attachment 255915 [details]
grep and dmesg before and after problem -- corrected
Too bad, sorry for the silly mistake ;)
is it better now?
before the bug [ 107.621890] [ACPI Debug] "_TMP Started" [ 107.621930] [ACPI Debug] 0x0000000000000001 [ 107.622728] [ACPI Debug] "Dump Local0/1/2" [ 107.622732] [ACPI Debug] 0x000000000000001F [ 107.622736] [ACPI Debug] 0x0000000000000000 [ 107.622739] [ACPI Debug] 0x0000000000000000 [ 107.622748] [ACPI Debug] "Dump DHKC" [ 107.622759] [ACPI Debug] 0x0000000000000001 [ 107.622769] [ACPI Debug] "_TMP Finished" after the bug [ 273.928856] [ACPI Debug] "_TMP Started" [ 273.928912] [ACPI Debug] 0x0000000000000001 [ 273.929886] [ACPI Debug] "Dump Local0/1/2" [ 273.929890] [ACPI Debug] 0x0000000000000080 [ 273.929893] [ACPI Debug] 0x0000000000000000 [ 273.929896] [ACPI Debug] 0x0000000000000000 [ 273.929905] [ACPI Debug] "Dump DHKC" [ 273.929915] [ACPI Debug] 0x0000000000000001 [ 273.929923] [ACPI Debug] "_TMP Finished" so the key change is that Method (RBEC, 1, NotSerialized) { Return (SMI (0x00, 0x03, Arg0, 0x00, 0x00)) } returns a fixed value 0x80, rather than a meaningful temperature value. hmmm, I'm not sure if this is the rootcause of the problem because I don't see how kernel change impacts SMI call. Nicolo', can you confirm that the acpi_tz always return the real temperature, rather than the fixed 48C, in working kernels like 4.8? (In reply to Zhang Rui from comment #84) > before the bug > [ 107.621890] [ACPI Debug] "_TMP Started" > [ 107.621930] [ACPI Debug] 0x0000000000000001 > [ 107.622728] [ACPI Debug] "Dump Local0/1/2" > [ 107.622732] [ACPI Debug] 0x000000000000001F > [ 107.622736] [ACPI Debug] 0x0000000000000000 > [ 107.622739] [ACPI Debug] 0x0000000000000000 > [ 107.622748] [ACPI Debug] "Dump DHKC" > [ 107.622759] [ACPI Debug] 0x0000000000000001 > [ 107.622769] [ACPI Debug] "_TMP Finished" > > after the bug > [ 273.928856] [ACPI Debug] "_TMP Started" > [ 273.928912] [ACPI Debug] 0x0000000000000001 > [ 273.929886] [ACPI Debug] "Dump Local0/1/2" > [ 273.929890] [ACPI Debug] 0x0000000000000080 > [ 273.929893] [ACPI Debug] 0x0000000000000000 > [ 273.929896] [ACPI Debug] 0x0000000000000000 > [ 273.929905] [ACPI Debug] "Dump DHKC" > [ 273.929915] [ACPI Debug] 0x0000000000000001 > [ 273.929923] [ACPI Debug] "_TMP Finished" > > so the key change is that > Method (RBEC, 1, NotSerialized) > { > Return (SMI (0x00, 0x03, Arg0, 0x00, 0x00)) > } > returns a fixed value 0x80, rather than a meaningful temperature value. > > hmmm, I'm not sure if this is the rootcause of the problem because I don't > see how kernel change impacts SMI call. Plus, I don't see how this temperature change impacts the fan because there is no ACPI fan binding to this thermal zone. That is correct: acpitz returns what I believe is the correct T value, also in current 4.10 kernel, when it is not stuck at 48C; when it is stuck on 48C, then the fan is always triggered; this happens I would say 1/3 of the times, but (usually) another suspend solves the issue. By the way, just out of curiosity: is there a simple algorithm that determines fan speed out of temperatures? @Zhang on my laptop the real temperature is always reported in place of 0x80 for the kernels that show no bug. You can have a look at my kernel bisect above that points to the merge of a new IPC mechanism (or something like that). As far as I can tell, that merge has nothing to do with ACPI code. Does that make any sense to you? I have this issue as well, on a 20FB Thinkpad (X1 Carbon gen4). I know for a fact that BIOS 1.19 NEVER showed the issue, but any newer version does. I just tried 1.24 today, and the problem is still there. Unfortunately I can't downgrade from 1.24 to earlier versions, so now I'm stuck with the issue. Previously, after upgrading from 1.19 and noticing the issue, I would just downgrade and the problem would go away. I always run custom kernels, ISH is not included. Let me know if you guys need more info. I'm currently running 4.11-rc7 and the problem is still there. What changed between firmware 1.19 and newer? That would likely provide a good clue as to what caused this very annoying change in behavior. Jens: I'm surprised you think it's a BIOS issue; could you try kernels of the 4.8 series with current BIOS and check whether the problem is there? Nicolo: I'm saying that with BIOS 1.19 the problem NEVER happens, and any version newer than that, it happens on what appears to be every resume. Those are the indisputable facts on my laptop. I can run the very latest kernels on 1.19 and I don't have the fan issue. So whatever the real problem is, it only happens on BIOS > 1.19 here. That doesn't mean it's a BIOS issue, it could be a change that triggers a problem in later kernels. And I'm now peeved I can't downgrade the BIOS anymore, hence the issue is high priority problem for me now. I'll compile a 4.8 and 4.9 kernel on the laptop and see if 4.8 works and 4.9 does not, running with BIOS 1.24. I tested v4.8 and it works fine (no crazy fan after resume), and v4.9-rc1 which is broken (full speed fan after resume). So 1.19 and any kernel is fine, or >1.19 and v4.9-rc1 and later is broken. I've got the same problem on my X1 2017 (Gen5). I'm on Arch Linux and after resuming the fan runs at max speed pretty often. Resuming again most of the times fixes the problem. One thing that isn't mentioned here already is the following: If the power plug isn't plugged, then I've NEVER had this bug. Don't know if that is of any help. Jens: so it is most likely a kernel issue, not a bios one; also, how can you say 1.19 works with any kernel if you're not able to test it now? Markus: to me, it happens both plugged and unplugged. Nicolo, I already explained why it could be both a firmware issue and a kernel bug further up. If you read what I wrote, you'd also see that I was running 2.10 until yesterday and tried newer versions and rolled back when I saw they had this bug. Hence I know any kernel up to 4.11-rc7 work with 2.19 just fine. Thanks phone. The 2.10 and 2.19 above should be 1.19, of course. I read you, but could you (or someone else) repeat a test now with 1.19 and newer kernels? if not, as you seem to say, I don't care what you remember etc; if yes, then I agree it may be BIOS related, which could be useful info. No, you are clearly NOT reading me, because if you did, you would not ask questions I already answered. As I wrote in comment 89 and 92, I tested 1.19 YESTERDAY with 4.11-rc7. I don't need to test again, and in fact I cannot, since I can't roll back from my 1.24 since the tool no longer allows downgrades of BIOS. That last bit of info was also in the above comments. This is why I'm 100% positive that 1.19 works - because whenever they released a new BIOS, I'd update, see the bug was there, and then downgrade again. Since I develop Linux, I run the latest kernels on my laptop all the time. Hence I know that I've never had the issue with newer kernels and 1.19. Read and understand what is being written and stop wasting peoples time. Dear Jens, I understood from the beginning that you were able to test with 1.19 until a few days ago with all kernels, and you say that you had no problems, no need to repeat it, but you're not able to reproduce it now, likely as anyone else who upgraded to latest bios. I agree that you may have a point in correlating with bios, but I'd be happier if someone can repeat the test with 1.19 and confirm what you say; also in debugging it could be useful to work with 1.19 It's a fact that there's 100% correlation, as I went back and forth between 1.19 and newer BIOS' while I could. It always triggered on newer BIOS, it never triggered on 1.19. I don't need to reproduce it now, as I reproduced it _yesterday_. The folks currently on 1.19 will not be finding this bug report, as they don't run into the issue. They will find it when/if they do upgrade, and at that point it'll be too late, as they can no longer downgrade to 1.19. I'm going to try and bisect this issue, as we have the luxury of knowing it works on 4.8 and doesn't on 4.9-rc1. I'll be happy to try suggestions from the people actually working on fixing this issue. If you look above Claudio Sacerdoti has already done that. Does Lenovo supporting documentation for 1.19 and later give any hint? have you tried to contact them as well? I did see that someone ran a bisect, but it was basically useless as it didn't yield any real information. I also see that it was run twice, with different results. So I don't have a high level of confidence in it. I have not talked to Lenovo. Looks like it's the embedded controller (ECP) update. Since I cannot downgrade the actual BIOS anymore, I flashed BIOS 1.24 with ECP 1.13 (that's the one that BIOS 1.19 ships with) on the laptop. With that, the fan issue doesn't seem to be there (output from dmidecode): Handle 0x000C, DMI type 0, 24 bytes BIOS Information Vendor: LENOVO Version: N1FET50W (1.24 ) Release Date: 03/08/2017 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16384 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 1.24 Firmware Revision: 1.13 Note how "Firmware Revision" is 1.13, it should be 1.16 with this BIOS. So likely the BIOS is fine, but the ECP update from 1.13 to 1.14 broke something or triggered a bug in the kernel. To follow-up on Jens's last comment: I have BIOS revision 1.23 and firmware revision 1.15 (> 1.13) and I see the bug (with new kernels) Claudio, you can try and get firmware 1.24 and 1.19, then copy the smaller firmware file from 1.19 into the 1.24 folder and flash it. It'll warn that the ECP firmware is older, but just say yes. Would be interesting to see if it fixes it for you, too. On my thinkpad t460s, dmidecode gives firmware 1.10, but I have the problem: is it possible that they use different version numbers for different models? BIOS Information Vendor: LENOVO Version: N1CET54W (1.22 ) Release Date: 02/10/2017 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16384 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 1.22 Firmware Revision: 1.10 Jens: can I ask how did you distinguish what is bios and what is ecp update? in my case they only provide one iso, which I can extract to img, but then in the flash folder besides readme there are some pat files, one efi and another folder with two files fl1 and fl2 extension; these last two seem the only ones to be different among different versions: was is the same for you? which file did you substitute? The fl2 file is the smaller one, that is the ECP firmware. The fl1 file is the BIOS. Jens: you were right, now my dmidecode reads BIOS Information Vendor: LENOVO Version: N1CET54W (1.22 ) Release Date: 02/10/2017 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16384 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 1.22 Firmware Revision: 1.9 Probably the numbering is different among different models, but the upshot is that either older ECB firmware and all Linux kernels, or older kernels and all ECB firmware work (at least I did several suspends and did not had the issue), but both newer trigger some bug, as you correctly observed. Do you think there may be any compatibility issue using a newer UEFI with an older ECB, at least temporarily? probabaly. As kernel git bisect does not give us any clue, it would be good to know the differences between different EC firmware versions. BTW, I also want to make sure the fixed 48C reported is related with the problem or not. Say, with older EC firmware or older kernel, does the thermal zone report 48C or not? Zhang: with older EC, for example now I get acpitz-virtual-0 Adapter: Virtual device temp1: +27.0°C (crit = +128.0°C) which I believe is correct. As for differences between different ECs, I could try to contact Lenovo, unless you know a better way. (In reply to Nicolo' from comment #113) > Zhang: with older EC, for example now I get > acpitz-virtual-0 > Adapter: Virtual device > temp1: +27.0°C (crit = +128.0°C) > problem is we also get correct value with later Kernel/ECP. We have confirmed that the temperature is 48C when the problem happens. And now, it's better to confirm if 48C is never shown with later kernel/ECP. In order to do this, we'd better to monitor the temperature for a longer period, say, 1 day, to see if the fixed 48C is ever reported. (temperature raises/drops to 48C smoothly from 45C/50C is a reasonable 48C, temperature raise from 30C to 48C is the fixed 48C) > which I believe is correct. As for differences between different ECs, I > could try to contact Lenovo, unless you know a better way. yes, please contact Lenovo to understand the details of the difference, which may give us a clue about the problem. Lv, I also saw a couple of EC driver changes between 4.8 and 4.9, can you please help verify if it could be related? df45db6177f8dde380d44149cca46ad800a00575 750f628be68e8b8e1624d8abd003b9f1fc758ed6 e923e8e79e18fd6be9162f1be6b99a002e9df2cb c2b46d679b30c5c0d7eb47a21085943242bdd8dc 39a2a2aa3e9e5538984e9130c92a6c889ad86435 d30283057ecdf8c543ae757ae34db3d7fd2d7732 72c77b7ea9ce781f4987840984a462e4456ba98e 46922d2a3aff5122253d97e64500801c08f4f2c0 2a5708409e4e05446eb1a89ecb48641d6fd5d5a9 97cb159fd91d00f8d7d1adeb075503dc0d946bff these are the commits shipped in 4.9 kernel. please cherry pick them on top of v4.8 kernel and see if the problem can be reproduced or not. or... Perhaps it would be simpler to try running the old ec.c and if that works, then we know that changes to the ec.c driver ether broke this, or have nothing to do with this. Probably should just try to revert this commit https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=d30283057ecdf8c543ae757ae34db3d7fd2d7732 and see what happens. I confirm that previous ECP firmware works fine with current kernel: the sensor seems to work just fine, going smoothly from below to above 48C when some load is present. I will compile a kernel that reverts the commit and test with newer ECP to see whether the issue is present. To Srinivas:
> LV, Do you think that enabling #debug in drivers/ec.c will help?
> I am waiting for comments from the person who knows EC more.
I should say sorry as I've just noticed it this morning.
Let me take a look at what's going on.
The _TMP implementation looks interesting:
Method (_TMP, 0, NotSerialized) // _TMP: Temperature
{
If (\H8DR)
^^^^^^^^^^
{
Store (\_SB.PCI0.LPC.EC.TMP0, Local0)
Store (\_SB.PCI0.LPC.EC.TSL2, Local1)
Store (\_SB.PCI0.LPC.EC.TSL3, Local2)
}
Else
{
Store (\RBEC (0x78), Local0)
Store (And (\RBEC (0x8A), 0x7F), Local1)
Store (And (\RBEC (0x8B), 0x7F), Local2)
}
If (LEqual (Local0, 0x80))
{
Store (0x30, Local0)
}
...
}
}
I'm not sure if \H8DR matters. It's only set in the \_SB._INI:
Scope (\_SB)
{
Method (_INI, 0, NotSerialized) // _INI: Initialize
{
...
If (LGreaterEqual (\_REV, 0x02))
{
Store (0x01, \H8DR)
}
...
}
}
So have you guys looked into the _REV related stuffs?
It goes 2 code paths, if _REV >= 2, it runs EC opregion, else it invokes SMI functionality implemented via RBEC.
By default, _REV returns 2, while you can change it to 5 via "acpi_rev_override" (it might be better to have a different boot parameter to change it to 1).
If RBEC is executed, it implemented in this way:
Method (RBEC, 1, NotSerialized)
{
Return (SMI (0x00, 0x03, Arg0, 0x00, 0x00))
}
Then this is not related to the EC driver.
The SMI method is:
Mutex (MSMI, 0x00)
Method (SMI, 5, Serialized)
{
Acquire (MSMI, 0xFFFF)
Store (Arg0, CMD)
Store (0x01, ERR)
Store (Arg1, PAR0)
Store (Arg2, PAR1)
Store (Arg3, PAR2)
Store (Arg4, PAR3)
Store (0xF5, APMC)
While (LEqual (ERR, 0x01))
{
Sleep (0x01)
Store (0xF5, APMC)
}
Store (PAR0, Local0)
Release (MSMI)
Return (Local0)
}
It accesses the following opregions which are all not handled by the EC driver:
OperationRegion (SMI0, SystemIO, 0xB2, 0x01)
Field (SMI0, ByteAcc, NoLock, Preserve)
{
APMC, 8
}
OperationRegion (MNVS, SystemMemory, 0xAFFBC018, 0x1000)
Field (MNVS, AnyAcc, NoLock, Preserve)
{
Offset (0xFC0),
CMD, 8,
ERR, 32,
PAR0, 32,
PAR1, 32,
PAR2, 32,
PAR3, 32
}
I guess:
1. The old firmware doesn't contain the SMI invocations. Could someone upload the acpidump from the old working ECP?
2. If reverting the above mentioned commit can recover this, it might mean that the SMI method is stuck if the EC driver stops to handle events too early.
Thanks and best regards
Lv
To Rui: The followings are not related, they are related to boot ec: 72c77b7ea9ce781f4987840984a462e4456ba98e 46922d2a3aff5122253d97e64500801c08f4f2c0 2a5708409e4e05446eb1a89ecb48641d6fd5d5a9 97cb159fd91d00f8d7d1adeb075503dc0d946bff These are also not related, they are simple fix and cleanups: df45db6177f8dde380d44149cca46ad800a00575 750f628be68e8b8e1624d8abd003b9f1fc758ed6 e923e8e79e18fd6be9162f1be6b99a002e9df2cb For the 1st step, we might only focus on the following 3 commits: c2b46d679b30c5c0d7eb47a21085943242bdd8dc <- this is not likely a root cause, it only changes post-resume behavior. 39a2a2aa3e9e5538984e9130c92a6c889ad86435 <- this is a no-op and d30283057e enables the feature introduced in this commit. d30283057ecdf8c543ae757ae34db3d7fd2d7732 Thanks Lv Created attachment 256003 [details]
acpidump from old firmware
Here's my acpi dump from old working firmware with new kernel; I will try to compile and test a kernel with the commit removed later.
The same situation (https://bugzilla.kernel.org/show_bug.cgi?id=191181#c0) on Lenovo Edge E540 (ThinkPad) Type 20C6. on kennels 4.8 and higher. BIOS 2.24. Hi, Nicolo' A. Could you re-do the test of comment 78 with non-buggy kernels but buggy firmware and upload the new result here? B. You can also do the following test: Boot the kernel with: dyndbg=\"file ec.c +p\" Under kernel source code tree: # cd tools # make acpi You can find "ec" tool under tools/power/acpi. It looks H8DR is 1 according to comment 84. And TMP0 will be read, the problem occurs when it returns 0x80. TMP0 is here: OperationRegion (ECOR, EmbeddedControl, 0x00, 0x0100) Field (ECOR, ByteAcc, NoLock, Preserve) { ... Offset (0x78), TMP0, 8, ... } It's at offset 0x78, you can trigger EC transaction to obtain its value from userspace by "sudo ec -b 0x78". So can you execute this command for the following combinations and post the result here: 1. buggy firmware, non-buggy kernel, before suspend 2. buggy firmware, non-buggy kernel, after resume 3. buggy firmware, buggy kernel, before suspend 4. buggy firmware, buggy kernel, after resume 5. non-buggy firmware, non-buggy kernel, before suspend 6. non-buggy firmware, non-buggy kernel, after resume 7. non-buggy firmware, buggy kernel, before suspend 8. non-buggy firmware, buggy kernel, after resume And upload "dmesg" output for test 2, 3, 4 here. To obtain the dmesg output, please run the following commands: # dmesg -c # ec -b 0x78 # dmesg -c > dmesg-2/3/4.log Thanks in advance. Best regards Lv Hi, Nicolo' Here is another test, let me call it as test C: 1. Boot "buggy firmware, buggy kernel" with: dyndbg=\"file ec.c +p\" 2. After boot, run the following commands and post the logs here: # dmesg -c > boot.log # echo mem > /sys/power/state Resume the system by pressing the right buttons # dmesg -c > s2ram.log # ec -b 0x78 # dmesg -c > tmp0.log Thanks in advance Lv Hi all, did actually anybody controlled the cpu frequency scalling? I don't have the problem of running the fans at full speed only after resume - my problem is that its running all the time directly after boot. The 4.11 actually has a problem with the cpu frequency scalling in my eyes. Today i installed the 4.11-rc8 and directly noticed, that the frequencies of my i7-6600U are residing in > 3.0 Ghz domain. The frequency just don't goes down. It's a t460s with p-state driver and performance governor, running tlp. Before i was using 4.10.0 and the cpu frequencies stayed at minimum 400 mhz in idle states. Now as an implication of high frequencies > the fan is running almost all the time. I already had this issue when the p-state driver first came out with my x230. I had to wait for a newer kernel and the problem gone, the frequencies went to mininum in idle. Here is my bios if that helps: BIOS Information Vendor: LENOVO Version: N1CET47W (1.15 ) Release Date: 08/08/2016 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16384 kB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 1.15 Firmware Revision: 1.9 Again, my problem is slightly different, but i don't see any info regarding the cpu frequency. Perhaps you can check your frequencies after the machine resumes. If my issue is not related i'll open a new ticket. Best regards, Hermann Hermann: you can take a look at my powertop with the problem in the attachments, and I think my CPU could go to lower states even with the fan issue. Lv: I will try to do the tests you suggested soon, and report back. Hermann: do not use the performance governor. It tries to keep high frequency, and should be used for environments that latency matters (e.g. audio). I'm noticing that my computer is more likely to overheat than before, no idea what's the cause. i7 is too beefy. Created attachment 256183 [details]
x270 acpi dump without issue
For what it's worth, I have an x270 and I am not having this particular issue. I attached an ACPI dump from my machine in case there is some clue there. I've been switching back and forth between the 4.8 kernel and the 4.11 releases in Linux Mint since I got the laptop and haven't noticed any issues with the fan. I am having another weird problem though. When I close the lid while running on battery, it generates an unhandled HKEY event 0x6032 and CPU temp/throttling warnings in dmesg every so often. When I run the laptop with the lid closed like this, the CPU appears to be limited to 2.0GHz and stays around 60C even when I'm doing something intensive like encoding video. It seems to think the CPU is hotter than it really is? It also seems like some kind of issue reading the EC to me, so I thought maybe it was related. It's triggered during very light use. Symptoms are described exactly in this Fedora bug: https://bugzilla.redhat.com/show_bug.cgi?id=924570 I would be curious to know what BIOS / Firmware versions Max Deineko (other x270 owner) is running and if he also has this problem. (I am running BIOS 1.11 / Firmware revision 1.11) @Tatsuyuki Ishi thank you very much for the hint. All the time i had the wrong information that the performance and powersave governors are acting like ondemand from cpufreq with the difference in speed of scalling:( And the weird thing is that kernels i've used before with performance governor like the 4.9 put the frequencies at the minimum like ~400Mhz. Very strange! But you are completely right: https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt: For example the "performance" policy is similar to cpufreq’s "performance" governor, but "powersave" is completely different than the cpufreq "powersave" governor. The strategy here is similar to cpufreq "ondemand", where the requested P-State is related to the system load. Thanks again for the hint. Now i'm with powersave. So much wasted energy... Hahah:) (In reply to Tatsuyuki Ishi from comment #127) > Hermann: do not use the performance governor. It tries to keep high > frequency, and should be used for environments that latency matters (e.g. > audio). > > I'm noticing that my computer is more likely to overheat than before, no > idea what's the cause. i7 is too beefy. thank you very much for the hint. All the time i had the wrong information that the performance and powersave governors are acting like ondemand from cpufreq with the difference in speed of scalling:( And the weird thing is that kernels i've used before with performance governor like the 4.9 put the frequencies at the minimum like ~400Mhz. Very strange! But you are completely right: https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt: For example the "performance" policy is similar to cpufreq’s "performance" governor, but "powersave" is completely different than the cpufreq "powersave" governor. The strategy here is similar to cpufreq "ondemand", where the requested P-State is related to the system load. Thanks again for the hint. Now i'm with powersave. So much wasted energy... Hahah:) Just a "me too". I have a T470 with BIOS Revision: 1.29, Firmware Revision: 1.12, running Arch Linux with a the standard 4.10.13 kernel. same thing on ThinkPad X1 Carbon 5th gen Manufacturer: LENOVO Product Name: 20HQS0LV00 BIOS Revision: 1.18 Firmware Revision: 1.12 I've tried kernels 4.10.13-ARCH, self-compiled 4.11 with the Arch config, and a self-compiled torvalds/master kernel (4.12-rc0) with the same config. (In reply to Nicolo' from comment #118) > I confirm that previous ECP firmware works fine with current kernel: the > sensor seems to work just fine, going smoothly from below to above 48C when > some load is present. I can confirm that too: System Information Manufacturer: LENOVO Product Name: 20F90045GE Version: ThinkPad T460s BIOS Information Vendor: LENOVO Version: N1CET56W (1.24 ) Release Date: 04/19/2017 BIOS Revision: 1.24 Firmware Revision: 1.9 # uname -r 4.11.3-200.fc25.x86_64 After flashing BIOS/ECP == 1.24(N1CET56W)/1.09(N1CHT27W) following Jens' instructions (thanks a ton!) the nasty fan issue is gone. ECP 1.10(N1CHT28W) and 1.11(N1CHT29W) do not work with recent kernels. updated the X1 Carbon (5th gen) to UEFI: 1.20 / ECP: 1.14 (2017/05/26) still the same issue :( Nicolo, are you going to run the tests suggested by Lv Zheng in comment 124+125? It'd be nice if we can get this closer to being resolved, no progress has been made in over a month. Sorry, I've been busy with other stuff. I will try to do those tomorrow. One thing that anyone can try is just remove commit d30283057ecdf8c543ae757ae34db3d7fd2d7732 Has this been tried? I tried and failed today to do test A, due to a (possibly unrelated) compile error with kernel series 4.8; will try again after someone either here or at Arch helps with the issue, which is as follows: mkdir build cd build/ git clone git://git.archlinux.org/svntogit/packages.git --single-branch --branch "packages/linux"mkdir build cd packages/trunk git checkout d59764443634990fb9c058e31515af5692de44ce cd ../.. cp -r packages/trunk/ linux makepkg -o edit PKGBUILD to change name to custom, edit config and add file to include folder as described in comment#79 updpackagesums makepkg -s gives error LD init/built-in.o kernel/built-in.o: In function `update_wall_time': (.text+0x7c377): undefined reference to `____ilog2_NaN' make: *** [Makefile:951: vmlinux] Error 1 ==> ERROR: A failure occurred in build(). Aborting... (In reply to Nicolo' from comment #139) > > LD init/built-in.o > kernel/built-in.o: In function `update_wall_time': > (.text+0x7c377): undefined reference to `____ilog2_NaN' > make: *** [Makefile:951: vmlinux] Error 1 > ==> ERROR: A failure occurred in build(). > Aborting... gcc7, i assume? Might be this one: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=474c90156c8dcc2fa815e6716cc9394d7930cb9c Thank you, will try to apply the patch. Hi, I still get an error even with the patch: .... LD drivers/built-in.o ==> ERROR: A failure occurred in build(). Aborting... I can confirm revert commit d30283057ecdf8c543ae757ae34db3d7fd2d7732 in kernel 4.11.3 (archlinux) solves acpitz-virtual-0 stuck in 48º, but not solve the fan issue. I will try in theses days to revert the others commits and see what happen Interesting. I was finally able to build 4.8 kernel, so now I can easily do tests ABC. (In reply to Fernando Chaves from comment #143) > I can confirm revert commit d30283057ecdf8c543ae757ae34db3d7fd2d7732 in > kernel 4.11.3 (archlinux) solves acpitz-virtual-0 stuck in 48º, but not > solve the fan issue. > > I will try in theses days to revert the others commits and see what happen not for me on 4.12-rc4. acpitz-virtual-0 is still stuck at +48.0°C when the issue appears. Created attachment 256897 [details]
testA
I'm attaching the result of test A.
Soon will also do B and C.
About test B, I get the error sudo ./ec -b 0x78 ec: /sys/kernel/debug/ec/ec0/io: No such file or directory What am I doing wrong? I'm assuming I can make ec at any time, don't necessarily need to make it when I compile the kernel, is that right? (In reply to Damjan Georgievski from comment #145) > (In reply to Fernando Chaves from comment #143) > > I can confirm revert commit d30283057ecdf8c543ae757ae34db3d7fd2d7732 in > > kernel 4.11.3 (archlinux) solves acpitz-virtual-0 stuck in 48º, but not > > solve the fan issue. > > > > I will try in theses days to revert the others commits and see what happen > > not for me on 4.12-rc4. acpitz-virtual-0 is still stuck at +48.0°C when the > issue appears. You are right, I was watching another sensor, sorry I was tired. Now I reverts all the commits in 4.11.3 and the issue is gone, no more fan blowing and acpitz-virtual-0 stuck in +48.0°C df45db6177f8dde380d44149cca46ad800a00575 750f628be68e8b8e1624d8abd003b9f1fc758ed6 e923e8e79e18fd6be9162f1be6b99a002e9df2cb c2b46d679b30c5c0d7eb47a21085943242bdd8dc 39a2a2aa3e9e5538984e9130c92a6c889ad86435 d30283057ecdf8c543ae757ae34db3d7fd2d7732 72c77b7ea9ce781f4987840984a462e4456ba98e 46922d2a3aff5122253d97e64500801c08f4f2c0 2a5708409e4e05446eb1a89ecb48641d6fd5d5a9 97cb159fd91d00f8d7d1adeb075503dc0d946bff eab05ec38073f72389386f4a77fb58c06e246a4c 4c237371f290d1ed3b2071dd43554362137b1cce c3a696b6e8f8f75f9f75e556a9f9f6472eae2655 I don't deep in the changes of theses commits because I don't have the knowledge and time, but if I have time I will try to see what happen in theses commits. And I forget to say, my laptop is X1 Carbon 5th Gen Have you tried to revert just this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3028305 Thanks Lv Also you can help to confirm by just reverting the followings: 72c77b7ea9ce781f4987840984a462e4456ba98e 46922d2a3aff5122253d97e64500801c08f4f2c0 2a5708409e4e05446eb1a89ecb48641d6fd5d5a9 97cb159fd91d00f8d7d1adeb075503dc0d946bff eab05ec38073f72389386f4a77fb58c06e246a4c 4c237371f290d1ed3b2071dd43554362137b1cce c3a696b6e8f8f75f9f75e556a9f9f6472eae2655 And see whether the problem can disappear. Let's narrow down the problem. Fernando Chaves: In comment 143: You've confirmed "reverting d30283057ecdf8c543ae757ae34db3d7fd2d7732" can solve acpitz-virtual-0 stuck in 48º. But fan still blows. Could you just try to comment out the following line and try again: SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(acpi_ec_suspend_noirq, acpi_ec_resume_noirq) Could you help to do the test as comment 150 to see whether the commits in comment 150 relate to this issue. To Damjan Georgievski: In comment 145, you said "reverting d30283057ecdf8c543ae757ae34db3d7fd2d7732" cannot solve acpitz-virtual-0 stuck at +48.0°C. You may have a different issue. Could you confirm this again, or let's root cause Fernando's one first. Thanks Lv Fernando Chaves/Damjan Georgievski: Is CONFIG_SMP enabled in your configuration file? (In reply to Lv Zheng from comment #152) > Fernando Chaves/Damjan Georgievski: > > Is CONFIG_SMP enabled in your configuration file? of course. (In reply to Lv Zheng from comment #150) > Also you can help to confirm by just reverting the followings: > > 72c77b7ea9ce781f4987840984a462e4456ba98e > 46922d2a3aff5122253d97e64500801c08f4f2c0 > 2a5708409e4e05446eb1a89ecb48641d6fd5d5a9 > 97cb159fd91d00f8d7d1adeb075503dc0d946bff > eab05ec38073f72389386f4a77fb58c06e246a4c > 4c237371f290d1ed3b2071dd43554362137b1cce > c3a696b6e8f8f75f9f75e556a9f9f6472eae2655 > > And see whether the problem can disappear. I reverted all of those, and also d30283057ecdf8c543ae757ae34db3d7fd2d7732 didn't help in my case. here's the branch I compiled https://github.com/gdamjan/linux/commits/bugzilla-191181 Let me observe that with the new ECP firmware released by Lenovo a few weeks ago and 4.11 series kernel, it seems my system does not suffer anymore from the problem: at least I've had the new update for a few days and never experienced the problem, but I will keep monitoring it. On a ThinkPad X1 Carbon 4th gen Version 1.25 (UEFI: 1.25 / ECP: 1.17) This bug still persists with 4.11.3-1-ARCH #1 SMP PREEMPT. Concur with comment 156 - just upgraded my gen 4 x1 carbon to 1.25, and the problem is still there. I combined 1.25 with ECP 1.13, and then it's fine. Created attachment 256925 [details] attachment-6102-0.html I will keep monitoring it, but mine seems to have no problems (it seems the # are different) 4.11.3-1-ARCH BIOS Information Vendor: LENOVO Version: N1CET56W (1.24 ) Release Date: 04/19/2017 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16 MB BIOS Revision: 1.24 Firmware Revision: 1.11 On Thu, Jun 8, 2017 at 2:49 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=191181 > > --- Comment #157 from Jens Axboe (axboe@kernel.dk) --- > Concur with comment 156 - just upgraded my gen 4 x1 carbon to 1.25, and the > problem is still there. I combined 1.25 with ECP 1.13, and then it's fine. > > -- > You are receiving this mail because: > You are on the CC list for the bug. > (In reply to Lv Zheng from comment #151) > Let's narrow down the problem. > > Fernando Chaves: > In comment 143: > You've confirmed "reverting d30283057ecdf8c543ae757ae34db3d7fd2d7732" can > solve acpitz-virtual-0 stuck in 48º. But fan still blows. > Could you just try to comment out the following line and try again: > SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(acpi_ec_suspend_noirq, acpi_ec_resume_noirq) > > Could you help to do the test as comment 150 to see whether the commits in > comment 150 relate to this issue. > > To Damjan Georgievski: > In comment 145, you said "reverting > d30283057ecdf8c543ae757ae34db3d7fd2d7732" cannot solve acpitz-virtual-0 > stuck at +48.0°C. > You may have a different issue. > Could you confirm this again, or let's root cause Fernando's one first. > > Thanks > Lv Yes I will do the test from comment #150 and comment #151 today or tomorrow I was wrong respect at comment #143 , as I say in comment #148, sorry for that (In reply to Lv Zheng from comment #152) > Fernando Chaves/Damjan Georgievski: > > Is CONFIG_SMP enabled in your configuration file? Yes, It's enabled (In reply to Nicolo' from comment #155) > Let me observe that with the new ECP firmware released by Lenovo a few weeks > ago and 4.11 series kernel, it seems my system does not suffer anymore from > the problem: at least I've had the new update for a few days and never > experienced the problem, but I will keep monitoring it. 4.11.3-1-ARCH stock Kernel the issue persist in my case, If I revert the commits in comment #148, the fan and the sensor acpitz-virtual-0 works perfectly Other info I discover today, If I run stock ARCH kernel (4.11.3-1-ARCH) and if I have my USB 3 HUB connected and IN THE HUB I have a keyboard or a mouse, then the issue not appear (I tested this about 10 times), but if I connect a Pendrive in the HUB, or If I connect keyboard/mouse directy in the laptop, the issue appear in the first resume (I tested this about other 10 times) May be is related that a keyboard/mouse can awake from sleep?? I'm using the oficial PKGBUILD from stock archlinux kernel to build my tests kernels, this PKGBUILD download the vanilla tar kernel from https://www.kernel.org/ BIOS Information Vendor: LENOVO Version: N1MET35W (1.20 ) Release Date: 05/17/2017 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16 MB BIOS Revision: 1.20 Firmware Revision: 1.11 Ok, so the culprit seems to be this one: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3a696b6e8f8f75f9f75e556a9f9f6472eae2655 But this seems to be useful to tune suspend/resume cycle faster and there should be users want this feature. So instead of reverting it, let me prepare a boot option for users who want to use it. Thanks Lv Created attachment 256927 [details]
[PATCH] ACPI: EC: Revert back to default wait polling style processing in noirq stage
Could someone try to:
1. apply this commit.
2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" and let me know the result.
3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" and let me know the result.
Thanks in advance.
It seems there are multi-layered issues/concepts revealed by these EC commits. I'm not able to root cause all of them. Let me just do what I need to do in the EC driver. Thanks Lv Created attachment 256931 [details] attachment-3214-0.html Actually, today the issue came back for me as well, so at least there is no mystery :) On Thu, Jun 8, 2017 at 10:34 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=191181 > > --- Comment #162 from Lv Zheng (lv.zheng@intel.com) --- > It seems there are multi-layered issues/concepts revealed by these EC > commits. > I'm not able to root cause all of them. > Let me just do what I need to do in the EC driver. > > Thanks > Lv > > -- > You are receiving this mail because: > You are on the CC list for the bug. > (In reply to Lv Zheng from comment #161) > Created attachment 256927 [details] > [PATCH] ACPI: EC: Revert back to default wait polling style processing in > noirq stage > > Could someone try to: > > 1. apply this commit. > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" > and let me know the result. > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" > and let me know the result. > > Thanks in advance. No issues with 2, 3 and without params (as ec_suspend_yield default is TRUE), tested 5 times with each If boot with acpi.ec_suspend_yield=N, issues appears in first suspend Created attachment 256933 [details] attachment-14163-0.html Lv: let me know whether now that you have isolated the commit you still want me to do tests B and C. On Fri, Jun 9, 2017 at 9:02 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=191181 > > --- Comment #164 from Fernando Chaves (nanochaves@gmail.com) --- > (In reply to Lv Zheng from comment #161) > > Created attachment 256927 [details] > > [PATCH] ACPI: EC: Revert back to default wait polling style processing in > > noirq stage > > > > Could someone try to: > > > > 1. apply this commit. > > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" > > and let me know the result. > > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" > > and let me know the result. > > > > Thanks in advance. > > > No issues with 2, 3 and without params (as ec_suspend_yield default is > TRUE), > tested 5 times with each > > If boot with acpi.ec_suspend_yield=N, issues appears in first suspend > > -- > You are receiving this mail because: > You are on the CC list for the bug. > The (In reply to Lv Zheng from comment #161) > Created attachment 256927 [details] > [PATCH] ACPI: EC: Revert back to default wait polling style processing in > noirq stage > > Could someone try to: > > 1. apply this commit. > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" > and let me know the result. > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" > and let me know the result. the issue persists in both cases for me. (Carbon 5th gen) (In reply to Damjan Georgievski from comment #166) > The (In reply to Lv Zheng from comment #161) > > Created attachment 256927 [details] > > [PATCH] ACPI: EC: Revert back to default wait polling style processing in > > noirq stage > > > > Could someone try to: > > > > 1. apply this commit. > > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" > > and let me know the result. > > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" > > and let me know the result. > > the issue persists in both cases for me. (Carbon 5th gen) I retract this comment, by mistake I didn't run the patched kernel. I'll be testing for several hours this branch: https://github.com/gdamjan/linux/commits/bugzilla-191181-try4 i.e. latest linus tree and the "EC: Revert..." patch (In reply to Lv Zheng from comment #161) > Created attachment 256927 [details] > [PATCH] ACPI: EC: Revert back to default wait polling style processing in > noirq stage > > Could someone try to: > > 1. apply this commit. > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" > and let me know the result. > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" > and let me know the result. > > Thanks in advance. After testing it more carefully, it seems to work with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" Carbon 5th gen 20HQS0LV00 BIOS N1MET35W (1.20 ) 05/17/2017 Firmware Revision: 1.14 (In reply to Lv Zheng from comment #161) > Created attachment 256927 [details] > [PATCH] ACPI: EC: Revert back to default wait polling style processing in > noirq stage > > Could someone try to: > > 1. apply this commit. > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" > and let me know the result. > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" > and let me know the result. > > Thanks in advance. After applying the patch it seems to be fixed. Kernel: 4.11.4 BIOS: N1QET55W (1.30 ) 05/23/2017 Firmware Revision: 1.13 Model: T470 OK, it looks your the is related to this breakge. I'll send the patch upstream. Here is another fix, hope someone could try it. Let me post it later. Thanks Lv Created attachment 256971 [details]
[PATCH] ACPI: EC: Mark a possible IRQ storm period
This is a known issue, not sure if it is related to the reported bug.
Please help to confirm.
Thanks
Lv
Please do not apply attachment 256927 [details] but apply attachment 256971 [details] instead and test again to see if the problem is fixed. Just some further info, from what I've noticed every time this happens the acpitz-virtual-0 sensor as reported by sensors is stuck at 48C. Hi Lv: I will test the new patch soon. (In reply to Lv Zheng from comment #171) > Created attachment 256971 [details] > [PATCH] ACPI: EC: Mark a possible IRQ storm period > > This is a known issue, not sure if it is related to the reported bug. > Please help to confirm. no, this patch alone didn't fix the issue (over kernel 4.12-rc5) this is the kernel I've compiled https://github.com/gdamjan/linux/commits/bugzilla-191181-try5 Created attachment 256985 [details] attachment-4105-0.html Lv: the new patch does not work for me. On Tue, Jun 13, 2017 at 7:45 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=191181 > > --- Comment #175 from Damjan Georgievski (gdamjan@gmail.com) --- > (In reply to Lv Zheng from comment #171) > > Created attachment 256971 [details] > > [PATCH] ACPI: EC: Mark a possible IRQ storm period > > > > This is a known issue, not sure if it is related to the reported bug. > > Please help to confirm. > > no, this patch alone didn't fix the issue (over kernel 4.12-rc5) > > this is the kernel I've compiled > https://github.com/gdamjan/linux/commits/bugzilla-191181-try5 > > -- > You are receiving this mail because: > You are on the CC list for the bug. > Thanks for the test, so I know it's not related.
I'll send refined attachment 256927 [details] to upstream.
Marking this as resolved.
Thanks
Lv
Here you can find the patch: https://patchwork.kernel.org/patch/9785497/ If there is something wrong with it, please let me know. Thanks Lv (In reply to Lv Zheng from comment #178) > Here you can find the patch: > https://patchwork.kernel.org/patch/9785497/ > If there is something wrong with it, please let me know. I've applied these 3 patches from patchwork over 4.12-rc5, but now I still have the issue. this is the tree I build https://github.com/gdamjan/linux/commits/bugzilla-191181-try6 Actually I also suspected that the fix won't work. Lv, are you misunderstanding the issue? The problem persists after it happens once, until a complete power reset. Rebooting into Windows does not solve it. It's definitely not only "during some kind of stage". Created attachment 257005 [details] attachment-6782-0.html Hi Lv, I can confirm that if I'm experiencing the issue with some kernel and reboot into patched kernel with either YY or NY parameters, then the issue is still there; the only solution for me at the moment is downgrading EC firmware. On Wed, Jun 14, 2017 at 7:26 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=191181 > > --- Comment #180 from Tatsuyuki Ishi (ishitatsuyuki@gmail.com) --- > Actually I also suspected that the fix won't work. > > Lv, are you misunderstanding the issue? The problem persists after it > happens > once, until a complete power reset. Rebooting into Windows does not solve > it. > It's definitely not only "during some kind of stage". > > -- > You are receiving this mail because: > You are on the CC list for the bug. > > Lv, are you misunderstanding the issue? My original understanding is: On some platform, if you type "make -j XXX" to build a very big program, fan can blow up. It might mean that the BIOS on such platforms has implemented thermal policy that have something to do with the CPU usage. So when we saw this problem, we were thinking "this bug is trivial". Linux simply will executes many driver teardowns/setups during suspend/resume, and all CPUs can be busy executing driver stuffs when they are executed in parallel, filling up all CPUs. So if the thermal layer is still actively making decisions during this period, we may get a wrong fan action. The commit in question actually can fills up CPU usages: c3a696b6e8f8f75f9f75e556a9f9f6472eae2655 When you are trying to make things faster, how can you avoid using CPU more? So it won't be a surprise that this can be bisected as a culprit of such kind of regression. TBH, I was actually thinking fixing this regression is meaningless. It can be any drivers to make such a regression just because it acts more active. This time, it's EC just because on some platforms, EC is the busiest component than others during the suspend/resume process. Maybe I'm wrong. > The problem persists after it happens once, until a complete power reset. > Rebooting into Windows does not solve it. It's definitely not only "during > some kind of stage". If it bugs windows around, this is defnitely a firmware bug. You should upgrade the firmware. Bug if Windows resumes with silent fans, it might not be related to firmware, but related to an OS gap, it could be an architectural issue, or a small gap. OTOH, I think this bug seems to be common on many platforms. I was suffering from this several years ago when I started to contribute to the community. My test platforms always blow up to maximum speed after resume. I'm sure this is not a regression. It just can be more significant when some changes are done. For this thread, we just can see different bug reports from different reporters. They might be different bugs. Maybe this commit is useful for some of the reporters: https://patchwork.kernel.org/patch/9785497/ And there are 3 patches in the series, 2 of them are not related to this issue. https://patchwork.kernel.org/patch/9785499/: only relates to debugging logs. https://patchwork.kernel.org/patch/9785489/: it's just a kind of common deferred IRQ handling style: disable irq, handle irq and re-enable it. During this period, EC driver won't be busy polling things, but will schedule to other processes, there are always idle period because wait() is still invoked. So it shouldn't be related to this issue. First, this is not some kind of bug that just lasts for seconds. The effect won't go away until you completely power off. The problem is: fan control got broken and either stays at maximum speed or zero, regardless the temperature. I suspect some regression is triggering a bug in the firmware, or worse, undefined behavior. Lv, I hope you understand with this explanation. > First, this is not some kind of bug that just lasts for seconds. The effect > won't go away until you completely power off. I exactly can see same problem on my dell latitude 6430u with very old kernels. > The problem is: fan control got broken and either stays at maximum speed or > zero, regardless the temperature. Can this be improved in thermal layer? > I suspect some regression is triggering a bug in the firmware, or worse, > undefined behavior. Could you bisect out the regression in your case? From comment 154, in Fernando Chaves's case. It seems the culprit is c3a696b6e8f8f75f9f75e556a9f9f6472eae2655. And this patch functionally reverts this commit: https://patchwork.kernel.org/patch/9785497/ But other guys may not. > Lv, I hope you understand with this explanation. No, I'm actually confused. To Damjan: For your case, please boot with "acpi.ec_freeze_events=Y" and confirm again. Thanks in advance. To Damjan: Sorry, for your case, please boot with "acpi.ec_freeze_events=Y" or "acpi.ec_freeze_events=N" and confirm again. Thanks in advance. To Nicolo' and Tatsuyuki: From your comments, I cannot see what is the problem related to your platforms. Could you also try to apply: https://patchwork.kernel.org/patch/9785497/ And test with "acpi.ec_freeze_events=N" I just asked thermal expert. Rui told me that there could be many such fan blowing bugs due to variant causes. So there must be several different bugs on the same report link. And only issues of Fernando/Gjorgji/Damjan cases relate to the EC driver change. Others seem to be unrelated. So my patch description is not correct. I'll change it later. Some more info, the bug happened again with this patch: https://bugzilla.kernel.org/attachment.cgi?id=256927&action=diff Though it's mostly working fine, happened only once in a week before it happened on almost every suspend/wakeup. I just applied this: https://patchwork.kernel.org/patch/9785497/ This is applied on top of 4.11.5 With "acpi.ec_freeze_events=N" it's broken but "acpi.ec_freeze_events=Y" seems to be working fine for now. I'll be running this so we'll see how it behaves. Nevermind that, it doesn't work appears to be just luck the first few times. Hi, I've recently got Lenovo Yoga X1 2nd gen (20JD) and it exhibits the same behavior. Not only when it comes out of sleep but also when the AC is plugged in. I'm running newest BIOS. # dmidecode ... Handle 0x000B, DMI type 0, 24 bytes BIOS Information Vendor: LENOVO Version: N1NET24W (1.11 ) Release Date: 05/26/2017 Address: 0xE0000 Runtime Size: 128 kB ROM Size: 16 MB Characteristics: PCI is supported PNP is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported EDD is supported 3.5"/720 kB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) CGA/mono video services are supported (int 10h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 1.11 Firmware Revision: 1.9 ... # uname -a Linux littletwo 4.11.5-1-ARCH #1 SMP PREEMPT Wed Jun 14 16:19:27 CEST 2017 x86_64 GNU/Linux Is there anything I can do to help with diagnosing this issue? My knowledge of kernel internals is limited. To: Gjorgji/Fernando/Damjan There seems to be many different bugs in this report. For your platforms, I opened a new one to track. Please find it on bug 196129. Let's leave this bug to thermal developers. Thanks Lv In this bug report, we have Tatsuyuki Ishi - ThinkPad X1 Yoga 0xbb - ThinkPad X1 Carbon 4th Nicolo' - Thinkpad t460s Claudio Sacerdoti Coen - ThinkPad X1 Carbon 4th H Zeng - ThinkPad T470s Max Deineko - X270 Jens Axboe - Thinkpad X1 Carbon gen4 Markus T.H. - Thinkpad X1 2017 Gen5 Alexander T. - Lenovo Edge E540 (ThinkPad) Neil Kownacki - x270 Marcoen Hirschberg - T470 Damjan Georgievski - ThinkPad X1 Carbon 5th gen a.piesk@gmx.net - ThinkPad T460s Fernando Chaves - X1 Carbon 5th Gen Hrvoje Zeba - Lenovo Yoga X1 2nd gen TBH, comments from so many bug reporters in the same thread may be misleading, and I'm concerning if we have exactly the same problem on those laptops. Thus I need the input from all of you to make things clear. 1. There is an known solution that fixes all the problems, which is tracked separately at bug #196129. So, for all of you, please confirm if that solution works or not, if yes, please drop a note here and then switch to that thread. 2. for Ishi, the original bug reporter, please describe the current status of the problem again, with and without the solution in #196129. 3. for the others, if the solution in #196129 does not work, please wait and check Ishi' latest description of the problem, if it is exactly the same, please drop a note here, or else, please also drop a note and let me check if we should track it separately. Thanks, all. With the patch applied to 4.11.5-1-ARCH and kernel parameters set to 'acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y', system is behaving normally for now (about a day's worth). I put it to sleep and disconnect/connect the power adapter every now and then. I'll give it a few more days and then switch to 'acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y' to test it out. I'll report back with the results. Solution in bug #196129 fix all the problems for me (X1 Carbon 5th Gen) Using 'acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y' has some wired effects on my system. It froze up multiple times and I had to go through multiple reboot/power cycles to get it up and running again. So I would say patched kernel works as expected with 'acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y' and doesn't work with 'acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y'. System is Lenovo Yoga X1 2nd gen (20JD). (In reply to Fernando Chaves from comment #194) > Solution in bug #196129 fix all the problems for me (X1 Carbon 5th Gen) please refer to https://bugzilla.kernel.org/show_bug.cgi?id=196129#c5 and confirm what solution fixes your problem It looks like by applying attachment 256927 [details] the problem hasn't be triggered so far. Should I try with the parameters as you said?
(As you know, I'm running X1 Yoga 1st gen)
(In reply to Hrvoje Zeba from comment #193) > With the patch applied to 4.11.5-1-ARCH and kernel parameters set to > 'acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y', system is behaving > normally for now (about a day's worth). I put it to sleep and > disconnect/connect the power adapter every now and then. I'll give it a few > more days and then switch to 'acpi.ec_freeze_events=Y > acpi.ec_suspend_yield=Y' to test it out. I'll report back with the results. so this is a duplicate of #196129 (In reply to Fernando Chaves from comment #194) > Solution in bug #196129 fix all the problems for me (X1 Carbon 5th Gen) so this is a duplicate of bug #196129. (In reply to Tatsuyuki Ishi from comment #197) > It looks like by applying attachment 256927 [details] the problem hasn't be > triggered so far. Should I try with the parameters as you said? > > (As you know, I'm running X1 Yoga 1st gen) so this is a duplicate of bug #196129 Tatsuyuki Ishi - ThinkPad X1 Yoga - duplicate of bug #196129 0xbb - ThinkPad X1 Carbon 4th Nicolo' - Thinkpad t460s Claudio Sacerdoti Coen - ThinkPad X1 Carbon 4th H Zeng - ThinkPad T470s Max Deineko - X270 Jens Axboe - Thinkpad X1 Carbon gen4 Markus T.H. - Thinkpad X1 2017 Gen5 Alexander T. - Lenovo Edge E540 (ThinkPad) Neil Kownacki - x270 Marcoen Hirschberg - T470 Damjan Georgievski - ThinkPad X1 Carbon 5th gen - handled in bug #196129 Gjorgji Jankovski - T470 - handled in bug #196129 a.piesk@gmx.net - ThinkPad T460s Fernando Chaves - X1 Carbon 5th Gen - duplicate of bug #196129 Hrvoje Zeba - Lenovo Yoga X1 2nd gen - duplicate of bug #196129 As all the people with latest update have confirmed that this can be handled by bug #196129, including the original bug reporter Ishi, I think this bug report and bug #196129 are duplicate. As there are too many reports in this thread, which is misleading, I will close this bug report, and focus on the EC problem in bug #196129. For people who got a similar problem and confirmed the solution in bug #196129 does not work, please open a new bug report. *** This bug has been marked as a duplicate of bug 196129 *** i just came across https://bugzilla.redhat.com/show_bug.cgi?id=1480844#c48 Seems to be the same issue and was fixed by a new BIOS/EC for T470s. Created attachment 260383 [details]
Disable deeper C-states
Created attachment 260385 [details]
[PATCH] Tune S3 resume step order
To Andreas Piesk <a.piesk@gmx.net>: Are you still suffering from this issue and monitoring here? I think you are T460s user. If you are still suffering from this issue, would you please try attachment 260383 [details] to confirm if the problem disappears with "acpi_resume_latency=25"? If the problem can be fixed by the workaround, please give attachment 260385 [details] a try. They are patches based on 4.10 upstream kernel, if you have trouble applying them on latest kernel, please give 4.10 tag a try. Thanks and best regards Lv Yes, i use a t460s and i'm still at EC 1.09, the last working firmware version. For now i will wait for the Lenovo BIOS team to check if t460s has the same issue and if it can be fixed by a new firmware too. The issue looks like the same but maybe it isn't. If it cannot or will not be fixed by firmware i will try your patch, thank you for posting it. I just wanted to let the people know that Lenovo is aware of the problem and is fixing it. Thanks, -ap > I just wanted to let the people know that Lenovo is aware of the problem and
> is fixing it.
I think same logic is in your EC FW (T460s), so from EC FW's point of view, it should be the same issue IMO.
While it can be made surfaced by different OS changes.
1.26 bios has been released for the X1 Carbon (5gen) on 2017-11-17 From https://download.lenovo.com/pccbbs/mobiles/n1mur11w.txt > [Problem fixes] > - Fixed an issue where fan might rotated with max speed due to not reading > CPU > temperature correctly. (In reply to Damjan Georgievski from comment #208) > 1.26 bios has been released for the X1 Carbon (5gen) on 2017-11-17 > > From > https://download.lenovo.com/pccbbs/mobiles/n1mur11w.txt > > > [Problem fixes] > > - Fixed an issue where fan might rotated with max speed due to not reading > > CPU > > temperature correctly. BIOS version 1.20 with the same patch notes came out awhile back for Lenovo Yoga X1 2nd gen. I'm using 4.13.12-1-ARCH and the problem persists if the scaling_governor is set to performance on the intel_pstate driver. The fan spinning after waking from sleep doesn't happen if the driver is set to powersave. Temperature reading seem to be correct (ie, it's not stuck to 48 anymore). I forgot to mention. When the machine wakes up, everything seems to be ok until cpu load increases. Fan stars blowing as it should but it never spins down, no matter the temperature reading. Created attachment 261061 [details] attachment-320-0.html it seems lenovo finally released a bios+ec update that solves such issue also for the t460s. On Mon, Nov 20, 2017 at 12:09 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=191181 > > --- Comment #210 from Hrvoje Zeba (zeba.hrvoje@gmail.com) --- > I forgot to mention. When the machine wakes up, everything seems to be ok > until > cpu load increases. Fan stars blowing as it should but it never spins > down, no > matter the temperature reading. > > -- > You are receiving this mail because: > You are on the CC list for the bug. > Created attachment 261063 [details] attachment-1531-0.html Actually, I spoke too early, as on the second time I suspended the sensor is stuck at 48C again.. On Thu, Dec 7, 2017 at 5:05 PM, Nicolo' Piazzalunga < nicolopiazzalunga@gmail.com> wrote: > it seems lenovo finally released a bios+ec update that solves such issue > also for the t460s. > > On Mon, Nov 20, 2017 at 12:09 PM, <bugzilla-daemon@bugzilla.kernel.org> > wrote: > >> https://bugzilla.kernel.org/show_bug.cgi?id=191181 >> >> --- Comment #210 from Hrvoje Zeba (zeba.hrvoje@gmail.com) --- >> I forgot to mention. When the machine wakes up, everything seems to be ok >> until >> cpu load increases. Fan stars blowing as it should but it never spins >> down, no >> matter the temperature reading. >> >> -- >> You are receiving this mail because: >> You are on the CC list for the bug. >> > > (In reply to Nicolo' from comment #212) > Created attachment 261063 [details] > attachment-1531-0.html > > Actually, I spoke too early, as on the second time I suspended the sensor > is stuck at 48C again.. > Yep, it doesn't fix the issue, i reverted back to EC 1.09, my last known good EC version. I just came here on the email notification of new comments. And I want to report that I have not run into this situation for long time with my T470s. For me, this bug was fixed with the patch from Lv Zheng being merged into the kernel (I think it's version 4.11 or 4.12 or so). During this long time, I have upgraded my openSUSE Tumbleweed now and then, and updated the BIOS firmware from Lenovo on the pace of new version releasing (2 or 3 new versions since then). This issue has never shown up again although I am using SLEEP almost all the time (1 to 3 times per day) -- with or without the Lenovo firmware fix for this issue. Just wanted to chime in here in case others come looking as well. I read through the whole thread but have not applied any specific suggested fixes. I have a Lenovo X1 Yoga and have only recently ran into this issue but it is reproducible 100% of the time. BIOS Revision: 1.33 Firmware Revision: 1.18 Kernel: 4.14.14-1-ARCH I was running into this 100% with BIOS v1.33 + 4.13.0-32-generic on my X1 Yoga. However after looking online, they only seem to publicize v1.32 https://pcsupport.lenovo.com/sa/en/products/laptops-and-netbooks/thinkpad-x-series-laptops/thinkpad-x1-yoga-type-20fq-20fr/downloads/ds111756, so I wonder if they're silently rolling back v1.33. I downloaded the v1.32 rolled back to that version. On v1.32 I sporadically run into this issue, but if I do, I just suspend again (via GNOME's top right power button, while holding Alt) and re-awaken. The fans will still occasionally spin full speed again, but more often than not re-suspending and re-awakening will fix it. BIOS Revision: 1.32 Firmware Revision: 1.18 Kernel: 4.13.0-32-generic BIOS Revision: 1.32 Kernel: 4.13.0-32-generic Laptop: X1 Carbon, 4th generation I'm also still having this issue. Contrary on Zachs report above the fan stuck at 100% hits me every time after sleep. Going back to 1.32 did not help at all. Is there really no workaround around this problem? Hibernating seems not be supported for my model, so that's not an option. Currently I'm just trying to sleep/wake up 3-4 times and suddenly the fan stays low, but I'm confused because I thought that the bug has now been fixed in the recent kernel versions? Or is it just fixing for some models? Is there anything I can do still to help fixing this? Just wanted to confirm that a bios upgrade 1.24 fixed the issue on my X270 as well (running 4.9.77 kernel now). A cold winter triggers the problem: 1. Start the computer, 2. Go to sleep state, 3. Go on the street, to cold down the computer (under 2 C° or something like this) 4. After wake up the fan blows up I hesitate to post this because it's not apples-to-apples, but I switched to Arch (via Antergos) and this problem has gone away with the updated kernel for me. I'm still running the same bios and firmware. BIOS Revision: 1.32 Firmware Revision: 1.18 Kernel: Linux 4.15.3-1-ARCH #1 SMP PREEMPT Mon Feb 12 23:01:17 UTC 2018 x86_64 GNU/Linux Created attachment 274183 [details]
attachment-15943-0.html
OOO till Feb 28th.
A downgrade to bios version 1.29 (n1fur22w) has now solved the issue for me for 2 consecutive weeks. My environment: Laptop: Thinkpad X1 carbon 20fc , kernel: 4.13.0-32-generic. See also this stackoverflow post: https://askubuntu.com/a/1005165/255917 Created attachment 274479 [details]
attachment-29289-0.html
The problem is solved on my t460s with recent EC update by Lenovo,
currently kernel 4.15.5, BIOS Revision: 1.33, Firmware Revision: 1.14.
Issues happens on Mageia 4.14.65-desktop. Downgrading to version to bios version 1.29 (n1fur22w) seems to solve the problem. Downgrade of bios fixed it for me, from 1.37 -> 1.29, although had to disable "secure rollback prevention" in bios first. Hardware: X1 Carbon 4th Gen, type: 20FC Hi! This issue is still present on linux: 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-11-02) x86_64 GNU/Linux debian/testing on Lenovo Carbon X1 4th Generation, type 20FC BIOS 1.39 ThinkPad BIOS N1FET65W (1.39 ) and is annoying as ... Regards, temporary fix for this problem is doing 2 consecutive suspend/resume cycles after that fan1 from speeding (~6k RPM): iwlwifi-virtual-0 Adapter: Virtual device temp1: +33.0°C pch_skylake-virtual-0 Adapter: Virtual device temp1: +36.5°C acpitz-virtual-0 Adapter: Virtual device temp1: +48.0°C (crit = +128.0°C) thinkpad-isa-0000 Adapter: ISA adapter fan1: 6912 RPM fan2: 65535 RPM goes to 0 RPM: iwlwifi-virtual-0 Adapter: Virtual device temp1: +30.0°C pch_skylake-virtual-0 Adapter: Virtual device temp1: +36.0°C acpitz-virtual-0 Adapter: Virtual device temp1: +36.0°C (crit = +128.0°C) thinkpad-isa-0000 Adapter: ISA adapter fan1: 0 RPM fan2: 65535 RPM coretemp-isa-0000 Adapter: ISA adapter Package id 0: +36.0°C (high = +100.0°C, crit = +100.0°C) Core 0: +32.0°C (high = +100.0°C, crit = +100.0°C) Core 1: +33.0°C (high = +100.0°C, crit = +100.0°C) Hi, I'm using the 5.9.1 kernel on a Thinkpad X1 Carbon 4th gen, 20FC000RAU, with BIOS version N1FET73W, and I still experience this problem regularly. After waking from suspend, 30-40% of the time the fans will hit full speed and stay on until I suspend and wake again, which fixes it 100% of the time. The BIOS is updated to the most recent release (13 July, BIOS v 1.47 (N1FET73W), ECP v 1.18 (N1FHT35W)). I'm not sure whether to report this bug here or at #196129 but I have been experiencing this problem on this laptop on 5.4, 5.7 and 5.8 kernels as well as 5.9 currently. I can confirm the above comment by Luke Midworth that this bug is still present also on my Lenovo Thinkpad X1 Carbon Gen 4. acpitz-acpi-0 seems to get stuck at 48 degrees C after resuming and fan keeps blowing as a result. I'm running kernel 5.9.11. Should a new bug report be created for this, since this one has been marked "resolved"? Updated to newer distro, newer kernel: 5.4.0 The bug still there: Thinkpad T470s (In reply to permaer from comment #228) > I can confirm the above comment by Luke Midworth that this bug is still > present also on my Lenovo Thinkpad X1 Carbon Gen 4. acpitz-acpi-0 seems to > get stuck at 48 degrees C after resuming and fan keeps blowing as a result. > I'm running kernel 5.9.11. > > Should a new bug report be created for this, since this one has been marked > "resolved"? (In reply to Ilya from comment #229) > Updated to newer distro, newer kernel: 5.4.0 > > The bug still there: Thinkpad T470s If there's three of us experiencing this bug, I think we should create a new bug report. Would you mind doing this @permaer? @Luke Midworth @Ilya I just created a new bug report here: https://bugzilla.kernel.org/show_bug.cgi?id=211313 You could both help by reporting about your experience in that report as well. |