Bug 191181 - Fans blowing at max speed after resuming - ThinkPad X1/T4xx series
Summary: Fans blowing at max speed after resuming - ThinkPad X1/T4xx series
Status: RESOLVED DUPLICATE of bug 196129
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: Intel Linux
: P1 normal
Assignee: Zhang Rui
URL:
Keywords:
: 195239 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-12-26 13:45 UTC by Tatsuyuki Ishi
Modified: 2018-12-30 14:44 UTC (History)
38 users (show)

See Also:
Kernel Version: 4.9
Tree: Mainline
Regression: Yes


Attachments
required logs (33.36 KB, application/x-compressed-tar)
2017-03-31 20:57 UTC, Claudio Sacerdoti Coen
Details
Force D3 (1.55 KB, application/mbox)
2017-03-31 21:27 UTC, Srinivas Pandruvada
Details
config.gz from 4.10.6-1-ARCH (44.59 KB, application/gzip)
2017-04-01 20:49 UTC, Nicolo'
Details
dmesg,powertop,turbostat (50.02 KB, application/x-xz)
2017-04-02 02:17 UTC, Nicolo'
Details
sensors (408 bytes, application/x-xz)
2017-04-02 13:27 UTC, Nicolo'
Details
logs showing wakeup with connected/disconnected power adapter (180.00 KB, application/x-tar)
2017-04-02 23:54 UTC, 0xbb
Details
sensors during stress (3.53 KB, text/plain)
2017-04-03 17:17 UTC, Nicolo'
Details
acpidump (524.74 KB, text/plain)
2017-04-03 17:19 UTC, Nicolo'
Details
customized _TMP method for Nicolo' (648 bytes, application/octet-stream)
2017-04-07 03:26 UTC, Zhang Rui
Details
X270 acpidump, 4.10.8 config (180.00 KB, application/x-tar)
2017-04-10 06:25 UTC, Max Deineko
Details
customized DSDT : dump _TMP method for Nicolo' (627.98 KB, application/octet-stream)
2017-04-17 06:02 UTC, Zhang Rui
Details
grep and dmesg before and after problem (15.13 KB, application/x-xz)
2017-04-17 15:58 UTC, Nicolo'
Details
grep and dmesg before and after problem -- corrected (15.49 KB, application/x-xz)
2017-04-18 01:33 UTC, Nicolo'
Details
acpidump from old firmware (524.74 KB, text/plain)
2017-04-25 12:13 UTC, Nicolo'
Details
x270 acpi dump without issue (761.13 KB, text/plain)
2017-05-04 02:27 UTC, Neil Kownacki
Details
testA (51.20 KB, text/plain)
2017-06-07 01:52 UTC, Nicolo'
Details
attachment-6102-0.html (1.66 KB, text/html)
2017-06-08 18:57 UTC, Nicolo'
Details
[PATCH] ACPI: EC: Revert back to default wait polling style processing in noirq stage (1.38 KB, patch)
2017-06-09 02:30 UTC, Lv Zheng
Details | Diff
attachment-3214-0.html (1.07 KB, text/html)
2017-06-09 12:15 UTC, Nicolo'
Details
attachment-14163-0.html (1.68 KB, text/html)
2017-06-09 13:13 UTC, Nicolo'
Details
[PATCH] ACPI: EC: Mark a possible IRQ storm period (1.34 KB, patch)
2017-06-13 04:40 UTC, Lv Zheng
Details | Diff
attachment-4105-0.html (1.40 KB, text/html)
2017-06-13 14:33 UTC, Nicolo'
Details
attachment-6782-0.html (1.35 KB, text/html)
2017-06-14 17:28 UTC, Nicolo'
Details
Disable deeper C-states (3.42 KB, patch)
2017-10-25 07:06 UTC, Lv Zheng
Details | Diff
[PATCH] Tune S3 resume step order (2.38 KB, patch)
2017-10-25 07:06 UTC, Lv Zheng
Details | Diff
attachment-320-0.html (1.08 KB, text/html)
2017-12-07 22:05 UTC, Nicolo'
Details
attachment-1531-0.html (1.67 KB, text/html)
2017-12-07 22:19 UTC, Nicolo'
Details
attachment-15943-0.html (1.61 KB, text/html)
2018-02-15 16:04 UTC, Zhang Rui
Details
attachment-29289-0.html (159 bytes, text/html)
2018-02-28 03:42 UTC, Nicolo'
Details

Description Tatsuyuki Ishi 2016-12-26 13:45:10 UTC
Device: ThinkPad X1 Yoga

The fan was normal on 4.8 (4.8.13 worked well), but after upgrading to 4.9 it started to malfunction. I'm running 4.9.0-ARCH.

After resuming, the fan randomly accelerates to maximum speed and doesn't stop. Another suspend + resume sometimes brings it to quiet. This happens randomly, but is frequently reproduced (>30% occurence).

I haven't noticed any suspicious dmesg, but this one is a little bit strange:
thermal thermal_zone2: failed to read out thermal zone (-5)
This was zone3 when I'm running 4.8 though.

An interesting thing is that when I rebooted and resumed the hibernated Windows session (fast boot) while the fan is blowing, it continued to blow as well; not sure it's some register corruption or not.
Comment 1 Tatsuyuki Ishi 2017-03-14 09:48:00 UTC
Still not fixed in 4.10, and downgrading to 4.8.14 resolves the problem. Please take a look.
Comment 2 0xbb 2017-03-14 10:03:05 UTC
I can confirm this exact same problem on a ThinkPad X1 Carbon 4th.

More precise machine information:
DMI: LENOVO 20FB0043GE/20FB0043GE, BIOS N1FET49W (1.23 ) 02/08/2017
Comment 3 Nicolo' 2017-03-22 14:56:08 UTC
Same problem with Thinkpad t460s and kernels of the 4.9 and 4.10 series: it seems the sensor acpitz-virtual-0 (randomly) sometimes (after suspend) reports 20+ degrees than actual temperature, thus triggering the fan; they perhaps somehow have broken control for this sensor in the newer kernels.
Comment 4 Claudio Sacerdoti Coen 2017-03-24 11:32:09 UTC
Confirmed as well on a ThinkPad X1 Carbon 4th.
Comment 5 Tatsuyuki Ishi 2017-03-28 09:43:10 UTC
This doesn't happen with 4.11 kernel built without ISH support, adding maintainers to CC list.

PS: the sensors was working very crap so I suspected it's corrupting something too.
Comment 6 Tatsuyuki Ishi 2017-03-28 11:18:29 UTC
I'm also suffering from https://lkml.org/lkml/2017/3/21/900 on 4.11-rc4, and it makes my test unreliable. It *always* crashes on second resume. Waiting for next merge to perform a test with ISH enabled.

Interesting factor: after the BUG is triggered, I guess the kernel is somewhat frozen and it also makes the fan blowing up indefinitely. (This is reproduced with ISH turned off)
Comment 7 Claudio Sacerdoti Coen 2017-03-31 09:40:20 UTC
I confirm that the problem is about the ISH support.

Hibernation, instead, has no problem. Moreover, when the fan is running at full throttle after a resume from suspend, hibernating and resuming fixes the issues.
Comment 8 Srinivas Pandruvada 2017-03-31 19:06:35 UTC
Could you try these steps:
- Build 4.11 with ISH support, which you suspect here.
- Boot
- attach dmesg.
- Attach
lsmod | grep -i ish
- systemctl status iio-sensor-proxy
- Attach cat /proc/interrupts just before suspend
suspend and resume
- Attach cat /proc/interrupts
Does it show huge difference in count

Boot again and next time
# systemctl stop iio-sensor-proxy
and then suspend/resume
Comment 9 Claudio Sacerdoti Coen 2017-03-31 20:57:34 UTC
Created attachment 255671 [details]
required logs

All the logs required, in a .tgz file
Comment 10 Claudio Sacerdoti Coen 2017-03-31 21:00:08 UTC
(In reply to Claudio Sacerdoti Coen from comment #9)
> Created attachment 255671 [details]
> required logs
> 
> All the logs required, in a .tgz file

The name of the files should be easily mapped to the logs required.
interrupts_{before,after} are with iio-sensor-proxy not stopped
interrupts_stopped_{before,after} with iio-sensor-proxy stoppe

stopping iio-sensor-proxy did not solve the issue
Comment 11 Srinivas Pandruvada 2017-03-31 21:27:03 UTC
Created attachment 255673 [details]
Force D3
Comment 12 Srinivas Pandruvada 2017-03-31 21:28:35 UTC
I don't see any interrupt storm. Do you have a Fan device here
# cat /sys/class/thermal/cooling_device*/type

Also a test/debug patch is attached. Please try and send me dmesg.
Comment 13 Claudio Sacerdoti Coen 2017-03-31 22:17:09 UTC
claudio@zenone:~$ cat /sys/class/thermal/cooling_device*/type
Processor
Processor
Processor
Processor
intel_powerclamp
iwlwifi

I will try the test/debug patch and let you know.
Comment 14 H Zeng 2017-04-01 12:55:07 UTC
(In reply to Tatsuyuki Ishi from comment #0)
> I haven't noticed any suspicious dmesg, but this one is a little bit strange:
> thermal thermal_zone2: failed to read out thermal zone (-5)
> This was zone3 when I'm running 4.8 though.

I am using ThinkPad T470s. I was having this problem (fun runs at maximum speed after resuming from suspend to ram) several days ago. But the problem has gone since the day before yesterday when I upgraded my OS (openSUSE Tumbleweed) to latest snapshots of 20170324 and 20170328.(I was testing something so these two were left to be updated together.) The kernel source is 4.10.4-1.4 or 4.10.5.

Today when I was testing to confirm the vanish of this problem, I noticed that I have this same warning in output of `dmesg` too.

`thermal thermal_zone2: failed to read out thermal zone (-5)`

It is not relevant to this problem, at least not for me.
Comment 15 Tatsuyuki Ishi 2017-04-01 12:57:39 UTC
Zeng, please confirm that your kernel is built with ISH support (CONFIG_INTEL_ISH) and you have tried multiple times to reproduce the problem.

Please also check if the fan spins up on heavy load, there's a rare case where it will not spin at all burning your machine.
Comment 16 H Zeng 2017-04-01 13:06:18 UTC
(In reply to Tatsuyuki Ishi from comment #15)
> Zeng, please confirm that your kernel is built with ISH support
> (CONFIG_INTEL_ISH) and you have tried multiple times to reproduce the
> problem.
Hi, I think I do not have ISH support because `lsmod | grep -i ish` outputs nothing. So maybe my previous problem is not the same as this one although the symptoms match well.

> 
> Please also check if the fan spins up on heavy load, there's a rare case
> where it will not spin at all burning your machine.
Yes, the fun runs well under load test.
Comment 17 Tatsuyuki Ishi 2017-04-01 13:07:53 UTC
Zeng, it's known that this bug won't trigger on kernels with the CONFIG_INTEL_ISH disabled. Have nice days with that build (it's now on by default on most distros despite the driver's crappiness).
Comment 18 H Zeng 2017-04-01 13:14:07 UTC
Hi Ishi, thanks for your confirmation. Hope you guys getting a solution soon.
Comment 19 Nicolo' 2017-04-01 14:27:27 UTC
Hi Ishi, one remark: also for me the command `lsmod | grep -i ish` does not output anything, but I clearly have the problem, on a t460s with 4.10.6 the sensor sometimes registers 20+ degrees more, and triggers fan. Is that strange?
Comment 20 Tatsuyuki Ishi 2017-04-01 14:30:02 UTC
Nicolo, I haven't built a custom build of 4.10.6 yet so I'm not 100% confident.

I've heard you're running Arch. If then, the stock kernel is built with CONFIG_INTEL_ISH. Do you blacklist the modules? That won't solve the problem at all (the reason remains unknown).
Comment 21 Srinivas Pandruvada 2017-04-01 15:06:09 UTC
Tatsuyuki Ishi: You are saying blacklist of module doesn't help. Is this correct?
Comment 22 Nicolo' 2017-04-01 15:45:12 UTC
Yes, I'm running Arch, but I'm not blacklisting anything, at least not intentionally, namely I just take the stock kernel from Arch.
Comment 23 Srinivas Pandruvada 2017-04-01 17:05:16 UTC
Nicolo: Can you upload your kernel config (/boot/config-4.10 ..)?
Comment 24 Nicolo' 2017-04-01 17:31:01 UTC
Sure, could you be more precise? My boot folder only contains

BOOT  initramfs-linux-fallback.img  intel-ucode.img  vmlinuz-linux
EFI   initramfs-linux.img           loader

Where can I find the file you mention? (sorry for my ignorance)
Comment 25 Srinivas Pandruvada 2017-04-01 18:06:57 UTC
I don't use ARCH-Linux, but from running system you can try (provided Arch Linux enabled kernel config), try
#zcat /proc/config.gz
Comment 26 H Zeng 2017-04-01 18:25:33 UTC
Can everyone confirm whether you are using tpacpi-bat from <https://github.com/teleshoes/tpacpi-bat>? And whether the problem remains if removing tpacpi-bat?
Comment 27 Nicolo' 2017-04-01 20:49:00 UTC
Created attachment 255697 [details]
config.gz from 4.10.6-1-ARCH

Srinivas: that worked, here's the file.
Comment 28 Srinivas Pandruvada 2017-04-01 21:25:03 UTC
Thanks Nicolo. You have ISH enabled as module. Since you don't see in lsmod any ish module, your system doesn't have ISH.

Also thermal sensors are not handled by ISH. So not all sensors are owned by ISH.

I guess there is some problem somewhere, may be on ISH enabled system since you will have some more interrupts after resume the problem might have been more noticed.

Anyway if it is ISH, I uploaded a test patch. And Claudio Sacerdoti will proabbly try.
We can go from there. Based on his current logs, are no interrupt storms or anything, but we will try couple of more options to isolate. I have two other thinkpads, I don't see any issue.

If not there we will take as thermal regression and debug this issue further.
There are number of reasons this could be triggered. So anybody want to help debug, please do.
Comment 29 Nicolo' 2017-04-01 22:14:40 UTC
I see, thanks. At least in my case it is clearly sensor acpitz-virtual-0 that sometimes  reports 20+ degrees (say 48C instead of 28C) after suspend to RAM, and triggers the fan. I'm on a t460s (no GPU) with the above kernel. Let me know whether I should perform some tests.
Comment 30 Srinivas Pandruvada 2017-04-01 23:48:19 UTC
Nicole: Try this
#turbostat --debug
# In another window run
sudo powertop –time=30 --html

# echo mem > /sys/power/state
wake up by power button after suspend
let powertop and turbostat run for few minutes and copy paste screen output
to a file and send also powertop will generate a html file

attach both.
Comment 31 Srinivas Pandruvada 2017-04-02 01:19:49 UTC
Nicole: Also include output of dmesg from boot to after suspend and resume complete when you think you reached problem condition
#dmesg > ~/dmesg.txt
Comment 32 Nicolo' 2017-04-02 02:17:48 UTC
Created attachment 255699 [details]
dmesg,powertop,turbostat

I'm attaching:
-dmesg after the problem
-powertop files from before suspending, after suspending without the problem, and after suspending with the problem
-turbostat files with and without the problem

Btw I'm Nicolo' :)
Comment 33 Nicolo' 2017-04-02 13:27:50 UTC
Created attachment 255703 [details]
sensors

Perhaps I'm overemphasizing this, but here it's clear which sensor is not working. Also, I tried to reproduce the problem by using echo mem > /sys/power/state (as opposed to just closing the laptop lid), but it seems to happen less often (or maybe I was just unlucky).
Comment 34 Srinivas Pandruvada 2017-04-02 20:54:55 UTC
Nicolo: Thanks for sending logs.

From dmesg
- There is no ISH in the system.

From both with and without turbostat/powertop logs
- Temperature is around 30C, which is not a high temperature of concern (105C is max)
- The processor almost idle(Look at CPU idle) You are even reaching PC7. Processor is active less than 3% of time. Actally your "with" logs is less than without.
So core CPU perspective there is no concern here. System is idle with deep C states. Package and core temp is low.

Only concern here is why Fan is still blowing. Your "powertop with after" log show
4660 rpm		Device	Laptop fan

Are you charging laptop? Also dump
grep . /sys/class/thermal/*
If Linux OS thermal is supposed to control Fan, it will be here. If it controlled by "Embedded controller" then it will not be here.

You may want to look at thinkpad fan control program. There is a way from user space to control Fan.
This may be in /sys/devices/platform/thinkpad_hwmon by controling pwm. Also some thinkpads will have in /proc/acpi/ibm/fan.
Comment 35 Srinivas Pandruvada 2017-04-02 21:00:10 UTC
Regarding you sensor.tar
acpitz-virtual-0
Adapter: Virtual device
temp1:        +48.0°C  (crit = +128.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +23.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +23.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +23.0°C  (high = +100.0°C, crit = +100.0°C)

ACPI virtual should be close to package temperature which is close to what turbostat also pointed.

I think the Fan controller in EC is looking acpi-virtual, which is not correct here. So this is broken.

We can also look at
#acpidump > acpi.out
Comment 36 Srinivas Pandruvada 2017-04-02 21:02:09 UTC
Also you can some workload like
# stress -c 4

and run sensors and see "acpitz-virtual-0" changes. I guess it will show 48C.
Comment 37 0xbb 2017-04-02 23:04:48 UTC
This bug doesn't seem to trigger (the fan stays quiet) if the laptop is woken up from standby while being connected to the power adapter.
Comment 38 Tatsuyuki Ishi 2017-04-02 23:37:42 UTC
Just upgraded to 4.10.8, the bug seems to be less likely triggered.

All the things I know:

1. I'm not supposing that this problem is directly related to the thermal sensors, but it's likely that something is corrupting EC registers.
2. Manually setting EC fan control level works, but it will blow back up when you revert it to "auto".

0xbb: No, the bug is very likely to trigger when being connected to the power adapter.
Comment 39 0xbb 2017-04-02 23:54:59 UTC
Created attachment 255727 [details]
logs showing wakeup with connected/disconnected power adapter
Comment 40 0xbb 2017-04-03 00:08:00 UTC
I am almost sure that this bug never triggered during the last month when I woke up the laptop while charging.

I attached some dmesg logs showing this behavior:

1. dmesg_before_unpugged.txt: normal running; disconnected the power adapter.
2. dmesg_wakeup_disconnected.txt: suspend; wake up; fan is at max speed.
3. dmesg_wakeup_pluggedin.txt: suspend again; connected  power adapter; wake, fan control is working again.
Comment 41 Claudio Sacerdoti Coen 2017-04-03 00:55:42 UTC
I just tried and it is triggered for me even while the laptop is charging.
Comment 42 Nicolo' 2017-04-03 02:53:18 UTC
Srinivas: I can confirm the problem may appear both while charging and on battery.

grep . /sys/class/thermal/*
grep: /sys/class/thermal/cooling_device0: Is a directory
grep: /sys/class/thermal/cooling_device1: Is a directory
grep: /sys/class/thermal/cooling_device2: Is a directory
grep: /sys/class/thermal/cooling_device3: Is a directory
grep: /sys/class/thermal/cooling_device4: Is a directory
grep: /sys/class/thermal/cooling_device5: Is a directory
grep: /sys/class/thermal/thermal_zone0: Is a directory
grep: /sys/class/thermal/thermal_zone1: Is a directory
grep: /sys/class/thermal/thermal_zone2: Is a directory
grep: /sys/class/thermal/thermal_zone3: Is a directory

ls /sys/devices/platform/thinkpad_hwmon/
driver           fan1_input  modalias  power  pwm1_enable  uevent
driver_override  hwmon       name      pwm1   subsystem

I didn't try the specific stress you mention, but with a similar program generating prime numbers there are 2 cases: either we start with the normal situation, then the system heats and fan regularly starts, or we start with the fan problem, in which case the fan basically blows at 4k rpm all the time. I will try this again tomorrow to monitor more closely the sensor's behavior.

As for acpidump, I cannot find a simple way to install it in Arch linux, maybe someone can help?

I agree that something is looking at acpi-virtual-0, which is sometimes broken after suspend to ram for kernels after 4.8.
Please let me know if I should perform any other tests.
Comment 43 Claudio Sacerdoti Coen 2017-04-03 07:56:42 UTC
Srinivas:

I have some problem recompiling 4.11 git on my laptop, I will try harder.
However, I have applied your patch to 4.9 and I recompiled. Results:

1) I see the output of the debugging line, thus I assume the patch works.
 However, the problem is still there, no benefits.

2) I have tried to not compile ISH at all as suggested by Tatsuyuki and I have suspend/resumed multiple times, with and without AC power. The issue happens less often, but eventually the problem is triggered anyway.

Sig :-(
Comment 44 Tatsuyuki Ishi 2017-04-03 07:59:36 UTC
The patch is debugging only and just add the logging. You should add what line you saw.

I recommend compiling 4.10, it should contain some more fixes. (not a big difference though)

Seems ISH is somewhat related but not the exact cause of the issue. I have no idea what's happening. There was no problem when I used 4.9-rc8 compiled with 4.8-ARCH config (there was no "official" 4.9 config at the moment).
Comment 45 Srinivas Pandruvada 2017-04-03 15:37:49 UTC
This happens without even ISH on the system (Nicolo's system) and also seems that even you blacklist ish modules, still happens. Also "Claudio Sacerdoti Coen's" system it happens eventually.
I suggest try to get data as I suggest in comment #30. It will show that whether this is a Fan issue even if system is cool or some real high temperature issue.
Comment 46 Nicolo' 2017-04-03 15:48:35 UTC
I see, I can do that: basically stress the system (say with prime95) and monitor it, both in the case where the problem arises and when it does not. I suspect that the system will be cool, so it should be more of a fan issue (triggered by the sometimes broken sensor), but I will run more stress tests and report as you say, so we can check better.
Comment 47 Srinivas Pandruvada 2017-04-03 16:10:50 UTC
This is what from logs "logs showing wakeup with connected/disconnected power adapter "

In two case "acpitz-virtual-0" was following "coretemp-isa-0000", which should be the case mostly.
But after wakeup
acpitz-virtual-0
Adapter: Virtual device
temp1:        +48.0°C  (crit = +128.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +37.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +34.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +35.0°C  (high = +100.0°C, crit = +100.0°C)

Basically the acpitz-virtual-0, basically stuck.

The idea of doing some stress test (after the issue happens) is to see, if the condition of the sensor "acpitz-virtual-0" ever change or always stay 48C. When you run stress the coretemp will go higher, in that case "acpitz-virtual-0" should go high too. If not then it is stuck. If goes high but doesn't go low, means that some other sensor is also in play.


It will be great if someone post acpidump of the problem system. On Arch-linux it is part of iasl package. For other distros I see a package called acpidump.
Comment 48 Nicolo' 2017-04-03 17:17:24 UTC
Created attachment 255741 [details]
sensors during stress

I attach snapshots of sensors during a quick stress test, with the problem showing up, first and last entries being before and after stress. I can do more extensive study with turbostat etc, but basically it is stuck as you suggested.
Comment 49 Nicolo' 2017-04-03 17:19:47 UTC
Created attachment 255743 [details]
acpidump

Here is acpidump from my system.
Comment 50 Srinivas Pandruvada 2017-04-03 17:31:35 UTC
The stress test shoe that the acpi-virtual temp got stuck at 48C even if the CPU temp was high. The acpi-virtual temp is read via EC (from acpidump otput), which is controlling Fan. So the fan control algorithm never see the temp drop again, so always on.

Added some Rui Zhang for suggestions.


If it was regression on 4.9, then git bisect can help to find if some ACPI changes caused this.
Comment 51 Nicolo' 2017-04-03 17:34:09 UTC
Correct. I think the problem started with 4.9 series.
Comment 52 Claudio Sacerdoti Coen 2017-04-03 19:18:56 UTC
[ To Srnivas:

ISH is no longer the suspect, but since you asked I was able to see
"... require reinit after resume"]

This may be interesting: my acpitz-virtual-0 is also stuck _exactly_ at +48.0C
until the next hibernate. (Restart does not fix the issue: only power down or
hibernate). It does not matter if the real temperature is higher/lower than +48.0 at the time of resume or if it is lowered/increased above/below +48.0 after.
Comment 53 Nicolo' 2017-04-03 19:24:26 UTC
Precisely, that's also what I kind of pointed out in my comment 3 :)
Comment 54 Srinivas Pandruvada 2017-04-03 19:27:07 UTC
Claudio Sacerdoti Coen : Can you do git bisect between 4.8 and 4.9 disabling ISH?
Comment 55 Claudio Sacerdoti Coen 2017-04-03 21:52:40 UTC
Yes, I am doing it. 12 iterations needed, don't hold your breath :-)
Comment 56 Srinivas Pandruvada 2017-04-03 22:10:26 UTC
Added LV.
LV,
Do you think that enabling #debug in drivers/ec.c will help?
TMP is at offset 0x78 of EC opregion. Looks like this is now 0x80, which will result in temperature 48.


          Method (_TMP, 0, NotSerialized)  // _TMP: Temperature
            {
                If (\H8DR)
                {
                    Local0 = \_SB.PCI0.LPC.EC.TMP0
                    Local1 = \_SB.PCI0.LPC.EC.TSL2
                    Local2 = \_SB.PCI0.LPC.EC.TSL3
                }
                Else
                {
                    Local0 = \RBEC (0x78)
                    Local1 = (\RBEC (0x8A) & 0x7F)
                    Local2 = (\RBEC (0x8B) & 0x7F)
                }

                If (Local0 == 0x80)
                {
                    Local0 = 0x30
                }


..
Comment 57 Claudio Sacerdoti Coen 2017-04-04 00:47:50 UTC
I am halfway. I will continue tomorrow. In case the partial information is useful to anyone, here is the bisect log so far:

git bisect start
# bad: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1
git bisect bad 1001354ca34179f3db924eb66672442a173147dc
# good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8
git bisect good c8d2bc9bc39ebea8437fd974fdbc21847bb897a3
# bad: [e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39] Merge tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bad e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39
# bad: [687ee0ad4e897e29f4b41f7a20c866d74c5e0660] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 687ee0ad4e897e29f4b41f7a20c866d74c5e0660
# bad: [e6dce825fba05f447bd22c865e27233182ab3d79] Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad e6dce825fba05f447bd22c865e27233182ab3d79
# bad: [12b7bcb43e6ea834ab2f5dc52d971e379a0ca109] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 12b7bcb43e6ea834ab2f5dc52d971e379a0ca109
# good: [72a9cdd083005900f15934e8568f1ac43a6bb755] Merge tag 'pnp-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 72a9cdd083005900f15934e8568f1ac43a6bb755
# good: [a9e57009dacd58052755cf58463ce41a14a01db5] perf record: Fix documentation 'event_sources' -> 'event_source'
git bisect good a9e57009dacd58052755cf58463ce41a14a01db5
Comment 58 Tatsuyuki Ishi 2017-04-04 00:49:14 UTC
It's quite tricky since this bug is very less likely to be triggered when ISH is disabled. Make sure you repeat the test for many times when bisecting.
Comment 59 Claudio Sacerdoti Coen 2017-04-04 09:40:17 UTC
Here is the bisect result and log. Unfortunately, it does not point to something easily related.

Note: all the bad states were bad. I made my best to detect the good states repeating suspend/resume multiple times.

4b978934a440c1aafce986353001b03289eaa040 is the first bad commit


git bisect start
# bad: [1001354ca34179f3db924eb66672442a173147dc] Linux 4.9-rc1
git bisect bad 1001354ca34179f3db924eb66672442a173147dc
# good: [c8d2bc9bc39ebea8437fd974fdbc21847bb897a3] Linux 4.8
git bisect good c8d2bc9bc39ebea8437fd974fdbc21847bb897a3
# bad: [e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39] Merge tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect bad e6e3d8f8f4f06caf25004c749bb2ba84f18c7d39
# bad: [687ee0ad4e897e29f4b41f7a20c866d74c5e0660] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 687ee0ad4e897e29f4b41f7a20c866d74c5e0660
# bad: [e6dce825fba05f447bd22c865e27233182ab3d79] Merge tag 'tty-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect bad e6dce825fba05f447bd22c865e27233182ab3d79
# bad: [12b7bcb43e6ea834ab2f5dc52d971e379a0ca109] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 12b7bcb43e6ea834ab2f5dc52d971e379a0ca109
# good: [72a9cdd083005900f15934e8568f1ac43a6bb755] Merge tag 'pnp-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 72a9cdd083005900f15934e8568f1ac43a6bb755
# good: [a9e57009dacd58052755cf58463ce41a14a01db5] perf record: Fix documentation 'event_sources' -> 'event_source'
git bisect bad de956b8f45b3338cfb66a725e22b4050109daf2a
# good: [2ab78a724b1fd885b65199707b8e053677745457] Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into efi/core
git bisect good 2ab78a724b1fd885b65199707b8e053677745457
# good: [d74b62bc3241af8ebf5141f5b12e89d9d7f341e1] Merge branches 'doc.2016.08.22c', 'exp.2016.08.22c', 'fixes.2016.09.14a', 'hotplug.2016.08.22c' and 'torture.2016.08.22c' into HEAD
git bisect good d74b62bc3241af8ebf5141f5b12e89d9d7f341e1
# good: [e23f22b5cb9e44da24cb8494707536211adff8d1] dcdbas: Make use of smp_call_on_cpu()
git bisect good e23f22b5cb9e44da24cb8494707536211adff8d1
# good: [8db549491c4a3ce9e1d509b75f78516e497f48ec] smp: Allocate smp_call_on_cpu() workqueue on stack too
git bisect good 8db549491c4a3ce9e1d509b75f78516e497f48ec
# bad: [4b978934a440c1aafce986353001b03289eaa040] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 4b978934a440c1aafce986353001b03289eaa040
# good: [2d8fbcd13ea1d0be3a7ea5f20c3a5b44b592e79c] Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
git bisect good 2d8fbcd13ea1d0be3a7ea5f20c3a5b44b592e79c
# first bad commit: [4b978934a440c1aafce986353001b03289eaa040] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Comment 60 Srinivas Pandruvada 2017-04-04 15:29:48 UTC
The commit is a merge commit for x86-tip tree, so will have many commits in it.
Comment 61 Claudio Sacerdoti Coen 2017-04-04 22:34:13 UTC
Indeed. Can I help in some other way?
Comment 62 Srinivas Pandruvada 2017-04-04 22:42:50 UTC
I am waiting for comments from the person who knows EC more. The category of the bug is changed to power/thermal which gets monitored every week.

If you want to deep dive there, there is a 
#debug in the top of file drivers/acpi/ec.c
If you enable you will see lots of EC transaction. Temperature is at offset 0x78. You will see a request and then response.
Comment 63 CircleCode 2017-04-05 15:36:44 UTC
*** Bug 195239 has been marked as a duplicate of this bug. ***
Comment 64 Srinivas Pandruvada 2017-04-05 15:48:06 UTC
Claudio Sacerdoti Coen: I want make sure that in your tests you didn't use ISH (removed from CONFIG). I don't think this has anything to do because it happens on laptop without ISH as (Nicolos's system), but atleast will allow to focus on some other changes.
Comment 65 Claudio Sacerdoti Coen 2017-04-06 07:36:11 UTC
I confirm. Either ISH was disabled from configure, or ISH support was not even merged in the kernel yet.
Comment 66 Srinivas Pandruvada 2017-04-06 17:57:14 UTC
Claudio Sacerdoti Coen: Thanks.
It is possible that changes in folder drivers/acpi may have some changes causing this issue. So one try we can do is forklift /drivers/acpi from working version of kernel tree to non working version. So basically delete drivers/acpi folder and copy from working version. There may have some compile issues, if they are minor we can give a try.
Comment 67 Zhang Rui 2017-04-07 03:03:33 UTC
So the problem is that:
temperature reported by ACPI _TMP method stuck at 48C, resulting in fan spinning all the time
right?

Let's see why 48C is returned by _TMP.
Comment 68 Zhang Rui 2017-04-07 03:26:52 UTC
Created attachment 255777 [details]
customized _TMP method for Nicolo'

Nicolo'
please rebuild your kernel with CONFIG_ACPI_CUSTOM_METHOD=y, and reboot with acpi.aml_debug_output=1

after boot, please override the _TMP method with this new one attached, by
      "cat /tmp/tmp.aml > /sys/kernel/debug/acpi/custom_method"
And then please attach the dmesg output after the problem is reproduced.
Comment 69 Zhang Rui 2017-04-07 03:28:51 UTC
NOTE:
For the others, please don't try the binary attached in comment #68, as it is made based on the acpidump attached by Nicolo, which should work for his platform only.
Comment 70 Nicolo' 2017-04-07 15:54:16 UTC
Zhang:
Correct. So I should take my current config file (say from 4.10.8 Arch), edit the line you suggest, and compile the kernel: right? Sorry for my ignorance, but could you please explain in more detail how do I reboot with acpi.aml_debug_output=1?
Comment 71 Nicolo' 2017-04-08 20:02:56 UTC
Zhang:
when I try to do the last step (as root), I get
cat: write error: Invalid argument
What should I do?
Comment 72 Zhang Rui 2017-04-10 02:15:40 UTC
(In reply to Nicolo' from comment #70)
> Zhang:
> Correct. So I should take my current config file (say from 4.10.8 Arch),
> edit the line you suggest, and compile the kernel: right? Sorry for my
> ignorance, but could you please explain in more detail how do I reboot with
> acpi.aml_debug_output=1?

this is a kernel option, you just need to append "acpi.aml_debug_output=1" into your kernel command line.
Comment 73 Zhang Rui 2017-04-10 02:16:57 UTC
(In reply to Nicolo' from comment #71)
> Zhang:
> when I try to do the last step (as root), I get
> cat: write error: Invalid argument
> What should I do?

when this happens, do you see some new messages in dmesg output?

Let me try on my local machine to see if this feature is broken.
Comment 74 Nicolo' 2017-04-10 02:32:05 UTC
no new messages in dmesg output
Comment 75 Max Deineko 2017-04-10 06:25:29 UTC
Created attachment 255803 [details]
X270 acpidump, 4.10.8 config

Also seem to be experiencing this on an X270 running 4.10.8.
Comment 76 Nicolo' 2017-04-10 14:21:53 UTC
Right, actually the name 'x1 yoga' in the title is a bit misleading, since this bug appears in multiple models, e.g. t460s or x1 carbon.
Comment 77 Zhang Rui 2017-04-11 07:20:18 UTC
yeah. it seems that this "customize DSDT" feature is broken in Linux kernel. I will file another report to track that issue.
Comment 78 Zhang Rui 2017-04-17 06:02:01 UTC
Created attachment 255901 [details]
customized DSDT : dump _TMP method for Nicolo'

before that feature being fixed, please follow this link
https://01.org/linux-acpi/documentation/overriding-dsdt
to override the DSDT with the one attached, and boot with kernel parameter "acpi.aml_debug_output=1", and then attach the dmesg output after "grep . /sys/class/thermal/thermal*/*" both before and after the problem reproduced.
Comment 79 Nicolo' 2017-04-17 14:03:24 UTC
Zhang: I guess you have already done part of the job, up until I have to

Put it where the kernel build can include it:
$ cp DSDT.hex $SRC/include/
Add this to the kernel .config:
CONFIG_STANDALONE=n
CONFIG_ACPI_CUSTOM_DSDT=y
CONFIG_ACPI_CUSTOM_DSDT_FILE="DSDT.hex"

and recompile the kernel: is that right?
Comment 80 Nicolo' 2017-04-17 15:58:35 UTC
Created attachment 255903 [details]
grep and dmesg before and after problem

I hope I've done things correctly, please let me know in case I haven't.
Comment 81 Zhang Rui 2017-04-18 00:18:00 UTC
[    0.000000] Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-custom.img root=/dev/nvme0n1p5 rw acpi.aml.debug_output=1

it seems that you're using 'acpi.aml.debug_output=1' instead of 'acpi.aml_debug_output=1'
please retest with the correct kernel option. :)
Comment 82 Zhang Rui 2017-04-18 00:20:31 UTC
BTW, please attach the output of "grep . /sys/class/thermal/thermal_zone*/temp" instead of "grep . /sys/class/thermal/thermal*/*"
Comment 83 Nicolo' 2017-04-18 01:33:09 UTC
Created attachment 255915 [details]
grep and dmesg before and after problem -- corrected

Too bad, sorry for the silly mistake ;)
is it better now?
Comment 84 Zhang Rui 2017-04-18 02:29:35 UTC
before the bug
[  107.621890] [ACPI Debug]  "_TMP Started"
[  107.621930] [ACPI Debug]  0x0000000000000001
[  107.622728] [ACPI Debug]  "Dump Local0/1/2"
[  107.622732] [ACPI Debug]  0x000000000000001F
[  107.622736] [ACPI Debug]  0x0000000000000000
[  107.622739] [ACPI Debug]  0x0000000000000000
[  107.622748] [ACPI Debug]  "Dump DHKC"
[  107.622759] [ACPI Debug]  0x0000000000000001
[  107.622769] [ACPI Debug]  "_TMP Finished"

after the bug
[  273.928856] [ACPI Debug]  "_TMP Started"
[  273.928912] [ACPI Debug]  0x0000000000000001
[  273.929886] [ACPI Debug]  "Dump Local0/1/2"
[  273.929890] [ACPI Debug]  0x0000000000000080
[  273.929893] [ACPI Debug]  0x0000000000000000
[  273.929896] [ACPI Debug]  0x0000000000000000
[  273.929905] [ACPI Debug]  "Dump DHKC"
[  273.929915] [ACPI Debug]  0x0000000000000001
[  273.929923] [ACPI Debug]  "_TMP Finished"

so the key change is that
    Method (RBEC, 1, NotSerialized)
    {
        Return (SMI (0x00, 0x03, Arg0, 0x00, 0x00))
    }
returns a fixed value 0x80, rather than a meaningful temperature value.

hmmm, I'm not sure if this is the rootcause of the problem because I don't see how kernel change impacts SMI call.

Nicolo', can you confirm that the acpi_tz always return the real temperature, rather than the fixed 48C, in working kernels like 4.8?
Comment 85 Zhang Rui 2017-04-18 02:30:51 UTC
(In reply to Zhang Rui from comment #84)
> before the bug
> [  107.621890] [ACPI Debug]  "_TMP Started"
> [  107.621930] [ACPI Debug]  0x0000000000000001
> [  107.622728] [ACPI Debug]  "Dump Local0/1/2"
> [  107.622732] [ACPI Debug]  0x000000000000001F
> [  107.622736] [ACPI Debug]  0x0000000000000000
> [  107.622739] [ACPI Debug]  0x0000000000000000
> [  107.622748] [ACPI Debug]  "Dump DHKC"
> [  107.622759] [ACPI Debug]  0x0000000000000001
> [  107.622769] [ACPI Debug]  "_TMP Finished"
> 
> after the bug
> [  273.928856] [ACPI Debug]  "_TMP Started"
> [  273.928912] [ACPI Debug]  0x0000000000000001
> [  273.929886] [ACPI Debug]  "Dump Local0/1/2"
> [  273.929890] [ACPI Debug]  0x0000000000000080
> [  273.929893] [ACPI Debug]  0x0000000000000000
> [  273.929896] [ACPI Debug]  0x0000000000000000
> [  273.929905] [ACPI Debug]  "Dump DHKC"
> [  273.929915] [ACPI Debug]  0x0000000000000001
> [  273.929923] [ACPI Debug]  "_TMP Finished"
> 
> so the key change is that
>     Method (RBEC, 1, NotSerialized)
>     {
>         Return (SMI (0x00, 0x03, Arg0, 0x00, 0x00))
>     }
> returns a fixed value 0x80, rather than a meaningful temperature value.
> 
> hmmm, I'm not sure if this is the rootcause of the problem because I don't
> see how kernel change impacts SMI call.

Plus, I don't see how this temperature change impacts the fan because there is no ACPI fan binding to this thermal zone.
Comment 86 Nicolo' 2017-04-18 03:03:53 UTC
That is correct: acpitz returns what I believe is the correct T value, also in current 4.10 kernel, when it is not stuck at 48C; when it is stuck on 48C, then the fan is always triggered; this happens I would say 1/3 of the times, but (usually) another suspend solves the issue.
Comment 87 Nicolo' 2017-04-18 13:33:52 UTC
By the way, just out of curiosity: is there a simple algorithm that determines fan speed out of temperatures?
Comment 88 Claudio Sacerdoti Coen 2017-04-19 19:44:12 UTC
@Zhang

on my laptop the real temperature is always reported in place of 0x80 for the kernels that show no bug. You can have a look at my kernel bisect above that points to the merge of a new IPC mechanism (or something like that). As far as I can tell, that merge has nothing to do with ACPI code. Does that make any sense to you?
Comment 89 Jens Axboe 2017-04-20 01:04:59 UTC
I have this issue as well, on a 20FB Thinkpad (X1 Carbon gen4). I know for a fact that BIOS 1.19 NEVER showed the issue, but any newer version does. I just tried 1.24 today, and the problem is still there. Unfortunately I can't downgrade from 1.24 to earlier versions, so now I'm stuck with the issue. Previously, after upgrading from 1.19 and noticing the issue, I would just downgrade and the problem would go away.

I always run custom kernels, ISH is not included.

Let me know if you guys need more info. I'm currently running 4.11-rc7 and the problem is still there.
Comment 90 Jens Axboe 2017-04-20 01:05:54 UTC
What changed between firmware 1.19 and newer? That would likely provide a good clue as to what caused this very annoying change in behavior.
Comment 91 Nicolo' 2017-04-20 01:20:13 UTC
Jens: I'm surprised you think it's a BIOS issue; could you try kernels of the 4.8 series with current BIOS and check whether the problem is there?
Comment 92 Jens Axboe 2017-04-20 01:41:32 UTC
Nicolo: I'm saying that with BIOS 1.19 the problem NEVER happens, and any version newer than that, it happens on what appears to be every resume. Those are the indisputable facts on my laptop. I can run the very latest kernels on 1.19 and I don't have the fan issue.

So whatever the real problem is, it only happens on BIOS > 1.19 here. That doesn't mean it's a BIOS issue, it could be a change that triggers a problem in later kernels.

And I'm now peeved I can't downgrade the BIOS anymore, hence the issue is high priority problem for me now.
Comment 93 Jens Axboe 2017-04-20 01:49:46 UTC
I'll compile a 4.8 and 4.9 kernel on the laptop and see if 4.8 works and 4.9 does not, running with BIOS 1.24.
Comment 94 Jens Axboe 2017-04-20 03:00:38 UTC
I tested v4.8 and it works fine (no crazy fan after resume), and v4.9-rc1 which is broken (full speed fan after resume).

So 1.19 and any kernel is fine, or >1.19 and v4.9-rc1 and later is broken.
Comment 95 Markus T.H. 2017-04-20 09:09:42 UTC
I've got the same problem on my X1 2017 (Gen5).

I'm on Arch Linux and after resuming the fan runs at max speed pretty often. Resuming again most of the times fixes the problem.

One thing that isn't mentioned here already is the following:
If the power plug isn't plugged, then I've NEVER had this bug. Don't know if that is of any help.
Comment 96 Nicolo' 2017-04-20 12:11:51 UTC
Jens: so it is most likely a kernel issue, not a bios one; also, how can you say 1.19 works with any kernel if you're not able to test it now?

Markus: to me, it happens both plugged and unplugged.
Comment 97 Jens Axboe 2017-04-20 12:36:28 UTC
Nicolo, I already explained why it could be both a firmware issue and a kernel bug further up. If you read what I wrote, you'd also see that I was running 2.10 until yesterday and tried newer versions and rolled back when I saw they had this bug. Hence I know any kernel up to 4.11-rc7 work with 2.19 just fine.
Comment 98 Jens Axboe 2017-04-20 12:38:13 UTC
Thanks phone. The 2.10 and 2.19 above should be 1.19, of course.
Comment 99 Nicolo' 2017-04-20 13:16:00 UTC
I read you, but could you (or someone else) repeat a test now with 1.19 and newer kernels? if not, as you seem to say, I don't care what you remember etc; if yes, then I agree it may be BIOS related, which could be useful info.
Comment 100 Jens Axboe 2017-04-20 14:02:57 UTC
No, you are clearly NOT reading me, because if you did, you would not ask questions I already answered. As I wrote in comment 89 and 92, I tested 1.19 YESTERDAY with 4.11-rc7. I don't need to test again, and in fact I cannot, since I can't roll back from my 1.24 since the tool no longer allows downgrades of BIOS. That last bit of info was also in the above comments.

This is why I'm 100% positive that 1.19 works - because whenever they released a new BIOS, I'd update, see the bug was there, and then downgrade again. Since I develop Linux, I run the latest kernels on my laptop all the time. Hence I know that I've never had the issue with newer kernels and 1.19.

Read and understand what is being written and stop wasting peoples time.
Comment 101 Nicolo' 2017-04-20 14:31:48 UTC
Dear Jens, I understood from the beginning that you were able to test with 1.19 until a few days ago with all kernels, and you say that you had no problems, no need to repeat it, but you're not able to reproduce it now, likely as anyone else who upgraded to latest bios. I agree that you may have a point in correlating with bios, but I'd be happier if someone can repeat the test with 1.19 and confirm what you say; also in debugging it could be useful to work with 1.19
Comment 102 Jens Axboe 2017-04-20 14:44:36 UTC
It's a fact that there's 100% correlation, as I went back and forth between 1.19 and newer BIOS' while I could. It always triggered on newer BIOS, it never triggered on 1.19. I don't need to reproduce it now, as I reproduced it _yesterday_.

The folks currently on 1.19 will not be finding this bug report, as they don't run into the issue. They will find it when/if they do upgrade, and at that point it'll be too late, as they can no longer downgrade to 1.19.

I'm going to try and bisect this issue, as we have the luxury of knowing it works on 4.8 and doesn't on 4.9-rc1. I'll be happy to try suggestions from the people actually working on fixing this issue.
Comment 103 Nicolo' 2017-04-20 14:49:34 UTC
If you look above Claudio Sacerdoti has already done that. Does Lenovo supporting documentation for 1.19 and later give any hint? have you tried to contact them as well?
Comment 104 Jens Axboe 2017-04-20 14:53:51 UTC
I did see that someone ran a bisect, but it was basically useless as it didn't yield any real information. I also see that it was run twice, with different results. So I don't have a high level of confidence in it.

I have not talked to Lenovo.
Comment 105 Jens Axboe 2017-04-20 15:30:16 UTC
Looks like it's the embedded controller (ECP) update. Since I cannot downgrade the actual BIOS anymore, I flashed BIOS 1.24 with ECP 1.13 (that's the one that BIOS 1.19 ships with) on the laptop. With that, the fan issue doesn't seem to be there (output from dmidecode):

Handle 0x000C, DMI type 0, 24 bytes
BIOS Information
	Vendor: LENOVO
	Version: N1FET50W (1.24 )
	Release Date: 03/08/2017
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 16384 kB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		CGA/mono video services are supported (int 10h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 1.24
	Firmware Revision: 1.13

Note how "Firmware Revision" is 1.13, it should be 1.16 with this BIOS.

So likely the BIOS is fine, but the ECP update from 1.13 to 1.14 broke something or triggered a bug in the kernel.
Comment 106 Claudio Sacerdoti Coen 2017-04-20 17:01:32 UTC
To follow-up on Jens's last comment: I have BIOS revision 1.23 and firmware revision 1.15 (> 1.13) and I see the bug (with new kernels)
Comment 107 Jens Axboe 2017-04-20 17:02:44 UTC
Claudio, you can try and get firmware 1.24 and 1.19, then copy the smaller firmware file from 1.19 into the 1.24 folder and flash it. It'll warn that the ECP firmware is older, but just say yes. Would be interesting to see if it fixes it for you, too.
Comment 108 Nicolo' 2017-04-20 19:48:05 UTC
On my thinkpad t460s, dmidecode gives firmware 1.10, but I have the problem: is it possible that they use different version numbers for different models?

BIOS Information
	Vendor: LENOVO
	Version: N1CET54W (1.22 )
	Release Date: 02/10/2017
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 16384 kB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		CGA/mono video services are supported (int 10h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 1.22
	Firmware Revision: 1.10
Comment 109 Nicolo' 2017-04-20 21:05:26 UTC
Jens: can I ask how did you distinguish what is bios and what is ecp update? in my case they only provide one iso, which I can extract to img, but then in the flash folder besides readme there are some pat files, one efi and another folder with two files fl1 and fl2 extension; these last two seem the only ones to be different among different versions: was is the same for you? which file did you substitute?
Comment 110 Jens Axboe 2017-04-20 21:06:24 UTC
The fl2 file is the smaller one, that is the ECP firmware. The fl1 file is the BIOS.
Comment 111 Nicolo' 2017-04-20 22:05:10 UTC
Jens: you were right, now my dmidecode reads

BIOS Information
	Vendor: LENOVO
	Version: N1CET54W (1.22 )
	Release Date: 02/10/2017
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 16384 kB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		CGA/mono video services are supported (int 10h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 1.22
	Firmware Revision: 1.9

Probably the numbering is different among different models, but the upshot is that either older ECB firmware and all Linux kernels, or older kernels and all ECB firmware work (at least I did several suspends and did not had the issue), but both newer trigger some bug, as you correctly observed.

Do you think there may be any compatibility issue using a newer UEFI with an older ECB, at least temporarily?
Comment 112 Zhang Rui 2017-04-21 00:25:57 UTC
probabaly.
As kernel git bisect does not give us any clue, it would be good to know the differences between different EC firmware versions.

BTW, I also want to make sure the fixed 48C reported is related with the problem or not. Say, with older EC firmware or older kernel, does the thermal zone report 48C or not?
Comment 113 Nicolo' 2017-04-21 00:35:47 UTC
Zhang: with older EC, for example now I get
acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.0°C  (crit = +128.0°C)

which I believe is correct. As for differences between different ECs, I could try to contact Lenovo, unless you know a better way.
Comment 114 Zhang Rui 2017-04-21 01:09:14 UTC
(In reply to Nicolo' from comment #113)
> Zhang: with older EC, for example now I get
> acpitz-virtual-0
> Adapter: Virtual device
> temp1:        +27.0°C  (crit = +128.0°C)
> 
problem is we also get correct value with later Kernel/ECP.
We have confirmed that the temperature is 48C when the problem happens.
And now, it's better to confirm if 48C is never shown with later kernel/ECP. In order to do this, we'd better to monitor the temperature for a longer period, say, 1 day, to see if the fixed 48C is ever reported. (temperature raises/drops to 48C smoothly from 45C/50C is a reasonable 48C, temperature raise from 30C to 48C is the fixed 48C)

> which I believe is correct. As for differences between different ECs, I
> could try to contact Lenovo, unless you know a better way.

yes, please contact Lenovo to understand the details of the difference, which may give us a clue about the problem.


Lv, I also saw a couple of EC driver changes between 4.8 and 4.9, can you please help verify if it could be related?
Comment 115 Zhang Rui 2017-04-21 01:16:25 UTC
df45db6177f8dde380d44149cca46ad800a00575
750f628be68e8b8e1624d8abd003b9f1fc758ed6
e923e8e79e18fd6be9162f1be6b99a002e9df2cb
c2b46d679b30c5c0d7eb47a21085943242bdd8dc
39a2a2aa3e9e5538984e9130c92a6c889ad86435
d30283057ecdf8c543ae757ae34db3d7fd2d7732
72c77b7ea9ce781f4987840984a462e4456ba98e
46922d2a3aff5122253d97e64500801c08f4f2c0
2a5708409e4e05446eb1a89ecb48641d6fd5d5a9
97cb159fd91d00f8d7d1adeb075503dc0d946bff

these are the commits shipped in 4.9 kernel.
please cherry pick them on top of v4.8 kernel and see if the problem can be reproduced or not.
Comment 116 Len Brown 2017-04-25 00:17:22 UTC
or...
Perhaps it would be simpler to try running the old ec.c
and if that works, then we know that changes to the ec.c
driver ether broke this, or have nothing to do with this.
Comment 117 Lv Zheng 2017-04-25 00:59:03 UTC
Probably should just try to revert this commit

https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git/commit/?h=linux-next&id=d30283057ecdf8c543ae757ae34db3d7fd2d7732

and see what happens.
Comment 118 Nicolo' 2017-04-25 02:01:46 UTC
I confirm that previous ECP firmware works fine with current kernel: the sensor seems to work just fine, going smoothly from below to above 48C when some load is present.

I will compile a kernel that reverts the commit and test with newer ECP to see whether the issue is present.
Comment 119 Lv Zheng 2017-04-25 04:31:17 UTC
To Srinivas:

> LV, Do you think that enabling #debug in drivers/ec.c will help?
> I am waiting for comments from the person who knows EC more.

I should say sorry as I've just noticed it this morning.

Let me take a look at what's going on.
The _TMP implementation looks interesting:
            Method (_TMP, 0, NotSerialized)  // _TMP: Temperature
            {
                If (\H8DR)
                ^^^^^^^^^^
                {
                    Store (\_SB.PCI0.LPC.EC.TMP0, Local0)
                    Store (\_SB.PCI0.LPC.EC.TSL2, Local1)
                    Store (\_SB.PCI0.LPC.EC.TSL3, Local2)
                }
                Else
                {
                    Store (\RBEC (0x78), Local0)
                    Store (And (\RBEC (0x8A), 0x7F), Local1)
                    Store (And (\RBEC (0x8B), 0x7F), Local2)
                }

                If (LEqual (Local0, 0x80))
                {
                    Store (0x30, Local0)
                }

                ...
            }
        }

I'm not sure if \H8DR matters. It's only set in the \_SB._INI:
    Scope (\_SB)
    {
        Method (_INI, 0, NotSerialized)  // _INI: Initialize
        {
            ...
            If (LGreaterEqual (\_REV, 0x02))
            {
                Store (0x01, \H8DR)
            }
            ...
        }
    }
So have you guys looked into the _REV related stuffs?

It goes 2 code paths, if _REV >= 2, it runs EC opregion, else it invokes SMI functionality implemented via RBEC.
By default, _REV returns 2, while you can change it to 5 via "acpi_rev_override" (it might be better to have a different boot parameter to change it to 1).

If RBEC is executed, it implemented in this way:
    Method (RBEC, 1, NotSerialized)
    {
        Return (SMI (0x00, 0x03, Arg0, 0x00, 0x00))
    }
Then this is not related to the EC driver.
The SMI method is:
    Mutex (MSMI, 0x00)
    Method (SMI, 5, Serialized)
    {
        Acquire (MSMI, 0xFFFF)
        Store (Arg0, CMD)
        Store (0x01, ERR)
        Store (Arg1, PAR0)
        Store (Arg2, PAR1)
        Store (Arg3, PAR2)
        Store (Arg4, PAR3)
        Store (0xF5, APMC)
        While (LEqual (ERR, 0x01))
        {
            Sleep (0x01)
            Store (0xF5, APMC)
        }
        Store (PAR0, Local0)
        Release (MSMI)
        Return (Local0)
    }
It accesses the following opregions which are all not handled by the EC driver:
    OperationRegion (SMI0, SystemIO, 0xB2, 0x01)
    Field (SMI0, ByteAcc, NoLock, Preserve)
    {
        APMC,   8
    }
    OperationRegion (MNVS, SystemMemory, 0xAFFBC018, 0x1000)
    Field (MNVS, AnyAcc, NoLock, Preserve)
    {
        Offset (0xFC0), 
        CMD,    8, 
        ERR,    32, 
        PAR0,   32, 
        PAR1,   32, 
        PAR2,   32, 
        PAR3,   32
    }

I guess:
1. The old firmware doesn't contain the SMI invocations. Could someone upload the acpidump from the old working ECP?
2. If reverting the above mentioned commit can recover this, it might mean that the SMI method is stuck if the EC driver stops to handle events too early.

Thanks and best regards
Lv
Comment 120 Lv Zheng 2017-04-25 08:43:58 UTC
To Rui:

The followings are not related, they are related to boot ec:
72c77b7ea9ce781f4987840984a462e4456ba98e
46922d2a3aff5122253d97e64500801c08f4f2c0
2a5708409e4e05446eb1a89ecb48641d6fd5d5a9
97cb159fd91d00f8d7d1adeb075503dc0d946bff

These are also not related, they are simple fix and cleanups:
df45db6177f8dde380d44149cca46ad800a00575
750f628be68e8b8e1624d8abd003b9f1fc758ed6
e923e8e79e18fd6be9162f1be6b99a002e9df2cb

For the 1st step, we might only focus on the following 3 commits:
c2b46d679b30c5c0d7eb47a21085943242bdd8dc <- this is not likely a root cause, it only changes post-resume behavior.
39a2a2aa3e9e5538984e9130c92a6c889ad86435 <- this is a no-op and d30283057e enables the feature introduced in this commit.
d30283057ecdf8c543ae757ae34db3d7fd2d7732

Thanks
Lv
Comment 121 Nicolo' 2017-04-25 12:13:41 UTC
Created attachment 256003 [details]
acpidump from old firmware

Here's my acpi dump from old working firmware with new kernel; I will try to compile and test a kernel with the commit removed later.
Comment 122 Alexander T. 2017-04-25 13:52:46 UTC
The same situation (https://bugzilla.kernel.org/show_bug.cgi?id=191181#c0) on Lenovo Edge E540 (ThinkPad) Type 20C6. on kennels 4.8 and higher. BIOS 2.24.
Comment 123 Lv Zheng 2017-04-26 02:59:01 UTC
Hi, Nicolo'

A. Could you re-do the test of comment 78 with non-buggy kernels but buggy firmware and upload the new result here?

B. You can also do the following test:

Boot the kernel with:
 dyndbg=\"file ec.c +p\"

Under kernel source code tree:
# cd tools
# make acpi

You can find "ec" tool under tools/power/acpi.

It looks H8DR is 1 according to comment 84.
And TMP0 will be read, the problem occurs when it returns 0x80.
TMP0 is here:
                    OperationRegion (ECOR, EmbeddedControl, 0x00, 0x0100)
                    Field (ECOR, ByteAcc, NoLock, Preserve)
                    {
                        ...
                        Offset (0x78), 
                        TMP0,   8, 
                        ...
                    }
It's at offset 0x78, you can trigger EC transaction to obtain its value from userspace by "sudo ec -b 0x78".

So can you execute this command for the following combinations and post the result here:

1. buggy firmware, non-buggy kernel, before suspend
2. buggy firmware, non-buggy kernel, after resume
3. buggy firmware, buggy kernel, before suspend
4. buggy firmware, buggy kernel, after resume
5. non-buggy firmware, non-buggy kernel, before suspend
6. non-buggy firmware, non-buggy kernel, after resume
7. non-buggy firmware, buggy kernel, before suspend
8. non-buggy firmware, buggy kernel, after resume

And upload "dmesg" output for test 2, 3, 4 here. To obtain the dmesg output, please run the following commands:
# dmesg -c
# ec -b 0x78
# dmesg -c > dmesg-2/3/4.log
Thanks in advance.

Best regards
Lv
Comment 124 Lv Zheng 2017-04-26 03:04:30 UTC
Hi, Nicolo'

Here is another test, let me call it as test C:

1. Boot "buggy firmware, buggy kernel" with:
    dyndbg=\"file ec.c +p\"
2. After boot, run the following commands and post the logs here:
   # dmesg -c > boot.log
   # echo mem > /sys/power/state
   Resume the system by pressing the right buttons
   # dmesg -c > s2ram.log
   # ec -b 0x78
   # dmesg -c > tmp0.log

Thanks in advance
Lv
Comment 125 Hermann Bier 2017-05-03 13:49:52 UTC
Hi all,

did actually anybody controlled the cpu frequency scalling? I don't have the problem of running the fans at full speed only after resume - my problem is that its running all the time directly after boot. The 4.11 actually has a problem with the cpu frequency scalling in my eyes. Today i installed the 4.11-rc8 and directly noticed, that the frequencies of my i7-6600U are residing in > 3.0 Ghz domain. The frequency just don't goes down. It's a t460s with p-state driver and performance governor, running tlp. Before i was using 4.10.0 and the cpu frequencies stayed at minimum 400 mhz in idle states. Now as an implication of high frequencies > the fan is running almost all the time. I already had this issue when the p-state driver first came out with my x230. I had to wait for a newer kernel and the problem gone, the frequencies went to mininum in idle. Here is my bios if that helps:

BIOS Information
	Vendor: LENOVO
	Version: N1CET47W (1.15 )
	Release Date: 08/08/2016
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 16384 kB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		CGA/mono video services are supported (int 10h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 1.15
	Firmware Revision: 1.9

Again, my problem is slightly different, but i don't see any info regarding the cpu frequency. Perhaps you can check your frequencies after the machine resumes. If my issue is not related i'll open a new ticket.

Best regards,
Hermann
Comment 126 Nicolo' 2017-05-03 15:21:40 UTC
Hermann: you can take a look at my powertop with the problem in the attachments, and I think my CPU could go to lower states even with the fan issue.

Lv: I will try to do the tests you suggested soon, and report back.
Comment 127 Tatsuyuki Ishi 2017-05-04 00:12:08 UTC
Hermann: do not use the performance governor. It tries to keep high frequency, and should be used for environments that latency matters (e.g. audio).

I'm noticing that my computer is more likely to overheat than before, no idea what's the cause. i7 is too beefy.
Comment 128 Neil Kownacki 2017-05-04 02:27:19 UTC
Created attachment 256183 [details]
x270 acpi dump without issue
Comment 129 Neil Kownacki 2017-05-04 02:43:02 UTC
For what it's worth, I have an x270 and I am not having this particular issue. I attached an ACPI dump from my machine in case there is some clue there.  I've been switching back and forth between the 4.8 kernel and the 4.11 releases in Linux Mint since I got the laptop and haven't noticed any issues with the fan. 


I am having another weird problem though.  When I close the lid while running on battery, it generates an unhandled HKEY event 0x6032 and CPU temp/throttling warnings in dmesg every so often.  When I run the laptop with the lid closed like this, the CPU appears to be limited to 2.0GHz and stays around 60C even when I'm doing something intensive like encoding video.  It seems to think the CPU is hotter than it really is?  It also seems like some kind of issue reading the EC to me, so I thought maybe it was related.  It's triggered during very light use.  Symptoms are described exactly in this Fedora bug:


https://bugzilla.redhat.com/show_bug.cgi?id=924570


I would be curious to know what BIOS / Firmware versions Max Deineko (other x270 owner) is running and if he also has this problem.
Comment 130 Neil Kownacki 2017-05-04 02:46:44 UTC
(I am running BIOS 1.11 / Firmware revision 1.11)
Comment 131 Hermann Bier 2017-05-04 18:21:41 UTC
@Tatsuyuki Ishi  thank you very much for the hint. All the time i had the wrong information that the performance and powersave governors are acting like ondemand from cpufreq with the difference in speed of scalling:( And the weird thing is that kernels i've used before with performance governor like the 4.9 put the frequencies at the minimum like ~400Mhz. Very strange! But you are completely right:

https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt:
For example the "performance" policy is
similar to cpufreq’s "performance" governor, but "powersave" is completely
different than the cpufreq "powersave" governor. The strategy here is similar
to cpufreq "ondemand", where the requested P-State is related to the system load.

Thanks again for the hint. Now i'm with powersave. So much wasted energy... Hahah:)
Comment 132 Hermann Bier 2017-05-04 21:13:54 UTC
(In reply to Tatsuyuki Ishi from comment #127)
> Hermann: do not use the performance governor. It tries to keep high
> frequency, and should be used for environments that latency matters (e.g.
> audio).
> 
> I'm noticing that my computer is more likely to overheat than before, no
> idea what's the cause. i7 is too beefy.

thank you very much for the hint. All the time i had the wrong information that the performance and powersave governors are acting like ondemand from cpufreq with the difference in speed of scalling:( And the weird thing is that kernels i've used before with performance governor like the 4.9 put the frequencies at the minimum like ~400Mhz. Very strange! But you are completely right:

https://www.kernel.org/doc/Documentation/cpu-freq/intel-pstate.txt:
For example the "performance" policy is
similar to cpufreq’s "performance" governor, but "powersave" is completely
different than the cpufreq "powersave" governor. The strategy here is similar
to cpufreq "ondemand", where the requested P-State is related to the system load.

Thanks again for the hint. Now i'm with powersave. So much wasted energy... Hahah:)
Comment 133 Marcoen Hirschberg 2017-05-06 13:52:02 UTC
Just a "me too". I have a T470 with BIOS Revision: 1.29, Firmware Revision: 1.12, running Arch Linux with a the standard 4.10.13 kernel.
Comment 134 Damjan Georgievski 2017-05-13 14:58:32 UTC
same thing on ThinkPad X1 Carbon 5th gen
Manufacturer: LENOVO
Product Name: 20HQS0LV00
BIOS Revision: 1.18
Firmware Revision: 1.12

I've tried kernels 4.10.13-ARCH, self-compiled 4.11 with the Arch config, and a self-compiled torvalds/master kernel (4.12-rc0) with the same config.
Comment 135 a.piesk 2017-06-03 09:50:59 UTC
(In reply to Nicolo' from comment #118)
> I confirm that previous ECP firmware works fine with current kernel: the
> sensor seems to work just fine, going smoothly from below to above 48C when
> some load is present.

I can confirm that too:

System Information
	Manufacturer: LENOVO
	Product Name: 20F90045GE
	Version: ThinkPad T460s

BIOS Information
	Vendor: LENOVO
	Version: N1CET56W (1.24 )
	Release Date: 04/19/2017
	BIOS Revision: 1.24
	Firmware Revision: 1.9

# uname -r
4.11.3-200.fc25.x86_64

After flashing BIOS/ECP == 1.24(N1CET56W)/1.09(N1CHT27W) following Jens' instructions (thanks a ton!) the nasty fan issue is gone.
ECP 1.10(N1CHT28W) and 1.11(N1CHT29W) do not work with recent kernels.
Comment 136 Damjan Georgievski 2017-06-03 11:09:35 UTC
updated the X1 Carbon (5th gen) to UEFI: 1.20 / ECP: 1.14 (2017/05/26) 

still the same issue :(
Comment 137 Jens Axboe 2017-06-03 18:16:20 UTC
Nicolo, are you going to run the tests suggested by Lv Zheng in comment 124+125? It'd be nice if we can get this closer to being resolved, no progress has been made in over a month.
Comment 138 Nicolo' 2017-06-03 23:36:17 UTC
Sorry, I've been busy with other stuff. I will try to do those tomorrow.
One thing that anyone can try is just remove commit d30283057ecdf8c543ae757ae34db3d7fd2d7732
Has this been tried?
Comment 139 Nicolo' 2017-06-04 21:25:59 UTC
I tried and failed today to do test A, due to a (possibly unrelated) compile error with kernel series 4.8; will try again after someone either here or at Arch helps with the issue, which is as follows:

mkdir build
cd build/
git clone git://git.archlinux.org/svntogit/packages.git --single-branch --branch "packages/linux"mkdir build
cd packages/trunk
git checkout d59764443634990fb9c058e31515af5692de44ce
cd ../..
cp -r packages/trunk/ linux
makepkg -o

edit PKGBUILD to change name to custom, edit config and add file to include folder as described in comment#79

updpackagesums
makepkg -s

gives error

LD      init/built-in.o
kernel/built-in.o: In function `update_wall_time':
(.text+0x7c377): undefined reference to `____ilog2_NaN'
make: *** [Makefile:951: vmlinux] Error 1
==> ERROR: A failure occurred in build().
    Aborting...
Comment 140 a.piesk 2017-06-04 22:29:12 UTC
(In reply to Nicolo' from comment #139)
> 
> LD      init/built-in.o
> kernel/built-in.o: In function `update_wall_time':
> (.text+0x7c377): undefined reference to `____ilog2_NaN'
> make: *** [Makefile:951: vmlinux] Error 1
> ==> ERROR: A failure occurred in build().
>     Aborting...

gcc7, i assume? Might be this one:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=474c90156c8dcc2fa815e6716cc9394d7930cb9c
Comment 141 Nicolo' 2017-06-04 22:48:07 UTC
Thank you, will try to apply the patch.
Comment 142 Nicolo' 2017-06-05 15:41:46 UTC
Hi, I still get an error even with the patch:
....
LD      drivers/built-in.o
==> ERROR: A failure occurred in build().
    Aborting...
Comment 143 Fernando Chaves 2017-06-06 18:39:37 UTC
I can confirm revert  commit  d30283057ecdf8c543ae757ae34db3d7fd2d7732 in kernel 4.11.3  (archlinux) solves acpitz-virtual-0 stuck in 48º, but not solve the fan issue.

I will try in theses days to revert the others commits and see what happen
Comment 144 Nicolo' 2017-06-06 19:58:29 UTC
Interesting.

I was finally able to build 4.8 kernel, so now I can easily do tests ABC.
Comment 145 Damjan Georgievski 2017-06-06 22:34:18 UTC
(In reply to Fernando Chaves from comment #143)
> I can confirm revert  commit  d30283057ecdf8c543ae757ae34db3d7fd2d7732 in
> kernel 4.11.3  (archlinux) solves acpitz-virtual-0 stuck in 48º, but not
> solve the fan issue.
> 
> I will try in theses days to revert the others commits and see what happen

not for me on 4.12-rc4. acpitz-virtual-0 is still stuck at +48.0°C when the issue appears.
Comment 146 Nicolo' 2017-06-07 01:52:05 UTC
Created attachment 256897 [details]
testA

I'm attaching the result of test A.
Soon will also do B and C.
Comment 147 Nicolo' 2017-06-07 02:10:17 UTC
About test B, I get the error
sudo ./ec -b 0x78
ec: /sys/kernel/debug/ec/ec0/io: No such file or directory
What am I doing wrong?
I'm assuming I can make ec at any time, don't necessarily need to make it when I compile the kernel, is that right?
Comment 148 Fernando Chaves 2017-06-07 18:30:26 UTC
(In reply to Damjan Georgievski from comment #145)
> (In reply to Fernando Chaves from comment #143)
> > I can confirm revert  commit  d30283057ecdf8c543ae757ae34db3d7fd2d7732 in
> > kernel 4.11.3  (archlinux) solves acpitz-virtual-0 stuck in 48º, but not
> > solve the fan issue.
> > 
> > I will try in theses days to revert the others commits and see what happen
> 
> not for me on 4.12-rc4. acpitz-virtual-0 is still stuck at +48.0°C when the
> issue appears.

You are right, I was watching another sensor, sorry I was tired.



Now I reverts all the commits in 4.11.3 and the issue is gone, no more fan blowing and acpitz-virtual-0 stuck in +48.0°C

df45db6177f8dde380d44149cca46ad800a00575
750f628be68e8b8e1624d8abd003b9f1fc758ed6
e923e8e79e18fd6be9162f1be6b99a002e9df2cb
c2b46d679b30c5c0d7eb47a21085943242bdd8dc
39a2a2aa3e9e5538984e9130c92a6c889ad86435
d30283057ecdf8c543ae757ae34db3d7fd2d7732
72c77b7ea9ce781f4987840984a462e4456ba98e
46922d2a3aff5122253d97e64500801c08f4f2c0
2a5708409e4e05446eb1a89ecb48641d6fd5d5a9
97cb159fd91d00f8d7d1adeb075503dc0d946bff
eab05ec38073f72389386f4a77fb58c06e246a4c
4c237371f290d1ed3b2071dd43554362137b1cce
c3a696b6e8f8f75f9f75e556a9f9f6472eae2655

I don't deep in the changes of theses commits because I don't have the knowledge and time, but if I have time I will try to see what happen in theses commits.

And I forget to say, my laptop is X1 Carbon 5th Gen
Comment 149 Lv Zheng 2017-06-08 02:49:20 UTC
Have you tried to revert just this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3028305

Thanks
Lv
Comment 150 Lv Zheng 2017-06-08 02:56:29 UTC
Also you can help to confirm by just reverting the followings:

72c77b7ea9ce781f4987840984a462e4456ba98e
46922d2a3aff5122253d97e64500801c08f4f2c0
2a5708409e4e05446eb1a89ecb48641d6fd5d5a9
97cb159fd91d00f8d7d1adeb075503dc0d946bff
eab05ec38073f72389386f4a77fb58c06e246a4c
4c237371f290d1ed3b2071dd43554362137b1cce
c3a696b6e8f8f75f9f75e556a9f9f6472eae2655

And see whether the problem can disappear.
Comment 151 Lv Zheng 2017-06-08 03:14:13 UTC
Let's narrow down the problem.

Fernando Chaves:
In comment 143:
You've confirmed "reverting d30283057ecdf8c543ae757ae34db3d7fd2d7732" can solve acpitz-virtual-0 stuck in 48º. But fan still blows.
Could you just try to comment out the following line and try again:
SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(acpi_ec_suspend_noirq, acpi_ec_resume_noirq)

Could you help to do the test as comment 150 to see whether the commits in comment 150 relate to this issue.

To Damjan Georgievski:
In comment 145, you said "reverting d30283057ecdf8c543ae757ae34db3d7fd2d7732" cannot solve acpitz-virtual-0 stuck at +48.0°C.
You may have a different issue.
Could you confirm this again, or let's root cause Fernando's one first.

Thanks
Lv
Comment 152 Lv Zheng 2017-06-08 03:31:53 UTC
Fernando Chaves/Damjan Georgievski:

Is CONFIG_SMP enabled in your configuration file?
Comment 153 Damjan Georgievski 2017-06-08 06:58:36 UTC
(In reply to Lv Zheng from comment #152)
> Fernando Chaves/Damjan Georgievski:
> 
> Is CONFIG_SMP enabled in your configuration file?

of course.
Comment 154 Damjan Georgievski 2017-06-08 11:43:50 UTC
(In reply to Lv Zheng from comment #150)
> Also you can help to confirm by just reverting the followings:
> 
> 72c77b7ea9ce781f4987840984a462e4456ba98e
> 46922d2a3aff5122253d97e64500801c08f4f2c0
> 2a5708409e4e05446eb1a89ecb48641d6fd5d5a9
> 97cb159fd91d00f8d7d1adeb075503dc0d946bff
> eab05ec38073f72389386f4a77fb58c06e246a4c
> 4c237371f290d1ed3b2071dd43554362137b1cce
> c3a696b6e8f8f75f9f75e556a9f9f6472eae2655
> 
> And see whether the problem can disappear.

I reverted all of those, and also
d30283057ecdf8c543ae757ae34db3d7fd2d7732

didn't help in my case.

here's the branch I compiled 
https://github.com/gdamjan/linux/commits/bugzilla-191181
Comment 155 Nicolo' 2017-06-08 11:58:07 UTC
Let me observe that with the new ECP firmware released by Lenovo a few weeks ago and 4.11 series kernel, it seems my system does not suffer anymore from the problem: at least I've had the new update for a few days and never experienced the problem, but I will keep monitoring it.
Comment 156 0xbb 2017-06-08 12:22:14 UTC
On a ThinkPad X1 Carbon 4th gen
Version 1.25 (UEFI: 1.25 / ECP: 1.17)
This bug still persists with 4.11.3-1-ARCH #1 SMP PREEMPT.
Comment 157 Jens Axboe 2017-06-08 18:49:03 UTC
Concur with comment 156 - just upgraded my gen 4 x1 carbon to 1.25, and the problem is still there. I combined 1.25 with ECP 1.13, and then it's fine.
Comment 158 Nicolo' 2017-06-08 18:56:58 UTC
Created attachment 256925 [details]
attachment-6102-0.html

I will keep monitoring it, but mine seems to have no problems (it seems the
# are different)

4.11.3-1-ARCH
BIOS Information
Vendor: LENOVO
Version: N1CET56W (1.24 )
Release Date: 04/19/2017
Address: 0xE0000
Runtime Size: 128 kB
ROM Size: 16 MB
BIOS Revision: 1.24
Firmware Revision: 1.11


On Thu, Jun 8, 2017 at 2:49 PM, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>
> --- Comment #157 from Jens Axboe (axboe@kernel.dk) ---
> Concur with comment 156 - just upgraded my gen 4 x1 carbon to 1.25, and the
> problem is still there. I combined 1.25 with ECP 1.13, and then it's fine.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 159 Fernando Chaves 2017-06-08 19:51:40 UTC
(In reply to Lv Zheng from comment #151)
> Let's narrow down the problem.
> 
> Fernando Chaves:
> In comment 143:
> You've confirmed "reverting d30283057ecdf8c543ae757ae34db3d7fd2d7732" can
> solve acpitz-virtual-0 stuck in 48º. But fan still blows.
> Could you just try to comment out the following line and try again:
> SET_NOIRQ_SYSTEM_SLEEP_PM_OPS(acpi_ec_suspend_noirq, acpi_ec_resume_noirq)
> 
> Could you help to do the test as comment 150 to see whether the commits in
> comment 150 relate to this issue.
> 
> To Damjan Georgievski:
> In comment 145, you said "reverting
> d30283057ecdf8c543ae757ae34db3d7fd2d7732" cannot solve acpitz-virtual-0
> stuck at +48.0°C.
> You may have a different issue.
> Could you confirm this again, or let's root cause Fernando's one first.
> 
> Thanks
> Lv

Yes I will do the test from comment #150 and comment #151 today or tomorrow

I was wrong respect at comment #143 , as I say in comment #148, sorry for that


(In reply to Lv Zheng from comment #152)
> Fernando Chaves/Damjan Georgievski:
> 
> Is CONFIG_SMP enabled in your configuration file?

Yes, It's enabled




(In reply to Nicolo' from comment #155)
> Let me observe that with the new ECP firmware released by Lenovo a few weeks
> ago and 4.11 series kernel, it seems my system does not suffer anymore from
> the problem: at least I've had the new update for a few days and never
> experienced the problem, but I will keep monitoring it.

4.11.3-1-ARCH stock Kernel the issue persist in my case, If I revert the commits in comment #148, the fan and the sensor acpitz-virtual-0 works perfectly


Other info I discover today, If I run stock ARCH kernel (4.11.3-1-ARCH) and if I have my USB 3 HUB connected and IN THE HUB I have a keyboard or a mouse, then the issue not appear (I tested this about 10 times), but if I connect a Pendrive in the HUB, or If I connect keyboard/mouse directy in the laptop, the issue appear in the first resume (I tested this about other 10 times)

May be is related that a keyboard/mouse can awake from sleep??


I'm using the oficial PKGBUILD from stock archlinux kernel to build my tests kernels, this PKGBUILD download the vanilla tar kernel from https://www.kernel.org/ 

BIOS Information
Vendor: LENOVO
Version: N1MET35W (1.20 )
Release Date: 05/17/2017
Address: 0xE0000
Runtime Size: 128 kB
ROM Size: 16 MB
BIOS Revision: 1.20
Firmware Revision: 1.11
Comment 160 Lv Zheng 2017-06-09 02:26:42 UTC
Ok, so the culprit seems to be this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c3a696b6e8f8f75f9f75e556a9f9f6472eae2655

But this seems to be useful to tune suspend/resume cycle faster and there should be users want this feature.
So instead of reverting it, let me prepare a boot option for users who want to use it.

Thanks
Lv
Comment 161 Lv Zheng 2017-06-09 02:30:31 UTC
Created attachment 256927 [details]
[PATCH] ACPI: EC: Revert back to default wait polling style processing in noirq stage

Could someone try to:

1. apply this commit.
2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y" and let me know the result.
3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y" and let me know the result.

Thanks in advance.
Comment 162 Lv Zheng 2017-06-09 02:34:56 UTC
It seems there are multi-layered issues/concepts revealed by these EC commits.
I'm not able to root cause all of them.
Let me just do what I need to do in the EC driver.

Thanks
Lv
Comment 163 Nicolo' 2017-06-09 12:15:05 UTC
Created attachment 256931 [details]
attachment-3214-0.html

Actually, today the issue came back for me as well, so at least there is no
mystery :)

On Thu, Jun 8, 2017 at 10:34 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>
> --- Comment #162 from Lv Zheng (lv.zheng@intel.com) ---
> It seems there are multi-layered issues/concepts revealed by these EC
> commits.
> I'm not able to root cause all of them.
> Let me just do what I need to do in the EC driver.
>
> Thanks
> Lv
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 164 Fernando Chaves 2017-06-09 13:02:43 UTC
(In reply to Lv Zheng from comment #161)
> Created attachment 256927 [details]
> [PATCH] ACPI: EC: Revert back to default wait polling style processing in
> noirq stage
> 
> Could someone try to:
> 
> 1. apply this commit.
> 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y"
> and let me know the result.
> 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"
> and let me know the result.
> 
> Thanks in advance.


No issues with 2, 3 and without params (as ec_suspend_yield default is TRUE),  tested 5 times with each

If boot with acpi.ec_suspend_yield=N, issues appears in first suspend
Comment 165 Nicolo' 2017-06-09 13:13:48 UTC
Created attachment 256933 [details]
attachment-14163-0.html

Lv: let me know whether now that you have isolated the commit you still
want me to do tests B and C.

On Fri, Jun 9, 2017 at 9:02 AM, <bugzilla-daemon@bugzilla.kernel.org> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>
> --- Comment #164 from Fernando Chaves (nanochaves@gmail.com) ---
> (In reply to Lv Zheng from comment #161)
> > Created attachment 256927 [details]
> > [PATCH] ACPI: EC: Revert back to default wait polling style processing in
> > noirq stage
> >
> > Could someone try to:
> >
> > 1. apply this commit.
> > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y"
> > and let me know the result.
> > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"
> > and let me know the result.
> >
> > Thanks in advance.
>
>
> No issues with 2, 3 and without params (as ec_suspend_yield default is
> TRUE),
> tested 5 times with each
>
> If boot with acpi.ec_suspend_yield=N, issues appears in first suspend
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 166 Damjan Georgievski 2017-06-10 08:39:34 UTC
The (In reply to Lv Zheng from comment #161)
> Created attachment 256927 [details]
> [PATCH] ACPI: EC: Revert back to default wait polling style processing in
> noirq stage
> 
> Could someone try to:
> 
> 1. apply this commit.
> 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y"
> and let me know the result.
> 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"
> and let me know the result.

the issue persists in both cases for me. (Carbon 5th gen)
Comment 167 Damjan Georgievski 2017-06-10 10:23:15 UTC
(In reply to Damjan Georgievski from comment #166)
> The (In reply to Lv Zheng from comment #161)
> > Created attachment 256927 [details]
> > [PATCH] ACPI: EC: Revert back to default wait polling style processing in
> > noirq stage
> > 
> > Could someone try to:
> > 
> > 1. apply this commit.
> > 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y"
> > and let me know the result.
> > 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"
> > and let me know the result.
> 
> the issue persists in both cases for me. (Carbon 5th gen)

I retract this comment, by mistake I didn't run the patched kernel.

I'll be testing  for several hours this branch:
https://github.com/gdamjan/linux/commits/bugzilla-191181-try4

i.e. latest linus tree and the "EC: Revert..." patch
Comment 168 Damjan Georgievski 2017-06-12 16:03:49 UTC
(In reply to Lv Zheng from comment #161)
> Created attachment 256927 [details]
> [PATCH] ACPI: EC: Revert back to default wait polling style processing in
> noirq stage
> 
> Could someone try to:
> 
> 1. apply this commit.
> 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y"
> and let me know the result.
> 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"
> and let me know the result.
> 
> Thanks in advance.

After testing it more carefully, it seems to work with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"

Carbon 5th gen
20HQS0LV00
BIOS N1MET35W (1.20 ) 05/17/2017
Firmware Revision: 1.14
Comment 169 Gjorgji Jankovski 2017-06-12 16:09:14 UTC
(In reply to Lv Zheng from comment #161)
> Created attachment 256927 [details]
> [PATCH] ACPI: EC: Revert back to default wait polling style processing in
> noirq stage
> 
> Could someone try to:
> 
> 1. apply this commit.
> 2. boot the kernel with "acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y"
> and let me know the result.
> 3. boot the kernel with "acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y"
> and let me know the result.
> 
> Thanks in advance.
After applying the patch it seems to be fixed.
Kernel: 4.11.4
BIOS: N1QET55W (1.30 ) 05/23/2017
Firmware Revision: 1.13
Model: T470
Comment 170 Lv Zheng 2017-06-13 04:18:41 UTC
OK, it looks your the is related to this breakge.
I'll send the patch upstream.

Here is another fix, hope someone could try it.
Let me post it later.

Thanks
Lv
Comment 171 Lv Zheng 2017-06-13 04:40:01 UTC
Created attachment 256971 [details]
[PATCH] ACPI: EC: Mark a possible IRQ storm period

This is a known issue, not sure if it is related to the reported bug.
Please help to confirm.

Thanks
Lv
Comment 172 Lv Zheng 2017-06-13 04:41:07 UTC
Please do not apply attachment 256927 [details] but apply attachment 256971 [details] instead and test again to see if the problem is fixed.
Comment 173 Gjorgji Jankovski 2017-06-13 08:29:26 UTC
Just some further info, from what I've noticed every time this happens the acpitz-virtual-0 sensor as reported by sensors is stuck at 48C.
Comment 174 Nicolo' 2017-06-13 11:41:04 UTC
Hi Lv: I will test the new patch soon.
Comment 175 Damjan Georgievski 2017-06-13 11:45:15 UTC
(In reply to Lv Zheng from comment #171)
> Created attachment 256971 [details]
> [PATCH] ACPI: EC: Mark a possible IRQ storm period
> 
> This is a known issue, not sure if it is related to the reported bug.
> Please help to confirm.

no, this patch alone didn't fix the issue (over kernel 4.12-rc5)

this is the kernel I've compiled
https://github.com/gdamjan/linux/commits/bugzilla-191181-try5
Comment 176 Nicolo' 2017-06-13 14:33:39 UTC
Created attachment 256985 [details]
attachment-4105-0.html

Lv: the new patch does not work for me.

On Tue, Jun 13, 2017 at 7:45 AM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>
> --- Comment #175 from Damjan Georgievski (gdamjan@gmail.com) ---
> (In reply to Lv Zheng from comment #171)
> > Created attachment 256971 [details]
> > [PATCH] ACPI: EC: Mark a possible IRQ storm period
> >
> > This is a known issue, not sure if it is related to the reported bug.
> > Please help to confirm.
>
> no, this patch alone didn't fix the issue (over kernel 4.12-rc5)
>
> this is the kernel I've compiled
> https://github.com/gdamjan/linux/commits/bugzilla-191181-try5
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 177 Lv Zheng 2017-06-14 01:01:09 UTC
Thanks for the test, so I know it's not related.
I'll send refined attachment 256927 [details] to upstream.
Marking this as resolved.

Thanks
Lv
Comment 178 Lv Zheng 2017-06-14 06:25:22 UTC
Here you can find the patch:
https://patchwork.kernel.org/patch/9785497/
If there is something wrong with it, please let me know.

Thanks
Lv
Comment 179 Damjan Georgievski 2017-06-14 11:05:47 UTC
(In reply to Lv Zheng from comment #178)
> Here you can find the patch:
> https://patchwork.kernel.org/patch/9785497/
> If there is something wrong with it, please let me know.

I've applied these 3 patches from patchwork over 4.12-rc5, but now I still have the issue.

this is the tree I build
https://github.com/gdamjan/linux/commits/bugzilla-191181-try6
Comment 180 Tatsuyuki Ishi 2017-06-14 11:26:58 UTC
Actually I also suspected that the fix won't work.

Lv, are you misunderstanding the issue? The problem persists after it happens once, until a complete power reset. Rebooting into Windows does not solve it. It's definitely not only "during some kind of stage".
Comment 181 Nicolo' 2017-06-14 17:28:42 UTC
Created attachment 257005 [details]
attachment-6782-0.html

Hi Lv,

I can confirm that if I'm experiencing the issue with some kernel and
reboot into patched kernel with either YY or NY parameters, then the issue
is still there; the only solution for me at the moment is downgrading EC
firmware.

On Wed, Jun 14, 2017 at 7:26 AM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>
> --- Comment #180 from Tatsuyuki Ishi (ishitatsuyuki@gmail.com) ---
> Actually I also suspected that the fix won't work.
>
> Lv, are you misunderstanding the issue? The problem persists after it
> happens
> once, until a complete power reset. Rebooting into Windows does not solve
> it.
> It's definitely not only "during some kind of stage".
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 182 Lv Zheng 2017-06-16 03:09:39 UTC
> Lv, are you misunderstanding the issue?

My original understanding is:
On some platform, if you type "make -j XXX" to build a very big program, fan can blow up. It might mean that the BIOS on such platforms has implemented thermal policy that have something to do with the CPU usage.
So when we saw this problem, we were thinking "this bug is trivial".
Linux simply will executes many driver teardowns/setups during suspend/resume, and all CPUs can be busy executing driver stuffs when they are executed in parallel, filling up all CPUs.
So if the thermal layer is still actively making decisions during this period, we may get a wrong fan action.
The commit in question actually can fills up CPU usages:
c3a696b6e8f8f75f9f75e556a9f9f6472eae2655
When you are trying to make things faster, how can you avoid using CPU more?
So it won't be a surprise that this can be bisected as a culprit of such kind of regression.
TBH, I was actually thinking fixing this regression is meaningless. It can be any drivers to make such a regression just because it acts more active. This time, it's EC just because on some platforms, EC is the busiest component than others during the suspend/resume process.

Maybe I'm wrong.

> The problem persists after it happens once, until a complete power reset.
> Rebooting into Windows does not solve it. It's definitely not only "during
> some kind of stage".

If it bugs windows around, this is defnitely a firmware bug.
You should upgrade the firmware.
Bug if Windows resumes with silent fans, it might not be related to firmware, but related to an OS gap, it could be an architectural issue, or a small gap.

OTOH, I think this bug seems to be common on many platforms.
I was suffering from this several years ago when I started to contribute to the community.
My test platforms always blow up to maximum speed after resume.
I'm sure this is not a regression.
It just can be more significant when some changes are done.

For this thread, we just can see different bug reports from different reporters.
They might be different bugs.
Maybe this commit is useful for some of the reporters:
https://patchwork.kernel.org/patch/9785497/

And there are 3 patches in the series, 2 of them are not related to this issue.
https://patchwork.kernel.org/patch/9785499/: only relates to debugging logs.
https://patchwork.kernel.org/patch/9785489/: it's just a kind of common deferred IRQ handling style: disable irq, handle irq and re-enable it.
During this period, EC driver won't be busy polling things, but will schedule to other processes, there are always idle period because wait() is still invoked. So it shouldn't be related to this issue.
Comment 183 Tatsuyuki Ishi 2017-06-16 05:36:18 UTC
First, this is not some kind of bug that just lasts for seconds. The effect won't go away until you completely power off.

The problem is: fan control got broken and either stays at maximum speed or zero, regardless the temperature.

I suspect some regression is triggering a bug in the firmware, or worse, undefined behavior.

Lv, I hope you understand with this explanation.
Comment 184 Lv Zheng 2017-06-16 05:49:35 UTC
> First, this is not some kind of bug that just lasts for seconds. The effect
> won't go away until you completely power off.

I exactly can see same problem on my dell latitude 6430u with very old kernels.

> The problem is: fan control got broken and either stays at maximum speed or
> zero, regardless the temperature.

Can this be improved in thermal layer?

> I suspect some regression is triggering a bug in the firmware, or worse,
> undefined behavior.

Could you bisect out the regression in your case?

From comment 154, in Fernando Chaves's case.
It seems the culprit is c3a696b6e8f8f75f9f75e556a9f9f6472eae2655.
And this patch functionally reverts this commit:
https://patchwork.kernel.org/patch/9785497/
But other guys may not.

> Lv, I hope you understand with this explanation.

No, I'm actually confused.
Comment 185 Lv Zheng 2017-06-16 05:57:43 UTC
To Damjan:

For your case, please boot with "acpi.ec_freeze_events=Y" and confirm again.
Thanks in advance.
Comment 186 Lv Zheng 2017-06-16 06:13:24 UTC
To Damjan:

Sorry, for your case, please boot with "acpi.ec_freeze_events=Y" or "acpi.ec_freeze_events=N" and confirm again.
Thanks in advance.

To Nicolo' and Tatsuyuki:

From your comments, I cannot see what is the problem related to your platforms.
Could you also try to apply:
https://patchwork.kernel.org/patch/9785497/
And test with "acpi.ec_freeze_events=N"
Comment 187 Lv Zheng 2017-06-16 06:23:41 UTC
I just asked thermal expert.
Rui told me that there could be many such fan blowing bugs due to variant causes.
So there must be several different bugs on the same report link.

And only issues of Fernando/Gjorgji/Damjan cases relate to the EC driver change.
Others seem to be unrelated.
So my patch description is not correct.
I'll change it later.
Comment 188 Gjorgji Jankovski 2017-06-17 09:19:17 UTC
Some more info, the bug happened again with this patch:

https://bugzilla.kernel.org/attachment.cgi?id=256927&action=diff

Though it's mostly working fine, happened only once in a week before it happened on almost every suspend/wakeup.

I just applied this:

https://patchwork.kernel.org/patch/9785497/

This is applied on top of 4.11.5

With "acpi.ec_freeze_events=N" it's broken but "acpi.ec_freeze_events=Y" seems to be working fine for now. I'll be running this so we'll see how it behaves.
Comment 189 Gjorgji Jankovski 2017-06-18 09:27:43 UTC
Nevermind that, it doesn't work appears to be just luck the first few times.
Comment 190 Hrvoje Zeba 2017-06-18 22:09:03 UTC
Hi,

I've recently got Lenovo Yoga X1 2nd gen (20JD) and it exhibits the same behavior. Not only when it comes out of sleep but also when the AC is plugged in. I'm running newest BIOS.

# dmidecode
...
Handle 0x000B, DMI type 0, 24 bytes
BIOS Information
	Vendor: LENOVO
	Version: N1NET24W (1.11 )
	Release Date: 05/26/2017
	Address: 0xE0000
	Runtime Size: 128 kB
	ROM Size: 16 MB
	Characteristics:
		PCI is supported
		PNP is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		EDD is supported
		3.5"/720 kB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		CGA/mono video services are supported (int 10h)
		ACPI is supported
		USB legacy is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 1.11
	Firmware Revision: 1.9
...

# uname -a
Linux littletwo 4.11.5-1-ARCH #1 SMP PREEMPT Wed Jun 14 16:19:27 CEST 2017 x86_64 GNU/Linux

Is there anything I can do to help with diagnosing this issue? My knowledge of kernel internals is limited.
Comment 191 Lv Zheng 2017-06-19 23:49:59 UTC
To: Gjorgji/Fernando/Damjan

There seems to be many different bugs in this report.
For your platforms, I opened a new one to track.
Please find it on bug 196129.

Let's leave this bug to thermal developers.

Thanks
Lv
Comment 192 Zhang Rui 2017-06-20 03:27:45 UTC
In this bug report, we have

Tatsuyuki Ishi         - ThinkPad X1 Yoga
0xbb                   - ThinkPad X1 Carbon 4th
Nicolo'                - Thinkpad t460s
Claudio Sacerdoti Coen - ThinkPad X1 Carbon 4th
H Zeng                 - ThinkPad T470s
Max Deineko            - X270
Jens Axboe             - Thinkpad X1 Carbon gen4
Markus T.H.            - Thinkpad X1 2017 Gen5
Alexander T.           - Lenovo Edge E540 (ThinkPad)
Neil Kownacki          - x270
Marcoen Hirschberg     - T470
Damjan Georgievski     - ThinkPad X1 Carbon 5th gen
a.piesk@gmx.net        - ThinkPad T460s
Fernando Chaves        - X1 Carbon 5th Gen
Hrvoje Zeba            - Lenovo Yoga X1 2nd gen

TBH, comments from so many bug reporters in the same thread may be misleading, and I'm concerning if we have exactly the same problem on those laptops.
Thus I need the input from all of you to make things clear.

1. There is an known solution that fixes all the problems, which is tracked separately at bug #196129. So, for all of you, please confirm if that solution works or not, if yes, please drop a note here and then switch to that thread.

2. for Ishi, the original bug reporter, please describe the current status of the problem again, with and without the solution in #196129.

3. for the others, if the solution in #196129 does not work, please wait and check Ishi' latest description of the problem, if it is exactly the same, please drop a note here, or else, please also drop a note and let me check if we should track it separately.

Thanks, all.
Comment 193 Hrvoje Zeba 2017-06-20 15:47:44 UTC
With the patch applied to 4.11.5-1-ARCH and kernel parameters set to 'acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y', system is behaving normally for now (about a day's worth). I put it to sleep and disconnect/connect the power adapter every now and then. I'll give it a few more days and then switch to 'acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y' to test it out. I'll report back with the results.
Comment 194 Fernando Chaves 2017-06-20 17:03:51 UTC
Solution in bug #196129 fix all the problems for me (X1 Carbon 5th Gen)
Comment 195 Hrvoje Zeba 2017-06-22 02:12:23 UTC
Using 'acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y' has some wired effects on my system. It froze up multiple times and I had to go through multiple reboot/power cycles to get it up and running again. So I would say patched kernel works as expected with 'acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y' and doesn't work with 'acpi.ec_freeze_events=Y acpi.ec_suspend_yield=Y'. System is Lenovo Yoga X1 2nd gen (20JD).
Comment 196 Zhang Rui 2017-06-26 07:33:03 UTC
(In reply to Fernando Chaves from comment #194)
> Solution in bug #196129 fix all the problems for me (X1 Carbon 5th Gen)

please refer to https://bugzilla.kernel.org/show_bug.cgi?id=196129#c5
and confirm what solution fixes your problem
Comment 197 Tatsuyuki Ishi 2017-07-01 09:53:51 UTC
It looks like by applying attachment 256927 [details] the problem hasn't be triggered so far. Should I try with the parameters as you said?

(As you know, I'm running X1 Yoga 1st gen)
Comment 198 Zhang Rui 2017-07-02 14:34:09 UTC
(In reply to Hrvoje Zeba from comment #193)
> With the patch applied to 4.11.5-1-ARCH and kernel parameters set to
> 'acpi.ec_freeze_events=N acpi.ec_suspend_yield=Y', system is behaving
> normally for now (about a day's worth). I put it to sleep and
> disconnect/connect the power adapter every now and then. I'll give it a few
> more days and then switch to 'acpi.ec_freeze_events=Y
> acpi.ec_suspend_yield=Y' to test it out. I'll report back with the results.

so this is a duplicate of #196129
Comment 199 Zhang Rui 2017-07-02 14:40:04 UTC
(In reply to Fernando Chaves from comment #194)
> Solution in bug #196129 fix all the problems for me (X1 Carbon 5th Gen)

so this is a duplicate of bug #196129.
Comment 200 Zhang Rui 2017-07-02 14:41:58 UTC
(In reply to Tatsuyuki Ishi from comment #197)
> It looks like by applying attachment 256927 [details] the problem hasn't be
> triggered so far. Should I try with the parameters as you said?
> 
> (As you know, I'm running X1 Yoga 1st gen)

so this is a duplicate of bug #196129
Comment 201 Zhang Rui 2017-07-02 14:54:54 UTC
Tatsuyuki Ishi         - ThinkPad X1 Yoga            - duplicate of bug #196129
0xbb                   - ThinkPad X1 Carbon 4th
Nicolo'                - Thinkpad t460s
Claudio Sacerdoti Coen - ThinkPad X1 Carbon 4th
H Zeng                 - ThinkPad T470s
Max Deineko            - X270
Jens Axboe             - Thinkpad X1 Carbon gen4
Markus T.H.            - Thinkpad X1 2017 Gen5
Alexander T.           - Lenovo Edge E540 (ThinkPad)
Neil Kownacki          - x270
Marcoen Hirschberg     - T470
Damjan Georgievski     - ThinkPad X1 Carbon 5th gen  - handled in bug #196129
 Gjorgji Jankovski     - T470                        - handled in bug #196129
a.piesk@gmx.net        - ThinkPad T460s
Fernando Chaves        - X1 Carbon 5th Gen           - duplicate of bug #196129
Hrvoje Zeba            - Lenovo Yoga X1 2nd gen      - duplicate of bug #196129

As all the people with latest update have confirmed that this can be handled by bug #196129, including the original bug reporter Ishi, I think this bug report and bug #196129 are duplicate.

As there are too many reports in this thread, which is misleading, I will close this bug report, and focus on the EC problem in bug #196129.
For people who got a similar problem and confirmed the solution in bug #196129 does not work, please open a new bug report.

*** This bug has been marked as a duplicate of bug 196129 ***
Comment 202 a.piesk 2017-10-22 16:42:56 UTC
i just came across 

https://bugzilla.redhat.com/show_bug.cgi?id=1480844#c48

Seems to be the same issue and was fixed by a new BIOS/EC for T470s.
Comment 203 Lv Zheng 2017-10-25 07:06:04 UTC
Created attachment 260383 [details]
Disable deeper C-states
Comment 204 Lv Zheng 2017-10-25 07:06:44 UTC
Created attachment 260385 [details]
[PATCH] Tune S3 resume step order
Comment 205 Lv Zheng 2017-10-25 07:09:30 UTC
To Andreas Piesk <a.piesk@gmx.net>:

Are you still suffering from this issue and monitoring here?
I think you are T460s user.

If you are still suffering from this issue, would you please try attachment 260383 [details] to confirm if the problem disappears with "acpi_resume_latency=25"?

If the problem can be fixed by the workaround, please give attachment 260385 [details] a try.

They are patches based on 4.10 upstream kernel, if you have trouble applying them on latest kernel, please give 4.10 tag a try.

Thanks and best regards
Lv
Comment 206 a.piesk 2017-10-25 19:08:55 UTC
Yes, i use a t460s and i'm still at EC 1.09, the last working firmware version.

For now i will wait for the Lenovo BIOS team to check if t460s has the same issue and if it can be fixed by a new firmware too. The issue looks like the same but maybe it isn't.

If it cannot or will not be fixed by firmware i will try your patch, thank you for posting it.

I just wanted to let the people know that Lenovo is aware of the problem and is fixing it.

Thanks,
-ap
Comment 207 Lv Zheng 2017-10-26 02:16:21 UTC
> I just wanted to let the people know that Lenovo is aware of the problem and
> is fixing it.

I think same logic is in your EC FW (T460s), so from EC FW's point of view, it should be the same issue IMO.
While it can be made surfaced by different OS changes.
Comment 208 Damjan Georgievski 2017-11-20 10:42:48 UTC
1.26 bios has been released for the X1 Carbon (5gen) on 2017-11-17

From
https://download.lenovo.com/pccbbs/mobiles/n1mur11w.txt

> [Problem fixes]
> - Fixed an issue where fan might rotated with max speed due to not reading
> CPU 
>   temperature correctly.
Comment 209 Hrvoje Zeba 2017-11-20 17:04:43 UTC
(In reply to Damjan Georgievski from comment #208)
> 1.26 bios has been released for the X1 Carbon (5gen) on 2017-11-17
> 
> From
> https://download.lenovo.com/pccbbs/mobiles/n1mur11w.txt
> 
> > [Problem fixes]
> > - Fixed an issue where fan might rotated with max speed due to not reading
> > CPU 
> >   temperature correctly.

BIOS version 1.20 with the same patch notes came out awhile back for Lenovo Yoga X1 2nd gen. I'm using 4.13.12-1-ARCH and the problem persists if the scaling_governor is set to performance on the intel_pstate driver. The fan spinning after waking from sleep doesn't happen if the driver is set to powersave. Temperature reading seem to be correct (ie, it's not stuck to 48 anymore).
Comment 210 Hrvoje Zeba 2017-11-20 17:09:19 UTC
I forgot to mention. When the machine wakes up, everything seems to be ok until cpu load increases. Fan stars blowing as it should but it never spins down, no matter the temperature reading.
Comment 211 Nicolo' 2017-12-07 22:05:42 UTC
Created attachment 261061 [details]
attachment-320-0.html

it seems lenovo finally released a bios+ec update that solves such issue
also for the t460s.

On Mon, Nov 20, 2017 at 12:09 PM, <bugzilla-daemon@bugzilla.kernel.org>
wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>
> --- Comment #210 from Hrvoje Zeba (zeba.hrvoje@gmail.com) ---
> I forgot to mention. When the machine wakes up, everything seems to be ok
> until
> cpu load increases. Fan stars blowing as it should but it never spins
> down, no
> matter the temperature reading.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.
>
Comment 212 Nicolo' 2017-12-07 22:19:16 UTC
Created attachment 261063 [details]
attachment-1531-0.html

Actually, I spoke too early, as on the second time I suspended the sensor
is stuck at 48C again..

On Thu, Dec 7, 2017 at 5:05 PM, Nicolo' Piazzalunga <
nicolopiazzalunga@gmail.com> wrote:

> it seems lenovo finally released a bios+ec update that solves such issue
> also for the t460s.
>
> On Mon, Nov 20, 2017 at 12:09 PM, <bugzilla-daemon@bugzilla.kernel.org>
> wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=191181
>>
>> --- Comment #210 from Hrvoje Zeba (zeba.hrvoje@gmail.com) ---
>> I forgot to mention. When the machine wakes up, everything seems to be ok
>> until
>> cpu load increases. Fan stars blowing as it should but it never spins
>> down, no
>> matter the temperature reading.
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.
>>
>
>
Comment 213 a.piesk 2017-12-07 22:22:16 UTC
(In reply to Nicolo' from comment #212)
> Created attachment 261063 [details]
> attachment-1531-0.html
> 
> Actually, I spoke too early, as on the second time I suspended the sensor
> is stuck at 48C again..
> 

Yep, it doesn't fix the issue, i reverted back to EC 1.09, my last known good EC version.
Comment 214 H Zeng 2017-12-07 22:46:33 UTC
I just came here on the email notification of new comments. And I want to report that I have not run into this situation for long time with my T470s. For me, this bug was fixed with the patch from Lv Zheng being merged into the kernel (I think it's version 4.11 or 4.12 or so).

During this long time, I have upgraded my openSUSE Tumbleweed now and then, and updated the BIOS firmware from Lenovo on the pace of new version releasing (2 or 3 new versions since then). This issue has never shown up again although I am using SLEEP almost all the time (1 to 3 times per day) -- with or without the Lenovo firmware fix for this issue.
Comment 215 Mr Brown 2018-01-23 06:28:59 UTC
Just wanted to chime in here in case others come looking as well. I read through the whole thread but have not applied any specific suggested fixes. I have a Lenovo X1 Yoga and have only recently ran into this issue but it is reproducible 100% of the time.

BIOS Revision: 1.33
Firmware Revision: 1.18
Kernel: 4.14.14-1-ARCH
Comment 216 zach.moazeni 2018-02-03 14:47:07 UTC
I was running into this 100% with BIOS v1.33 + 4.13.0-32-generic on my X1 Yoga. However after looking online, they only seem to publicize v1.32 https://pcsupport.lenovo.com/sa/en/products/laptops-and-netbooks/thinkpad-x-series-laptops/thinkpad-x1-yoga-type-20fq-20fr/downloads/ds111756, so I wonder if they're silently rolling back v1.33.

I downloaded the v1.32 rolled back to that version.

On v1.32 I sporadically run into this issue, but if I do, I just suspend again (via GNOME's top right power button, while holding Alt) and re-awaken. The fans will still occasionally spin full speed again, but more often than not re-suspending and re-awakening will fix it.

BIOS Revision: 1.32
Firmware Revision: 1.18
Kernel: 4.13.0-32-generic
Comment 217 Philipp Keller 2018-02-11 10:19:41 UTC
BIOS Revision: 1.32
Kernel: 4.13.0-32-generic
Laptop: X1 Carbon, 4th generation

I'm also still having this issue. Contrary on Zachs report above the fan stuck at 100% hits me every time after sleep. Going back to 1.32 did not help at all. Is there really no workaround around this problem?

Hibernating seems not be supported for my model, so that's not an option. Currently I'm just trying to sleep/wake up 3-4 times and suddenly the fan stays low, but I'm confused because I thought that the bug has now been fixed in the recent kernel versions? Or is it just fixing for some models? Is there anything I can do still to help fixing this?
Comment 218 Max Deineko 2018-02-11 11:19:43 UTC
Just wanted to confirm that a bios upgrade 1.24 fixed the issue on my X270 as well (running 4.9.77 kernel now).
Comment 219 Ilya 2018-02-15 03:04:36 UTC
A cold winter triggers the problem:

1. Start the computer,
2. Go to sleep state,
3. Go on the street, to cold down the computer (under 2 C° or something like this)
4. After wake up the fan blows up
Comment 220 zach.moazeni 2018-02-15 15:54:34 UTC
I hesitate to post this because it's not apples-to-apples, but I switched to Arch (via Antergos) and this problem has gone away with the updated kernel for me.

I'm still running the same bios and firmware.

BIOS Revision: 1.32
Firmware Revision: 1.18
Kernel: Linux 4.15.3-1-ARCH #1 SMP PREEMPT Mon Feb 12 23:01:17 UTC 2018 x86_64 GNU/Linux
Comment 221 Zhang Rui 2018-02-15 16:04:13 UTC
Created attachment 274183 [details]
attachment-15943-0.html

OOO till Feb 28th.
Comment 222 Philipp Keller 2018-02-23 14:28:23 UTC
A downgrade to bios version 1.29 (n1fur22w) has now solved the issue for me for 2 consecutive weeks. My environment: Laptop: Thinkpad X1 carbon 20fc , kernel: 4.13.0-32-generic. See also this stackoverflow post: https://askubuntu.com/a/1005165/255917
Comment 223 Nicolo' 2018-02-28 03:42:47 UTC
Created attachment 274479 [details]
attachment-29289-0.html

The problem is solved on my t460s with recent EC update by Lenovo,
currently kernel 4.15.5, BIOS Revision: 1.33, Firmware Revision: 1.14.
Comment 224 Glen Ogilvie 2018-08-28 22:47:05 UTC
Issues happens on Mageia 4.14.65-desktop.  
Downgrading to version to bios version 1.29 (n1fur22w) seems to solve the problem.

Downgrade of bios fixed it for me, from 1.37 -> 1.29, although had to disable "secure rollback prevention" in bios first.

Hardware: X1 Carbon 4th Gen, type: 20FC
Comment 225 karolszk 2018-11-30 18:47:12 UTC
Hi! This issue is still present on linux: 4.18.0-2-amd64 #1 SMP Debian 4.18.10-2 (2018-11-02) x86_64 GNU/Linux 

debian/testing on Lenovo Carbon X1 4th Generation, type 20FC BIOS 1.39 ThinkPad BIOS N1FET65W (1.39 )

and is annoying as ...

Regards,
Comment 226 karolszk 2018-12-29 12:48:38 UTC
temporary fix for this problem is doing 2 consecutive suspend/resume cycles after that fan1 from speeding (~6k RPM):

iwlwifi-virtual-0
Adapter: Virtual device
temp1:        +33.0°C  

pch_skylake-virtual-0
Adapter: Virtual device
temp1:        +36.5°C  

acpitz-virtual-0
Adapter: Virtual device
temp1:        +48.0°C  (crit = +128.0°C)

thinkpad-isa-0000
Adapter: ISA adapter
fan1:        6912 RPM
fan2:        65535 RPM

goes to 0 RPM:

iwlwifi-virtual-0
Adapter: Virtual device
temp1:        +30.0°C  

pch_skylake-virtual-0
Adapter: Virtual device
temp1:        +36.0°C  

acpitz-virtual-0
Adapter: Virtual device
temp1:        +36.0°C  (crit = +128.0°C)

thinkpad-isa-0000
Adapter: ISA adapter
fan1:           0 RPM
fan2:        65535 RPM

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +36.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +32.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +33.0°C  (high = +100.0°C, crit = +100.0°C)

Note You need to log in before you can comment on or make changes to this bug.