Bug 45291 - high frequency causes thermal shutdown - Lenovo T410
Summary: high frequency causes thermal shutdown - Lenovo T410
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: ACPI
Classification: Unclassified
Component: Power-Thermal (show other bugs)
Hardware: x86-64 Linux
: P1 high
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-29 10:34 UTC by Raymond Wooninck
Modified: 2013-04-13 16:22 UTC (History)
1 user (show)

See Also:
Kernel Version: 3.4.x and 3.5.x
Subsystem:
Regression: Yes
Bisected commit-id:


Attachments

Description Raymond Wooninck 2012-07-29 10:34:07 UTC
After upgrading to the 3.4 kernel, I noticed that under heavy load the kernel
gets some strange temperature reading and based on this reading it shuts the
notebook down. 

I see the following behavior :

1)    Load the notebook with a gcc acitivity until all processors are 100% load
2)  Shortly after you get notifications that a thermal event is reached (CPU's
> 100 degrees)
3)  Kernel reacts and shutdown the system. 

Strangely enough the same task with an 3.3 kernel can easily survive and also
under Windows I do not have any problems. Checking bugs.kernel.org I found
something with regards to this effect and they suggested to load the thermal
module with nocrt=1  so that the activity on the first trip-point is not
initiated. This indeed helps a lot and it seems that even with a high load the
notebook doesn't get that hot. Also no shutdown is initiated. 

The notebook is a Lenovo T410 with an Intel I5. Running with the parameter
or switching back to the 3.3 kernel resolves the issue. 

I did some more testing and I found out something quite interesting. I got a vanilla 3.5.x kernel from my distribution, regenerated the initrd without the nocrt=1 parameter and then rebooted. After reboot I put put all 4 cores again under a 100% load. Strangely enough the notebook kept on running and the fans were spinning blowing out hot air. Checking the output of /proc/cpu/ibm/thermal, it indicated that the temperature reached was around 62 degrees. The task completed successfully and the temperature got never above 65 degrees. What I noticed however is that all 4 cores (2.4Ghz) where running at 2.390Ghz. Checking the modules loaded it appeared that the acpi_cpufreq module was not loaded. This is a known bug which was resolved for openSUSE (see https://bugzilla.novell.com/show_bug.cgi?id=756085). I loaded manually the acpi_cpufreq module and executed the action again. This time I got the same behavior as with the desktop version. After a couple of seconds of load, the notebook issued a shutdown due to a critical temperature reached. This was confirmed by a temperature of 101 degrees indicated by the value in /proc/acpi/ibm/thermal. 

This seems to be a regression from the changes done in 3.4 for the acpi_cpufreq sources.
Comment 1 Len Brown 2013-01-29 04:14:42 UTC
My guess is that you have a fan full of dust.
When you clean it out, you'll not be able to reproduce this bug.
(so don't clean it out till we fix the bug:-)

I also venture that cpufreq and turbo mode are working properly,
and it was "just luck" that they were screwed up and not running
properly so that you ran artificially slow and thus didn't
previously run into the thermal issue.  But lets check...

note thermal.nocrt=1 should simply disable the _action_ on hitting
hot and critical trip points.  Keep this parameter in place.

Please show the output from
grep .  /sys/class/thermal/*/*
or if you have one...
grep . /proc/acpi/thermal_zone/*/*

The question is if you have a passive trip point below
the critical trip point where we should have throttled
to prevent going critical.  My guess is that you do,
and that windows responded better to it than Linux did.

If you attach the output from acpidump,
that may also be helpful.

Get turbostat from the kernel source tree, tools/power/x86/turbostat/
and use it to monitor temperature and frequency.

Please invoke it with the -v option to show what frequency range
this processor has, and then show its output with
and without acpi-cpufreq loaded.
Comment 2 Zhang Rui 2013-03-08 06:01:46 UTC
Hi, Raymond,
please follow len's suggestion in comments #1.
And please check if the problem still exists in the latest upstream kernel, say 3.9-rc1.
Comment 3 Zhang Rui 2013-03-27 07:33:02 UTC
ping ...
Comment 4 Zhang Rui 2013-04-13 16:22:13 UTC
bug closed as there is no response from the bug reporter.
Please feel free to re-open it if you can reproduce the problem again.

Note You need to log in before you can comment on or make changes to this bug.