After upgrading to the 3.4 kernel, I noticed that under heavy load the kernel gets some strange temperature reading and based on this reading it shuts the notebook down. I see the following behavior : 1) Load the notebook with a gcc acitivity until all processors are 100% load 2) Shortly after you get notifications that a thermal event is reached (CPU's > 100 degrees) 3) Kernel reacts and shutdown the system. Strangely enough the same task with an 3.3 kernel can easily survive and also under Windows I do not have any problems. Checking bugs.kernel.org I found something with regards to this effect and they suggested to load the thermal module with nocrt=1 so that the activity on the first trip-point is not initiated. This indeed helps a lot and it seems that even with a high load the notebook doesn't get that hot. Also no shutdown is initiated. The notebook is a Lenovo T410 with an Intel I5. Running with the parameter or switching back to the 3.3 kernel resolves the issue. I did some more testing and I found out something quite interesting. I got a vanilla 3.5.x kernel from my distribution, regenerated the initrd without the nocrt=1 parameter and then rebooted. After reboot I put put all 4 cores again under a 100% load. Strangely enough the notebook kept on running and the fans were spinning blowing out hot air. Checking the output of /proc/cpu/ibm/thermal, it indicated that the temperature reached was around 62 degrees. The task completed successfully and the temperature got never above 65 degrees. What I noticed however is that all 4 cores (2.4Ghz) where running at 2.390Ghz. Checking the modules loaded it appeared that the acpi_cpufreq module was not loaded. This is a known bug which was resolved for openSUSE (see https://bugzilla.novell.com/show_bug.cgi?id=756085). I loaded manually the acpi_cpufreq module and executed the action again. This time I got the same behavior as with the desktop version. After a couple of seconds of load, the notebook issued a shutdown due to a critical temperature reached. This was confirmed by a temperature of 101 degrees indicated by the value in /proc/acpi/ibm/thermal. This seems to be a regression from the changes done in 3.4 for the acpi_cpufreq sources.
My guess is that you have a fan full of dust. When you clean it out, you'll not be able to reproduce this bug. (so don't clean it out till we fix the bug:-) I also venture that cpufreq and turbo mode are working properly, and it was "just luck" that they were screwed up and not running properly so that you ran artificially slow and thus didn't previously run into the thermal issue. But lets check... note thermal.nocrt=1 should simply disable the _action_ on hitting hot and critical trip points. Keep this parameter in place. Please show the output from grep . /sys/class/thermal/*/* or if you have one... grep . /proc/acpi/thermal_zone/*/* The question is if you have a passive trip point below the critical trip point where we should have throttled to prevent going critical. My guess is that you do, and that windows responded better to it than Linux did. If you attach the output from acpidump, that may also be helpful. Get turbostat from the kernel source tree, tools/power/x86/turbostat/ and use it to monitor temperature and frequency. Please invoke it with the -v option to show what frequency range this processor has, and then show its output with and without acpi-cpufreq loaded.
Hi, Raymond, please follow len's suggestion in comments #1. And please check if the problem still exists in the latest upstream kernel, say 3.9-rc1.
ping ...
bug closed as there is no response from the bug reporter. Please feel free to re-open it if you can reproduce the problem again.