Subject : 2.6.27-rc1: critical thermal shutdown on thinkpad x60 Submitter : Pavel Machek <pavel@suse.cz> Date : 2008-08-06 9:02 References : http://marc.info/?l=linux-kernel&m=121802744007517&w=4 Handled-By : Andi Kleen <andi@firstfloor.org> This entry is being used for tracking a regression from 2.6.26. Please don't close it until the problem is fixed in the mainline.
Aug 6 11:00:10 amd kernel: Critical temperature reached (128 C), shutting down. This could be a shutdown trigger. Then the temperature is not a sensor, but just a variable as a switch for the BIOS developer to shut down the machine gracefully. I saw this several times and the high critical temp without a passive trip point points to such a construct. Just guessing for now, acpidump would be great even better acpi.debug_level=0x1F or 0x21F when the machine shuts down.
> even better acpi.debug_level=0x1F or 0x21F when the machine shuts down. I mean additionally to acpidump. acpidump and dmesg is a must to start at all looking at this.
> ------- Comment #1 from trenn@suse.de 2008-08-12 02:53 ------- > Aug 6 11:00:10 amd kernel: Critical temperature reached (128 C), > shutting down. > > This could be a shutdown trigger. > Then the temperature is not a sensor, but just a variable as a switch for the > BIOS developer to shut down the machine gracefully. I saw this several times > and the high critical temp without a passive trip point points to such a > construct. No, it does not seem so. This machine has two thermal zones, and both seem to be reporting reasonable (and similar) temperatures normally. root@amd:~# cat /proc/acpi/thermal_zone/THM0/* <setting not supported> <polling disabled> state: ok temperature: 53 C critical (S5): 127 C root@amd:~# ...but when I load the machine under -rc2, THM0 and THM1 go up to 95C or something, and then THM0 goes to 128C suddenly. Basic reason seems to be that fan is running too slow. OTOH fan is controlled by hardware, so... Pavel
Thomas tried to debug this, here is my reply: > Hmm, relatively obvious is the Warning after resume? > Does this only happen after suspend? No, it happens after fresh boot, too. > Any way to trigger this? Do you have thinkpad x60 near you? Run two while true; do echo -n; done loops on 2.6.27-rc2. (Or were you asking about triggering one specific warning?) > It seem to be some real HW accessed, at least it is > a EC byte read for this zone's temp. > If this is a regression then likely to be related with an EC change. Actually, I believe that critical shutdown works as designed -- I believe I seen it once after doing something really stupid like leaving thinkpad on direct sun with lid closed. I actually want to try to reproduce the shutdown on 2.6.26 after forcing fan off. > In the _TMP function of the first thermal zone it is likely that > one of these two code paths is hit. 0x80 should evaluate to a temp of 128C. > But this again depends on EC reads... > If (Local2) > { > Return (C2K (0x80)) > } > > If (LNot (\_SB.PCI0.LPC.EC.HKEY.DHKC)) > { > If (Local1) > { > Return (C2K (0x80)) > } > } > > The first one is the temperature of our affected thermal zone. > It may happen when sensors are used now, that the same temperature > (or related values) are read from EC. While EC should always return > sane values, maybe you get wrong ones when reading two often or > uncoordinated? No, this is real overheat. Readings are very consistent, and values go to 95C range. Machine is hot, too, and fails to start after critical shutdown.
> No, this is real overheat. Readings are very consistent, and values go > to 95C range. Machine is hot, too, and fails to start after critical > shutdown. It looks like the trip points don't work. I reproduced it on my hp nx6325 once, after a resume from hibernation. I was running $ watch cat /proc/acpi/fan/C3*/state /proc/acpi/thermal_zone/TZ*/temperature in one xterm while I ran 'while true; do echo -n; done' in two other xterms, the trip points didn't trigger and the passive cooling went on really quickly (it throttles the box down to its knees). However, this is not readily reproducible on my box. Pavel, can you please check if the state of the fan(s) changes as the trip points are being passed?
IMO, this is the same bug introduced by a1531acd43310a7e4571d52e8846640667f4c74b
Handled-By : Milan Broz <mbroz@redhat.com> Patch : http://lkml.org/lkml/2008/8/13/141
Fixed by: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9f497bcc695fb828da023d74ad3c966b1e58ad21