Bug 11281

Summary: 2.6.27-rc1: critical thermal shutdown on thinkpad x60
Product: ACPI Reporter: Rafael J. Wysocki (rjw)
Component: Power-ThermalAssignee: acpi_power-thermal
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla, andi-bz, astarikovskiy, bunk, gmazyland, pavel, rui.zhang, trenn
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.27-rc1 Subsystem:
Regression: Yes Bisected commit-id:
Bug Depends on:    
Bug Blocks: 11167    

Description Rafael J. Wysocki 2008-08-07 14:10:05 UTC
Subject    : 2.6.27-rc1: critical thermal shutdown on thinkpad x60
Submitter  : Pavel Machek <pavel@suse.cz>
Date       : 2008-08-06 9:02
References : http://marc.info/?l=linux-kernel&m=121802744007517&w=4
Handled-By : Andi Kleen <andi@firstfloor.org>

This entry is being used for tracking a regression from 2.6.26.  Please don't
close it until the problem is fixed in the mainline.
Comment 1 Thomas Renninger 2008-08-12 02:53:43 UTC
Aug  6 11:00:10 amd kernel: Critical temperature reached (128 C),
shutting down.

This could be a shutdown trigger.
Then the temperature is not a sensor, but just a variable as a switch for the BIOS developer to shut down the machine gracefully. I saw this several times and the high critical temp without a passive trip point points to such a construct.
Just guessing for now, acpidump would be great even better acpi.debug_level=0x1F or 0x21F when the machine shuts down.
Comment 2 Thomas Renninger 2008-08-12 02:55:41 UTC
> even better acpi.debug_level=0x1F or 0x21F when the machine shuts down.
I mean additionally to acpidump. acpidump and dmesg is a must to start at all looking at this.
Comment 3 Pavel Machek 2008-08-12 03:12:01 UTC
> ------- Comment #1 from trenn@suse.de  2008-08-12 02:53 -------
> Aug  6 11:00:10 amd kernel: Critical temperature reached (128 C),
> shutting down.
> 
> This could be a shutdown trigger.
> Then the temperature is not a sensor, but just a variable as a switch for the
> BIOS developer to shut down the machine gracefully. I saw this several times
> and the high critical temp without a passive trip point points to such a
> construct.

No, it does not seem so. This machine has two thermal zones, and both
seem to be reporting reasonable (and similar) temperatures normally. 

root@amd:~# cat /proc/acpi/thermal_zone/THM0/*
<setting not supported>
<polling disabled>
state:                   ok
temperature:             53 C
critical (S5):           127 C
root@amd:~#

...but when I load the machine under -rc2, THM0 and THM1 go up to 95C
or something, and then THM0 goes to 128C suddenly.

Basic reason seems to be that fan is running too slow. OTOH fan is
controlled by hardware, so... 
								Pavel
Comment 4 Pavel Machek 2008-08-12 05:50:06 UTC
Thomas tried to debug this, here is my reply:

> Hmm, relatively obvious is the Warning after resume?
> Does this only happen after suspend?

No, it happens after fresh boot, too.

> Any way to trigger this?

Do you have thinkpad x60 near you? Run two while true; do echo -n;
done loops on 2.6.27-rc2. (Or were you asking about triggering one
specific warning?)

> It seem to be some real HW accessed, at least it is
> a EC byte read for this zone's temp.
> If this is a regression then likely to be related with an EC change.

Actually, I believe that critical shutdown works as designed -- I
believe I seen it once after doing something really stupid like
leaving thinkpad on direct sun with lid closed.

I actually want to try to reproduce the shutdown on 2.6.26 after forcing fan off.

> In the _TMP function of the first thermal zone it is likely that
> one of these two code paths is hit. 0x80 should evaluate to a temp of 128C.
> But this again depends on EC reads...
>                 If (Local2)
>                 {
>                     Return (C2K (0x80))
>                 }
>
>                 If (LNot (\_SB.PCI0.LPC.EC.HKEY.DHKC))
>                 {
>                     If (Local1)
>                     {
>                         Return (C2K (0x80))
>                     }
>                 }
>
> The first one is the temperature of our affected thermal zone.
> It may happen when sensors are used now, that the same temperature
> (or related values) are read from EC. While EC should always return
> sane values, maybe you get wrong ones when reading two often or
> uncoordinated?

No, this is real overheat. Readings are very consistent, and values go
to 95C range. Machine is hot, too, and fails to start after critical shutdown.
Comment 5 Rafael J. Wysocki 2008-08-12 07:17:04 UTC
> No, this is real overheat. Readings are very consistent, and values go
> to 95C range. Machine is hot, too, and fails to start after critical
> shutdown.

It looks like the trip points don't work.

I reproduced it on my hp nx6325 once, after a resume from hibernation.  I was running

$ watch cat /proc/acpi/fan/C3*/state /proc/acpi/thermal_zone/TZ*/temperature

in one xterm while I ran 'while true; do echo -n; done' in two other xterms, the trip points didn't trigger and the passive cooling went on really quickly (it throttles the box down to its knees).  However, this is not readily reproducible on  my box.

Pavel, can you please check if the state of the fan(s) changes as the trip points are being passed?
Comment 6 Zhang Rui 2008-08-12 18:09:48 UTC
IMO, this is the same bug introduced by a1531acd43310a7e4571d52e8846640667f4c74b
Comment 7 Adrian Bunk 2008-08-13 03:43:30 UTC
Handled-By      : Milan Broz <mbroz@redhat.com>
Patch           : http://lkml.org/lkml/2008/8/13/141