Bug 8884

Summary: Hardware shutoff at 83C - HP OmniBook XE3 GD
Product: ACPI Reporter: Pavel Machek (pavel)
Component: Power-FanAssignee: Len Brown (lenb)
Status: CLOSED CODE_FIX    
Severity: normal CC: acpi-bugzilla
Priority: P1    
Hardware: All   
OS: Linux   
Kernel Version: 2.6.22 Subsystem:
Regression: Yes Bisected commit-id:
Attachments: acpidump
dmesg
tar of acpi/thermal
dmidecode
patch vs 2.6.23-rc3 creating thermal.crt=C

Description Pavel Machek 2007-08-13 05:38:59 UTC
Most recent kernel where this bug did not occur: 2.6.21
Distribution:
Hardware Environment:
Software Environment:
Problem Description:

Steps to reproduce:
Comment 1 Len Brown 2007-08-13 06:47:18 UTC
Please describe how the failing machine fails,
attach the output from acpidump, dmesg -s64000,
dmidecode, and the contents of /proc/acpi/thermal_zone/*/*

In particular, please note if /proc/acpi/thermal_zone/*/polling_frequency
is non-zero, and any change in behaviour when it is zero.
Comment 2 Pavel Machek 2007-08-13 08:55:54 UTC
Oops, sorry, I forgot to add description of the bug.

Done, xe3 was re-built from parts.

/proc/acpi/.../trip_points:
critical (S5):          100 C
passive:                83 C...
active[0]:              100 C...

(hmm, active=critical? Interesting. Fortunately fan seems to be driven
by BIOS).

Temperature is ~63 C in "normal" use. Now lets simulate fan failure...
and lets load the cpu...

temperature slowly rises, 1min00 -- 72C, 1min15 -- 75C, 1min30 --
77C, 1min45 -- 80C, 1min00 -- 82C, 1min15 -- 83C, 1min45 -- sudden
powerdown, presumably because of hardware failsafe.

So we have two bugs here: machine should have attempted to use passive
cooling sooner, so that critical temperature would not be reached, and
machine should have attempted shutdown before hardware failsafe killed
the power. I could do both in 2.6.21, with echo of new trip points and
enable of polling.


Polling frequency is 2 seconds.
Comment 3 Pavel Machek 2007-08-13 08:56:58 UTC
Created attachment 12360 [details]
acpidump
Comment 4 Pavel Machek 2007-08-13 08:57:17 UTC
Created attachment 12361 [details]
dmesg
Comment 5 Pavel Machek 2007-08-13 08:57:52 UTC
Created attachment 12362 [details]
tar of acpi/thermal
Comment 6 Pavel Machek 2007-08-13 08:59:52 UTC
Provided the files, hopefully I did not forgot anything this time.
Comment 7 Pavel Machek 2007-08-13 09:01:23 UTC
(aha, and I should say that 2 seconds polling is there by default, without any tweaks).
Comment 8 Pavel Machek 2007-08-13 09:04:45 UTC
Created attachment 12363 [details]
dmidecode

of course I did forget to attach something. Fixed now.
Comment 9 Len Brown 2007-08-13 09:14:23 UTC
Assuming the system always shuts down just above 83C,
the failure at hand appears to be a thermally induced
hardware or firmware induced hardware poweroff.

This isn't a particularly high temperature.
Is BIOS SETUP running with defaults?
Are there any cooling related options in BIOS SETUP?

When the system is booted with "acpi=off", does the same shutdown occur?
How about with ACPI enabled, but CONFIG_ACPI_THERMAL=n?

> temperature slowly rises, 1min00 -- 72C, 1min15 -- 75C, 1min30 --
> 77C, 1min45 -- 80C, 1min00 -- 82C, 1min15 -- 83C, 1min45 -- sudden
> powerdown, presumably because of hardware failsafe.

At what temperature do you notice the fan start spinning?
Is it spinning at full speed when the system shuts off?

> powernow-k8: Processor cpuid 661 not supported
> powernow: PowerNOW! Technology present. Can scale: frequency and voltage.
> powernow: SGTC: 10000
> powernow: Minimum speed 300 MHz. Maximum speed 900 MHz.

Does this system fail the same way the powernow software is disabled?
Comment 10 Anonymous Emailer 2007-08-13 13:24:01 UTC
Reply-To: pavel@ucw.cz

Hi!

> ------- Comment #9 from len.brown@intel.com  2007-08-13 09:14 -------
> Assuming the system always shuts down just above 83C,
> the failure at hand appears to be a thermally induced
> hardware or firmware induced hardware poweroff.

Yes, that was my diagnosis, too.

> This isn't a particularly high temperature.
> Is BIOS SETUP running with defaults?
> Are there any cooling related options in BIOS SETUP?

I do not think BIOS has any cooling-related options, I'll check again.

Actually, 83C is too _low_ for emergency-shutdown
temperature. Apparently, they put HDD near the CPU, and running at the
temperature near the 83C for extended period overheats the HDD,
producing read/write errors.

Will check with acpi-off, rmmod thermal, etc.
								Pavel
Comment 11 Len Brown 2007-08-13 21:52:39 UTC
> tar of acpi/thermal

The files in attachment #5 [details] are empty
please just copy/paste their contents into bugzilla
Comment 12 Len Brown 2007-08-13 22:07:43 UTC
> Polling frequency is 2 seconds.

There is no _TZP in the DSDT, so the kernel should not enable polling
on this box.  Please disable/override whatever in user-space requests
the 2-second polling interval and report how the machine
behaves with polling disabled -- its default configuration.

Also, please kill acpid and cat /proc/acpi/event and see if changes
in temperature give any thermal events.

Re: fan control
The DSDT does show real-live fan _ON/_OFF methods.
In 2.6.21 were you able to control the state of the fan
by modifying the (100C) active trip point?
Are you able to cause the fan to start running at a specific temperature?

Are you able to change the behaviour of the fan by echo 0 or 3
into the fan state files -- and is the change in state indicated
in those files?
Comment 13 Anonymous Emailer 2007-08-14 00:49:30 UTC
Reply-To: pavel@ucw.cz


> The files in attachment #5 [details] are empty
> please just copy/paste their contents into bugzilla

Sorry. Here they are; will find out why I have 2 seconds polling
there.

dream:/proc/acpi/thermal_zone/THRM # cat cooling_mode
cooling mode:   active
dream:/proc/acpi/thermal_zone/THRM # cat polling_frequency
polling frequency:       2 seconds
dream:/proc/acpi/thermal_zone/THRM # cat state
state:                   ok
dream:/proc/acpi/thermal_zone/THRM # cat temperature
temperature:             69 C
dream:/proc/acpi/thermal_zone/THRM # cat trip_points
critical (S5):           100 C
passive:                 83 C: tc1=2 tc2=5 tsp=300 devices=0xcffdb338
active[0]:               100 C: devices=0xcffd4e14
dream:/proc/acpi/thermal_zone/THRM #

Machine is stable if I simulate fan failure, load the cpu and rmmod
thermal and rmmod powernow_k7. I guess 300MHz cpu is not hot enough to
reach any interesting temperature. Inserting powernow_k7 kills the
machine in minute-or-so.
Comment 14 Anonymous Emailer 2007-08-14 00:52:51 UTC
Reply-To: pavel@ucw.cz

Hi!

> Re: fan control
> The DSDT does show real-live fan _ON/_OFF methods.
> In 2.6.21 were you able to control the state of the fan
> by modifying the (100C) active trip point?
> Are you able to cause the fan to start running at a specific temperature?

Strange, yes, fan interface is there. But it says

dream:/proc/acpi/fan/FAN # cat state
status:                  off
dream:/proc/acpi/fan/FAN #

...and fan is clearly spinning.

> Are you able to change the behaviour of the fan by echo 0 or 3
> into the fan state files -- and is the change in state indicated
> in those files?

I can echo 0/3, and the value in state file changes, but it does not
seem to effect the physical fan. It just spins.

								Pavel
Comment 15 Anonymous Emailer 2007-08-14 00:57:34 UTC
Reply-To: pavel@ucw.cz


> Re: fan control
> The DSDT does show real-live fan _ON/_OFF methods.
> In 2.6.21 were you able to control the state of the fan
> by modifying the (100C) active trip point?
> Are you able to cause the fan to start running at a specific temperature?
> 
> Are you able to change the behaviour of the fan by echo 0 or 3
> into the fan state files -- and is the change in state indicated
> in those files?

Fan seems to have a mind of its own. It comes into live at ~60
celsius, and turns itself off at ~55. Nothing I do can control it. I
was not trying to control it in 2.6.21 because fan was physically dead
at that time... I was trying to lower passive trip points to cool
machine that way.
								Pavel 
Comment 16 Pavel Machek 2007-08-14 01:27:32 UTC
with acpi=off (and powernow-k7 loaded, fan failed, and cpu loaded) machine fails about ~2min30 with the same thermally induced hardware poweroff.

No thermal-related settings in BIOS, but I reset the value to defaults, anyway.

Hmm, polling_frequency indeed says <disabled> on init=/bin/bash boot. Sorry for confusion, something in suse10.2 is playing with me.
Comment 17 Pavel Machek 2007-08-14 01:36:00 UTC
I tried cat /proc/acpi/events with acpid and polling disabled, but did not see anything -- but the 83C passive trip point is so close to hw shutdown that I may have missed it.
Comment 18 Len Brown 2007-08-14 11:22:49 UTC
> Machine is stable if I simulate fan failure, load the cpu and rmmod
> thermal and rmmod powernow_k7. I guess 300MHz cpu is not hot enough to
> reach any interesting temperature. Inserting powernow_k7 kills the
> machine in minute-or-so.

If you boot without powernow_k7, the machine comes up and
runs at constant 300MHz and does not fail under load --
even if you disconnect the fan?  How hot is it getting?
Can you load thermal here and verify that you are not
exceeding 83C?

Is it possible to run the same experiment, but at peak MHz?
I'd like to see powernow not loaded, if possible, or at least
not running -- say, by using the performance governor.

> Fan seems to have a mind of its own. It comes into live at ~60
> celsius, and turns itself off at ~55. Nothing I do can control it.

Apparently ACPI fan control on this system is a facade with nothing
behind it.  Certainly with no _TZP and no thermal events, Windows
will simply ignore it.  Linux should just ignore it too.   This
can be done in a pretty way in 2.6.23-rc3 via "thermal.act=-1"

But the fact that thermal events don't work suggests that Windows
will also not notice the passive trip point at 83C -- and it should
run into the same hardware malfunction that Linux runs into.

It would be useful if you can verify that Windows causes this unit to
malfunction the same way Linux does, or if Windows behaves differently.

Does more than one of these machines exist?
If so, do they all fail the same way?

The fact that the fan kicks in starting at 60C but the system
continues to heat up to 83C suggests that the thermal solution
is simply broken on this unit.

But the fact that the malfunction occurs at 83C, which happens to
be the passive trip point is either a very large coincidence
(possible, since you get a failure also with acpi=off,
though unclear at what temperature)
or throttling may actually be involved provoke the failure.

What do you see if you boot 2.6.23-rc3 with "thermal.psv=70" and
heat up the system.  Does it continue to function
while keeping the temperature down to 70?  Note that since
there seem to be no thermal events on this box, you'll also
need thermal.tzp=5 or use what SuSE uses to enable polling from proc.
Comment 19 Len Brown 2007-08-14 12:55:27 UTC
Created attachment 12386 [details]
patch vs 2.6.23-rc3 creating thermal.crt=C

Please apply this patch to 2.6.23-rc3 and boot with "thermal.crt=80"
or something that will provoke a graceful shutdown before the hardware
malfunctions.  Note that since this system doesn't seem to provide any
thermal events, you'll need to invoke polling (eg. thermal.tzp=10)
to get this threshold noticed.

I hesitate to add a DMI option to invoke workarounds for this model
automatically, because the cooling failure and hardware malfunction
look like a unit failure that may not be shared with other units.
We could add a DMI entry to ignore the false active cooling hooks --
but that would be cosmetic only, since they do no harm today,
other than perhaps confuse users of this model.

Assuming the powernow thing is not related to the failure,
perhaps the (manually invoked) .psv and .crt thermal hooks
are sufficient to make this box usable?
Comment 20 Len Brown 2007-08-20 13:18:51 UTC
patch in comment #19 is in acpi test tree.
Please let me know if it is insufficient to resolve this issue.
Comment 21 Len Brown 2007-08-25 21:23:58 UTC
the patch in comment #19 shipped in linux-2.6.23-rc3-git9
closed.
Comment 22 Anonymous Emailer 2007-09-05 00:24:49 UTC
Reply-To: pavel@ucw.cz

Hi!

> the patch in comment #19 shipped in linux-2.6.23-rc3-git9
> closed.

Thanks a lot, configurable critical trip point should solve the issue.