Bug 9129 - Unreasonable BIOS trip points cause critical shutdown after 4 minutes of CPUBurn
Summary: Unreasonable BIOS trip points cause critical shutdown after 4 minutes of CPUBurn
Status: CLOSED PATCH_ALREADY_AVAILABLE
Alias: None
Product: ACPI
Classification: Unclassified
Component: BIOS (show other bugs)
Hardware: All Linux
: P1 high
Assignee: Zhang Rui
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-10-06 10:48 UTC by Jon Becker
Modified: 2008-10-24 23:03 UTC (History)
2 users (show)

See Also:
Kernel Version: 2.6.22-12
Subsystem:
Regression: ---
Bisected commit-id:


Attachments
ACPIDump file (82.77 KB, application/octet-stream)
2007-10-06 10:49 UTC, Jon Becker
Details
DSDT.dsl (160.23 KB, text/plain)
2007-10-06 10:51 UTC, Jon Becker
Details
patch-allow-override-critical-trip-point (946 bytes, patch)
2007-10-08 23:45 UTC, Zhang Rui
Details | Diff
patch: allow overriding critical threshold to higher value (1.06 KB, patch)
2008-06-16 20:41 UTC, Zhang Rui
Details | Diff

Description Jon Becker 2007-10-06 10:48:23 UTC
Most recent kernel where this bug did not occur:  2.6.20
Distribution: Ubuntu 7.10 beta
Hardware Environment: Avalue ECM-945GM embedded core duo board
Software Environment:
Problem Description:
 The BIOS does not seem to communicate the correct temperature trip points to the OS.
It sets CPU temperature trip points as:
critical: 60C
passive, 50C
active, 50C.

I am using the T7200 CPU (rated up to 100C, critical is 125C).

With 2.6.20 I can successfully override the critical trip point and with two copies of CPUBurn running the system stabilizes at about 67C, (at ~23C room temp), and does not suffer from shutdown.

But of course that's not an option with 2.6.22.
I would much rather be able to do that from user space.  The other options (modifying ACPI tables, turning off thermal module, preventing power down command from running), all seem pretty hokey.  Staying with the older version fixes that problem but introduces another.

There is no fan control on this embedded system, it's always on.  And there is a requirement for the high cpu performance.

attachments to follow.
Comment 1 Jon Becker 2007-10-06 10:49:49 UTC
Created attachment 13058 [details]
ACPIDump file
Comment 2 Jon Becker 2007-10-06 10:51:48 UTC
Created attachment 13059 [details]
DSDT.dsl

DSDT dsl file, note recompile fails with:
Intel ACPI Component Architecture
ASL Optimizing Compiler version 20061109 [May 16 2007]
Copyright (C) 2000 - 2006 Intel Corporation
Supports ACPI Specification Revision 3.0a

DSDT.dsl   373:     Method (\_WAK, 1, NotSerialized)
Warning  1079 -                 ^ Reserved method must return a value (_WAK)

DSDT.dsl   405:             Store (Local0, Local0)
Error    4049 -                         ^ Method local variable is not initialized (Local0)

DSDT.dsl   410:             Store (Local0, Local0)
Error    4049 -                         ^ Method local variable is not initialized (Local0)

ASL Input:  DSDT.dsl - 4997 lines, 164074 bytes, 1790 keywords
Compilation complete. 2 Errors, 1 Warnings, 0 Remarks, 531 Optimizations
Comment 3 Zhang Rui 2007-10-08 23:45:46 UTC
Created attachment 13087 [details]
patch-allow-override-critical-trip-point

Hi, Jon,
Are there any options in the BIOS to set trip points?
You can override them in the BIOS if the answer is yes.
If not, please apply the patch I attached and try the boot parameter "thermal.crt=xxx".

Note that the temperature of the thermal zone is not equal to the temperature 
of the processor, so a high critical trip point (like 100C, 125C) may be dangerous.
Comment 4 Zhang Rui 2007-11-06 23:13:05 UTC
Hi, Jon,
any updates on this?
can the patch work for you?
Comment 5 Zhang Rui 2007-12-09 19:01:16 UTC
Hi, Len,
thermal.crt can only be used to lower the critical trip point in current code. And this can't sovle the problem shown in this bug.
The patch in comment #3 allows userspace to override the trip point to higher temperatures as well.
Any comments on this?
Comment 6 Len Brown 2008-06-13 21:26:44 UTC
Does Windows work on this board?

Rui, I guess I'm okay with allowing higher crt thresholds
if we include a warning.  please rebase patch to tip.
Comment 7 Zhang Rui 2008-06-16 20:41:35 UTC
Created attachment 16511 [details]
patch: allow overriding critical threshold to higher value
Comment 8 Zhang Rui 2008-06-16 20:43:01 UTC
len,
patch in comment #7 is on top of 2.6.26-rc6. :)
Comment 9 Andi Kleen 2008-07-16 13:29:31 UTC
I don't see the warning Len asked for in the latest patch? Can you please add it?

I'm a little uneasy with the concept in general. If it's an embedded 
system you control why can't you change the ACPI tables? And is it useful
on a wider range of systems? Could people cook their systems with 
increasing the trip point? (I think yes)

Please someone reopen the bug, I am not allowed to do that.
Comment 10 Jon Becker 2008-07-16 13:40:43 UTC
The ACPI tables will not recompile with available tools as above.  The manufacturer hasn't / won't fix them.  I'm just a user who needs a fix, it shouldn't be my problem to fix the manufacturer's broken BIOS (beyond my ability)!
Comment 11 Zhang Rui 2008-07-16 19:01:54 UTC
(In reply to comment #9)
> I don't see the warning Len asked for in the latest patch? Can you please add
> it?
> 
-				if (crt_k < tz->trips.critical.temperature)
-					tz->trips.critical.temperature = crt_k;
+				if (crt_k > tz->trips.critical.temperature)
+					printk(KERN_WARNING PREFIX
+						"Critical threshold %d C\n", crt);
+				tz->trips.critical.temperature = crt_k;

this is the warning when user tries to increase the critical threshold.

> I'm a little uneasy with the concept in general. If it's an embedded 
> system you control why can't you change the ACPI tables? And is it useful
> on a wider range of systems? Could people cook their systems with 
> increasing the trip point? (I think yes)
> 
yes, they could.
I agree that overriding the critical threshold with higher values is dangerous.
But this could be used to fix a lot of other laptops which we used to.
This is a regression for the users who have a bogus critical threshold on their laptops.
Comment 12 Len Brown 2008-07-17 10:32:09 UTC
The bogus critical trip point is a BIOS bug --
apparently one that the manufacturer is unwilling to fix.
So i'm changing the category of this report to ACPI/BIOS.

No word on if Windows works properly on this board,
the assumption is that it does not.

Jon,
Does thermal.nocrt make the problem go away?

Note that re-defining the thermal trip point does not guarantee
that the EC on the board will actually trigger around when
that temperatre is exceeded, so it is not necessarily a solid
solution.

Please verify that thermal.nocrt makes the issue go away.

If it does, please attach the output from dmidecode.
The practical solution may be simply to invoke "thermal.nocrt"
on this board automatcially.
Comment 13 Jon Becker 2008-07-18 07:21:23 UTC
Len,

Let me understand the procedure to do this (currently on Ubuntu 8.04, 2.6.24-19).
Adding thermal.nocrt=1 as a kernel boot option does not work (feature not in this version?).

After hunting around I have attempted to modify
/etc/modprobe.d/options
adding the line
options thermal nocrt=1

but I can't seem to find a way to tell whether this is correct (other than trying the thermal stress test which I will but it would be nice to know if this is the correct way to disable the critical check).

I think disabling critical actions entirely is much more dangerous than just bumping up the critical trip temperature to a more reasonable value.
Comment 14 Zhang Rui 2008-07-20 19:08:44 UTC
(In reply to comment #13)
> I think disabling critical actions entirely is much more dangerous than just
> bumping up the critical trip temperature to a more reasonable value.
> 
agree.
Considering that we used to allow users overriding higher crt threshold, the patch in comment #7 is not that bad.
Without the patch, users can only
1. wait for the computer to restart once the temperature reaches 60 C.
2. under the risk of cooking their system with module parameter thermal.nocrt=1.
Comment 15 Zhang Rui 2008-08-28 01:42:55 UTC
Andi, Len,

I agree that thermal.nocrt=1 is more dangerous than using a higher critical  threshold here.
can you apply the patch in comment #7 please?
Comment 16 Len Brown 2008-10-16 23:45:17 UTC
Jon,
re: comment #13
if you succeed in overriding the trip point, your setting
will be visible in /proc/acpi/thermal_zone/*/trip_points

in the case of thermal.nocrt, the trip point is unchanged,
but the trip action is ignored.

In the case of thermal.crt=-1, the trip point will simply
vanish from the files above.

in the case of thermal.crt=N, it will be set to N
The change here is the ability to make N higher than the BIOS default.

The problem with reasoning "bumping up critical is less dangerous
than disabling critical" is that bumping up the critical
trip point may actually just give you the illusion of control
that you don't actually have.  ie. the EC decides when/if
to send a thermal event which is what we use to compare
the temperature to the trip points.  There is absolutely
no assurance that the EC will do this near the new
fake trip point.

and...

"The greatest obstacle to discovery is not ignorance -- it is the illusion of knowledge."

so i don't really like it, but i'll apply the patch in commment #7
to 'keep the customer satisfied':-)
Comment 17 Len Brown 2008-10-24 23:03:56 UTC
shipped in linux-2.6.28-rc1
closed

commit 22a94d79a34bf010d11996d30eed8ee3fc1a4fbf
Author: Zhang Rui <rui.zhang@intel.com>
Date:   Fri Oct 17 02:41:20 2008 -0400

    ACPI: Allow overriding to higher critical trip point.

    http://bugzilla.kernel.org/show_bug.cgi?id=9129

Note You need to log in before you can comment on or make changes to this bug.